By now—unless you have been living under a rock—you have seen the reporting on the myriad problems of the Affordable Care Act (ACA) website. Even President Obama has begun to publicly acknowledge that the problems stem from more than simply “unexpected volume.”
From a technology point of view, it would be easy to “Monday Morning Quarterback” the website’s design (technical and UX), execution, integration, testing, operation and support—especially if you work in Health IT, eCommerce, Systems Integration or about a dozen other fields. However, we at Oulixeus thought we would take a step back and look at what you should do if you are ever faced with a big consumer web/eCommerce rollout that is ridden with bugs and problems. The result was a 7-Step Program to Responding to Bad Web Rollouts:
Step 1: Recognize that you CANNOT hide
When faced with something like this, the natural response may be akin to a “bunker mentality.” This is entire futile. You cannot hide problems that any web user (including any reporter or competitor) can see in any web browser. Similarly you cannot hide the fact that you are dropping transactions to supply chain partners.
Everyone knows things are not working well on your site. Acknowledge this and start to plan and respond accordingly. Bad problems happen everywhere (even Google, Amazon and Facebook have outages sometimes). How you respond will determine whether you start to rebuild trust with your customers—or lose even more.
Step 2: Ban Fingerpointyness
It will also be really easy to start pointing fingers: the product manager did not tell the developers X; the developers made assumption Y; no one tested Z; the project manager reclassified a “Critical” bug to a “Minor” one so we could roll-out the site; the CxO said we had to go-live; and so on…
All of this fingerpointyness not only burns time and energy, it also prevents people from sharing information that is critical to resolving all the problems you are facing. Ban all finger pointing. Focus on solving the problems at hand. After all is said and down you can hold an After Action Review to learn from your mistakes and figure out how to avoid similar problems in the future.
Step 3: Adopt the POV of your customer
As you try to figure out what is broken and what is working, it will be natural for teams to start to get defensive: pointing out that their specific code, infrastructure, etc. is working as intended. While this is not hurtful per see, it is also not constructive. Your customers do not care that the front end of registration works if they cannot receive an email with a link to confirm their registration (or worse, if your system task queues are so backed up that customer confirmation events cannot even reach your database).
Trace the transactions through your platform from the POV of your customers. The places that break their ability to continue are where you need to focus on fixing if. This approach will drive a systems-approach to problem resolution (e.g., you may fix a bug and add queue capacity and improve database connection pooling and clarify onscreen messaging all to solve a issue where X% of your customers cannot register on your site).
Step 4: Focus on those issues Critical to Quality
If you just starting hacking away at problems as you find them, you run the risk of devolving into chaos. While it may “feel” like you solving issues (especially if you have been up for days without sleep), in reality you are only addressing the “low hanging fruit.” It is unlikely that you are really focusing on the issues that are most critical to you customers and continued operation.
Starting from the POV of the customer, log what is not working (not just what is broken but also what is too slow, what is throwing lots of errors to your logs and what is causing massive drops in conversion). Prioritize these “Defects” by importance (i.e., net impact per incident x number of users/transactions/etc.) This will give you a prioritized plan of attack
Step 5: Setup an SQR/Tiger Team/Rapid Response Team for Response
Your number one priority is getting your website (back) to point where customers are able to use your site with ease and success. Everything else is secondary.
Setup a Rapid Response Team of dedicated individuals from all departments (Product, Engineering, QA, Operations, Support AND PR, Marketing, Customer Service, and – if you are in enterprise software – Sales and Account Management). The purpose of this team is to bang through your Critical to Quality list in non-blocking priority order: assign resources to attack the most important problems that are not blocked by others first. If your problems are severe enough, make this a multi-shift team, so you can address problems around-the-clock without burning people out.
Meet often—at the start and end of every shift—to go over the status of fixes (including adding newly detected issues and re-prioritizing existing ones based on new data). Empower the team with an authority to release fixes” having a team with all departments involved will short-circuit classic delays in communication and approval. Appoint a senior executive to run the team—one who can call in favors and specialist as needed to “get things done.”
Step 6: Share what is going on: Good and Bad
You may have the inclination to do all of the above steps internally and wait until “things are fixed” before providing an external update. Avoid this inclination.
Instead, regularly share summary update with your customers (summary metrics on the defects you have found and the fixes you have released). Ideally, you should do on web page (Github Status is an outstanding practitioner of this—poke around the current and historical information they share). This transparency will not get you into trouble—it will actually begin to rebuild trust.
Step 7: Wash, Rinse, Repeat
As things get better, you may have the urge to stop looking for problems (you may even believe that finding new problems is a bad thing).
Avoid this urge and repeat this entire process iteratively. Keep looking for new issues from the customer’s POV. Prioritize and fix them based on customer impact. Share results (internally and externally).
Don’t just repeat this for the duration of your crisis. Instead make continuous improvement part of your operations. If you do, you will reduce your defects exponentially over time (this is Six Sigma DMAIC at work) and can eventually “dial down” Rapid Response Team Scrum Huddles to Weekly System Quality Meetings.
***
These Seven Steps may seem like a lot of work. However, they are something you can setup in a hour in the case of an emergency. They do not require expensive system or incident management software. They can be setup and run with tools as lightweight as JIRA On-demand, Microsoft Excel or Google Spreadsheets.