On testing in production

Posted by Albert Gareev on May 25, 2017 | Categories: DiscussionsNotesQuora

Based on my Quora answer to “What is “Testing in production” as a concept?

“What is testing in production?” –

Follow along..

  1. Product owners and project managers need to make decisions about software and software development process.
  2. Effective decision making is informed decision making.
  3. Effective testing provides information directly relevant to the risks and concerns that decision makers have and might have.
  4. Each decision has an “opportunity window” and associated costs.
  5. Testing must operate effectively and efficiently within these constraints, i.e.
    • Provide “good enough” volume of information within the given resources and limitations;
    • Balance the need to mitigate risks against the risk of trying to gather too much information, when making the decision is too late, too costly, or too risky.

Don’t you think this test was a little.. redundant?


When the code is promoted to production it’s too late to address many risks.

Many, but not all.

Testing in production happens in some of the following ways

  • Monitoring and diagnostics – gathering and analysis of information about state and performance of a system. Well-developed software system has a sophisticated logging system that may flag many emerging problems.
  • Proactive investigation of customer-reported feedback and concerns. Yes, some companies even track Twitter rants!
  • A/B Testing – presenting variations of features to defined user groups to gather feedback what implementation has better acceptance. Might be as trivial as color and positioning of a button. Facebook does that a lot with advertisements. Google does that at large scale.
  • Incremental release (slippery slope begins) – rolling out new release to a limited group of users to employ them with or without knowing as “beta testers”; if the release is too buggy only limited population of users will experience it, and the update (hopefully) is rolled back.
  • Slapdash cases – uncommon, risky, and generally not recommended cases of testing in production include
    • Catching up on testing that must have been done during development, before the release
    • Testing in production conditions because test environment was not configured to resemble required characteristics.

But there’s one more..

There are risks associated with promotion of builds, resources, and configuration. Whether done mostly manually or in a highly automated fashion, the process may fail in many ways.

To address that, companies practice post-promotion testing in production.

Such testing is typically characterized by broad yet shallow coverage and focus on the most business critical areas.

Such testing also bears some risks – like interfering with a real business flow or impacting real business data.

For that matter, certain precautionary measures would be a good practice.

  • Isolation – conducting operations with accounts never used by customers, and on mock branches never accessed by the real customers.
  • Using pre-defined accounts and objects that are normally marked as “disabled” and made accessible only during post-promotion testing.
  • Well-established data back-up and roll-back procedures to ensure recovery in case of a problem.
  • Timing – reducing risks by promoting outside of operational hours or during scheduled outage, or during off-peak hours for 24/7 systems.
  • Whenever using real production data perform only lower risk tests – those that do not create new or modify existing data, i.e. search, view, reports.

Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported
This work by Albert Gareev is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported.