Risk Management Theatre: On Show At An Organization Near You

Published 05 August 2013

Translations: 한국말

One of the concepts that will feature in the new book I am working on is “risk management theatre”. This is the name I coined for the commonly-encountered control apparatus, imposed in a top-down way, which makes life painful for the innocent but can be circumvented by the guilty (the name comes by analogy with security theatre.) Risk management theatre is the outcome of optimizing processes for the case that somebody will do something stupid or bad, because (to quote Bjarte Bogsnes talking about management), “there might be someone who who cannot be trusted. The strategy seems to be preventative control on everybody instead of damage control on those few.”

Unfortunately risk management theatre is everywhere in large organizations, and reflects the continuing dominance of the Theory X management paradigm. The alternative to the top-down control approach is what I have called adaptive risk management, informed by human-centred management theories (for example the work of Ohno, Deming, Drucker, Denning and Dweck) and the study of how complex systems behave, particularly when they drift into failure. Adaptive risk management is based on systems thinking, transparency, experimentation, and fast feedback loops.

Here are some examples of the differences between the two approaches.

Adaptive risk management (people work to detect problems through improving transparency and feedback, and solve them through improvisation and experimentation) Risk management theatre (management imposes controls and processes which make life painful for the innocent but can be circumvented by the guilty)
Continuous code review in which engineers ask a colleague to look over their changes before check-in, technical leads review all check-ins made by their team, and code review tools allow people to comment on each others’ work once it is in trunk. Mandatory code review enforced by check-in gates where a tool requires changes to be signed off by somebody else before they can be merged into trunk. This is inefficient and delays feedback on non-trivial regressions (including performance regressions).
Fast, automated unit and acceptance tests which inform engineers within minutes (for unit tests) or tens of minutes (for acceptance tests) if they have introduced a known regression into trunk, and which can be run on workstations before commit. Manual testing as a precondition for integration, especially when performed by a different team or in a different location. Like mandatory code review, this delays feedback on the effect of the change on the system as a whole.
A deployment pipeline which provides complete traceability of all changes from check-in to release, and which detects and rejects risky changes automatically through a combination of automated tests and manual validations. A comprehensive documentation trail so that in the event of a failure we can discover the human error that is the root cause of failures in the mechanistic, Cartesian paradigm that applies in the domain of systems that are not complex.
Situational awareness created through tools which make it easy to monitor, analyze and correlate relevant data. This includes process, business and systems level metrics as well as the discussion threads around events. Segregation of duties which acts as a barrier to knowledge sharing, feedback and collaboration, and reduces the situational awareness which is essential to an effective response in the event of an incident.

It’s important to emphasize that there are circumstances in which the countermeasures on the right are appropriate. If your delivery and operational processes are chaotic and undisciplined, imposing controls can be an effective way to improve - so long as we understand they are a temporary countermeasure rather than an end in themselves, and provided they are applied with the consent of the people who must work within them.

Here are some differences between the two approaches in the field of IT:

Adaptive risk management (people work to detect problems through improving transparency and feedback, and solve them through improvisation and experimentation) Risk management theatre (management imposes controls and processes which make life painful for the innocent but can be circumvented by the guilty)
Principle-based and dynamic: principles can be applied to situations that were not envisaged when the principles were created. Rule-based and static: when we encounter new technologies and processes (for example, cloud computing) we need to rewrite the rules.
Uses transparency to prevent accidents and bad behaviour. When it’s easy for anybody to see what anybody else is doing, people are more careful. As Louis Brandeis said, “Publicity is justly commended as a remedy for social and industrial diseases. Sunlight is said to be the best of disinfectants; electric light the most efficient policeman.” Uses controls to prevent accidents and bad behaviour. This approach is the default for legislators as a way to prove they have taken action in response to a disaster. But controls limit our ability to adapt quickly to unexpected problems. This introduces a new class of risks, for example over-reliance on emergency change processes because the standard change process is too slow and bureaucratic.
Accepts that systems drift into failure. Our systems and the environment are constantly changing, and there will never be sufficient information to make globally rational decisions. Humans solve our problems and we must rely on them to make judgement calls. Assumes humans are the problem. If people always follow the processes correctly, nothing bad can happen. Controls are put in place to manage “bad apples”. Ignores the fact that process specifications always require interpretation and adaptation in reality.
Rewards people for collaboration, experimentation, and system-level improvements. People collaborate to improve system-level metrics such as lead time and time to restore service. No rewards for “productivity” on individual or function level. Accepts that locally rational decisions can lead to system level failures. Rewards people based on personal “productivity” and local optimization. For example operations people optimizing for stability at the expense of throughput, or developers optimizing for velocity at the expense of quality (even though these are false dichotomies.)
Creates a culture of continuous learning and experimentation: People openly discuss mistakes to learn from them and conduct blameless post-mortems after outages or customer service problems with the goal of improving the system. People are encouraged to try things out and experiment (with the expectations that many hypotheses will be invalidated) in order to get better. Creates a culture of fear and mistrust. Encourages finger pointing and lack of ownership for errors, omissions and failure to get things done. As in: If I don't do anything unless someone tells me to, I won't be held responsible for any resulting failure.
Failures are a learning opportunity. They occur in controlled circumstances, their effects are appropriately mitigated, and they are encouraged as an opportunity to learn how to improve. Failures are caused by human error (usually a failure to follow some process correctly), and the primary response is to find the person responsible and punish them, and then use further controls and processes as the main strategy to prevent future problems.

Risk management theatre is not just painful and a barrier to the adoption of continuous delivery (and indeed to continuous improvement in general). It is actually dangerous, primarily because it creates a culture of fear and mistrust. As Bogsnes says, “if the entire management model reeks of mistrust and control mechanisms against unwanted behavior, the result might actually be more, not less, of what we try to prevent. The more people are treated as criminals, the more we risk that they will behave as such.”

This kind of organizational culture is a major factor whenever we see people who are scared of losing their jobs, or engage in activities designed to protect themselves in the case that something goes wrong, or attempt to make themselves indispensable through hoarding information.

I’m certainly not suggesting that controls, IT governance frameworks, and oversight are bad in and of themselves. Indeed, applied correctly, they are essential for effective risk management. ITIL for example allows for a lightweight change management process that is completely compatible with an adaptive approach to risk management. What’s decisive is how these framework are implemented. The way such frameworks are used and applied is determined by—and perpetuates—organizational culture.