Mitigating Risk In Complexity

Thinking out loud about a problem space

Dec 20, 2024

Often what seems like a simple change, or several simple changes can quickly create complexity where it’s hard to determine the actual risk in delivering many small changes at once or in quick succession.

When do small, fast changes present a bigger risk than a large release? (Especially if your infrastructure isn’t optimized for small, fast changes.)

Agile (and that’s with a capital A), Lean, and eXtreme programming principles advocate for small and fast. Often teams and businesses really grasp that concept and run with it, yet forget that to make this a prosperous reality, you have to build out a lot of infrastructure to support these quick changes.

I like small, fast development cycles myself. Easy to see exactly what changed. Easy to pinpoint a problem if it happens—if the infrastructure is in place to let you do that, otherwise, a small change can’t really be measured correctly.

Let’s work through an example of a smallish change without infrastructure taken into account:

I have a story where I have a profile template, and customers have wanted better quality images supported in the profile. It seems like a really simple change. Instead of limiting customers to a certain image type (like jpeg) we change the profile to accept more types. We also change the resolution requirements and size up the div to show a larger image.

These all seem like simple changes, mostly UI related. We make the code updates, and push them out. Customers are notified, nothing much happens in the first few weeks of the changes.

THEN

Profile pages start to slow down, especially when larger image files are loaded with a greater resolution - all within the allowed parameters, yet it was not detected until customers reported it. (Lack of early detection/analytics, lack of A/B testing in prod, lack of measured roll out infra to be able to manage user updates.)
Infrastructure is now struggling to keep up with storage for the larger images. Only warning was data use and storage use from cloud provider tripling from previous months. (Lack of observability, or early logging. Lack of impact analysis or cost analysis around cloud storage.)
A lag in the search display function develops because the profile images that are returned in the search list take too long to load because the size and resolution weren’t accounted when search was utilized. Some just display a white box without any indication of what happened to the image. (Lack of error handling, unknown or hidden logic/inheritance.)

Granted, these are hypotheticals, and in many cases, if you have a team focused on quality and communication, they would likely bring up some of these examples at the beginning of the project.

It’s the iceberg/onion problem. It looks pretty simple until you pull things apart or look below the surface. And that’s just a few small changes, three to be exact: type, size, resolution.

What if the changes were bigger? Changes like creating multiple profile types, monetizing the new profile types, and providing search preferences based on the profile types that didn’t previously exist.

Each one of those have complexities, but like the example above, those could be mitigated. However, these three small to medium changes in the profile, the search function, and data transmission could have hidden issues as well.

How do you get to the risks when overlapping systems, some you might not even have information about, could present problems?

I’m reminded of a model I learned early on in my career. Something that would be good to do when discussions of product or feature development begin to happen.

The Rumsfeld Matrix. Created in 2002, and originally applied to security and intelligence gathering, the matrix was adapted for software development. It still holds up today as a way to mitigate and identify risks in projects.

You can order these however you prefer, but generally the matrix reads like this:

Known Knowns: Anything that is currently agreed to and understood by the collective.

Known Unknowns: Anything the collective is aware of but does not understand.

Unknown knowns: Anything the collective would understand if encountered, but didn’t predict or identify immediately.

Unknown unknowns: Anything the collective isn’t aware of and wouldn’t understand.

Applying this to a project becomes a quick whiteboard session with a group.

Here’s an example of what a matrix session might look like:

You’ll note that we know what features we’re building out. We have questions about how things will be deployed, how it will go live, how it will be marketed - ie we know what we need to do to make these known knowns. The unknown knowns are things that have happened with other projects, but we don’t know if these will affect this project, yet we have processes for these events, if and when they are necessary.

And in the unknown unknown field I added critical events. These are unique, unexpected, never seen before things that have no process and have never happened or have never been detected before in a project. These can become known knowns with the implementation of an RCA, and then for the next project, it would be an Unknown knowns if the event creates a process to follow later if it happens again.

This is a very high level version of how someone could break down a project. Ideally, we’d want to take each feature or requirement and create a matrix for it.

As a person focused on quality, a lot of this happens in our heads. Sometimes we ask the necessary questions, and sometimes we prioritize just the knowns, relying on the hope that someone else has addressed the known unknowns and the unknown knowns.

Even for peace of mind, especially if there are a lot of small things happening fairly quickly, it would be good to do this exercise with a team to make sure everyone agrees on what’s known and what’s not. Actually, it’s not a bad exercise to do by yourself to see what you know about the project you’re working on.

If you’re currently using a method like this or something similar, feel free to share and comment below!

Quality Technologist

Discussion about this post