Argentina was plunged into darkness that affected 44 million people Sunday. And Target says its stores are operating normally after glitches prevented customers from paying with some credit cards over the weekend.
These are some of the latest examples of complex systems failing catastrophically, which has become a fixture of modern life. But David Woods, a professor in integrated systems engineering at Ohio State University, says the news media often misses the point when it comes to the reason behind these failures.
“People … in the news are missing the way complex systems fail because they think of this as having a single root cause,” Woods tells Here & Now’s Jeremy Hobson. “Often they think of it as a system that works all the time, and then occasionally someone screwed it up — because they’re looking for someone to blame. But complexity … derives from the success we have at building systems at new scales.“
In fact, Woods says small problems happen more often than we might notice — people just step in to fix it before a minor issue turns into a major failure.
“So whether it’s the IT services and infrastructure behind Target or the power systems, actually small problems are happening all the time,” he says. “The reason we don’t see big outages is because people succeed. People are the key source of resilience.“
Interview Highlights
On what causes complex systems to fail so spectacularly
“Our world today is full of complex systems, and this has been a trend that’s been going an increasing over the last 40 years. We’re just noticing it now in all kinds of service areas that used to be simpler.
“So the Target case, for example, is [an] integrated system of many dynamic processes that are highly automated in order to provide these digital services. And even though it’s a retail store, it’s now really [an] e-enterprise store. And in all of these cases, complexity means there’s not a single failure. In all of these cases, a few small things go wrong, build one upon another and create a significant outage. They cascade. And this happened in the energy system in South America. We’re not quite sure what led to the main transmission line failure, but typically, it’s several small things that quickly cascaded. The normal mechanisms that block the cascade somehow didn’t come into play.”
On how people step in to stop most of these small problems
“They recognize emerging anomalies. They recruit other people with other forms of expertise to come into this situation. They start taking actions to mitigate the spread of the disturbances and block the, at least initial, cascade. So what we find is that people are the hidden source of resilience that make these systems work much more often than we would expect, given how brittle they are.”
On the idea that outages are caused by a security breach
“Usually, actually it’s the other way around. Rather than jump to the idea that there was an adversary or some deliberate act, we find all the time that complex systems have this signature that these small failures build up combined to create a lot much larger consequence. In fact, that really is the signature of a complex world. That we see signs of improvement — new capabilities, new technologies — that allows us to work at bigger and bigger scales. And so the research and the formal analysis keeps revealing that these are natural parts of complexity penalties that come with the success. And what we’ve been working on with companies is what do you do to outmaneuver the complexity? You can’t retreat into simplicity. You give up too much competitive advantage. But you can outmaneuver the complexity, and the way you do that is by being poised to adapt.”
On how to prevent system failures as they become more complex
“Well, what we find in organizations that have performed poorly in these challenge cases is they’ve actually undermined the ability of the key people in the system to provide this resilient performance. And they did that inadvertently. Why? Because they never recognized that people were an essential ingredient. Now in today’s technology-driven world, there’s another angle on your question, which is are people the only way to get this kind of resilient performance? My position is the people building more automation and artificial intelligence don’t try. They don’t understand the special capabilities that people bring. And so they don’t build it into their machines, and they don’t consider it as part of how they build artificial intelligence algorithms. So we might be able to rebuild these, and we’re trying to do that with companies like Target in our industry consortia.”
Chris Bentley produced and edited this interview for broadcast with Tinku Ray.Samantha Raphelson adapted it for the web.
This segment aired on June 18, 2019.
https://www.wbur.org/hereandnow/2019/06/18/why-do-systems-fail