One idea behind building large systems is that, assuming no one is maliciously trying to make things break, any system failures don’t fall entirely on one person’s head. I’ve written about this briefly in Making Computers Feel Safe:
Humans are fallible, and the lesson to learn [from when things fail] is not that we just need to try harder [or attempt to “hold someone accountable”, neither of which actually solve the problem], but rather look at how we put someone in a position to make a mistake like this with no guard rails.
Anyways, recently HBO Max accidentally sent this email out to a fairly large fraction of their mailing list. It looked something like this:
Most people, when they came across the strange email, wondered if it was some sort of scam attempt or decided it was just an accident. It’s a fairly inconsequential mistake in the grand scheme of things: no one had private information leaked or got misled on anything substantial. Something to chuckle about, maybe, if it accidentally got sent to a few people.
Well, actually, as it turns out, this email got sent out to several million people. Oops.
Naturally, with something of this scale, the internet jumped on it immediately and Twitter ended up getting a flood of comments on it. In response, HBO ended up tweeting this out:
Then the replies started pouring in:
That’s the key. It wasn’t just some intern’s mistake that’s worth looking at here, but also the environment they were working in that allowed them to single-handedly email so many people without any safeguards that stopped them. It’s never just one thing or one person. That doesn’t improve systems. If you go read the discussion on the topic, you’ll hear story after story telling the same.
What gets mentioned less often, but also has important implications, is the role that safety plays in how we live. Most of the time we think of safety from the perspective of preventing or avoiding failures, but if you start reading these stories you start noticing that a lot of the people retelling these stories were able to come back better and more confident in what they were doing. So many slip-ups left very little, if any, real long-term damage.
The point of safety is to enable risk-taking and take on more ambitious endeavors. Seat belts and airbags in cars don’t just serve to protect people in the car, they let us drive with confidence faster than a snail’s pace. When you’re engineering or operating complex systems, your ability to change things depends on how safe it is to change things. One of the big advantages of working with computers is that so many small actions are easily reversible. Mistakes happen to the best of us. It’ll be ok! All it means is that our work is not yet done.