Wednesday, December 06, 2023


In May 2009, Google hosted an internal "Design Wizardry" panel, with talks by Jeff Dean, Mike Burrows, Paul Haahr, Alfred Spector, Bill Coughran, and myself. Here is a lightly edited transcript of my talk. Some of the details have aged out, but the themes live on, now perhaps more than ever.


Simplicity is better than complexity.

Simpler things are easier to understand, easier to build, easier to debug, and easier to maintain. Easier to understand is the most important, because it leads to the others. Look at the web page for One text box. Type your query, get useful results. That's brilliantly simple design and a major reason for Google's success. Earlier search engines had much more complicated interfaces. Today they have either mimicked ours, or feel really hard to use.

That's But what about what's behind it? What about GWS? How you do you invoke it? I looked at the argument list of a running GWS (Google Web Server) instance. XX,XXX characters of configuration flags. XXX arguments. A few name backend machines. Some configure backends. Some enable or disable properties. Most of them are probably correct. I guarantee some of them are wrong or at least obsolete.

So, here's my question: How can the company that designed be the same company that designed GWS? The answer is that GWS configuration structure was not really designed. It grew organically. Organic growth is not simple; it generates fantastic complexity. Each piece, each change may be simple, but put together the complexity becomes overwhelming.

Complexity is multiplicative. In a system, like Google, that is assembled from components, every time you make one part more complex, some of the added complexity is reflected in the other components. It's complexity runaway.

It's also endemic.

Many years ago, Tom Cargill took a year off from Bell Labs Research to work in development. He joined a group where every subsystem's code was printed in a separate binder and stored on a shelf in each office. Tom discovered that one of those subsystems was almost completely redundant; most of its services were implemented elsewhere. So he spent a few months making it completely redundant. He deleted 15,000 lines of code. When he was done, he removed an entire binder from everybody's shelf. He reduced the complexity of the system. Less code, less to test, less to maintain. His coworkers loved it.

But there was a catch. During his performance review, he learned that management had a metric for productivity: lines of code. Tom had negative productivity. In fact, because he was so successful, his entire group had negative productivity. He returned to Research with his tail between his legs.

And he learned his lesson: complexity is endemic. Simplicity is not rewarded.

You can laugh at that story. We don't do performance review based on lines of code.

But we're actually not far off. Who ever got promoted for deleting Google code? We revel in the code we have. It's huge and complex. New hires struggle to grasp it and we spend enormous resources training and mentoring them so they can cope. We pride ourselves in being able to understand it and in the freedom to change it.

Google is a democracy; the code is there for all to see, to modify, to improve, to add to. But every time you add something, you add complexity. Add a new library, you add complexity. Add a new storage wrapper, you add complexity. Add an option to a subsystem, you complicate the configuration. And when you complicate something central, such as a networking library, you complicate everything.

Complexity just happens and its costs are literally exponential.

On the other hand, simplicity takes work—but it's all up front. Simplicity is very hard to design, but it's easier to build and much easier to maintain. By avoiding complexity, simplicity's benefits are exponential.

Pardon the solipsism but look at the query logging system. It's far from perfect but it was designed to be—and still is—the only system at Google that solves the particular, central problem it was designed to solve. Because it is the only one, it guarantees stability, security, uniformity of use, and all the economies of scale. There is no way Google would be where it is today if every team rolled out its own logging infrastructure.

But the lesson didn't spread. Teams are constantly proposing new storage systems, new workflow managers, new libraries, new infrastructure.

All that duplication and proliferation is far too complex and it is killing us because the complexity is slowing us down.

We have a number of engineering principles at Google. Make code readable. Make things testable. Don't piss off the SREs. Make things fast.

Simplicity has never been on that list. But here's the thing: Simplicity is more important than any of them. Simpler designs are more readable. Simpler code is easier to test. Simpler systems are easier to explain to the SREs, and easier to fix when they fail.

Plus, simpler systems run faster.

Notice I said systems there, not code. Sometimes—not always—to make code fast you need to complicate it; that can be unavoidable. But complex systems are NEVER fast—they have more pieces and their interactions are too poorly understood to make them fast. Complexity generates inefficiency.

Simplicity is even more important than performance. Because of the multiplicative effects of complexity, getting 2% performance improvement by adding 2% complexity—or 1% or maybe even .1%—isn't worth it.

But hold on! What about our Utilization Code Red?

We don't have utilization problems because our systems are too slow. We have utilization problems because our systems are too complex. We don't understand how they perform, individually or together. We don't know how to characterize their interactions.

The app writers don't fully understand the infrastructure.

The infrastructure writers don't fully understand the networks.

Or the apps for that matter. And so on and so on.

To compensate, everyone overprovisions and adds zillions of configuration options and adjustments. That makes everything even harder to understand.

Products manage to launch only by building walls around their products to isolate them from the complexity—which just adds more complexity.

It's a vicious cycle.

So think hard about what you're working on. Can it be simpler? Do you really need that feature? Can you make something better by simplifying, deleting, combining, or sharing? Sit down with the groups you depend on and understand how you can combine forces with them to design a simpler, shared architecture that doesn't involve defending against each other.

Learn about the systems that already exist, and build on them rather than around them. If an existing system doesn't do what you want, maybe the problem is in the design of your system, not that one.

If you do build a new component, make sure it's of general utility. Don't build infrastructure that solves only the problems of your own team.

It's easy to build complexity. In the rush to launch, it's quicker and easier to code than to redesign. But the costs accumulate and you lose in the long run.

The code repository contains 50% more lines of code than it did a year ago. Where will we be in another year? In 5 years?

If we don't bring the complexity under control, one day it won't be a Utilization Code Red. Things will get so complex, so slow, they'll just grind to a halt. That's called a Code Black.

What We Got Right, What We Got Wrong

  This is my closing talk ( video ) from the GopherConAU conference in Sydney, given November 10, 2023, the 14th anniversary of Go being lau...