, ,

One of the fascinating technical advancements that’s been happening recently has been around distributed data, and distributed data processing. The whole “Big Data” revolution rode in on the back of Hadoop. Hadoop is a tool that used a simple paradigm, the idea of combining two logical set operations (viz. map and reduce) into one clean way to heavily parallelize processing of massive sets of data. Now given, the data and the problem have to be a fit for the solution – and not all data and problems are. But none-the-less, if one’s problem space can be fit to the solution shape, the payoff is huge. There are newer techniques that follow the same broad pattern, but the underlying principle is that problems are split into pieces that don’t rely on each other, and can thus be run in or out of order, on the same computer or a different one, while still eventually arriving at meaningful results.

Data storage has been similarly advancing, largely along the same lines. Several years ago – the line of “no-sql” tools started to crop up offering different technical trade-offs that weren’t available when we relied solely on relational databases for our production data storage. Most interestingly – they allowed for a thing called “eventual consistency”. The dogma throughout my early years as a software geek was that data should always be what it’s supposed to be – it should be – in the fastest way possible – completely and utterly consistent. This turns out to be unnecessary for large swaths of problems. And so our beliefs about this and thus the tools we build for ourselves evolve to support it. The general outcome is that if we say that our database can be running on a number of computers, geographically dispersed, and they don’t even have to be immediately consistent, suddenly storing that data becomes far less burdensome. We can scale by throwing hardware at the problem – so that we don’t have to spend so much expensive thinking time on making the data storage as fully optimal as possible. And now we are commonly in the range of storing more than a Petabyte of data.

I want to highlight that just like with Hadoop and distributed processing, the distributed storage problem has to match the solution shape. Eventual consistency (in this case – and please forgive the radical simplification I’m doing to keep this blog post under a Petabyte itself) has to be ok.

And this is the big idea – the thing that big data teaches us about team organization:

In order to scale out massively, we have to place particular limits on the shape of the problem we are solving.

Demanding 100% control over our data – so that we always know it is entirely consistent – is comforting in a way. But it demands a lot of things about the technical structure of the database. You can’t parallelize operations, you can’t do them out of order, you can’t distribute them. The benefit requires a trade off in restriction. Conversely, if you shape your application and your business logic such that it is useful regardless the immediate consistency of the data, you can scale broadly.

On the data processing side – if we serialize processing, depend on the results of previous operations, and share state, we limit ourself to nearly one thread of processing, with shared, controlled memory to hold state, and scaling is something we can only do by adding more power to the single computer.

In general we give up immediate control for immediate chaos – with an understanding that the answer will emerge from the chaos, because of how we’ve shaped it.

With a computer it’s easy to see why this is true, memory and processing on a single computer, or a tightly coordinated few computers has an upper limit. It always will, though that limit will increase slightly over long periods of time. Designing a solution that can arbitrarily scale to any number of computers makes us only limited by the number of computers we can get a hold of, which is a completely different, and much easier ceiling to push up.

With humans, it’s even worse, until we make a breakthrough and figure out how to implant the cybernetic interfaces directly into our brains, our capacity has been the same for thousands of years. And more discouragingly it will remain the same for many years to come. We can not buy a more powerful processor; we cannot add memory. Some models come installed with a trivial, incrementally larger amount of one or the other or both, but it’s not enough to really make a difference.

To solve large problems, you need more than one person. To solve problems requiring creativity and human ingenuity – humans need autonomy to explore with their intuition. This means that every human being involved in solving the problem will have to saturate their ability to think. No matter how smart the genius – it is a significant stunting of effort, if we pretend like they can serialize all the thinking through themselves. Even if it’s a couple or maybe a small team of geniuses that are tightly coordinated. To really achieve anything great in a modern software organization – or rather, in a modern information-work-based organization – and actually that’s EVERY organization – the thinking has to be fully distributed.

For the EXACT same reason as it has to be with distributed computing and distributed storage. The problem size has a limit if we want full control. It doesn’t if we can allow the solution to emerge from the chaos of coordinated but distributed activity.

And just like the distributed example above, this simply means we have to choose to forge our problems to match the solution shape. Organizationally this means at every level realizing that the leadership and initiative necessary to drive something to completion – the thing that drive us so quickly to top-down, command-and-control structure – has to be embedded in every doer. That positional leaders in the organization should be few and should be “Clock Builders” – to use Jim Collins’ brilliant phrase – shaping the system and providing just enough structure so things don’t fly apart.

And when we get down to the software – the teams have to be shaped so that there are no dependencies on one team to another. Not for architecture, database programming, testing or moving something into production. The small software team that works together day to day, to fully distribute the innovation and creation that we humans are designed so well for, should have everything it needs to handle the software from cradle to grave – including but not limited to – the necessary authority and skill sets – with every team member possessing the leadership and initiative to drive anything through, even if the rest of the organization burned to the ground.

Failure to distribute our creativity and innovation in this way will rightly be viewed as just as backwards as a software development team committing to only storing megabytes of data in a fully consistent relational database is today.

So let’s get organized and build some amazing software!!