Ian Wilkes, a writer for ArsTechnica.com and apparently a former employee of Linden Labs, has just written a good article on how one should approach large systems design for applications that scale quickly.  While there’s little real technology discussion in the article, it underscores that the team, the development philosophy and the deployment method are all just as important as the technology.

A few years back, I had a similar experience with a company that I was working for as their application scaled to higher usages.  At this company, we were adding clients left and right.  Initially, the system was designed to implement a series of jobs on a job server that compiled data daily for their clients.  Over the year that I was there, the job stream for each client grew in complexity and the clients grew in number.  We were scaling in two directions at once.  Small errors that were handled by an on-call person once or twice a night, turned into all night affairs quickly.  Management didn’t listen to the cries of the developers, the same people that served on-call duty.  On-call became a dreaded nightmare because you were expected to work a full week and be on-call for the week.  80 – 90 hour weeks ensued with little to no sleep between pager rings.

It wasn’t until I implemented a simple system that captured the state of the job server nightly, that we began to see the full breadth of the problem.  All the developers knew we had a growing problem but we couldn’t quantify the problem to management so it was perceived as nothing but complaining for the sake of complaining.  We grew from 10,000 jobs to over 25,000 jobs while I was there.  The system continued to grow to over 40,000 jobs after I left.  On day one, we were operating at about 99% efficiency each night.  99% of our jobs would finish without failure.  The other 1% required input from an operator or the on-call person.  1% of 10,000 is 100 jobs every night that failed.  About 95% of those jobs simply needed to be restarted leaving 5 issues that needed to be solved nightly.  Not bad considering the code was decades old.  It didn’t scale.  As 10,000 jobs became 20,000 jobs we increased our total number of failures and then began to see load issues on the same servers that caused even more failures.  The web application that I built made it painfully obvious just how bad our system was performing because it charted failure rates on a daily basis and painted 99% or less reliability red, 99.1% – 99.9% yellow and 99.91% or more green.  Only when presented with lots of red numbers, did management figure out that something needed to be done.  By then it was too late and I suspect that we lost customers during that time because we couldn’t be attentive to customers needs while fighting fires in our job system.  From what I heard following my time with the company, they eventually achieved reliability in the range of 99.99%, though shortly after that, they architected a new version of the platform on a new software stack that wasn’t instrumented by my web application.  For the sake of their on-call staff, I hope the new architecture included instrumentation.

Needless to say, it was the one and only IT job I ever had where I was fired.  In showing the extent of the wounds to the current management, I slit my own throat.  That act was political suicide but I hated working 80 - 90 hour weeks and despised the toll it took on my family.  It was made all the worse because management was focused on selling more clients and didn’t see or care about the problem.  If they cared then their actions never showed it because any mention of improving system reliability was stuffed behind expanding the system for new customers.  I wasn’t getting any sort of benefit from working the extra hours either.  In fact, the company later went on to freeze bonuses during this time.  I learned more about large systems development from that experience than I have in any other position I’ve been in.

So what does this have to do with Ian’s article.  It has everything to do with his article.  It touches on understanding the requirements and the need to ask the question, “How will x new feature impact system load?”  It touches on the need to focus development around error conditions and gracefully handling errors.  Most importantly, it talks about the need to instrument these large systems so that you can keep tabs on how well they are performing.  The company I worked for was a prime example of how a small startup can grow too fast for it’s own good and the incredibly difficult challenge presented to the company’s leadership of leading the company through the scaling process.