Ian Wilkes Discusses Scaling Second Life
Ian Wilkes, a writer for ArsTechnica.com and apparently a former employee of Linden Labs, has just written a good article on how one should approach large systems design for applications that scale quickly. While there’s little real technology discussion in the article, it underscores that the team, the development philosophy and the deployment method are all just as important as the technology.
A few years back, I had a similar experience with a company that I was working for as their application scaled to higher usages. At this company, we were adding clients left and right. Initially, the system was designed to implement a series of jobs on a job server that compiled data daily for their clients. Over the year that I was there, the job stream for each client grew in complexity and the clients grew in number. We were scaling in two directions at once. Small errors that were handled by an on-call person once or twice a night, turned into all night affairs quickly. Management didn’t listen to the cries of the developers, the same people that served on-call duty. On-call became a dreaded nightmare because you were expected to work a full week and be on-call for the week. 80 – 90 hour weeks ensued with little to no sleep between pager rings.
It wasn’t until I implemented a simple system that captured the state of the job server nightly, that we began to see the full breadth of the problem. All the developers knew we had a growing problem but we couldn’t quantify the problem to management so it was perceived as nothing but complaining for the sake of complaining. We grew from 10,000 jobs to over 25,000 jobs while I was there. The system continued to grow to over 40,000 jobs after I left. On day one, we were operating at about 99% efficiency each night. 99% of our jobs would finish without failure. The other 1% required input from an operator or the on-call person. 1% of 10,000 is 100 jobs every night that failed. About 95% of those jobs simply needed to be restarted leaving 5 issues that needed to be solved nightly. Not bad considering the code was decades old. It didn’t scale. As 10,000 jobs became 20,000 jobs we increased our total number of failures and then began to see load issues on the same servers that caused even more failures. The web application that I built made it painfully obvious just how bad our system was performing because it charted failure rates on a daily basis and painted 99% or less reliability red, 99.1% – 99.9% yellow and 99.91% or more green. Only when presented with lots of red numbers, did management figure out that something needed to be done. By then it was too late and I suspect that we lost customers during that time because we couldn’t be attentive to customers needs while fighting fires in our job system. From what I heard following my time with the company, they eventually achieved reliability in the range of 99.99%, though shortly after that, they architected a new version of the platform on a new software stack that wasn’t instrumented by my web application. For the sake of their on-call staff, I hope the new architecture included instrumentation.
Needless to say, it was the one and only IT job I ever had where I was fired. In showing the extent of the wounds to the current management, I slit my own throat. That act was political suicide but I hated working 80 - 90 hour weeks and despised the toll it took on my family. It was made all the worse because management was focused on selling more clients and didn’t see or care about the problem. If they cared then their actions never showed it because any mention of improving system reliability was stuffed behind expanding the system for new customers. I wasn’t getting any sort of benefit from working the extra hours either. In fact, the company later went on to freeze bonuses during this time. I learned more about large systems development from that experience than I have in any other position I’ve been in.
So what does this have to do with Ian’s article. It has everything to do with his article. It touches on understanding the requirements and the need to ask the question, “How will x new feature impact system load?” It touches on the need to focus development around error conditions and gracefully handling errors. Most importantly, it talks about the need to instrument these large systems so that you can keep tabs on how well they are performing. The company I worked for was a prime example of how a small startup can grow too fast for it’s own good and the incredibly difficult challenge presented to the company’s leadership of leading the company through the scaling process.
» Trackbacks & Pingbacks
2 Comments
-
I found my way over here after reading the Ian Wilkes article about the difficulties inherent in scaling the systems supporting Linden Lab's Second Life virtual world. His article was excellent, and yours is too. I've dealt with processing performance and storage device contention throughout my working days. And I still do!
Operations is often a thankless job. No one notices you unless something goes wrong. Regarding the 80 hour work weeks and pager madness, well, that is just so short-sighted of company management. It is impossible to work effectively and productively in a sustained crisis mode, with no end in sight. It's a shame you got fired because you were the one who finally communicated the extent of your shop's situation to business management. Unfortunately, that happens too, and is also a product of short-term crisis thinking. Actually, someone who stepped forward like you did should've been retained and rewarded for going beyond the scope of your job, and taking action to get things turned around, finally!
-
Thanks for the words of encouragement. I sit here today, some 6 or 7 years later, knowing that the work I started infected more people that remained with the company. Later I learned of all sorts of bad things going on with management when there was a clearing of house at my old job. Funny thing about the whole situation is that the web site I put up ran for over a year or more after I left. The developers wouldn't let go of it. Management had to engineer a complete move to a new platform to get the web site to not report what was failing and when. Now, they needed to move to a more robust platform regardless but from the word on the street, management was all too happy to move the processing to a system that wasn't monitored by the web site that got me canned. :)
In the end, I think I did the rest of the group that I worked with a good service so I sleep easily at night. Call it one of those life's lessons and when I think back on it, I don't think I would have done anything differently even knowing the outcome.
7.10.2010 at 5:40 AM