Skip to content
January 4, 2011 / cohodo

ScaleCamp 2010

On the 10th of December 2010 a couple of us from the Platform Engineering team attended ScaleCamp 2010 at the Guardian offices in London. Very much like its bigger, older (second?) cousin Velocity, Scalecamp is a gathering of developers, operations folk and other people with an interest in scaling systems to support increasing numbers of data-hungry users in the post Web 2.0 age. Scalecamp aims to fill the gap for UK-based peeps who want to get in on the scalability chin-wagging and knowledge-sharing act. Smaller than Velocity or new-kid-on-the-block Surge, Scalecamp is now in its second year and still small enough to use the unconference format, allowing attendees to self-organise around whatever subjects float their scalability boats.


Pastries & Scaling your team

The day began with an empty timetable with slots for 40 minute sessions across 5 rooms of varying sizes. And some cheeky pastries. By lunchtime the board was pretty much full, with some intriguing sessions on the cards. First one to tickle my personal fancy was a discussion on how to scale teams. Talk of scaling teams made me remember the phrase “meat cloud”, which still makes me giggle. Like many engineering teams, we pretty much always have more work to do than we can get through, or at least get through for some value of “now”. Adding a good engineer or two (and if you’re a good engineer, we’d love to hear from you) would help us to go a little bit faster, and who doesn’t want that? So we’re certainly searching for the mythical “elastic meat cloud”; turn up the dial, add a few more people, and hey presto, you’re a team scaling guru!

Hmmmm, pastries!

The discussion touched on areas including technical architecture, how to attract and retain good people, and which working practices scale up best in different environments. We pretty much unanimously preferred a modular architecture to a monolithic “big ball of mud”. Loosely coupled components and services make it easier for multiple developers to work concurrently on the same system. An additional benefit is that you don’t need to understand the whole system before you can start to work on part of it, making it easier for new people to contribute earlier.

Good unit and acceptance test suites were also raised as technical concerns that can reduce the friction of adding new people to a project. The lurking fear of silently breaking something you don’t yet understand will certainly slow down new hires.

Handily, we managed to avoid any serious dogma wars while discussing process and methodology, although most of the talk was about various forms of agile approach and what size of team they scale to. Interesting to hear the experiences of people who had been using Scrum with teams of around 20 developers, which appears to be pushing the limits a bit, judging from their testimony. Also discussed was the question of when you need to start some form of line management, whether technical, admin-focused or both. How many people can usefully report directly to the same person? At what point does this start to become unworkable?

File Systems are shiny too!

Next up was a man standing in front of a room full of techies and inviting them to pull his system architecture to pieces. In a nice way. Richard Jones is building a browser-based IRC client that maintains user sessions even when the browser is closed. Richard outlined the requirements and characteristics of his app; append only (no edits), no joining between users, no search, allows users to download logs, page back to see chat they missed, and so on. His goal was to get some ideas to help him scale the app, which he expected may entail replacing the PostgreSQL back-end with something else.

The architecture currently uses table inheritance in Postgres to achieve vertical partitioning. There is one RDBMS table per day’s worth of data, so the data is basically sharded by day. This allows cheap deletes via SQL “DROP TABLE”, as opposed to “DELETE FROM”.


A brief discussion of various sharding strategies took place. The well documented foursquare outage was mentioned to illustrate the potential pitfalls of sharding randomly on user name; this can lead to hotspots in the cluster that can be tricky to manage. There was a certain irony in the fact that I was expecting this discussion to focus on one or more of the shiny new NoSQL databases as a replacement for Postgres, but ultimately it took a turn towards solutions that used good old file systems to manage data storage. Clearly we can also find shiny new work in the file system space too, but I suppose the takeaway here is to use whatever tool does the specific job you need, shiny or otherwise.

Analysing droppings using Hadoop

Matt Biddulph of Nokia hosted a session where he outlined work he has been doing to analyse massive datasets about cities. Matt described the process of collecting log files from assorted Nokia applications and analysing them as “inspecting their droppings”. Using these “droppings”, Matt has been able to do things like produce heat maps that visualise which map locations people inspect most regularly on their phones. In general terms, the approach he has used for this is to analyse these massive datasets in Hadoop, then take the resulting, much smaller data and load that into an RDBMS for querying. This approach seems to be the most popular one right now for finding interesting relationships and patterns in big data, although we were all hoping somebody in the room had been doing something different and funky we could learn about, analysing massive data in a more online fashion. Maybe next year.

Eventually Matt wants to be able to use Hadoop to calculate various types of ground truths offline, for example the “normal” number of active Nokia devices in the Notting Hill area. A comparison of streaming data against these ground truths could then highlight interesting patterns, for example how much busier are various locations in Notting Hill during carnival weekend? The possibilities of using the streaming data could extend even further, for example to answer questions like “Which bars in the area are currently too crowded to bother going to, and which are worth a visit?”. Now that’s an app I’d snap up from the Android market place without a second thought.

Gentlemen, let’s broaden our minds

As a developer who has spent most of his career working on various back-end applications, I enjoyed attending a couple of sessions that covered subject matter outside my usual domain. Firstly, Spike Morelli described a systems configuration approach to managing a cluster of several thousand nodes by using a config management tool to roll out only entire images. The QA department apparently loved this, because the release as rolled out was exactly the same as the thing they signed off after testing.

Secondly, Premasagar Rose hosted a session on design patterns for JavaScript performance. Topics covered included JQuery tips, caching data in the browser as JSON values, and making as few DOM calls as possible. A couple of interesting tools were mentioned in the form of and Web Inspector.

Fail at failing

I also enjoyed Andrew Betts‘ session on handling errors at scale. Although initially PHP focused, there was a lot of general wisdom covered in the discussion. People compared notes on logging strategies, monitoring tools, and assorted low-level nitty-gritty. One such hard-won nugget was the value of assigning a unique ID to each request in a distributed system so you can follow it as it moves from one component to the next. We have learned this the hard way here at Talis while attempting to trace SPARQL queries from the Platform web servers through to the RDF stores at the back-end. The “X-TALIS-RESPONSE-ID” header you see in your HTTP response to a SPARQL query is a unique identifier that enables us to see what went on with an individual request all the way through the Platform’s stack. Big Brother sees all, innit?

That’s all very well, but when do I get the X-Ray glasses & exploding cigars?

Scalecamp organiser Michael Brunton-Spall, who deserves enormous credit for his creation, hosted a session at the tail-end of the day. Michael introduced an approach used by the tech team at the Guardian to analyse a technical crisis after the event. The Analysis of Competing Hypotheses is a technique formulated by the CIA in the 1970’s to help identify a wide set of hypotheses and provide a means to evaluate each when looking for explanations of complex problems. Interestingly, there is an open source project providing software to help you do this. The CIA and open source – strange bedfellows indeed, no? Whatever next, the FBI opening a sustainable hemp farm?

A spy

To illustrate the process, Michael used a real example from the Guardian so fresh it was still warm. A week or so before Scalecamp, the Guardian’s website had slowed to a crawl just before a scheduled live Q & A with WikiLeaks’ Julian Assange. We were asked to shout out possible causes, e.g. “Denial of service attack”, “Too many comments on a page”, and so on. Then we attempted to think of what evidence would prove or disprove each. A lightweight version of the full CIA methodology. Our own root cause analysis usually incorporates the 5 whys, but ACH looks like another useful tool to have at our disposal. Plus, we get to pretend we’re spies, although we’ll probably stop just short of the water boarding.

One Comment

Leave a Comment
  1. Michael Fitzmaurice / Jan 10 2011 1:28 pm

    Murray Rowan of Yahoo Developer Network also blogged about ScaleCamp at

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

<span>%d</span> bloggers like this: