Skip to content
January 4, 2011 / cohodo

ScaleCamp 2010

On the 10th of December 2010 a couple of us from the Platform Engineering team attended ScaleCamp 2010 at the Guardian offices in London. Very much like its bigger, older (second?) cousin Velocity, Scalecamp is a gathering of developers, operations folk and other people with an interest in scaling systems to support increasing numbers of data-hungry users in the post Web 2.0 age. Scalecamp aims to fill the gap for UK-based peeps who want to get in on the scalability chin-wagging and knowledge-sharing act. Smaller than Velocity or new-kid-on-the-block Surge, Scalecamp is now in its second year and still small enough to use the unconference format, allowing attendees to self-organise around whatever subjects float their scalability boats.


Pastries & Scaling your team

The day began with an empty timetable with slots for 40 minute sessions across 5 rooms of varying sizes. And some cheeky pastries. By lunchtime the board was pretty much full, with some intriguing sessions on the cards. First one to tickle my personal fancy was a discussion on how to scale teams. Talk of scaling teams made me remember the phrase “meat cloud”, which still makes me giggle. Like many engineering teams, we pretty much always have more work to do than we can get through, or at least get through for some value of “now”. Adding a good engineer or two (and if you’re a good engineer, we’d love to hear from you) would help us to go a little bit faster, and who doesn’t want that? So we’re certainly searching for the mythical “elastic meat cloud”; turn up the dial, add a few more people, and hey presto, you’re a team scaling guru!

Hmmmm, pastries!

The discussion touched on areas including technical architecture, how to attract and retain good people, and which working practices scale up best in different environments. We pretty much unanimously preferred a modular architecture to a monolithic “big ball of mud”. Loosely coupled components and services make it easier for multiple developers to work concurrently on the same system. An additional benefit is that you don’t need to understand the whole system before you can start to work on part of it, making it easier for new people to contribute earlier.

Good unit and acceptance test suites were also raised as technical concerns that can reduce the friction of adding new people to a project. The lurking fear of silently breaking something you don’t yet understand will certainly slow down new hires.

Handily, we managed to avoid any serious dogma wars while discussing process and methodology, although most of the talk was about various forms of agile approach and what size of team they scale to. Interesting to hear the experiences of people who had been using Scrum with teams of around 20 developers, which appears to be pushing the limits a bit, judging from their testimony. Also discussed was the question of when you need to start some form of line management, whether technical, admin-focused or both. How many people can usefully report directly to the same person? At what point does this start to become unworkable?

File Systems are shiny too!

Next up was a man standing in front of a room full of techies and inviting them to pull his system architecture to pieces. In a nice way. Richard Jones is building a browser-based IRC client that maintains user sessions even when the browser is closed. Richard outlined the requirements and characteristics of his app; append only (no edits), no joining between users, no search, allows users to download logs, page back to see chat they missed, and so on. His goal was to get some ideas to help him scale the app, which he expected may entail replacing the PostgreSQL back-end with something else.

The architecture currently uses table inheritance in Postgres to achieve vertical partitioning. There is one RDBMS table per day’s worth of data, so the data is basically sharded by day. This allows cheap deletes via SQL “DROP TABLE”, as opposed to “DELETE FROM”.


A brief discussion of various sharding strategies took place. The well documented foursquare outage was mentioned to illustrate the potential pitfalls of sharding randomly on user name; this can lead to hotspots in the cluster that can be tricky to manage. There was a certain irony in the fact that I was expecting this discussion to focus on one or more of the shiny new NoSQL databases as a replacement for Postgres, but ultimately it took a turn towards solutions that used good old file systems to manage data storage. Clearly we can also find shiny new work in the file system space too, but I suppose the takeaway here is to use whatever tool does the specific job you need, shiny or otherwise.

Analysing droppings using Hadoop

Matt Biddulph of Nokia hosted a session where he outlined work he has been doing to analyse massive datasets about cities. Matt described the process of collecting log files from assorted Nokia applications and analysing them as “inspecting their droppings”. Using these “droppings”, Matt has been able to do things like produce heat maps that visualise which map locations people inspect most regularly on their phones. In general terms, the approach he has used for this is to analyse these massive datasets in Hadoop, then take the resulting, much smaller data and load that into an RDBMS for querying. This approach seems to be the most popular one right now for finding interesting relationships and patterns in big data, although we were all hoping somebody in the room had been doing something different and funky we could learn about, analysing massive data in a more online fashion. Maybe next year.

Eventually Matt wants to be able to use Hadoop to calculate various types of ground truths offline, for example the “normal” number of active Nokia devices in the Notting Hill area. A comparison of streaming data against these ground truths could then highlight interesting patterns, for example how much busier are various locations in Notting Hill during carnival weekend? The possibilities of using the streaming data could extend even further, for example to answer questions like “Which bars in the area are currently too crowded to bother going to, and which are worth a visit?”. Now that’s an app I’d snap up from the Android market place without a second thought.

Gentlemen, let’s broaden our minds

As a developer who has spent most of his career working on various back-end applications, I enjoyed attending a couple of sessions that covered subject matter outside my usual domain. Firstly, Spike Morelli described a systems configuration approach to managing a cluster of several thousand nodes by using a config management tool to roll out only entire images. The QA department apparently loved this, because the release as rolled out was exactly the same as the thing they signed off after testing.

Secondly, Premasagar Rose hosted a session on design patterns for JavaScript performance. Topics covered included JQuery tips, caching data in the browser as JSON values, and making as few DOM calls as possible. A couple of interesting tools were mentioned in the form of and Web Inspector.

Fail at failing

I also enjoyed Andrew Betts‘ session on handling errors at scale. Although initially PHP focused, there was a lot of general wisdom covered in the discussion. People compared notes on logging strategies, monitoring tools, and assorted low-level nitty-gritty. One such hard-won nugget was the value of assigning a unique ID to each request in a distributed system so you can follow it as it moves from one component to the next. We have learned this the hard way here at Talis while attempting to trace SPARQL queries from the Platform web servers through to the RDF stores at the back-end. The “X-TALIS-RESPONSE-ID” header you see in your HTTP response to a SPARQL query is a unique identifier that enables us to see what went on with an individual request all the way through the Platform’s stack. Big Brother sees all, innit?

That’s all very well, but when do I get the X-Ray glasses & exploding cigars?

Scalecamp organiser Michael Brunton-Spall, who deserves enormous credit for his creation, hosted a session at the tail-end of the day. Michael introduced an approach used by the tech team at the Guardian to analyse a technical crisis after the event. The Analysis of Competing Hypotheses is a technique formulated by the CIA in the 1970’s to help identify a wide set of hypotheses and provide a means to evaluate each when looking for explanations of complex problems. Interestingly, there is an open source project providing software to help you do this. The CIA and open source – strange bedfellows indeed, no? Whatever next, the FBI opening a sustainable hemp farm?

A spy

To illustrate the process, Michael used a real example from the Guardian so fresh it was still warm. A week or so before Scalecamp, the Guardian’s website had slowed to a crawl just before a scheduled live Q & A with WikiLeaks’ Julian Assange. We were asked to shout out possible causes, e.g. “Denial of service attack”, “Too many comments on a page”, and so on. Then we attempted to think of what evidence would prove or disprove each. A lightweight version of the full CIA methodology. Our own root cause analysis usually incorporates the 5 whys, but ACH looks like another useful tool to have at our disposal. Plus, we get to pretend we’re spies, although we’ll probably stop just short of the water boarding.

October 11, 2010 / cohodo

Surge 2010

Having now finally gotten over my jetlag, I’ve had a few minutes to write up my notes from Surge 2010, which was a really great couple of days, perfectly filling its niche. It also had probably the best lineup of speakers at any conference I’ve attended. Aside from the content, the whole thing was brilliantly organised and run by OmniTI, who deserve a massive amount of credit for initiating such an awesome event. Mostly for my own benefit, I’ve collected a few writeups from other folk who attended, and videos & slides from pretty much all of the sessions are due to be published any day now.

The main message coming through was read more, learn more, share more. This theme ran through a number of talks, from John Allspaw & Brian Cantrill‘s opening keynotes to Theo’s closing plenary where he delivered the 11 Commandments of Scaling. There’s a huge body of literature out there constantly being produced by the academic and research communitities. In general, we in industry are not particularly good at putting it to use and building on top of it – all too often we’re found re-inventing the wheel, making the same mistakes over and over, and then perpetuating this vicious circle by not sharing our experiences with our peers.

Standout sessions for me included Allspaw’s keynote, delivered with customary insight and aplomb, where he talked of the absolute immaturity of Web Operations as a discipline, and of the huge amount that we can learn from more established like civil & mechanical engineering, the aerospace and utilities industries which have been tackling similar-shaped problems for decades, if not centuries.

Another highlight for me was Basho CTO Justin Sheehy‘s session on concurrency in distributed systems. Here, we got right to the nub of the issue – in any complex system, both in the real universe and in computer systems, its usually not correct to think of time as a single linear flow of events occurring in lockstep. Any software system, particularly any distributed system, that attempts to hide the underlying asynchronicity that this entails is fundamentally flawed. There are no strong guarantees of consistency in the physical world and certain domains, like banking for example, have long recognised this and built compensating mechanisms into their systems. A great soundbite is that we shouldn’t aim to build reliable systems (i.e. one that do not fail), but that we should aim to make our systems resilient to the failures that they will inevitibly encounter.

There were also some great case studies and war stories including Artur Bergman‘s deep dive into operations at Wikia, Ruslan Belkin‘s ‘Going 0 to 60: Scaling LinkedIn’ and Geir Magnusson’s detailed walk through of how scaled up from a typical n-tier application by building out a loosely coupled, service oriented back end.

I definitely learned a lot, had a bunch of things reaffirmed, and also found a lot of great validation for the stuff we’re doing on our Platform. Can’t wait for next year.

October 8, 2010 / cohodo

Royal Society Web Science

I’ve just spent the last couple of days at the Royal Society Web Science discussion meeting which I felt was a very special event for the following reasons.  Web Science (the internet/www as an object of scientific study) is emerging as a new interdisciplinary field of activity with collaborators from both science and the humanities.  This cross over of ideas from many different disciplines (physics, mathematics, computer science, politics, philosophy, sociology) could prove fruitful, and indeed there were speakers at the event from all these disciplines. All of the speakers were very good indeed, some excellent, and all with high calibre backgrounds and good credentials; people who have obviously paid their dues with years of hard work and good research.
Some common themes, ideas mentioned by more than one speaker were as follows. More than one person mentioned Frigyes Karinthy and the 6 degrees of separation concept. Another theme was the value to researchers of having at their disposal unprecedentedly vast amounts of rich data, the “digital traces” (Kleinberg) of all of our interactions on the web. With this kind of data sociologists and other students of humanity have the ability to examine human behaviour, and may be able to prove and disprove theories by empirical studies at a scale not possible before.  Another common theme was the value of the internet and the web. The value of maintaining the structure of the internet and ensuring its security and scalability and the value of keeping the web democratic and open.
The presentations should be available to view from
September 28, 2010 / cohodo

Heading out to Surge 2010

A couple of us will be flying out to Baltimore tomorrow for the inaugural Surge Conference. Billed as “more than an event, it’s a chance to identify emerging trends and meet the architects behind established technologies”, the speaker list includes some real heavyweights and its hard (really hard) to pick which sessions to miss.

If you’re going to be there and fancy meeting up, feel free to ping either of us @beobal & @daveiw

August 1, 2010 / cohodo

Configuring Guice Dependencies Post-Deployment

In a number of our projects, the Platform engineering team use Guice as a dependency injection framework. The benefits of DI with regard to increasing modularity, lowering coupling and facilitating reuse are well documented, and a killer feature for us is the vast improvement of testability. One of the reasons we like Guice, is that all of your dependency wiring is done in code and so is checked by the compiler. Guice also seems to strike just the right balance between features and bloat, the core library makes it easy to do the things you really need, without including lots of stuff you don’t want. There’s also an active community developing extensions and additions to integrate or adapt Guice for specific uses.

Sometimes, we do want the ability to control the composition of an app at deploy time, which for us means specifying which combination of Guice Modules to configure our Injector with. Ordinarily, the main method (or something called early on in the application lifecycle) would contain some code to initialise the Injector with a list of Modules. Like so:

Injector injector = Guice.createInjector(new NetworkModule(),
                                         new SequencingModule(),
                                         new MySQLModule()
                                         new JMSModule());
SomeThing thing = injector.getInstance(SomeThing.class);

Our use case was this, we wanted to deploy the same distribution of an application to multiple places and configure which implementations of various internal services were used on each environment. So in the example above, we wanted to be able to choose between the bindings specified in MySQLModule and PostgresModule after deployment. Initially, it didn’t seem that there was an existing solution, until we ran into java.util.ServiceLoader. This enables multiple concrete implementations of abstract services (i.e. interfaces/abstract classes) to be specified at runtime using a simple descriptor file on the classpath (the javadocs have a much fuller explanation). So, in this case the abstract service that we want to load is defined by and the concrete implementations are the specific combination of modules we want to use to configure our app. The hardcoded Injector bootstrapping is replaced with this one liner:

Injector injector = Guice.createInjector(ServiceLoader.load(Module.class));

The spec of which modules to load is contained in a classpath resource named META-INF/services/ and is just a simple list of full qualified class names

It’s possible to provide the service configuration file over HTTP by specifying remote URLs on the classpath, but at the moment we’re controlling which config gets deployed where using our regular deployment tool, Puppet.

July 13, 2010 / cohodo

Automatically Creating Inverse Changesets and When They Don't Behave as Expected

The Talis Platform uses changesets as a mechanism for updating RDF. As the configuration of the Platform is itself stored as RDF, we also use changesets to modify its configuration. This can be as part of a release or to make requested changes to a customer’s store.

I recently needed to apply a large number of changesets to the Platform configuration. But before applying them, I wanted to create another set of changesets which would, if necessary, reverse all the changes – I wanted to be able to rollback if anything went wrong.

So my changesets looked something like this:

<rdf:RDF xmlns:rdf=""
   <cs:ChangeSet rdf:about="">
    <cs:subjectOfChange rdf:resource=""/>
        <rdf:subject rdf:resource=""/>
        <rdf:predicate rdf:resource=""/>
        <rdf:object rdf:resource=""/>
        <rdf:subject rdf:resource=""/>
        <rdf:predicate rdf:resource=""/>
        <rdf:object rdf:resource=""/>

This changeset can be reversed by changing the removals to additions and changing the additions to removals. This is easy to achieve with sed:

for f in changesetdirectory/* ; do
  sed -e 's/cs:addition/TOBEAREMOVAL/' -e 's/cs:removal/TOBEANADDITION/' \
    -e 's/TOBEAREMOVAL/cs:removal/'  -e 's/TOBEANADDITION/cs:additon/' $f > rollback/$f

The above script creates an inverse of every changeset in the specified changesetdirectory and places them in the rollback directory. The inverse of the example changeset above is created as below:

<rdf:RDF xmlns:rdf=""
   <cs:ChangeSet rdf:about="">
    <cs:subjectOfChange rdf:resource=""/>
        <rdf:subject rdf:resource=""/>
        <rdf:predicate rdf:resource=""/>
        <rdf:object rdf:resource=""/>
        <rdf:subject rdf:resource=""/>
        <rdf:predicate rdf:resource=""/>
        <rdf:object rdf:resource=""/>

So the original changeset removes the triple:

and replaces it with:

The inverse changeset removes the triple:

and replaces the original:

Using this technique, I successfully created inverse changesets which, if I had needed to, would have rolled back the changes to the configuration.

However, there is a caveat. The set semantics of a triplestore can be a gotcha.

Suppose the following triple already exists:

The following changeset could be applied:

<rdf:RDF xmlns:rdf=""
   <cs:ChangeSet rdf:about="">
    <cs:subjectOfChange rdf:resource=""/>
        <rdf:subject rdf:resource=""/>
        <rdf:predicate rdf:resource=""/>
        <rdf:object rdf:resource=""/>

This changeset is accepted but doesn’t actually modify the triples as the triple it adds already existed. Creating an inverse of this changeset gives us:

<rdf:RDF xmlns:rdf=""
   <cs:ChangeSet rdf:about="">
    <cs:subjectOfChange rdf:resource=""/>
        <rdf:subject rdf:resource=""/>
        <rdf:predicate rdf:resource=""/>
        <rdf:object rdf:resource=""/>

However, applying the inverse changeset removes the triple. As the triple existed before applying the first changeset the inverse of the changeset did not have the result we were looking for. It ended up deleting the triple which existed before we started.

So creating inverse changesets in this way can be useful, but only when you know with certainty that any triples added in the original changeset did not already exist.

July 2, 2010 / cohodo

Velocity 2010

Two Planes, an IT-related Sitcom, and a Shuttle-related Ruckus (a Shuckus?)

After setting off from Digbeth coach station some 25+ hours earlier, 2 tired Talisians (Matt and me) finally arrived at the Hyatt Regency, Santa Clara, California for O’Reilly’s Velocity 2010 conference for Web Performance and Operations. We first flew from Heathrow to Dallas Fort Worth on a surprisingly cramped American Airlines flight, then onto San Francisco; by the time we hit SFO we were looking forward to getting to the hotel to freshen up and sleep. Night had fallen as our plane approached SFO which made a nice change after nothing but daylight since waking up in Birmingham well over a day ago. Not a journey for vampires, I can tells ya.

On the flight from Dallas to San Francisco I sat next to Patrick Wilson, CEO of who was on his way back from a conference in Florida. The Valley and Bay Area are chock full of people involved in one area or another of the Tech industry; you can’t swing a cat without hitting a developer, an ops guy and a couple of venture capitalists. And probably doing some damage to the cat.

IT Crowd's "Moss"

Patrick has a pretty extensive knowledge of the area and served as a high altitude tour guide of sorts as we made our approach over the lights of San Francisco (“There’s the Googleplex. That complex there is Sun” and so on) . He also revealed a recently acquired taste for the Channel 4 sitcom “The I.T. Crowd”, even going as far as to fire up his laptop and share an episode.  I was more than happy to help him with his query as to the “exchange rate between pounds and quid?”, given his obvious delight at one of our most surprising cultural exports.

We knew a cab from SFO to Santa Clara would run us well over 100 dollars; Irish genes kicked in and we started looking for a cheaper alternative, despite our by now zombie-like state of fatigue. At this point we discovered the magic of SFO “Shuttles”. These are basically big truck-like minivans that provide a (relatively) cheap way to get from the Airport to anywhere in the general Bay Area. When enough people want to go in roughly the same direction, you jump in and off you go.  Cheaper than a cab and organised to cover broad geographical regions e.g. South Bay area, there is an element of pot luck controlling how long it will take you to reach your destination depending on the route the driver chooses in order to encompass all stops.

SFO Shuttle

SFO employees are on hand to ensure fair play from Shuttle drivers. The lady we spoke to attempted to find a way to tactfully express the fact (“…it depends on…err… the… size of the people…”) that a family of very large Americans waiting expectantly inside the shuttle we needed meant there should really be 2 spare seats, but realistically there was now only 1 seat with access to a seatbelt. This was causing a problem.  Presumably potential litigation in the case of an accident was a worry. A brief Mexican (or at least Californian) stand-off ensued. SFO lady went to find a supervisor.

The driver attempted to slyly squeeze us in anyway now SFO lady’s back was turned. Oversized family took immediate exception to this blatant disregard for their safety (if only they showed the same concern for their health when passing a Dunkin’ Donuts). Dummies were spat, rattles were thrown, they retrieved their baggage and stormed (actually more like waddled) off in a huff (“You ****** up, Buddy! **** you!”). Too tired to find this episode as amusing as we really should have, we slid into the space vacated by their ample collective bulk and an hour or so later we were in Santa Clara, via Oakland. The driver was relying heavily on his sat nav, which caused some concern when it failed spectacularly. He switched it for an identical one that worked, prompting lame, tired jokes from us about redundancy and switching to a warm standby. First performance lesson of the trip: geek humour degrades dramatically after 26 hours without sleep.

A Serendipitous Meeting between Scribes

On Sunday we had dinner with fellow Velocity attendees Patrick Debois, and Torben Graversen. Matt met Patrick at Puppet Camp a few weeks ago. Patrick is the originator of the term “Devops”; his blog is highly recommended. As we chatted in the bar afterwards, Sean Power approached and asked if we were here for Velocity. He introduced himself and his friend Tracy Lee.

We chewed the fat for a while and talked lean start-ups, performance monitoring, Silicon Valley, and Sean’s upcoming talk at Velocity. Sean mentioned that he had contributed to an O’Reilly book that was due out any day now; Patrick asked which one. It turned out to be “Web Operations”; this book contains a chapter on Monitoring written by non other than one Mr. Patrick Debois! You could have knocked us over with an O’Reilly “In a Nutshell” book, such was the strength of the minor coincidence.

Lack of nerve scuppers a tour of Twitter (or possibly just a beating from security)

Since the conference didn’t start until Tuesday, we hired a car on Monday and drove to San Francisco. Whilst there we met with a couple of friends of Patrick, one of whom was something of a veteran of the Bay Area Tech scene. He told us he had calculated that there were at least 400 tech companies in a 3 block area of San Francisco; pretty mind blowing when you think of it. He also informed us the Twitter offices were just around the corner and suggested we should go in, ask at reception for his friend John Adams, tell him we were here for Velocity, (“Mike told us to ask for you”) and see if he would give us a tour.

We were all pretty surprised to find we were able to stroll into the Twitter building unannounced, take the lift up to the 6th floor, wander into reception and hang around without anybody once challenging us; I kept expecting to be thrown out any second. Ultimately, we were all far too European and reserved to ask for someone we didn’t know, tell him we were friends of somebody else we didn’t really know (“Does anybody know that guy Mike’s surname?”), and cheekily ask for a tour of Twitter, so we just hung around for a bit looking goofy and then left. So much for the meek inheriting the Earth; we couldn’t even blag a tour of Twitter.

Down to Business

Velocity traditionally covers two broad areas:

  • performance of Web applications
  • operations

At first glance some folks may not see the connection between these two topics, but they are increasingly intertwined as engineers seek to build highly available, scalable and fast applications that operate at Internet scale. Here in Platform Engineering & Operations, our development and operations functions work together closely in the same team, so it made perfect sense to us that these tracks had been combined into a single conference looking at performance in a holistic way. It also seemed fitting for us to send one developer and one ops person.

The conference was sold out, with over 1200 people in attendance, and up to 3 tracks at once at various times. Between the 2 of us, we tried to arrange our schedules to cover as many of the presentations and sessions as we could. Some of the sessions were billed as “workshops”, but in reality they were way too big to be anything other than long presentations; 400 people is far too many for anything “workshoppy”. Nevertheless, the content was generally of a very high standard; informative and well presented.

DO go chasing waterfalls…

Quite a number of sessions focussed on optimisation of applications that are delivered to the browser. Although not a problem we face directly in delivering the Platform (which is an API); this is an area that has come on in leaps and bounds over the last couple of years and it was interesting to see the current state-of-the-art. Annie Sullivan of Google gave a very good presentation covering many of the techniques engineers turn to when tuning performance of their web pages from the point after server-side processing is complete.

Waterfall charts are a common tool for analysing performance; during the course of the conference we saw many variations created by assorted tools including Webpagetest, Google Page Speed, DynaTrace, WebKit, Gomez, and a host of others. In fact, Steve Souders mentioned that there are twice as many of these type of tools as there were this time last year, which underlines the growth of this area of performance tuning.  Performance is arguably more important now than ever before, even Google page ranking is now partially dependent on the speed of your site.

Techniques mentioned by Annie and others included various ways to optimise and minify JavaScript, CSS & HTML, including getting JavaScript into a build system to help you identify dead code, code that can be modularised, and code that could be loaded asynchronously. Asynchrony, along with progressive rendering techniques to ensure the most important parts of the page load first also featured heavily in an exploration of how Facebook made their site twice as fast.

Engineering for the win!

One of my favourite presentations came from Theo Schlossnagle. Theo’s “Scalable Internet Architectures” was 90 minutes of wisdom covering a vast array of material, from analysing network packet size, to choosing between SQL and NoSQL databases, to version control, caching, monitoring, service decoupling, mastering tools and the importance of engineering maths. A truly wide-ranging and ambitious presentation, skilfully delivered. Unfortunately there appears to be no video, so you can’t really appreciate the moments when Theo worked himself into a righteous engineering rage as he dismissed various bone-headed architectural decisions. However, the slides are still well worth a look.

How do they do that?

Undoubtedly the most over-subscribed session of the week was “A Day in the Life of Facebook Operations” by Tom Cook; I literally had to watch this one while standing in the doorway to the lecture theatre. The room was full to bursting point and Tom did not disappoint. The sheer scale of the job at Facebook is daunting; more than 400 million users, 10s of thousands of servers, 300+ TB of data served from RAM alone via Memcached, and multiple software releases and configuration changes every single day across this gigantic stack. A great example of operating on a massive scale and yet still moving quickly and keeping risk small.

Similarly popular and insightful were John Adams’ “In the Belly of the Whale: Operations at Twitter“, John Allspaw’s “Ops Meta-Metrics: The Currency You Use to Pay For Change” and Paul Hammond’s “Always Ship Trunk: Managing Change In Complex Websites”. All of these presenters have real, in-the-trenches experience of managing development and operations in very large, very fast moving Web applications, servicing mind-boggling numbers of users via staggering amounts of code and infrastructure. Much can be learned from them.

Dev what now?

I have become increasingly aware of the Devops movement over the last few months. I believe this kind of thinking has the potential to change the face of Operations the way agile approaches have changed Software Development over the last 10 years, so it was good to see it well represented at Velocity. I particularly enjoyed Andrew Shafer’s “Change Management: A Scientific Classification”, which sounds almost like it could be espousing a very buttoned-down, paperwork and process heavy approach to managing change, but in fact stresses the importance of agile thinking (high-bandwidth communication, version control, small changes deployed regularly and monitored heavily, automation and configuration management tools) in safely managing change. Adam Jacob also touched on Devops during his innovative “Choose Your Own Adventure” session.

There is no spoon

Wedged in amongst all the good stuff on performance in the browser there were a couple of sessions that took different approaches to looking at performance. Firstly, there was Yahoo Search’s Stoyan Stefanov with “The Psychology of Performance”, offering fascinating insights into how humans perceive the duration of various things and what that means for web applications.

Secondly, Neil Gunther and Shanti Subramanyam used performance testing analysis of Memcached in “Hidden Scalability Gotchas in Memcached and Friends” to introduce Neil’s Universal Scalability Law and explain what mathematical modelling can do to help performance tuning in the Brave New World of multi core machines. This was truly eye-opening stuff; the material was accessible enough to pull you in, but deep enough that I will be digesting bits of it and delving into this further for a long time to come. It was also good to see that server-side performance was being addressed at Velocity, albeit on a much smaller scale than the browser-side.

A recurring theme for me was the additional material Velocity has pointed me towards; the performance-related blogs of Neil and Shanti being great examples.

What has all that got to do with the Platform?

Common high level threads amongst all these cool kids on the Tech block were being process light but review heavy and making frequent small changes with enough testing, automation and monitoring around them to keep the risk of change minimal, yet keep the pace punchy. I found it encouraging to see how much of this stuff we already do in Platform Engineering & Operations, e.g.:

  • version controlling everything (Subversion and Git)
  • always shipping trunk
  • using configuration management tools (Puppet)
  • stressing peer review
  • extensive automated testing (J-Unit, Grinder)
  • monitoring and alerting (Ganglia, Nagios, Cacti, etc.)
  • Continuous Integration (Hudson)
  • dark deployment
  • service decoupling
  • using switches in code to enable/disable features
  • frequent small releases
  • appropriate use of asynchrony
  • judicious use of cloud technologies (EC2, S3 and various other bits of AWS)
  • having ops and devs work closely together.

We don’t yet face the problems of scale that have led Facebook and Twitter to turn to BitTorrent as a means to roll out software quickly to thousands of servers (that would be one of those “nice” problems to have, given what it would represent in terms of take-up of the Platform), and we have some way to go before we can truly say we deploy continuously. However, I left feeling confident in the way we work, primed with new areas for us to explore, and inspired at having gained an insight into how some of the leading lights of Internet-scale engineering make it all hang together.

July 2, 2010 / cohodo

Putting Structure Into Application Logs

Most of us know that application logs can be a fantastically rich source of data, and if you can mine them effectively they can be an extremely valuable resource in lots of contexts. We make use of the app logs from the various Platform components in a number of ways – for trend analysis, monitoring, tracking deployments of new features, fault diagnosis and post-mortem analysis.

Unfortunately for us ops peeps, application logging is a big, moving target. We might dark deploy a version of the software with super verbose output, or we might introduce a new feature which renders our previous perl + regex home brew tool to pull out important values pretty useless, this means that the important data that we want to trend on in March 2009 needs a different regex to June 2010.

Standards! I hear you cry! It seems (to me) existing standards are good at getting the message into the “right” place in the log, which makes them easy to parse and pass around the network, but they rely on the fact that you will be saying the same thing in the logs over and over again – they rely on rigid structure, and a change to logs is the same as adding a new column to an sql database. This means that if you are trying to match values on their position – sometimes adding a new variable will render your previous match useless – and worst still you might not know about it until it is too late!

Basically, string matching just isn’t robust or portable enough for what we want to do so we’ve been adding structure to the data in our logs. We’re currently outputting logs from some of our services as json – this means that we can dynamically add and remove variables as we wish, we can potentially send the logs straight to an indexer, load them into a db, convert them to RDF, or process them using Hadoop-based tools. We can still do our perl based graph building – by converting the json to an array or a hash and pulling out the right fields – irrespective of where they appear in the output.

To get a fully jsonised log line we need to work with 3 main areas, firstly the application itself, the applications logging handler, and finally the centralised logging tool, here is some real output:

    "syslogDate" : "2010-06-28T15:24:45+01:00",
    "syslogFacility" : "local0",
    "syslogPriority" : "info",
    "host" : "somehost.talis",
    "Message" : {
        "Process": "WS",
        "Date": "2010-06-28 15:24:45,834",
        "Priority": "INFO",
        "Category": "MemoryUsageLogger",
        "Thread": "Memory-Usage-Logger-Thread",
        "id": "",
        "Message": {
            "Memory Usage" : {
                "Heap" : {
                    "init" : "60",
                    "committed" : "73",
                    "used" : "26",
                    "max" : "864"
                "Non-Heap" : {
                    "init" : "23",
                    "committed" : "24",
                    "used" : "22",
                    "max" : "216"

The first part of the json comes from syslog-ng, this puts in the details around how syslog sees the Log Message, the first sub “Message” comes from Log4J, and is how Log4J sees the message and the final “Message” is from the application itself, which in this instance gives us some general information about Memory usage.

As well as making our lives easier, it also means we can correlate on events must more quickly, and easily. From the simple Json above – we could write a rule that identifies descrepancies between “syslogDate” and “Date”, which could indicate a problem with the system clock on the application server, or even identify a problem with the syslog server unexpectedly slowing down.

One of the uses we’re putting this too is to build real-time reporting of coarse-grained application events to compliment the views we get from ganglia etc.

February 17, 2010 / cohodo

Talis Hackday 1.0

A couple of weeks ago we held our first hackday. Basically, this involved taking over one of the larger rooms at Talis HQ for the day, filling it with hackers and pizza then baking for several hours. Hackdays tend to be aimed squarely at developers, but taking a leaf from events like Hacks & Hackers we wanted to be a bit more inclusive, so we tried to make it interesting and accessible to non-techies. For a week or so before the day, everyone who had an idea, pet project, or itch to scratch was encouraged to post it up on a whiteboard and ‘pitch’ it to other people, who might be interested in finding out more or even pitching in to help out. There were only 2 rules – that no idea was dismissed out of hand and that no-one was allowed to hack on stuff from their day job (because that’s what we do, like, every other day).
Talis is an organisation full of hackers, so there was no shortage of ideas or participants. In fact, the number of hacks posted on the board far exceeded our hacking capacity for a single day.

The day was a great success and we’re already planning future events with lots of ideas on how to tweak the format. We’d love to open these up for wider participation, and hope to be doing this in the next few months, so watch this space. There were some really cool projects being worked on, so see if anything tickles your fancy and let us know what you think.

Recording Environmental Data as RDF
Über-cool mashup of Arduino and RDF, Rob built a device to take temperature readings at regular intervals, represent the data in RDF and post it to a Platform Store. Its now sat on his windowsill, keeping us informed of the ambient temperature in Rob’s general vicinity

A twitter-enabled plugin for PVRs (primarily MythTV, but hopefully with support for other distros in the pipeline). Triggered when you record a TV show, this queries various datasources, integrates the data and publishes it for the world to see. Perfect for advertising your love of Carry On films or afternoon soap opera.

Store Activity Visualisation
Julian built a cool visualisation of activity on a Platform Store using the built-in OAI-PMH service which graphs updates made to both the Metabox and Contentbox over a specified period. The IRC logs for #talis are persisted in a Store, so we’re going to use this tool to graph activity on the channel.

Using PIG and Amazon Elastic MapReduce to Analyze Webserver Logs
We have a lot of logs, and as you can imagine, they contain lots of truly invaluable data. Some members of our Platform engineering team wanted to explore this a bit more deeply, and so spent the day hacking up Pig Latin scripts to do this. Since they managed to chomp so many logfiles, we let them get away with breaking hackday rule #2.

Android Life Tracker
Talis CSO Justin hacked up an android app to record events as RDF direct from mobile devices. Surprisingly, he’s chosen to store the trail of these events in a Platform Store for post-hoc analysis & data mining 🙂

Sparql 1.1 HTTP Update Protocol Implementation
Paolo spent the day working on a reference implementation of the current draft of the Sparql Working Group’s RESTful update protocol using Jena and Jersey/JAX-RS. Paolo plans to open source and contribute this back to the Jena project once he’s done.

Data Integration for Business Intelligence
John spent the day working on modelling data extracted from library loans services. Using RDF to integrate data from disparate sources like this is just the sort of job we built our Platform for.

Like a school sports day, there were no prizes awarded on the day. But if there had been, the gold medal would have undoubtedly gone to Ian Corns for his LittleBigPlanet hack – Sackboy Explains the Semantic Web.

There’s just no way you can top an platform-based romp through the bowels of CERN where the eponymous hero meets TimBL to explore the origins of the document and data web.