Sunday, September 23, 2012

Crash & Burn 2012

When I was studying computer science we spent a lot of time learning about programming and theory, but from what I can recall almost nothing was mentioned about deployment and administration, which is kind of absurd. What good are programs that can't be run properly? When talking to others, this is something that is common to many developers. This is exacerbated by the habit of some organizations to separate developers far from the sys-admins. The Crash & Burn conference is about the important topics of continuous integration, testing, deployment and virtualization. It was held at KTH Forum, Kista Stockholm, March 2 2012, and this is a summary of my notes from the conference.

Keynote: Why DevOps? by Morten Nielsen from RemoteX
Morten discussed how if someone at the hosting company made a mistake, it would still reflect badly on the developers and their company when the site becomes inaccessable. A few years ago they therefore chose to host the application themselves. This initially gave them increased quality of service, but as the number of versions of the system in production increased, they started feeling the pain of mainting and upgrading the system. Despite a comprehensive manual, deployment took time and was error prone.

The solution was to throw away any deployment manual and automate everything. This reduced the upgrade cost to being 4 minutes, unsupervised. This in turn led to having many releases. This in turn led to developers having more time over for development of new features, increased focus, and increased confidence that they could react quickly to security threats or other emergencies.

Morten especially recommended the book Continuous Delivery.

Designing for Rapid Release by Sam Newman at ThoughtWorks
A lot of time is spent in software projects worrying about requirements, perfomance and compliance. Sam Newman argues that to be successful, more projects should prioritize how easy it is to release. To do this, three things are needed.
  • Make it quick to change
  • Make it quick to release
  • Make it safe to release.
Each release is a rollback point - a small incremental code change. Sam spent some time explaining Blue/Green deployment. Essentially, instead of taking down a service and replacing it, you deploy a new version of a service concurrently. After ensuring it has been correctly deployed, you switch clients over to the new services. A major block to this nice technique is "session serialization", especially in the Java/.Net world. You have to either wipe session data (not a good thing to keep your customers happy, especially if you are for instance a shop - lost sales!) or do a complex session migration (especially tricky with static types). Sam argued that session serialization is an anti-pattern anyway. Many shops routinely throw lots of objects into the session with their following graph, and are then surprised that each user session consumes several megabytes - a performance killer. He argued for stateless sessions, and the usage of classic cookies instead.

He mentioned "Dark Launches" - launching a new version of a service in secret, but not migrating users to it. Instead you play back transactions from the current service to do testing. Facebook were famous for doing this, adding JavaScript hooks that sent all user transactions both to the live site, and the Dark Launched version.

He then described the necessity to degrade quality of service whenever you do an upgrade. This is complex, but it is something you should take into consideration anyway, since it is needed to make the site tolerant to failure by degrading gracefully. He mentioned the Circuit Breaker design pattern, and for further reading recommended the book Patterns of Enterprise Application Architecture by Martin Fowler.

An alternative design is to use asynchronous behaviour - but this is also complex, perhaps too complex for most usages? If using this architecture, it becomes even more important to degrade gracefully, and keep users informed about progress.

Sam then brought up a couple of anti-patterns that make updating and deploying especially difficult. There is one he called "The Trifle". I would have liked an explanation of the choice of name, because the meaning is not obvious to me. From my understanding, he used it do describe the common fallacy of dividing the development of an application into, for instance, "the web layer" and "the persistence layer" and putting teams on working them concurrently. There is then just the little trifle of merging the two. He described a project he had been on where two geographically separate teams developed those tiers, with disastrous results. The problem is that code changes almost always cross boundaries - both tiers need upgrade at the same time. Furthermore, if you divide your application like this, you often end up with a lot of chatter between the services.

Another common antipattern is The Spider - a single central service, which usually grows into a sort of "god object" which then just polls a lot of dumb services in the periphery.

As a rule of thumb, to design services correctly Sam recommended to think of a service as "A set of capabilities at an endpoint" - something an end user would recognize as relevant to them, for instance a "music recommendation engine". You should model your services based on your business domain (see also: DDD).

Whatever your architecture, you should beware of shared serialization protocols. He quoted "Be conservative in what you do, be liberal in what you expect". As an warning example, he mentioned XML binding libraries which, while giving appearent convenience to programmers, tightly couple services to each other. He said that ThoughtWorks almost universally recommend using XPath or similar technologies to consume SOAP services rather than XML-object binding libraries. This is less brittle because domain object changes don't affect the protocol, and if the protocol changes fewer consumer changes are needed.

At this moment in the presentation I had started to think about my own experiences from upgrading applications, and the fact that databases are often a pain point. I had planned to ask serveral questions about how he would mitigate this, but he was ahead of me as the last part of the presentation covered databases and persistance. He offerend the great quote "Data is cool, databases are generally evil". The problem with dbs is that you can't version the schemas easily. There is one terrible anti-pattern that often pop up in shops that try out doing Service Oriented Architecture - the services are decoupled when sending messages, but all too often they all share the same database (because of architectural mistakes, sending all the necessary state is expensive). Sam urged us, "If you only take one thing from this presentation - PLEASE USE SEPARATE DB SCHEMAS FOR YOUR SERVICES".

A great talk, and an obviously experienced presenter with great flow and nice slides.

Glu: Open Source deployment automation platform by Yan Pujante
Yan Pujante is one of the founders of LinkedIn, and one of the core developers of Glu, a Apache licenced library/tool for deployment. At LinkedIn, many developers and admins who experienced the pain of redeployment had developed their own scripts to automate deployment, using various languages. This library grew out of the need to have one, solid way to redeploy. Essentially you need to install the JVM and deploy one Glu agent (written in Groovy) at each node you want to manage. You can then communicate with the agent using a ready web GUI and a REST interface. Then you can deploy, restart services, or even send OS level signals to any live processes. Yan walked through the architecture and the configuration, and ended with a demo. It was certainly interesting and something I want to look more at, but my suggestion would be to start the talk with a short "WOW!" style demo. When you have a technical in-depth presentation, it is important to start strong to arouse interest so that listeners don't start to drift in their attention, or it is very difficult for them to get back. He had prepared slides for it but didn't have time to go into security unfortunately.

The Ops side of Dev with Mårten Gustavsson
Mårten brought up how much of his professional time has been spent maintaining systems and tracking down bugs. This is valuable experience - too often many developers only have experience of working on projects that are cancelled, or are dumped on a maintenance crew while "the next version which will fix everything" is developed. Leonard Axelsson then observed on Twitter - " ever single dev ever should serve in an operations team for at least a few months." Mårten's talk was oriented around how developers can make the operations side easier, no matter who is in charge of it.

Most of the talk revolved around logging, something that is suprisingly hard to do right (or at least most developers are surprisingly poor at it). Some important pointers:
  • Make all developers agree on consistent log levels.
  • Have an action plan ready - who responds to each log level?
  • Rotation and retention. It is no fun trying to grep through gigabyte size files (or even worse, try to open them in a gui). It is better to have more and smaller files. Put an upper size on them. Compress older files as they become less relevant.
  • Spend some time on formatting - make the log files easy on eyes and tools.
  • Destinations - you should definitely consider having multiple logger outputs. If you have clusters of services, you want to correlate logging between them, not have to ssh to dozens of servers in turn to locate the one server where the customer problem occured. You probably also want error/critical logging pings to go to for instance IRC or XMPP services. Remember the fallacies of distributed computing though, you can't assume the network or the service is alway available. Always, always have fallback logging to a local file or error information critical to debugging may be lost.
  • Logging and other services should be reconfigurable in runtime. You don't want to take your services down in order to increase log levels after the fact. Many JVM frameworks have good logging and the ability to be instrumented through JMX, but you have to remember to enable them in configuration before you start the server.
The second big topic was metrics. You should instrument your code - from the beginning, because it is during early development that it is cheapest to fix the stability or performance problems you may discover. Always add some sort of smoke tests or "health" methods to your services so that you can ping them manually or automatically to check that they are ok. Make instrumentation a habit. Mårten recommended http://metrics.codahale.com/ as a great library for JVM based projects.
I asked if he used to instrument manually, or use aspect oriented libraries like AspectJ since logging and metrics are classic cross-cutting concerns and often used as examples in aspect tutorials. He said his experience of using declarative aspects was so-so, they tended to be too fine grained and spew too much info into logging, but that he favored using annotations together with something like Google Guice for good control and minimal code impact.
He also mentioned striving for zero-touch configuration of deployment, and preferring using a maving plugin like Shade to produce an über-jar for deployment rather than expecting a server with all runtime dependencies correctly set up and maintained.

DNS in the Spotify Infrastructure with John Stäck from Spotify
Another technically in-depth talk which was interesting, but difficult to take notes on. Some of the things I remember:
  • "You would think DNS resolver libraries are clever. Most of them are not". He mentioned some of them cashing infinitely, requiring a server restart when they had changed DNS configurations. But they had managed to eliminate almost all those problematic servers.
  • He mentioned using Geo DNS to route users to servers geographically near them, but that the db for Geo DNS was bad and getting worse all the time, as the shrinking IP4 address space means fragmentation. Also big domains such as Google give strange results. His end recommendation was that you should "Very carefully consider if you really need Geo DNS".
  • DNS can handle great loads, don't be afraid to utilize it.
  • When it came to deployment, his advice was "Eliminate humans, they are not to be trusted".
This was probably the most "sysadmin" oriented talk. Interesting, but this was a talk that perhaps could have done with a bit more explanations for the developers like me who have little interactions with DNS as long as it is working.

Intiutive and distributed load testing with Locust by Carl Byström
Locust is a load testing framwork written in Python. When evaluating existing load testing framworks, Carl decided he wanted to avoid some comon features in his own library:
  • Using a GUI to define tests. Programming through a GUI is a pain (something I totally agree with).
  • Declarative. (This surprised me, since declarative is usually considered a good thing. It seems he meant XML and similar test definition formats that are ONLY declarative and are not easy to parameterize or mix with code).
  • Expensive scaling.
Features he DID want when he designed Locust:
  • Configuration using POPC - Plain Old Python Code
  • Small and hackable source
  • Distributed & scalable
  • An intuitive web GUI to monitor and control test runs.
As a rule of thumb when testing, his recommendation was to not test requests/second as this is almost meaningless. You want to test response times, and failure ratios. He then demo-ed running a test through the GUI.
It did look very impressive, but a little too much time of the presentation was spent talking about companies using Locust. Who they are and what they do is not that interesting - only how they use Locust! I would have preferred if this part of the presentation instead had been spent on discussing clustering of clients and especially test orchestration - how easy is it to start 50 machines sending requests at the same time, and summarizing their results?

Graphite – The village pump of your team by Leonard Axelsson and Ville Svärd
Graphite is an open source tool written in Python. You instrument you code to write to the Carbon component (based on Twisted), which in turn writes to the Whisper (db) component. You then use a Django based GUI to visualize the data. The talk was split into two parts, a demo of how to configure the visualization and the story of how introducing Graphite into a project changed it. Leonard was initially just running it on his local machine. When he was looking through it each morning, other developers and eventually also managers gathered around and asked about it. It eventually became a highly valuable tool for the team.

Continuous Integration – The Good, Bad and Ugly with Brian Riddle
The good:
Brian Riddle works at TV4 and shared his journey of introducing continuous integration using Hudson (now most use the fork called Jenkins) to a very messy CMS project. It is easy to get started, you can download it and start running it locally with just "java -jar jenkins.war", instead of asking for permission to set up a new server. Brian highly recommended the book Working Effectively with Legacy Code by Michael Feathers. They used emma to document the code coverage, and while the percentage of source code increased, "it was a lie". They were dependent on a large number of jsp pages which could fail to compile at runtime, causing redeployments to fail. Eventually they introduced an Ant task which pre-compiled the jsp pages, which at least caught compilation errors ahead of time. But testing them remained problematic, not to mention Flash which was a large component of the site.
Brian's advice is to "Keep deploying until it doesn't hurt". Previously they had scheduled redepoyments every 6 weeks, but they often went wrong. When they started to seriously fix the underlying problems of redeploying, the first major redeploy was an all-nighter before it worked. The second took one redeploy. After that things started working smoothly. Years ago it took over 2 hours to deploy, now they are down to less than 25 minutes unsupervised for the slowest project. Most projects take less than 10 minutes to run all tests and deploy.
They later replaced much of the jsp with Rails, and when using Ruby they could use rcov to test code coversage. They used GitHub to host the source, and had hooks so that changes were notified to the whole team via Yammer. GitHub has a feature where people with the right access rights can edit the code directly through the web site and commit changes. Brian said this was a feature that kept him awake at night at first, but after a while it became a valuable feature, as managers and other non-coders could easily edit FAQs, web pages etc. They could also redeploy it themselves. This meant the correct people (product owners) could decide when redeployment should occur, and developers were not a resource limiting this. Developers also experienced less stress, and could dedicate more time to development.
The Bad:
The CMS can't be deployed whenever. 20 minutes downtime is disastrous if major news break during the downtime. PHP and Wordpress could be problematic to deploy, and especially Wordpress themes introduced by users led to trouble. mule the ESB can hot-redeploy which is nice, but you can't redeploy the whole system often (?), and it is highly problematic to test through it.
Notifications from GitHub, Jenkins and testing systems are critical, but sometimes devs experience notification overload and start to ignore them.
The ugly:
The closer you are to the user, the harder things become to test. Brian mentioned the special problems of testing:
  • Firefox, IE, Chrome and Safari and all their versions. You can test manually, but that is highly time consuming. Most test framworks running a browser and scraping pages for tests are brittle, and work with few browsers.
  • qt-webkit
  • Flash. A nightmare to test, especially though Jenkins. The Flash plugin only works on 32 bit Linux, and since Jenkins and all their other servers run on Linux, they either have to downgrade just for Flash testing or else choose not to test.
Very good presentation. I especially liked the slides where he had calculated how much deployments cost before they started working on doing quick and nice redeploys (lots), compared with after (just a few dollars).

Scaling Github with Zach Holman
There are two problems with scaling - technological and organizational (or human). Zach chose to focus on the latter with his presentation. Happy employees are productive employees, and vice versa, so how do GitHub score excitement and reduce toxicity (i.e. keep their people happy)? They have more or less elimiated meetings. People are not forced to be in the office but are allowed to work from wherever they feel they are most productive, as long as they contribute and keep in touch through their chats.
They have a number of perks (though some of the ones mentioned like dental plans and generous vacation are mandatory in Sweden so those perhaps didn't feel special for us). The employees can spend time on learning new stuff and get paid to talk at conferences (like this one!) They like to hire independent and self-motivating people. Even so they are sure to help them get started as quickly as possible by automating and documenting as much as possible. Internal presentations are recorded (using a neat Arduino+Kinekt hack) and made available for all to view on the intranet.
Hiring poorly is at least as dangerous as losing experienced people, before you know it they are influencing who are hired in turn. How do you find good people to hire? Zach stressed the importance of extending your personal contact networks. You get to know good people by for instance
  • Contributing to Open Source
  • Arranging conferences
  • Blogging and other tech posts
  • Sponsorships
  • Hack nights and other smaller meetups
  • Talking at conferences
When it came to the technological side, he echoed Sam Newman on the importance of small, incremental upgrades. At GitHub they can have as much as 5-30 deploys per day! That everything can be automated is key, they have internally developed the Hubot chat-bot which can be scripted through CoffeeScript. Anyone in the chatroom can cause a test+redeploy (of their big clustered mission critical system) by just typing "hubot deploy github to production".

This was a fantastic presentation, funny, fast moving and great slides. At the Q&A I asked if they had more or less eliminated managers since they didn't have any meetings, and the answer was yes - they had an extremely flat organization even though they are over 60 employees now.

Summary
This was a small conference, but very well worth attending. It was a nice mix of experienced, internationally famous speakers like Sam Newman and Zach Holman and local talent. If this conference returns next year you should try to go, it deservers to grow.

Edit: Links to presentation slides from Brian Riddle available.