We had an outage – here’s what happened

Codeship News

Reading Time: 5 minutes

At the beginning of February we experienced a database outage that lasted for a couple of days. In this post we will go through the different stages we encountered and our approach to solving those issues now and in the future.

We are deeply sorry for this episode and the interruption in service this has caused and hope we can clear your concerns through this blog but especially our future work.

Database problems protocol

Thursday, January 30

For the first time, an ActiveRecord::ConnectionTimeoutError popped up in our exception tracking. This exception is raised, if ActiveRecord, Rails’ ORM library, can’t establish a database connection within a certain timeframe (5 seconds, by default). Usually this indicates that the database connection pool which ActiveRecord manages is too small, so ActiveRecord isn’t able to acquire a connection from the pool, because all of them are already in use.

Our first approach was to increase the connection pool limit and monitor the situation.

Monday, February 3rd and Tuesday, February 4th

At first, it looked like increasing the connection pool had solved the problem. But eventually the exception popped up again. We tweaked the code that was responsible for calculating the connection pool size for each of our Rails processes. We audited our code to find out if some database transactions were neither committed nor rolled back. The exception popped up every hour already and the only way we could temporarily resolve it was by restarting the whole web application. This affected a lot of our users and customers as regular error messages started to show up.

Wednesday, February 5th

We were already using the Heroku integration from Librato and built a special dashboard that focused on our most important database and Heroku dyno metrics to be able to quickly determine any errors. We noticed that database system load was high and the available memory for caching was low. Looking at a live-view of running database queries, we noticed that certain queries seemed to run for minutes. They seemed to be blocking database connections. We decided to upgrade our database to increase memory and hoped that this would solve the connection problem.

Heroku offers a nice feature called “database followers” which lets you provision a new database and tell it to follow another database. This “follower database” will be an identical copy of the original database and all changes to the original database will be available at the follower database, usually within a split second.

It was around 9pm UTC when the follower database was ready. This is the time of the day when we receive most traffic on our application, but we had no other choice than to switch over to the new database. This took around 1 minute of downtime.

We watched our metrics right after the switch: The application seemed much slower than before. This was expected as the database caches still needed to warm up a bit. After 1 hour it was still slow. Additionaly we didn’t receive any database metrics from Heroku. Since this problems didn’t get better we decided to create another follower database at around 12am UTC because the current one seemed flawed.

Thursday, February 6th

After switching to the second follower database at around 2am UTC and monitoring for a few hours everything seemed resolved.

A few hours later, another outage had occurred. The new database didn’t perform as expected. We noticed that certain queries were slower than others, in particular queries that involved certain data records. We performed ANALYZE TABLE table_name on the tables containing these data records.

ANALYZE collects statistics about the contents of tables in the database, and stores the results in the pg_statistic system catalog. Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries. – http://www.postgresql.org/docs/9.3/static/sql-analyze.html

Friday, February 7th

During the early hours of February 7th we received a notification by Heroku that the machine our database was running on had died and they moved the database to a new machine. This didn’t have anything to do with our previous database problems.

During the day we saw long-running queries again involving certain data records. It looked like some records were damaged, so we decided to recreate our database completely. We scheduled a maintenance window for the next day.

Saturday, February 8th

We put our application into maintenance mode and pulled a backup from our database. We created a new database with the next higher size on Heroku and restored the backup. We monitored our database over the next days and didn’t see any more database errors.

Over the last few days we’ve done further load tests to determine if higher load or more long running requests could cause further problems. Those tests aren’t finished, but so far they have been successful in keeping up with much higher load. There are certain optimisations in our codebase we will work on in the future to make sure we reduce the load placed on specific parts of our system or reimplement those infrastructure parts to be able to deal with much higher load.

Our learnings and improvements

We agreed on a number of measures to prevent this from happening again:

  • We’ve now got a better understanding of problematic areas in our application and how we can fix them.

  • Better and stricter monitoring is in place that will notify us if any of those or similar errors happen again in the future so we can act immediately.

  • Our practices to deal with database errors have improved so in case of larger problems we can switch to different databases or maintain it more efficiently and with minimal interruption.

  • Stronger load testing to put our application under high load will be regular events to determine possible points of failure.

Conclusion

Heroku and it’s Postgres cloud solution have helped us tremendously in the past to bring our product to market and iterate quickly. We still consider them to be the best service for us at this point. To be able to deal with faulty databases or problems in our infrastructure we needed and need to improve our practices and monitoring so we can keep the extremely high quality of service we want to provide to our customers. Especially as we do not have full access to those Postgres instances and need to rely on other data to determine faulty state.

We will keep pushing our tools and workflows that these interruptions aren’t happening in the first place, but if they occur are dealt with swiftly.

We are very sorry that this happened and sincerely hope that through our efforts we can rebuild the trust that might have been lost during this episode. As always we would love to get your feedback or suggestions on how we can improve and what your expectations and worries are.

Ship long and prosper!

Subscribe via Email

Over 60,000 people from companies like Netflix, Apple, Spotify and O'Reilly are reading our articles.
Subscribe to receive a weekly newsletter with articles around Continuous Integration, Docker, and software development best practices.



We promise that we won't spam you. You can unsubscribe any time.

Join the Discussion

Leave us some comments on what you think about this topic or if you like to add something.

  • PeanutGallery6

    If you are using activerecord, it does not allow you to use the best optimizations in SQL databases.

    • While ActiveRecord is an abstraction that will not necessarily have the most optimized queries for every request there is definitely a lot you can do to optimize which we will do in the future.

  • cphoenix

    This is not the same problem as the random-request-routing problem Heroku had/has, but it “feels” similar – including the lack of accurate/available metrics. You might want to check all your assumptions about exactly what Heroku may be doing to create such a high load.

    It might be worth setting up your own Postgres database on your own machine, with the same data and usage, and see for yourself whether and why it takes so much time. And… don’t believe any measurements Heroku gives you, if your own experience contradicts them. The second link below documents how Heroku’s monitoring app gave disastrously incorrect information about response times and delays.

    http://tiwatson.com/blog/2011-2-17-heroku-no-longer-using-a-global-request-queue
    http://news.rapgenius.com/James-somers-herokus-ugly-secret-annotated

    • gumaflux

      @cphoenix:disqus great input, very useful thanks

    • So far we’ve been very happy with the responses and data that Heroku provides us and don’t really have any doubts that the data we receive from them is correct.

      That it somehow seemed the data got corrupted is a worry that we will definitely check in the future. and see if it happens again.

      • cphoenix

        I was referring to “Additionaly we didn’t receive any database metrics from Heroku.” And the story of how the database that had been working suddenly started working much worse than expected without a clear explanation in the available metrics. I’m just wondering if Heroku changed something, and the change had an effect that the metrics aren’t catching.

        • Yes not having complete control and insight into every part of the technology stack is definitely a problem, but on the other hand also an advantage in using a hosted service. There were definitely a few hiccups along the way that made using this a little harder than it could be.

  • Aravind

    Well, to be fair, I thought you were to going to end this post on how you optimized those blood sucking queries.. Not complaining though!

    • We have started to look into optimizing some of the queries, but the main problem wasn’t really with the speed of the queries, but more with faulty database indexes, which we fixed since then