Reading Time: 3 minutes
On July 18, 2017 from 11:26 AM EDT to 4:37 PM EDT, Codeship Basic builds were delayed in processing, and there was an elevated level of system initiated build failures. Codeship users saw that their build pipelines would not start running immediately after they were created as there was contention for servers in our elastic build infrastructure. In some cases, builds were marked as failed with the system error state.
The Codeship Basic elastic build infrastructure relies on compute resources provided by Amazon Web Services EC2. We maintain a dynamic fleet of instances that we scale up and down, automatically, based on system demands. This is driven by the number of concurrent build pipelines across our customer base which naturally fluctuates over the course of a day. When our infrastructure decides it needs to scale up, API calls to AWS are issued to start additional EC2 instances.
At the outset of this event, we started receiving unexpected and unhandled responses from our API calls saying that AWS had no capacity of the instance type that we requested. The result was that our fleet stayed at a static size that was too small for the number of builds that were waiting to run.
What Did We Do About It?
With some persistence, manual intervention, and a switch over to a different, compatible instance type that wasn’t suffering from a lack of capacity at AWS, we were able to scale the fleet to a size that was appropriate for handling the backlog of builds waiting to run. We were fairly aggressive in this effort as we were eager to get our customers’ builds running as quickly as possible.
The downside to our aggressive scale-up was that we found a limit to our scaling in the number of connections we could make to our in-memory data store (Redis). When we reached this limit, many of our systems lost connectivity to this critical component, and we had to scale back down to stabilize. In the process of scaling down, we inflicted a spike of build failures due to instances terminating before cleaning up.
Once things stabilized and everything reconnected to Redis, we turned control back over to our elastic build infrastructure as AWS was successfully responding to our API requests for EC2 instances. Unfortunately, the system automatically made the same mistake we had made the first time we scaled up manually. Once again, we exhausted our connections to Redis, causing the same problem as before. We had learned from the first instance and were able to stabilize in a less destructive way, causing fewer builds to be impacted this time.
After stabilization, the fleet was scaled up in a controlled way to a size appropriate to work through the backlog of waiting builds. This time, we did so without exceeding the connection limit, and the system returned to its normal state with automatic scaling.
What Did We Learn and What Will We Do for the Future?
We immediately added triggers in our monitoring infrastructure to alert us whenever we get capacity errors from AWS. This was a condition that we had not (but should have) anticipated. Adverse effects from that condition can be mitigated with early indication of the problem.
We also added triggers to alert when we come within a safe margin of our Redis connection limit. We’re now investigating whether we need to size up our Redis infrastructure to enable more connections or whether we can tune our clients to use fewer connections with acceptable performance.
We will also be enhancing our fleet management system to automatically use a diverse set of instance types when needed. Longer term, we would like to leverage a more diverse set of cloud compute resources, either from multiple providers or from multiple resource groups within our existing provider.
We sincerely regret the impact this issue has had on our customers and are committed to continually improving our systems and infrastructure to prevent similar problems in the future and deliver the highest quality of service.