Reading Time: 8 minutes
During the first week of June, the Codeship Basic platform experienced several incidents where builds were not processing as expected. In total, there were five incidents and we logged 16 hours with our Codeship Basic build-processing system not operating as expected.
All of the incidents had the same symptoms: Codeship Basic builds were getting “stuck.” For our users, this meant that either their queued builds were never reaching a running state or that a currently running build was essentially dead — no log output, no actual test commands running, but the build was still presented running — that consumed a concurrent build slot and therefore led to the pileup of queued builds.
In this post, we’ll talk about how we addressed this instability by stabilizing and repairing our build systems.
Codeship’s engineering team is dedicated to minimizing customer impact by stabilizing our build systems during any incident. The response process for this set of incidents proved to be challenging because the contributing factors were difficult to diagnose when the system was operating normally.
Our system is designed to have queued builds — your account’s concurrent build limit is part of your subscription configuration — which added an extra layer of complexity when attempting to determine the state of the system. During each incident, we were able to gather more data that eventually led us to identifying the contributing factors.
TL;DR A change to our Codeship Basic machine pool configuration uncovered an issue with build container
created_attimestamps, leading to pipeline failures due to programmatic termination of long-running processes by the Codeship Basic build machine agent.
Alerting and Stabilization Timeline
Our engineering team was paged on June 1 at 8:32 UTC due to a high number of waiting builds on Codeship Basic. After assessing the state of the system, the engineering team took steps to stabilize the system — to move the builds that were “stuck” back into a state where they were eligible for processing. After stabilization, the team began investigating the root cause, thinking it was related to a race condition when build jobs enter the queue, as this incident corresponded with a spike in incoming builds.
On June 1 at 12:14 UTC, the same alert was triggered, resulting in a page. At 12:18 UTC, we updated our status page to notify our customers that we were investigating issues with stuck Codeship Basic builds.
The engineering team completed stabilization steps as indicated in the alert runbook, and was able to identify that
lxc-attach command failures were a contributing factor to the instability. Those failures were not being retried, and our main account service was not being notified properly when the failure did happen.
A brief aside to share that Codeship Basic builds run inside LXC containers.
The engineering team rolled out a new Codeship Basic machine image to address the
lxc-attach failures, and the incident was resolved.
Another alert for too many queued builds fired on June 2 at 20:01 UTC, and our status page was updated at 20:02. The previous update to the Codeship Basic build machine image also enabled more logging around
lxc-attach failures, so the engineering team was able to gather more data points that connected the stuck builds to specific build machines. Stabilization steps were taken according to the alert runbook, matching the actions that were taken the previous day.
The engineering team was paged by a Customer Success engineer at 5:57 UTC and at 10:49 UTC on Saturday, June 3. The CS team proactively notified the engineering team before the automated alert fired, in order to minimize customer impact. Our status page noted the degraded performance at 10:53 UTC. In these two cases, our infrastructure management service was not performing as expected, leading to reduced capacity to process Codeship Basic builds.
The on-call engineer identified a problematic control flow statement in our infrastructure management service that allowed the service to downscale too aggressively. A fix was pushed out, the system was observed processing builds normally, and the incident was resolved.
This incident was not associated with the
lxc-attachfailures, but caused the same symptoms on the Codeship Basic build platform.
On June 5 at 14:01 UTC, our CS team marked our Builds component as degraded on our status page, and an automated page followed at 14:19 UTC.
During this incident, because of additional logging introduced on June 1, we were able to definitively see a connection between
lxc-attach failures, specific build machines, and specific builds that were in a bad state. We did observe that build pipelines that experienced an
lxc-attach error got retried, and after stabilization steps, the system showed no further signs of “stuckness.”
At this point, the full scope of root causes had not been fully discovered or corrected. However, the corrective action we took earlier in the week, along with stabilization steps, allowed the Codeship Basic platform to process builds as normal.
No further alerts fired as a result of queued Basic builds.
Identifying Contributing Factors
Fully understanding the scenarios that resulted in interrupted build processing proved difficult. As noted earlier in this post, our system is designed to have queued builds, as concurrent builds are a factor that a user can control via subscription configuration. Some number of queued builds is expected at any given point, so it’s an unreliable metric for build-processing health (though higher numbers have historically been correlated with processing issues). The pattern of queued builds diverged quite a bit from our normal usage patterns.
Because of the changes made on June 1, the team was able to gather better data during each incident. After each alert and corresponding response time, the team took stabilization actions that were documented in our runbook for the alert. Additionally, the team documented affected machine IDs, builds, and projects and kept track of relevant log output from build machines in our incidents Slack channel, where all incident response communication is logged.
During the June 2 incident, one of the engineers remarked: “idk why these things always seem to happen at 4pm on a Friday.” Though the comment was made in jest, once we began connecting the dots, it was clear that 1) the failures were isolated to certain build machines and 2) the failures seemed more likely to happen during or just after periods of low usage (weekends, Friday afternoons, and during EU morning hours).
Beginning on June 3, a subset of our engineering team paused normal development work to devote 100 percent of its capacity to investigating the systemic causes of these build-processing issues. We plan a bit of “padding” in each sprint in order to address incidents and other interruptions, but it was clear that all other feature development should not continue until these stability issues were addressed.
As part of discovery efforts, one engineer noticed that our build-machine agent is configured to terminate build pipelines that have been running for over five hours. The log output was not specific enough, so he pushed a change out that would record more data about the termination, specifically the timestamps of the container process.
The Main Contributing Factor
This log output led the team to identify the main contributing factor: containers running build pipelines were given a
created_at timestamp of the build machine host, not of their own creation. This was due to our build-machine agent setting the timestamp via
/usr/bin/uptime, which uses the time the host machine came online, not when the container started, even when running inside a container.
In short, pipelines were being given the wrong
created_at time. Because of the behavior of the machine agent, all pipelines that were started after the build machine reached five hours old were programmatically terminated. Through a stroke of luck, a machine had been online for nearly five hours, and the team was able to confirm their hypothesis by observing that machine terminating build pipelines.
Reporting the timestamp of the host instead of the guest with
/usr/bin/uptime is a known behavior, but due to the design of our build platform, this had not presented any problems until that week. We use an elastic machine pool to run Codeship Basic builds, and up until this string of incidents, machines averaged about three hours online before being terminated.
A few days earlier, we made changes to our infrastructure management system to reduce frequency of upscaling and downscaling events, making the pool a bit more consistent. With these changes, build machines simply had the opportunity to stay online longer than before, sometimes reaching five or six hours in age.
With five hours being the threshold for programmatic termination of build pipelines, we saw a dramatic increase in failures after we made changes to stabilize the machine pool. A change in one of our services caused a dramatic increase in failures in another area of our build-platform infrastructure.
Over the course of the following week, the team responsible for the build platform took several corrective actions to not only stabilize the system but to correct the known root causes and increase resiliency in the face of infrastructure failures or
- Container uptime is reported by a new
created_atmethod that is loosely based on this known workaround to
lxc-attachjobs are now retried more aggressively and can be retried on different machines in the case of failure. We allow for booting containers to time out or to have problems in the boot process and will resiliently handle them by rescheduling those jobs.
- Certain errors that used to cause builds to finish with an
infrastructure_errorfailure have been given appropriate names and are retried when appropriate.
- Pipeline IDs are passed back to our central account and project management service in the case of an unexpected build termination. We will automatically rerun the build in some cases (ie, provided no deployment step had started).
- Our infrastructure-management system has a minimum threshold for downscaling events to avoid a situation where low build activity means the possibility of downscaling.
What We’ll Do Differently Next Time
We chose not to update the Codeship status page twice in this week, as the system was not showing symptoms after stabilization actions were completed. In hindsight, we feel that we should have communicated that we were investigating issues related to stuck builds. Since our system is designed to have queued builds (our users can change their concurrent build limit as part of their subscription configuration), it was difficult to definitively say that no customers were impacted.
To that point, we’ve committed ourselves to publish public post mortems after incidents with significant customer impact. We know that you put a lot of trust in Codeship to test and deploy your applications; we want to keep that trust by staying open and transparent when our system isn’t operating as expected.