Reading Time: 8 minutes
In today’s video, I walk you through five suggestions for building an infrastructure geared towards helping teams innovate from the ground up. I share tips on what’s worked at Codeship as well as a whole list of helpful tools and solutions for your team. Check out the list on my personal blog, watch the video or read the transcript, and make sure to tell us what you think!
Discuss this article on Hacker News: https://news.ycombinator.com/item?id=9219848
Every company and product is under constant pressure to change by competition who might be coming up with a new feature or product, customers who demand changes, and even our team, process, and technology which needs to change constantly, too.
We have to deal with that constant change in every part of our company, but especially in our technology. Bad infrastructure can hinder us from innovating and learning quickly enough.
Bad infrastructure can make us fail. It determines the speed at which we can build our product. Moving fast is a key component for every company. Without it you’re at a considerable disadvantage and probably already dead.
With that in mind I want to give you five suggestions that help us and our customers a lot in building our product and innovating quickly.
First of all, the easiest way to optimise your time is by not doing certain things and letting somebody else take care of it. There are great services out there for anything that isn’t your core product and value proposition. Make use of these services to really focus on building value and finding your product market fit.
Building and maintaining infrastructure is hard, so unless it’s a core part of your value proposition, it’s a waste of time.
I’ve assembled a list of interesting tools that we use at Codeship or that I personally like a lot. I’ll keep updating it when I find new tools, and you can take a look to get started with your infrastructure easily.
Build to replace
My second suggestion is to build to replace. Too often we start building monoliths that get way too entangled and do everything in one codebase. This is incredibly hard to manage and you will have to resolve this in the longer term.
Instead, start by building microservices that do one thing and one thing only. As those codebases are separated from the get go, you can use queueing or messaging systems to connect them together. These separate codebases have a much clearer separation of concerns and will be a lot easier to manage or replace in the long term as soon as a different implementation becomes necessary.
We’re now at a point where services like Heroku, ElasticBeanstalk, or the recently launched AWS Lambda, and ContainerService take away pretty much all of the burden of running infrastructure. We give those services a unit of work that we want to have fulfilled, without being interested in too many of the underlying details. With that we can easily standardise for most of our stack while having the ability to build specialised infrastructure for the separate services that need it.
A mistake that we’ve certainly done and that I see repeated too many times is over optimising on costs early on. The time I’ve spent on running several of our apps or workers on the same machine just to save a couple of bucks is insane. Setting realistic spending goals and then actually spending that money on a distributed infrastructure would have made much more sense. And it is definitely cheaper in the short, as well as long run, when you factor in your work.
Let repositories drive infrastructure
Now that we have many different apps and services, the complexity moves from our infrastructure into our processes to maintain those services. Having a consistent workflow that makes it easy to switch between codebases and work on them is very important.
We want to be able to focus on our code and not get distracted by our infrastructure again. A workflow that has helped us tremendously with that is repository-driven infrastructure. So my third suggestion is letting your repositories drive your infrastructure. Let’s take a look at how this works in our team.
We use Github Flow, which says that for every new change, you open a new branch. Work on that branch until you want to get feedback. After opening a pull request and receiving feedback from the team, you merge it into the master branch which will then automatically deploy it.
The deployment can be to staging or production. In our case it’s always to production, but the important thing is that a merge between branches triggers that without any further action necessary.
We deploy everything in our infrastructure like that, including our DNS which has made it incredibly easy to onboard new people as it’s trivial to switch between codebases. We hired six more engineers in the last two months to come to a total of eighteen people now and this helped us a lot to get them started.
The same workflow is used anywhere so new team members don’t need to know the details on how to deploy specific parts of our infrastructure from day one. Those details are taken care of by the system so they are productive early, but they can look into the details anytime later on.
Eventually our infrastructure will be consistent with what is in our repository. And if any problems happen we’ll be notified while we’re already working on the next task.
But then, how do I build an infrastructure that makes it possible to move that quickly and automate everything? Well, one of the keys for us is immutable infrastructure. So suggestion number four is build an immutable infrastructure.
Build an immutable infrastructure
An immutable Infrastructure is built out of immutable component which are replaced for every deployment. Each time we want to deploy we create a new image that contains everything from the OS up to the application.
We fully test and validate that new image to make sure everything works fine and then roll it out. Let me walk you through an example. [Editor’s note: see video for full example.]
As you can see here, we have a new release that goes through testing, build and validation to create a new image. That image is stored in our image store. It could be an Amazon AMI or Docker container for example.null
Then, whenever we need to deploy we start a new cluster, move the traffic over to the new cluster, and as soon as everything works fine, we shut down the old machines. There is no in-place update happening. They are really completely new machines.
As machines can vanish anytime, data is effectively soloed into specific parts of your infrastructure. There are no gaurantees that data that is stored in one of the application clusters is available the next minute. Effective siloing is important, in order to not have state creep into different parts of your infrastructure where it then becomes hard to replace this part of the system.
Deployment and rollback are essentially the same operation which makes handling bad deployments trivial. You simply start a new cluster with the old image and everything is rolled back.
But what are the problems now, with running this kind of small service infrastructure that gets deployed constantly?
We need to have proper overview and operations in place to be able to determine the current state of the system. Otherwise, due to it’s fast changing nature, we won’t be able to cope with that kind of system at all.
This brings me to my next and last suggestions, unified logging, monitoring, and alerting.
There needs to be one source of truth for logging and monitoring of all of your applications. Every piece of data that helps paint that picture needs to be fed into the same system, so you can correlate problems between parts of your infrastructure. In a system that changes constantly, the server that caused a problem might not even be around any more.
Logs are an extremely effective, but also an extremely underused tool to understand and monitor our infrastructure. Modern log services let us pump constant data about the status and operations of our infrastructure into a queryable service without us having to build any of that. Even the smallest team can get started easily. Those logs can then be used to build graphs in metrics systems to correlate errors or deep dive into past systems. Start thinking through a proper log strategy so you can follow all processes happening in your system, get constant data about the status of those systems and have the metadata necessary to fix things in the future. If you are looking for services to use for that, check out the blogpost I mentioned before.
While having the data is very important, making it accessible is at least as important. A while ago, we added a link to our admin tools for every build and deployment running on Codeship. That lets us view all system logs corresponding to that build. Thus, we can follow the build throughout our whole system, from the receiving hook on our web application, to our build infrastructure and updates sent back to our database.
All of this is one click away and it made us use our logs a lot more. Especially in times of problems, you want to be able to access all data very quickly and without any overhead. But even in non-critical situations this is incredibly helpful. So now that we’ve got all of this data available at our fingertips we want to make sure that, when crisis hits, the whole team discusses based on the same data so no miscommunication happens.
We live in Slack so it’s only natural to pull this data into our Slack chat, leading to Chatops. We’re not even close to the first team doing this. It’s been incredibly helpful to move important system data even closer to the team, especially while fixing issues. Again, accessibility is the key to making sure your team can work without overhead in any situation. All of this naturally takes some time to implement, although existing services make it very easy and fast.
But not investing enough time into this will cost you tremendously. Not just in how slow you will be in determining and fixing problems, but also in how your customers see the stability of your service. Make sure you invest sufficient time early in setting up the right services so you are aware of what’s happening in your infrastructure.
And the great thing is that once you have all that wealth of data you start understanding problems in your infrastructure, can fix them, and also implement automated healing and scaling.
Giving customers a great experience shouldn’t be dependent on somebody sitting in front of a monitor all day. If we can make sure the system works and will get itself into a good state whenever problems happen, we’ve bought ourselves a lot of time to be really focused on building a product that gives a lot of value to our customers.
Five to apply
So to recap, the five suggestions:
- Use services
- Build to replace
- Let repositories drive infrastructure
- Build an immutable infrastructure
- Unify logging, monitoring, and alerting
All of those are in support of one higher goal: to focus on your code and product. Losing focus is an incredibly costly problem that many teams, but especially startups, face. Make sure you’re not one of them!