Software development has changed dramatically over recent years, as new technologies such as containers and innovative processes like DevOps have altered how software is planned, implemented and deployed. Development tools have become more advanced, and the barrier to entry for engineering teams who are looking to boost their integration and deployment practices has never been easier.
In this installment of Software Development 2.0, seasoned software engineering leader Real Deprez gives his take on how valuable a continuous integration and continuous deployment (CI/CD) system can be for a development team.
Real also takes us through his journey managing and growing development teams at TrueCar, Edmunds, and now as Head of Engineering at Headspace.
I graduated college with a studio art degree, but I took a few computer science classes. The concept of creating user experiences in code and being able to interact with them immediately, combined with the novelty of the web at the time, sparked my interest in web development.
Initially I was drawn to the front-end because of the visual element, and that fast feedback loop. Writing user interface code and instantly seeing it, and being able to make changes in real time was really gratifying. I’ve always been very user focused, and it’s great to be able to interact with a product as a user would, and then refine and iterate on it.
At one point early on I wanted more server-side experience so I started a full-stack side project. I learned what I set out to tackle but also learned a great deal about SEO, which was helpful later in my career. I think anytime you pick up a side project, whether it’s directly related or your work or not, it’s going to be beneficial. Often the most useful knowledge you’ll gain won’t be what you expected.
Automating build tasks or deployments is not new – good developers will always try to automate repetitive tasks to improve their workflow. In a connected (web and SaaS) world, it’s possible to do these things more often. That’s beneficial, as shipping smaller changesets reduces risk, minimizing the impact of failures. Additionally, getting features to customers and feedback faster allows a product team to be more successful.
CI/CD and related terminology, the process definition and best practices came from the folks at ThoughtWorks (Jez Humble, David Farley, and others) as far as I know. The amount of work they put into defining and supporting CI/CD, and DevOps culture in general, is awe-inspiring.
There’s a great quote from Jez, “if it hurts, do it more frequently, and bring the pain forward.” That’s a great way to think about it, don’t avoid the hard parts, do them more often, and write code to make them easier. I can think back to doing deployments at a startup, the team huddled around working off of a playbook at midnight on a Friday after weeks of development. That was stressful and nerve wracking and didn’t always go smoothly. So we started automating.
One of the tenets of CI/CD is to automate everything, because it’s more efficient and less error-prone. Manual tasks are time consuming and stressful, and you also want to eliminate single points of failure, i.e. what do you do when the deployment person in your company gets sick or goes on vacation.
I started to get involved with CI/CD more seriously when I went back to Edmunds. There were some tests and deployment scripts, but we took it to the next level – automating everything we could and connecting the pipeline stages using GoCD. The goal we had was getting to continuous deployment, and for a web application I think that makes a lot of sense (for a mobile application, you’re not going to ship every build, but it still makes sense to continuously build and test, then you can decide on how often you want to ship that passing build). Continuous delivery was our first milestone, then once we were satisfied with build quality we enabled continuous deployments. You learn a lot about your deployments and the weaknesses of your systems when you’re shipping multiple times a day.
Initially, as soon as the unit tests passed, we could build, run some end-to-end tests, and ship, and from there we continued to add steps and refine the process. Some of the more important pieces, like multiple environments, were already set up. One of the things we added was ephemeral environments, so code reviewers and product managers could test out new features before merging to mainline (master). You want to be merging feature code to master (that’s the branch you build from) and then that single binary ships across all of your environments.
Those on-demand sandbox environments helped us keep the pipeline healthy by preventing the use of pre-production as a development environment. Any developer could take their current branch, as long as it was checked in, and they could deploy onto our cloud infrastructure at a custom URL (we used the branch name as an identifier). Once those features or fixes had been approved, code was reviewed, got merged to master, then the binary got built and shipped across the other environments. That sandbox was crucial to alleviating the churn of having developers testing new code in the real pipeline.
Yes, we were using a slightly modified version of GitFlow. Rollback is always better than hotfixing, so you want to revert to a previous build when it’s possible. When you have a CI/CD system that’s running efficiently, you can quickly push non-urgent changes and let them flow through on their own. For emergencies, you can always branch off the tag in master, make changes, and then force it through the CI system. We never really had any hair-on-fire emergencies like that because of all the checks in the system. Our canary deployment system was a big help there, as another layer of security that prevented defective builds from being fully deployed to production. It was effectively an auto-rollback mechanism for bad builds.
We looked at metrics like deployment speed and deployment frequency, which give you insight into the health and efficiency of your system. You want builds to be quick, you want failures to happen fast and early in the pipeline, and you want that failure feedback loop to be as short as possible because the earlier you can detect failures the cheaper they are to solve.
For web applications, fast can mean a subset of browsers for early testing, and in terms of mobile apps, a subset of supported devices. You want the feedback loop fast for the device agnostic, broad-impact bugs. For edge cases and narrow-impact (less probable) bugs, you can make the trade-off of testing them further down the pipeline.
Build failures and deployment failures are also important metrics and you want to look at time to resolution to make sure teams are acting on those failures. Another big tenet of CI/CD is the whole notion of “don’t go home on a broken build,” because if the pipeline is stuck, it’s stuck for everyone. This was one of the cultural changes that I found hardest to tackle when moving to CI/CD. If there’s a failure in the pipeline, it’s more important to solve than whatever feature you’re working on, for example, because the pipeline has to be running.
At Edmunds, we didn’t have a QA team, so the developers wrote end-to-end tests. We could determine who was responsible for test failures by using metadata in the test files and the changesets between the last good and failing builds. From there, we could also go back through the Git history to see who was responsible for the code that caused the failure. That made it easy to alert those engineers right away, because you want the people with the most experience with that piece of code addressing the issue. On-call triage is inefficient when it comes to CI/CD, if you can get failure alerts to the right team or person you’ll turn around fixes much faster.
There are also benefits to making the pipeline and these metrics transparent to the product team, to give them more insight on features status and deployment events. A CI/CD system can even move tickets on sprint boards with existing JIRA integrations. If you need to prove the value of tests a "bugs caught in pre-production" metric can be exposed to highlight failures that were caught by tests in CI that would have made it to production.
The mobile world is a bit different in terms of deployment; you can ship anytime on the Play Store, but the Apple Store has to approve your builds, so there’s a delay introduced that you have to live with. Not that you want to be shipping mobile applications much faster than once a week, because you want to be mindful about update cadence and bandwidth since you’re shipping binaries to user devices.
Where CI/CD is important with mobile apps is ensuring quality and reducing time spent on bug fixes. One of the big benefits of having a fully fledged CI/CD system is that you’re finding out and catching issues when they happen early and are still top of mind – not two weeks later during regression testing. Also, in a world where a lot of the business logic is on the server (in the APIs), you want to be testing your current build against what APIs are shipping to production. So it’s another quality gate around existing functionality as APIs change.
In determining release cadence, I don’t like to plan releases around features. I’d much rather have a release cadence that’s predetermined and then you just ship what’s ready (and tested). If you decided that you’re going to ship every week, or ship every two weeks, you want to ship everything that’s available. With the CI/CD system, if you have a working build daily or better, that’s fantastic. The change sets are smaller, it’s easier to pinpoint breaking changes and allows you to be much more nimble. When you’re ready to cut that release branch and push, you know you’ve got something working because it’s already passed through quality assurance. Feature flags and feature-driven development are also essential tools to enable a high level of agility, by allowing the codebase to stay up-to-date and preventing time-consuming merges.
Ensuring test coverage feels time consuming and it does add more time to feature development – up front. But the benefit of that effort comes in long term savings, because issues in production are more expensive to fix. Time will be spent in triage, issues pile up and increase the complexity of fixes and releases, and you’ll ultimately spend more time fixing bugs that make it to production. Your customer experience team may be involved, on-call engineers will be involved, and those first responders may or may not have insight into the failing code or feature. Failures in production are way more expensive in terms of time, and more importantly they can be detrimental to your business.
The other thing is if you have good test coverage you can move really fast and not be concerned about breaking other parts of the system. Over and over I’ve seen unknown dependencies cause failures, especially with legacy codebases, and it can be hard to avoid that without proper testing in place. If you’re changing code or adding a new feature, you’re usually looking at this local scope of functionality, not thinking about external impact. When you have good global test coverage, that’s taken care of for you.
At the end of the day good automated test coverage allows you to be more agile. There’s a sweet spot of what coverage looks like for every team, and it doesn’t mean you have to have end-end testing around every edge case. You want to devote a certain amount of your time to writing tests, and I can tell you from my experience that teams with more coverage have a faster product velocity.
Again, you can create reports, and I recommend giving read-only access to your CI/CD system dashboard to anyone and everyone. You want to be very transparent about pipeline status, alert on important passes and all failures, and get those to the team through Slack or other means. When you have a functioning CI/CD system, you can add the secondary metrics and make improvements.
Often when setting up a new CI/CD system, there can be a trough of despair type moment where it seems hopeless and impossible to ship. If you can push through to that inflection point, when the pipelines start flowing together and people are working together to make sure the system is healthy, it’s all worth it. At that point it sells itself and people get excited and become more committed to fixing failures quickly. If you can get to that point everyone will become an advocate for the CI/CD system.
The first step is to start automating anything that is painful or time consuming for you in your integration and deployment processes. Things that are complicated, dangerous or people are nervous about doing are great candidates. Add your unit, functional, and/or end-to-end tests early (it’s table stakes at this point to have automated test coverage). You can run them on a schedule at first, then tie them into your CI/CD pipelines.
There are some great CI/CD tools available now, and many of them are offered by cloud or SaaS providers you’re probably already using. You can get pretty far with GitLab/GitHub and AWS or Google Cloud. There’s a lot of resources on the internet that can make it easy to get started, and if you want to go deep I strongly recommend reading Continuous Delivery by Jez Humble and David Farley. It’s the bible of CI/CD.
In Software Development 2.0, experts in the field of software development share their insights and best practices with the community.
Interested in sharing your experiences? Give us a shout at firstname.lastname@example.org.