Broken Builds > Beaufort Fairmont

Broken Builds

What rules does your team have about broken builds?

How long do you allow a build to remain broken? Who fixes it? Who determines why a test failed?

What options exist for a broken test? Do you identify and fix immediately? Does the team allow ignored tests?

Should you fess up to the team when you break the build? Do you need to declare “I’m working on it” or is that understood?

Do you allow broken features in your code base? If so, why?

I’m not saying you should or shouldn’t – I don’t know your unique pressures – but the question may be warranted if it hasn’t been asked recently.

Is there some implicit agreement allowing broken functionality in the build? If so, wouldn’t it be better to have that on the table so that everyone knows what the expectations are?

8 responses to “Broken Builds”

  1. We triage the tests before 9am each morning. Which means, in my case, I get up at 6:30 and start looking at them. If all green, it is a good day. If not, I look at the logs and try to figure out what went wrong. We are permitted to skip tests, but it is a short term solution. Usually considered for less than a day’s work. Sometimes, given the circumstance, a week, but very rarely longer. I have to declare in the morning before 9am what blew up and why. Every morning on the company chat, no exceptions. It is not understood that “I am working on it” even though peeps know I am. As for broken features, we are not a critical software application, so having something break is not as serious as it would be in the medical and financial industries. At the same time, we do not want to lose revenue by upsetting customers.

    • Thanks for telling us how your team operates Laura! Triage, troubleshooting, and establishing roles/responsibilities are key aren’t they? Plenty of teams aren’t as far along as yours and there is hope for them. We work with all different types of teams all the time and our customers see a lot of success in getting the benefits out of Continuous Integration and Continuous Delivery!

  2. Before we had a group chat tool, we used to only have emails sent to those who broke the build. This made it really hard to monitor the build, and we didn’t have any policies around addressing that issue. The build might be yellow or red for days (or even weeks!) on end and finally someone would notice, and announce in standup that we needed to fix it.

    We finally got a group chat tool (first IRC then Slack) and were able to tie our build statuses into a group chat channel. This helped make monitoring and fixing the build a collaborative effort. If something broke, we could be easily alerted to the fact, and would have quick insight into who’s bad code just got checked in (or determine if it was an infrastructure problem). Usually, if a dev has added code recently, they will ensure that their code passes with a clean build. We still have Jenkins setup to email those who have committed code since the last broken build, so that is another reminder to fix the build.

    As far as tests go, we do have a tag to skip tests (or WIP tests, etc). Also, we have certain tests that only get run once a day in a long-form of the build CI (these are tests that take longer to run; we pull them out so that the build that runs with every checkin is as snappy as possible).

    When it comes time to release, we don’t release without a green build across the board, ensuring that functionality that we are testing that might be broken isn’t included (this was a policy we have enforced only within the past year or so). We did have some issues with flapping tests a few months ago, but through some extra initiatives, we were able to mitigate those through better app testability, better waiting functionality in the test framework, etc.

    • That’s great information Brian! I think separating out long-running tests, or finding ways to divide tests so that the most meaningful information is delivered at the right time is an under-utilized technique. Sounds like you all have a solid process that works well for your team!

  3. I like the “you broke it, you bought it” approach, but with support for the team around you to find and fix it if you need it. I also think that automated builds, tests, an deployment either should be part of the normal in a culture or something it’s striving toward. Manual deployments and configuration are high risk and focused on the short-term, and the free tools and information for setting up even the basics these days are amazing.

    I have some strong opinions on this:

    * Expectations should be clear on all of these in a culture, or folks tend to default to personal preference and/or norms from other environments

    * Developers should be responsible for what they’re delivering all the way to the customer (even if you’re doing handoffs and you never talk to the customer, you should care about your feature going out into the world and how well it’s adopted, fits it’s need, etc)

    * Pushing code that doesn’t compile feels either amateurish or egotistical, unless it’s extremely infrequent which typically means your multitasking too much or something similar (and if you can’t compile locally, that’s a whole different level of problem).

    * Test breaks in CI? Stop the line, fix it. Then figure out why it didn’t break locally/sooner (if possible, immediately: I like tools like NCrunch and Karma that run continuously in the background and tell me when I did something silly). Once upon a time it was me looking at the fix and triaging to see who’s change caused it, but with experience and 15 minute build times folks usually see it and jump on it themselves now (or one of their team members points it out).

    * Test breaks in CI and it’s an emergency? Fix it. I’m fixing problem #1 and broke something unexpected in location #2, pushing that to production seems like it’s the beginning of a long night and potentially upsetting a completely new set of folks

    * Ignored tests – Delete them or fix them. They’re not providing value, if they were important we would have rewritten them, and it’s not obvious when CI goes from 52 to 53 ignored tests but it is obvious when it goes from 0 to 1 (and it helps force the conversation around why it’s been disabled)

    * Broken Features – Broken features are going to happen. Discuss the impact to users success, prioritize against other broken things in the queue (if they exist), and have a clear expectation defined ahead of time for where the line in the sand is for stop-the-line events and if there are exceptions. This can be a hard one with lots of caveats, but I think it all boils down to having the right conversations and deciding what’s most important to the company.

    • Thanks Eli! My personal preferences are very much inline with the ones you’ve shared here. One of the things in working with my clients is meeting them where they are and walking alongside as kind of an “automation sherpa” to help them up their specific path. Some teams are more resistant than others high-altitude sickness. My team has to know that and understand where the client is and how far we can go together at this point in time. Thank you for your vision of the summit for so many at basecamp!

  4. It might be important to note that some teams have a completely automated CI process. There is no person that presses the button to deploy. The deploy automatically happens when all the tests pass. I think it is important to separate tests into groupings: unit tests, build acceptance tests, feature tests, regression tests. If a team can automated minimally the unit tests and the build acceptance tests, things can go so much more smoothly for whatever manual testing has to occur before a build is deployed.

Leave a Reply

Your email address will not be published. Required fields are marked *