Code Coverage Considered Harmful?

Recently a colleague asked if we "care" about code coverage. My answer was that we care about it, but that it is not something we actively track. In fact, as I will show in this post, a focus on code coverage might actually be counter productive.

What is Code Coverage?

Code Coverage refers to a broad set of metrics and tools that provide insight into what parts of a code base are tested. There are various code coverage metrics ranging from the simplistic statement coverage to MC/DC coverage, which is used in mission-critical systems.

At a surface level, code coverage metrics and tooling appear to be useful as a way to gauge the level of testing in a code base and identify untested code. But there is a fundamental flaw with the use of these metrics in the workflow of an average agile software development team.

The objective of Testing

We use testing in software to verify both the functional requirements and some non-functional requirements of our code. A comprehensive test suite also doubles as an effective form of regression testing and therefore directly impacts on the maintainability and longevity of a code base.

But the objective of testing is not to ensure every line, branch, condition, entry or exit are executed. Instead, we want to verify our software's behavior across equivalence classes - particularly meaningful use cases and edge cases.

In some ways, our tests are a proxy for these functional and non-functional requirements.

The flaw and danger of Code Coverage

Generally, good coverage is a necessary but not sufficient condition for high quality tests. By focusing on the coverage of some structural element of our code, code coverage metrics fail to ensure that these tests are high quality and a good proxy for the requirements.

For example, code coverage cannot tell us anything about the production code that was missed for relevant use cases or edge cases.

Additionally, code coverage metrics cannot ensure that your tests have high quality assertions, or even assertions at all. A test without high quality assertions has minimal utility and no longer provides the regression testing assurances necessary for maintainability.

Mandates or actively checking code coverage results can detract from processes that we know are beneficial both for high quality testing and code. For example, Test-Driven Development and code reviews. A prominent coverage figure might provide a false sense of security and therefore encourage test last behavior, or reduce the scrutiny provided by reviewers to the set of tests.

Where Code Coverage can be used

Despite these criticisms, code coverage can be a useful tool. I believe there are two main benefits it can provide:

  1. Uncovering untested code, and particularly recurring patterns of poorly tested code that provide opportunities for education or refactors.
  2. Validating that existing processes are resulting in sufficient levels of code coverage as expected.

The frequent use of code coverage tools can therefore be useful alongside a very rigorous development workflow, such as one used for mission-critical systems.

But for the average agile software development team which does not have those formal and rigorous processes, these benefits are best gained by infrequent and random inspections of code coverage. Regularly checking this metric, such as with every pull request, can remove the efficacy of these inspections by altering developer behaviors towards achieving high coverage.

In those circumstances, reliance on code coverage can result in a paradoxical situation where the desired outcomes are reduced as the metrics improve, eroding the utility of the metric itself. This is similar to how machine learning models will over-fit to a testing data set (itself a proxy for real data) if it is reused.

Summary

Code coverage tools and metrics have their place in software development, but production code coverage is not the aim of testing. For the average agile software development team, a reliance on code coverage might actually lead to sub-optimal results by detracting from processes like reviews and test-driven development that we know result in high quality code and tests.