Coverage-guided property-based testing

Posted on Sep 25, 2024

Work in progress, please don’t share, but do feel free to get involved!

Almost ten years ago, back in 2015, Dan Luu wrote a post asking why coverage-guided property-based testing wasn’t a thing.

In this post I’ll survey the coverage-guided landscape, looking at what was there before Dan’s post and what has happened since. I’ll also show how to add basic coverage-guidance to the first version of the original property-based testing tool, QuickCheck, in about 35 lines of code.

Unlike many previous implementations of this idea, the technique used to implement coverage-guidance is programming language agnostic and doesn’t rely on any language-specific instrumentation of the software under test.

Motivation

Before we start, let me try to motivate why one would want to combine coverage-guided fuzzing and property-based testing to begin with.

Consider the following example1, where an error is triggered if some input byte array starts with the bytes "bad!":

func sut(input []byte) {
    if input[0] == 'b' {
        if input[1] == 'a' {
            if input[2] == 'd' {
                if input[3] == '!' {
                    panic("input must not be bad!")
                }
            }
        }
    }
}

What are the odds that a property-based testing tool (without coverage-guidance) would be able to find the error?

To make the calculation easier, let’s say that we always generate arrays of length \(4\). A byte consists of eight bits, so it has \(2^8\) possible values. That means that the probability is \(\frac{1}{2^8} \cdot \frac{1}{2^8} \cdot \frac{1}{2^8} \cdot \frac{1}{2^8} = (\frac{1}{2^8})^4 = \frac{1}{2^{32}}\) which is approximately \(1\) in \(4\) billion. In a realistic test suite, we wouldn’t restrict the length of the array to be \(4\), and hence the probability will be even worse.

With coverage-guidance we keep track of inputs that resulted in increased coverage. So, for example, if we generate the array []byte{'b'} we get further into the nested ifs, and so we take note of that and start generating longer arrays that start with 'b' and see if we get even further, etc. By building on previous successes in getting more coverage, we can effectively reduce the problem to only need \(\frac{1}{2^8} + \frac{1}{2^8} + \frac{1}{2^8} + \frac{1}{2^8} = \frac{1}{2^8} \cdot 4 = \frac{1}{2^{10}} = \frac{1}{1024}\).

In other words coverage-guidance turns an exponential problem into a polynomial problem!

Background and prior work

There’s a lot to cover here, so I’ll split it up in before and after Dan’s post.

Before 2015

Fuzzing has an interesting origin. It started as a class project in an advanced operating systems course taught by Barton Miller at the University of Wisconsin in 1988.

The project was inspired by the observation that back then, if you logged into your workstation via a dail-up modem from home and it rained, then frequently random characters would appear in the terminal. The line noise wasn’t the surprising thing, but rather that the extra characters would sometimes crash the program that they tried to invoke. Among these programs were basic utilities such as vi, mail, cc, make, sed, awk, sort, etc, and it was reasonable to expect that these would give an error message rather than crash and core dump if fed with some extra characters caused by the rain.

So the project set out to basically recreate what the rain did, but more effectively, but essentially generating random noise (stream of bytes) and feeding that to different utilities and see if they crashed. A couple of years later Barton et al published An empirical study of the reliability of UNIX utilities (1990) which documents their findings.

Inserting random characters was effective in finding corner cases where the programmers forgot to properly validate the input from the user. However it wouldn’t trigger bugs hiding deeper under the surface, such as the "bad!" example from the previous section.

This changed around 2007 when people started thinking about how fuzzing can be combined with evolutionary algorithms. The idea being that instead of generating random bytes all the time as with classical fuzzing, we can use coverage information from one test to mutate the input for the next test. Or to use the evolution metaphor: seeds that lead to better coverage are mutated with the hope that they will lead to even better coverage.

One of the first, and perhaps still most widely known, such coverage-guided fuzzers is Michał Zalewski’s AFL (2013). To get a feel for how effective AFL-style coverage-guidance is, check out the list of bugs that it found and this post about how it manages to figure out the jpeg format on its own2.

Since AFL is the tool that Dan explicitly mentions in his post, let’s stop at this point and go back to his point, before looking at what happened since with coverage-guided fuzzers.

Recall that Dan was asking why this idea of coverage-guidance wasn’t present in property-based testing tools. I’ve written about the history of property-based testing and explained how it works already, so I won’t take up space by repeating myself here. Let’s just note that the original paper on property-based testing was published in 2000. So when Dan wrote asking about this question property-based testing would have been fifteen and AFL two years old.

The main difference between property-based testing and fuzzing is that fuzzing requires less work by the user. Simply hook up the byte generator to the function that expects bytes as input and off it goes looking for crashes. Property-based testing on the other hand can test functions that take arbitrary data structures as input (not just bytes), but you have to describe how to generate such inputs.

Fuzzing only looks for crashes, while property-based testing lets you specify arbitrary relations that should hold between the input and output of the system under test. For example, we can generate binary search trees and check that after we insert something into an arbitrary binary search tree then it will remain sorted (when we do an inorder traversal). Fuzzing can’t check such properties, and furthermore because they generate random bytes it’s unlikely that they’ll even generate a valid binary search tree to begin with (without lots of coverage-driven testing).

On the other hand, the coverage of property-based tests is only as good as the user provided generators. Corner cases where slightly modified data leads to e.g. exception handling is not explored automatically, and coverage information is not used to guide the input generation process.

By now we should have enough background to see that the idea of combining coverage-guidance and property-based testing makes sense. Basically what we’d like is to:

  1. Use user provided generators to kick start the exploration in the right direction;
  2. Mutate the generated data, while preserving its type, to surface bugs that user provided generators alone wouldn’t have found;
  3. Use coverage information to iteratively get deeper into the state space of the system under test, like in the “bad!” example from the motivation.

After 2015

Having covered what had happened before Dan’s post, let’s have a look at what has happened in the ten years since his post. First off, it’s worth noting that at some point Dan added an update to his post:

“Update: Dmitry Vyukov’s Go-fuzz, which looks like it was started a month after this post was written, uses the approach from the proof of concept in this post of combining the sort of logic seen in AFL with a QuickCheck-like framework, and has been shown to be quite effective. I believe David R. MacIver is also planning to use this approach in the next version of hypothesis.”

So let’s start there, with Go-fuzz. At a first glance, all functions that are tested with Go-fuzz need to take an array of bytes as input. My initial thought was, how can I write properties which involve generating more interesting data structures? But it turns out it’s possible, as somebody pointed out in the comments of an old post of mine. Furthermore there are also ways of making the fuzzer aware of more complex data structures. I haven’t played around enough with this to be able to compare how well shrinking works yet though, but overall I’d say this ticks the box of being a property-based testing tool that is also coverage-guided.

Next up Dan mentions Python’s Hypothesis. I was searching through the documentation trying to find out how coverage-guidance works, but I couldn’t find anything. Searching through the repository I found the following release note (2018):

“This release deprecates the coverage-guided testing functionality, as it has proven brittle and does not really pull its weight.

We intend to replace it with something more useful in the future, but the feature in its current form does not seem to be worth the cost of using, and whatever replaces it will likely look very different.”

As far as I can tell, it hasn’t been reintroduced since. However it’s possible to hook Hypothesis up to use external fuzzers. Hypothesis already uses random bytes as basis for its generators, unlike QuickCheck which uses an integer seed, so I suppose that the external fuzzers essentially fuzz the random input bytes that in turn are used to generate more structured input.

Traditional fuzzers are usually designed to target a single binary, where as the test suite which uses property-based testing typically has many properties. The HypoFuzz tool works around this mismatch by scheduling fuzzing time among the many Hypothesis properties in your test suite.

What else has happened since Dan’s post?

One of the first things I noticed is that AFL is no longer maintained:

“Note: AFL hasn’t been updated for a couple of years; while it should still work fine, a more complex fork with a variety of improvements and additional features, known as AFL++, is available from other members of the community and is worth checking out.”

Whereas AFL is based on a single idea of how the fuzzer does its exploration with very few knobs, AFL++ (2020) keeps the basic AFL evolutionary algorithm structure, but incorporates a lot of new research on other ways to explore the state space. For example, which seed gets scheduled and how many times it gets mutated per round are two new parameters that can be tweaked to achieve different paths of exploration throughout the system under test.

The next thing I did was to search for “coverage-guided property-based testing” in the academic literature.

One of the first papers I found was Coverage guided, property based testing by Leonidas Lampropoulos, Michael Hicks, Benjamin C. Pierce (2019). In this paper FuzzChick, Coq/Rocq library, that adds AFL-style coverage instrumentation to QuickChick (a Rocq QuickCheck clone) is presented. Unfortunately the only source code I could find lives in an unmaintained branch that doesn’t compile.

The related works section of the paper has a couple of interesting references though.

The main inspiration for FuzzChick seems to have been Stephen Dolan et al’s OCaml library called Crowbar (2017). Crowbar uses a stream of bytes to drive its generators, similar to Hypothesis, and it’s this stream that AFL is hooked up to. This indirection is Crowbar’s (and by extension, I guess, also HypoFuzz’s) biggest weakness.

AFL is good at manipulating this byte stream, but because the bytes are not used directly to test the system under test, but rather to generate data which in turn is used for testing, some of its effectiveness is lost. This becomes particularly obvious when data structures with sparse pre-conditions, e.g. sorted list or a binary search tree. That’s what the authors of FuzzChick say at least, while claiming that they addressed this weakness by doing type-aware mutations.

The other libraries that the paper mentions are from the imperative language community.

For example JQF + Zest: Coverage-guided semantic fuzzing for Java, libfuzzer and it’s successor FuzzTest (2022?) for C++.

Rust’s cargo fuzz seems to build upon libfuzzer, see the chapter on Structure-aware fuzzing using libfuzzer-sys in Rust in the Rust Fuzz Book.

The FuzzTest README claims “It is a first-of-its-kind tool that bridges the gap between fuzzing and property-based testing”. I can’t tell why they would claim that, given that it appears to have been released in 2022 and many of the tools we looked at above seem to have successfully combined the two approaches before that. For example, how is it different from Go-fuzz?

In my search I also found the paper MUTAGEN: Reliable Coverage-Guided, Property-Based Testing using Exhaustive Mutations by Agustín Mista and Alejandro Russo (2023). This paper seems to build upon the FuzzChick paper, however it swaps out the AFL-style coverage instrumentation for the use of a GHC plugin to annotate source code with coverage information of: function clauses, case statements, and each branch of if-then-else expressions.

Imperative languages such as Go, Python, C++, Rust, and Java seem ahead of functional languages when it comes to combining coverage-guided fuzzing and property-based testing.

Let’s try to change that by implementing a small functional programming version, based on the original property-based testing implementation.

Prototype implementation

Getting the coverage information

One key question we need to answer in order to be able to implement anything that’s coverage-guided is: where do we get the coverage information from?

When I’ve been thinking about how to implement coverage-guided property-based testing in the past, I always got stuck thinking that parsing the coverage output from the compiler in between test case generation rounds would be annoying and slow.

I thought that in the best case scenario the compiler might provide a library which exposes the coverage information3.

It wasn’t until I started researching this post that I realised that AFL, and most coverage-guided fuzzers since, actually inject custom coverage capturing code into compiled programs at every branch point or basic block. It’s explained in more detail in the whitepaper.

The main reason they do it is because of performance, not because it’s necessarily easier, in fact I still don’t understand exactly how it works.

It wasn’t until I read about Antithesis’ “sometimes assertions” that I started seeing a simple solution to the problem of collecting coverage information.

To understand how “sometimes assertions” work let’s first recall how regular assertions, or “always assertions”, work. If we add assert b "message" somewhere in the code base and the boolean b evaluates to false at run-time then the program will fail with "message".

“Sometimes assertions” are different in that they don’t need to always hold. Consider the example:

for (1..10000) {
  c := flipCoin()
  sometimesAssert (c == Heads) "probably an unfair coin"
}

Above the “sometimes assertion” will only fail if we flip 10000 tails.

How is this related to coverage though? If we sprinkle “sometimes assertions” at every branch point:

if b {
  sometimesAssert True "true branch"
} else {
  sometimesAssert True "false branch"
}

Then we’ll get a failure when some branch hasn’t been covered! The neat thing about “sometimes assertions” is that we don’t need to annotate every single branch, we can annotate interesting points in our program, that’s why we can think of “sometimes assertions” as generalised coverage.

The final piece of the puzzle, and I think this is the only original idea that this post adds4, is that property-based testing already has functionality for implementing “sometimes assertions”: the machinery for gathering run-time statistics of the generated data!

This machinery is crucial for writing good tests and has been part of the QuickCheck implementation since the very first version5.

So the question is: can we implement coverage-guided property-based testing using the internal notion of coverage that property-based testing already has?

The first version of QuickCheck

Before we answer the above question, let’s remind ourselves of how a property-based testing library is implemented. For the sake of self-containment, let’s reproduce the essential parts of QuickCheck as defined in the appendix of the original paper that first introduced property-based testing (ICFP, 2000).

Generating input data

Let’s start with the generator6, which is used to generate random inputs to the software under test:

newtype Gen a = Gen (Int -> StdGen -> a)

generate :: Int -> StdGen -> Gen a -> a
generate n rnd (Gen m) = m size rnd'
 where
  (size, rnd') = randomR (0, n) rnd

So a Gen a is basically a function from a size and a pseudo-random number generator into a. The pseudo-random number generator and size can be accessed using the following two functions:

rand :: Gen StdGen
rand = Gen (\_n r -> r)

sized :: (Int -> Gen a) -> Gen a
sized fgen = Gen (\n r -> let Gen m = fgen n in m n r)

Using these together with the Functor, Applicative and Monad instances of Gen we can derive other useful combinators for generating data:

choose :: Random a => (a, a) -> Gen a
choose bounds = (fst . randomR bounds) `fmap` rand

elements :: [a] -> Gen a
elements xs = (xs !!) `fmap` choose (0, length xs - 1)

vector :: Arbitrary a => Int -> Gen [a]
vector n = sequence [ arbitrary | _i <- [1..n] ]

Instead of defining generators directly for different datatypes, QuickCheck first wraps generators in a type class called Arbitrary7:

class Arbitrary a where
  arbitrary :: Gen a

instance Arbitrary Bool where
  arbitrary = elements [True, False]

instance Arbitrary Char where
  -- Avoids generating control characters.
  arbitrary = choose (32,126) >>= \n -> return (chr n)

instance Arbitrary Int where
  arbitrary = sized $ \n -> choose (-n,n)

instance Arbitrary a => Arbitrary [a] where
  arbitrary = sized (\n -> choose (0,n) >>= vector)

Specifying properties

Next up, let’s look at how properties are expressed. The Property type is a wrapper around Gen Result:

newtype Property
  = Prop (Gen Result)

result :: Result -> Property
result res = Prop (return res)

Where Result is defined as follows:

data Result
  = Result { ok :: Maybe Bool, stamp :: [String], arguments :: [String] }

nothing :: Result
nothing = Result{ ok = Nothing, stamp = [], arguments = [] }

The idea being that ok :: Maybe Bool is Nothing if the input gets discarded and otherwise the boolean indicates whether the property passed or not. The stamp field is used to collect statistics about the generated test cases, while arguments contains all the generated inputs (or arguments) to the property.

In order to allow the user to write properties of varying arity another type class is introduced:

class Testable a where
  property :: a -> Property

instance Testable () where
  property _ = result nothing

instance Testable Bool where
  property b = result (nothing{ ok = Just b })

instance Testable Result where
  property res = result res

instance Testable Property where
  property prop = prop

instance (Arbitrary a, Show a, Testable b) => Testable (a -> b) where
  property f = forAll arbitrary f

The key ingredient in the function instance of Testable is forAll:

forAll :: (Show a, Testable b) => Gen a -> (a -> b) -> Property
forAll gen body = Prop $
  do a   <- gen
     res <- evaluate (body a)
     return (argument a res)
 where
  argument a res = res{ arguments = show a : arguments res }

Which in turn depends on:

evaluate :: Testable a => a -> Gen Result
evaluate a = gen where Prop gen = property a

One last construct for writing properties that we need is the ability to add assumptions or pre-conditions about the input:

(==>) :: Testable a => Bool -> a -> Property
True  ==> a = property a
False ==> a = property ()

Notice how if the input doesn’t pass this test, then it will be discarded.

Collecting statistics

The way we collect statistics about the generated data is through these two functions:

label :: Testable a => String -> a -> Property
label s a = Prop (add `fmap` evaluate a)
 where
  add res = res{ stamp = s : stamp res }

classify :: Testable a => Bool -> String -> a -> Property
classify True  name = label name
classify False _    = property

Running the tests

Finally we have all the pieces we need to be able to actually run the tests. The testing can be configured:

data Config = Config
  { maxTest :: Int
  , maxFail :: Int
  , size    :: Int -> Int
  , every   :: Int -> [String] -> String
  }

quick :: Config
quick = Config
  { maxTest = 100
  , maxFail = 1000
  , size    = (+ 3) . (`div` 2)
  , every   = \n args -> let s = show n in s ++ [ '\b' | _ <- s ]
  }

verbose :: Config
verbose = quick
  { every = \n args -> show n ++ ":\n" ++ unlines args
  }

Where maxTest is the amount of passing test cases that will be run, maxFail is the amount of tests that are allowed to be discarded, size is how the size parameter to the generator changes between tests, and every is used to print something (or not) between each test.

The tests themselves can now be run as follows:

test, quickCheck, verboseCheck :: Testable a => a -> IO ()
test         = check quick
quickCheck   = check quick
verboseCheck = check verbose

check :: Testable a => Config -> a -> IO ()
check config a =
  do rnd <- newStdGen
     tests config (evaluate a) rnd 0 0 []

tests :: Config -> Gen Result -> StdGen -> Int -> Int -> [[String]] -> IO ()
tests config gen rnd0 ntest nfail stamps
  | ntest == maxTest config = do done "OK, passed" ntest stamps
  | nfail == maxFail config = do done "Arguments exhausted after" ntest stamps
  | otherwise               =
      do putStr (every config ntest (arguments result))
         case ok result of
           Nothing    ->
             tests config gen rnd1 ntest (nfail+1) stamps
           Just True  ->
             tests config gen rnd1 (ntest+1) nfail (stamp result:stamps)
           Just False ->
             putStr ( "Falsifiable, after "
                   ++ show ntest
                   ++ " tests:\n"
                   ++ unlines (arguments result)
                    )
     where
      result      = generate (size config ntest) rnd2 gen
      (rnd1,rnd2) = split rnd0

Example: traditional use of coverage

Okay, so the above is the first version of the original property-based testing tool, QuickCheck.

Before we extend it with coverage-guidance, let’s have a look at an example property and how collecting statistics is typically used to ensure good coverage.

The example we’ll have a look at is inserting into an already sorted list (from which insertion sort can be implemented):

insert :: Ord a => a -> [a] -> [a]
insert x [] = [x]
insert x (y : xs) | x <= y    = x : y : xs
                  | otherwise = y : insert x xs

If we do so, then we resulting list should remain sorted:

prop_insert :: Int -> [Int] -> Property
prop_insert x xs = isSorted xs ==> isSorted (insert x xs)

isSorted :: Ord a => [a] -> Bool
isSorted xs = sort xs == xs

This test passes:

>>> quickCheck prop_insert
OK, passed 100 tests.

What do the test cases that are generated look like? This is where classify comes in:

prop_insert' :: Int -> [Int] -> Property
prop_insert' x xs = isSorted xs ==>
  classify (null xs) "empty" $
  classify (length xs == 1) "singleton" $
  classify (length xs > 1 && length xs <= 3) "short" $
  classify (length xs > 3) "longer" $
    isSorted (insert x xs)

Running this property, we get some statistics about the generated data:

>>> quickCheck prop_insert'
OK, passed 100 tests.
54% empty.
27% singleton.
19% short.

As we can see, all of the lists that get generated are less than 3 elements long! This is perhaps not what we expected. However if we consider that precondition says that the list must be sorted, then it should become clear that it’s unlikely to generate such longer such lists completely by random8.

The extension to add coverage-guidance

Now let’s add coverage-guidance to it using the machinery for collecting statistics about the generated data.

The function that checks a property with coverage-guidance slight different from quickCheck9:

coverCheck :: (Arbitrary a, Show a) => Config -> ([a] -> Property)  -> IO ()
coverCheck config prop = do
  rnd <- newStdGen
  testsC config arbitrary prop [] 0 rnd 0 0 []

In particular notice that instead of Testable a we use an explicit predicate on a list of a, [a] -> Property. The reason for using a list in the predicate is so that we can iteratively make progress, using the coverage information. We see this more clearly if we look at the coverage-guided analogue of the tests function, in particular the xs parameter:

testsC :: Show a => Config -> Gen a -> ([a] -> Property) -> [a] -> Int
       -> StdGen -> Int -> Int -> [[String]] -> IO ()
testsC config gen prop xs cov rnd0 ntest nfail stamps
  | ntest == maxTest config = do done "OK, passed" ntest stamps
  | nfail == maxFail config = do done "Arguments exhausted after" ntest stamps
  | otherwise               =
      do putStr (every config ntest (arguments result))
         case ok result of
           Nothing    ->
             testsC config gen prop xs cov rnd1 ntest (nfail+1) stamps
           Just True  -> do
             let stamps' = stamp result : stamps
                 cov'    = length (nub (concat stamps'))
             if cov' > cov
             then testsC config gen prop xs' cov' rnd1 (ntest+1) nfail stamps'
             else testsC config gen prop xs  cov  rnd1 (ntest+1) nfail stamps'
           Just False ->
             putStrLn ( "Falsifiable, after "
                   ++ show ntest
                   ++ " tests:\n"
                   ++ head (arguments result)
                    )
     where
       x              = generate (size config ntest) rnd3 gen
       xs'            = xs ++ [x]
       Prop genResult = prop xs'
       result_        = generate (size config ntest) rnd4 genResult
       result         = result_ {arguments = show xs' : arguments result_ }
       (rnd1,rnd2)    = split rnd0
       (rnd3,rnd4)    = split rnd2

The other important difference is the coverage parameter, which keeps track of how many things have been classifyed (the stamps parameter). Notice how we only add the newly generated input, x, if the coverage increases.

Example: using coverage to guide generation

We now have all the pieces to test the example from the motivation section:

bad :: String -> Property
bad s = coverage 0 'b'
      $ coverage 1 'a'
      $ coverage 2 'd'
      $ coverage 3 '!' $ if s == "bad!" then False else True

  where
    coverage :: Testable a => Int -> Char -> a -> Property
    coverage i ch = classify (s !? i == Just ch) [ch]

    (!?) :: [a] -> Int -> Maybe a
    xs !? i | i < length xs = Just (xs !! i)
            | otherwise     = Nothing

This property basically says that there’s no string that’s equal to "bad!", which is obviously false. If we try to test this property using the unmodified first version of QuickCheck:

testBad :: IO ()
testBad = check config bad
  where
    config = quick { maxTest = (2^8)^4 }

We’ll see spin away, but not actually find the bad string:

>>> testBad
32301
^CInterrupted.

I stopped it after about 32k tries, in theory we’d need more than 4 billion attempts to find the bad string using this approach.

Whereas if we use coverage-guided generation:

testBad' :: IO ()
testBad' = coverCheck config bad
  where
    config = verbose
      { maxTest = (2^8)*4
      , every = \n args -> show n ++ ": " ++ unlines args
      }

We find the bad string pretty quickly. I’m using verbose output here so you can see how it first find the "b", then "ba", etc10:

>>> testBad'
0: "n"          23: "#"         46: "bai"       69: "badb"
1: "T"          24: "T"         47: "ba}"       70: "bad~"
2: "L"          25: "T"         48: "bay"       71: "bada"
3: "|"          26: "_"         49: "bac"       72: "bad%"
4: "X"          27: "@"         50: "bak"       73: "bad9"
5: "\""         28: "}"         51: "ba`"       74: "badE"
6: "e"          29: "y"         52: "bad"       75: "bad8"
7: "G"          30: "-"         53: "bad8"      76: "bad{"
8: "R"          31: "s"         54: "bade"      77: "badS"
9: "}"          32: "b"         55: "bad0"      78: "badn"
10: "C"         33: "bx"        56: "badA"      79: "bad?"
11: "3"         34: "b9"        57: "badP"      80: "badn"
12: ">"         35: "bf"        58: "badQ"      81: "badq"
13: "C"         36: "bl"        59: "bad0"      82: "bady"
14: "J"         37: "ba"        60: "bade"      83: "badA"
15: "9"         38: "baC"       61: "bad)"      84: "bad4"
16: "="         39: "baX"       62: "bado"      85: "bad;"
17: "9"         40: "baA"       63: "badE"      86: "bad9"
18: "L"         41: "baE"       64: "bad\""     87: "badU"
19: ")"         42: "bay"       65: "bad@"      88: "bad!"
20: "5"         43: "baX"       66: "bad{"
21: "6"         44: "ba6"       67: "badX"
22: "x"         45: "ba@"       68: "bado"
Falsifiable, after 88 tests:
"bad!"

The full source code is available here.

Conclusion and further work

We’ve seen how to add converage-guidance to the first version of the first property-based testing tool, QuickCheck, in about 35 lines of code.

Coverage-guidance effectively reduced a exponential problem into a polynomial one, by building on previous test runs’ successes in increasing the coverage.

The solution does change the QuickCheck API slightly by requring a property on a list of a, rather than merely a, so it’s not suitable for all properties.

I think this limitation isn’t so important, because going further I’d like to apply coverage-guidance to testing stateful systems. When testing stateful systems, which I’ve written about here, one always generates a list of commands anyway, so the limitation doesn’t matter.

A more serious limitation with the current approach is that it’s too greedy and will seek to maximise coverage, without ever backtracking. This means that it can easily get stuck in local maxima. Consider the example:

if input[0] == 'o'
  if input[1] == 'k'
    return
if input[0] == 'b'
  if input[1] == 'a'
    if input[2] == 'd'
      error

If we generate an input that starts with ‘o’ (rather than ‘b’), then we’ll get stuck never finding the error.

Real coverage-guided tools, like AFL, will not get stuck like that. While I have a variant of the code that can cope with this, I chose to present the above greedy version because it’s simpler.

I might write another post with a more AFL-like solution at some later point, but I’d also like to encourge others to port these ideas to your favorite language and experiment!


  1. This example is due to Dmitry Vyukov, the main author of go-fuzz, but it’s basically an easier to understand version of the example from Dan Luu’s post. For comparison, here’s Dan’s example in full:

    // Checks that a number has its bottom bits set
    func some_filter(x int) bool {
        for i := 0; i < 16; i = i + 1 {
            if !(x&1 == 1) {
                return false
            }
            x >>= 1
        }
        return true
    }
    
    // Takes an array and returns a non-zero int
    func dut(a []int) int {
        if len(a) != 4 {
            return 1
        }
    
        if some_filter(a[0]) {
            if some_filter(a[1]) {
                if some_filter(a[2]) {
                    if some_filter(a[3]) {
                        return 0 // A bug! We failed to return non-zero!
                    }
                    return 2
                }
                return 3
            }
            return 4
        }
        return 5
    }
    ↩︎
  2. For more details about how it works, see the AFL “whitepaper” and its mutation heuristics.↩︎

  3. In the case of Haskell, I didn’t know that there was such a library until read Shae “shapr” Erisson’s post Run property tests until coverage stops increasing (2023). Note that Shae’s post only uses coverage as a stopping condition, not to actually drive the generation of test cases.↩︎

  4. As I was writing up, I stumbled across the paper Ijon: Exploring Deep State Spaces via Fuzzing (2020) which lets the user to add custom coverage annotations. HypoFuzz also has this functionality.↩︎

  5. See the appendix of the original paper that first introduced property-based testing. It’s interesting to note that the collecting statistics functionality is older than shrinking.↩︎

  6. We’ll not talk about the coarbitrary method of the Arbitrary type class, which is used to generate functions, in this post.↩︎

  7. The reason for wrapping Gen in the Arbitrary type class is so that generators don’t have to be passed explicitly. Not everyone agrees that this is a good idea, as type class instances cannot be managed by the module system.↩︎

  8. The standard workaround here is to introduce a wrapper type for which we write a custom generator which generates a random list and then sorts it before returning. That way no pre-condition is needed, as the input will be sorted by construction so to say.↩︎

  9. It might be interesting to note that we can implement this signature using the original combinators:

    testsC' :: Show a => Config -> Gen a -> ([a] -> Property)
            -> StdGen -> Int -> Int -> [[String]] -> IO ()
    testsC' config gen prop = tests config genResult
      where
        Prop genResult = forAll (genList gen) prop
        genList gen = sized $ \len -> replicateM len gen
    ↩︎
  10. To save vertical space I’ve also arranged the output in columns rather than one per line.↩︎