Parallel property-based testing with a deterministic thread scheduler
This post is about how to write tests that can catch race conditions in a reproducible way. The approach is programming language agnostic, and should work in most languages that have a decent multi-threaded story. It’s a white-box testing approach, meaning you will have to modify the software under test.
Background
In my previous post, we had a look at how to mechanically derive parallel tests that can uncover race conditions from a sequential fake1.
One of the nice things about the approach is that it’s a black-box testing technique, i.e. it doesn’t require the user to change the software under test. On the other hand because threads will interleave differently when we rerun the tests, there by potentially causing different outcomes. This in turn creates problems for the shrinking of failing test cases2.
As a workaround, I suggested that when a race condition is found in the unmodified code, one could swap the shared memory module for one that introduces sleeps around the operations. This creates less non-determinism, because the jitter of each operation will have less of an impact, and therefore helps shrinking. This isn’t a satisfactory solution, of course, and I left a to do item to implement a deterministic scheduler, like the authors do in the paper that first introduced parallel property-based testing.
The idea of the deterministic scheduler is that it should be possible to rerun a multi-threaded program and get exactly the same interleaving of threads each time.
The deterministic scheduler from the above mentioned paper is called PULSE. It was supposedly released under the BSD license, however I’ve not been able to find it. PULSE is written in Erlang and the paper uses it to test Erlang code. In Erlang everything is triggered by message passing, so I think that the correct way of thinking about what PULSE does is that it acts as a person-in-the-middle proxy. With other words, an Erlang process doesn’t send a messaged directly to another process, but instead asks the scheduler to send it to the process. That way all messages go via the scheduler and it can choose the message order. Note that a seed can be used to introduce randomness, without introducing non-determinism.
I implemented a proxy scheduler like this in Haskell (using
distributed-process
, think Haskell trying to be like
Erlang) about 6 years ago,
but I didn’t know how to do it in a non-message-passing setting. I was
therefore happy to see that my previous post inspired matklad to write a
post
where he shows how he’d do it in a multi-threaded shared memory
setting.
In this post I’ll port matklad’s approach from Rust to Haskell and hook it up to the parallel property-based testing machinery from my previous post.
Another difference between matklad and the approach in this post is that matklad uses an ad hoc correctness criteria, whereas I follow the parallel property-based testing paper and use linearisability. An ad hoc criteria can be faster than linearisability checking, but depending on how complicated your system is, it might be harder to find one. Linearisability checking on the other hand follows mechanically (for free) from a sequential (single-threaded) model/fake.
If you know what you are doing, then by all means figure out an ad hoc correctness criteria like matklad does. If on the other hand you haven’t tested much concurrent code before, then I’d recommend starting with the linearisability checking approach that we are about to describe3.
Motivation and overview
In order to explain what we’d like to do, it’s helpful to consider an example of a race condition. The text book example of a race condition is a counter which is incremented by two threads at the same time.
One possible interleaving of the two threads that yields the correct result is the following:
Time | Thread 1 | Thread 2 | Integer value | |
---|---|---|---|---|
0 | 0 | |||
1 | read value | ← | 0 | |
2 | increase value | 0 | ||
3 | write back | → | 1 | |
4 | read value | ← | 1 | |
5 | increase value | 1 | ||
6 | write back | → | 2 |
However there are other interleavings where one of the threads overwrites the other thread’s increment, yielding an incorrect result:
Time | Thread 1 | Thread 2 | Integer value | |
---|---|---|---|---|
0 | 0 | |||
1 | read value | ← | 0 | |
2 | read value | ← | 0 | |
3 | increase value | 0 | ||
4 | increase value | 0 | ||
5 | write back | → | 1 | |
6 | write back | → | 1 |
In most programming languages the thread interleaving is non-deterministic, and so we get irreproducible failures also sometimes known as “Heisenbugs”.
What we’d like to do is to be able to start a program with some seed and if the same seed is used then we get the same thread interleaving and therefore a reproducible result.
The idea, due to matklad, is to insert pauses around each shared memory operation (the reads and writes) and have a scheduler unpause one thread at the time. The scheduler is parametrised by a seed, which is fed into a pseudorandom number generator which in turn lets the scheduler deterministically choose which thread to unpause.
In the rest of this post we will port matklad’s deterministic scheduler from Rust to Haskell, hopefully in a way that shows that this can be done in any other language with decent multi-threaded programming primitives. Then we’ll do a short recap of how parallel property-based testing works, and finally we’ll hook up the deterministic scheduler to the parallel property-based testing machinery.
Deterministic scheduler
The implementation of the deterministic scheduler can be split up in three parts. First we’ll implement a way for the spawned threads to communicate with the scheduler, this communication channel will be used to pause and unpause the threads. After that we’ll make a wrapper datatype around Haskell’s threads which also includes the communication channel. Finally, we’ll have all the pieces to implement the deterministic scheduler itself.
Thread-scheduler communication
The scheduler needs to be able to communicate with the running threads, in order to be able to deterministically unpause, or “step”, one thread at a time.
We’ll use Haskell’s TMVar
s for this, but any kind of
shared memory will do. Haskell’s MVar
s can be thought of
boxes that contain a value, where taking something out of a box that is
empty blocks and putting something into a box that is full blocks as
well. Where “blocks” means that the run-time will suspend the thread
that tries the blocking action and only wake it up when the
MVar
changes, i.e. it’s an efficient way of waiting
compared to busy-waiting or
spinning.
The T
in TMVar
s adds STM
transactions around MVar
s, we’ll see an example of what
these are useful for shortly.
We’ll call our communication channel Signal
:
data Signal = SingleThreaded | MultiThreaded (TMVar ())
deriving Eq
There are two ways to create a Signal
, one for
single-threaded and another for multi-threaded execution:
newSingleThreadedSignal :: Signal
= SingleThreaded
newSingleThreadedSignal
newMultiThreadedSignal :: IO Signal
= MultiThreaded <$> newEmptyTMVarIO newMultiThreadedSignal
The idea being that in the single-threaded case the scheduler shouldn’t be doing anything. In particular pausing a thread in the single-threaded case is a no-op:
pause :: Signal -> IO ()
SingleThreaded = return ()
pause MultiThreaded tmvar) = atomically (takeTMVar tmvar) pause (
Notice that in the multi-threaded case the pause operation will try
to take a value from the TMVar
and also notice that the
TMVar
starts off being empty, so this will cause the thread
to block.
The way the scheduler can unpause the thread is by putting a unit
value into the TMVar
, which will cause the
takeTMVar
finish.
unpause :: Signal -> IO ()
SingleThreaded = error
unpause "unpause: a single thread should never be paused"
MultiThreaded tmvar) = atomically (putTMVar tmvar ()) unpause (
For our scheduler implementation we’ll also need a way to check if a thread is paused:
isPaused :: Signal -> STM Bool
SingleThreaded = return False
isPaused MultiThreaded tmvar) = isEmptyTMVar tmvar isPaused (
It’s also useful to be able to check if all threads are paused:
waitUntilAllPaused :: [Signal] -> IO ()
= atomically $ do
waitUntilAllPaused signals <- mapM isPaused signals
bs and bs) guard (
Notice that STM makes this easy as we can do this check atomically.
Managed threads
Having implemented the communication channel between the thread and
the scheduler, we are now ready to introduce our “managed” threads (we
call them “managed” because they are managed by the scheduler). These
threads are basically a wrapper around Haskell’s Async
threads that also includes our communication channel,
Signal
.
data ManagedThreadId a = ManagedThreadId
_mtidName :: String
{ _mtidSignal :: Signal
, _mtidAsync :: Async a
,
}deriving Eq
Our managed thread can be spawned as follows:
spawn :: String -> (Signal -> IO a) -> IO (ManagedThreadId a)
= do
spawn name io <- newMultiThreadedSignal
s <- async (io s)
a return (ManagedThreadId name s a)
Noticed that the spawned IO action gets access to the communication channel.
The Async
thread API exposes a way to check if a thread
is still executing, threw an exception or finished yielding a result.
We’ll extend this by also being able to check if the thread is paused as
follows.
data ThreadStatus a = Paused | Finished a | Threw SomeException
getThreadStatus :: ManagedThreadId a -> IO (ThreadStatus a)
= atomically go
getThreadStatus mtid where
= do
go <- pollSTM (_mtidAsync mtid)
res case res of
Nothing -> do
<- isPaused (_mtidSignal mtid)
b if b
then return Paused
else go
Just (Left err) -> return (Threw err)
Just (Right x) -> return (Finished x)
Scheduler
We now got all the pieces we need to implement our deterministic scheduler.
The idea is to wait until all threads are paused, then step one of them and wait until it either pauses again or finishes. If it pauses again, then repeat the stepping. If it finishes, remove it from the list of stepped threads and continue stepping.
schedule :: RandomGen g => [ManagedThreadId a] -> g -> IO ([a], g)
= do
schedule mtids0 gen0 <- timeout 1000000 (waitUntilAllPaused (map _mtidSignal mtids0))
res case res of
Nothing -> error "schedule: all threads didn't pause within a second"
Just () -> do
go mtids0 gen0 []where
go :: RandomGen g => [ManagedThreadId a] -> g -> [a] -> IO ([a], g)
= return (reverse acc, gen)
go [] gen acc = do
go mtids gen acc let (ix, gen') = randomR (0, length mtids - 1) gen
= mtids !! ix
mtid
unpause (_mtidSignal mtid)<- getThreadStatus mtid
status case status of
Finished x -> do
: acc)
go (mtids \\ [mtid]) gen' (x Paused -> go mtids gen' acc
Threw err -> error ("schedule: " ++ show err)
We can now also implement a useful higher-level combinator:
mapConcurrently :: RandomGen g => (Signal -> a -> IO b) -> [a] -> g -> IO ([b], g)
= do
mapConcurrently f xs gen <- forM (zip [0..] xs) $ \(i, x) ->
mtids "Thread " ++ show i) (\sig -> f sig x)
spawn ( schedule mtids gen
Example: broken atomic counter
To show that our scheduler is indeed deterministic, let’s implement the race condition between two increments from the introduction.
First let’s introduce an interface for shared memory. The idea is that we will use two different instances of this interface: a “real” one which just does what we’d expect from shared memory, and a “fake” one which pauses around the real operations.
data SharedMemory a = SharedMemory
memReadIORef :: IORef a -> IO a
{ memWriteIORef :: IORef a -> a -> IO ()
,
}
realMem :: SharedMemory a
= SharedMemory readIORef writeIORef
realMem
fakeMem :: Signal -> SharedMemory a
=
fakeMem signal SharedMemory
= \ref -> do
{ memReadIORef
pause signal<- readIORef ref
x
pause signalreturn x
= \ref x -> do
, memWriteIORef
pause signal
writeIORef ref x
pause signal }
The real one will be used when we deploy the actual software and the fake one while we do our testing.
We can now implement our counter example against the shared memory interface:
data AtomicCounter = AtomicCounter (IORef Int)
newCounter :: IO AtomicCounter
= do
newCounter <- newIORef 0
ref return (AtomicCounter ref)
incr :: SharedMemory Int -> AtomicCounter -> IO ()
AtomicCounter ref) = do
incr mem (<- memReadIORef mem ref
i + 1)
memWriteIORef mem ref (i
get :: SharedMemory Int -> AtomicCounter -> IO Int
AtomicCounter ref) = memReadIORef mem ref get mem (
Finally, we can implement the race condition test using the counter and two threads that do increments:
test :: Int -> IO (Int, Bool, Int)
= do
test seed <- newCounter
counter <- spawn "0" (\signal -> incr (fakeMem signal) counter)
mtid1 <- spawn "1" (\signal -> incr (fakeMem signal) counter)
mtid2 let gen = mkStdGen seed
<- schedule [mtid1, mtid2] gen
_ <- get realMem counter
two return (seed, two == 2, two)
test1 :: IO ()
= mapM_ (\seed -> print =<< test seed) [0..10] test1
The test is parametrised by a seed for the scheduler. If we run it with different seeds we get different outcomes:
>>> test1
(0,True,2)
(1,True,2)
(2,False,1)
(3,True,2)
(4,False,1)
(5,True,2)
(6,False,1)
(7,False,1)
(8,True,2)
(9,True,2)
(10,False,1)
If we fix the seed to one which makes our test fail:
test2 :: IO ()
= let seed = 2 in replicateM_ 10 (print =<< test seed) test2
Then we get the same outcome every time:
>>> test2
(2,False,1)
(2,False,1)
(2,False,1)
(2,False,1)
(2,False,1)
(2,False,1)
(2,False,1)
(2,False,1)
(2,False,1)
(2,False,1)
These quick tests seem to suggest that our scheduler is in fact deterministic.
Parallel property-based testing recap
In our counter example above, we had two concurrent increments, in this case it’s easy to see what the answer must be (the counter must have the value of two, if we start counting from zero and increment by one).
However for more complicated scenarios it gets less clear, consider:
- Two increments and a get operation all happening concurrently, what’s the right return value of the get? It depends, it can be 0, 1 or 2;
- Consider the counter being on a remote server and clients doing the increments and gets via some network. Imagine a client first does an increment and this request times out or the client crashes, then another client does a get operation, what’s the return value of the get? It depends, it can be 0 or 1 depending on if the timeout or crash happened before or after the server received the increment;
- The above gets a lot more complicated with more operations involved or more complicated data structures than a counter, e.g. key-value store with deletes.
Luckily there’s a correctness criteria for concurrent programs like these which is based on a sequential model, Linearizability: a correctness condition for concurrent objects by Herlihy and Wing (1990), which hides the complexity of non-determinism and crashing threads and works on arbitrary data structures. This is what we use in parallel property-based testing.
The idea in a nutshell: execute commands in parallel, collect a concurrent history of when each command started and stopped executing, try to find an interleaving of commands which satisfies the sequential model. For a more detailed explanation see my previous post.
Integrating the scheduler into the testing
The point is not to reimplement the parallel property-based testing machinery from my previous post here, but merely show that integrating the deterministic scheduler isn’t too much work.
We need to change the code from the previous post in three different places: the sequential module, the parallel module and the counter example itself.
Changes to sequential module
First we import the library code that we wrote above in this post:
+import qualified ManagedThread2 as Scheduler
Then we move the runCommandMonad
method from the
ParallelModel
class into the StateModel
class
and change it so that it has access to the communication channel to the
scheduler (Signal
):
+ runCommandMonad :: proxy state -> CommandMonad state a -> Scheduler.Signal -> IO a
We then change the runCommands
function to use
runCommandMonad
and use a single-threaded
Signal
, i.e. one that doesn’t do any pauses:
runCommands :: forall state. StateModel state- => Commands state -> PropertyM (CommandMonad state) ()
-runCommands (Commands cmds0) = go initialState emptyEnv cmds0
+ => Commands state -> PropertyM IO ()
+runCommands (Commands cmds0) =
+ hoist (flip (runCommandMonad (Proxy :: Proxy state)) Scheduler.newSingleThreadedSignal) $
+ go initialState emptyEnv cmds0
where
go :: state -> Env state -> [Command state (Var (Reference state))] -> PropertyM (CommandMonad state) ()
In order for this to typecheck we need a helper function that changes
the underlying monad of a PropertyM
(QuickCheck’s monadic
properties):
+hoist :: Monad m => (forall x. m x -> IO x) -> PropertyM m a -> PropertyM IO a
+hoist nat (MkPropertyM f) = MkPropertyM $ \g ->
+ let
+ MkGen h = f (fmap (fmap (return . ioProperty)) g)
+ in
+ MkGen (\r n -> nat (h r n))
Changes to parallel module
Again we import the deterministic scheduler that we defined in this post:
+import qualified ManagedThread2 as Scheduler
As we said above, the runCommandMonad
method was moved
into the sequential testing module:
- runCommandMonad :: proxy state -> CommandMonad state a -> IO a
We’ll reuse QuickCheck’s seed for our scheduler, the following helper
function extracts the seed from QuickCheck’s PropertyM
:
+getSeed :: PropertyM m QCGen
+getSeed = MkPropertyM (\f -> MkGen (\r n -> unGen (f r) r n))
We now have all the pieces we need to rewrite
runParallelCommands
to use the deterministic scheduler:
runParallelCommands :: forall state. ParallelModel state
=> ParallelCommands state -> PropertyM IO ()
runParallelCommands cmds0@(ParallelCommands forks0) = do+ gen <- getSeed
+ liftIO (putStrLn ("Seed: " ++ show gen))
forM_ (parallelCommands cmds0) $ \cmd -> do
let name = commandName cmd
monitor (tabulate "Commands" [name] . classify True name)
monitor (tabulate "Concurrency" (map (show . length . unFork) forks0))
q <- liftIO newTQueueIO
c <- liftIO newAtomicCounter- env <- liftIO (runForks q c emptyEnv forks0)
+ env <- liftIO (runForks q c emptyEnv gen forks0)
hist <- History <$> liftIO (atomically (flushTQueue q))
let ok = linearisable env (interleavings hist)
unless ok (monitor (counterexample (show hist)))
assert ok
where- runForks :: TQueue (Event state) -> AtomicCounter -> Env state -> [Fork state]
- -> IO (Env state)
- runForks _q _c env [] = return env
- runForks q c env (Fork cmds : forks) = do
- envs <- liftIO $
- mapConcurrently (runParallelReal q c env) (zip [Pid 0..] cmds)
+ runForks :: RandomGen g => TQueue (Event state) -> AtomicCounter -> Env state -> g
+ -> [Fork state] -> IO (Env state)
+ runForks _q _c env _gen [] = return env
+ runForks q c env gen (Fork cmds : forks) = do
+ (envs, gen') <- liftIO $
+ Scheduler.mapConcurrently (runParallelReal q c env) (zip [Pid 0..] cmds) gen
let env' = combineEnvs (env : envs)- runForks q c env' forks
+ runForks q c env' gen' forks
runParallelReal :: TQueue (Event state) -> AtomicCounter -> Env state- -> (Pid, Command state (Var (Reference state))) -> IO (Env state)
- runParallelReal q c env (pid, cmd) = do
+ -> Scheduler.Signal -> (Pid, Command state (Var (Reference state))) -> IO (Env state)
+ runParallelReal q c env signal (pid, cmd) = do
atomically (writeTQueue q (Invoke pid cmd))- eResp <- try (runCommandMonad (Proxy :: Proxy state) (runReal (fmap (lookupEnv env) cmd)))
+ eResp <- try (runCommandMonad (Proxy :: Proxy state) (runReal (fmap (lookupEnv env) cmd)) signal)
case eResp of
Left (err :: SomeException) -> error ("runParallelReal: " ++ displayException err)
Changes to the counter example
We start by replacing our sleeps (threadDelay
s) and
direct manipulation of shared memory ({read,write}IORef
)
with operations from the shared memory interface:
+import qualified ManagedThread2 as Scheduler
-incrRaceCondition :: IO ()
-incrRaceCondition = do
- n <- readIORef gLOBAL_COUNTER
- threadDelay 100
- writeIORef gLOBAL_COUNTER (n + 1)
- threadDelay 100
+incrRaceCondition :: Scheduler.SharedMemory Int -> IO ()
+incrRaceCondition mem = do
+ n <- liftIO (Scheduler.memReadIORef mem gLOBAL_COUNTER)
+ Scheduler.memWriteIORef mem gLOBAL_COUNTER (n + 1)
-get :: IO Int
-get = readIORef gLOBAL_COUNTER
+get :: Scheduler.SharedMemory Int -> IO Int
+get mem = Scheduler.memReadIORef mem gLOBAL_COUNTER
We need to pass in the Signal
communication channel when
constructing the shared memory interface. With our change to
runCommandMonad
we have access to the sig
nal
when translating the CommandMonad
into IO
, so
we can simply pass the sig
nal through using the reader
monad (recall that ReaderT Scheduler.Signal IO a
is
isomorphic to Scheduler.Signal -> IO a
).
+ type CommandMonad Counter = ReaderT Scheduler.Signal IO
+
+ runCommandMonad _ m sig = runReaderT m sig
The construction of the fake shared memory interface happens in the
runReal
function, where ask
retrieves the
sig
nal via the reader monad:
-- We also need to explain which part of the counter API each command
-- corresponds to.- runReal :: Command Counter r -> IO (Response Counter r)
- runReal Get = Get_ <$> get
- runReal Incr = Incr_ <$> incrRaceCondition
+ runReal :: Command Counter r -> CommandMonad Counter (Response Counter r)
+ runReal cmd = do
+ sig <- ask
+ let mem = Scheduler.fakeMem sig
+ case cmd of
+ Get -> liftIO (Get_ <$> get mem)
+ Incr -> liftIO (Incr_ <$> incrRaceCondition mem)
The final changes are in the in the parallel property itself. Where
we can now remove the replicateM_ 10
, which repeats the
test 10 times, because the thread scheduling is now deterministic and we
don’t need to repeat the test in order to avoid being unlucky with only
getting thread interleavings that don’t reveal the bug.
prop_parallelCounter :: ParallelCommands Counter -> Property
prop_parallelCounter cmds = monadicIO $ do- replicateM_ 10 $ do
- run reset
- runParallelCommands cmds
+ run reset
+ runParallelCommands cmds
assert True
Running the parallel property gives output such as the following:
>>> quickCheck prop_parallelCounter
Seed: SMGen 14250666030628800360 1954972351745194697
Seed: SMGen 13912848539649022280 6105520832690741705
Seed: SMGen 11982463081258021613 5563494797767522969
Seed: SMGen 3766496530906674898 8913882928510646053
Seed: SMGen 9878140450988724144 11431408445192688375
Seed: SMGen 10677049786290338516 2728325560351012375
Seed: SMGen 8857820011662424543 17283242182436244785
Seed: SMGen 8857820011662424543 17283242182436244785
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785rink)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785rinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785rinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
Seed: SMGen 8857820011662424543 17283242182436244785shrinks)...
*** Failed! Assertion failed (after 7 tests and 3 shrinks):
ParallelCommands [Fork [Incr,Incr],Fork [Get]]
History [Invoke (Pid 0) Incr,Invoke (Pid 1) Incr,Ok (Pid 0) (Incr_ ()),Ok (Pid 1) (Incr_ ()),Invoke (
Pid 0) Get,Ok (Pid 0) (Get_ 1)]
We can see that different seeds are used up until the test fails, then shrinking is done with the same seed.
Conclusion and further work
I hope I’ve managed to give a glimpse of how we can deterministically test multi-threaded code using a deterministic scheduler, and how this technique can be applied to parallel property-based testing.
While this seems to work, there are several ways in which it can be improved upon:
Some seeds don’t give the minimal counterexample (the one we saw above with two concurrent increments followed by a get). While shrinking can be improved as already pointed out in my previous post, the problem could also be that shrinking changes the interleavings. Let’s say we generated three concurrent increments followed by a get, this triggers the race condition if one of those increments overwrite the other’s increment. It could be that trying to shrink away any of the increments (to get to the minimal counterexample) fails because by removing any of them will cause the scheduler to unpause the remaining ones in a different order, and thus potentially failing to trigger the race condition.
One possible solution to this problem could be to “tombstone” the shrunk commands/threads rather than removing them and then change
runReal
so that tombstoned commands get run using an instance of the shared memory interface in which the pauses happen but not the mutation of the memory. The idea being that by doing so the scheduler will still use the pseudorandom number generator for the shrunk commands and thus the original interleavings will be preserved.Another possible solution, due to Daniel Gustafsson, is to save more info than merely the seed. In particular the scheduler could generate the first interleaving using a seed, but then save the interleaving (the order in which the threads were unpaused) and consequent reruns can reuse the interleaving (or shrunk subsets of it) instead of the seed and thereby preserve the original interleaving.
Currently random interleavings are checked, but we could also imagine enumerating all interleavings up to some depth. This would be more in line with what model checkers do. Perhaps SmallCheck could be used for this? It would also be interesting to compare this approach to what the dejafu library does.
While the approach in this post works in all languages due to its white-box nature, it’s interesting to consider what would it would take to turn it into a black-box approach? Where with black-box I mean that the programmer doesn’t need to change their code to get the deterministic testing.
Two black-box approaches that I’m aware of are:
- Intercepting and recording the syscalls that the multi-threaded program does and then somehow using the recorded trace to deterministically reproduce the same execution when the program is rerun. I believe this is what Mozilla’s time travelling debugger, rr, and Facebook’s hermit does);
- Antithesis’ deterministic hypervisor.
Both of these approaches involve a lot of engineering work though, and I’m curious if we can get there cheaper?
One thing I’m interested in is: what if we had a programming language that’s able to switch between the fake and real shared memory interface, depending on if we are testing or not? The multi-threaded code that the user writes in that case doesn’t need to be changed to get the deterministic testing, i.e. a black-box approach.
Implementing a new language and rewriting all your code in that language is also a lot of work as well though. Perhaps existing languages can be incrementally changed to expose scheduler hooks or allow user defined schedulers? Either way, it seems to me that this should be solved at the language-level, rather than OS-level, but maybe that’s partly because I don’t understand the OS-level solutions well enough. I’d be curious to hear about other opinions or ideas.
We’ve looked at linearisability, but what about other consistency models? For example, eventual consistency? See John Hughes et al’s Mysteries of Dropbox Property-Based Testing of a Distributed Synchronization Service (2016) as well as matklad’s post for hints.
Partial-order reduction: during concurrent execution sometimes we can commute two operations without changing the outcome, e.g. the interleaving of two increments doesn’t matter, they all end up the same state. We can exploit this fact to check less histories;
We looked at shared memory, but there are other ways of getting data races, e.g. via concurrent file system access, mmaped memory, etc. All these other ways of concurrently mutating some state would require interfaces with fake implementations that insert pauses around the actual mutation. It would be interesting to take an example of a concurrent (and persisted?) data structure, e.g. the LMAX Disruptor or Aeron’s log buffers, and implement and test it in the same way we tested the counter.
If you’ve feedback, comments or are interested in working on any of the above, feel free to get in touch.
Acknowledgments
Thanks to Daniel Gustafsson for discussing tombstone solution to the problem where shrinking can cause different thread interleavings.
If you haven’t heard of fakes before, think of them as a more elaborate test double than a mock. A mock of a component expects to be called in some particular way (i.e. exposes only some limited subset of the components API), and typically throw an exception when called in any other way. While a fake exposes the full API and can be called just like the real component, but unlike the real component it takes some shortcuts. For example a fake might lose all data when restarted (i.e. keeps all data in memory, while the real component persists the data to some stable storage).↩︎
The reason for shrinking not working so well with non-determinism is because shrinking stops when the property passes. So if some input causes the property to fail and then we rerun the property on a smaller input it might be the case that the smaller input still contains the race condition, but because the interleaving of threads is non-deterministic we are unlucky and the race condition isn’t triggered and the property passes, which stops the shrinking process.↩︎
Jepsen’s Knossos checker also uses linearisability checking and has found many bugs, so we are in good company.
Note that the most recent Jepsen analyses use the Elle checker, rather than the Knossos checker. The Elle checker doesn’t do linearisability checking, but rather looks for cycles in the dependencies of database transactions. Checking for cycles is less general than linearisability checking, but also more efficient. See the Elle paper for details.↩︎