Towards human-readable binary encodings
Can we make binary encodings “human-readable”? This post explores this question by means of implementing a library inspired by Erlang’s bit syntax.
Motivation
JSON is probably the most commonly used format for serialising data today. A frequent argument for using it is that JSON is human-readable.
What does that mean exactly? I suppose that people usually mean two
things. First, it’s less verbose than XML, making it easier to read.
Most people would probably still call XML human-readable, but arguebly
less so than JSON. Second, it’s easier to read than binary encodings
produced by MessagePack, ASN.1 or Protobuf, etc. For example, the JSON
string "foo"
is represented by the following byte sequence
when using MessagePack:
+------------ A string of length 3 consisting of ...
| +--------- ... the character 'f', following by ...
| | +--+--- ... two 'o' characters.
| | | |
v v v v
a3 66 6f 6f
If we were to open a file with the above bytes or echo them to the
terminal we’d see £foo
. Which, while one character
shorter1 than the JSON string, is starting to
become unreadable already and it will become worse once the JSON object
is more complicated.
It’s worth noting that all serialised data ends up being bytes once on disk or sent over the network. So in a sense one could argue that the reason JSON is human-readable, is because these bytes get displayed as ASCII or UTF-8 by our editors and the standard terminal utilities. Another way to think about it is that ASCII and UTF-8 are encodings as well, which would be unreadable without tool support. This isn’t a new argument, people like Joe Armstrong and Martin Thompson have separately and on multiple occasions pointed this out. Both stress that we are wasting massive amounts of CPU cycles on parsing JSON.
It’s not just that it’s less space efficient, as we saw with
"foo"
vs £foo
, it’s also because with JSON we
need to inspect every single character after the first "
in
order to determine when the string ends, i.e. finding the closing
"
. Whereas in, for example, the MessagePack case the length
of the string is encoded in the a3
byte so we can jump
forward and just copy the three bytes (without looking at them). Joe
calls this reconstructing as opposed to parsing.
So if JSON is merely human-readable because of our application-level tooling, this raises the question: what would it take to make binary encodings “human-readable”?
For starters I think we’d need to make it easier to work with binary data in our programming languages. I believe Erlang’s bit syntax, which lets us do bit-level pattern-matching, is a good example of what better language support for working with binary data looks like. Even though Erlang’s way ahead most programming languages on this front, there are important use cases which are not possible to express efficiently using bit syntax though, e.g. in-place updates, leaving more to be desired.
In the rest of this post we’ll have first have a look at how Erlang’s bit syntax works, then we’ll turn to its shortcomings and try to start addressing them by means of implementing a library.
Erlang’s bit syntax
Erlang has a feature called bit syntax which allows the user to encode and decode data at the bit-level. Here’s an example, where we encode three integers into two bytes:
1> Red = 2.
2
2> Green = 61.
61
3> Blue = 20.
20
4> Mem = <<Red:5, Green:6, Blue:5>>.
<<23,180>>
Normally, even the smallest integer takes up one byte
(e.g. char
in C or Int8
in Haskell) but
Erlang’s bit syntax lets us encode, e.g., Red
using only 5
bits (rather than the default 8 bits) and thus we can fit all three
integers in \(5 + 6 + 5 = 16\) bits or
two bytes.
We can also pattern match at the bit-level using sizes to get our integers back:
5> <<R1:5, G1:6, B1:5>> = Mem.
<<23,180>>
6> R1.
2
7> G1.
61
8> B1.
20
For larger integer types, e.g. 0x12345678 :: Int32
, we
can also specify the byte order or endianness:
1> {<<16#12345678:32/big>>,<<16#12345678:32/little>>,
<<16#12345678:32/native>>,<<16#12345678:32>>}.
{<<18,52,86,120>>,<<120,86,52,18>>,
<<120,86,52,18>>,<<18,52,86,120>>}
For a slightly larger example, here’s pattern-matching on an IP datagram of IP protocol version 4:
-define(IP_VERSION, 4).
-define(IP_MIN_HDR_LEN, 5).
...
DgramSize = byte_size(Dgram),
case Dgram of
<<?IP_VERSION:4, HLen:4, SrvcType:8, TotLen:16,
ID:16, Flags:3, FragOff:13,
TTL:8, Proto:8, HdrChkSum:16,
SrcIP:32,
DestIP:32, RestDgram/binary>> when HLen >= 5, 4*HLen =< DgramSize ->
OptsLen = 4*(HLen - ?IP_MIN_HDR_LEN),
<<Opts:OptsLen/binary,Data/binary>> = RestDgram,
...
Note how we can match on the header length, HLen
, and
later use the value of that match as the size when pattern matching on
later values.
Usage
We can implement a library that lets us do similar things to Erlang’s bit syntax, but in a more clunky way (it’s difficult to beat native syntax support).
import BitsAndBobs
let
= sized word32 5 ::: sized word32 6 ::: sized word32 5 ::: Nil
pattern0 = byteString [2 :. sized word32 5, 61 :. sized word32 6, 20 :. sized word32 5]
bytestring0
bitMatch pattern0 bytestring0-- => (2,(61,(20,())))
let
= word8 :>>= \sz -> sized bytes sz ::: bytes ::: Nil
pattern1 = byteString [5 :. word8, "hello, rest" :. bytes]
bytestring1
bitMatch pattern1 bytestring1-- => (5,("hello",(", rest",())))
The above is implemented in Haskell, but should be straightforward to port to most languages using the following recipe.
How it works
The high-level idea when encoding a bunch of, possibly sized, values
into a ByteString
is as follows:
- For each value convert the value into a list of booleans (or bits);
- If the value is sized then only take that many bits, otherwise if it
isn’t sized use the default value, e.g.
Int8
= 8 bits,Float
= 32 bits, etc; - Concatenate the lists of booleans for each value into a single list of booleans;
- Split the list in groups of 8 bits;
- Convert each 8 bits into a byte (
UInt8
); - Create a
ByteString
from list ofUInt8
s.
For decoding or pattern-matching a, possibly sized, pattern against a
ByteString
the idea is:
- Convert
ByteString
into list of booleans (or bits); - For each pattern take its size many bits from the list;
- Convert the bits into the value type of the pattern;
- Continue matching the remaining patterns against the remaining bits.
Float
and Double
s get converted into
UInt32
and UInt64
respectively before
converted into bits, and Int
egers are encoding using zigzag
encoding2.
Extending Erlang’s bit syntax
Erlang’s bit syntax makes it possible to decode binary data into the host languague’s types, which can then be manipulated, and finally encoded back to binary.
While already useful, it doesn’t cover some interesting use cases. Let me try to explain the use cases and at the same time sketch possible ways we can extend Erlang’s bit syntax to cover those.
In-place updates
What if we merely want to update some binary in-place without reading it all in and writing it all back out?
For example, the de facto standard for metadata format for mp3 files is called ID3. This was never part of the mp3 specification, but rather added afterwards and so in order to not break backwards-compatibility with old media players they added it at the end of the file.
Lets imagine we wanted to write a metadata editor for mp3 files using Erlang’s bit syntax. I think no matter how smart the Erlang run-time is about bit syntax, it’s hard to imagine that it wouldn’t need to deserialse and serialise more data than necessary. Worst case it would deserialise all of the audio that leads up to where the metadata starts, but even if it’s somehow clever and starts from the back then we’d still probably need to at least deserialise all fields preceding the field we want to update.
Inspired by this problem and how tools like poke
work, I’ve started
another experiment based on Schema
s with this use case in
mind, here’s an example session of editing the metadata of an mp3
file:
$ cabal run mp3 -- /tmp/test.mp3
mp3> help
schema | read <field> | write <field> <value> | list | q(uit)
mp3> schema
audio : Binary
header : Magic "TAG"
title : ByteString (Fixed 30)
artist : ByteString (Fixed 30)
album : ByteString (Fixed 30)
year : ByteString (Fixed 4)
comment : ByteString (Fixed 30)
genre : UInt8
mp3> read title
Unknown
mp3> write title "Bits and Bobs"
mp3> read title
Bits and Bobs
mp3> list
Right (Id3V1 {title = "Bits and Bobs", artist = "", album = "", year = "2023", comment = ""})
mp3> quit
The user needs to specify the Schema
, which is closely
mapped to the ID3v1 specficiation and the rest is provided by the
library. In particular all the offsets to the different fields are
calculated from the schema3, which allow us to jump
straight to the field of interest and reconstruct it without
parsing. The above interactive editor is
completely generic and works for
any Schema
!
If we can read and update fields, it should also be possible to get diffs and patches for cheap.
On-disk data structures
Now that we can edit files in-place on the disk it would be nice to use this in order to implement on-disk data structures. For example imagine we’d like to do some kind of logging. If our schemas could express arrays and records we could define our log to be an a struct with a length field and an array of records field that of size length. In addition to extending the schema with arrays and records, we’d also need atomic increments of the length field so that we can in a thread-safe manner allocate space in our array. B-trees or Aeron’s log buffers would be other interesting on-disk data structures to implement.
The generic editor would be useful for debugging and manipulating
such data structures, but we’d probably want more tooling. For logging
we probably want something like cat
and grep
but generic in Schema
.
Zero-copy
When we read
a ByteString
field in the mp3
metadata example above, we copied the bytes from the underlying file.
Sometimes we might want to avoid doing that.
For example imagine we are implementing some network protocol. We can
use a pre-allocated buffer and recv
bytes from
a socket into this buffer (avoiding allocating memory while handling
requests), once the request is inside our buffer we can decode
individual fields (without parsing) and from that we can determine what
kind of request it is. Let’s imagine it’s some kind of write request
where we want to save the payload of some field to disk. It would be a
waste to copy the bytestring of the payload only to write it disk
immediately after, since the network request consists of raw bytes and
that’s what we want to write to the disk anyway. Instead we’d like to be
able to decode the payload field as a pointer/slice of the buffer which
we pass to write
(thus
avoiding copying aka “zero-copy”).
Backward- and forward-compatiability and migrations
Another big topic is schema evolution. How can we maintain backward- and forward-compatibility as our software evolves? We probably want to be able to migrate old formats into newer ones somehow also.
Compression
Currently our schemas cannot express how to compress fields on disk, or how to avoid sending unnecessary data in consecutive network messages.
An example of the former might be to compress a bytestring field, using say deflate, before writing it to disk. While an example of the former might be to only send the difference or change of some integer field, instead of sending the whole integer again. To make things more concrete, lets say the integer represents epoch time and we send messages several times per second, then by only sending the difference or delta in time since the last message we can save space. Other examples of compression include dictionary compression, run-length encoding, bit packing and Huffman coding.
It would be neat if encoding and decoding fields could be done modulo
compression! Likewise the schema-based cat
and
grep
could also work modulo compression.
A related topic is storing our data in a row-based or columnar fashion. Take the example of a logging library we discussed earlier with a schema that’s an array of records, i.e. each log call adds a new record to the array. This is nice in terms of writing efficiency, but if we wanted to do a lot of grepping or some aggregation on some field in the record then we’d have to jump around a lot in the file (jumping over the other fields that we are not interested in). It could be more efficient to restructure our data into a record of arrays instead, where each array only has data from one field, that way searching or aggregating over that field would be much more efficient (no jumping around). Some compression is also a lot easier to apply on columnar data, e.g. delta and run-length encoding. Perhaps it would make sense if the schema-based tools could do such data transformations in order to optimise for reads or archiving?
Checksums
If we can do encoding and decoding fields modulo compression, why not also handle checksums transparently? When we update a field which is part of a checksum, we’d probably want to check the checksum beforehand and recompute it afterwards.
Validation
What if some input bytes don’t match the schema? Currently all magic tags in a schema get verified, but sometimes we might want to be able to edit incomplete or malformed inputs.
Can we add refinements to the schema which allow us to express things like, integer between 18 and 150 or bytestring containing only alphanumeric characters, etc?
Protocols
So far we’ve looked at how to specify what data our programs use and how it’s transformed to and from bytes on disk or over the network. Another important aspect is what protocol is followed when said data is sent between components in the system.
For example consider some client-server application where our schema describes the request and responses:
flowchart LR
Client -- request --> Server
Server -- response --> Client
The schema doesn’t say anything about in which order requests are
legal. For example, we might want to always requrie a login-like request
at the start of a session. Or let’s say we are describing a POSIX-like
filesystem API, then read
s and write
s must
only be made on open
(and not yet close
d) file
descriptors.
Joe Armstrong wrote a paper called Getting Erlang to talk to the outside world (2002) which discusses this problem. He proposed a language for describing protocols and a dynamic sessions type checker, it never seemed to have got much traction though even though he gave several talks about it. One implementation can be found here.
Pandoc for binary encodings
There’s this neat tool called pandoc
that makes
possible to convert between different text formats, e.g. from Markdown
to HTML.
The list of supported formats to convert from and to is pretty long.
If we were to convert to and from each pair of possibilities would
require \(O(N^2)\) work. So what
pandoc
does instead is to convert each format to and from
its internal abstract representation, thereby reducing the problem to
\(O(N)\).
Could we do something similar for binary encodings?
In the book Development and Deployment of Multiplayer Online
Games, Vol. I by Sergey Ignatchenko (pp. 259-285, 2017) the author
talks about how most IDLs,
e.g. Protobufs, have the same language for describing what the
abstract data which we want to serialise and how we actually
want the data to be serialised. By separating the two, we could change
the binary format “on the wire” without changing the application which
operates on the abstract data (the what part). A clearer
separation between IDL and its encoding could perhaps be useful when
trying to solve the pandoc
problem for binary.
Another way to think of this is: can we make a DSL for IDLs?
Discussion
Q: Why not just use Protobuf?
A: Except for backward- and forward-compatibility, I don’t think Protobufs can handle any of the above listed use cases. Also the way it handles compatibility with it’s numbered and optional fields is quite ugly4.
Q: Writing safely to disk without going via a database is almost impossible!?
A: Dan Luu has written about this on several occasions. Short answer: don’t store anything you are worried about losing using this library. Longer answer: I’d like to revisit this topic from the point of view of testing at some later point in time. In particular I’m interested in how we can make the results from the paper All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications
- more accessible, especially their tool ALICE: Application-Level Intelligent Crash Explorer.
Contributing
The current implementation is in Haskell, but I’d really like to encourage a discussion beyond specific languages. In order to make binary “human-readable” we need solutions that are universal, i.e. work in any language, or perhaps better yet than libraries – extend programming languages with something like Erlang’s bit syntax.
- Do you have use cases that are not listed above?
- Do you know of tools, libraries or solutions any of the above use cases that have already not been discussed or are not listed below in the “see also” section?
- Do you know if some use cases impossible in general or incompatible with each other?
- Interested in porting any of these ideas to your favorite language?
If so, feel free to get in touch!
See also
- The Erlang reference manual on bit syntax;
- Programming examples of bit syntax from the Erlang user’s guide;
- Joe Armstrong’s PhD thesis
- also has a section on bit syntax on p. 60;
- Native Code Compilation of Erlang’s Bit Syntax (2002);
- Cap’n Proto;
- Simple Binary Encoding (SBE) by Martin Thompson et al;
- GNU poke, extensible editor for structured binary data;
- fq: jq for binary formats also described in this talk;
- Terminal JSON viewer;
- Rust’s binrw crate;
- Designing Data-Intensive Applications by Martin Kleppmann (chapter 3-4, 2017);
- Development and Deployment of Multiplayer Online Games, Vol. I by Sergey Ignatchenko (pp. 200-216 and 259-285, 2017).
The savings are greater for more complicated JSON objects, especially considering JSON doesn’t support binary data which needs to be either escaped or base64 encoded before used as a string.↩︎
I don’t think Erlang uses zig-zag encoding of integers, in fact I’m not sure what it does with them.↩︎
The library tries to calculate the offset of a field from the start of the file, in this case the beginning of the file contains an audio binary “field” of unknown length, so it fails and retries calculating the offset from the end of the file instead.↩︎
Avro has a nicer story for compatibility.↩︎