Encoding Huffman codebooks

This post will assume you have a basic knowledge of the data compression technique known as Huffman coding. Though maybe, since I’m only concerned about decompression, I should call it something like “bit-oriented prefix codes”. Huffman coding is really just one of the algorithms that can produce such a code, but it’s the term everybody uses for this type of code, so I’m going to abuse terminology and call it Huffman coding.

Not many notable compression formats use Huffman coding by itself. It’s usually combined with something else — RLE, LZ77, whatever. In this post, I’m only looking at the Huffman coding component, and ignoring everything else.

Note that this post just looks at some formats that I happen to know about. I didn’t go out of my way to survey lots of formats.

To decompress Huffman-coded data, you use a known codebook of variable-length bit strings. For example, if the original data consists only of the letters A through H, a Huffman codebook for it might look like this:

0111  = A
0000  = B
001   = C
010   = D
1     = E
00011 = F
0110  = G
00010 = H

The decompressor reads bits from the compressed data one by one, matching them to the codes in the codebook, left-to-right. No code is equal to the first part of a longer code, so it knows when it has read a complete code.

But how does the decompressor know what the codebook is, in the first place? It depends on the how the format was designed. There are three main options…

1. Hard-code it.

The easiest way to store the codebook is to not store the codebook. Just always use the same codebook, known to both the compressor and the decompressor. This is sometimes called “fixed” Huffman. It defeats some of the purposes of Huffman coding, but can sometimes be useful.

This strategy is notably used by fax machines, and some image formats that copy from them, such TIFF format’s most common compression methods for bi-level images.

Deflate format can optionally use a fixed Huffman codebook, though in practice it usually does not.

One of the compression methods in StuffIt format (#6, “Fixed Huffman”) uses an interesting variant of this, in which the set of available codes is fixed, but the meaning of the codes is not.

2. Build it on the fly.

A technique called adaptive Huffman coding involves an ever-changing Huffman tree, derived from the frequencies of the symbols encountered so far. Adaptive Huffman coding was once fairly popular, but has been rare since… maybe the early 1990s? One notable use of it is by early versions of the LHarc compression utility.

3. Read it from the file.

The obvious solution, and the topic of the rest of this post, is to record the tree in the compressed file, prior to the compressed data that uses it.

Any Huffman codebook can be drawn as a binary tree. The example codebook above would look like this:

To decompress data using this tree, start at the root (the green dot). Read bits from the compressed data stream, one at a time. When you read a 0 bit, take the left branch. When you read a 1 bit, take the right branch. When you reach a leaf node, interpret the symbol contained in that node (the symbol could be a decompressed byte value, for example), then go back to the root (unless you have a reason to stop).

So, formats that use this type of Huffman coding have to store a tree-like codebook in the file. But a tree structure is not the most natural thing to store in a sequential data file. And because this is all about data compression, it’s desirable to store it as compactly as possible. Here are some of the ways I’ve seen it done.

Non-canonical formats

Most formats do not permit the use of arbitrary Huffman codebooks, but here are two that do.

ARC and Squeeze

The old ARC compression format has a compression method that uses Huffman coding: #4, “Squeezed”. The same basic format is used by an older utility named “Squeeze”.

The file contains a structure I’ll call the “node table”. Each item in the table consists of two coded integers: one telling you what to do if the next bit is 0, and one for if it’s 1. Most of the integer codes represent either the index of the “next node”, or a decompressed byte value.

This is just the sort of data structure a decompressor might use internally to decode Huffman codes, so it’s a simple and obvious solution. But it’s not size-efficient. And it may invite bugs, if the decoder doesn’t carefully check the table for loops, invalid indices, etc.

StuffIt compression method #3

The Macintosh-oriented StuffIt format has an old “Huffman” compression method that records the tree in a recursive way. I’ll call the data structure a “subtree”. A subtree consists of either a 1 bit followed by an 8-bit decompressed byte value, or a 0 bit followed by two subtrees.

Though more size-efficient than the “Squeeze” method, it’s still not that impressive, because it uses at least 9 bits for each byte value. It’s possible to do quite a bit better.

Canonical formats

Consider the example codebook from earlier. The following statements are true about it:

  • There is 1 code of length 1: {E}
  • There are 2 codes of length 3: {C, D}
  • There are 3 codes of length 4: {A, B, G}
  • There are 2 codes of length 5: {F, H}

There are many codebooks that satisfy all of those statements, and they would all result in exactly the same compression ratio. So it doesn’t really matter which one is used to compress the data, as long as the compressor and decompressor can agree on the same one.

With that in mind, instead of directly storing the codebook, many formats store the equivalent of the above statements. The compressor and decompressor use hard-coded rules for how to construct one of the codebooks that satisfy those statements — the canonical codebook.

I’m not going to go into the methods used to encode those statements into bits. In some formats, it gets pretty complicated, as the format designers go to great lengths to squeeze out as many bits as they can.

But it should be easy to believe that storing just the length of each code is likely to require fewer bits than storing the actual code.

For the overall canonical tree structure, the only really sensible thing to do is design it so that all the branches are as far to one side (left or right) as possible, with the leaves on the other side. From a computer programming standpoint, it turns out that there are simple algorithms for assigning the codes without explicitly building such a tree. The tree can then be constructed from the codes, if needed.

Without loss of generality, let’s say we’ll assign codes in the natural order of the Huffman codes. By that, I mean numerical order, if we were to first right-pad the codes with zeroes until they’re all the same length. The code consisting of all 0s will be assigned first.

That still leaves us with four different not-too-unreasonable ways to do it. We have two decisions to make:

  1. We can assign the shortest codes first, or the longest codes first.
  2. For a given code size (i.e., a given tree level, or “row”), we can assign the symbols in their natural ascending order (for our codes of length 4, that’s A, B, G), or the reverse of that order (G, B, A).

Shortest codes first

If we assign the shortest codes first, we get a tree in which the leaves are left-aligned:

If we then put the symbols in each row in ascending order, we get the following assignments:

0     = E
100   = C
101   = D
1100  = A
1101  = B
1110  = G
11110 = F
11111 = H

When someone talks about a canonical Huffman code, this is probably the coding system they’re referring to. It might be wrong for me to refer to anything else as a canonical Huffman code, but what am I supposed to call a Huffman code that, while canonicalized, is not this one? I think of this as the canonical canonical Huffman code, or the orthodox canonical Huffman code.

This coding strategy is used by many formats. Some that I know of are Deflate (ZIP/Deflate, PNG, Gzip, zlib, etc.), JPEG, and LHA’s “lh5” compression.

I don’t know of any formats that do it this way, but with the symbols in each row in descending order.

Longest codes first

If we assign the longest codes first, we get a tree in which the branches are left-aligned.

Putting the symbols in each row in ascending order, we get:

00000 = F
00001 = H
0001  = A
0010  = B
0011  = G
010   = C
011   = D
1     = E

The ancient Unix file compressor named Pack (.z) does it this way.

The PKZIP 1.x “implode” compression method also has the longest codes start with 0, but it assigns the symbols in each row in descending order, like so:

00000 = H
00001 = F
0001  = G
0010  = B
0011  = A
010   = D
011   = C
1     = E

I guess there’s some logic to reversing the symbols, since that way the all-0 code gets assigned to the symbol that sorts last. This is equivalent to the orthodox code, but with all the bits inverted (0 becomes 1, 1 becomes 0)

Other canonical-like formats

I was looking at an old Unix file compressor called “Freeze / Melt”. It uses adaptive Huffman coding, but one part of its algorithm uses a non-adaptive Huffman codebook that can be stored in the compressed file. The codebook consumes just 21 bits.

I’m not sure I understand how it works, but I think it can be that small primarily because, due to the nature of the symbols it’s compressing, the software assumes it knows the approximate relative frequencies of the symbols. It’s not trying to be fully optimized. So it only needs to store the number of symbols having each bit length, not which symbols have that bit length.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s