ARC/PAK’s “Distilled” compression scheme

PAK is an old file compression and archiving program for DOS, developed by NoGate Consulting. (Search the web for “pak251.exe”.) It has a number of features, which include some extensions to ARC format. One such ARC extension is compression method #11, named “Distilled”. It was introduced in PAK v2.0, released in July 1989.

Unlike my posts on LHARK and ARC/Trimmed, this one is not about reverse engineering. Well, there might be a little reverse engineering. But public source code to decompress Distilled format already exists, as part of the software called The Unarchiver. What I haven’t found is any technical documentation of the format, so that’s what I’ll attempt to provide here.

Everything about this documentation is unofficial, and may be incorrect or incomplete. The terminology is my own.

This document assumes you have a basic understanding of LZ77 decompression, and of how to interpret Huffman-coded data.

Preliminary notes

Some notes on Distilled compression:

  • It is not segmented. Given the type of Huffman coding it uses, that means it may not be very good at compressing very large files.
  • Its most distinctive characteristic is that the way offsets are encoded depends on how many bytes have been decompressed so far.
  • PAK doesn’t seem to have an option to force it to use Distilled compression. The best you can do is to use the /O option, causing PAK to choose between Distilled and non-compressed. Distilled has a fairly large overhead, so PAK is inclined to store very small files non-compressed. This makes it a bit more difficult to investigate the format than it should be.
  • I speculate that it is vaguely related to the classic LZHUF.C source code by Haruyasu Yoshizaki, though LZHUF.C uses an adaptive Huffman coding algorithm that is entirely different from Distilled. (There are many different versions of LZHUF.C floating around. You may find it in a file named OKUMURA.ZIP or LZHSRC10.ZIP.)

Format overview

Distilled is a combination of LZ77, and Huffman-like prefix coding (I’ll just call it “Huffman coding”). Two Huffman codebooks are used. One is stored at the beginning of the compressed data, while the other is hard-coded and is always the same. No adaptive Huffman coding is used. Distilled is in the same general class of compression schemes as ZIP’s “Deflate” scheme, and LHA’s “-lh5-” scheme, though it’s a bit simpler than those.

The compressed data is bit-oriented, and consists of a mix of Huffman codes, and unsigned integers. The bit order is least-significant bit first. In such a format, Huffman codes are stored with their bits in an order that might seem like it’s backwards. You should think of a Huffman code as a sequence of 1-bit integers, to be read one at a time, not as a single multi-bit integer.

The compressed data consists of a “codebook definition” segment, followed by a “compressed data codes” segment.

The codebook definition segment

The codebook definition segment defines the Huffman codebook used for codes (mainly literals and match-lengths). It is similar to that used by another ARC compression method, “Squeezed” (method #4). Similar, but different in several ways. What it’s practically identical to is the internal data structures in the LZHUF.C source code.

It is structured as follows:

Field nameField typeRemarks
num_nodes16-bit integerKnown valid values are the even numbers from 2 to 628.
bits_per_node8-bit integerKnown valid values are 9 and 10.
node_tablearray of num_nodes integers, bits_per_node bits each

The 628 maximum node count is derived from 315 being the maximum number of codes. When constructing a Huffman tree in this fashion, the very first node you create gives you two leaves, and every node you add after that gives you one additional leaf. So you need 314 nodes to make 315 leaves. Each node uses two table entries, and 314×2 is 628.

The items in node_table are used in pairs. The first item in the pair tells what to do if a “0” bit is encountered, and the second is for a “1” bit.

The node_table values are interpreted as follows:

  • 0, 2, 4, … num_nodes−2: A pointer to the first node of a node pair.
  • num_nodesnum_nodes+314: Signifies a leaf node. Subtract num_nodes to get a number from 0 to 314, which is a code for a literal, match-length, or stop code.

Other values are presumed to be invalid.

The root node pair is the last one in the table, at index num_nodes−2 (and num_nodes−1).

The offsets codebook

Below is the hard-coded Huffman codebook used in the calculation of offsets. It gives you the high 6 bits of the match-offset.

While it’s clearly not random, I can’t figure out a simple way to explain the pattern. I can see that some parts of the codes, if left-right reversed, seem to be counting up in binary. But there’s more to it than that.

The code lengths, in bits, are identical to the p_len table in LZHUF.C. I thought the codes might might be derived from the corresponding p_code table, and maybe they are, but I can’t figure out how.

000    = 0    100110  = 16   1101010 = 32   11110000 = 48
0100   = 1    101110  = 17   1100110 = 33   11111000 = 49
0010   = 2    101001  = 18   1110110 = 34   11110100 = 50
0011   = 3    100101  = 19   1101110 = 35   11111100 = 51
10000  = 4    101101  = 20   1100001 = 36   11110010 = 52
01100  = 5    101011  = 21   1110001 = 37   11111010 = 53
01010  = 6    100111  = 22   1101001 = 38   11110110 = 54
01110  = 7    101111  = 23   1100101 = 39   11111110 = 55
10001  = 8    1100000 = 24   1110101 = 40   11110001 = 56
01101  = 9    1110000 = 25   1101101 = 41   11111001 = 57
01011  = 10   1101000 = 26   1100011 = 42   11110101 = 58
01111  = 11   1100100 = 27   1110011 = 43   11111101 = 59
101000 = 12   1110100 = 28   1101011 = 44   11110011 = 60
100100 = 13   1101100 = 29   1100111 = 45   11111011 = 61
101100 = 14   1100010 = 30   1110111 = 46   11110111 = 62
101010 = 15   1110010 = 31   1101111 = 47   11111111 = 63

The compressed data codes segment

Note that there are no padding bits after the codebook definition segment, so this segment may start in the middle of a byte.

To decompress, repeatedly read a code using the codes codebook, and follow the instructions according to its value.

If the code is from 0 to 255, it is a literal byte. Emit it to the decompressed codestream. (As will all decompressed data, also store it in the history buffer, as one does with LZ77.)

If the code is 256, stop.

If the code is 315 or higher, that’s an error. But you should have caught such an error when reading the codebook definition segment.

If the code is from 257 to 314, do the following:

  • Subtract 254 from it, and use it as the match-length.
  • Read an offset code using the offsets codebook. This will be an integer from 0 to 63, representing the high 6 bits of the offset.
  • Let b be the number of low bits of the offset. Calculate b (see below), then read a b-bit integer directly from the compressed datastream.
  • Combine the high and low bits of the offset, to get an offset from 0 to 8191. (Add 1 to it if you expect it to be in the range 1–8192.)
  • Use the match-length and offset to copy some bytes from the history buffer to the decompressed datastream, in the usual LZ77 way.

To calculate b, let H be 60 plus the number of bytes written so far to the decompressed stream. Then use the following table:

Hb
60 to 630
64 to 1271
128 to 2552
256 to 5113
512 to 10234
1024 to 20475
2048 to 40956
4096+7

Caution: I’ve never seen PAK use the last five encodable offsets codes (8187 through 8191). I did a crude test to see what its decompressor did with code 8191, and it worked as expected. But it’s possible that these five codes are special in some way.

LZ77 details

The LZ77 part of the algorithm uses an 8K (8192-byte) history buffer. Perhaps only 8187 bytes are actually used by PAK’s compressor, but let’s call it 8192.

“Prehistory” offsets that refer to nonexistent bytes before the beginning of the file are sometimes used. These virtual bytes are assumed to be spaces (byte value 0x20). Refer to my LZ77 prehistory post for a discussion of this general issue.

Initializing the history buffer with spaces is another thing I suspect was inherited from LZHUF.C.

Due to the maximum match length of 60, there’s no use for more than 60 bytes of prehistory, and the Distilled format seems to be engineered to make sure that at least 60 bytes of prehistory are always addressable. I’ve never actually seen PAK use more than three such bytes, though I didn’t try very hard to get it to use more.

Here is a 47-byte text file that PAK will compress in a way that uses prehistory:

AA   AA0123456789012345678901234567890123456789

PAK compresses it as:

literal: 'A'
literal: 'A'
match: offset=5, length=5
literal: '0'
...

The “match” instruction utilizes three virtual spaces of prehistory to “repeat” the “ AA” sequence that hadn’t actually appeared before.

By my calculations, the minimum number of addressable bytes of prehistory is actually 61, instead of the expected 60. In other words, the value of b is bumped up one byte earlier than necessary. I think either format designers made an off-by-one error, or I did.

Unknowns

  • What is the maximum valid length of a Huffman code in the “codes” codebook? The longest I’ve seen is 15 bits, but I haven’t put any effort into testing this. The format doesn’t force there to be any real limit, though there should be no need for more than, say, 25 or 30 bits.
  • Are match lengths higher than 60 allowed? I very much doubt it, but I didn’t actually test PAK’s decompressor to see if it supports them. There’s nothing about the format that makes them impossible, but they seem never to be used by PAK’s compressor, and aren’t supported by The Unarchiver.
  • Do the last five offset codes have a special meaning, or are they reserved for future expansion, or what?
  • What is the most efficient way to “compress” the zero-length file? Theoretically, you only need one code, for the “stop” code. So, that code could be a zero-length Huffman code, and the node_table would have only one node. But unless PAK supports such a thing as a special case, there’s no way to encode it. The node_table would have to have (at least) two nodes, with only one being used.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s