Notes on PKLITE format, Part 3

For an introduction, and a list of the other posts in this series, see Part 1.

Code image compression format

Prerequisites:

  • This section assumes you already understand how to decode (non-adaptive) Huffman-coded (or “prefix-coded”) data, and how to decompress LZ77-compressed data.
  • Please at least read the notes near the end of Part 2 of this series, for some important requirements and limitations.

The information here is mainly derived from a decompressor I wrote, with the help of information from depklite, ModdingWiki, other sources, and some original research.

The goal is to be able to decompress any file that PKLITE could actually create. No attempt is made to figure out how PKLITE’s decompressor would process a file that PKLITE would never create.


The PKLITE algorithm is a combination of LZ77, and Huffman coding with pre-defined codebooks. For what it’s worth, the LZ77 part is basically identical to that used by LZEXE, a slightly older executable compression utility.

Note that if you are constructing a decompressed EXE file, the relocation table needs to be written before the code image. But you have to decompress the code image first, sometimes before you even know how big the relocation table is. There are several ways to solve this chicken-and-egg problem, of course; I suggest that now is the best time to decide how you’re going to do it.


Some particulars about the LZ77 algorithm:

  • The minimum match length is 2.
  • The maximum match length is 262 (small mode) or 277 (large mode)
  • Valid offsets are 1 to 8191. Exception: For a match length of 2, valid offsets are 1 to 255.
  • An offset is not allowed to be larger than the number of bytes decompressed so far.

For a given file, two pre-defined Huffman codebooks are used: one that I’ll call “match-lengths”, and one that I’ll call “offsets”. The match-lengths codebook to use depends on whether the file uses “small”, or “large”, compression mode. There is only one offsets codebook.

Huffman codebooks

Here’s the match-lengths codebook for “small” mode:

Huffman code  Decodes to
------------  ----------
010           2 (i.e., match_length=2)
00            3
100           4
101           5
1100          6
1101          7
1110          8
1111          9
011           special

Here’s the match-lengths codebook for “large” mode:

Huffman code  Decodes to
------------  ----------
10            2 (i.e., match_length=2)
11            3
000           4
0010          5
0011          6
0100          7
01010         8
01011         9
01100         10
011010        11
011011        12
0111010       13
0111011       14
0111100       15
01111010      16
01111011      17
01111100      18
011111010     19
011111011     20
011111100     21
011111101     22
011111110     23
011111111     24
011100        special

Here’s the offsets codebook:

Huffman code  Decodes to
------------  ----------
1             0 (i.e., high-5-bits-of-offset=0)
0000          1
0001          2
00100         3
00101         4
00110         5
00111         6
010000        7
010001        8
010010        9
010011        10
010100        11
010101        12
010110        13
0101110       14
0101111       15
0110000       16
0110001       17
0110010       18
0110011       19
0110100       20
0110101       21
0110110       22
0110111       23
0111000       24
0111001       25
0111010       26
0111011       27
0111100       28
0111101       29
0111110       30
0111111       31

For the offsets, it may be that the format designers intended the first bit to be read as a separate flag, rather than being part of the Huffman codebook. Without the first bit, this would qualify as a “canonical” Huffman code.


For the LZ77 layer, the decompressor needs to remember at least the 8192 most-recently decompressed bytes, using a circular “history” buffer, or other suitable mechanism. (Actually, 8191 is sufficient, but using a power of 2 may be more efficient.)

The decompressor should implement what I’ll call the “bit buffer”, which can just be an unsigned integer of 16 or more bits, along with a bit count. Whenever the algorithm calls for reading a bit, use the least-significant bit in the bit buffer, right-shift the remaining bits, and decrease the bit count by 1.

The bit buffer “wants” to keep the bit count between 1 and 16 at all times. At the start of decompression, and any time the bit count reaches 0, immediately read two bytes of compressed data, and put them in the bit buffer. Use the first byte as the low-order bits, and the second as the high-order bits.

It is slightly suboptimal to preemptively fill the bit buffer like this, instead of filling it only on demand, but that is how PKLITE does it.

Sometimes the algorithm will call for reading a whole byte, instead of a bit (or a Huffman code made of bits). In that case, read the next byte of compressed data directly, without changing the bit buffer.

Pseudocode for code image decompression:

START:
 - Fill the bit buffer.
 - Go to MAIN_LOOP.

MAIN_LOOP:
 - Read a bit.
 - If the bit is 0, go to LITERAL.
 - If the bit is 1, go to MATCH-LEN.

LITERAL:
 - Read a byte (N).
 - If the file uses "extra" compression, let N = N xor {the number
   of bits currently in the bit buffer}. This will be a number from
   0x01 to 0x10.
 - Process N as a literal byte in the usual LZ77 manner. (Emit it to
   the output stream, and append it to the history buffer.)
 - Go to MAIN_LOOP.

MATCH-LEN:
 - Read a value (M) using the "match-lengths" codebook (via the bit
   buffer).
 - If M is the "special" code, go to MATCH-LEN-SPECIAL.
 - Otherwise, let match_length = M, and go to OFFSET.

MATCH-LEN-SPECIAL:
 - Read a byte (N).
 - If N≤252 (0xfc), let match_length = N + {10 if mode=small, 25
   if mode=large}. Go to OFFSET.
 - If N=0xff, STOP. The decompression completed normally.
 - If N=0xfe and mode=large, do nothing, and go to MAIN_LOOP.
 - If N=0xfd and mode=large, or N=0xfe and mode=small, I think
   this is a special code for an uncompressed region. Unless you
   know how to handle it, ERROR(UNSUPPORTED_FEATURE).
 - Otherwise, ERROR.

OFFSET:
 - If the match_length is 2, let high-5-bits-of-offset = 0.
 - If the match_length is not 2, read high-5-bits-of-offset using
   the offsets codebook.
 - Read a byte (low-8-bits-of-offset).
 - Combine high-5-bits-of-offset and low-8-bits-of-offset to get an
   offset from 0 to 8191.
 - If offset is 0, ERROR.
 - If offset is larger than the number of output bytes that have
   been decompressed so far, ERROR.
 - Use offset and match_length in the standard LZ77 manner, to read
   and emit a sequence of bytes from history. Offset 1 refers to the
   most recently decompressed byte, 2 is the second-most recent, and
   so on. (Note that this is unusual. In most LZ77 implementations,
   it's an offset encoded as 0 that refers to the most recently
   decompressed byte.)
 - Go to MAIN_LOOP.

Remarks about code image decompression

When reading a Huffman code, read it one bit at a time (via the bit buffer), until you have a code that exists in the relevant codebook. The first bit you read corresponds the leftmost digit in the codebooks given above.

If the decompression was successful, record the position of the next unread byte in the input file, as this is the position of the compressed relocation table.

If the PKLITEd file contains a copy of the original DOS header, then you can calculate the number of bytes that the compressed code image should decompress to, and trigger an error or warning if you didn’t get the right number. If not, then you can’t — or at least, I don’t know how to.

Relocation table compression

The relocation table is compressed using one of two algorithms, depending on whether “extra compression” was used. Both algorithms are low-tech RLE-style algorithms.

I’ve seen it suggested that the algorithm to use also depends on the PKLITE version, but I have not found that to be the case. But I have very few “extra compression” sample files from early versions of PKLITE to go by, and I could have made an error, so no promises.

To find the start of the compressed relocation table, you must first decompress the “code image” segment. After decompressing the relocation table, record the position of the next unread byte in the input file, as that is the position of the “footer” segment.

Relocation table with normal compression

Pseudocode for decompression:

START:
 - Go to MAIN_LOOP.

MAIN_LOOP:
 - Read a byte (C).
 - If C=0, STOP. The decompression completed normally.
 - Read two bytes (SEGMENT). (You can interpret it as a 2-byte
   little-endian integer, or just leave it as raw bytes.)
 - Loop C times:
   - Read two bytes (OFFSET), and emit them.
   - Emit SEGMENT as two bytes.
 - Go to MAIN_LOOP

Relocation table with extra compression

Note that PKLITE normalizes the relocation addresses by making the segment always be a multiple of 0x0fff. This is possible because DOS only cares about the calculated value SEGMENT×16 + OFFSET.

START:
 - Let SEGMENT=0
 - Go to MAIN_LOOP.

MAIN_LOOP:
 - Read a two-byte little-endian integer (C).
 - If C=0xffff, STOP. The decompression completed normally.
 - Loop C times:
   - Read two bytes (OFFSET), and emit them.
   - Emit SEGMENT, as a two-byte little-endian integer.
 - Let SEGMENT = SEGMENT + 0x0fff.
 - Go to MAIN_LOOP.

The next post in this series is tentatively planned to include some ideas for fingerprinting PKLITE versions, and various odds and ends.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s