Notes on PKLITE format, Part 7: v1.20 compression

This is a continuation of my series on PKLITE executable compression format for DOS. For a list of other posts, see the first post. In particular, Part 3 is an important prerequisite.

In a previous post, I named a then-unknown compression scheme “PKLC-U”. In this post, I’ll call it “v1.20 compression”. I’ll refer to all non-v1.20 compression schemes as “normal”.

This post will explain the v1.20 compression scheme. It will not cover all of the (difficult) work you have to do to figure out the compression parameters. For what it’s worth, I’m working on a utility to do that automatically.

I thank Sergei Kolzun for figuring out the critical parts of v1.20 compression, and providing the information to me. I would probably have never figured it out myself.

Overview

“V1.20” files were created by various unreleased special versions of PKLITE. These special versions were presumably only used internally by PKWARE, the makers of PKLITE. I’m fairly confident that there was no PKLITE version 1.20 (or 1.10, or 1.11) release. Every legitimate “v1.20” file that I’ve found can be traced back to internal PKWARE software.

V1.20 compression is very similar to normal compression. But it uses different Huffman code tables, and there are two new special codes: one for match-length 2 with larger offsets, and one that is an alternate/optimized form of a literal 0 byte.

As with normal compression, v1.20 has two modes: “small” and “large”. V1.20 format always uses the features of “extra compression”. A few of the newer v1.20 files have an additional feature I’ll call “obfuscated offsets” — more on that later. In almost all v1.20 files, the decompressor is “scrambled”, using the “ADD” algorithm covered in Supplement 1. A few of the oldest files are not scrambled. V1.20 files never use the “XOR” scrambling algorithm, and non-v1.20 files never use “ADD”.

It might have been more correct to name the scheme “v1.10 compression”, since the earliest such files use version number 1.10. But the majority use version number v1.20, and that’s what I’m going with.

There are a few files misleadingly labeled “v1.20”, that actually just use normal compression. An example is PKLITE.EXE from PKLITE v1.50.

But, in most cases, v1.20 files are labeled accurately. At offset 28, you’ll usually find the bytes 0x14 0x11 (for small mode v1.20), or 0x14 0x31 (for large mode v1.20).

There are, unfortunately, many different varieties of v1.20 files. But at least the core compression scheme seems to have been quite stable.

Caution: Since I have no way to create new v1.20 files, it’s difficult to be sure that I’ve figured out how to decompress them correctly.

Relocation table compression

For v1.20 files, the relocation table is compressed in one of two ways: either the usual method for files with “extra compression”, or a slight variation of it. The format used seems to correlate with whether the file uses “scrambling”.

For v1.20 files that are not scrambled, the usual format for “extra compression” files is used.

For v1.20 files that are scrambled, there is a difference: the bytes in the two-byte “OFFSET” fields are swapped. Equivalently, they use big-endian byte order, instead of the usual little-endian order.

Another way to look at it is that the swapped relocation table format is used if, and only if, the file is scrambled with the ADD method.

Code image compression

I’m going to present both normal and v1.20 compression, so that they can be compared. I’m making a small change to how I present normal compression: I now treat the “match length=2” code as a special code, as opposed to having special logic when the match length is 2.

Huffman codebooks

Here are the small mode “match lengths, etc.” codebooks:

normal-sm  v1.20-sm   Decodes to
---------  --------   ----------
00         11         ml=3 (match length = 3 bytes)
100        000        ml=4
101        0100       ml=5
1100       0101       ml=6
1101       01110      ml=7
1110       011110     ml=8
1111       011111     ml=9
011        0110       special-code-1: long match lengths, etc.
010        10         special-ml2-1: ml=2, offset = 0 to 255
n/a        0011       special-ml2-2: ml=2, offset = 256 to 511
n/a        0010       special-lit0: literal 0x00 byte

Here are the large mode “match lengths, etc.” codebooks:

normal-lg  v1.20-lg   Decodes to
---------  --------   ----------
11         11         ml=3 (match length = 3 bytes)
000        000        ml=4
0010       0101       ml=5
0011       0110       ml=6
0100       00110      ml=7
01010      00111      ml=8
01011      001000     ml=9
01100      001001     ml=10
011010     0100000    ml=11
011011     0100001    ml=12
0111010    0100010    ml=13
0111011    0100011    ml=14
0111100    01001000   ml=15
01111010   01001001   ml=16
01111011   01001010   ml=17
01111100   010010110  ml=18
011111010  010010111  ml=19
011111011  n/a        ml=20
011111100  n/a        ml=21
011111101  n/a        ml=22
011111110  n/a        ml=23
011111111  n/a        ml=24
011100     010011     special-code-1: long match lengths, etc.
10         10         special-ml2-1: ml=2, offset = 0 to 255
n/a        0111       special-ml2-2: ml=2, offset = 256 to 511
n/a        00101      special-lit0: literal 0x00 byte

Here are the “offset high bits” codebooks:

normal     v1.20      Decodes to
---------  --------   ----------
1          1          0x00__
0000       000        0x01__
0001       00100      0x02__
00100      00101      0x03__
00101      00110      0x04__
00110      00111      0x05__
00111      010000     0x06__
010000     010001     0x07__
010001     010010     0x08__
010010     010011     0x09__
010011     010100     0x0a__
010100     010101     0x0b__
010101     0101100    0x0c__
010110     0101101    0x0d__
0101110    0101110    0x0e__
0101111    0101111    0x0f__
0110000    0110000    0x10__
0110001    0110001    0x11__
0110010    0110010    0x12__
0110011    0110011    0x13__
0110100    0110100    0x14__
0110101    0110101    0x15__
0110110    0110110    0x16__
0110111    0110111    0x17__
0111000    0111000    0x18__
0111001    0111001    0x19__
0111010    0111010    0x1a__
0111011    0111011    0x1b__
0111100    0111100    0x1c__
0111101    0111101    0x1d__
0111110    0111110    0x1e__
0111111    0111111    0x1f__

Decompression algorithm

Here’s the updated “code image” decompression algorithm:

START:
 - Let MATCH_LENGTH_BIAS = 10 for small mode, 25 for normal
   large mode, or 20 for v1.20 large mode.
 - Fill the bit buffer.
 - Continue to MAIN_LOOP.

MAIN_LOOP:
 - Read a bit.
 - If the bit is 0, go to LITERAL.
 - If the bit is 1, go to MATCH-LEN-ETC.

LITERAL:
 - Read a byte (N).
 - If the file uses "extra" compression, let N = N xor {the number
   of bits currently in the bit buffer}. This will be a number from
   0x01 to 0x10.
 - Process N as a literal byte in the usual LZ77 manner. (Emit it to
   the output stream, and append it to the history buffer.)
 - Go to MAIN_LOOP.

MATCH-LEN-ETC:
 - Read a value (M) using the "match-lengths" codebook (via the bit
   buffer).
 - If M is "special-code-1", go to MATCH-LEN-SPECIAL-CODE-1.
 - If M is "special-ml2-1", let high-5-bits-of-offset = 0. Go to
   OFFSET-LO.
 - If M is "special-ml2-2", let high-5-bits-of-offset = 1. Go to
   OFFSET-LO.
 - If M is "special-lit0", treat this as a literal 0x00 byte (as in
   the LITERAL section, without the "xor" step). Go to MAIN_LOOP.
 - Otherwise, let match_length = M, and go to OFFSET-HI.

MATCH-LEN-SPECIAL-CODE-1:
 - Read a byte (N).
 - If N≤252 (0xfc), let match_length = N + MATCH_LENGTH_BIAS. Go to
   OFFSET-HI.
 - If N=0xff, STOP. The decompression completed normally.
 - If N=0xfe and mode=large, do nothing, and go to MAIN_LOOP.
 - If N=0xfd and mode=large, or N=0xfe and mode=small, I think
   this is a special code for an uncompressed region. Unless you
   know how to handle it, ERROR(UNSUPPORTED_FEATURE).
 - Otherwise, ERROR.

OFFSET-HI:
 - Read high-5-bits-of-offset using the offsets codebook.
 - Continue to OFFSET-LO.

OFFSET-LO:
 - Read a byte (low-8-bits-of-offset).
 - If "offsets obfuscation" is enabled, xor low-8-bits-of-offset with
   the key.
 - Combine high-5-bits-of-offset and low-8-bits-of-offset to get an
   offset from 0 to 8191.
 - If offset is 0, ERROR.
 - If offset is larger than the number of output bytes that have
   been decompressed so far, ERROR.
 - Use offset and match_length in the standard LZ77 manner, to read
   and emit a sequence of bytes from history. Offset 1 refers to the
   most recently decompressed byte, 2 is the second-most recent, and
   so on.
 - Go to MAIN_LOOP.

Offsets obfuscation

A few v1.20 files use “offsets obfuscation”. The only such files that I know of:

  • Self-extracting ZIP files made by ZIP2EXE from PKZIP v2.50, free edition only. (key=0x98)
  • PKZFIND.EXE from PKZFIND/PKZOOM v1.50 (look for PKZF15.EXE or PKZF15.ZIP). (key=0x02)
  • PKZOOM.EXE from PKZFIND/PKZOOM v1.50. (key=0xd7)

Notes on the start of compressed data

I think I’ve figured out where in PKLITE files the “start of compressed data” pointer is stored, for all non-beta versions of the format. But it’s complicated, and I won’t try to explain it in this post. I do have some comments relevant to v1.20, though.

For small mode v1.20 files, unlike all other PKLITE formats, the compressed data does not have to start on a 16-byte boundary. A common starting offset is 510 bytes from the beginning of the file.

As a point of trivia, for normal compression, the first byte of compressed data is always an even number (disregarding some theoretical edge cases involving special codes). That’s because the first code in the compressed data-stream has to be for a “literal”, as there is no history to copy from. But for v1.20, the first byte can be, and often is, an odd number ending in the hex digit 9. That’s due to the existence of a special code for a literal 0.

With that in mind, I want to correct a wrong guess I made in one of my previous posts, in an annotated hex dump of a self-extracting ZIP file. Here’s the corrected version:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s