This is a continuation of my series on PKLITE executable compression format for DOS. For a list of other posts, see the first post. In particular, Part 3 is an important prerequisite.
In a previous post, I named a then-unknown compression scheme “PKLC-U”. In this post, I’ll call it “v1.20 compression”. I’ll refer to all non-v1.20 compression schemes as “normal”.
This post will explain the v1.20 compression scheme. It will not cover all of the (difficult) work you have to do to figure out the compression parameters. For what it’s worth, I’m working on a utility to do that automatically.
I thank Sergei Kolzun for figuring out the critical parts of v1.20 compression, and providing the information to me. I would probably have never figured it out myself.
Overview
“V1.20” files were created by various unreleased special versions of PKLITE. These special versions were presumably only used internally by PKWARE, the makers of PKLITE. I’m fairly confident that there was no PKLITE version 1.20 (or 1.10, or 1.11) release. Every legitimate “v1.20” file that I’ve found can be traced back to internal PKWARE software.
V1.20 compression is very similar to normal compression. But it uses different Huffman code tables, and there are two new special codes: one for match-length 2 with larger offsets, and one that is an alternate/optimized form of a literal 0 byte.
As with normal compression, v1.20 has two modes: “small” and “large”. V1.20 format always uses the features of “extra compression”. A few of the newer v1.20 files have an additional feature I’ll call “obfuscated offsets” — more on that later. In almost all v1.20 files, the decompressor is “scrambled”, using the “ADD” algorithm covered in Supplement 1. A few of the oldest files are not scrambled. V1.20 files never use the “XOR” scrambling algorithm, and non-v1.20 files never use “ADD”.
It might have been more correct to name the scheme “v1.10 compression”, since the earliest such files use version number 1.10. But the majority use version number v1.20, and that’s what I’m going with.
There are a few files misleadingly labeled “v1.20”, that actually just use normal compression. An example is PKLITE.EXE from PKLITE v1.50.
But, in most cases, v1.20 files are labeled accurately. At offset 28, you’ll usually find the bytes 0x14 0x11 (for small mode v1.20), or 0x14 0x31 (for large mode v1.20).
There are, unfortunately, many different varieties of v1.20 files. But at least the core compression scheme seems to have been quite stable.
Caution: Since I have no way to create new v1.20 files, it’s difficult to be sure that I’ve figured out how to decompress them correctly.
Relocation table compression
For v1.20 files, the relocation table is compressed in one of two ways: either the usual method for files with “extra compression”, or a slight variation of it. The format used seems to correlate with whether the file uses “scrambling”.
For v1.20 files that are not scrambled, the usual format for “extra compression” files is used.
For v1.20 files that are scrambled, there is a difference: the bytes in the two-byte “OFFSET” fields are swapped. Equivalently, they use big-endian byte order, instead of the usual little-endian order.
Another way to look at it is that the swapped relocation table format is used if, and only if, the file is scrambled with the ADD method.
Code image compression
I’m going to present both normal and v1.20 compression, so that they can be compared. I’m making a small change to how I present normal compression: I now treat the “match length=2” code as a special code, as opposed to having special logic when the match length is 2.
Huffman codebooks
Here are the small mode “match lengths, etc.” codebooks:
normal-sm v1.20-sm Decodes to --------- -------- ---------- 00 11 ml=3 (match length = 3 bytes) 100 000 ml=4 101 0100 ml=5 1100 0101 ml=6 1101 01110 ml=7 1110 011110 ml=8 1111 011111 ml=9 011 0110 special-code-1: long match lengths, etc. 010 10 special-ml2-1: ml=2, offset = 0 to 255 n/a 0011 special-ml2-2: ml=2, offset = 256 to 511 n/a 0010 special-lit0: literal 0x00 byte
Here are the large mode “match lengths, etc.” codebooks:
normal-lg v1.20-lg Decodes to --------- -------- ---------- 11 11 ml=3 (match length = 3 bytes) 000 000 ml=4 0010 0101 ml=5 0011 0110 ml=6 0100 00110 ml=7 01010 00111 ml=8 01011 001000 ml=9 01100 001001 ml=10 011010 0100000 ml=11 011011 0100001 ml=12 0111010 0100010 ml=13 0111011 0100011 ml=14 0111100 01001000 ml=15 01111010 01001001 ml=16 01111011 01001010 ml=17 01111100 010010110 ml=18 011111010 010010111 ml=19 011111011 n/a ml=20 011111100 n/a ml=21 011111101 n/a ml=22 011111110 n/a ml=23 011111111 n/a ml=24 011100 010011 special-code-1: long match lengths, etc. 10 10 special-ml2-1: ml=2, offset = 0 to 255 n/a 0111 special-ml2-2: ml=2, offset = 256 to 511 n/a 00101 special-lit0: literal 0x00 byte
Here are the “offset high bits” codebooks:
normal v1.20 Decodes to --------- -------- ---------- 1 1 0x00__ 0000 000 0x01__ 0001 00100 0x02__ 00100 00101 0x03__ 00101 00110 0x04__ 00110 00111 0x05__ 00111 010000 0x06__ 010000 010001 0x07__ 010001 010010 0x08__ 010010 010011 0x09__ 010011 010100 0x0a__ 010100 010101 0x0b__ 010101 0101100 0x0c__ 010110 0101101 0x0d__ 0101110 0101110 0x0e__ 0101111 0101111 0x0f__ 0110000 0110000 0x10__ 0110001 0110001 0x11__ 0110010 0110010 0x12__ 0110011 0110011 0x13__ 0110100 0110100 0x14__ 0110101 0110101 0x15__ 0110110 0110110 0x16__ 0110111 0110111 0x17__ 0111000 0111000 0x18__ 0111001 0111001 0x19__ 0111010 0111010 0x1a__ 0111011 0111011 0x1b__ 0111100 0111100 0x1c__ 0111101 0111101 0x1d__ 0111110 0111110 0x1e__ 0111111 0111111 0x1f__
Decompression algorithm
Here’s the updated “code image” decompression algorithm:
START: - Let MATCH_LENGTH_BIAS = 10 for small mode, 25 for normal large mode, or 20 for v1.20 large mode. - Fill the bit buffer. - Continue to MAIN_LOOP. MAIN_LOOP: - Read a bit. - If the bit is 0, go to LITERAL. - If the bit is 1, go to MATCH-LEN-ETC. LITERAL: - Read a byte (N). - If the file uses "extra" compression, let N = N xor {the number of bits currently in the bit buffer}. This will be a number from 0x01 to 0x10. - Process N as a literal byte in the usual LZ77 manner. (Emit it to the output stream, and append it to the history buffer.) - Go to MAIN_LOOP. MATCH-LEN-ETC: - Read a value (M) using the "match-lengths" codebook (via the bit buffer). - If M is "special-code-1", go to MATCH-LEN-SPECIAL-CODE-1. - If M is "special-ml2-1", let high-5-bits-of-offset = 0. Go to OFFSET-LO. - If M is "special-ml2-2", let high-5-bits-of-offset = 1. Go to OFFSET-LO. - If M is "special-lit0", treat this as a literal 0x00 byte (as in the LITERAL section, without the "xor" step). Go to MAIN_LOOP. - Otherwise, let match_length = M, and go to OFFSET-HI. MATCH-LEN-SPECIAL-CODE-1: - Read a byte (N). - If N≤252 (0xfc), let match_length = N + MATCH_LENGTH_BIAS. Go to OFFSET-HI. - If N=0xff, STOP. The decompression completed normally. - If N=0xfe and mode=large, do nothing, and go to MAIN_LOOP. - If N=0xfd and mode=large, or N=0xfe and mode=small, I think this is a special code for an uncompressed region. Unless you know how to handle it, ERROR(UNSUPPORTED_FEATURE). - Otherwise, ERROR. OFFSET-HI: - Read high-5-bits-of-offset using the offsets codebook. - Continue to OFFSET-LO. OFFSET-LO: - Read a byte (low-8-bits-of-offset). - If "offsets obfuscation" is enabled, xor low-8-bits-of-offset with the key. - Combine high-5-bits-of-offset and low-8-bits-of-offset to get an offset from 0 to 8191. - If offset is 0, ERROR. - If offset is larger than the number of output bytes that have been decompressed so far, ERROR. - Use offset and match_length in the standard LZ77 manner, to read and emit a sequence of bytes from history. Offset 1 refers to the most recently decompressed byte, 2 is the second-most recent, and so on. - Go to MAIN_LOOP.
Offsets obfuscation
A few v1.20 files use “offsets obfuscation”. The only such files that I know of:
- Self-extracting ZIP files made by ZIP2EXE from PKZIP v2.50, free edition only. (key=0x98)
- PKZFIND.EXE from PKZFIND/PKZOOM v1.50 (look for PKZF15.EXE or PKZF15.ZIP). (key=0x02)
- PKZOOM.EXE from PKZFIND/PKZOOM v1.50. (key=0xd7)
Notes on the start of compressed data
I think I’ve figured out where in PKLITE files the “start of compressed data” pointer is stored, for all non-beta versions of the format. But it’s complicated, and I won’t try to explain it in this post. I do have some comments relevant to v1.20, though.
For small mode v1.20 files, unlike all other PKLITE formats, the compressed data does not have to start on a 16-byte boundary. A common starting offset is 510 bytes from the beginning of the file.
As a point of trivia, for normal compression, the first byte of compressed data is always an even number (disregarding some theoretical edge cases involving special codes). That’s because the first code in the compressed data-stream has to be for a “literal”, as there is no history to copy from. But for v1.20, the first byte can be, and often is, an odd number ending in the hex digit 9. That’s due to the existence of a special code for a literal 0.
With that in mind, I want to correct a wrong guess I made in one of my previous posts, in an annotated hex dump of a self-extracting ZIP file. Here’s the corrected version:
