Notes on PKLITE format, Supplement 1: Descrambling

This post is part of a series about PKLITE format. For a list of all the posts, see the first post.

In a previous post, I noted that some PKLITE-compressed executable files are more difficult to deal with, because most of the decompressor is obfuscated. I named the obfuscation format “scrambled”. In this post, I’ll look at this scrambled format in more detail. I’ll correct some errors I made, and explain how to descramble more varieties of PKLITE-compressed files. I’ve also written a Python script to do the descrambling.

I don’t actually care that much about the scrambled format. We don’t necessarily have to know how to descramble, in order to accomplish our presumed goal of undoing the PKLITE executable compression. But descrambling does make that easier.

I know of three general sources of scrambled PKLITE files:

  1. Made by a registered version of PKLITE v1.14 or newer
  2. Made internally at PKWARE, using an unreleased version of PKLITE
  3. Self-extracting ZIP files generated by the ZIP2EXE utility included with PKZIP v2.04c through v2.50

Here’s an annotated hex dump of a self-extracting ZIP file from PKZIP v2.04g, shareware version:

In fact, every standard-mode self-extracting ZIP file made by a given version of ZIP2EXE is identical, except for the ZIP data part.

The scrambled part is shaded with a “\\\” pattern.

I must admit, it’s easy to get confused by these wheels within wheels. The “Compressed code image, etc.” section is a stripped-down version of PKUNZIP, which decompresses and extracts the “Embedded ZIP data” section. It is itself compressed, and will be decompressed and executed by a decompressor primarily in the scrambled part of the “Decompressor” section. The scrambled part is first de-obfuscated by the small unshaded part. As we’ll see, even the unshaded part is sometimes a little bit obfuscated by some code randomization.

Side note: I don’t know for sure that the “Compressed code image, etc.” segment starts at offset 512. That’s just my best guess. And the “etc.” suffix is because, in this diagram, it presumably includes the compressed relocation table and “footer” section.

The compression method used for the “Compressed code image” segment, and the scrambling algorithm, are both different in this file than they are in files created by public versions of PKLITE. I’m assuming that the common part of these self-extracting EXE files was created by a secret internal version of PKLITE. The compression method remains unknown to me, but I do know how to descramble it.

Specific sources of scrambled files

Let’s try to get a handle on what “scrambled” files exist.

There should be a finite number of EXE files made by private PKLITE versions. Here’s a list of the relevant PKWARE software distributions known to me, which might contain such EXE files:

PKZIP 2.04c - free & registered
PKZIP 2.04e - free & registered
PKZIP 2.04g - free & registered
PKZIP 2.06 (special IBM version)
PKZIP 2.50 - free & registered
PKLITE 1.14 - free & registered
PKLITE 1.15 - free & registered
PKLITE 1.50 - free & registered
PKLITE 2.01 - free & registered
PKZMENU 1.04 - free & registered(?)
PKZFIND and PKZOOM 1.50 - free & registered(?)

To some extent, I’m guessing about what registered versions exist. Only a few can be found on the internet, so we’ll have to do without the rest.

As for files created by ZIP2EXE, here are the relevant versions of PKZIP:

PKZIP 2.04c - free & registered
PKZIP 2.04e - free & registered
PKZIP 2.04g - free & registered
PKZIP 2.50 - free & registered

The special PKZIP v2.06 does not have a unique version of ZIP2EXE, so we don’t have to worry about files created by it.

As for files created by public versions of PKLITE, here are the relevant versions. Files created by free versions do not use scrambling.

PKLITE 1.14 - registered
PKLITE 1.15 - registered
PKLITE 1.50 - registered
PKLITE 2.01 - registered

There’s no limit to how many files may be created by these public versions of PKLITE, so we’ll just have to do our best to collect or create sample files, and account for the different versions and options.

Scrambled formats

There are two scrambled formats. I’ll first explain the descrambling algorithms, then go into detail about how to find the scrambled data, detect the algorithm, and read the initial key.

Here, a “word” is a 16-bit unsigned integer. When applicable, it is in little-endian byte order (least significant byte first). The scrambled data is an array of words.

To descramble XOR format

For each scrambled word:

The last {descrambled word[N]} = {scrambled word[N]} XOR {initial key}.

Otherwise, {descrambled word[N]} = {scrambled word[N]} XOR {scrambled word[N+1]}.

To descramble ADD format

The ADD operation here is modular addition: (x + y) mod 65536.

For each scrambled word:

The last {descrambled word[N]} = {scrambled word[N]} ADD {initial key}.

Otherwise, {descrambled word[N]} = {scrambled word[N]} ADD {scrambled word[N+1]}.

Brute force descrambling

If you don’t need much precision, that might be good enough. Just try both algorithms, and see which one produces the standard byte sequences that should be in the descrambled data. I suggest descrambling the first 1000 bytes of the code image segment (the main DOS segment that, in these files, usually starts at offset 96).

For the ADD algorithm, byte alignment matters. While you could try both alignments, I think you can safely assume a scrambled word always starts on a two-byte boundary.

The initial key

The code image segment always starts with one of these two byte patterns:

b8 XX XX ba YY YY

50 b8 XX XX ba YY YY

The XX and YY bytes are different in different files. The initial key is, to the best of my knowledge, always the “YY YY” word.

I suspect that, in some files, the initial key is irrelevant. But I haven’t really tested it.

Correcting the XOR algorithm

In a previous post, I got the XOR algorithm slightly wrong. I said it was word[N] = word[N-1] XOR word[N], which would make sense if the descrambling happened in the forward direction. But in fact, it’s done backward, starting at the end.

That explains why the final two scrambled bytes in files made by PKLITE v1.14 seemed to be missing. In fact, they are present, if you do it right. In most other files, the scrambled section ends with two bytes of apparent garbage.

The incorrect algorithm makes every descrambled byte’s position be off by 2. If you switch to the correct algorithm, be sure to also adjust the logic for finding the start of the compressed data.

Unfortunately, this means that looking for the special byte pattern I called “tables” might not be as reliable as I thought, as a way to find the start of the compressed data. The issue is that it doesn’t tell us exactly where the scrambled data ends — it might be off by 2. In practice, it’s probably good enough. But it’s possible that there are some special cases that we’d have to detect and handle.

The PKLC-U compression format

I need some names for the compression formats, so I’m just going to call them “PKLC”, with various suffixes.

“PKLC-Free” is the standard format used by the free versions of PKLITE. There are two variants of it, “PKLC-Free-Small” and “PKLC-Free-Large”.

“PKLC-Extra” is the “extra compression” format used corresponding to the “-e” option available in the registered versions of PKLITE. It’s almost the same as PKLC-Free, but features a simple obfuscation layer.

“PKLC-U” is the unknown (to me) format or formats used by some of the PKWARE distribution files, and files generated by ZIP2EXE.

The PKLC-U decompressor does not include the characteristic byte pattern I called “tables”. It’s not that it has something else instead; it’s that these bytes are completely absent. It does have the byte pattern that normally appears right before the “tables”, which looks like this:

6b 33 c0 8b d8 8b c8 8b d0 8b e8 8b f0 8b f8 cb

I have not, at this point, made much effort to decode PKLC-U. The absent “tables” data suggests that it’s significantly different from PKLC-Extra, which is a little discouraging. On the plus side, it’s possible that it’s just a simplified form of PKLC-Extra.

(Now, I’d bet that PKLC-U was figured out around 1992, probably about two hours after PKZIP 2.04 was released and propagated through what passed for the internet at the time. There are people who are really good at this sort of thing. Unfortunately, they usually seem to be less good at publicly documenting their knowledge.)

Files using PKLC-U are almost always obfuscated, and when they are, they seem to always use the ADD method. The one and only non-obfuscated file I’ve found is PKZMENU.EXE from PKZMENU 1.04 (search the web for “PKZM104.EXE”), which is labeled as using PKLITE version “1.10”.

Files using PKLC-U with obfuscation closely correspond to those that are labeled with version number “1.20”.

I want to make it clear that decompressing these files is not a problem. You can do it, for example, with DISLITE or UNP running in a DOS emulator. And both of them are open source. But they are “dynamic” decompressors, which use the file to decompress itself, so their source code does not contain any knowledge of the algorithm.

The gory details

Here’s a disassembly of a small part of the annotated self-extracting ZIP file. If you don’t know assembly language, you can probably still follow along.

CS:0100 B84910   MOV   AX,1049
CS:0103 BA9A3D   MOV   DX,3D9A ; Initial key
CS:0106 05060E   ADD   AX,0E06
CS:0109 3B060200 CMP   AX,[0002]
CS:010D 721B     JB    012A
CS:012A 8BFC     MOV   DI,SP
CS:012C 81EF4A03 SUB   DI,034A
CS:0130 57       PUSH  DI
CS:0131 57       PUSH  DI
CS:0132 52       PUSH  DX
CS:0133 B9AD00   MOV   CX,00AD ; # of scrm words (raw)
CS:0136 BE9C02   MOV   SI,029C ; End of scrm data (raw)
CS:0139 8BFE     MOV   DI,SI
CS:013B FD       STD
CS:013C 49       DEC   CX      ; Start of loop
CS:013D 7407     JZ    0146
CS:013F AD       LODSW
CS:0140 92       XCHG  AX,DX   ; = This is the "ADD"
CS:0141 03C2     ADD   AX,DX   ; alg., and uses DX.
CS:0143 AB       STOSW
CS:0144 EBF6     JMP   013C    ; End of loop

While different files have some differences, they all follow this same basic outline.

There’s always a “MOV DX” instruction near the start of the code. I think, though I’m not sure, that it always contains the initial key.

There is a “JB” (0x72) jump instruction that skips over an error handler. Unfortunately, I’ve found one exception to this: the PKZIP.EXE file from PKZIP 2.04c. That might have to be dealt with as a special case.

Then we look for two consecutive “MOV” instructions that reveal where the scrambled data is. The second seems to always be “MOV SI”, but the first is not always “MOV CX”. I’ve found one PKLC-U file where it’s “MOV BX” (the first byte is then 0xbb) (the file is CHK4LITE.EXE from PKLITE 2.01-shareware). And for PKLC-Extra files, it seems to always be “MOV AX” (the first byte is the 0xb8).

The first MOV instruction tells us the number of scrambled words, but it’s biased by 1. Subtract 1 from it to get the correct number. (Then multiply by 2 if you want the number of bytes.)

The “MOV SI” instruction tells us the relative offset of the start of the last word of scrambled data. To get the byte position of the end of the scrambled data in the file, add 2, then subtract 0x0100 for technical reasons, then add the offset of the DOS code image segment (usually 96 in these files). Then, you can calculate the start of the scrambled data, by subtracting the size in bytes of the scrambled data.

If your program has made it this far, the ending offset of the scrambled data might tell you reliably where the compressed code image starts. If’s it’s not a multiple of 16 bytes, round up to the next multiple of 16. I can’t promise that this works for PKLC-U format, though.

A little later, we find the “XCHG” and “ADD” instructions that do the actual descrambling. The “ADD” (0x03) might instead be “XOR” (0x33), and all we really care about is which one it is. The catch is that, due to code randomization techniques used in some of the later files, the “DX” register could instead be something else. I’ve found the following byte patterns, which I suspect is a complete list:

92, 03, c2 = ADD (DX)
91, 33, c1 = XOR (CX)
92, 33, c2 = XOR (DX)
93, 33, c3 = XOR (BX)
96, 33, c6 = XOR (SI)
97, 33, c7 = XOR (DI)

A descrambler program

I still haven’t given an explicit algorithm for parsing these files. I’m not feeling bold enough to do that right now, in part because whatever I write could be proven wrong by the next PKLITE-compressed file I find.

But as mentioned, I’ve written a Python program that tries to do this descrambling. You can get it here:

Its quality is “quick and dirty”. I haven’t decided if I’m going to try to make it better, and maintain it.

When it descrambles a file, it also patches it so that, when executed, it will not try to descramble the already-descrambled data. I’ve had some success in running descrambled files, but I have to advise against it. A patched file could do anything (within your DOS environment), so make sure you’re not risking anything you care about.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s