Notes on some old self-extracting ZIP archives

The old PKZIP compression software for DOS includes a utility named “ZIP2EXE”, which turns a plain ZIP file into a self-extracting executable file in DOS EXE format. Depending on the version of PKZIP/ZIP2EXE, and the options used, there are several different ways in which this EXE file is constructed.

I wanted to learn about these files, and I thought I might be able to do that by analyzing the oldest version, assuming it would be the simplest, then working my way forward through time. That plan didn’t get very far, but I wasted enough time on it that I figured I may as well write about it.

The first public version of PKZIP was 0.80, which doesn’t support self-extracting archives. The second was 0.90, which does. After that came versions 0.92, 1.01, and 1.02. Those are the four versions I’ll discuss here. The next version was 1.10, whose self-extracting EXE format seems to be quite different.

To be clear, this post is only about PKZIP versions 0.90 through 1.02. Basically none of this applies to any other version.

Version 0.90’s release date was 1989-02-01, and v1.10’s was 1990-03-15. So there was only a span of 13.5 months in which this format was the one to use. For that and other reasons, these files are probably not very common. I acknowledge that.

Format identification goal

One of my goals was to find a good way to identify this class of self-extracting ZIP archives for what they are, without executing them. That’s easy to do (due to their formulaic nature), but strangely difficult to do well. The files just don’t seem to have any nice distinctive markings at predictable positions.

Note that if you run such a file in a DOS environment, it will tell you very directly what it is, including its version number. (It will also, by default, extract the files contained in it.) But that’s rarely a convenient thing to do.

Creating a self-extracting archive

In order to create a self-extracting archive using these versions of PKZIP, you must acquire a copy of the original PKZIP distribution file, which should be named PKZ090.EXE, PKZ092.EXE, PKZ101.EXE, or PKZ102.EXE. Any other form will not do. If it’s been renamed, you must rename it back to its original name.

In a DOS environment, create an empty directory, and copy the distribution file to it. “CD” to that directory, and run the EXE file, to extract the component files (PKZIP.EXE, etc.).

Now run MAKESFX.COM (one of the extracted files), and it will create a new file named PKSFX.PRG.

MAKESFX.COM works by copying the first X number of bytes from the PKZxxx.EXE distribution file. The point is, the self-extractor component is not contained in the distribution file — it is a part of the distribution file.

The included ZIP2EXE.EXE utility should now work. Create a ZIP file in the usual way, with PKZIP.EXE. Then run ZIP2EXE on it. The resulting self-extracting EXE file will start with the bytes in PKSFX.PRG.

There’s a slight concern I have here: I can’t really be sure that I have a perfect copy of the original distribution file. The component files have some integrity protection, but the main file does not.

Here are the sizes and “md5sum” hashes of the PKSFX.PRG files that I came up with. I don’t know for sure that these are correct.


The files from v1.01 and 1.02 are identical. To the best of my knowledge, v1.01 and v1.02 produce byte-for-byte identical self-extracting EXE files. Both report their version number as “1.01”.

So, there seem to be only three different versions to worry about, not four. But there may well be others. I don’t know if the registered versions of PKZIP produced different files, or if there are other special versions that do.

The PKSFX.PRG file is in fact a valid EXE file. If we rename it and run it, here’s what happens:

   1 File(s) copied.


PKSFX (R)  FAST!  Self Extract Utility  Version 1.01  07-21-89
Copyright 1989 PKWARE Inc.  All Rights Reserved.
PKSFX Reg. U.S. Pat. and Tm. Off.

Searching EXE: C:/ZIP/101/PKSFX.EXE
PKSFX: C:/ZIP/101/PKSFX.EXE - error in ZIP use PKZipFix

It thinks it’s a self-extracting archive file, but it can’t find the ZIP data that should be contained in it. Because, well, there isn’t any.

Another thing we can do is run it with the -h option, which prints a help screen. Or with the -l option, which prints licensing information.

Clearly, this self-extractor has a lot of textual data embedded in it. But if you look at the file in a hex editor, or with something like the Unix “strings” utility, you will not see any such data. The only relevant text evident in the file is this:

PKZIP(tm) FAST! Create/Update Utility
PKUNZIP(tm) FAST! Extract Utility
PKSFX(R) FAST! Self-Extract Utility
Copyright 1989 PKWARE Inc. All Rights Reserved

It’s just some boilerplate legalese, the same as that which appears in the MANUAL.DOC documentation file. I don’t think the program ever prints this text.

So, the important text is probably obfuscated, encrypted, or compressed in some way. I wanted to find it.

I tried making random changes to an executable copy of PKSFX.PRG, near the plaintext legalese, and soon found a byte that affected one of the characters in the messages that it printed when I ran it.

Working from there, I identified the bytes corresponding to the “error in ZIP” string. To figure out how it’s encrypted, an easy thing to try is to XOR these bytes with the bytes that they decrypt to.

In PKSFX.PRG:      e3 f7 f6 ec f0 a1 e9 11 5e 27 35 2b
Decrypts to:       e  r  r  o  r     i  n     Z  I  P
Decrypts to (hex): 65 72 72 6f 72 20 69 6e 20 5a 49 50
XOR:               86 85 84 83 82 81 80 7f 7e 7d 7c 7b

The XOR key changes with each byte, but with an obvious pattern, simply decreasing by 1.

As it turns out, I was “lucky” that the first string I found happened to use the simplest encryption method. Had I found one of the more difficult strings first, I might have (correctly) decided that this wasn’t worth the trouble.

This error message turned out to be part of a longer block of strings:

0000000 2a 2e 2a 00 00 20 2d 20 65 72 72 6f 72 20 69 6e  >*.*.. - error in<
0000016 20 5a 49 50 00 49 6e 73 75 66 66 69 63 69 65 6e  > ZIP.Insufficien<
0000032 74 20 4d 65 6d 6f 72 79 00 63 61 6e 27 74 20 63  >t Memory.can't c<
0000048 72 65 61 74 65 3a 20 00 63 61 6e 27 74 20 66 69  >reate: .can't fi<
0000064 6e 64 3a 20 00 57 61 72 6e 69 6e 67 21 20 00 0d  >nd: .Warning! ..<
0000080 0a 00 63 61 6e 27 74 20 6f 70 65 6e 3a 20 00 44  >..can't open: .D<
0000096 69 73 6b 20 46 75 6c 6c 2c 20 66 69 6c 65 3a 20  >isk Full, file: <
0000112 00 20 75 73 65 20 50 4b 20 5a 69 70 52 65 63 6f  >. use PK ZipReco<
0000128 76 65 72 00 20 2d 20 00 2f 00 4e 55 4c 00        >ver. - ./.NUL.<

I found one other block of strings encrypted in the same way. But the bulk of the text is different: Changing one byte in the file often messes up two consecutive characters in the output text.

I eventually figured out how to decrypt the other text. First, decrypt the bytes using the descending-XOR-key method. Then, shift the decrypted bytes by some number of bits.

I found four blocks of text encrypted with this more-complex method. To decrypt, shift left by 6, 5, 4, or 3 bits — it’s different for each block. The high bits are shifted left into the low bits of the previous byte.

Initializing the key

The starting value of the key is the number of output bytes in that block of encrypted text (mod 256). Consequently, the last few bytes of encrypted text always use keys …0x03, 0x02, 0x01, 0x00. If there is no bit-shifting, then key value 0x00 will not actually be used, but we can still point to the byte that would have used it.

It is not always obvious exactly where a block of encrypted text begins, but it is very clear where it ends: It ends at key value 0x00. I suggest that a decryption routine should therefore take three parameters:

  1. The offset of the ending byte whose key is 0x00
  2. The length of the string block (measured in number of output bytes)
  3. The number of bits to shift (which can be 0)

Here are the parameters of the strings blocks I found.

(end-offset, length, bits-to-shift)

   (12804, 436, 6)
   (13549, 744, 5)
   (14137, 587, 4)
   (14311, 173, 3)
   (14910, 416, 0)
   (15052, 142, 0)

   (12548, 436, 6)
   (13293, 744, 5)
   (13881, 587, 4)
   (14055, 173, 3)
   (14654, 416, 0)
   (14791, 137, 0)

   (12900, 436, 6)
   (13625, 724, 5)
   (14211, 585, 4)
   (14363, 151, 3)
   (14963, 421, 0)
   (15101, 137, 0)

Extraction utility

I’ve written a little C program, zipsfxdec1.c, to de-obfuscate and extract the strings from these files. It can be downloaded here:

Format identification idea

The corresponding string blocks in different versions of self-extracting files are not all identical. And a slight difference can cause all the encrypted bytes before that point, back to the beginning of the string block, to be completely different. But the encrypted bytes after the last difference are the same.

Consider the string block I name “intro”, which appears shortly before the unencrypted legalese text. Here are the three versions of it:

PKSFX (R)   FAST!   Self Extract Utility   Version 0.90   02-01-89
Copyright 1989 PKWARE Inc.  All Rights Reserved.  PKSFX/h for help
PKSFX Reg. U.S. Pat. and Tm. Off.

PKSFX (R)   FAST!   Self Extract Utility   Version 0.92   03-06-89
Copyright 1989 PKWARE Inc.  All Rights Reserved.  PKSFX/h for help
PKSFX Reg. U.S. Pat. and Tm. Off.

PKSFX (R)  FAST!  Self Extract Utility  Version 1.01  07-21-89
Copyright 1989 PKWARE Inc.  All Rights Reserved.
PKSFX Reg. U.S. Pat. and Tm. Off.

The last line is the same in all versions, and indeed we find a sequence of 37 bytes that occurs in all of them:

85 69 2b 4b 48 d4 1a 17 50 b7 ff dd 12 b2 dc 70
d0 19 1e 3f 95 cb 02 20 c0 8f 00 84 ad c3 0f e9
c8 c6 c3 a0 40

This sequence quite likely appears in any hypothetical unknown versions as well — though I would not necessarily trust the very first and last of these bytes, due to the way the bits are shifted.

And 37 bytes is overkill. I suggest that a high-tech identification strategy might be to search a suitable portion of the file for the following 16-byte sequence:

95 cb 02 20 c0 8f 00 84 ad c3 0f e9 c8 c6 c3 a0

In the known versions, it appears starting at offset 14295 (v0.90), 14039 (v0.92), or 14347 (v1.01).

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s