Will the real PKZ110.EXE please stand up?

I’ve been researching the version history of PKZIP, the once-popular compression software that gave us the still-popular ZIP file format. There are two important MS-DOS versions of it:

  1. v1.10, released in March 1990, which was the latest official version for more than 2.5 years, until v2.04c(?) was released in December 1992.
  2. v2.04g, released February 1993, which was the latest version up until PKZIP was no longer relevant.

It’s easy to find copies of the freely-distributable editions of these versions, but an issue arises: Which one is the real one? I downloaded v1.10 from about 10 different places, and got about 6 different files. Other versions of PKZIP (and other software from that era) have the same issue, but I’ll just focus on v1.10. It’s the one I found the most different versions of.

There are some preliminary things to consider. In what form (what file format, etc.) was it originally distributed? For that matter, to what extent did a single official original “shrink-wrapped” version even exist? It’s not unthinkable that the developer might have uploaded it in different forms to different BBSes. Or released multiple versions of it that have the same version number.

Fortunately, PKZIP doesn’t seem to have any issues like that. I think there is only one (freely-distributable) PKZIP v1.10, and it comes in the form of a self-extracting ZIP file named PKZ110.EXE.

Collecting files

For this experiment, I downloaded most of the files named PKZ110.EXE that I could find on cd.textfiles.com, a site that collects a lot of old files from CD-ROMs and BBSes. I found 10 different ones, to which I’ll assign names based on the unique part of their URLs.

I’ll start by figuring out how many different files there are, and listing the files by size. For each set of duplicates, I selected one arbitrarily.

Name           Size   Hash     Count
ibmgopher      149029 b675ab9b   1
1stcanadian    149219 74b5cad3   6
hof91          149248 53c3898c   1
640swstudio    149248 d7299214   1
gigabytesw     149354 66f83fbd   2
successfulbbs  149504 b9bfcb63   1
originalsw     149504 da6bab2f   2
smsharew2      149536 96788f15   2
knowledgemedia 149632 a36daef9   1
powerpakgold   149776 1f994051   1

The “Hash” is the first 32 bits of the file’s MD5 hash. Since the hashes are all different, the files must all be different, even though they don’t all have different sizes.

Just from this much data, it’s easy to guess which one is the original. There were more instances of the one named 1stcanadian than of any other, and it’s one of the smallest files, so I suspect it’s that one. (Modified files would likely have stuff added to them, and thus be larger.) But I’m curious to know what exactly is different about the files.

A nice feature of self-extracting ZIP files is that most of them can be interpreted as valid ZIP files, and be analyzed as such. The first part of the file contains the code to decompress the ZIP part of the file that appears after it. I am not competent at analyzing the code section of the file, but I probably won’t have to. Most or all of the differences probably lie in the ZIP part.

I did some superficial analysis of 1stcanadian, and it doesn’t look like it contains any after-market files, or a ZIP file comment. It should be a good reference to which to compare the other files, even if it doesn’t turn out to be the original. I’ll give this file the original “PKZ110.EXE” filename.

Things that aren’t different

All the files apparently contain the exact same 15 member files, with the possible exception of knowledgemedia, which seems to be seriously corrupted. I was a little surprised that none of them have any extra files added to them. At least according to the ZIP directory information, they all have the same member files, with the same contents (same CRC-32 hashes), same compression methods, same post-compression sizes, and same timestamps. We can see these things with the Info-ZIP unzip utility.

$ unzip -v PKZ110.EXE
Archive:  PKZ110.EXE
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
    2916  Implode    1555  47% 03-15-1990 01:10 2011f29d  WHATSNEW.110
     800  Implode     523  35% 03-15-1990 01:10 7b0db6cf  README.DOC
  140355  Implode   34329  76% 07-21-1989 01:01 1046156f  MANUAL.DOC
   21473  Implode    6519  70% 03-15-1990 01:10 03975389  ADDENDUM.DOC
     720  Implode     434  40% 03-15-1990 01:10 253e799b  DEDICATE.DOC
    9366  Implode    3228  66% 03-15-1990 01:10 c917b5c2  LICENSE.DOC
    4701  Implode    1464  69% 03-15-1990 01:10 6e20e127  ORDER.DOC
   25811  Implode    8390  68% 03-15-1990 01:10 4f35b70d  APPNOTE.TXT
    1744  Implode     866  50% 03-15-1990 01:10 f49e5256  AUTHVERI.FRM
     595  Implode     442  26% 03-15-1990 01:10 167904ac  OMBUDSMN.ASP
   34296  Implode   25523  26% 03-15-1990 01:10 32268bf7  PKZIP.EXE
   23528  Implode   18125  23% 03-15-1990 01:10 4457e783  PKUNZIP.EXE
   22188  Stored    22188   0% 03-15-1990 01:10 6c674772  ZIP2EXE.EXE
    9224  Implode    6692  28% 03-15-1990 01:10 3397d14f  PKZIPFIX.EXE
    4479  Stored     4479   0% 03-15-1990 01:10 479b6f89  PUTAV.EXE
--------          -------  ---                            -------
  302196           134757  55%                            15 files

Things that might be different

Here are the next potential differences I want to look for:

  • The order of the directory entries.
  • The offset of the first byte in the file that differs from the reference file.
  • The offset of the start of the “ZIP” part of the file, which is likely to be the first occurrence of the ‘P’ ‘K’ 0x03 0x04 byte sequence.
  • The offset of the “end of central directory” signature, which is the last and probably only occurrence of the ‘P’ ‘K’ 0x05 0x06 byte sequence.
  • Any differences in the directory information, including any “extra fields” (ZIP format extensions). In particular, whether the “AV” field exists.
  • Whether there is a ZIP file comment, and if so, what it says.
  • Any extra bytes at the end of the file, if not explained by the presence of a ZIP file comment.

The “zipinfo” utility from the Info-ZIP unzip software can give us a lot of this information.

$ zipinfo -v PKZ110.EXE

Archive:  PKZ110.EXE
There is no zipfile comment.

End-of-central-directory record:
-------------------------------
  Zip archive file size:                    149219 (00000000000246E3h)
  Actual end-cent-dir record offset:        149197 (00000000000246CDh)
[...]

Central directory entry #1:
---------------------------
  WHATSNEW.110
  offset of local header from start of archive:   12784
[...]
  length of extra field:                          95 bytes
[...]
  The central-directory extra field contains:
  - A subfield with ID 0x0007 (PKWARE AV) and 91 data bytes.  The first
    20 are:   3d f4 d4 d6 c7 5c d5 3b 31 46 38 76 e1 b3 59 da f2 3b c2 31.
[...] 
Central directory entry #2:
---------------------------
  README.DOC
[...]
  length of extra field:                          0 bytes
[...]

I note that the first member file, “WHATSNEW.110”, has an “extra field” of type “PKWARE AV”. It turns out that this is the only member file with an extra field of any type. It doesn’t matter to me what “PKWARE AV” actually is (it’s from an old digital signature system), but its existence might be important. The field is 95 bytes in size, but it occurs twice (as does most ZIP directory information, for redundancy), so it accounts for 190 bytes of the file size.

The “end-cent-dir” offset of 149197 is where the “PK56” signature appears.

The first file’s local header offset of 12784 probably marks the start of the “ZIP” part of the file. That’s what I’m calling the “PK34” offset.

To find the location of the first differing byte in the file, I can use the Unix “cmp” utility. For example:

$ cmp PKZ110.EXE PKZ110_ibmgopher.EXE
PKZ110.EXE PKZ110_ibmgopher.EXE differ: byte 12792, line 56

Note that cmp thinks the first byte in the file is numbered 1 (cmp was written by cavemen, before the invention of the number 0), so I’ll subtract 1 from the numbers it prints.

If the “PK56” signature is more than 22 bytes from the end of the file, and there is no ZIP file comment, then I’ll just use a hexdump utility to figure out what’s at the end of the file.

I’ll also take note of which files have a ZIP file comment. Here’s an example of such a comment:

╔═════════════════════════════════════════════════════════════════════════════╗
║      IDS Magic!    (716)/633-□□□□       127 Megs On-Line 24 Hours 7 Days    ║
║  USRobotics Courier HST Dual Standard  1200/2400/4800/7200/9600/14400 Baud  ║
╚═════════════════════════════════════════════════════════════════════════════╝

Results

Here are my findings:

Name           SzDiff 1stDiff PK34  PK56   Order   AV End-of-file/ZIP comment
ibmgopher      -190   12791   12784 149007 Differs N  Normal
1stcanadian    0      -       12784 149197 Normal  Y  Normal
hof91          +29    EOF     12784 149197 Normal  Y  29 0x1a bytes @ EOF
640swstudio    +29    12790   12784 149197 Differs Y  29 0x1a bytes @ EOF
gigabytesw     +135   12791   12784 149007 Normal  N  Comment "IDS Magic!…"
successfulbbs  +285   EOF     12784 149197 Normal  Y  285 0x1a bytes @ EOF
originalsw     +285   EOF     12784 149197 Normal  Y  285 0x00 bytes @ EOF
smsharew2      +317   149217  12784 149197 Normal  Y  Comment "Exec-PC…"
knowledgemedia +413   2048    13296 None   Normal? Y  Irregular
powerpakgold   +557   12791   12784 149007 Normal  N  Comment "Lion's Den…"

The “SzDiff” column is the file size, relative to the reference file’s size of 149219.

My final assessment of the modifications I found:

Name Modifications found
ibmgopher Files re-ordered. “PKWARE AV” data removed.
1stcanadianNone; this is presumably the original.
hof91Extra (0x1a) padding at end of file, possibly padded to a multiple of 128 bytes.
640swstudioFiles re-ordered. Extra (0x1a) padding at end of file, possibly padded to a multiple of 128 bytes.
gigabytesw“PKWARE AV” data removed. ZIP file comment “IDS Magic!…” added.
successfulbbsExtra (0x1a) padding at end of file, probably padded to a multiple of 512 bytes.
originalswExtra (0x00) padding at end of file, probably padded to a multiple of 512 bytes.
smsharew2ZIP file comment “Exec-PC…” added.
knowledgemediaFile is corrupted, as if it were chopped up and put back together incorrectly. (Don’t blame the CD-ROM creators; the corruption could have happened later.)
powerpakgold“PKWARE AV” data removed. ZIP file comment “Lion’s Den…” added.

References

Links to the files I used: 1 2 3 4 5 6 7 8 9 10

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s