Investigating an UnZip decompression bug

One of the old compression methods for ZIP format is named “Shrink”. In the process of writing my own Shrink decompressor, I came across a problem with the Info-ZIP UnZip software.

It’s triggered by a small percentage of Shrink-compressed files. As an example, I’ll use the file that you can download here: BLING.WAV.

Now, in DOSBox, I’ll compress it using PKZIP 1.10, using the -es option to tell it to prefer the Shrink method.

C:\ZIPTEST>110\PKZIP.EXE -es BLING.ZIP ORIG\BLING.WAV

Creating ZIP: BLING.ZIP
  Adding: BLING.WAV  shrinking (37%), done.

Testing the resulting BLING.ZIP file, everything seems fine.

C:\ZIPTEST>110\PKUNZIP.EXE -v BLING.ZIP

Searching ZIP: BLING.ZIP
 Length  Method   Size  Ratio   Date    Time   CRC-32  Attr  Name
  ------  ------   ----- -----   ----    ----   ------  ----  ----
   14644  Shrunk    9323  37%  02-03-92  04:44  1a406a54 --w  BLING.WAV
  ------           -----  ---                                 -------
   14644            9323  37%

C:\ZIPTEST>110\PKUNZIP.EXE -t BLING.ZIP

Searching ZIP: BLING.ZIP
Testing: BLING.WAV     OK

It can be successfully decompressed by (I presume) any version of PKZIP, or by 7-Zip, or by Windows 10’s integrated ZIP features. But not by Info-ZIP UnZip. I’ll demonstrate by testing (-t) the file using the Cygwin version of UnZip:

~ $ unzip -v
UnZip 6.00 of 20 April 2009, by Info-ZIP.  [...]

~ $ unzip -t /cygdrive/c/dosprogs/ZIPTEST/BLING.ZIP
Archive:  /cygdrive/c/dosprogs/ZIPTEST/BLING.ZIP
    testing: BLING.WAV
  error:  invalid compressed data to unshrink
At least one error was detected in /cygdrive/c/dosprogs/ZIPTEST/BLING.ZIP.

(There are unreleased versions of UnZip newer than 6.00, but they have the same problem.)

I investigated the issue, and my conclusion is that it is a longstanding bug in UnZip. I reported the bug to the proper authorities, and indications are that it will be fixed eventually. But the Info-Zip software is only barely still being maintained, so I don’t know how long it might take.

The bug is not really worthy of a long discussion. But I’m going to do it anyway. I do think it’s more interesting than most bugs, due to its age, and because it might be fairly widespread.

Shrink is a somewhat unorthodox implementation of the compression scheme known as LZW. Generally speaking, LZW compression is quite straightforward, and easy to understand. LZW decompression, though, is crazy. Trying to understand it turns my brain into a pretzel. When the ZIP format documentation describes Shrink compression, it doesn’t say much of anything about decompression. You have to logically deduce how to do it.

LZW involves maintaining a “code table”, where a code is an integer from 0 up to some limit. Some codes have special purposes. Others are “static” codes that always code for a single byte. But the bulk of the available codes are “dynamic” codes, which (when in use) code for byte sequences encountered in the uncompressed data.

Many LZW implementations have a special “clear” code that tells the decompressor to delete all the dynamic codes from the table. ZIP’s Shrink compression is unusual in that it has a “partial clear” code that deletes some, but not all, of the dynamic codes (based on criteria that’s not important to this discussion).

When a partial clear code is encountered, the decompressor needs to examine all the dynamic codes that are currently in use, to figure out which ones to delete. The UnZip bug happens because UnZip fails to adequately track which codes are potentially in use. It only examines codes up to the most recently added code, but it is possible for there to be codes in use that are higher than that code.

If you look at files Shrunk by PKZIP, you’ll see that, for each input file, PKZIP always does a partial clear when the most recently added code is one less than some pre-selected power of 2. E.g., a file could contain 10 partial clear codes, each occurring when the most recently added code was 2047.

I lied — that’s not quite true. For example, for BLING.WAV, the most recently added code when a partial clear occurs has the following values: 511, 511, [30 more 511’s omitted], 511, 511, 509, 511, 511, 511, 511, 511, 511, 511, 509, 511.

So there are two cases where it does a partial clear when the most recently added code was 509, which is lower than the highest potentially-used code of 511. That’s what leads to the bug.

Fixing the bug

A quick and dirty fix is to change the partial_clear function in unshrink.c so that it always searches the entire code table.

@@ -310,6 +310,8 @@
 {
     register shrint code;

+    lastcodeused = HSIZE - 1;
+
     /* clear all nodes which have no children (i.e., leaf nodes only) */

Better fixes are possible, but would require slightly more effort. Some options:

  • Search up to the highest code that is possible, given the current code size (in bits).
  • Track and search up to the highest code that has ever been used.
  • Precisely track and search up to the highest code that is currently used. This is ideal in theory, but it requires extra logic, and has very little benefit.

There is another potential issue here that I’m not quite sure I’m right about. UnZip’s current unshrink algorithm has been around since about 1996 (though it was disabled in most releases prior to 2005, due to concerns about patents). Before that, UnZip used a different algorithm, which used the following logic:

  • Search up to (and not including) the lowest code that is currently unused.

I think the last version of UnZip with the old algorithm was 5.1, and that version does not have the bug in question. But I suspect it does have a more subtle form of the bug, which would only be triggered if the file was compressed by software that uses different partial-clearing logic than PKZIP does. Specifically, it would only happen if the compressor sometimes issues a partial clear “prematurely”, when fewer codes are in use than there were at the time of some previous partial clear. As far as I can tell, it is legal to do that, and an optimizing compressor might even have a good reason to do it. I have not followed up on this idea.

Impact

The bug is even rarer than I’ve probably made it sound like. Shrink is the main compression method used by PKZIP 0.x, but it is only used by PKZIP 1.x for small files (320 or fewer bytes), or if for some reason you use the -es option. Shrink compression has been pretty much unused since around 1993, when PKZIP 2.x was released.

The bug can only be triggered if the compressed datastream contains at least two partial clear codes. While 320 bytes is (just) enough for one partial clear to be warranted, it seems impossible, or nearly impossible, for PKZIP to decide to use two. So, one would not expect to find problematic files in the wild unless they were created by PKZIP 0.x.

Such 0.x files are pretty rare, but I found a CDROM that has quite a few files that could potentially trigger the bug. I tested them all, but not a single one of them actually does ☹.

Finally, in another place, I found a file in the wild that does trigger the bug: EIW.EXE in EIW130.ZIP. (I hope the appearance of the word “virus” in the URL doesn’t cause any problems. The file doesn’t appear to have anything to do with viruses.) I stopped looking once I found that.

This is not a security bug. It can cause files to be decompressed incorrectly (with a warning), or it can cause the decompressor to give up because its copy of the code table ends up being full when it’s not supposed to be.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s