Corrupting ZIP files

The ZIP (PKZIP) compressed file format is organized in a backwards fashion. To unzip a ZIP file, you start by looking for a special byte sequence near the end of it. This is a marker for a data structure (the end of central directory) that points to other data structures that appear earlier in the file.

(Note: This post only considers plain vanilla ZIP format. This discussion might not apply if features like disk spanning, directory compression, encryption, ZIP64 format, etc., are used.)

So, one should be able to append some junk to the end of a ZIP file, and have it still work. And, assuming the appended data is not too large, it usually does work. I’ll test that theory, using the Info-ZIP zip and unzip programs:

$ echo "file1" > file1.txt
$ echo "file2" > file2.txt
$ zip test.zip file1.txt file2.txt
  adding: file1.txt (stored 0%)
  adding: file2.txt (stored 0%)
$ echo "junkjunkjunkjunkjunkjunk" > junk.txt
$ cat test.zip junk.txt > testbad1.zip
$ unzip -l testbad1.zip
Archive: testbad1.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        6  06-02-2018 11:27   file1.txt
        6  06-02-2018 11:27   file2.txt
---------                     -------
       12                     2 files

Works fine. The file is readable by 7-Zip as well.

I found that for Info-ZIP, I can append up to 73865 bytes and still have it work. But at 73866, it fails. There is not necessarily anything special about this number, and the limit may be different in different circumstances, or for different versions of the unzip software.

7-Zip can handle more appended data than this, but I haven’t tried to figure out what its limit is, if it has one.

It would be nice if you could also prepend data to the beginning of a ZIP file, and have it still work. Since the important data is at the end of a ZIP file, that seems doable. It would allow you to, for example, easily convert to and from self-extracting ZIP files, and other kinds of hybrid ZIP files.

It would have been pretty easy for the ZIP format to have been designed to allow for this. It would work almost automatically if all the pointers were relative to the beginning of the end of central directory structure, or to some other relative location.

But no. Instead, nearly all the pointers in a ZIP file are relative to the beginning of the ZIP file. I.e., they’re absolute, not relative. So, if you prepend some data to a ZIP file, you’ll make the pointers point to the wrong place. You’ll get a file that still looks like a ZIP file, but is corrupt. I’ve seen a few self-extracting ZIP files that appear to have this problem, so it’s not just of theoretical interest.

Let’s construct such a file:

$ cat junk.txt test.zip > testbad2.zip

Not too surprisingly, 7-Zip cannot open the testbad2.zip file. Info-ZIP complains, but it does handle it, somehow:

$ unzip -l testbad2.zip
Archive: testbad2.zip
warning [testbad2.zip]: 25 extra bytes at beginning or within zipfile
  (attempting to process anyway)
  Length      Date    Time    Name
---------  ---------- -----   ----
        6  06-02-2018 11:27   file1.txt
        6  06-02-2018 11:27   file2.txt
---------                     -------
       12                     2 files

How does it do it? I don’t know exactly. I briefly looked at the source code, and it appears to have several hacks to try to deal with various types of corruption. (Note: ZIP format has enough redundancy that there is nothing remarkable about being able to handle some corrupt ZIP files. Even the original MS-DOS PKZIP packages included a utility named PKZIPFIX.EXE.)

I do know a trick that works fairly robustly in this particular case. The end of central directory structure has two fields that we’re interested in:

  • size of the central directory
  • offset of start of central directory (with respect to the starting disk number)

The size of the central directory field is not important in most ZIP files. We don’t need to know it, and in most cases, it equals the {offset of the end of central directory structure} minus the offset of start of central directory field. So, if it doesn’t equal that value, we can be suspicious that all the file offsets might be wrong by the difference between those two values.

If we don’t find the central directory signature where it should be, then, we could also look for it at {offset of the end of central directory structure} minus the size of the central directory field. If we find it there, keep going, and make a note of this difference. If the member file signatures aren’t found where they should be, assume they are offset by the same amount. Sometimes this works.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s