LZ77 compression prehistory

LZ77 is a widely-used class of data compression algorithms. I’ll start with a quick overview of it.

Assuming you’re compressing a stream of bytes (a “file”), your LZ77 compressed data, at a high level, would contain two possible kinds of instructions for the decompressor:

  1. Emit literal: {byte value=A}
  2. Copy from history: {match-offset=B, match-length=C}

The match-offset may instead be called the “distance”, among other things. In this post, I’ll use “offset” and “length” to mean the match-offset and match-length. The encoding of literal bytes is not important here, and I’ll say no more about it.

The “history” is the stream of bytes that have already been decompressed. In this post, I’ll say that offset “1” means the most recently decompressed byte, offset “2” is the second-most recent, and so on. In this system, an offset less than 1 would not be allowed, and presumably there would not even be a way to encode it.

Each LZ77 implementation defines some limit on how large an “offset” can be, and this dictates how many recently-decompressed bytes the decompressor must remember. The decompressor has to allocate some memory for this. Many different names have been used for this memory, including: history buffer, ring buffer, circular buffer, slide buffer, sliding window, window, dictionary.

(Side note: If you’re decompressing the file to memory, you can do without this buffer, and just copy from the bytes you already wrote.)

LZ77 allows the length to be larger than the offset, but this does not cause undefined “future bytes” to be copied, and it doesn’t wrap around from recent bytes to the oldest bytes in memory. The algorithm requires the bytes to be copied one at a time, from oldest to newest. By the time you need to copy from the “future” bytes, they will be in the past. So that’s not an issue.

What can be an issue is that the offset can refer to “prehistory” byte positions before the beginning of the file. There’s usually no good way to prevent such offsets from being possible to encode. So, every LZ77 format designer has to decide what should be done in this situation. If the format designer neglects to cover this case, then it falls to the programmer of the decompressor to decide how to handle it.

For example, consider Deflate compression, which involves LZ77. Deflate is the main algorithm used by ZIP, gzip, and anything that uses “zlib” compression. The Deflate specification in RFC 1951 says “[…] a distance cannot refer past the beginning of the output stream.”

So, Deflate declares such offsets to be illegal. The decompressor is required to reject them. Of course, that doesn’t mean that all decompressors actually do.

In this post, I’ll investigate how several other formats deal with (or fail to deal with) such offsets.

Predefined prehistory

A slight digression. If you’re designing an LZ77 algorithm, you might think you could improve the compression by pre-seeding the history buffer with some strategically chosen byte values, allowing the compressors to copy from them. You’d be right, and some algorithms do exactly that. But (without having tested it) I’m thinking that the savings is almost always going to be very small, at least for general-purpose data compression. LZ77 always efficiently compresses runs of identical bytes, and repetetive byte sequences. If that’s what you’re seeding your history buffer with, it’s not going to help much.

If you seed the history buffer with a single byte value, such as 0, or an ASCII space, it will only save a few bits at best. But as with any seeding strategy, it has the benefit that the decompressor no longer has to check for invalid offsets. It can just fill the history buffer with that value, before starting the decompression. If that’s deemed to be too slow, it can generate the required bytes on demand, when it encounters a prehistory offset.

Experiments

I decided to experiment with three compressed archive formats that use LZ77: ARJ, LHA, and Zoo. These are old formats, but software that can decompress them is still being maintained.

I hand-crafted some byte sequences, reproduced below, that encode an LZ77 instruction to copy the most recent 256 bytes from the history buffer. 256 is the maximum match length in each of these particular formats. I could copy more, using multiple instructions, but this is just a proof of concept.

This idea is not original to me. For instance, the lhasa software includes at least one test file (initial.lzs) that does the same type of thing.

Byte sequences

ARJ type 4 compression is nice, because it’s LZ77 only. No Huffman coding.

ARJ/type 4: 
  ff fc ff

LHA’s “lh5” and “lh6” compression use LZ77 plus a complicated encoding system involving Huffman coding. The byte sequence I came up with has some unused artifacts left over from the compressed file I started with. But it works, and it’s too much trouble to clean it up.

LHA/lh5:    
 00 01 20 04 3f e1 0a c9 42 3f c0

LHA’s “lh6” is very similar to lh5. I just had to insert two 0-valued bits into the definition of the “offsets tree”. This one was not used in the experiments that I’m discussing, but it’s handy because it’s also compatible with ARJ compression types 1 through 3.

LHA/lh6, ARJ/type 1-3:  
 00 01 20 04 3f e1 0a c9 40 8f f0

Zoo’s LZH compression is the same as LHA-lh5, except that to do it right, we should append 16 0-valued bits, as an end-of-data marker.

Zoo/LZH: 
 00 01 20 04 3f e1 0a c9 42 3f c0 00 00

My test files

I’m paranoid, and want to reduce the risk of getting my website blacklisted for hosting malicious files. I’m providing only a password-protected 7-Zip file [here] containing a text file that contains hex dumps of the most useful of my test files. The password is “a d 8 6 b 8 a 6 b 7 e 9 1 5 c f”. So you’ll have to go to some trouble to recreate these files. Sorry.

CRC integrity checks

A complicating and mitigating factor is that most formats of this type have some built-in integrity checking, usually in the form of a “CRC” field in the header of each compressed file. The decompressor is expected to evaluate a certain mathematical function on the decompressed data, and if the result doesn’t match the reported CRC, it means something’s wrong. The decompressor will probably issue an error message, and may or may not refuse to decompress the data.

If I knew in advance what the decompressor was going to do, I could set the CRC field to the correct value. But the whole point of this experiment is that I don’t know what the decompressor is going to do. It might not even be deterministic.

LHA and Zoo formats use only 16-bit CRCs, so in many cases it may be practical to try random CRCs, or all the possible CRCs, until you find one that works. A 32-bit CRC like ARJ’s is not impossible to brute-force, but is impractical in most situations.

A lesser issue is that all three formats I investigated also have a CRC or checksum for the header information. This field has to be patched up whenever we change the CRC field, or any other field. All this patching is not too difficult to do with the help of my Deark utility (note its “-m crc” feature), and a hex editor.

A “seed” file

Meanwhile, I’m going to need something else to decompress. Ideally a file thats compressible and distinctive, but without a repetitive pattern. English text is fine. I chose a random text file about relativity that was a little over 64KB in size: rel_ftl.faq. That should be big enough to exceed any of the history buffer sizes I’ll be dealing with.

ARJ format

I did most of my ARJ experimenting with ARJ v2.50a (dating from 1995) on DOSbox. No particular reason for that version, but it’s one of the later ones that I’m sure doesn’t have a hard-coded expiration date. DOSbox is a nice safe environment, and ARJ is mostly closed-source, so if I want to know how it behaves, I can’t cheat and look at the source code.

I constructed three test files having compression type 4 (when using ARJ, that’s the -m4 option). One with the text file (SEED.ARJ), one with the tricky byte sequence (GETBUF.ARJ), and one with with SEED file followed by the GETBUF file (BOTH.ARJ).

By default, ARJ refuses to decompress files with a bad CRC. But it has a “recovery” option, “-jr”, to do it anyway. I used that option.

In a new DOSBox session, I decompressed the GETBUF.ARJ file. It decompressed to a file with 256 0-valued bytes.

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
...

Then I decompressed the SEED.ARJ file.

Then I decompressed the GETBUF.ARJ file again. This time I got a different 256-byte file, that begins:

0000000 65 6d 2c 20 6e 6f 20 6d 61 74 74 65 72 20 68 6f  >em, no matter ho<
0000016 77 20 79 6f 75 20 67 65 74 20 74 68 65 20 0d 0a  >w you get the ..<
0000032 6d 65 73 73 61 67 65 20 74 68 65 72 65 2e 0d 0a  >message there...<
0000048 20 20 20 20 20 20 20 4f 4b 2c 20 77 68 61 74 20  >       OK, what <
0000064 61 62 6f 75 74 20 73 70 65 65 64 73 20 67 72 61  >about speeds gra<
0000080 74 65 72 20 74 68 61 6e 20 63 20 62 75 74 20 4e  >ter than c but N<
0000096 4f 54 20 69 6e 73 74 61 6e 74 61 6e 65 6f 75 73  >OT instantaneous<
...

It produced data that definitely does not exist in any form in the GETBUF.ARJ file. Remember, it was just 3 bytes of compressed data that got “decompressed” into this.

If I start a new DOSBox session, and extract just the BOTH.ARJ file, I get exactly the same 256-byte file as the latter result.

There’s an open source version of ARJ available with many Linux distrubutions. I tested the Cygwin version (“ARJ32 v 3.10 [28 Jun 2015]”).

For the first test, I again got all zeroes. But this time, the second GETBUF test was also all zeroes. That’s not surprising. This is a modern computing platform, so it isn’t going to leak memory between processes the way DOS does.

But when I extract BOTH.ARJ, the 256-byte file I get begins like this:

0000000 20 0d 0a 77 61 76 65 20 73 75 63 68 20 61 73 20  > ..wave such as <
0000016 6c 69 67 68 74 20 6d 75 73 74 20 62 65 20 74 68  >light must be th<
0000032 65 20 73 61 6d 65 20 66 6f 72 20 61 6c 6c 20 6f  >e same for all o<
0000048 62 73 65 72 76 65 72 73 2e 20 20 54 68 69 73 20  >bservers.  This <
0000064 69 6e 20 74 75 72 6e 20 0d 0a 6d 61 64 65 20 69  >in turn ..made i<
0000080 74 20 6e 65 63 65 73 73 61 72 79 20 66 6f 72 20  >t necessary for <
0000096 64 69 66 66 65 72 65 6e 74 20 6f 62 73 65 72 76  >different observ<
...

This proves that it doesn’t initialize the history buffer between compressed files that are in the same ARJ archive.

Notice that this leaky result is different from DOS ARJ’s. That’s probably because they use different history buffer sizes. Generally speaking, it’s not an error for an LZ77 decompressor to use a history buffer that’s larger than necessary.

I tried decompressing GETBUF.ARJ with 7-Zip, and it failed completely with a “Data error”. I’m pretty sure it’s because of the invalid offset, not because of a CRC mismatch.


I should probably acknowledge that compressed archive formats do exist in which the member files are, by design, not compressed independently of each other. But the formats I’m looking at here are not of that type. If they were, whether extracting the second file of an archive succeeds wouldn’t depend on whether we also extract the first file.

LHA format

I had the idea that for LHA format, the history buffer was initialized to all ASCII spaces (0x20). Turns out I was right, at least for some of the LHA software I’ve tested. I don’t think it’s common for actual LHA files to rely on this.

I tested LHA 2.55b for DOS. When decompressing GETBUF.LHA or BOTH.LHA, the 256-byte file extracted to all spaces.

20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20  >                <
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20  >                <
...

The modern lhasa software works the same way.

7-Zip seems to give up and report an error if an LHA file uses a prehistory offset. I’m not aware of any formal specification for the “lh5” format, so I don’t know if my GETBUF.LHA file is legal.

The software named ”LHa for UNIX” has a compatibility option (--extract-broken-archive) that initializes the history buffer to zeroes instead of spaces. Apparently, there were some buggy versions of it that compressed files using such a buffer.

Zoo format

I decompressed BOTH.ZOO with the Cygwin distribution of Zoo. The 256-byte file started like this:

0000000 20 0d 0a 77 61 76 65 20 73 75 63 68 20 61 73 20  > ..wave such as <
0000016 6c 69 67 68 74 20 6d 75 73 74 20 62 65 20 74 68  >light must be th<
0000032 65 20 73 61 6d 65 20 66 6f 72 20 61 6c 6c 20 6f  >e same for all o<
0000048 62 73 65 72 76 65 72 73 2e 20 20 54 68 69 73 20  >bservers.  This <
0000064 69 6e 20 74 75 72 6e 20 0d 0a 6d 61 64 65 20 69  >in turn ..made i<
0000080 74 20 6e 65 63 65 73 73 61 72 79 20 66 6f 72 20  >t necessary for <
0000096 64 69 66 66 65 72 65 6e 74 20 6f 62 73 65 72 76  >different observ<
...

So, it leaks data between two files in the same Zoo archive.

I decided that was enough experimenting for now.

Security implications

If we can trick a trusted computer program (e.g. a web application) into giving us a copy of some of its uninitialized memory, it quite possibly represents an information-disclosure type of security flaw. Remember Heartbleed?

I tried a few web searches for LZ77 security issues of this type, and didn’t find anything. But it can’t be a new idea. Anyone who’s programmed an LZ77 decompressor in the modern security-paranoid computer era would have had to think about how to initialize the history buffer.

One threat model would be something like this: An attacker uploads a compressed file to a web server, or uploads some data using a compressed “transfer encoding”. The web server decompresses the data, disregarding invalid LZ77 offsets and CRC failures, and lets the attacker see it. If the decompression is done in a process that also handles data that the attacker isn’t supposed to have access to, such data could be leaked to the attacker via an uninitialized LZ77 history buffer.

That’s a lot of “if”s, so I don’t really suspect that such security flaws are widespread. But considering that flawed ARJ and Zoo utilities are still being distributed today, maybe nobody’s ever done a wide sweep for this type of issue. Maybe such a sweep should be done.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s