This post is about data compression algorithms that involve LZ77, or a similar kind of compression. It’s mainly about old-school compression algorithms and software.
There is some information about LZ77 in my post about LZ77 prehistory. I won’t explain it in detail here, but here are some things to know about it. Both the compressor and decompressor maintain a memory buffer called the “history buffer”, or “sliding window”, or whatever. It conceptually “wraps around”, so that the first byte follows the last byte.
The buffer has a fixed size; call it WS (window size). It’s usually a power of 2. It contains a copy of the most recent WS bytes from the decompressed file. It has a “next position” pointer associated with it, so we know where the next decompressed byte will be written.
The compressed file contains special codes that tell the decompressor to go to some “match position” in the window, and copy some number (the “match length”, ML) of bytes starting at that position. We’ll measure the match position in a relative way, as the number of bytes backward from the “next position”. Call this the “distance back”, or D. The minimum useful value of D is 1, and the maximum is the window size (WS).
I noticed that some compression programs never use the maximum value of D that seems to be allowed by the format they use, and decided to do some testing. When possible, I also kept track of the apparent maximum value of ML that they use (but I’m not going to discuss how you might do this).
The “dual blob” test
The situation is that we have a compression utility (ZIP, or whatever) that we can use to compress whatever file we want. We suspect it uses LZ77, and we want to figure out the maximum value it uses for D.
(In this section, I’m using a Unix-like environment, though most of the software tested is for DOS or Windows.)
Create a file, let’s call it “BUCKET”, consisting of a bunch of random bytes. The exact size doesn’t matter; a megabyte or two will suffice for now.
$ head -c 1200000 /dev/urandom > BUCKET
Choose a number N, for example 5000 (we’ll do this for lots of different numbers). Create a file named 5000.1 containing the first 5000 bytes from BUCKET:
$ head -c 5000 BUCKET > 5000.1
Now create a file named 5000.2 that is just two concatenated copies of 5000.1:
$ cat 5000.1 5000.1 > 5000.2
Now feed 5000.2 into the data compression program. What do we expect to happen? Should it be able to compress it? By how much?
Ideally, what should happen is this: If N is less than or equal to the compression program’s WS, it should compress it by almost 50%. But if N is even one byte larger than WS, it should hit a brick wall, and be completely unable to compress the file.
So, we should be able to use a binary search to zero in on the maximum value of N that the utility can compress. If we don’t know what algorithm it uses, knowing this maximum N might give us a clue. If the maximum N is not a power of 2, that’s a hint that the compressor might not be behaving optimally. But to be sure, we’d have to know more about the compression format.
As an example, let’s test 7-Zip’s ZIP compressor, using the usual Deflate compression method, and “Maximum” compression. Deflate has a formal specification, so we know that WS, and the maximum possible D value, are exactly 32768 (2^15).
I created files 32768.2 and 32769.2, compressed them with 7-Zip, and got these results:
Name Method Orig.Size Cmpr.Size ------- ------ --------- --------- 32768.2 Defl:N 65536 33111 32769.2 Stored 65538 65538
It compressed 32768.2 to 50.5% of its original size. It couldn’t compress 32769.2 at all, so it gave up and stored it uncompressed. Good — that’s exactly what should happen.
With a larger set of files, we get a graph like the following, showing the compression ratio jumping to 100% when the maximum value of D is exceeded.
Disclaimer — This is only a test!
This information was compiled in October of 2021. I’m not planning to keep it updated as the world changes.
These numbers are for entertainment purposes only. They could be wrong or misleading in many ways. Please do not use them for anything important. Conduct your own tests, if you want.
The sub-optimal compression behavior pointed out here usually makes little difference in practice. And in many or most cases, it was done by design, not by accident. LZ77 compression is not simple to do well, and trade-offs have to be made.
Some test results
|LHa for Unix||lh5||8191/8192||256/256|
|LHa for Unix||lh6||32767/32768||256/256|
|LHa for Unix||lh7||65535/65536||256/256|
|ARJ 2.50A||-m1 (best)||26622/26624||256/256|
|ARJ 2.50A||-m4 (fast)||15800/15872||256/256|
Discussion of selected tests
LZSS, LArc, MS Compress
In the late 1980s, Japanese programmer Haruhiko Okumura released, in source code form, a compression program we’ll call LZSS.C. It is an implementation of a sub-family of LZ77 compression known as LZSS. It’s relatively obscure, but it’s sort of a grand ancestor of a lot of later compression software.
It has a 4096-byte buffer (W=4096), and a maximum match length of 18. For convenience, its compressor uses 18 bytes of the buffer as a “look ahead” buffer, and consequently the maximum D it ever uses is 4078.
A compression program called LArc was derived from LZSS.C, and it inherits the same limits. LArc has a (complicated) relationship to LHarc.
From the same era, there is a Microsoft compressed file format called “MS Compress”, or “EXPAND”, that uses a nearly identical compression scheme. But it only seems to use ML values up to 16, when 18 ought to be possible. Correspondingly, the maximum D is 4080 (4096-16).
I haven’t tested whether the EXPAND utility works with ML=18. By the way, if you use Windows, you might still find a copy of EXPAND on your computer, maybe at C:\Windows\System32\expand.exe.
LZARI and “lh1”
LZARI is another Okumura utility. It also has WS=4096, but ML is increased to 60. The compressor is artificially limited to D=4036 (4096-60).
LZARI is fairly obscure, but it evolved (via a program called LZHUF) into the “lh1” compression method used by LHarc, which is much less obscure. LHarc’s implementation of lh1 has the same parameters and limitations.
Ar002 and LHA “lh5″/”lh6″/”lh7”
“Ar002” is yet another Okumura utility. Its window size is 8KB (8192), and it is able to use almost all of the window during compression, but for whatever reason falls 1 byte short. The maximum D value is 8191, when it could (as far as I know) be 8192.
LHA (LHarc v2) and a number of other compression utilities were derived in part from ar002. Those that I’ve tested all share the quirk of not using the largest allowed distance D.
The Info-Zip Zip utility’s limit of D=32506 is not completely arbitrary. It’s right there in the source code, as MAX_DIST:
#define MIN_MATCH 3 #define MAX_MATCH 258 #define WSIZE (0x8000) #define MIN_LOOKAHEAD (MAX_MATCH+MIN_MATCH+1) #define MAX_DIST (WSIZE-MIN_LOOKAHEAD)
Zopfli is a Google side project that, while it can be very slow, is among the best Deflate compressors when it comes to file size. I was surprised to find that it could not compress my 32768.2 file.
PKWARE DCL Implode
Though little-known, PKWARE DCL Implode was a widely used compression method, especially by software installation utilities. I tested a few different compression utilities that use the format, and all of them behaved the same. The maximum D was one byte less than it could be, and the maximum ML was also less than it could be. But this format has, as far as I know, never been documented, so there’s no way for me to be sure what its limits are intended to be.
There is an open source version of ARJ that confirms that its usual compression window size is 26KB(!), even though it basically uses the “lh6” format that should allow up to 32KB. I don’t know if it can decompress distances larger than 26KB — the source code suggests that this could even be a compile-time option.
-m4 method is quite different, and needs more investigation, but the constant 15800 does appear in the source code.
Issues to look for
Gradually worsening compression
For some compression utilities, what you may find is that, instead of hitting a wall at the maximum D, the compression ratio just gradually worsens until no compression is possible. Consider PKZIP 2.04g:
This could happen for a number of reasons. The obvious one is that maybe the format doesn’t use LZ77. If it does, maybe the compression software just isn’t very good at finding matching byte sequences, when they are too far back in the history buffer.
Note that this issue doesn’t seem to exist in PKZIP 4.00, which behaves about the same as 7-Zip.
If the compression ratio does hit a wall, but not in a place that makes any sense, or not always in the same place, one possibility is that LZ77 is not the first layer of compression used. For an example of such an algorithm, see my discussion of ARC’s “Trimmed” compression method, which uses an RLE pre-pass.
The “dual blob” test, while crude, is something that could be useful to help characterize or test a compression program.
I was surprised by how many compression programs do not use the full range of their format’s parameters. For some old software, it was a sensible trade-off, that made the compression algorithm a little faster or simpler. For modern software, make of it what you will. I can understand the reluctance to mess with complex critical software that’s working well enough, but it would be interesting to see it changed to make full use of the format.