What is LZSS compression?

I’m not asking how to implement LZSS. I’m asking how to distinguish things-that-are-LZSS from things-that-are-not-LZSS.

It’s generally understood that LZSS is a kind of data compression. It’s supposedly a derivative of LZ77. But you may struggle to find out anything definitive or verifiable about LZSS.

The name LZSS is more than likely derived from Lempel–Ziv–Storer–Szymanski, though I could not find any definitive proof of this.

Sources of information

A quest to research LZSS will quickly bottom out at two things:

  1. Storer and Szymanski’s paper “Data Compression via Textual Substitution”, published in the Journal of the ACM in 1982, hereinafter referred to as “the S&S paper”.
  2. A computer program named “LZSS”, by Haruhiko Okumura. It was written in 1987 or early 1988, and made public in 1988 in the form of a source code file named LZSS.C.

Other sources of information I used:

  • “A universal algorithm for sequential data compression”, the original LZ77 paper by Ziv and Lempel.
  • COMPRESS.TXT, a text document by Okumura, sometimes included with LZSS.C along with other software.
  • “History of Data Compression in Japan”, a later document by Okumura.

Other possibly noteworthy things:

  • Early versions of what became the S&S paper date back to at least 1977, and are named “The macro model for data compression”.
  • There is a 1988 book by Storer: Data Compression: Methods and Theory.

See the References section below for links to a few of these things.

The S&S paper

The back issues of the Journal of the ACM are available to read for free on the internet. I think this has only been the case for a few weeks.

So, I read the S&S paper. Well, the relevant parts of it. Which isn’t all that much. The bulk of it is about calculating bounds on compression time, and things of that nature, which isn’t going to do much to help me figure out what LZSS means.

The paper is meant to have practical value. It is a computer science paper, not an abstract math paper. But, I must say, it sure does seem to lean in the direction of abstract theory, over practical implementation. Part of that may be because, to some degree, it is a study of some compression concepts that were already in use at the time.

It does not define any specific compressed data formats. It does not use the term “LZSS”.

It covers several variations of a certain kind of compression. Only one such variation is now (more or less) known as LZSS. In the paper’s terminology, that’s the internal macro model with original pointers (OPM), “left” pointers only, topological recursion allowed.

Terminology notes:

  • Adopting the paper’s terminology, I’ll use “pointer” to mean an entity consisting of both a match position and a match length.
  • What the paper calls a “character”, I’ll call a “byte” or a “literal”.
  • What the paper calls a “string”, I’ll call a “byte sequence”, or maybe a “file”.

Okumura’s LZSS

I’m not aware of any use of the term “LZSS” prior to Okumura’s software. I’m not saying that’s it’s origin; I’m saying that I don’t know.

In the COMPRESS.TXT document, Okumura calls LZSS “a very simple data compression program”.

I’ve decided that the best way to approach this is to go over the attributes of Okumura’s LZSS, and explore the origin of each.

The compression algorithm

By the compression algorithm, I mean the specific technique for searching for the longest matching byte sequence, and that sort of thing. I think that’s what Okumura is referring to in COMPRESS.TXT in the “LZSS coding” section. He says it came from LZ77, slightly modified by the S&S paper, with an implementation technique described by T. C. Bell in 1986.

I haven’t really investigated LZSS’s algorithm in detail. I acknowledge the possibility that the name “LZSS” might have been intended to refer more to the compression algorithm, than to the compressed data format that it produces. But I also think that, especially for a relatively simple format like this, it’s hopeless to enforce such a distinction. In the long run, the format is what matters. The algorithm can be improved or replaced at any time, so long as it produces compressed data in the same format.

A “sliding window” history buffer

The use of a sliding window is in the LZ77 paper.

I cannot find any direct mention of it in the S&S paper, which is a little surprising to me. The paper generally assumes that pointers have a fixed size, and says that match positions can be relative to the position of the pointer. A sliding window would be one easy way to make that possible. But it doesn’t seem to actually suggest doing that.

Overlapping copies

Also known as “a match length is allowed to be larger than the ‘distance back'”.

This is a clever trick that allows better compression of a byte sequence that appears more than twice in a row. For example, “ABCABCABCABC” could be encoded as {“A”}{“B”}{“C”}{pointer: distance_back=3, length=9}.

Wikipedia says that this is a fundamental part of LZ77, and I’m assuming that it is. I’ve looked for it in the LZ77 paper, and I suspect that it is in there somewhere, but I’m just not sure. The paper is very difficult to read.

I do not think that this technique is considered in the S&S paper. If it is, it’s well hidden.

Unrestricted literals and pointers

The original LZ77, in effect, requires pointers and literals to alternate. They cannot appear in an arbitrary order.

The S&S paper allows literals and pointers to appear in any order, as does Okumura’s LZSS.

But I do not get the impression that S&S thought this was an innovation. The implication is that they are discussing compression ideas that were, for the most part, already in use.

The S&S paper does not seem to say anything about how a literal is distinguished from a pointer, though it’s implied that there must be some way to do it.

Use of 1-bit flags

Okumura’s LZSS uses 1-bit flags to distinguish literals from pointers. This is one of the two things that, it is often asserted without evidence, sets LZSS apart from LZ77.

As indicated above, original LZ77 does not do this.

Also as indicated above, this does not seem to be in the S&S paper. It would, however, be the simplest (though not necessarily the best) way to implement the compression schemes they cover.

I’d be very surprised if Okumura’s LZSS was the first time this was done, but at the moment I can’t say for sure that it wasn’t.

Packed flag bits

Okumura’s LZSS stores the flag bits 8 at a time, packed into a single byte. In combination with other design decisions, this means the payload part of a literal or pointer is always byte-aligned. This has some performance benefits, and perhaps other benefits.

I don’t know if this was Okumura’s invention. Maybe it was. I wouldn’t call it groundbreaking, but it’s not an entirely obvious thing to do.

Biased match-lengths

This is the second of the two things that, it is often asserted without evidence, sets LZSS apart from LZ77.

In many implementations, a pointer whose match length is less that some minimum value, say 3, would be larger than simply encoding that byte sequence as literals. By disallowing such nonproductive match lengths, you free up some codes that could be used to (for example) increase the maximum match length, thus improving the compression ratio.

In my humble opinion, this is a really obvious thing to do. Optimizations like this are a standard part of tuning most any new compression format.

But, some descriptions of LZSS give the impression that this was some sort of technological breakthrough made by S&S.

Now: LZ77 mandates the use of zero-length matches, so it’s fair to say that LZ77 does not do this. Not because Lempel and Ziv didn’t think of it, but because it’s not really possible in this case.

So, does the S&S paper actually do this? Well, sort of. In the definitions section, it says “Without loss of generality it will always be assumed that m > p.” Here, m is the size of the string, and p is the size of the pointer referring to that string. So, I guess the S&S paper does have this trait. But only as a formality, to simplify their calculations. I don’t get the impression that S&S thought there was anything innovative about it.

Okumura’s LZSS uses a minimum match length of 3. Since match lengths are 4 bits in size, it therefore supports 16 different match lengths, from 3 to 18.

As a curiosity, in Okumura’s LZSS, a minimum match length of 2 would also meet S&S’s criterion. A literal consumes 9 bits (1 flag bit plus 8 data bits), and a pointer 17 bits. Two literals will use 18 bits, so you would sometimes save a bit if you could use a match length of 2. That doesn’t mean its compression would be better if the minimum match length were 2. It would be slightly better for some files, but slightly worse for others.

Where to draw the line

The problem, it seems to me, is that there is no good name for the general class of compression schemes being discussed here. That is, S&S’s “OPM” model, “left” pointers only, topological recursion, sliding window history buffer, LZ77-style overlapping copies allowed. People fumble around for a name for this, and come up with either “LZ77” or “LZSS”. Both names are close, but neither is just right.

An example of a format that pretty definitely uses LZSS is that of Microsoft’s old COMPRESS/EXPAND utility. Its original “SZDD” compression scheme is almost identical to Okumura’s LZSS.

More advanced schemes add some form of Huffman-like coding to the mix, and usually no longer have explicit flag bits. An example is ZIP format’s Deflate scheme. It is sometimes considered to be derived from LZSS, which may be a stretch, but I don’t necessarily have a better idea. If you ask me, I’d prefer to call everything “LZ77” or “generalized LZ77”, but that’s just an opinion.

Concluding remarks

The S&S paper does not explicitly define LZSS compression as the term has come to be used. The class of data compression that it explores is too broad (it includes schemes that not known as LZSS) and too abstract. Some of the innovations often attributed to the S&S paper do not explicitly appear in it, or were not original to it.

I hypothesize that the term “LZSS” originated as the name of Okumura’s software. The name then organically came to be used for a class of data compression formats to which Okumura’s software belongs. But there is no proper authority to tell us what does and does not fall under the umbrella of LZSS.

There is an important family of data compression formats that does not really seem to have a good name, and “LZSS” has sometimes been used, often stretched to the breaking point, to fill that gap.

References

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s