The blocksize field in LHA compression format

This post is about the data compression format I’ll call “lh5”. It is actually a family of formats that includes the compression methods often named lh{4, 5, 6, 7, 8}. It was most notably used by version 2.x of the old LHA/LZH/LHArc compressed archive format.

It was used, often in modified form, in a number of different file formats back in the day. If you know what zlib is, you can think of lh5 as a predecessor of zlib. In the 1990–1995 timeframe, if a file format designer wanted high quality data compression, lh5 was often the format they reached for.

The grand progenitor of the format was the “ar002” software by Haruhiko Okumura. It comes with source code in the C language, but not with any real documentation about the format. As far as I know, no formal specification has ever been written, at least not in English.

The best English language documentation that I’m aware of is in the comments in the “unzoo.c” program by Martin Schoenert. But even that is incomplete, and the format used by Zoo is not quite the same as that used by LHA.

lh5 format consists of a sequence of “blocks”, each of which has the following structure:

“blocksize” field
Huffman tree definitions
Sequence of blocksize Huffman codes

The blocksize field is an unsigned 16-bit integer. The tree definitions section is self-terminating, and does not make use of the blocksize field.

Since I was writing a program to decode this format, it occurred to me to wonder what happens if the blocksize is 0. Is that an invalid value? Does it have a special purpose? Does it introduce a degenerate block, with meaningless tree definitions, and no Huffman codes?

ar002

Fortunately, we can just look at the ar002 source code, and see what it does. Unfortunately, there’s a problem.

typedef unsigned int uint; /* 16 bits or more */
/* ... */
static uint blocksize;
/* ... */

    if (blocksize == 0) {
        blocksize = getbits(16);
        read_pt_len(NT, TBIT, 3);
        read_c_len();
        read_pt_len(NP, PBIT, -1);
    }
    blocksize--;

blocksize is a global variable, but I assure you that it’s not used in the read_pt_len or read_c_len functions. And it starts out safely initialized to 0, which I haven’t shown.

What happens is that, every time it reaches 0, a new blocksize is read from the file using the getbits function. Then, the program unconditionally subtracts 1 from it, before it ever looks at it. It is possible for getbits to return 0, so what happens if it does? The program subtracts 1 from 0, giving −1. But blocksize is an unsigned integer, so it can’t have the value −1. In C, unsigned integer underflow is legal, and it wraps around to the largest value that that variable can hold. At the time this code was written, an “unsigned int” was often 16 bits, though 32 bits was possible, depending on the compiler. The source code comment indicates that the author was aware that it isn’t guaranteed to be a particular number of bits.

What it boils down to is that the ar002 code is nonportable. It doesn’t consistently handle the case where blocksize=0. I decided to investigate what some other LHA software, and other related software, does in this case.

LHa for Unix

There is an open source all-C program called LHa for Unix, so let’s see what it does.

static unsigned short blocksize;
/* ... */

    if (blocksize == 0) {
        blocksize = getbits(16);
        read_pt_len(NT, TBIT, 3);
        read_c_len();
        read_pt_len(np, pbit, -1);
    }
    blocksize--;

It changed the type of blocksize to “unsigned short”, which for all practical purposes is always going to be a 16-bit integer. It doesn’t check for a value of 0 before subtracting 1 from it, so it will wrap around to 65535. In other words, it treats “0” as if it meant 65536.


With that in mind, I constructed two test files in LHA format: One that would be well-formed if blocksize 0 meant 0 (0is0.lha), and one that would be well-formed if it meant 65536 (0is65536.lha). You can download them them here: lha5blocksize.zip

LHa for Unix successfully decompresses my 0is65536.lha file, and fails to decompress 0is0.lha.

LHA v2.x

It is slightly ironic that the original LHA software that made the ar002 source code famous didn’t actually use it. Instead, it used an assembly language translation of it. The source code for at least one version of LHA 2.x was made public, but I’d rather not try to decipher assembly language if I can avoid it.

LHA for DOS, through v2.55b (the last official version, as far as I know), successfully decompresses 0is65536.lha. It fails to decompress 0is0.lha, reporting a “CRC error”. So it thinks that “0” means 65536.

But there is a newer beta test version, v2.66, and it fails to decompress either file. I don’t know what the error message is, because it’s in Japanese.

There is also an LHA32 v2.67 test version for Windows console. It also fails to decompress either file, reporting an error that is partly in English: “Bad table (5)”. I’m not sure exactly what’s going on, but I suspect it specifically tests for blocksize=0, and reports a somewhat generic error in that case.

lhasa

Another open source LHA decompressor is lhasa. (I’ve edited the following source code to make it more direct, and more similar to the other code snippets. lhasa actually has a better coding style than this.)

unsigned int block_remaining;
/* ... */

    while (block_remaining == 0) {
        block_remaining = read_bits(16);
        read_temp_table()
        read_code_table()
        read_offset_table()
    }
    --block_remaining;

Changing the “if” to “while” makes a difference. lhasa thinks that 0 actually means 0. It treats a block with blocksize=0 as one that has Huffman tree definitions, but no codes that use those definitions. The next block, if there is one, immediately follows the tree definitions.

lhasa (as of this writing in 2020-11) successfully decompresses my 0is0.lha file, and fails to decompress 0is65536.lha.

7-Zip

I think this is the relevant code in 7-Zip:

    blockSize = _inBitStream.ReadBits(16);
    if (blockSize == 0)
        return S_FALSE;

It checks for blocksize=0, and considers it to be a fatal error. Decompressing either of my test files with 7-Zip results in a “Data error”.

Ancient

There’s a new open source lh5 decompressor on the block: a C++ program named Ancient. It’s not derived from ar002. Here’s the relevant source code:

    blockRemaining=readBits(16);
    if (!blockRemaining) blockRemaining=0x10000;

Ancient deliberately interprets a value of 0 to mean 65536 (0x10000).

Other file formats

Zoo

Zoo’s LZH compression method defines blocksize=0 to be an end-of-data marker. Every compressed file is supposed to end with such a field, and there are no tree definitions or any other data following it.

Gzip Compress-LZH

The Gzip software is well known, but what might not be as well known is that it can decompress a few formats that aren’t Gzip format. One of them is the “Compress-LZH” format that was used on SCO Unix systems. It is derived from the ar002 source code. Here’s where it handles the blocksize field:

    if (blocksize == 0) {
        blocksize = getbits(16);
        if (blocksize == 0) {
            return NC; /* end of file */
        }
        read_pt_len(NT, TBIT, 3);
        read_c_len();
        read_pt_len(NP, PBIT, -1);
    }
    blocksize--;

So, like Zoo format, it uses blocksize=0 as an end-of-compressed-data marker.

Open-source ARJ

ARJ archive format usually uses a compression format based on lh6 format, which is just a slight modification of lh5. I haven’t evaluated the official ARJ software. But there is an “Open-source ARJ” project that supports the format, so let’s take a quick look at its source code:

define CODE_BIT 16
/* ... */
short blocksize;
/* ... */

    if(blocksize==0)
    {
        blocksize=getbits(CODE_BIT);
        read_pt_len(NT, TBIT, 3);
        read_c_len();
        read_pt_len(NP, PBIT, -1);
    }
    blocksize--;

This is a little concerning. The getbits function returns an “int” whose value is between 0 and 65535, but stores it in a “short” (blocksize), which can only store values from −32768 to 32767. The way this type conversion works is that numbers from {+32768 to +65535} will be translated to {−32768 to −1}.

If I’m interpreting this correctly, there’s no special issue with 0, that doesn’t also exist for all negative numbers as well. It seems the code is hoping that when you subtract 1 from −32768, it will wrap around to +32767. But in modern C programming, there is no such guarantee. In principle, the program could crash, or do literally anything. (Signed integer underflow is very different from unsigned integer underflow.) In practice, I think it will probably work as intended. Since blocksize is read from a file at runtime, it will be hard for the C optimizer to do anything clever with it.

So, without having tested it, Open-source ARJ appears to treat blocksize=0 as if it were 65536.

Summary

Since some versions of LHA do not allow blocksize=0, it seems that the relevant authorities decided that 0 is not a valid value. Treating it as a fatal error might be the most correct thing to do. But if you care more about robustness than strictness, treating it as meaning 65536 is a sensible thing to do.

That advice applies only to LHA format. Other file formats that use lh5 compression often have their own rules.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s