ARJ compression method 4

ARJ is a compressed archiver utility and format, with features similar to ZIP. It seems at first to have five compression methods, with ID numbers 0 through 4.

Method 0 is for no compression, so it’s easily dealt with.

Methods 1 through 3 are really all the same. They record how hard the compressor tried to reduce the file size, but as far as I know, for decompression purposes, it makes no difference. They are essentially the same as LHA format’s “-lh6-” compression method. While this compression scheme is not documented as well as it ought to be, it’s at least sort of a standard-ish format.

That leaves Method 4, the topic of this post. ARJ calls it the “compressed fastest” method. As far as I know, it is unique to ARJ. I’ve never seen it documented, other than as source code.

Method 4 is an implementation of the general compression technique usually known as either “LZSS” or “LZ77”. This post assumes you are well acquainted with this type of compression. There’s an overview of it in my post on LZ77 compression prehistory.

Methods 1-3 involve Huffman coding, with the codebooks stored as part of the compressed data. Method 4 does not do this. You could interpret Method 4 as using a kind of Huffman coding, but if so, the codebooks are pre-defined, and very simple.

For the purposes of this format description, I’m assuming you are writing a program to decompress such a file. If for some reason you want to create a new one, you should be more careful, and test your edge cases on actual ARJ software.

Preliminary information

The maximum LZ77 “offset”, or “distance back”, value that I’ve seen ARJ use for Method 4 is 15800 bytes. The maximum that the format seemingly supports is 15872 bytes. There is nothing special about the format that requires using a particular history buffer size; it just has to be big enough. Powers of 2 can be convenient for this, so you could use 16384 bytes (16 KiB).

Method 4 is bit-oriented. Interpret the compressed data as a stream of bits, not bytes.

The bit order is most-significant bit first. That means that, for example, if you have some pending bits and need another bit, you read the high bit from the next byte, and insert it as the low bit of the pending bits.

There is no special “stop” marker at the end of the compressed data, so you have to either know the expected size in bytes of the decompressed data, or the size in bytes of the compressed data. The latter is sufficient, because each instruction uses at least 9 bits, so there won’t be any ambiguity about how many bits in the last byte are used. I did a quick test with a random version of ARJ, and it seemed to require that both sizes be correct.

Old format

The very earliest beta versions of ARJ, 0.13 through 0.14, use a slightly different version of Method 4. Such files can be identified by the “archiver version” field having value 1.

I figured out the old format just enough that I seem to be able to decompress the files that ARJ creates. There are some bit patterns that I haven’t seen ARJ use, and I don’t know what they do.

Realistically, you won’t encounter such files, unless you create one. And that’s not so easy to do. These old versions of ARJ expire, and will refuse to create new files if they think they are being run after early 1991 or so. If you really want to, you could run them in DOSBox-X, and change the emulated system time, and you may also have to fake the timestamps of the files you’re compressing.

Decompression algorithm

The compressed bitstream is a sequence of variable-length LZ77 instructions. Read an instruction, do what it says, and repeat until some stopping condition has been met.

To read an instruction, first read a single “opcode” bit.

If opcode=0, the instruction is a literal. Read the next 8 bits, and treat them as a literal byte in the usual LZ77 way (emit it to the output stream, and store it in the history buffer).

If opcode=1, the instruction is a match. It will be followed by a length code, then an offset code. Read both (see below), then use them to perform a standard LZ77 “copy from history” operation. It is legal for the length to be larger than the offset. It is not legal for the offset to be larger than the number of bytes decompressed so far.

Length codes

Read up to 6 bits (“code”), stopping early if you get a 0 bit. Interpret as follows, to get a value we’re naming “length_0”.

code       "length_0" =
--------   ------------------------
0          2   + {read 1 more bit}
10         4   + {read 2 more bits}
110        8   + {read 3 more bits}
1110       16  + {read 4 more bits}
11110      32  + {read 5 more bits}
111110     64  + {read 6 more bits}
111111     128 + {read 7 more bits}

For example, if the code is 1110, read 4 more bits, interpret them as an integer from 0 to 15, and add that to 16, to get an integer in the range 16 to 31.

Now you have length_0, in the range 2 through 255. Add 1 to get the true match length, in the range 3 to 256.

For old format, you need to read up to (at least?) 7 bits:

code       "length_0" =
--------   ------------------------
...        ...
111110     64  + {read 6 more bits}
1111110    128 + {read 7 more bits}
1111111    ???

Offset codes

Read up to 4 bits (“code”), stopping early if you get a 0 bit. Interpret as follows, to get a value we’re naming “offset_0”.

code       "offset_0" =
-----      ------------------------
0          512 +  {read 9 more bits}
10         1024 + {read 10 more bits}
110        2048 + {read 11 more bits}
1110       4096 + {read 12 more bits}
1111       8192 + {read 13 more bits}

Now you have offset_0 in the range of 512 to 16383. Assuming your measuring system considers the most recently decompressed byte to be at offset 1, subtract 511 to get the true offset in the range 1 to 15872.

For old format, you need to read up to (at least?) 5 bits:

code       "offset_0" =
-----      ------------------------
...        ...
1110       4096 + {read 12 more bits} = 4096-8191
11110      8192 + {read 13 more bits} = 8192-16383
11111      ???

Concluding remarks

I think that’s all you need to know to decompress ARJ method 4.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s