ARC is a file compression and archiving utility that was in use from the mid-1980s to the early 1990s, mainly on DOS computers. It was developed by a small company named System Enhancement Associates.
One interesting thing about ARCv7 is that it introduced a new compression scheme that it names “Trimmed”. Trimmed is ARCv7’s preferred scheme. It uses it by default for most files.
Despite ARC being a historically significant format, Trimmed compression seems to be pretty much forgotten. Or maybe it was never noticed in the first place? Most descriptions of ARC format don’t mention it.
ARC+Plus was, I think, mainly intended for commercial use, and you’re not likely to directly encounter many files created by it. But just to prove they do exist, here’s one I found: GUS_140.ARC.
I don’t know if any documentation of Trimmed compression was ever made public. If so, I couldn’t find it. I think I’ve figured it out, though, and I document it here.
Standard warning: I only did “black-box” reverse engineering, which means there is no way to be sure that I’ve figured everything out.
I’m mainly trying to cover how to decode (decompress) the format, not how to encode it. If for some reason you want to write your own encoder, I advise against using this post as your only source of information about the format.
Trimmed is a combination of two separate algorithms: a run-length encoding (RLE) algorithm, and an algorithm that uses LZ77 with adaptive Huffman coding (I’ll call it “LZAH”).
It is possible, though not necessarily advisable, to do the decompression in two separate passes: first LZAH decompression, then RLE decompression. You will not know how many bytes the LZAH part should decompress to by itself, but the format is self-terminating, so you’ll know when it’s finished.
Each compression scheme in ARC has an ID number, and the ID for Trimmed is 10.
Unfortunately, ID 10 is also used by another program in the ARC universe: PAK, developed by NoGate Consulting. The PAK compression scheme 10 is named “Crushed”. I don’t know exactly how Crushed compression works, but it’s related to LZW, and it’s definitely quite different from Trimmed.
ARC files created by ARCv7 usually start with a special “archive information” item, and files created by PAK usually have a “.PAK” filename extension. So if a file uses compression scheme #10, it’s usually possible to make a good guess as to which one it is.
The RLE layer
The RLE scheme used in Trimmed compression is a common one sometimes called “RLE90”. It is also used by several of ARC’s other compression schemes.
To decompress it, follow the instructions in the first line below whose pattern matches the next byte(s) to be processed.
0x90 0x00 : Emit one byte, value 0x90 0x90 N : Emit the previous output byte N-1 more times N : Emit one byte, value N
That should suffice to decode the data produced by ARC. I didn’t test how its decoder handles bad RLE90 data, such as a file that starts with a “repeat” instruction.
LZ77 intrinsically supports a kind of RLE, so it’s reasonable to ask why one would design an algorithm with a seemingly superfluous extra RLE layer. Well, it could be that Trimmed is just badly designed. But there is a potential justification. With a separate RLE layer, long runs of bytes don’t take up much space in the LZ77 history buffer, leaving more room for other data. That will improve the compression ratio in some cases, especially if the history buffer is relatively small.
The LZAH layer
The LZAH format is very similar to LHarc’s “lh1” compression scheme. The canonical source code to compress and decompress lh1 is a widely-distributed file named LZHUF.C, written primarily by Haruyasu Yoshizaki. There are lots of different versions of it floating around; one can be found here: OKUMURA2.ZIP. (Caution: Most copies of LZHUF.C do not work correctly when compiled by a modern C compiler, because of integer size assumptions and other issues.)
First, let’s review lh1.
Standard lh1 compression
lh1 uses a 4K (4096-byte) LZ77 history buffer.
Each unit of compressed data starts with (and may consist entirely of) something I’ll call a “character code”. Character codes are encoded using a particular adaptive Huffman coding scheme, which I won’t go into (and have not studied in detail). There are 314 character codes, numbered from 0 to 313. Codes 0 through 255 each represent a literal output byte, while codes 256 through 313 represent “match lengths”, from 3 to 60.
If a code is a match length, then it is followed by a “distance” (also called a “position”), encoded with a particular variable-length integer code. In the measuring system that I’m using in this article, a distance is a number from 1 to 4096. It identifies a relative position in the LZ77 history buffer.
Trimmed/LZAH differences from lh1
Trimmed/LZAH uses a 4K history buffer, the same as lh1.
It has 314 character codes, the same as lh1, but code 256 is a special “stop” code that marks the end of the compressed data. Codes 257 through 313 are for match lengths 3 through 59. So, to calculate the match length, you would subtract 254 instead of 253.
Trimmed/LZAH edge cases
I have never seen ARC v7.12’s compressor use match length 59, which corresponds to character code 313. Code 313 still has to exist in the Huffman tree, or else the adaptive Huffman algorithm will go haywire. But ARC doesn’t seem to use it — the longest match it encodes is 58 bytes.
Similarly, the format seems to support 4096 different “distances”. But I have never see ARC v7.12’s compressor use distance 4096. The largest distance it encodes is 4095.
I artificially created some files that use these values (59 and 4096), and confirmed that ARC and XARC’s decompressors do support them.
Perhaps they were reserved for potential future expansion.
There’s nothing about the format that prevents a distance code from referring to a position before the beginning of the file. I discussed this general issue in a previous post: LZ77 prehistory.
It’s a safe bet that ARCv7 never writes such codes, so it probably doesn’t matter how you handle them, as long as you do so safely.
I tested the DOS version of ARC v7.12, and found that it does not initialize the history buffer. It doesn’t reject “prehistory” distance codes, but it does unpredictable things with them, so we must consider them to be invalid.
Comparison to CRLZH
As it happens, I know of another LZHUF-based format that is modified to use 256 as a special “stop” code. It’s a compressed file format named CRLZH, that was used on CP/M computers, and possibly DOS computers. There are two variants of it, which I’ll call version 1 and version 2.
It’s not the same as Trimmed. Here are the differences:
- CRLZH doesn’t have an RLE layer.
- The history buffer size is 2048 instead of 4096.
- Instead of repurposing one of the 58 “length” codes to be a “stop” code as Trimmed does, CRLZH add a new code. There are still 58 length codes, now numbered 257 through 314. The LZHUF source code seems to be able to handle this change, though I don’t blame the Trimmed designers for not wanting to risk it.
- In CRLZH v2, the way that distances are encoded is changed, so that it can only encode the 2048 different distances that it needs. That makes it a little more efficient.