Notes on LHARK compression format

LHARK (with a K) is an old compression/archiver utility for DOS. It is related to the popular utility named LHA (formerly LHarc), but should not be confused with it. You should be able to find a copy of LHARK by searching the web for “LHARK04D”.

LHARK was developed by Kerwin F. Medina around 1995/1996. It uses the LHA/LHarc container format, but the point of it is its new improved(?) compression scheme that uses the identifier “lh7”.

If there is any reason to be interested in such an old and obscure format as LHARK, it’s that there are still a number of pieces of software being maintained that support LHA format. Some of them support some pretty obscure compression schemes, but as far as I know, none of them supports LHARK’s. Their maintainers might be willing to add LHARK support if they knew how, but as far as I know it’s never been documented — until now.

I think I’ve figured out the format, and in this article I’ll try to document it enough that it can be decompressed. I’ll assume you already know how to decode the standard “lh5” scheme, or a similar scheme like “lh6” (I’ll call the family “lh5x”).

I figured it out using “black box” reverse engineering. That is, I fed the program a bunch of specially-crafted files, to see what it compressed them into. I did not disassemble the software, or anything like that. With this type of reverse engineering, there’s no way to be sure that I’ve got it all right, so everything in this post should be considered uncertain.

Preliminary things

ID conflict

Beyond the undocumented nature of LHARK format, there is a major annoyance with it: There is another, more mainstream, LHA compression scheme that uses the same “lh7” identifier.

I can say with some confidence that there is no easy-and-reliable way to tell which of the two “lh7” schemes a given compressed file uses. So, supporting LHARK isn’t just a matter of adding another LHA compression scheme — you also need some way to decide which “lh7” scheme your software will use. If you want it to happen automatically, you might just have to try both schemes, to see which one works.

LHARK versions

The only versions of LHARK I could find are v0.4a and v0.4d. They seem to use the exact same compressed file format, though I can’t be sure of that. I did most of my work with v0.4d.

LHARK modes

The LHARK software has two different “modes”, which use different command-line parameter styles. Instead of a separate configuration file, the mode is configured by a few bytes stored at the end of the LHARK.EXE file (!). To tell which mode it’s using, run “LHARK.EXE” with no parameters, and it will say either “LHARK-A” or “LHARK-B”. This post assumes the default mode, LHARK-A.

LHARK compression strategies

LHARK has five different compression settings for its lh7 format, selected by option -tt, -ta1, -ta2, -toa, or -tob. There is some information about them in the LHARK.TXT documentation file. As far as I can tell, this does not really affect the compressed data format, or the decompression algorithm. I did most of my testing with the -tob option.

LHARK compression format

Here’s a summary of the differences between lh5x and LHARK. For LHARK:

  • The history buffer is 64K (instead of, e.g., 8K for lh5).
  • The second Huffman tree (“literals and lengths”) can only have up to 289 codes (numbered 0 to 288), instead of the usual 510 (numbered 0 to 509).
  • The third Huffman tree (which I call “offset codes”) uses 6 bits for its “number of codes” field, and, if that field is 0, the following field.
  • Additional processing is needed to decode the length codes. See the “Decoding the match length” section.
  • The codes from the “offsets” tree are processed differently. See the “Decoding the offset” section.

Decoding the match length

The code snippets below use the C language. All variables are unsigned integers.

The compressed data is a sequence of compression codes, each of which represents either a literal byte, or an instruction to repeat some previously-decompressed bytes. The first thing to do is to read a value, “llcode”, using the “literals and lengths” tree.

If llcode is less than 256, it is a literal. This is the same in all formats, so I’ll say no more about it.

For reference, with lh5x format, to get the “match length”, you simply subtract 253 from llcode:

// lh5x match length

if (llcode>=256 && llcode<=509) {
  match_length = llcode-253;
}

With LHARK, it is more complex. The llcode value generally only implies the three most significant bits of the match length (shaped like “1xx”). The remaining bits, if any, are read directly from the compressed data stream, immediately after llcode.

// LHARK match length

if (llcode>=256 && llcode<=263) {
  match_length = llcode-253;
}
else if (llcode>=264 && llcode<=287) {
  num_lowbits = (llcode-260)/4;
  lowbits = read_bits(num_lowbits);
  match_length = ((4+(llcode%4)) << num_lowbits) + lowbits + 3;
}
else if (llcode==288) {
  match_length = 514;
}

Decoding the offset

After the match length, you read/calculate the offset. Start by reading a value, “ocode”, using the offsets tree.

For reference, in lh5x, the offset is then calculated like this:

// lh5x offset

if (ocode<=1) {
  offset = ocode;
}
else {
  num_lowbits = ocode1-1;
  lowbits = read_bits(num_lowbits);
  offset = (1 << num_lowbits) + lowbits;
}

LHARK is similar, but an additional high bit is implied by ocode. That is, the high bits are shaped like “1x” instead of “1”.

// LHARK offset

if (ocode<=3) {
  offset = ocode;
}
else { // 4 <= ocode <= 31
  num_lowbits = (ocode-2)/2;
  lowbits = read_bits(num_lowbits);
  offset = ((2+(ocode%2)) << num_lowbits) + lowbits;
}

That should be all you need.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s