LHA is a compressed archive file format and compression utility that was, for a long time, a competitor of ZIP. It’s also known as LZH format, or LHarc format, but I’ll call it LHA.
In the course of researching it, I came across an obscure lookalike format created by a program named CAR. CAR is a DOS program, released in 1996 by MylesHi! Software, and distributed as shareware. Search for “car150.zip” if you want a copy.
CAR files look superficially like LHA, but they are not compatible. Since I’m writing a program that reads LHA files, it’d be nice to have an algorithm for telling them apart.
I could guess, based on the filename. A CAR file usually has a name that end in “.CAR”. But I’d rather not guess. I want to be able to tell just by looking at the bytes that make up the file.
Unlike most one-off utilities like this, CAR actually documents its file format. It says:
Header for CAR files (version 1.50) ----------------------------------- Structure of archive block (low order byte first): -----preheader 1 basic header size = 25 + strlen(filename) (= 0 if end of archive) 1 basic header algebraic sum (mod 256) -----basic header 5 method ("-lh0-" = stored, "-lh5-" = compressed) 4 compressed size (including extended headers) 4 original size 1 filename length (x) x filename 2 original file's CRC 2 original file's attribute 2 original file's time 2 original file's date 1 0x20 (placeholder, not used for now) 2 first extended header size (0 if none) -- 0 for CAR files -----first extended header, etc. (none for CAR files) -----compressed file
There’s an influential piece of open source software often called “ar002”, due to its lack of a distinct name. It was written by Haruhiko Okumura in 1990, and it seems to be the grand progenitor of LHA v2’s default “lh5” compression method, as well as a lot of other software that uses “LZH” compression. In a comment in its source code, it gives an overview of the file format it uses:
Structure of archive block (low order byte first): -----preheader 1 basic header size = 25 + strlen(filename) (= 0 if end of archive) 1 basic header algebraic sum (mod 256) -----basic header 5 method ("-lh0-" = stored, "-lh5-" = compressed) 4 compressed size (including extended headers) 4 original size 4 not used 1 0x20 1 0x01 1 filename length (x) x filename 2 original file's CRC 1 0x20 2 first extended header size (0 if none) -----first extended header, etc. -----compressed file
To be clear, ar002 did not invent this format. It’s just a particular implementation of the older LHA format. But I think it’s safe to say that CAR’s format description is derived from this, or a descendant of it.
So, we see that CAR has mostly the same fields as LHA, but some of them are in a different order.
There are actually four different versions (or “header levels”) of LHA format, with the header level (0 to 3) given by the byte at offset 20. Originally, this byte was the high byte of the file attributes field. But it was always 0, and was later repurposed as the version number. In the ar002 format, this is the “0x01” byte. However, there is no corresponding byte in CAR. It sets the high byte of its attributes field to 0, though its format seems to be derived from LHA header level 1.
In CAR format, the byte at offset 20 is the fifth character of the filename, provided the filename is at least 5 characters long. Filenames in DOS generally cannot have characters below value 0x21. So, by looking at the byte at offset 20, you can distinguish between LHA format, and any CAR file for which the first archived file has a name that’s at least 5 characters long.
But, of course, filenames of length 1 to 4 are perfectly normal. And the fields that come after the filename field in CAR format can have bytes with values 0 to 3, making them possible to be mistaken for an LHA version number.
Though I didn’t really want to spend much more time on this silly CAR format, I got to wondering if it might not be too hard to prove that this task is impossible, by constructing a file that is simultaneously a valid CAR file and an LHA file, yet has a different meaning depending on which format you think it is.
Files that are valid for two or more formats are sometimes called “polyglots”. If you want to see some better examples, check out Ange Albertini’s work.
I tried a few things, but it seemed too difficult to find a good solution in which the internal file’s contents are different. I did find an alignment that seemed promising for LHA header level 0, but only the file names could be different, not the file contents.
I decided to pursue the “header level 0” idea. Here’s the final 32-byte polyglot I came up with:
The middle column is the bytes that make up the file. On either side are the field names, and interpretation of the values.
The header size is measured from the start of the “compression method” field. For LHA header level 0, and CAR, the file data must start at the end of the header. Combined with the fact that the file size fields are shared, this means there’s no way to make the file contents be different.
The biggest defect with my file is that the LHA timestamp is not a valid time. But LHA is a large and diverse ecosystem; there are probably lots of weird LHA files out there. CAR, on the other hand, is a monoculture. So we really want the file to be something that the CAR software could actually produce.
The LHA filename ends with a space: “BBB “. That would be bad for CAR format, which as far as I know is DOS only, but its okay for LHA.
The CRC-16 fields are a hash of the internal file’s data after decompression. Since the “lh0” method I used just means “not compressed”, we can ignore the decompression step. We do need the file to pass this CRC check, so we need a file whose CRC-16 is 0x0000 (0).
A brute force search would work fine, but I recently learned of a much easier way to make files whose CRC-16 is 0 (at least for the CRC-16 algorithm used here). You just need to be able to change the last two bytes.
To make it nontrivial, I decided to use three bytes. The first byte has the arbitrary value 0x5a. Calculate the CRC-16 of all bytes except the last two. We get 0x3b80. Now split that value into two bytes, least-significant byte first: 0x80 0x3b. Use that as last two bytes of the file, and the file as a whole is now guaranteed to have a CRC-16 of 0.
Finally, most LHA files (at least for these header levels) end with a 0x00 byte to mark the end of the sequence of compressed files.
Let’s test it out, by trying to have CAR create the polyglot file. I prepared a DOSBox directory with the CAR.EXE program, and the English version of LHA 2.55b for DOS (LHA_E.EXE).
I created the three-byte file, and named it “A”. With a bit of trial end error, I figured out the right timezone fudge factor to use, to set its timestamp to the time that CAR will interpret as 2013-02-02 08:16:08.
C:\CARPOLY>CAR A POLY A
(Now I realize I shouldn’t have named the file “A”. The first “A” is the “add” command used to create a new archive.)
This file is so small that CAR won’t actually compress it (it will use the “lh0” method instead of “lh5”), which is what I planned on. But even if it wanted to compress it, (1) CAR has an option to disable compression, and (2) it shouldn’t really matter anyway, because the parts of the file related to compression are the same for CAR and LHA.
CAR created a 32-byte file named POLY.CAR. It’s byte for byte identical to the one I designed. So far so good.
Now we pretend it’s an LHA file, and ask the LHA software what it thinks about it:
C:\CARPOLY>LHA_E V POLY.CAR Listing of archive : POLY.CAR Name Original Packed Ratio Date Time Attr Type CRC -------------- -------- -------- ------ -------- -------- ---- ----- ---- BBB 3 3 100.0% 23-04-30 01:39:46 a--w -lh0- 0000 -------------- -------- -------- ------ -------- -------- 1 files 3 3 100.0% 20-08-25 15:43:52
Some weirdness with the date, not surprisingly.
And test it:
C:\CARPOLY>LHA_E T POLY.CAR Testing archive: POLY.CAR Test OK BBB c
And extract it:
C:\CARPOLY>LHA_E X POLY.CAR Extracting from archive : POLY.CAR Melted BBB_ c
It extracted a 3-byte file named “BBB_”, identical to our “A” file.
Since this is a DOS environment, LHA had to change the space in the filename to something else. But that’s not a problem. The point is that its name doesn’t resemble “A”.
Distinguishing CAR and LHA with perfect accuracy is difficult-to-impossible. At the very least, I’ve proven that you must take the validity of the LHA timestamp field into account. One pathological case is a CAR-compressed file with a 1-character filename and a CRC-16 of 0x0000.
But keep in mind that my specimen is not necessarily the only way to make a file that resembles both formats.