I wanted to write a program to extract the text from WinHelp .HLP files. HLP format was the standard Microsoft Windows help/documentation file format from around 1990 (the start of the Windows 3.x era), through the early 2000s. There are countless old Windows applications that come with an HLP file, but starting with Vista, Windows no longer even ships with an HLP file reader.
This is the first in a series of posts about some of the things I’ve learned about HLP format — before I forget it all.
HLP is a difficult format to decode. Even just getting to the text contained in a file is difficult, never mind all the hypertext rendering and macro processing needed to fully support it.
The best reference that I know of is the helpfile.txt file included with the software named helpdeco. The document is densely packed with information, but it can sometimes be difficult to make sense of. It’s not written as carefully and unambiguously as it could be. I will mostly stick to the terminology used in that document, without knowing how standard it is.
I’m not going to come close to fully covering HLP format. What I want to do is explain the structure of certain parts of it, how the various types of compression work, and how to read the text. I’ll assume you are only trying to read HLP format, not edit it, or create a new file.
At the lowest level, an HLP file starts with a 16-byte “HLP header”. The rest of the file is a data area. The data area contains “internal files” (which I’ll just call “files”) in arbitrary places. There can be unused data between the files.
Each (internal) file starts with a 9-byte header that gives the size of the file.
The first four bytes of an HLP file are always 0x3f 0x5f 0x03 0x00, which helps to identify the format. The only other useful thing in the 16-byte HLP file header is a field giving the offset of one of the files: the internal directory file.
The internal directory file
The internal directory file contains the names and absolute offsets of all the other files in the HLP file.
I’m going to skip over a really big thing here. The internal directory file is structured as a B+ tree, and is fairly complex. But you’ll have to learn about it somewhere else. If you manage to decode it, you’ll have a table of filenames and offsets, something like this:
Name Offset |CONTEXT 15908 |CTXOMAP 12665 |FONT 12300 |KWBTREE 12742 |KWDATA 12676 |KWMAP 12725 |PhrImage 16 |PhrIndex 2277 |SYSTEM 2486 |TOPIC 2787 |TTLBTREE 13813 |bm0 18003 |bm1 20240 |bm2 20408
Many of the filenames have special meanings. Such special names usually start with a “|” (vertical line) character. I don’t know if the names are supposed to be case sensitive. In my experience, it seems safe to assume they are.
The only files I intend to cover are the unnamed internal directory file, “|SYSTEM”, “|TOPIC”, and if present: “|Phrases”, “|PhrIndex”, and “|PhrImage”.
The SYSTEM file
The next thing to do is to read the file named “|SYSTEM”. It will tell you things like:
- The HLP format version number
- The block size used by the TOPIC file
- Whether the TOPIC file uses HLP-LZ77 compression
- The HLP file title, creation date, and various other attributes
The only format version numbers I’ve seen are 1.15, 1.21, 1.27 (rare), and 1.33. I’ll assume those are the only ones that exist.
Several different portions of HLP format are compressed using a scheme I’ll call HLP-LZ77. For future reference, here’s how to decompress it.
The format is byte-oriented; you can read it one byte at a time. As far as I know, compressed data bytes will always occur contiguously in the HLP file. It is never necessary to save the decompressor state for later.
Allocate a 4096-byte ring buffer, with a “current position” (current_pos). Initialize it to contain 4096 spaces (0x20). For each post-decompression byte emitted, also store it in the ring buffer at the current_pos, then increment current_pos. The ring buffer wraps around in both directions; adding 1 to 4095 results in 0.
It doesn’t matter what current_pos starts at, so you may as well make it 0.
The algorithm (with some expressions in C-like pseudocode):
Read a byte (B) of compressed data. For each of the 8 bits in B, low to high: If the bit is 0: Read the next byte, and emit it without translation. If the bit is 1: Read the next 2 bytes: M0 and M1 matchlen = (M1>>4) + 3 matchpos_rel = ((M1 & 0x0f)<<8) | M0 matchpos_abs = (current_pos - 1 - matchpos_rel) & 0xfff From the ring buffer, read and emit matchlen bytes starting at matchpos_abs. Repeat (read the next B).
Note that in C/C++, if you use unsigned integers for the matchpos/current_pos variables, they will wrap around correctly. Just AND them 4095 (0xfff) to put them in the right range before using them. Do not use signed integers.
In some situations, it is critical that you stop immediately if the instructions tell you to read a byte that does not exist. E.g., if the instructions tell you to read two bytes, but only one byte of compressed data remains, stop immediately without emitting anything.
In the next part, I’ll peel back the first couple of layers of the TOPIC file, which is where the text (mostly) is.