Notes on WinHelp format, part 3

This post is part of a series about WinHelp file format. Please read the other parts first:

With what we learned previously, we can decompress the TOPIC blocks, locate the TOPICLINKs, and stitch each TOPICLINK’s fragments together to make each TOPICLINK a contiguous blob of bytes:

A defragmented TOPICLINK is composed of three parts:

  • A header
  • A “LinkData1” segment
  • A “LinkData2” segment

There are a few different kinds (record types) of TOPICLINKs, identified by one of the bytes in the header:

Record typeDescription
1Displayable text (old version)
2Topic header
32 (0x20)Displayable text (new version)
35 (0x23)Table

For record type 2 (topic header), LinkData1 contains some simple navigational information. For the other record types, it contains a complex set of data that describes the formatting of the TOPICLINK, and of each string in the LinkData2 segment.

The LinkData2 segment always contains a list of strings. For record type 2, the first string is the topic title. If there are any other strings, they are macros that should not be displayed to the user. For the other record types, all strings are displayable.

You can pretty much extract the raw text without using LinkData1, but there will be no formatting at all. You can’t tell where sentences and paragraphs begin and end. And it seems that there is some text-like data in LinkData1 that I’d consider to be displayable, such as tabs, and non-breaking spaces. But I’m not going to go into any more detail about LinkData1.

LinkData2 may be uncompressed, but more often than not, it is compressed using one of two tokenization schemes which together I’ll call phrase compression.

From the TOPICLINK header, you can get a pre-decompression size and a post-decompression size of LinkData2. If the post-decompression size is larger, and the HLP file supports phrase compression, then that TOPICLINK’s LinkData2 segment uses phrase compression.

Phrase compression makes use of other internal files, outside of the TOPIC file. So, you ought to read those files before reading the TOPIC file.

The original phrase compression scheme is simply known as “Phrase compression”, but I’ll call it “OldPhrase compression”, to distinguish it from phrase compression in general. The other kind is known as “Hall compression”.

Note that the LinkData2 segment is phrase-compressed as a whole. You don’t decompress the strings individually (well, I think you can do that with OldPhrase compression, but not with Hall compression). After decompression, the strings are separated by NUL bytes.

OldPhrase compression

If the HLP file contains a file named “|Phrases”, then compressed LinkData2 segments use OldPhrase compression.

The “|Phrases” file contains, among other things, a table of phrase offsets, and a blob of phrase data. Oddly, the phrase offsets are measured from the beginning of the offsets table, instead of the beginning of the phrase data.

For HLP version 1.21 and later, the phrase data is compressed with LZ77. The offsets are based on the decompressed phrase data.

Using the offset table, you can split the phrase data into individual phrases. (Usually, each “phrase” is just a single word.) Given a number N, you will have to be able to look up phrase number N. The first phrase is number 0.

Use the following algorithm to decompress a LinkData2 segment that uses OldPhrase compression:

Read a byte (B).
  If B is 0 or >=16, emit it untranslated.
  Otherwise:
    Read the next byte (B2).
    Emit phrase #{((B-1)<<7) + (B2>>1)}.
    If B2 is odd, emit a space.
Repeat.

Hall compression

If the HLP file contains files named “|PhrIndex” and “|PhrImage”, then compressed LinkData2 segments use Hall compression.

The PhrIndex file starts with a header that contains some information that you’ll need:

  • NumPhrases
  • PhrImage compressed size
  • PhrImage uncompressed size
  • BitCount (valid values are from 1 to 5(?))

The “phrase offset table” starts at offset 28 in the PhrIndex file. To decode it:

For each phrase (0...NumPhrases-1):
  Read bits one by one, least-significant bit first,
   counting the number of 1 bits you get. Stop reading
   after you get a 0 bit.
  Read the next {BitCount} bits, interpreted as an
   unsigned int, and assign the value to N.
  Let Length[phrase] = N + (num_1_bits << BitCount) + 1

Now you have a table of phrase lengths. The phrases are stored contiguously in the PhrImage file, with phrase #0 starting at position 0, so you can use the lengths to compute the position of each phrase.

The whole PhrImage file is usually LZ77 compressed. Decompress it first if it its compressed size is not equal to its uncompressed size.

Use the following algorithm to decompress a LinkData2 that uses Hall compression:

Read a byte (B) of compressed data.
  Follow the instructions for the bit pattern below that B matches:
   xxxxxxx0: Emit phrase #(B>>1).
   xxxxxx01: Read the next byte (B2);
             emit phrase #{B2 + 128 + 256*(B>>2)}.
   xxxxx011: Copy the next {(B>>3) + 1} bytes literally.
   xxxx0111: Emit {(B>>4)+1} spaces.
   xxxx1111: Emit {(B>>4)+1} NUL (0x00) bytes.
Repeat.

Note that every possible byte value matches exactly one of the bit patterns.

Character encoding

So now you have a bunch of strings, but they’re really just strings of byte values. How do you interpret them as letters and stuff?

The strings use one of the Windows “ANSI” code pages, such as Windows-1252, but unfortunately, I don’t think there’s any reliable way to tell which one.

There’s a setting in the “|SYSTEM” file named “CHARSET”, but (1) it’s only present in newer HLP files, and (2) nobody seems to know how to interpret it. It’s not clear to what extent it even contains information about the character encoding.

Summary

This has been an overview of what I know about getting text out of HLP files. Again, for the gory details, refer to the file named helpfile.txt in the software named “helpdeco”.

There are, in fact, even more types of compression used in HLP files. There’s a run-length encoding scheme used in some graphics. And there are some variable-length integer formats that the documentation names “compressed unsigned long”, etc. — though it might be a stretch to classify that as data compression. There might well be more types of compression that I don’t know about. But, for now at least, that’s all I’m going to write about.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s