Notes on WinHelp format, part 2

This post is part of a series about WinHelp file format.

The internal TOPIC file (named “|TOPIC”) is the business part of the HLP file. It contains the text, and other information.

To read the TOPIC file, you need to know the TOPIC “block size”, which will be either 2048 or 4096 bytes. You can figure out which, using information in the SYSTEM file: the version number, and some flags.

The TOPIC file contents are in the form of a series of blocks of that block size. The first 12 bytes of each block make up the TOPIC block header; the remaining bytes are what I’ll call the block “contents” or “data”. Every block is the same size, except that the last one can be smaller.

The blocks’ contents contain some number of structures called TOPICLINKs. (The name “TOPICLINK” seems a little misleading to me. A TOPICLINK is just a section of the TOPIC data. It’s actual data, not a “link” to data.)

Version 1.15 TOPIC blocks

First, let’s look at the HLP version 1.15 TOPIC file. This version never uses HLP-LZ77 compression, and the block size is always 2048. Using the previous diagram as a reference, the TOPIC contents could look something like this:

The white+hatched regions are the 12-byte block headers. A TOPICLINK can span multiple TOPIC blocks, in which case there will be 12 bytes of “dead space” for each block header that gets in the way.

Each TOPICLINK contains a field giving its size, as well as a field that points to the next TOPICLINK. Unfortunately, these fields use different measuring systems.

If you were to concatenate all the block data into a single buffer, deleting the block headers, then the TOPICLINK size fields would be correct. They do not include the size of any TOPIC block headers that might appear in the middle of that TOPICLINK.

But now the next-TOPICLINK fields would be wrong, because they do respect the TOPIC block headers. In version 1.15 (but not later versions!), a next-TOPICLINK field points directly to the actual byte position in the TOPIC file contents where the next TOPICLINK begins.

Even if you just want to read all the TOPICLINKs sequentially, you can’t get away with only using the TOPICLINK size fields, because there can be unused bytes between the TOPICLINKs. This is somewhat uncommon, but it definitely happens.

I do not know whether it’s safe to assume that the very first TOPICLINK always starts at the earliest possible position (i.e., address 12). I haven’t seen any exceptions. There are other ways to locate TOPICLINKs, though; one is to use a field in the TOPIC block header.

Version 1.21+ TOPIC blocks

In HLP version 1.21 and later, the TOPIC block contents can optionally be compressed using the HLP-LZ77 algorithm (“LZ77” for short) described in Part 1. You can figure out whether LZ77 is used, from fields in the SYSTEM file.

If LZ77 is used, then each block’s contents are compressed independently. Do not concatenate the compressed data together and then try to decompress it all at once; that won’t work.

In effect (or in reality, if you want) the TOPIC contents must first be transformed so that they have a larger block size. This is the case whether or not LZ77 compression is used.

Specifically, the contents need to be transformed, block by block, to have a block size of 16384 bytes. (At least, I think the magic number is always 16384. I’m not 100% sure about that.)

If a block is LZ77-compressed, decompress it as part of this transformation. Otherwise, just copy its contents. Keep the block headers the same.

This means that a block can decompress to a maximum of 16372 (16384 minus 12) bytes. (At least, I think 16372 is the maximum. The documentation isn’t clear.) If a block decompresses to more than that, discard any extra decompressed bytes.

The only new thing to watch out for is that each block has its own contents-length, and there are unused addresses after most TOPIC blocks’ contents. There is now more “dead space” — the gray-hatched regions in the diagram.

The reason for doing this transformation is that now the next-TOPICLINK pointers make sense. They can be interpreted the same way as in version 1.15.

But how…?

It took me a while to realize that there’s something peculiar about the TOPIC file format, when LZ77 compression is used. Specifically, it’s “impossible”.

Oh, you can read it just fine. But it’s impossible to create.

A given TOPICLINK’s pointer to the next TOPICLINK is part of the compressed data. Its value could affect the size of the compressed data, which affects the address of the next TOPICLINK. In other words, you have to compress and write the next-TOPICLINK pointer before you know what it is.

I can’t speak for every HLP file out there, but the ones I looked at seem to break this causality loop by not really compressing the first 20 or so bytes of each TOPICLINK. They set the LZ77 “control bits” to 0 for these bytes, so that they are never part of a compressed sequence. That way, the HLP file compiler can go back and patch up the next-TOPICLINK fields (and maybe other fields), once it figures out what they should be. This solution is even messier than it sounds, because you also have to keep track of which bytes are “control bytes”. But that is evidently how it works.

In the next part, I’ll go over the internal structure of the TOPICLINKs, most of which have an additional layer of compression.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s