A quick overview of DOS EXE file format

This is a brief introduction to, and a diagram of, the basic structure of the EXE executable file format used by MS-DOS and some other operating systems.

DOS EXE is a subset of the PE “Portable Executable” format still used by Microsoft Windows today, so this also constitutes preliminary information that could be helpful to know before studying PE format.

The EXE format is simple, but a little weird, because of the segmented memory architecture of DOS, and other historical reasons.

I’m not really an expert on the format. I’m just someone who wanted to classify and analyze some DOS EXE files. The various EXE format specifications I found on the internet seemed to make this more difficult than necessary. I also wanted to make a post like this, so that hypothetical future posts of mine can refer back to it.

I’m not going to cover the details of how DOS loads a program into memory, initializes, and executes it.

The terminology used here is not necessarily standard. I may have invented it, or borrowed it from other non-authoritative sources.

Diagram

How to calculate the file positions

Most fields are two-byte integers, with the least-significant byte first.

  • Start of custom-data-1 = end of DOS header = 28
  • Start of relocation table = offReloc
  • Start of custom-data-2 = end of relocation table = offReloc + 4×numReloc
  • Start of DOS code image = 16×pgHeader
  • entry point = 16×pgHeader + 16×regCS + regIP
  • If lenFinal=0, end of DOS code image = 512×numBlocks
  • If lenFinal≠0, end of DOS code image = 512×(numBlocks−1) + lenFinal
  • Start of custom-data-3 = end of DOS code image
  • custom-data-3 ends at the end of the file.

Descriptions of the segments and file positions

DOS header

This is the standard 28-byte DOS header.

Custom-data-1

This segment may contain signatures or other data specific to the linker, or whatever program generated the EXE file

Relocation table

Also called the “fixups” table, this is a list of places in the “code image” that need to be adjusted based on the memory location at which program is loaded.

You don’t really need to know what the relocation table does; just know that it exists, and is a fundamental part of the EXE format.

The relocation table may be empty, in which case its location seemingly shouldn’t matter. But there are situations where it may matter, and it affects the format layout as I’m presenting it here. So, an empty relocation table may have to be dealt with as a special case.

Custom-data-2

This segment is usually just padding, but you can put data here if you want to.

DOS code image

The “image” segment is where the bulk of the executable code is found.

The format forces the segment to start at a multiple of 16 bytes, though it can end anywhere.

In practice, it often starts at a multiple of 512 bytes. I assume that’s to align it with the start of a floppy disk sector, though I don’t know the benefit of doing that.

Entry point

What I’m calling the “entry point” is the place in the file containing the first machine code instructions to be executed. Interesting things might be found shortly after, or before, this position. Sometimes it’s right at the beginning of the DOS code image, and sometimes it’s not.

Take note that regCS might be a negative number. There’s a technical reason that you might want to make it negative, but that’s beyond the scope of this post.

Custom-data-3

Often called “overlay” data, this segment is not loaded into memory by DOS, though the program might load some or all of it on demand at runtime.

For extended EXE formats, the entire program may be in this segment.

A note about extended EXE formats

For extended EXE formats (Windows “Portable Executable”, “New Executable”, and others), there is a 4-byte field at offset 60 that tells where most of the “real” file is. Look for a signature at the offset pointed to by the field at offset 60.

Most of the DOS-specific fields are then irrelevant, and are just used for a DOS “stub” program that does nothing but print an error message.

For extended formats, the relocation table offset is usually exactly 64, even if the table is empty, which it usually is. That makes it easier to identify extended formats as such. If for some reason you want to create a new DOS EXE file, be nice and make the relocation table offset an even number from 28 to 62.

I admit I don’t know exactly what algorithm operating systems use to decide whether a file is a DOS EXE file, or an extended EXE file.

A look at some EXE files

One small thing we could do with this information is to use it to calculate where custom-data-1 ends, which should make it easier to characterize a potential signature in that segment.

Let’s take a quick look at the custom data in some random DOS EXE files. The files I’ll look at are all from the Simtel MSDOS 1997-09 collection, CD #1.

I’ll do a hex dump the first few bytes of the custom-data-1 segment. Exception: If the relocation table is empty, and its offset is 28 (or less), then I’ll use the custom-data-2 segment instead.

Of course, many files have no custom-data-1, or just zeroes:

                                          ><               ACP.EXE
                                          ><               AM.EXE
                                          ><               ARCHIVE.EXE
00 00                                     >..<             1ST.EXE
00 00 00 00                               >....<           BSM.EXE
00 00 00 00                               >....<           CDC.EXE
00 00 00 00 00 00 00 00 00 00 00 00 00 00 >..............< CHECKDRV.EXE
00 00 00 00 00 00 00 00 00 00 00 00 00 00 >..............< CHOOPT.EXE

A “jr” or “rj” signature is apparently from one or more popular linkers (Borland?):

01 00 fb 20 72 6a                         >... rj<         23.EXE
01 00 fb 50 6a 72 00 00 00 00 00 00 00 00 >...Pjr........< C.EXE
01 00 fb 30 6a 72 00 00 00 00 00 00 00 00 >...0jr........< C2L.EXE
01 00 fb 30 6a 72 00 00 00 00 00 00 00 00 >...0jr........< CHOP.EXE
01 00 fb 71 6a 72 00 00 00 00 00 00 00 00 >...qjr........< CREATE.EXE
01 00 fb 20 72 6a                         >... rj<         CUSTOMC.EXE

Here’s a commonly seen bit of text in DOS programs from this era:

75 62 2e 68 20 67 65 6e 65 72 61 74 65 64 >ub.h generated< CARD.EXE
75 62 2e 68 20 67 65 6e 65 72 61 74 65 64 >ub.h generated< DMGRAPH.EXE
75 62 2e 68 20 67 65 6e 65 72 61 74 65 64 >ub.h generated< FED.EXE

It’s from a DOS extender, part of a longer message that starts “stub.h generated from stub.asm by djasm”. You may ask why the initial “st” is missing from the hex dump. It’s because some EXE files are constructed as if the DOS header were only 26 bytes in size, instead of 28. This is possible in some cases, because the last two bytes of the 28-byte DOS header aren’t used by DOS for anything important.

Some compressed executable formats are identifiable:

57 57 50 20                               >WWP <           4DIZ.EXE
64 69 65 74                               >diet<           4EDIT.EXE
4c 5a 39 31                               >LZ91<           ACE.EXE
74 7a c3 00                               >tz..<           AK.EXE
78 70 61 63                               >xpac<           ATAINF.EXE
4c 5a 30 39                               >LZ09<           CCMREG.EXE
55 43 32 58                               >UC2X<           UPDATEDB.EXE
03 21 50 4b 4c 49 54 45 20 43 6f 70 72 2e >.!PKLITE Copr.< AAPLAY.EXE
01 22 50 4b 6c 69 74 65 28 52 29 20 43 6f >."PKlite(R) Co< ATPLITE.EXE
0c 31 50 4b 4c 49 54 45 20 43 6f 70 72 2e >.1PKLITE Copr.< GDS.EXE
0f 01 50 4b 4c 49 54 45 20 43 6f 70 72 2e >..PKLITE Copr.< HB.EXE

“WWP” is for WWPACK, “diet” is from DIET, “LZ09” and “LZ91” are from LZEXE, “tz” is from TinyProg, “xpac” is from XPACK, and “UC2X” is from UCEXE. Examples from several versions of PKLITE are also shown.

Something to be aware of is that a lot of developers didn’t want to make it obvious how their files were compressed, so they tampered with the signature. Just because a file does not have an LZEXE signature, for example, does not mean it wasn’t compressed with LZEXE.

There are also plenty of special EXE formats that, even if they haven’t been tampered with, can’t be identified by a characteristic signature at any fixed file position — you must examine the file at some non-fixed landmark position. For example, probably the most common compressed DOS EXE format was EXEPACK, and its most distinctive signature, “RB”, appears two bytes before the entry point. It’s cases like this where it’s most helpful to know how an EXE file is organized.

2 thoughts on “A quick overview of DOS EXE file format

  1. I was surprised to learn that DOS .exe files can also start with “ZM” instead of “MZ”. (The Wikipedia article mentions this). Since you’ve been sampling exe files, did you ever come across a file with this alternative signature? I wonder if there’s an interesting reason why this happened.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s