This is a brief introduction to, and a diagram of, the basic structure of the EXE executable file format used by MS-DOS and some other operating systems.
DOS EXE is a subset of the PE “Portable Executable” format still used by Microsoft Windows today, so this also constitutes preliminary information that could be helpful to know before studying PE format.
The EXE format is simple, but a little weird, because of the segmented memory architecture of DOS, and other historical reasons.
I’m not really an expert on the format. I’m just someone who wanted to classify and analyze some DOS EXE files. The various EXE format specifications I found on the internet seemed to make this more difficult than necessary. I also wanted to make a post like this, so that hypothetical future posts of mine can refer back to it.
I’m not going to cover the details of how DOS loads a program into memory, initializes, and executes it.
The terminology used here is not necessarily standard. I may have invented it, or borrowed it from other non-authoritative sources.
How to calculate the file positions
Most fields are two-byte integers, with the least-significant byte first.
- Start of custom-data-1 = end of DOS header = 28
- Start of relocation table = offReloc
- Start of custom-data-2 = end of relocation table = offReloc + 4×numReloc
- Start of DOS code image = 16×pgHeader
- entry point = 16×pgHeader + 16×regCS + regIP
- If lenFinal=0, end of DOS code image = 512×numBlocks
- If lenFinal≠0, end of DOS code image = 512×(numBlocks−1) + lenFinal
- Start of custom-data-3 = end of DOS code image
- custom-data-3 ends at the end of the file.
Descriptions of the segments and file positions
This is the standard 28-byte DOS header.
This segment may contain signatures or other data specific to the linker, or whatever program generated the EXE file
Also called the “fixups” table, this is a list of places in the “code image” that need to be adjusted based on the memory location at which program is loaded.
You don’t really need to know what the relocation table does; just know that it exists, and is a fundamental part of the EXE format.
The relocation table may be empty, in which case its location seemingly shouldn’t matter. But there are situations where it may matter, and it affects the format layout as I’m presenting it here. So, an empty relocation table may have to be dealt with as a special case.
This segment is usually just padding, but you can put data here if you want to.
DOS code image
The “image” segment is where the bulk of the executable code is found.
The format forces the segment to start at a multiple of 16 bytes, though it can end anywhere.
In practice, it often starts at a multiple of 512 bytes. I assume that’s to align it with the start of a floppy disk sector, though I don’t know the benefit of doing that.
What I’m calling the “entry point” is the place in the file containing the first machine code instructions to be executed. Interesting things might be found shortly after, or before, this position. Sometimes it’s right at the beginning of the DOS code image, and sometimes it’s not.
Take note that regCS might be a negative number. There’s a technical reason that you might want to make it negative, but that’s beyond the scope of this post.
Often called “overlay” data, this segment is not loaded into memory by DOS, though the program might load some or all of it on demand at runtime.
For extended EXE formats, the entire program may be in this segment.
A note about extended EXE formats
For extended EXE formats (Windows “Portable Executable”, “New Executable”, and others), there is a 4-byte field at offset 60 that tells where most of the “real” file is. Look for a signature at the offset pointed to by the field at offset 60.
Most of the DOS-specific fields are then irrelevant, and are just used for a DOS “stub” program that does nothing but print an error message.
For extended formats, the relocation table offset is usually exactly 64, even if the table is empty, which it usually is. That makes it easier to identify extended formats as such. If for some reason you want to create a new DOS EXE file, be nice and make the relocation table offset an even number from 28 to 62.
I admit I don’t know exactly what algorithm operating systems use to decide whether a file is a DOS EXE file, or an extended EXE file.
A look at some EXE files
One small thing we could do with this information is to use it to calculate where custom-data-1 ends, which should make it easier to characterize a potential signature in that segment.
Let’s take a quick look at the custom data in some random DOS EXE files. The files I’ll look at are all from the Simtel MSDOS 1997-09 collection, CD #1.
I’ll do a hex dump the first few bytes of the custom-data-1 segment. Exception: If the relocation table is empty, and its offset is 28 (or less), then I’ll use the custom-data-2 segment instead.
Of course, many files have no custom-data-1, or just zeroes:
>< ACP.EXE >< AM.EXE >< ARCHIVE.EXE 00 00 >..< 1ST.EXE 00 00 00 00 >....< BSM.EXE 00 00 00 00 >....< CDC.EXE 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >..............< CHECKDRV.EXE 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >..............< CHOOPT.EXE
A “jr” or “rj” signature is apparently from one or more popular linkers (Borland?):
01 00 fb 20 72 6a >... rj< 23.EXE 01 00 fb 50 6a 72 00 00 00 00 00 00 00 00 >...Pjr........< C.EXE 01 00 fb 30 6a 72 00 00 00 00 00 00 00 00 >...0jr........< C2L.EXE 01 00 fb 30 6a 72 00 00 00 00 00 00 00 00 >...0jr........< CHOP.EXE 01 00 fb 71 6a 72 00 00 00 00 00 00 00 00 >...qjr........< CREATE.EXE 01 00 fb 20 72 6a >... rj< CUSTOMC.EXE
Here’s a commonly seen bit of text in DOS programs from this era:
75 62 2e 68 20 67 65 6e 65 72 61 74 65 64 >ub.h generated< CARD.EXE 75 62 2e 68 20 67 65 6e 65 72 61 74 65 64 >ub.h generated< DMGRAPH.EXE 75 62 2e 68 20 67 65 6e 65 72 61 74 65 64 >ub.h generated< FED.EXE
It’s from a DOS extender, part of a longer message that starts “stub.h generated from stub.asm by djasm”. You may ask why the initial “st” is missing from the hex dump. It’s because some EXE files are constructed as if the DOS header were only 26 bytes in size, instead of 28. This is possible in some cases, because the last two bytes of the 28-byte DOS header aren’t used by DOS for anything important.
Some compressed executable formats are identifiable:
57 57 50 20 >WWP < 4DIZ.EXE 64 69 65 74 >diet< 4EDIT.EXE 4c 5a 39 31 >LZ91< ACE.EXE 74 7a c3 00 >tz..< AK.EXE 78 70 61 63 >xpac< ATAINF.EXE 4c 5a 30 39 >LZ09< CCMREG.EXE 55 43 32 58 >UC2X< UPDATEDB.EXE 03 21 50 4b 4c 49 54 45 20 43 6f 70 72 2e >.!PKLITE Copr.< AAPLAY.EXE 01 22 50 4b 6c 69 74 65 28 52 29 20 43 6f >."PKlite(R) Co< ATPLITE.EXE 0c 31 50 4b 4c 49 54 45 20 43 6f 70 72 2e >.1PKLITE Copr.< GDS.EXE 0f 01 50 4b 4c 49 54 45 20 43 6f 70 72 2e >..PKLITE Copr.< HB.EXE
“WWP” is for WWPACK, “diet” is from DIET, “LZ09” and “LZ91” are from LZEXE, “tz” is from TinyProg, “xpac” is from XPACK, and “UC2X” is from UCEXE. Examples from several versions of PKLITE are also shown.
Something to be aware of is that a lot of developers didn’t want to make it obvious how their files were compressed, so they tampered with the signature. Just because a file does not have an LZEXE signature, for example, does not mean it wasn’t compressed with LZEXE.
There are also plenty of special EXE formats that, even if they haven’t been tampered with, can’t be identified by a characteristic signature at any fixed file position — you must examine the file at some non-fixed landmark position. For example, probably the most common compressed DOS EXE format was EXEPACK, and its most distinctive signature, “RB”, appears two bytes before the entry point. It’s cases like this where it’s most helpful to know how an EXE file is organized.