This post concerns the TIFF image file format, and some related formats: Exif, DNG, and JPEG XR.
A TIFF file contains a set of “fields”, each of which contains some sort of data element. Each field has a “tag”: a code number that tells, logically, what is stored in the field. For example, tag #256 is the width of the image. Each field also has a “type”, which tells the low level details of how its data is stored. For example, type #4 is a 16-bit unsigned integer. This post is about type #2, “ASCII”.
ASCII fields are used for text strings. Each one contains an array of bytes, which will have to be interpreted somehow. This interpretation is a bit more complex than it seems.
The TIFF v6 specification defines type “ASCII” as follow. For technical reasons, it’s defined mainly in terms of an individual byte; but the format of the field as a whole also needs to be explained.
2=ASCII: 8-bit byte that contains a 7-bit ASCII code; the last byte must be NUL (binary zero).
(I’ll use the term “NUL” to mean a byte whose value is zero. Some specifications use “NULL”, or “zero”, or “0-valued”, etc.)
A little later, it adds a remark that I suspect is often overlooked:
Any ASCII field can contain multiple strings, each terminated with a NUL.
Without the multi-string remark, we might get some different ideas. For example, we might think that a decoded string can contain internal NUL bytes.
It is not clear to me what “any ASCII field” is supposed to mean. Does it mean literally any ASCII field regardless of tag number, or does it mean only if that particular tag number is specified as allowing multiple strings? If the latter, then single-string fields could still contain internal NUL bytes.
I think the only standard TIFF v6 field for which multiple strings has a defined meaning is InkNames (tag #333).
In practice, I have seen ASCII tags of length (say) 20 bytes, in which only the first five are not NUL. Or even in which the sixth byte is NUL, and the bytes after it are random garbage, or belong to some other field.
If you’re writing a TIFF decoder, the safest thing to do is to stop at the first NUL, unless the tag is explicitly defined to contain multiple strings. And, of course, you must be defensive, and account for the possibility of a field having no NUL bytes at all.
How would you encode a string with characters that don’t exist in ASCII? TIFF doesn’t seem to have a good solution. It would be nice if TIFF strings were encoded in UTF-8 instead of ASCII, but TIFF predates the invention of UTF-8 by a number of years, so this would have required a time machine.
It’s easy to think of some possible solutions, having varying degrees of backward-compatibility. But I’m not aware of any addendum to the TIFF standard that proposes a standard solution.
Some ways to handle non-ASCII strings will be mentioned in the following discussion of related formats.
Exif is a TIFF-structured metadata format commonly used in TIFF, JPEG, and many other formats.
Exif (v2.31) says of the ASCII data type:
2=ASCII: An 8-bit byte containing one 7-bit ASCII code. The final byte is terminated with NULL.
This is clumsily worded. It might be a bad translation from the original Japanese. I think we can assume the format is the same as TIFF, in that the final byte must be NUL.
The Exif specification does not have TIFF’s note about allowing multiple strings. However, the Exif “Copyright” field (tag #33432) does allow TIFF-style multiple strings. Copyright tag #33432 is also a standard TIFF v6 tag, but the TIFF specification does not say that it can contain multiple strings. For what it’s worth, I have never seen an Exif Copyright field containing more than one string.
There are some Exif fields that support Unicode in some way, notably UserComment (tag #37510), but they do not use the ASCII data type.
Microsoft Windows (Explorer) can add Exif metadata to JPEG and some other types of files. In some versions of Windows, you can do this by selecting a .jpg file on your computer, choosing Properties (e.g. from the right-click menu), then the Details tab. Microsoft invented some non-ASCII fields such as XPComment (tag #40092), which support Unicode. Windows will also sometimes put UTF-8 in Exif ASCII fields such as ImageDescription (tag #270), technically violating the Exif specification.
I have seen a few Exif ASCII fields containing non-ASCII characters that were probably not put there by Windows, so Windows is probably not the only software that does this. Some such fields use UTF-8 (which I’d say is defensible), but others use some other encoding (which I’d say is not).
DNG is an extension of TIFF, used for minimally-processed “raw” images from digital cameras.
The public DNG specification does not define what types like “ASCII” mean. I guess it inherits its type definitions from TIFF, or maybe Exif.
There are about 10 DNG string fields whose type is defined as “ASCII or BYTE [type #1]”, and are described as being “Null terminated UTF-8 encoded Unicode string”. Specifically:
Reading between the lines, I’m guessing that if a string has any non-ASCII characters, its type must be set to BYTE (because it would not be valid ASCII). And if it doesn’t, its type can be either ASCII or BYTE (preferably ASCII?).
I’ve been looking for DNG files that take advantage of this feature, but I have yet to find a single one.
JPEG XR, also known as HD Photo, is a Microsoft image file format. It is clearly based on TIFF, though not compatible with it. (Yes, TIFF. Not JPEG. Don’t ask.) So, its designers were free to make whatever TIFF-incompatible changes they wanted. And they did, changing type 2 from “ASCII” to “UTF8”:
ELEMENT_TYPE 2 = “UTF8”: […] Each data element is interpreted as a UTF-8 character set code […], and the value of the last data element […] shall be equal to 0 (null). Any such field may contain multiple strings of UTF-8 characters, each terminated with a 0-valued character. […] There shall not be any two consecutive bytes equal to 0.
JPEG XR, like TIFF and Exif, defines tag #33432 to be a copyright notice, though it changes the name slightly to “COPYRIGHT_NOTICE”. This field allows Exif-style multiple strings, and is the only JPEG XR field I could find that is explicitly defined to do so.