Summary of some Win32 I/O character encoding behavior

This post is about programming a Windows Win32 application, mainly one that uses the console (command line). It summarizes the results of some tests I performed.

Maybe you ported a Unix utility to Windows, but you find that it doesn’t work with filenames that contain Japanese characters. This information may help, though specific recommendations will have to wait for a followup post.

Notes:

  • “Win32” means both 32-bit and 64-bit. (There’s no API named “Win64”.)
  • I used the C language, and a Microsoft C library, to do my testing. It should be the same with C++, but not necessarily other languages.
  • I expect this information to apply to all versions of Windows that support Unicode and Win32, at least XP and newer. I haven’t extensively tested it, though.
  • As I write this, Microsoft says they are soon going to update the Windows 10 console, giving it better support for colors, emoji, and other things. I don’t think that will affect the correctness of this post, but it means you may have more options.
  • Even without this update, there may well be relevant console features that I don’t know about.
  • There are lots of variables involved that might affect the results, including your development system, the user’s version of Windows, and the user’s language settings. I probably don’t know all the dependencies.

The input methods I will consider:

  • Command-line arguments, typed directly by the user into a console.
  • Command-line arguments, when the command is run from a batch (.BAT) file.

The output methods I will consider:

  • Printing to standard output, where standard output is a console.
  • Printing to standard output, where standard output is redirected to a file.

Explanations of things in the tables below:

  • This may depend on your compiler, but you generally have the option of implementing either a main() function, or a wmain() function. With wmain(), you get Unicode versions of the command-line arguments. I don’t know any way to do that if you use main(). It would be nice if you could get the Unicode arguments from the special __wargv global variable, but __wargv doesn’t work unless you implement wmain(), defeating what seems like its only possible purpose.
  • A Windows user has two default legacy “code page” settings: An “OEM” encoding (from the MS-DOS era), and an “ANSI” encoding (from the pre-Unicode Windows era). For English-language users, the OEM encoding is usually Code Page 437, and the ANSI encoding is usually Windows-1252. You can view and change a console window’s OEM encoding using the “chcp” command.
  • By “full Unicode support”, I don’t really mean all of Unicode. I just mean you’re not limited to the characters present in legacy code pages. These functions might still not work fully for characters over U+FFFF, characters not present in the user’s font, combining characters, or any other characters with unusual properties.
  • Unlike a Unix terminal, a Windows console is a native Unicode device. You write Unicode characters to it, not raw bytes. Any API function like printf() that seems to be writing bytes to it, must be converting the bytes to Unicode characters behind the scenes.
  • “printf” stands for a whole family of output functions, including puts() and others. Similarly, “wprintf” stands for the whole family of “wide char” output functions.
  • The _setmode() function can change the behavior of subsequent calls to the wprintf() family of functions. An example of its use is “_setmode(_fileno(stdout), _O_U8TEXT)”.

Input:

argv from console main() Args are converted (from Unicode) to ANSI.
argv from console wmain() Args are encoded in UTF-16; full Unicode support.
argv from batch file main() Args are interpreted as OEM (in the batch file), and converted to ANSI. Only characters common to both encodings will work.
argv from batch file wmain() Args are interpreted as OEM (in the batch file), and converted to UTF-16.

Output:

printf to console Caller must encode text in OEM.
wprintf to console no _setmode Unicode codepoints up to 0xff are reinterpreted (misinterpreted) as OEM. Output for each wprintf stops when a codepoint over 0xff is encountered.
wprintf to console _setmode(_O_BINARY) Encodes in UTF-16LE, then reinterprets (misinterprets) each codepoint as two bytes of OEM text. Useless.
wprintf to console _setmode(_O_U16TEXT) Caller supplies text in UTF-16; full Unicode support.
wprintf to console _setmode(_O_U8TEXT) Caller supplies text in UTF-16; full Unicode support.
printf redirected to file Encoding-agnostic, raw bytes are written to the file, except for possible EOL conversion.
wprintf redirected to file no _setmode Codepoints up to 0xff are written to the file, with no conversion, one byte per codepoint. Output for each wprintf stops when a codepoint over 0xff is encountered.
wprintf redirected to file _setmode(_O_BINARY) Caller supplies text in UTF-16, written to the file as UTF-16LE.
wprintf redirected to file _setmode(_O_U16TEXT) Caller supplies text in UTF-16, written to the file as UTF-16LE.
wprintf redirected to file _setmode(_O_U8TEXT) Caller supplies text in UTF-16, which is converted and written to the file as UTF-8.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s