Summary of some Win32 I/O character encoding behavior

This is the first of a series of post. Here are the others:

This post is about programming a Windows Win32 application, mainly one that uses the console (command line). It summarizes the results of some tests I performed.

Maybe you ported a Unix utility to Windows, but you find that it doesn’t work with filenames that contain Japanese characters. This information may help, though specific recommendations will have to wait for a followup post.

Notes:

  • “Win32” means both 32-bit and 64-bit. (There’s no API named “Win64”.)
  • I used the C language, and a Microsoft C library, to do my testing. It should be the same with C++, but not necessarily other languages.
  • I expect this information to apply to all versions of Windows that support Unicode and Win32, at least XP and newer. I haven’t extensively tested it, though.
  • As I write this, Microsoft says they are soon going to update the Windows 10 console, giving it better support for colors, emoji, and other things. I don’t think that will affect the correctness of this post, but it means you may have more options.
  • Even without this update, there may well be relevant console features that I don’t know about.
  • There are lots of variables involved that might affect the results, including your development system, the user’s version of Windows, and the user’s language settings. I probably don’t know all the dependencies.

The input methods I will consider:

  • Command-line arguments, typed directly by the user into a console.
  • Command-line arguments, when the command is run from a batch (.BAT) file.

The output methods I will consider:

  • Printing to standard output, where standard output is a console.
  • Printing to standard output, where standard output is redirected to a file.

Explanations of things in the tables below:

  • This may depend on your compiler, but you generally have the option of implementing either a main() function, or a wmain() function. With wmain(), you get Unicode versions of the command-line arguments. I don’t know any way to do that if you use main(). It would be nice if you could get the Unicode arguments from the special __wargv global variable, but __wargv doesn’t work unless you implement wmain(), defeating what seems like its only possible purpose.
  • A Windows user has two default legacy “code page” settings: An “OEM” encoding (from the MS-DOS era), and an “ANSI” encoding (from the pre-Unicode Windows era). For English-language users, the OEM encoding is usually Code Page 437, and the ANSI encoding is usually Windows-1252. You can view and change a console window’s OEM encoding using the “chcp” command.
  • By “full Unicode support”, I don’t really mean all of Unicode. I just mean you’re not limited to the characters present in legacy code pages. These functions might still not work fully for characters over U+FFFF, characters not present in the user’s font, combining characters, or any other characters with unusual properties.
  • Unlike a Unix terminal, a Windows console is a native Unicode device. You write Unicode characters to it, not raw bytes. Any API function like printf() that seems to be writing bytes to it, must be converting the bytes to Unicode characters behind the scenes.
  • “printf” stands for a whole family of output functions, including puts() and others. Similarly, “wprintf” stands for the whole family of “wide char” output functions.
  • The _setmode() function can change the behavior of subsequent calls to the wprintf() family of functions. An example of its use is “_setmode(_fileno(stdout), _O_U8TEXT)”.

Input:

argvfrom consolemain()Args are converted (from Unicode) to ANSI.
argvfrom consolewmain()Args are encoded in UTF-16; full Unicode support.
argvfrom batch filemain()Args are interpreted as OEM (in the batch file), and converted to ANSI. Only characters common to both encodings will work.
argvfrom batch filewmain()Args are interpreted as OEM (in the batch file), and converted to UTF-16.

Output:

printfto console Caller must encode text in OEM.
wprintfto consoleno _setmodeUnicode codepoints up to 0xff are reinterpreted (misinterpreted) as OEM. Output for each wprintf stops when a codepoint over 0xff is encountered.
wprintfto console_setmode(_O_BINARY)Encodes in UTF-16LE, then reinterprets (misinterprets) each codepoint as two bytes of OEM text. Useless.
wprintfto console_setmode(_O_U16TEXT)Caller supplies text in UTF-16; full Unicode support.
wprintfto console_setmode(_O_U8TEXT)Caller supplies text in UTF-16; full Unicode support.
printfredirected to file Encoding-agnostic, raw bytes are written to the file, except for possible EOL conversion.
wprintfredirected to fileno _setmodeCodepoints up to 0xff are written to the file, with no conversion, one byte per codepoint. Output for each wprintf stops when a codepoint over 0xff is encountered.
wprintfredirected to file_setmode(_O_BINARY)Caller supplies text in UTF-16, written to the file as UTF-16LE.
wprintfredirected to file_setmode(_O_U16TEXT)Caller supplies text in UTF-16, written to the file as UTF-16LE.
wprintfredirected to file_setmode(_O_U8TEXT)Caller supplies text in UTF-16, which is converted and written to the file as UTF-8.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s