This is the first of a series of post. Here are the others:
This post is about programming a Windows Win32 application, mainly one that uses the console (command line). It summarizes the results of some tests I performed.
Maybe you ported a Unix utility to Windows, but you find that it doesn’t work with filenames that contain Japanese characters. This information may help, though specific recommendations will have to wait for a followup post.
Notes:
- “Win32” means both 32-bit and 64-bit. (There’s no API named “Win64”.)
- I used the C language, and a Microsoft C library, to do my testing. It should be the same with C++, but not necessarily other languages.
- I expect this information to apply to all versions of Windows that support Unicode and Win32, at least XP and newer. I haven’t extensively tested it, though.
- As I write this, Microsoft says they are soon going to update the Windows 10 console, giving it better support for colors, emoji, and other things. I don’t think that will affect the correctness of this post, but it means you may have more options.
- Even without this update, there may well be relevant console features that I don’t know about.
- There are lots of variables involved that might affect the results, including your development system, the user’s version of Windows, and the user’s language settings. I probably don’t know all the dependencies.
The input methods I will consider:
- Command-line arguments, typed directly by the user into a console.
- Command-line arguments, when the command is run from a batch (.BAT) file.
The output methods I will consider:
- Printing to standard output, where standard output is a console.
- Printing to standard output, where standard output is redirected to a file.
Explanations of things in the tables below:
- This may depend on your compiler, but you generally have the option of implementing either a main() function, or a wmain() function. With wmain(), you get Unicode versions of the command-line arguments. I don’t know any way to do that if you use main(). It would be nice if you could get the Unicode arguments from the special __wargv global variable, but __wargv doesn’t work unless you implement wmain(), defeating what seems like its only possible purpose.
- A Windows user has two default legacy “code page” settings: An “OEM” encoding (from the MS-DOS era), and an “ANSI” encoding (from the pre-Unicode Windows era). For English-language users, the OEM encoding is usually Code Page 437, and the ANSI encoding is usually Windows-1252. You can view and change a console window’s OEM encoding using the “chcp” command.
- By “full Unicode support”, I don’t really mean all of Unicode. I just mean you’re not limited to the characters present in legacy code pages. These functions might still not work fully for characters over U+FFFF, characters not present in the user’s font, combining characters, or any other characters with unusual properties.
- Unlike a Unix terminal, a Windows console is a native Unicode device. You write Unicode characters to it, not raw bytes. Any API function like printf() that seems to be writing bytes to it, must be converting the bytes to Unicode characters behind the scenes.
- “printf” stands for a whole family of output functions, including puts() and others. Similarly, “wprintf” stands for the whole family of “wide char” output functions.
- The _setmode() function can change the behavior of subsequent calls to the wprintf() family of functions. An example of its use is “_setmode(_fileno(stdout), _O_U8TEXT)”.
Input:
argv | from console | main() | Args are converted (from Unicode) to ANSI. |
argv | from console | wmain() | Args are encoded in UTF-16; full Unicode support. |
argv | from batch file | main() | Args are interpreted as OEM (in the batch file), and converted to ANSI. Only characters common to both encodings will work. |
argv | from batch file | wmain() | Args are interpreted as OEM (in the batch file), and converted to UTF-16. |
Output:
printf | to console | Caller must encode text in OEM. | |
wprintf | to console | no _setmode | Unicode codepoints up to 0xff are reinterpreted (misinterpreted) as OEM. Output for each wprintf stops when a codepoint over 0xff is encountered. |
wprintf | to console | _setmode(_O_BINARY) | Encodes in UTF-16LE, then reinterprets (misinterprets) each codepoint as two bytes of OEM text. Useless. |
wprintf | to console | _setmode(_O_U16TEXT) | Caller supplies text in UTF-16; full Unicode support. |
wprintf | to console | _setmode(_O_U8TEXT) | Caller supplies text in UTF-16; full Unicode support. |
printf | redirected to file | Encoding-agnostic, raw bytes are written to the file, except for possible EOL conversion. | |
wprintf | redirected to file | no _setmode | Codepoints up to 0xff are written to the file, with no conversion, one byte per codepoint. Output for each wprintf stops when a codepoint over 0xff is encountered. |
wprintf | redirected to file | _setmode(_O_BINARY) | Caller supplies text in UTF-16, written to the file as UTF-16LE. |
wprintf | redirected to file | _setmode(_O_U16TEXT) | Caller supplies text in UTF-16, written to the file as UTF-16LE. |
wprintf | redirected to file | _setmode(_O_U8TEXT) | Caller supplies text in UTF-16, which is converted and written to the file as UTF-8. |