Win32 I/O character encoding part 3

This is the third, and possibly final, post in my series on Microsoft Windows console mode character encodings. It describes how to use Unicode reasonably cleanly, without the “chcp 65001” hack discussed in Part 2. See Part 1 for a list of the posts in this series.

Again, the idea is that you are writing a command-line computer program in C or C++, and you want it to work reasonably well on Microsoft Windows, as well as on Unix-like platforms.

The first decision to make is what your program’s internal “native” string format will be. You have three main options:

  1. UTF-8
  2. UTF-16
  3. UTF-16 on Windows, and UTF-8 on other platforms, selected at compile time. This would probably be done using a set of macros like Windows’ old-fashioned “T” macros: TCHAR, TEXT(), stprintf(), etc.

Option 1 is probably best in most cases. Option 3 may sound good at first, but I don’t think I would recommend it unless you really need optimal performance everywhere. It’s usually preferable to isolate the platform-specific code in one place, instead of having it sprinkled throughout the project.

Assuming Option 1, there are several things that your Windows version will have to do: [Edit: There is another way to do some of this — see this post about setlocale.]

(1) Implement wmain() instead of main().

(2) At the beginning of your program, call _setmode(), to make wprintf() and related functions work properly. Probably like this:

_setmode(_fileno(stdout), _O_U8TEXT);
_setmode(_fileno(stderr), _O_U8TEXT);

Another option is to use _O_U16TEXT mode instead of _O_U8TEXT, but that’s probably either not what you want, or it makes no difference. It only affects the text that is generated when the output is redirected to a file or a pipe. Note that your program still has to pass UTF-16 text to the output functions, even if you use _O_U8TEXT.

You could also choose to detect redirected output, and handle it as a special case. One way to detect it is to call GetConsoleMode(). If you don’t handle it as a special case, then your program will likely convert text from UTF-8 to UTF-16, only to have it immediately converted back to UTF-8. That’s okay, just a little inefficient.

(3) Make a UTF-8 copy of the UTF-16 argv command-line arguments, maybe using WideCharToMultiByte().

(4) For every relevant I/O function, write a wrapper function to translate from UTF-8 to UTF-16, or vice versa. For example, you won’t call printf() directly. Instead, you’ll use a custom my_printf() function, which on Windows will most likely do the formatting (maybe using _vsnprintf_s()), then convert from UTF-8 to UTF-16 (maybe using MultiByteToWideChar()), then call a Windows-specific Unicode print function (maybe fputws()). For another example, my_fopen() will be a wrapper that calls a Unicode function, such as _wfopen_s().

Admittedly, doing all this can take quite a bit of effort. But it’s pretty much what you have to do if you want your program to behave well on Windows, and you’re programming to the bare metal Win32 API. It is to be hoped that most higher-level programming systems do this stuff automatically behind the scenes, so the application programmer doesn’t have to.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s