Win32 I/O character encoding supplement 2 – setlocale enhancement

This is part of a series of post on using Unicode in Windows command-line applications. Here’s the first post.


Sometime in 2018, some functions in the Windows 10 C runtime system, and related development SDKs, were enhanced to support UTF-8. This feature is enabled by calling the setlocale function.

For reference, Microsoft’s current documentation of setlocale is here.

I wasn’t aware of this feature when I made my first series of posts, though it apparently slightly predates them. I did look at the setlocale function, but maybe not at a new enough version of it.

To enable the feature in your C/C++ program, call:

setlocale(LC_ALL, ".UTF8");

(Note: The setlocale function has several different uses. In this post, “setlocale” means this specific use of setlocale, not the setlocale function in general.)

After that, certain functions, such as printf and fopen_s, will treat relevant parameters as being encoded in UTF-8. So, you can write a program something like this:

#include <locale.h>
#include <stdio.h>

int main(int argc, char **argv) {
  setlocale(LC_ALL, ".UTF8");
  printf("triangle \xe2\x96\xb2\n");
  printf("7/8 \xe2\x85\x9e\n");
}

Compiling it with a modern edition of Visual C++, and running it, I get:

triangle ▲
7/8 ⅞

Notice that the entire console output path supports Unicode here. It’s not that printf converts the data from UTF-8 to the user’s OEM encoding (the default legacy encoding used by the console). We know that, because “⅞” is not in my OEM encoding, yet it works.

Note that this works regardless of what the console’s legacy (OEM) code page is set to, and it doesn’t change it.

So it seems that setlocale significantly changes how printf works, in a similar manner to how _setmode(_fileno(stdout), _O_U8TEXT) changes how wprintf works.

Details

Alternatives to the setlocale ".UTF8" parameter that also worked for me include ".UTF-8", ".utf8", and ".65001".

Assuming the program’s output is going to a console, invalid UTF-8 causes the output from a given printf statement to be truncated at that point.

Splitting a single UTF-8 codepoint between two or more consecutive printf statements seems to work correctly, though I didn’t extensively test it.

If the output is redirected to a file, setlocale apparently doesn’t change how printf works. It still just writes the bytes to the file, in an encoding-agnostic way.

Limitations

Command line arguments

A serious limitation is that this doesn’t affect the command line arguments (argv) supplied to the main function. Those arguments will still be encoded in the user’s “ANSI” code page.

While your program could convert them from ANSI to UTF-8, that isn’t likely to help much. The biggest problem we’re trying to fix here is that the command line could contain a filename whose characters do not all exist in the user’s ANSI code page. Such characters will already be lost by the time your program receives the command line. The setlocale function is not able to travel back in time and fix this.

The setlocale documentation says that accessing the command line via the __argv global variable also doesn’t work, though I wonder if that might be somewhat less impossible for Microsoft to fix.

Implementing wmain, instead of main, and explicitly converting the arguments from UTF-16 if required, remains the only way I know of to make this work.

Side effects?

It’s possible that calling setlocale could have side effects that you weren’t expecting. I don’t know what the likely issues might be, but we should recognize that adding it to an existing program is not without risk.

Compatibility

The combination of your development system and linker settings, and the user’s version of Windows, must support this setlocale feature. Just because your program works for you, doesn’t mean it will work for someone else who runs and/or compiles it. When it doesn’t work, it probably won’t cause anything too bad to happen, but I don’t know.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s