This is part of a series of post on using Unicode in Windows command-line applications. Here’s the first post.
Sometime in 2018, some functions in the Windows 10 C runtime system, and related development SDKs, were enhanced to support UTF-8. This feature is enabled by calling the setlocale
function.
For reference, Microsoft’s current documentation of setlocale
is here.
I wasn’t aware of this feature when I made my first series of posts, though it apparently slightly predates them. I did look at the setlocale
function, but maybe not at a new enough version of it.
To enable the feature in your C/C++ program, call:
setlocale(LC_ALL, ".UTF8");
(Note: The setlocale
function has several different uses. In this post, “setlocale
” means this specific use of setlocale
, not the setlocale
function in general.)
After that, certain functions, such as printf
and fopen_s
, will treat relevant parameters as being encoded in UTF-8. So, you can write a program something like this:
#include <locale.h>
#include <stdio.h>
int main(int argc, char **argv) {
setlocale(LC_ALL, ".UTF8");
printf("triangle \xe2\x96\xb2\n");
printf("7/8 \xe2\x85\x9e\n");
}
Compiling it with a modern edition of Visual C++, and running it, I get:
triangle ▲
7/8 ⅞
Notice that the entire console output path supports Unicode here. It’s not that printf
converts the data from UTF-8 to the user’s OEM encoding (the default legacy encoding used by the console). We know that, because “⅞” is not in my OEM encoding, yet it works.
Note that this works regardless of what the console’s legacy (OEM) code page is set to, and it doesn’t change it.
So it seems that setlocale
significantly changes how printf
works, in a similar manner to how _setmode(_fileno(stdout), _O_U8TEXT)
changes how wprintf
works.
Details
Alternatives to the setlocale
".UTF8"
parameter that also worked for me include ".UTF-8"
, ".utf8"
, and ".65001"
.
Assuming the program’s output is going to a console, invalid UTF-8 causes the output from a given printf
statement to be truncated at that point.
Splitting a single UTF-8 codepoint between two or more consecutive printf
statements seems to work correctly, though I didn’t extensively test it.
If the output is redirected to a file, setlocale
apparently doesn’t change how printf
works. It still just writes the bytes to the file, in an encoding-agnostic way.
Limitations
Command line arguments
A serious limitation is that this doesn’t affect the command line arguments (argv
) supplied to the main
function. Those arguments will still be encoded in the user’s “ANSI” code page.
While your program could convert them from ANSI to UTF-8, that isn’t likely to help much. The biggest problem we’re trying to fix here is that the command line could contain a filename whose characters do not all exist in the user’s ANSI code page. Such characters will already be lost by the time your program receives the command line. The setlocale
function is not able to travel back in time and fix this.
The setlocale
documentation says that accessing the command line via the __argv
global variable also doesn’t work, though I wonder if that might be somewhat less impossible for Microsoft to fix.
Implementing wmain
, instead of main
, and explicitly converting the arguments from UTF-16 if required, remains the only way I know of to make this work.
Side effects?
It’s possible that calling setlocale
could have side effects that you weren’t expecting. I don’t know what the likely issues might be, but we should recognize that adding it to an existing program is not without risk.
Compatibility
The combination of your development system and linker settings, and the user’s version of Windows, must support this setlocale
feature. Just because your program works for you, doesn’t mean it will work for someone else who runs and/or compiles it. When it doesn’t work, it probably won’t cause anything too bad to happen, but I don’t know.