For a list of other posts in this series, refer to the first post.
A relatively recent Windows software development feature, affecting character encoding, is the ability to request a specific “ANSI” character encoding (or “code page”), presumably UTF-8, using a manifest. I decided to investigate what this really does.
This “manifest method” is independent of the “setlocale method” I wrote about last time. They do different things, though there is considerable overlap. You can use one, or the other, or both. Both of them only work on sufficiently new versions of Windows.
I am not recommending that you should, or should not, use this method. That’s up to you to decide. Using the Unicode API instead is certainly still a good option — refer to the early posts in this series for help with that.
Update about the setlocale method
First, I need to admit that I may have made a significant mistake in my previous post about the _setlocale(…,".UTF8")
method. Either that, or (as seems quite possible) Microsoft has changed something since then.
I said that printing UTF-8 text to the console works even for characters that are not in your OEM code page. But when I try it now, it doesn’t work. Only characters that are defined in my OEM code page work. Others are cleanly replaced with something else, usually a question mark.
The setlocale method still does the other things it’s supposed to do; it’s just that console output doesn’t work as well by default. If you set the console output code page (the “OEM” code page) to UTF-8 (e.g. by running “chcp 65001
” from the command prompt), then it will work, since all characters exist in that code page.
This correction isn’t necessarily relevant to the manifest method, but it does mean you can’t get everything to work just by using both methods at once.
Back to the manifest method…
What’s a manifest?
A manifest is a special piece of XML data associated with a Windows program. It is used by the operating system to help run the program with the proper settings. Here’s a minimal manifest to request UTF-8 encoding, for a traditional Win32 desktop application:
<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<assembly xmlns='urn:schemas-microsoft-com:asm.v1' manifestVersion='1.0'>
<application>
<windowsSettings>
<activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
</windowsSettings>
</application>
</assembly>
I don’t want to get too far into how to edit your program’s manifest, in part because I don’t think I know enough about it. I use Visual Studio, and most of its manifest options just don’t seem to work the way I think they ought to.
What does work, at least in simplest form, is to put the manifest in a text file whose filename ends in “.manifest”, and add that file to your project like any other source file. When you build your project, Visual Studio will gather the information in your manifest file, and include it in the final manifest that it embeds in your EXE file.
What does the manifest method do?
The manifest method simply changes your program’s “ANSI” code page to UTF-8. This affects pretty much every function that uses your ANSI code page in some way, including:
- The command-line parameters passed to your
main()
function(argv
). - Win32 ANSI functions like
CreateFileA()
orMessageBoxA()
. - C library functions lik
e fopen_s()
. Note that Windows thinksfopen()
is insecure, so I’ll test the replacement functionfopen_s()
instead.
I don’t think there is any normal way for a program to change its ANSI code page at runtime. But it can query it, using the GetACP()
function. Normally, the ANSI code page will be a legacy code page like 1252 (for Window-1252 encoding). With the manifest method, it will be 65001, which is a code number for UTF-8. So, a program can easily detect whether the manifest method worked.
Unfortunately, the manifest method doesn’t do anything about console output. Anything printed via a function like printf()
still needs to be in the console’s character set, which by default will be some legacy encoding like code page 437. If you want it to work right, the user’s console encoding (“OEM code page”) really needs to be set to 65001.
And this is a problem because, as discussed in previous posts, the console encoding is a property of the console, not of the program running in that console.
Your program can set its console’s encoding to UTF-8 very easily, by calling SetConsoleCP(65001)
and SetConsoleOutputCP(65001)
. But these settings will persist, and will affect other programs that run in the same console after yours. Your program can and should try to set everything back before it ends, but abnormal program termination is still a potential issue.
Still, to my annoyance, I’m getting more and more convinced that rudely setting the console code page to 65001 just something your program may have to do. It can even be useful in programs that only use the Unicode API. Another issue to be aware of that it can also cause problems on very old versions of Windows.
Note that if you do set the console’s code page to 65001, there isn’t very much reason left to also use the setlocale method. It might be best to do it anyway, but I don’t know.
Summary of some combinations of settings
Here’s a summary of what encoding you need to use for the parameters of some representative functions, under different combinations of settings. The “argv” column is for the command-line parameters passed to your main()
function.
The first three columns are the settings, and the remainder are the consequences of those settings.
For convenience, this table assumes the default ANSI code page is cp1252 (Windows-1252), and the default OEM code page is cp437. It still applies if that’s not the case; you just have to make the appropriate edits.
OEMcp | Manifest | setlocale | argv | fopen_s | CreateFileA | printf |
---|---|---|---|---|---|---|
437 | — | — | cp1252 | cp1252 | cp1252 | cp437 |
437 | — | .UTF8 | cp1252 | UTF-8 | cp1252 | UTF-8, but only cp437-compatible characters work |
437 | UTF-8 | — | UTF-8 | UTF-8 | UTF-8 | cp437 |
437 | UTF-8 | .UTF8 | UTF-8 | UTF-8 | UTF-8 | UTF-8, but only cp437-compatible characters work |
65001 | — | — | cp1252 | cp1252 | cp1252 | UTF-8 |
65001 | — | .UTF8 | cp1252 | UTF-8 | cp1252 | UTF-8 |
65001 | UTF-8 | — | UTF-8 | UTF-8 | UTF-8 | UTF-8 |
65001 | UTF-8 | .UTF8 | UTF-8 | UTF-8 | UTF-8 | UTF-8 |