Win32 I/O character encoding part 2: chcp 65001

In a previous post, I summarized the character encodings used by Windows console mode programs. This is a short post about a not-very-good mitigation technique for some of the resulting problems. In a future post, I’ll go over some better solutions.

[Edit 2020-05: Unfortunately, I’ve had to walk back the advice in this post a little bit. See this post.]

Sometimes on an internet forum, someone will complain about a third-party Windows console program that doesn’t work right with some non-ASCII characters. Often, the advice offered is that to make Windows support Unicode, you should type “chcp 65001”, which sets your OEM code page to UTF-8.

So, maybe a program named “enctest.exe” has this source code that’s trying to print a smiley face:

printf("test 1: '\xe2\x98\xba'\n");

And the chcp trick makes it work:

Unfortunately, though it might be the only workaround available to the user, it only sometimes works. It mainly helps in the situation where the program prints UTF-8 text to the console. It does not help, for example, in the case of a filename with weird characters passed on the command line. That’s because command line arguments use your ANSI code page, while only your OEM code page is changed by chcp. So, it fixes printf(), but not argv or fopen().

Another issue is that the older the user’s version of Windows, the worse “chcp 65001” works. I don’t think it works at all on Windows XP. It does work on Windows 7, but at least some versions have problems that were fixed in later versions of Windows. For example, it might fail if a multi-byte UTF-8 character is split among two separate calls to printf().

If you’re the application programmer, you don’t have to rely on the user typing “chcp 65001”. Your program can call SetConsoleOutputCP() to do the same thing. But be aware that this setting is associated with the user’s console window, not with your program. It persists even after your program ends, and it will affect the next program that runs in that window. You should at least try to set it back to its original value when your program exits (preferably even if it exits abnormally).

[Edit: Each console actually has two OEM code page settings: one for input, and one for output. The chcp command changes both, but (when run without parameters) only reports the input code page. The SetConsoleOutputCP() function only changes the output code page. Use SetConsoleCP() if you want to change the input code page.]

But instead of “chcp 65001” or SetConsoleOutputCP(), it would be better to change your program to use Win32’s Unicode API, if reasonably possible.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s