Win32 I/O character encoding supplement 1 – A Cygwin issue

A while back, I wrote a series of posts about using Unicode in Windows console mode programs:

In Part 2, I said that programmers should probably not be changing the console code page to UTF-8 (65001). And that if they must, they should change it back when they’re done.

But now I’ve encountered an issue where changing the code page to UTF-8 might be least bad thing to do. Worse, you can’t really set it back when you’re done.

To make sense of this post, you’ll need some understanding of Unix-style command lines, and of Cygwin. It would take too long to fully explain what Cygwin is, but basically, it’s a suite of software that makes it easy to compile and run many Unix programs on Windows. Cygwin-compiled programs work best when run from Mintty, or another terminal program designed for that purpose. In this post, I’ll assume Mintty is being used.

A workflow that’s important to me, though maybe not many other people, is running a native Windows program from a Mintty terminal, and piping its output through a Cygwin pager utility such as “less”.

Windows and Unix each have their own different set of characteric problems you’re likely to run into when dealing with Unicode, and non-ASCII character encodings. So, if you mix Windows and Unix programs together, you shouldn’t be surprised if it doesn’t work perfectly.

Until recently, if you ran a native Windows program from Cygwin, the “standard output” stream of that program would never be an actual Windows console. Instead, it would be some sort of pipe. So, it would be impossible to use the Unicode console API in such a situation. If you want Unicode output, one thing you could do is write UTF-8, cross your fingers, and hope for the best.

That’s what happens if you call _setmode(…, _O_U8TEXT). Your program will use the Unicode API when writing to a console, and UTF-8 otherwise.

And that worked well on Cygwin (though admittedly maybe not so well in certain other situations). Assuming Mintty’s character set was set to UTF-8, as it ought to be, everything would work fine. Your shell environment (e.g. the LANG environment variable) should also be set to UTF-8, but that is usually derived automatically from your terminal’s setting.

It was a case of “UTF-8 everywhere”, so it all worked.

But then, around 2018, Microsoft added some “Windows Pseudo Console” (“ConPTY”) features to Windows 10, which made it easier for Windows programs to run console-mode Windows programs in virtual consoles.

Sounds like something Cygwin could use. No need for the hack of using a pipe, when it could use an actual [virtual] Windows console instead. And Cygwin started doing just that, as of version 3.1.0 (December 2019).

This post is based on the behavior of Cygwin 3.1.4. Later versions, and earlier 3.1.x versions, might very well behave differently.

Let’s write a test C/C++ program (u8cwtest.cpp) that reports some information about its environment, and tries to print some non-ASCII characters.

#include <windows.h>
#include <stdio.h>
#include <fcntl.h>
#include <io.h>

int wmain()
{
  DWORD cmode;
  BOOL is_console = GetConsoleMode(
    GetStdHandle(STD_OUTPUT_HANDLE), &cmode);
  UINT ocp = GetConsoleOutputCP();
  _setmode(_fileno(stdout), _O_U8TEXT);
  wprintf(L"console=%c, cp=%u\n", is_console?'y':'n', ocp);
  wprintf(L"[\x2561] [\x0117] [\x042f] [\x25b2]\n");
  return 0;
}

The characters I’ve chosen for testing are:

  • A box-drawing character, which is in CP437.
  • An “e with dot above”, which is not in CP437.
  • A cyrillic letter, which is not in CP437.
  • A triangle graphic characer, which is in CP437, but is in the problematic region below codepoint 32.

Compile it for Windows, e.g. with Visual C++, or Mingw-w64:

$ x86_64-w64-mingw32-g++ -Wall -municode -o u8cwtest.exe u8cwtest.cpp

On the old Cygwin, it does this:

$ ./u8cwtest.exe
console=n, cp=437
[╡] [ė] [Я] [▲]

On Cygwin 3.1.4, it does this:

$ ./u8cwtest.exe
console=y, cp=437
[╡] [ė] [Я] [▲]

So it works in both cases, only now it thinks it’s running on a console. What’s going on under the hood is actually very different. The 3.1.4 output is also what happens on an actual Windows console (command prompt).

With the new Cygwin, what happens if the output is sent to a file?

$ ./u8cwtest.exe > temp.txt
$ file temp.txt
temp.txt: UTF-8 Unicode text, with CRLF line terminators
$ cat temp.txt
console=n, cp=437
[╡] [ė] [Я] [▲]

It works. The file gets encoded in UTF-8.

With the new Cygwin, I’ll now pipe the output through a Cygwin program: the trivial “cat” utility, which (ideally) should not affect the output. With the old Cygwin, this worked fine. But now…

$ ./u8cwtest.exe | cat
console=n, cp=437
[╡] [e] [?] []

… it doesn’t. The box-drawing character worked, but the dot above the ė disappeared, the cyrillic charater turned into a question mark, and the triangle seems to have been deleted entirely.

One might guess that the output from u8cwtest is being changed before it gets to “cat”, but it’s not. “cat” sees to correct output. Something happens to it after it passes through “cat”, and before it shows up on the terminal.

Let’s try the “less” pager. “less” has a lot of options, so you may get different results. When I use it, there are problems:

$ ./u8cwtest.exe | less
console=n, cp=437
[╡] [e] [?] []
(END)

But if I press the left arrow key before exiting “less”, it switches to a full-screen mode, where everything works:

console=n, cp=437
[╡] [ė] [Я] [▲]
~
~
~
~

(Press “q” to exit “less”.)

“less” also uses full-screen mode when displaying large amounts of text. That’s why it took me a while to notice that something was wrong.

My best guess is that something like the following is going on. Cygwin sees that the last program in the pipeline is a Cygwin program, so it assumes the output it produces uses the Unix-like encoding setting (maybe from the LANG environment variable), which is UTF-8. But the output is going to a virtual Windows console, and the console’s code page setting is CP437. So, somewhere, somehow, something is converting the presumed UTF-8 text printed by “cat” to CP437 as it is printed to the console.

The ė’s dot is missing because there is no ė character in CP437, and this is a “best fit” translation. Possibly, Windows’s WideCharToMultiByte function is doing the translation. If so, there must be another layer of translation happening first, to convert from UTF-8 to UTF-16, before converting from UTF-16 to CP437.

At this point I wonder if Cygwin could, and maybe will in some future version, do better than this. Why couldn’t it convert to UTF-16 instead of to the OEM code page, and then use the Unicode API to print it to the virtual console with full Unicode support?

Until that happens, though, the only fix I can figure out is to change the console’s code page to UTF-8. Running “chcp” manually works…

$ /cygdrive/c/Windows/System32/chcp.com 65001
Active code page: 65001
$ ./u8cwtest.exe | cat
console=n, cp=65001
[╡] [ė] [Я] [▲]
$ /cygdrive/c/Windows/System32/chcp.com 437
Active code page: 437

But I don’t want to have to do that. We can have our program do it, and restore it when it’s done. To reduce possible confusion, we can set both the input the output code page settings.

int wmain()
{
  UINT oldicp=0;
  UINT oldocp=0;
  DWORD cmode;
  BOOL is_console = GetConsoleMode(
    GetStdHandle(STD_OUTPUT_HANDLE), &cmode);

  if(!is_console) {
    oldicp = GetConsoleCP();
    oldocp = GetConsoleOutputCP();
    SetConsoleCP(65001);
    SetConsoleOutputCP(65001);
  }

  UINT ocp = GetConsoleOutputCP();
  _setmode(_fileno(stdout), _O_U8TEXT);
  wprintf(L"console=%c, cp=%u\n", is_console?'y':'n', ocp);
  wprintf(L"[\x2561] [\x0117] [\x042f] [\x25b2]\n");

  if(oldicp!=0) SetConsoleCP(oldicp);
  if(oldocp!=0) SetConsoleOutputCP(oldocp);
  return 0;
}

And now it should work:

$ ./u8cwtest.exe | cat
console=n, cp=65001
[╡] [e] [?] []
$ /cygdrive/c/Windows/System32/chcp.com
Active code page: 437

But it doesn’t! Apparently, there’s a race condition. Our program may end before its output gets all the way through the pipe, and only then does Cygwin look at the console’s code page setting — after we’ve already set it back. It other words, SetConsoleOutputCP() can retroactively affect text that our program has already printed.

If we don’t set the code page back when we’re done,

  // if(oldicp!=0) SetConsoleCP(oldicp);
  // if(oldocp!=0) SetConsoleOutputCP(oldocp);

then it works:

$ ./u8cwtest.exe | cat
console=n, cp=65001
[╡] [ė] [Я] [▲]
$ /cygdrive/c/Windows/System32/chcp.com
Active code page: 65001

But with the drawback that we’ve messed up something that doesn’t belong to us. The code page change is persistent, and could affect other programs that run later in the same console. For now, though, I don’t know a better solution.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s