Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-22 Thread Alvin Wong via Development

Hi,


I’ve looked into that one when we did the work for Qt 6. The console has its 
own code page that can be set independently from the app, and I believe also 
independently from the system code page. qDebug() should be mostly fine, as 
we’re using OutputDebugStringW() internally and let Windows handle this mess.

What it does affect is writing to stdout/err and OutputDebugStringA().


It is unfortunately a bit more messy. OutputDebugString communicates 
with the debugger via a debug event which contains an address, then the 
debugger reads the debug message from the memory space of the debuggee 
process.


The documentation of OutputDebugStringW [1] states:

   "In the past, the operating system did not return Unicode strings
   through OutputDebugStringW (ASCII strings were returned instead). To
   force OutputDebugStringW to return Unicode strings, debuggers are
   required to call the WaitForDebugEventEx function to opt into the
   new behavior. In this way, the operating system knows that the
   debugger supports Unicode and is specifically opting into receiving
   Unicode strings."

   "OutputDebugStringW converts the specified string based on the
   current system locale information and passes it to
   OutputDebugStringA to be displayed. As a result, some Unicode
   characters may not be displayed correctly."

What happens with a debugger that does not call `WaitForDebugEventEx` 
(e.g. gdb) is this: The debuggee calls OutputDebugStringW, which 
converts the debug string to ACP (UTF-8 in this case) to be passed to 
OutputDebugStringA. Then the debugger receives the event and tries to 
read the debug string from the debuggee as ACP, but the debugger thinks 
ACP is the system ACP (Windows-1252, CP950 or whatever) so it ends up 
displaying mojibake. The same also happens with Sysinternals DebugView.


In reality, most of the debug messages are ASCII, so this issue rarely 
affects anything and I consider it just "a mild annoyance".


[1]: 
https://learn.microsoft.com/en-us/windows/win32/api/debugapi/nf-debugapi-outputdebugstringw


Cheers,
Alvin


On 22/3/2023 17:58, Lars Knoll wrote:

Hi,


On 21 Mar 2023, at 17:46, Alvin Wong via 
Development  wrote:

Hi,

Yes, embedding the manifest with activeCodePage set to UTF-8 is the only thing 
need to enable UTF-8 as the ANSI code page (ACP) for the process.

Qt itself should work fine after the bug in QStringConverter had been fixed [1] 
a while back. (You can also refer to the linked mail thread. [2]) However, as 
this bug has shown, any code that uses`MultiByteToWideChar` incorrectly or 
wrongly assumes that `CP_ACP` always refers to a charset in which each 
characters are formed by no more than two bytes will break. Therefore, before 
switching to UTF-8 as the ACP, application developers have to check their code 
and other libraries to make sure everything will still work properly after the 
switch.

[1]:https://codereview.qt-project.org/c/qt/qtbase/+/412208
[2]:https://lists.qt-project.org/pipermail/interest/2022-May/038241.html

About the CRT, it is true that only UCRT fully supports UTF-8 locale. When 
compiling with MSVC, you are almost always using UCRT so it should be fine.

MinGW-w64 is a bit more complicated -- when one gets a mingw-w64 toolchain, the 
whole toolchain is already configured for a specific CRT. Usually it will be 
the system MSVCRT. (If it's configured for UCRT, the toolchain author will 
usually make it clear, because compiled programs will not run out-of-the-box on 
Windows 8.1 or earlier.) I did not run tests myself, but I would not trust 
MSVCRT to support UTF-8 ACP fully. mingw-builds [3] and llvm-mingw [4] are some 
examples of mingw-w64 toolchains that ships UCRT versions.

[3]:https://github.com/niXman/mingw-builds-binaries/releases
[4]:https://github.com/mstorsjo/llvm-mingw

There are two more problems with enabling UTF-8 ACP using the manifest that I 
have encountered so far. When a process is running with UTF-8 ACP, there seems 
to be no API available to get the native system ACP. This can be an issue if, 
for example some external tools write files using the system ACP and your 
program wants to read those files. The other problem (a mild annoyance) is 
that, some debuggers which isn't using updated APIs (gdb for example) does not 
capture `OutputDebugString` messages in the correct encoding, which affects 
QDebug output.


I’ve looked into that one when we did the work for Qt 6. The console has its 
own code page that can be set independently from the app, and I believe also 
independently from the system code page. qDebug() should be mostly fine, as 
we’re using OutputDebugStringW() internally and let Windows handle this mess.

What it does affect is writing to stdout/err and OutputDebugStringA().


(Console output encoding is separate from the ACP, so one might also need to 
call `SetConsoleOutputCP(CP_UTF8)`, but the detail is a bit fuzzy to me.)

Setting the code page for console o

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-21 Thread Alvin Wong via Development

Hi,

Yes, embedding the manifest with activeCodePage set to UTF-8 is the only 
thing need to enable UTF-8 as the ANSI code page (ACP) for the process.


Qt itself should work fine after the bug in QStringConverter had been 
fixed [1] a while back. (You can also refer to the linked mail thread. 
[2]) However, as this bug has shown, any code that 
uses`MultiByteToWideChar` incorrectly or wrongly assumes that `CP_ACP` 
always refers to a charset in which each characters are formed by no 
more than two bytes will break. Therefore, before switching to UTF-8 as 
the ACP, application developers have to check their code and other 
libraries to make sure everything will still work properly after the switch.


[1]: https://codereview.qt-project.org/c/qt/qtbase/+/412208
[2]: https://lists.qt-project.org/pipermail/interest/2022-May/038241.html

About the CRT, it is true that only UCRT fully supports UTF-8 locale. 
When compiling with MSVC, you are almost always using UCRT so it should 
be fine.


MinGW-w64 is a bit more complicated -- when one gets a mingw-w64 
toolchain, the whole toolchain is already configured for a specific CRT. 
Usually it will be the system MSVCRT. (If it's configured for UCRT, the 
toolchain author will usually make it clear, because compiled programs 
will not run out-of-the-box on Windows 8.1 or earlier.) I did not run 
tests myself, but I would not trust MSVCRT to support UTF-8 ACP fully. 
mingw-builds [3] and llvm-mingw [4] are some examples of mingw-w64 
toolchains that ships UCRT versions.


[3]: https://github.com/niXman/mingw-builds-binaries/releases
[4]: https://github.com/mstorsjo/llvm-mingw

There are two more problems with enabling UTF-8 ACP using the manifest 
that I have encountered so far. When a process is running with UTF-8 
ACP, there seems to be no API available to get the native system ACP. 
This can be an issue if, for example some external tools write files 
using the system ACP and your program wants to read those files. The 
other problem (a mild annoyance) is that, some debuggers which isn't 
using updated APIs (gdb for example) does not capture 
`OutputDebugString` messages in the correct encoding, which affects 
QDebug output.


(Console output encoding is separate from the ACP, so one might also 
need to call `SetConsoleOutputCP(CP_UTF8)`, but the detail is a bit 
fuzzy to me.)


Cheers,
Alvin


On 20/3/2023 21:44, Edward Welbourne wrote:

Thiago Macieira (31 October 2019 22:11) wrote [0]:

This RFC (...) is meant to discuss how we'll deal with locales on Unix
systems on Qt 6. This does not apply to Windows because on Windows we
cannot reasonably be expected to use UTF-8 for the 8-bit encoding.

[0] https://lists.qt-project.org/pipermail/development/2019-October/037791.html

The GNU make mailing list currently has a thread (starts at [1]) about
handling of encodings on Windows.

[1] https://lists.gnu.org/archive/html/bug-make/2023-03/msg00066.html

The discussion there seems to indicate that setting the system code-page
to UTF-8 can be done in a way that interoperates gracefully with other
processes and the file system, presumably thanks to the system being
substantially UTF-16-based, so all 8-bit encodings go via that anyway.

The means to achieve this appear [2] to hinge on setting the active
codepage for the application in a manifest file, that it gets combined
with after it is linked.

[2] 
https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

There do appear to be some vagaries still, it may depend on UCRT and I'm
not sure I've really understood it all, but it looks like we may, in
time, be able to consistently use UTF-8 as 8-bit encoding on Windows.

Eddy.


--
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] first (was: Re: C++20 @ Qt)

2023-01-25 Thread Alvin Wong via Development


On 25/1/2023 0:03, Thiago Macieira wrote:

On Tuesday, 24 January 2023 00:44:37 PST Marc Mutz via Development wrote:

On 23.01.23 23:57, Thiago Macieira wrote:

static_assert(sizeof(std::chrono::milliseconds::rep) == 8);

Why == and not >=?

I think we'd want to know if that happened, because a lot of our code will be
depending on the ± 292 million years limit around the epoch.

Not familiar with what Qt already does with std::chrono, but on a first 
glance it seems like `std::chrono::duration` will be 
better for the purpose, no?


___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development