Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-22 Thread Thiago Macieira
On Wednesday, 22 March 2023 09:48:05 HST Volker Hilsheimer via Development 
wrote:
> Even if one Qt 5 application and one Qt 6 application exchange data over a
> local socket, unwisely using to/fromLocal8Bit for the purpose - if the Qt 5
> application continues to run with the system code page, then the Qt 6
> application starting to sending UTF-8 encoded data will break this.

QLocalSocket is very rare on Windows. And any decent socket code that is 
prepared to work over networks has either used proper 8-bit tagging to 
indicate the encoding (since 2001) or plain UTF-8 (since 2003).

The console is already a mess on Windows because it's not just the ACP for 
Win32 "A" API, but also the legacy DOS encoding (the mess that renders my 
middle name JosÚ or JosΘ). Since that is already a mess, I don't particularly 
find it problematic to see José now... wouldn't be the first time. Most 
Windows 
applications aren't console applications so this is a limited issue. It's also 
time-limited: those issues should smooth out easily with proper terminal 
applications, which is how we solved it in the Unix world too.

No, the far more likely scenario is interchange via files and via pipes to 
child processes. So yes, finding out what the legacy ACP is might be a useful 
piece of information. It shouldn't be the toLocal8Bit encoding, but it should 
be available should the need arise.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Cloud Software Architect - Intel DCAI Cloud Engineering


smime.p7s
Description: S/MIME cryptographic signature
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-22 Thread Christian Ehrlicher

Am 22.03.2023 um 20:48 schrieb Volker Hilsheimer:

Indeed, the many hits in the sql code are mostly from warning output, thanks 
for checking.

But that Postgres supports UTF-8 doesn’t mean that an existing server is also 
configured to use it. If a server is configured to work with e.g. ISO_8859_5 
encoding, because all Qt clients (which are likely middleware servers, so fully 
controlled) run on Windows machines with a corresponding code page, then Qt 
deciding to encode in UTF-8 instead will break things, won’t it? And SQL is 
just one example.


No, the client encoding is completely unrelated to the encoding on the
server and the database. All three can differ. Even informix supported
this already 15 years ago iirc. The conversion happens between the
client and server.


Christian

--
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-22 Thread Volker Hilsheimer via Development


> On 22 Mar 2023, at 18:58, Christian Ehrlicher  wrote:
> 
> Am 22.03.2023 um 17:35 schrieb Volker Hilsheimer via Development:
>>  But we use toLocal8Bit in plenty of cases as well. For instance in our Qt 
>> SQL APIs.
> 
> The only plugin which really uses toLocal8Bit() is the IBase - Plugin.
> Postgres is using it as fallback but according the docs the utf-8
> encoding is supported by at least PostgreSQL 7.3 so the non utf-8 part
> should be removed.
> 
> The other usages are for qWarning() output.
> 
> 
> Will take a look on the IBase stuff to see if we can replace it.

Indeed, the many hits in the sql code are mostly from warning output, thanks 
for checking.

But that Postgres supports UTF-8 doesn’t mean that an existing server is also 
configured to use it. If a server is configured to work with e.g. ISO_8859_5 
encoding, because all Qt clients (which are likely middleware servers, so fully 
controlled) run on Windows machines with a corresponding code page, then Qt 
deciding to encode in UTF-8 instead will break things, won’t it? And SQL is 
just one example.

Even if one Qt 5 application and one Qt 6 application exchange data over a 
local socket, unwisely using to/fromLocal8Bit for the purpose - if the Qt 5 
application continues to run with the system code page, then the Qt 6 
application starting to sending UTF-8 encoded data will break this.


Volker

-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-22 Thread Christian Ehrlicher


Am 22.03.2023 um 18:58 schrieb Christian Ehrlicher:

Am 22.03.2023 um 17:35 schrieb Volker Hilsheimer via Development:

  But we use toLocal8Bit in plenty of cases as well. For instance in
our Qt SQL APIs.


The only plugin which really uses toLocal8Bit() is the IBase - Plugin.


Correction: it's only used during open() and for the event notification.


Cheerst,

Christian

--
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-22 Thread Christian Ehrlicher

Am 22.03.2023 um 17:35 schrieb Volker Hilsheimer via Development:

  But we use toLocal8Bit in plenty of cases as well. For instance in our Qt SQL 
APIs.


The only plugin which really uses toLocal8Bit() is the IBase - Plugin.
Postgres is using it as fallback but according the docs the utf-8
encoding is supported by at least PostgreSQL 7.3 so the non utf-8 part
should be removed.

The other usages are for qWarning() output.


Will take a look on the IBase stuff to see if we can replace it.


Cheers,

Christian

--
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-22 Thread Thiago Macieira
On Wednesday, 22 March 2023 01:07:12 HST Alvin Wong via Development wrote:
> In reality, most of the debug messages are ASCII, so this issue rarely
> affects anything and I consider it just "a mild annoyance".

And also a Not Out Bug issue.

First, the debuggers should opt in to UTF-16 support, if they can. If they 
can't, they should be updated to understand CP_UTF8 manifest executables, if 
they are real debuggers.

That leaves debugview.exe which is not a debugger and therefore doesn't know 
where the messages are coming from. This should reduce the annoyance level.

Question: which category does Qt Creator fall into?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Cloud Software Architect - Intel DCAI Cloud Engineering


smime.p7s
Description: S/MIME cryptographic signature
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-22 Thread Volker Hilsheimer via Development
> On 22 Mar 2023, at 12:07, Alvin Wong via Development 
>  wrote:
> On 22/3/2023 17:58, Lars Knoll wrote:
>> Hi,
>> 
>> 
>>> On 21 Mar 2023, at 17:46, Alvin Wong via Development 
>>>  wrote:
>>> 
>>> Hi,
>>> 
>>> Yes, embedding the manifest with activeCodePage set to UTF-8 is the only 
>>> thing need to enable UTF-8 as the ANSI code page (ACP) for the process.
>>> 
>>> Qt itself should work fine after the bug in QStringConverter had been fixed 
>>> [1] a while back. (You can also refer to the linked mail thread. [2]) 
>>> However, as this bug has shown, any code that uses`MultiByteToWideChar` 
>>> incorrectly or wrongly assumes that `CP_ACP` always refers to a charset in 
>>> which each characters are formed by no more than two bytes will break. 
>>> Therefore, before switching to UTF-8 as the ACP, application developers 
>>> have to check their code and other libraries to make sure everything will 
>>> still work properly after the switch.
>>> 
>>> [1]: https://codereview.qt-project.org/c/qt/qtbase/+/412208
>>> [2]: https://lists.qt-project.org/pipermail/interest/2022-May/038241.html
>>> 
>>> About the CRT, it is true that only UCRT fully supports UTF-8 locale. When 
>>> compiling with MSVC, you are almost always using UCRT so it should be fine.
>>> 
>>> MinGW-w64 is a bit more complicated -- when one gets a mingw-w64 toolchain, 
>>> the whole toolchain is already configured for a specific CRT. Usually it 
>>> will be the system MSVCRT. (If it's configured for UCRT, the toolchain 
>>> author will usually make it clear, because compiled programs will not run 
>>> out-of-the-box on Windows 8.1 or earlier.) I did not run tests myself, but 
>>> I would not trust MSVCRT to support UTF-8 ACP fully. mingw-builds [3] and 
>>> llvm-mingw [4] are some examples of mingw-w64 toolchains that ships UCRT 
>>> versions.
>>> 
>>> [3]: https://github.com/niXman/mingw-builds-binaries/releases
>>> [4]: https://github.com/mstorsjo/llvm-mingw
>>> 
>>> There are two more problems with enabling UTF-8 ACP using the manifest that 
>>> I have encountered so far. When a process is running with UTF-8 ACP, there 
>>> seems to be no API available to get the native system ACP. This can be an 
>>> issue if, for example some external tools write files using the system ACP 
>>> and your program wants to read those files. The other problem (a mild 
>>> annoyance) is that, some debuggers which isn't using updated APIs (gdb for 
>>> example) does not capture `OutputDebugString` messages in the correct 
>>> encoding, which affects QDebug output.
>>> 
>>> 
>> I’ve looked into that one when we did the work for Qt 6. The console has its 
>> own code page that can be set independently from the app, and I believe also 
>> independently from the system code page. qDebug() should be mostly fine, as 
>> we’re using OutputDebugStringW() internally and let Windows handle this mess.
>> 
>> What it does affect is writing to stdout/err and OutputDebugStringA(). 
>> 
> It is unfortunately a bit more messy. OutputDebugString communicates with the 
> debugger via a debug event which contains an address, then the debugger reads 
> the debug message from the memory space of the debuggee process.
> The documentation of OutputDebugStringW [1] states:
> "In the past, the operating system did not return Unicode strings through 
> OutputDebugStringW (ASCII strings were returned instead). To force 
> OutputDebugStringW to return Unicode strings, debuggers are required to call 
> the WaitForDebugEventEx function to opt into the new behavior. In this way, 
> the operating system knows that the debugger supports Unicode and is 
> specifically opting into receiving Unicode strings."
> "OutputDebugStringW converts the specified string based on the current system 
> locale information and passes it to OutputDebugStringA to be displayed. As a 
> result, some Unicode characters may not be displayed correctly."
> What happens with a debugger that does not call `WaitForDebugEventEx` (e.g. 
> gdb) is this: The debuggee calls OutputDebugStringW, which converts the debug 
> string to ACP (UTF-8 in this case) to be passed to OutputDebugStringA. Then 
> the debugger receives the event and tries to read the debug string from the 
> debuggee as ACP, but the debugger thinks ACP is the system ACP (Windows-1252, 
> CP950 or whatever) so it ends up displaying mojibake. The same also happens 
> with Sysinternals DebugView.
> In reality, most of the debug messages are ASCII, so this issue rarely 
> affects anything and I consider it just "a mild annoyance".
> [1]: 
> https://learn.microsoft.com/en-us/windows/win32/api/debugapi/nf-debugapi-outputdebugstringw
>> 
>>> (Console output encoding is separate from the ACP, so one might also need 
>>> to call `SetConsoleOutputCP(CP_UTF8)`, but the detail is a bit fuzzy to me.)
>>> 
>> Setting the code page for console output should help when writing to 
>> stdout/err. It’ll require a bit of testing again (it’s been a while since I 
>> looked into 

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-22 Thread Alvin Wong via Development

Hi,


I’ve looked into that one when we did the work for Qt 6. The console has its 
own code page that can be set independently from the app, and I believe also 
independently from the system code page. qDebug() should be mostly fine, as 
we’re using OutputDebugStringW() internally and let Windows handle this mess.

What it does affect is writing to stdout/err and OutputDebugStringA().


It is unfortunately a bit more messy. OutputDebugString communicates 
with the debugger via a debug event which contains an address, then the 
debugger reads the debug message from the memory space of the debuggee 
process.


The documentation of OutputDebugStringW [1] states:

   "In the past, the operating system did not return Unicode strings
   through OutputDebugStringW (ASCII strings were returned instead). To
   force OutputDebugStringW to return Unicode strings, debuggers are
   required to call the WaitForDebugEventEx function to opt into the
   new behavior. In this way, the operating system knows that the
   debugger supports Unicode and is specifically opting into receiving
   Unicode strings."

   "OutputDebugStringW converts the specified string based on the
   current system locale information and passes it to
   OutputDebugStringA to be displayed. As a result, some Unicode
   characters may not be displayed correctly."

What happens with a debugger that does not call `WaitForDebugEventEx` 
(e.g. gdb) is this: The debuggee calls OutputDebugStringW, which 
converts the debug string to ACP (UTF-8 in this case) to be passed to 
OutputDebugStringA. Then the debugger receives the event and tries to 
read the debug string from the debuggee as ACP, but the debugger thinks 
ACP is the system ACP (Windows-1252, CP950 or whatever) so it ends up 
displaying mojibake. The same also happens with Sysinternals DebugView.


In reality, most of the debug messages are ASCII, so this issue rarely 
affects anything and I consider it just "a mild annoyance".


[1]: 
https://learn.microsoft.com/en-us/windows/win32/api/debugapi/nf-debugapi-outputdebugstringw


Cheers,
Alvin


On 22/3/2023 17:58, Lars Knoll wrote:

Hi,


On 21 Mar 2023, at 17:46, Alvin Wong via 
Development  wrote:

Hi,

Yes, embedding the manifest with activeCodePage set to UTF-8 is the only thing 
need to enable UTF-8 as the ANSI code page (ACP) for the process.

Qt itself should work fine after the bug in QStringConverter had been fixed [1] 
a while back. (You can also refer to the linked mail thread. [2]) However, as 
this bug has shown, any code that uses`MultiByteToWideChar` incorrectly or 
wrongly assumes that `CP_ACP` always refers to a charset in which each 
characters are formed by no more than two bytes will break. Therefore, before 
switching to UTF-8 as the ACP, application developers have to check their code 
and other libraries to make sure everything will still work properly after the 
switch.

[1]:https://codereview.qt-project.org/c/qt/qtbase/+/412208
[2]:https://lists.qt-project.org/pipermail/interest/2022-May/038241.html

About the CRT, it is true that only UCRT fully supports UTF-8 locale. When 
compiling with MSVC, you are almost always using UCRT so it should be fine.

MinGW-w64 is a bit more complicated -- when one gets a mingw-w64 toolchain, the 
whole toolchain is already configured for a specific CRT. Usually it will be 
the system MSVCRT. (If it's configured for UCRT, the toolchain author will 
usually make it clear, because compiled programs will not run out-of-the-box on 
Windows 8.1 or earlier.) I did not run tests myself, but I would not trust 
MSVCRT to support UTF-8 ACP fully. mingw-builds [3] and llvm-mingw [4] are some 
examples of mingw-w64 toolchains that ships UCRT versions.

[3]:https://github.com/niXman/mingw-builds-binaries/releases
[4]:https://github.com/mstorsjo/llvm-mingw

There are two more problems with enabling UTF-8 ACP using the manifest that I 
have encountered so far. When a process is running with UTF-8 ACP, there seems 
to be no API available to get the native system ACP. This can be an issue if, 
for example some external tools write files using the system ACP and your 
program wants to read those files. The other problem (a mild annoyance) is 
that, some debuggers which isn't using updated APIs (gdb for example) does not 
capture `OutputDebugString` messages in the correct encoding, which affects 
QDebug output.


I’ve looked into that one when we did the work for Qt 6. The console has its 
own code page that can be set independently from the app, and I believe also 
independently from the system code page. qDebug() should be mostly fine, as 
we’re using OutputDebugStringW() internally and let Windows handle this mess.

What it does affect is writing to stdout/err and OutputDebugStringA().


(Console output encoding is separate from the ACP, so one might also need to 
call `SetConsoleOutputCP(CP_UTF8)`, but the detail is a bit fuzzy to me.)

Setting the code page for console output should help when writing to 

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2023-03-22 Thread Lars Knoll
Hi,

> On 21 Mar 2023, at 17:46, Alvin Wong via Development 
>  wrote:
> 
> Hi,
> 
> Yes, embedding the manifest with activeCodePage set to UTF-8 is the only 
> thing need to enable UTF-8 as the ANSI code page (ACP) for the process.
> 
> Qt itself should work fine after the bug in QStringConverter had been fixed 
> [1] a while back. (You can also refer to the linked mail thread. [2]) 
> However, as this bug has shown, any code that uses`MultiByteToWideChar` 
> incorrectly or wrongly assumes that `CP_ACP` always refers to a charset in 
> which each characters are formed by no more than two bytes will break. 
> Therefore, before switching to UTF-8 as the ACP, application developers have 
> to check their code and other libraries to make sure everything will still 
> work properly after the switch.
> 
> [1]: https://codereview.qt-project.org/c/qt/qtbase/+/412208
> [2]: https://lists.qt-project.org/pipermail/interest/2022-May/038241.html
> 
> About the CRT, it is true that only UCRT fully supports UTF-8 locale. When 
> compiling with MSVC, you are almost always using UCRT so it should be fine.
> 
> MinGW-w64 is a bit more complicated -- when one gets a mingw-w64 toolchain, 
> the whole toolchain is already configured for a specific CRT. Usually it will 
> be the system MSVCRT. (If it's configured for UCRT, the toolchain author will 
> usually make it clear, because compiled programs will not run out-of-the-box 
> on Windows 8.1 or earlier.) I did not run tests myself, but I would not trust 
> MSVCRT to support UTF-8 ACP fully. mingw-builds [3] and llvm-mingw [4] are 
> some examples of mingw-w64 toolchains that ships UCRT versions.
> 
> [3]: https://github.com/niXman/mingw-builds-binaries/releases
> [4]: https://github.com/mstorsjo/llvm-mingw
> 
> There are two more problems with enabling UTF-8 ACP using the manifest that I 
> have encountered so far. When a process is running with UTF-8 ACP, there 
> seems to be no API available to get the native system ACP. This can be an 
> issue if, for example some external tools write files using the system ACP 
> and your program wants to read those files. The other problem (a mild 
> annoyance) is that, some debuggers which isn't using updated APIs (gdb for 
> example) does not capture `OutputDebugString` messages in the correct 
> encoding, which affects QDebug output.
> 
I’ve looked into that one when we did the work for Qt 6. The console has its 
own code page that can be set independently from the app, and I believe also 
independently from the system code page. qDebug() should be mostly fine, as 
we’re using OutputDebugStringW() internally and let Windows handle this mess.

What it does affect is writing to stdout/err and OutputDebugStringA(). 

> (Console output encoding is separate from the ACP, so one might also need to 
> call `SetConsoleOutputCP(CP_UTF8)`, but the detail is a bit fuzzy to me.)

Setting the code page for console output should help when writing to 
stdout/err. It’ll require a bit of testing again (it’s been a while since I 
looked into it), but I believe console was mostly handling this fine 
independent of the codepage being used by it internally (ie. Windows would 
recode the string).

Cheers,
Lars

> 
> Cheers,
> Alvin
> 
> 
> On 20/3/2023 21:44, Edward Welbourne wrote:
>> Thiago Macieira (31 October 2019 22:11) wrote [0]:
>>> This RFC (...) is meant to discuss how we'll deal with locales on Unix
>>> systems on Qt 6. This does not apply to Windows because on Windows we
>>> cannot reasonably be expected to use UTF-8 for the 8-bit encoding.
>> [0] 
>> https://lists.qt-project.org/pipermail/development/2019-October/037791.html
>> 
>> The GNU make mailing list currently has a thread (starts at [1]) about
>> handling of encodings on Windows.
>> 
>> [1] https://lists.gnu.org/archive/html/bug-make/2023-03/msg00066.html
>> 
>> The discussion there seems to indicate that setting the system code-page
>> to UTF-8 can be done in a way that interoperates gracefully with other
>> processes and the file system, presumably thanks to the system being
>> substantially UTF-16-based, so all 8-bit encodings go via that anyway.
>> 
>> The means to achieve this appear [2] to hinge on setting the active
>> codepage for the application in a manifest file, that it gets combined
>> with after it is linked.
>> 
>> [2] 
>> https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page
>> 
>> There do appear to be some vagaries still, it may depend on UCRT and I'm
>> not sure I've really understood it all, but it looks like we may, in
>> time, be able to consistently use UTF-8 as 8-bit encoding on Windows.
>> 
>>  Eddy.
>> 
> -- 
> Development mailing list
> Development@qt-project.org
> https://lists.qt-project.org/listinfo/development

-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development