Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-16 Thread Thiago Macieira
Hi

Sorry, it looks like this thread is not progressing in a calm and reasoned 
manner, the way it was meant to be. And I'm very much to blame. So I apologise 
for the strong language and passionate opinions. I'm deleting most of what I 
had written as a reply so we can start over.

Let's start with your questions:

On Saturday, 16 November 2019 10:50:13 PST André Pönitz wrote:
> You have not yet answered
> 
>   - why this decision was made

You know, I don't know. To be frank, I don't know that a decision *was* made. 
It all started with a change (see OP) about removing QTextCodec from the API 
and from QtCore. It seemed reasonable enough but it turned up quite a few 
kinks that hadn't been predicted. One of them, which may still be a 
showstopper, is QXmlStreamReader's inability to handle XML data encoded in 
anything except UTF-8, though a thorough search of all XML files in my system 
turned up exactly zero such files.

I don't know why QTextCodec is being removed. I don't remember any decisions 
in prior QtCS or this mailing list about removing it. We definitely discussed 
removing the CJK codecs and their big tables and that can still be done, with 
no effect in the API, since QTextCodec is backed by ICU's ucnv. We may have 
discussed removing it, but I don't remember a firm decision. And even if it is 
firm, after looking at the consequences of doing so, we may want to reverse 
our decision.

Related to that is the discussion of whether UTF-8 is the only acceptable 
locale on Unix systems. If we don't have QTextCodec, then we have to have 
something fixed for QString::fromLocal8Bit and it would necessarily be UTF-8. 
But even if we do have QTextCodec, that's still a reasonable question: should 
assume it is UTF-8? And should we enforce it? Those were the questions in my 
OP.

>   - who did it

Considering I don't know a decision *was* made, I don't think we can say who 
made it.

>   - what the actual problem to solve was

Three things being tackled, all related:

1) QTextCodec in the API
I think we cannot do without it, it'll have to stay in one way or another. So 
the question reduces to whether it should stay in QtCore or be moved to 
another library. Given the QXmlStreamReader problem above, it's probably best 
to keep it in QtCore, actually.

QTextCodec has some API limitations but they can be fixed. It's not necessary 
for us to remove it: it's not *that* broken.

2) QtCore size
As I said above, removing the legacy codecs we have code for is not a problem. 
They are already disabled in Qt builds where ICU is present, so we'd 
additionally remove them from all other builds. Where ICU is present, there's 
no loss of functionality for user applications, since ICU provides far more 
codecs than we do. For those without ICU, it stands to reason that the user 
chose size so they are aware of the limitations. Plus, one can always 
instantiate their own QTextCodec and add to the list (at least, with today's 
implementation).

If QTextCodec is not in QtCore, then most likely you can't affect how QtCore 
and almost all other Qt classes decode 8-bit data into QString, including 
QTextStream.

and 3) misconfigured locale systems and filename handling
This is probably the biggest problem. As it is right now, when the locale 
isn't set on a Unix system or if it is explicitly set to C, we *cannot* decode 
any file names with the 8th bit set. Those file names are considered 
filesystem corruption. And yet they are quite commonly created by the user 
outside of English-speaking jurisdictions.

Your example of setting LC_ALL (or another environment variable) to force the 
locale to print something that either can be parsed or shared is one such 
problematic scenario. On one hand, you may need it to get some older tools to 
parse output; on the other, it makes Qt applications unable to even see some 
files exist.

>   - why LC_*ALL* comes into play

Because it's the override. If we decide to override and LC_ALL is set, then we 
have no choice but to override it. If it is unset, then we can leave it unset 
too, but may need to override LC_CTYPE.

> I get the impression that this thread was not started as an RFC for an
> open-ended discussion, but as a staged attempt to provide a figleaf for
> a pre-determined decision.

That was not the intention. That's why I am re-starting it so we can come back 
to a reasoned approach.

Anyway, the two independent (but related) decisions we need to make are:
1) do we keep QTextCodec in QtCore?
2) do we want to change we handle legacy (non-UTF8) locales?

For #2, the sub-questions of the OP apply:
 a) What should Qt 6 assume the locale to be, if no locale is set?
 b) In case a non-UTF-8 locale is set, what should we do?
 c) Should we propagate our decision to child processes?

My preferences were:
 a) C.UTF-8
 b) override it to force UTF-8 on the same locale
 c) yes

The reason for my preference in propagating to child processes is so that we 
have a consistent protocol 

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-16 Thread André Pönitz
On Fri, Nov 15, 2019 at 05:47:04PM -0800, Thiago Macieira wrote:
> On Friday, 15 November 2019 16:23:24 PST André Pönitz wrote:
> > > The questions are:
> > > 1) do we want to prevent another library from accidentally unsetting it?
> > > 2) do we want child processes to use the same?
> > > 
> > > Note the answers for both questions must be the same, for the solution is
> > > the same. So either both yeses or both nos.
> > 
> > This "answers for both questions must be the same" requirement is arbitrary.
> > 
> > The fact that one known solution results in same answers to both is in
> > no way proof that no other solutions exist.
> 
> I don't see how to prevent another library doing setlocale(LC_ALL, "") from 
> not overriding Qt's default other than to make setlocale(LC_ALL, "") do what 
> we want. Since what it does is read the environment, the only solution is to 
> change the environment.

You haven't even explained why this prevention would be needed, what exact
bad would happen if you don't do that, and you cannot prevent the other library
from setting an explicit locale anyway.

With modifying the environment, you just catch the "" case, one out of many,
and I'll continue to argue that it's not Qt's business to try even that.

> > > Qt 6 will not have support for non-UTF-8 codecs, outside of Windows. You
> > > can either deal with binary data or with UTF-8 text, there's no middle
> > > ground.
> > Now that's an interesting twist.
> > 
> > The latest memo I did (not...) get was that codecs are to be moved into a
> > separate module. Which is actually ok, as it allows user code using codecs
> > to live on with minimal changes, and makes QtCore slimmer, kind of "no-loss
> > + win".
> 
> Sure. But that's no different than using ICU or writing your own code to 
> convert from binary to text. QString will not support it on its own.

> 
> > "Qt 6 will not have support for non-UTF-8 codecs, outside of Windows" is
> > definitely news to me. I've not seen this being discussed, neither here nor
> > within the part of the company that I usually talk to.
> 
> You just said yourself, above.

I did not say that.

> If QTextCodec moves to another library, we have  no codecs in QtCore.

Not having codecs in QtCore does not mean QtCore cannot use codecs.

One could have a setup where Qt Core just has the bare minimum, with stubs
for other codecs that are used when that QtCodecs lib is linked.

Actually that's what I had expected something like that to be the targeted
solution once I heard that text codecs move out of QtCore.

> > So when and where was this decision made, by whom, and why?
> > 
> > Did that person bother to check e.g. whether Qt Creator uses non-UTF-8
> > codecs in some cases and did that person come to the conclusion that any
> > such use is bad and deserves to die?
> 
> Probably not. Why does Qt Creator need other codecs?

My guess would be to handle code bases that are not (a subset) of UTF-8.
 
> > > you're arguing that here are broken applications that won't handle
> > > C.UTF-8 correctly, without giving as single example.
> > 
> > ... is of course not true:
> > 
> > 1. I did not claim there were "broken" applications that won't handle
> >C.UTF-8 "correctly", I claimed that there are applications that react
> >differently to C.UTF-8.
> 
> Different behaviour is *exactly* what we want. We want this:

Who is 'we'?

> $ LC_ALL=C.UTF-8 ls á
> ls: cannot access 'á': No such file or directory
> 
> not this:
> 
> $ LC_ALL=C ls á
> ls: cannot access ''$'\303\241': No such file or directory

If you do not touch the environment, the user gets what he asked for.

He will most likely want not to see ''$'\303\241, but if he explicitly asks
for it in the environment he sets up, it's not Qt's job to override this.

> I thought the argument would be that despite being what we wanted,

Who is 'we'?

> it would break certain scenarios. But I haven't seen any examples of breakage.
> 
> >  gcc produces different output under C and C.UTF-8:
> > 
> >  echo x | LC_CTYPE=C gcc -xc -
> >   :1:1: error: expected '=', ',', ';', 'asm' or '__attribute__'
> > at end of input
> > 
> >  echo x | LC_CTYPE=C.UTF-8 gcc -xc -
> >   :1:1: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’
> > at end of input
> > 
> >  As an additional twist, this different behaviour does not require fancy
> > input, input is plain ASCII in both cases.
> > 
> >  Output parsers expecting "'" e.g. to produce a set recommendations how
> > to quick-fix such problems in an IDE will break.
> 
> Any application that is parsing GCC output is already setting LC_ALL in the 
> child process's environment.

Not necessarily, and if so, it's rather 'C', not 'C.UTF-8'.

> Otherwise, they'd be getting possibly translated 
> messages and we all know that the order of the messages could be different. 
> Not to mention that instead of "" or even “” we could see «» or „“.
 
Also the point here is not that the particular case. Each 

Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-16 Thread Thiago Macieira
On Friday, 15 November 2019 00:52:55 PST Eike Ziller wrote:
> - You state that as if that were a fact imposed on us from some external
> entity, and as if that patch were already in.

No, but that's the direction that started this conversation. If we're not 
going to do that, then the entire discussion is moot.

> - I thought QTextCodec will
> still be available, even if from a separate module. If that plan has
> changed, provide a patch for Qt Creator as well.

it will, but we'll probably need a session next week to discuss in what form. 
If wew remove the codecs we kept and only use ICU, then QTextCodec will have 
negligible cost and could stay in QtCore.

If it stays in QtCore, we still have a question whether QString::fromLocal8Bit 
shall assume it's UTF-8 on Unix systems.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] RFC: Defaulting to or enforcing UTF-8 locales on Unix systems

2019-11-16 Thread Thiago Macieira
On Friday, 15 November 2019 16:23:24 PST André Pönitz wrote:
> > The questions are:
> > 1) do we want to prevent another library from accidentally unsetting it?
> > 2) do we want child processes to use the same?
> > 
> > Note the answers for both questions must be the same, for the solution is
> > the same. So either both yeses or both nos.
> 
> This "answers for both questions must be the same" requirement is arbitrary.
> 
> The fact that one known solution results in same answers to both is in
> no way proof that no other solutions exist.

I don't see how to prevent another library doing setlocale(LC_ALL, "") from 
not overriding Qt's default other than to make setlocale(LC_ALL, "") do what 
we want. Since what it does is read the environment, the only solution is to 
change the environment.

> > Qt 6 will not have support for non-UTF-8 codecs, outside of Windows. You
> > can either deal with binary data or with UTF-8 text, there's no middle
> > ground.
> Now that's an interesting twist.
> 
> The latest memo I did (not...) get was that codecs are to be moved into a
> separate module. Which is actually ok, as it allows user code using codecs
> to live on with minimal changes, and makes QtCore slimmer, kind of "no-loss
> + win".

Sure. But that's no different than using ICU or writing your own code to 
convert from binary to text. QString will not support it on its own.

> "Qt 6 will not have support for non-UTF-8 codecs, outside of Windows" is
> definitely news to me. I've not seen this being discussed, neither here nor
> within the part of the company that I usually talk to.

You just said yourself, above. If QTextCodec moves to another library, we have 
no codecs in QtCore. That means the rest of Qt will not support other codecs.

> So when and where was this decision made, by whom, and why?
> 
> Did that person bother to check e.g. whether Qt Creator uses non-UTF-8
> codecs in some cases and did that person come to the conclusion that any
> such use is bad and deserves to die?

Probably not. Why does Qt Creator need other codecs?

> > you're arguing that here are broken applications that won't handle
> > C.UTF-8 correctly, without giving as single example.
> 
> ... is of course not true:
> 
> 1. I did not claim there were "broken" applications that won't handle
>C.UTF-8 "correctly", I claimed that there are applications that react
>differently to C.UTF-8.

Different behaviour is *exactly* what we want. We want this:

$ LC_ALL=C.UTF-8 ls á
ls: cannot access 'á': No such file or directory

not this:

$ LC_ALL=C ls á
ls: cannot access ''$'\303\241': No such file or directory

I thought the argument would be that despite being what we wanted, it would 
break certain scenarios. But I haven't seen any examples of breakage.

>  gcc produces different output under C and C.UTF-8:
> 
>  echo x | LC_CTYPE=C gcc -xc -
>   :1:1: error: expected '=', ',', ';', 'asm' or '__attribute__'
> at end of input
> 
>  echo x | LC_CTYPE=C.UTF-8 gcc -xc -
>   :1:1: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’
> at end of input
> 
>  As an additional twist, this different behaviour does not require fancy
> input, input is plain ASCII in both cases.
> 
>  Output parsers expecting "'" e.g. to produce a set recommendations how
> to quick-fix such problems in an IDE will break.

Any application that is parsing GCC output is already setting LC_ALL in the 
child process's environment. Otherwise, they'd be getting possibly translated 
messages and we all know that the order of the messages could be different. 
Not to mention that instead of "" or even “” we could see «» or „“.

Changing the environment of a child process is not going to go away.

If you're telling me that you're setting the environment before the Qt 
application to cope with its brokenness, I will ask why that application 
hasn't been fixed in the 16 years since UTF-8 environments became a thing. And 
we can provide a way to force Qt not to set the environment, for those weird 
cases where you musts deal with broken, proprietary cr#p that won't be fixed 
until the heat death of the Universe. And I will ask why everyone else must 
pay a performance price for the sake of those old, broken applications that 
even the maintainer isn't fixing anymore?

>  #include 
>  #include 
>  #include 
> 
>  int main()
>  {
>  if (strcmp((setlocale(LC_COLLATE, "")), "C") != 0)
>  abort();
>  }
> 
>  runs successfully under LC_ALL="C" and aborts under LC_ALL="C.UTF-8".

Strawman example, this doesn't happen in reality. See my exhaustive search for 
all such checks in an entire Linux distribution. I'm asking for *real* 
situations.

>  While contreived in this form, there _is_ code even in Creator checking
> for "C" literally, raising the suspicion that this might happen in other
> applications, too.

Oh, checking for "C" literally does exist, there were several in my search. 
About half