Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)

2019-03-18 Thread Giuseppe D'Angelo via Development

On 18/03/2019 17:02, Matthew Woehlke wrote:

On 16/03/2019 12.13, Giuseppe D'Angelo via Development wrote:

What I meant is this: during phase 5 and 6, are string literals
simply sequences of symbols from a set, or are they already encoded
in some encoding? From my reading, it's the former (the execution
character set is just this -- a set of symbols), and it's only after
phase 6 that those symbols are encoded in sequences of
char/char16_t/... values (depending on the string literal prefix).

I would certainly read 5 as*implying*  that at the conclusion of that
phase, string literals have a definite encoding.*Not*  applying that
assumption seems to be how we get the broken MSVC behavior of
"reinterpreting" a UTF-8 string as CP-1252.


Is there a "not" or a double negation somewhere? The way I am what 
_should_ happen is: at the end of phase 5 / 6, string literals are just 
sequences of symbols from a set. Encoding is still not a thing. After 
phase 7 (\0 appended at the end), then the string gets actually encoded 
as an array of char/char16_t/etc.


The fact that MSVC is not seeing sequences of symbols but encoded 
sequences at the end of phase 5, and then messing things up in phase 6, 
is IMHO the bug.


At the end of this... did anyone submit a bugreport against MSVC? Is it 
worth proposing a clarification against the Standard?


Cheers,
--
Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer
KDAB (France) S.A.S., a KDAB Group company
Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com
KDAB - The Qt, C++ and OpenGL Experts



smime.p7s
Description: S/MIME Cryptographic Signature
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)

2019-03-18 Thread Matthew Woehlke
On 16/03/2019 12.13, Giuseppe D'Angelo via Development wrote:
> What I meant is this: during phase 5 and 6, are string literals 
> simply sequences of symbols from a set, or are they already encoded 
> in some encoding? From my reading, it's the former (the execution
> character set is just this -- a set of symbols), and it's only after
> phase 6 that those symbols are encoded in sequences of
> char/char16_t/... values (depending on the string literal prefix).
I would certainly read 5 as *implying* that at the conclusion of that
phase, string literals have a definite encoding. *Not* applying that
assumption seems to be how we get the broken MSVC behavior of
"reinterpreting" a UTF-8 string as CP-1252.

-- 
Matthew
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)

2019-03-16 Thread Giuseppe D'Angelo via Development

Hi,

Il 15/03/19 19:09, Matthew Woehlke ha scritto:

The mapping of \u escape sequences to the execution character set
happens before string literal concatenation (translation phases 5/6).
But AFAIU the mapping is purely symbolic, and has nothing to do with any
actual encoding, so MSVC is at fault here?

Why do you think it's "symbolic"? The standard clearly says "if there is
no corresponding member [of the target character set], [the character]
is converted to an implementation-defined member". That's obviously the
case for the characters in question, so they get mapped to '?'.

AFAICT, in my example (execution character set == CP-1252), MSVC is
doing what the standard requires it to do. It's unfortunate that this
isn't what the user wanted, but I don't see a "solution" except to swap
phases 5 and 6. (But again, this does*not*  apply to the ECS == UTF-8 case.


What I meant is this: during phase 5 and 6, are string literals simply 
sequences of symbols from a set, or are they already encoded in some 
encoding? F


rom my reading, it's the former (the execution character set is just 
this -- a set of symbols), and it's only after phase 6 that those 
symbols are encoded in sequences of char/char16_t/... values (depending 
on the string literal prefix).


My 2 c,

--
Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer
KDAB (France) S.A.S., a KDAB Group company
Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com
KDAB - The Qt, C++ and OpenGL Experts



smime.p7s
Description: Firma crittografica S/MIME
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)

2019-03-15 Thread Thiago Macieira
On Friday, 15 March 2019 11:12:30 PDT Thiago Macieira wrote:
> On Friday, 15 March 2019 11:09:20 PDT Matthew Woehlke wrote:
> > (Note: Another solution is to redefine QT_UNICODE_LITERAL_II to `u ##
> > str`, but that's SIC.)
> 
> And doesn't help, because you *can* write
> 
>   QStringLiteral("a" "b" "c")

Extra note: you couldn't write this prior to MSVC 2013 getting dropped from 
the Qt support matrix, as that compiler failed to properly concatenate strings 
as per Lexer Phase 6 requires.

It was a different problem with supporting the same part of the spec as the 
problem you've found.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)

2019-03-15 Thread Thiago Macieira
On Friday, 15 March 2019 11:09:20 PDT Matthew Woehlke wrote:
> (Note: Another solution is to redefine QT_UNICODE_LITERAL_II to `u ##
> str`, but that's SIC.)

And doesn't help, because you *can* write

QStringLiteral("a" "b" "c")

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)

2019-03-15 Thread Matthew Woehlke
On 15/03/2019 08.27, Giuseppe D'Angelo via Development wrote:
> Il 14/03/19 22:48, Thiago Macieira ha scritto:
>> For
>>
>>    char16_t text1[] = u"" "\u0102";
>>
>> It produces, without /utf-8 (seehttps://msvc.godbolt.org/z/EvtKzq):
>>
>> ?text1@@3PA_SA DB '?', 00H, 00H, 00H    ; text1
>>
>> And with /utf-8:
>>
>> ?text1@@3PA_SA DB 0c4H, 00H, 01aH, ' ', 00H, 00H    ; text1
>>
>> Those two values make no sense. U+0102 is neither 0x003f (question 
>> mark) nor 0x00c4 0x201a ("Ä‚"). This is a clear compiler bug. An
>> interpretation of the C++11 standard could say that the translation
>> is correct for the no-/utf-8 build, 

In fact, I now believe that to be the case (if unfortunate); note
[lex.phases]¶1.5 and also
https://groups.google.com/a/isocpp.org/d/msg/std-discussion/qYf6treuLmY/EeLI6bqTCwAJ.

>> but with /utf-8 or /execution-charset:utf-8 it should have produced
>> the correct result.
> 
> Actually, those values have a somehow connection with the input. Looks
> like MSVC is double-encoding it:
> 
> * "\u0102" under UTF-8 execution charset produces a string containing
> 0xC4 0x82;
> 
> * that string literal is a generic narrow string literal (non prefixed).
> When concatenating to a u-prefixed string literal, somehow MSVC thinks
> it's in its native codepage instead of UTF-8...

*That* smells buggy. I think I'll stick to /we4566 and adding the extra
'u' if my QStringLiteral is non-ASCII so that I'm not hitting this case.

> The mapping of \u escape sequences to the execution character set
> happens before string literal concatenation (translation phases 5/6).
> But AFAIU the mapping is purely symbolic, and has nothing to do with any
> actual encoding, so MSVC is at fault here?

Why do you think it's "symbolic"? The standard clearly says "if there is
no corresponding member [of the target character set], [the character]
is converted to an implementation-defined member". That's obviously the
case for the characters in question, so they get mapped to '?'.

AFAICT, in my example (execution character set == CP-1252), MSVC is
doing what the standard requires it to do. It's unfortunate that this
isn't what the user wanted, but I don't see a "solution" except to swap
phases 5 and 6. (But again, this does *not* apply to the ECS == UTF-8 case.)

(Note: Another solution is to redefine QT_UNICODE_LITERAL_II to `u ##
str`, but that's SIC.)

-- 
Matthew
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)

2019-03-15 Thread Thiago Macieira
On Friday, 15 March 2019 05:27:09 PDT Giuseppe D'Angelo via Development wrote:
> The mapping of \u escape sequences to the execution character set
> happens before string literal concatenation (translation phases 5/6).
> But AFAIU the mapping is purely symbolic, and has nothing to do with any
> actual encoding, so MSVC is at fault here?

The people from the SG16 in the committee think it is and are preparing a 
paper to clarify. They came to the same conclusion regarding the steps the 
compiler performed as you did, but those steps still lead to an absurd result. 
Why in the world would anyone want the UTF-16 representation of the UTF-8 
encoding of something?

The point is that the compiler had 0xC4 0x82, knew it was UTF-8 and was being 
asked to provide an UTF-16 representation. It should have performed a UTF-8-
to-UTF-16 transformation, not CP1252-to-UTF-8 one.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)

2019-03-15 Thread Giuseppe D'Angelo via Development

Il 14/03/19 22:48, Thiago Macieira ha scritto:

For

   char16_t text1[] = u"" "\u0102";

It produces, without /utf-8 (seehttps://msvc.godbolt.org/z/EvtKzq):

?text1@@3PA_SA DB '?', 00H, 00H, 00H; text1

And with /utf-8:

?text1@@3PA_SA DB 0c4H, 00H, 01aH, ' ', 00H, 00H; text1

Those two values make no sense. U+0102 is neither 0x003f (question mark) nor
0x00c4 0x201a ("Ä‚"). This is a clear compiler bug. An interpretation of the
C++11 standard could say that the translation is correct for the no-/utf-8
build, but with /utf-8 or /execution-charset:utf-8 it should have produced the
correct result.



Actually, those values have a somehow connection with the input. Looks 
like MSVC is double-encoding it:


* "\u0102" under UTF-8 execution charset produces a string containing 
0xC4 0x82;


* that string literal is a generic narrow string literal (non prefixed). 
When concatenating to a u-prefixed string literal, somehow MSVC thinks 
it's in its native codepage instead of UTF-8...


* so it now reencodes 0xC4 0x82 from CP1252 to UTF-16, yielding
0x00 0xC4 0x20 0x1a, which is what ends up in text1 (fixing the endianness)

The mapping of \u escape sequences to the execution character set 
happens before string literal concatenation (translation phases 5/6). 
But AFAIU the mapping is purely symbolic, and has nothing to do with any 
actual encoding, so MSVC is at fault here?


My 2 c,

--
Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer
KDAB (France) S.A.S., a KDAB Group company
Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com
KDAB - The Qt, C++ and OpenGL Experts



smime.p7s
Description: Firma crittografica S/MIME
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)

2019-03-14 Thread Thiago Macieira
On Thursday, 14 March 2019 13:54:29 PDT NIkolai Marchenko wrote:
> I've posted about this issue (I think) on slack a bit earlier, see
> https://cpplang.slack.com/archives/C29936TQC/p154989901601

For those who can't read it, the suggestion was to use the /utf-8 option to 
the compiler (with qmake, CONFIG += utf8_source). But a quick set of testing 
does not show correct results. For 

  char16_t text1[] = u"" "\u0102";

It produces, without /utf-8 (see https://msvc.godbolt.org/z/EvtKzq):

?text1@@3PA_SA DB '?', 00H, 00H, 00H; text1

And with /utf-8:

?text1@@3PA_SA DB 0c4H, 00H, 01aH, ' ', 00H, 00H; text1

Those two values make no sense. U+0102 is neither 0x003f (question mark) nor 
0x00c4 0x201a ("Ä‚"). This is a clear compiler bug. An interpretation of the 
C++11 standard could say that the translation is correct for the no-/utf-8 
build, but with /utf-8 or /execution-charset:utf-8 it should have produced the 
correct result.

C++11 2.14.5 [lex.string]/13 (now 5.13.5/12 [1]) says:

"If one string-literal has no encoding-prefix, it is treated as a string-
literal of the same encoding-prefix as the other operand."

In table 9:
u"a" "b"is the same as  u"ab"

[1] http://eel.is/c++draft/lex.string#12
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)

2019-03-14 Thread NIkolai Marchenko
I've posted about this issue (I think) on slack a bit earlier, see
https://cpplang.slack.com/archives/C29936TQC/p154989901601

On Thu, Mar 14, 2019 at 11:51 PM Matthew Woehlke 
wrote:

> While working on some modernization of my application — in particular,
> converting some UTF-8 literals to use QStringLiteral — I noticed a
> concerning compiler warning:
>
>   warning C4566: character represented by universal-character-name
>   '\u' cannot be represented in the current code page (1252)
>
> After doing some testing, it turns out that, given code like
> QStringLiteral("\u269E \U0001f387 \u269F"), MSVC is indeed butchering
> the string.
>
> Further investigation shows that the problem seems to be with the
> implementation of QStringLiteral. In particular, it appears that the
> preprocessor initially sees just the raw string literal without the 'u'
> prefix, butchers it, then later "promotes" it to a UTF-16 literal, but
> by then the damage has been done.
>
> While this absolutely feels like a compiler bug, it's an *awful* big
> gotcha that probably should be documented. Also, is there anything that
> Qt can do to work around it? (I know these sorts of macro expansions can
> be tricksy...)
>
> Note: and the *local* work-around is apparently to include the 'u'
> prefix on my own literal; apparently doubling it (`uu"stuff"`) is okay.
>
> --
> Matthew
> ___
> Development mailing list
> Development@qt-project.org
> https://lists.qt-project.org/listinfo/development
>
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)

2019-03-14 Thread Matthew Woehlke
On 14/03/2019 16.50, Matthew Woehlke wrote:
> While working on some modernization of my application — in particular,
> converting some UTF-8 literals to use QStringLiteral — I noticed a
> concerning compiler warning:
> 
>   warning C4566: character represented by universal-character-name
>   '\u' cannot be represented in the current code page (1252)
> 
> After doing some testing, it turns out that, given code like
> QStringLiteral("\u269E \U0001f387 \u269F"), MSVC is indeed butchering
> the string.
> 
> Further investigation shows that the problem seems to be with the
> implementation of QStringLiteral. In particular, it appears that the
> preprocessor initially sees just the raw string literal without the 'u'
> prefix, butchers it, then later "promotes" it to a UTF-16 literal, but
> by then the damage has been done.
> 
> While this absolutely feels like a compiler bug, it's an *awful* big
> gotcha that probably should be documented. Also, is there anything that
> Qt can do to work around it? (I know these sorts of macro expansions can
> be tricksy...)
> 
> Note: and the *local* work-around is apparently to include the 'u'
> prefix on my own literal; apparently doubling it (`uu"stuff"`) is okay.

...forgot to mention; previous mail attempted to have a complete test
case attached, except I accidentally applied the work-around and saved
it before sending the message. So, to see the problem, build the
attached source (with VC++), but remove the 'u' prefix from the
QStringLiteral.

-- 
Matthew
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development