Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)
On 18/03/2019 17:02, Matthew Woehlke wrote: On 16/03/2019 12.13, Giuseppe D'Angelo via Development wrote: What I meant is this: during phase 5 and 6, are string literals simply sequences of symbols from a set, or are they already encoded in some encoding? From my reading, it's the former (the execution character set is just this -- a set of symbols), and it's only after phase 6 that those symbols are encoded in sequences of char/char16_t/... values (depending on the string literal prefix). I would certainly read 5 as*implying* that at the conclusion of that phase, string literals have a definite encoding.*Not* applying that assumption seems to be how we get the broken MSVC behavior of "reinterpreting" a UTF-8 string as CP-1252. Is there a "not" or a double negation somewhere? The way I am what _should_ happen is: at the end of phase 5 / 6, string literals are just sequences of symbols from a set. Encoding is still not a thing. After phase 7 (\0 appended at the end), then the string gets actually encoded as an array of char/char16_t/etc. The fact that MSVC is not seeing sequences of symbols but encoded sequences at the end of phase 5, and then messing things up in phase 6, is IMHO the bug. At the end of this... did anyone submit a bugreport against MSVC? Is it worth proposing a clarification against the Standard? Cheers, -- Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer KDAB (France) S.A.S., a KDAB Group company Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com KDAB - The Qt, C++ and OpenGL Experts smime.p7s Description: S/MIME Cryptographic Signature ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)
On 16/03/2019 12.13, Giuseppe D'Angelo via Development wrote: > What I meant is this: during phase 5 and 6, are string literals > simply sequences of symbols from a set, or are they already encoded > in some encoding? From my reading, it's the former (the execution > character set is just this -- a set of symbols), and it's only after > phase 6 that those symbols are encoded in sequences of > char/char16_t/... values (depending on the string literal prefix). I would certainly read 5 as *implying* that at the conclusion of that phase, string literals have a definite encoding. *Not* applying that assumption seems to be how we get the broken MSVC behavior of "reinterpreting" a UTF-8 string as CP-1252. -- Matthew ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)
Hi, Il 15/03/19 19:09, Matthew Woehlke ha scritto: The mapping of \u escape sequences to the execution character set happens before string literal concatenation (translation phases 5/6). But AFAIU the mapping is purely symbolic, and has nothing to do with any actual encoding, so MSVC is at fault here? Why do you think it's "symbolic"? The standard clearly says "if there is no corresponding member [of the target character set], [the character] is converted to an implementation-defined member". That's obviously the case for the characters in question, so they get mapped to '?'. AFAICT, in my example (execution character set == CP-1252), MSVC is doing what the standard requires it to do. It's unfortunate that this isn't what the user wanted, but I don't see a "solution" except to swap phases 5 and 6. (But again, this does*not* apply to the ECS == UTF-8 case. What I meant is this: during phase 5 and 6, are string literals simply sequences of symbols from a set, or are they already encoded in some encoding? F rom my reading, it's the former (the execution character set is just this -- a set of symbols), and it's only after phase 6 that those symbols are encoded in sequences of char/char16_t/... values (depending on the string literal prefix). My 2 c, -- Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer KDAB (France) S.A.S., a KDAB Group company Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com KDAB - The Qt, C++ and OpenGL Experts smime.p7s Description: Firma crittografica S/MIME ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)
On Friday, 15 March 2019 11:12:30 PDT Thiago Macieira wrote: > On Friday, 15 March 2019 11:09:20 PDT Matthew Woehlke wrote: > > (Note: Another solution is to redefine QT_UNICODE_LITERAL_II to `u ## > > str`, but that's SIC.) > > And doesn't help, because you *can* write > > QStringLiteral("a" "b" "c") Extra note: you couldn't write this prior to MSVC 2013 getting dropped from the Qt support matrix, as that compiler failed to properly concatenate strings as per Lexer Phase 6 requires. It was a different problem with supporting the same part of the spec as the problem you've found. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)
On Friday, 15 March 2019 11:09:20 PDT Matthew Woehlke wrote: > (Note: Another solution is to redefine QT_UNICODE_LITERAL_II to `u ## > str`, but that's SIC.) And doesn't help, because you *can* write QStringLiteral("a" "b" "c") -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)
On 15/03/2019 08.27, Giuseppe D'Angelo via Development wrote: > Il 14/03/19 22:48, Thiago Macieira ha scritto: >> For >> >> char16_t text1[] = u"" "\u0102"; >> >> It produces, without /utf-8 (seehttps://msvc.godbolt.org/z/EvtKzq): >> >> ?text1@@3PA_SA DB '?', 00H, 00H, 00H ; text1 >> >> And with /utf-8: >> >> ?text1@@3PA_SA DB 0c4H, 00H, 01aH, ' ', 00H, 00H ; text1 >> >> Those two values make no sense. U+0102 is neither 0x003f (question >> mark) nor 0x00c4 0x201a ("Ä‚"). This is a clear compiler bug. An >> interpretation of the C++11 standard could say that the translation >> is correct for the no-/utf-8 build, In fact, I now believe that to be the case (if unfortunate); note [lex.phases]¶1.5 and also https://groups.google.com/a/isocpp.org/d/msg/std-discussion/qYf6treuLmY/EeLI6bqTCwAJ. >> but with /utf-8 or /execution-charset:utf-8 it should have produced >> the correct result. > > Actually, those values have a somehow connection with the input. Looks > like MSVC is double-encoding it: > > * "\u0102" under UTF-8 execution charset produces a string containing > 0xC4 0x82; > > * that string literal is a generic narrow string literal (non prefixed). > When concatenating to a u-prefixed string literal, somehow MSVC thinks > it's in its native codepage instead of UTF-8... *That* smells buggy. I think I'll stick to /we4566 and adding the extra 'u' if my QStringLiteral is non-ASCII so that I'm not hitting this case. > The mapping of \u escape sequences to the execution character set > happens before string literal concatenation (translation phases 5/6). > But AFAIU the mapping is purely symbolic, and has nothing to do with any > actual encoding, so MSVC is at fault here? Why do you think it's "symbolic"? The standard clearly says "if there is no corresponding member [of the target character set], [the character] is converted to an implementation-defined member". That's obviously the case for the characters in question, so they get mapped to '?'. AFAICT, in my example (execution character set == CP-1252), MSVC is doing what the standard requires it to do. It's unfortunate that this isn't what the user wanted, but I don't see a "solution" except to swap phases 5 and 6. (But again, this does *not* apply to the ECS == UTF-8 case.) (Note: Another solution is to redefine QT_UNICODE_LITERAL_II to `u ## str`, but that's SIC.) -- Matthew ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)
On Friday, 15 March 2019 05:27:09 PDT Giuseppe D'Angelo via Development wrote: > The mapping of \u escape sequences to the execution character set > happens before string literal concatenation (translation phases 5/6). > But AFAIU the mapping is purely symbolic, and has nothing to do with any > actual encoding, so MSVC is at fault here? The people from the SG16 in the committee think it is and are preparing a paper to clarify. They came to the same conclusion regarding the steps the compiler performed as you did, but those steps still lead to an absurd result. Why in the world would anyone want the UTF-16 representation of the UTF-8 encoding of something? The point is that the compiler had 0xC4 0x82, knew it was UTF-8 and was being asked to provide an UTF-16 representation. It should have performed a UTF-8- to-UTF-16 transformation, not CP1252-to-UTF-8 one. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)
Il 14/03/19 22:48, Thiago Macieira ha scritto: For char16_t text1[] = u"" "\u0102"; It produces, without /utf-8 (seehttps://msvc.godbolt.org/z/EvtKzq): ?text1@@3PA_SA DB '?', 00H, 00H, 00H; text1 And with /utf-8: ?text1@@3PA_SA DB 0c4H, 00H, 01aH, ' ', 00H, 00H; text1 Those two values make no sense. U+0102 is neither 0x003f (question mark) nor 0x00c4 0x201a ("Ä‚"). This is a clear compiler bug. An interpretation of the C++11 standard could say that the translation is correct for the no-/utf-8 build, but with /utf-8 or /execution-charset:utf-8 it should have produced the correct result. Actually, those values have a somehow connection with the input. Looks like MSVC is double-encoding it: * "\u0102" under UTF-8 execution charset produces a string containing 0xC4 0x82; * that string literal is a generic narrow string literal (non prefixed). When concatenating to a u-prefixed string literal, somehow MSVC thinks it's in its native codepage instead of UTF-8... * so it now reencodes 0xC4 0x82 from CP1252 to UTF-16, yielding 0x00 0xC4 0x20 0x1a, which is what ends up in text1 (fixing the endianness) The mapping of \u escape sequences to the execution character set happens before string literal concatenation (translation phases 5/6). But AFAIU the mapping is purely symbolic, and has nothing to do with any actual encoding, so MSVC is at fault here? My 2 c, -- Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer KDAB (France) S.A.S., a KDAB Group company Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com KDAB - The Qt, C++ and OpenGL Experts smime.p7s Description: Firma crittografica S/MIME ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)
On Thursday, 14 March 2019 13:54:29 PDT NIkolai Marchenko wrote: > I've posted about this issue (I think) on slack a bit earlier, see > https://cpplang.slack.com/archives/C29936TQC/p154989901601 For those who can't read it, the suggestion was to use the /utf-8 option to the compiler (with qmake, CONFIG += utf8_source). But a quick set of testing does not show correct results. For char16_t text1[] = u"" "\u0102"; It produces, without /utf-8 (see https://msvc.godbolt.org/z/EvtKzq): ?text1@@3PA_SA DB '?', 00H, 00H, 00H; text1 And with /utf-8: ?text1@@3PA_SA DB 0c4H, 00H, 01aH, ' ', 00H, 00H; text1 Those two values make no sense. U+0102 is neither 0x003f (question mark) nor 0x00c4 0x201a ("Ä‚"). This is a clear compiler bug. An interpretation of the C++11 standard could say that the translation is correct for the no-/utf-8 build, but with /utf-8 or /execution-charset:utf-8 it should have produced the correct result. C++11 2.14.5 [lex.string]/13 (now 5.13.5/12 [1]) says: "If one string-literal has no encoding-prefix, it is treated as a string- literal of the same encoding-prefix as the other operand." In table 9: u"a" "b"is the same as u"ab" [1] http://eel.is/c++draft/lex.string#12 -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)
I've posted about this issue (I think) on slack a bit earlier, see https://cpplang.slack.com/archives/C29936TQC/p154989901601 On Thu, Mar 14, 2019 at 11:51 PM Matthew Woehlke wrote: > While working on some modernization of my application — in particular, > converting some UTF-8 literals to use QStringLiteral — I noticed a > concerning compiler warning: > > warning C4566: character represented by universal-character-name > '\u' cannot be represented in the current code page (1252) > > After doing some testing, it turns out that, given code like > QStringLiteral("\u269E \U0001f387 \u269F"), MSVC is indeed butchering > the string. > > Further investigation shows that the problem seems to be with the > implementation of QStringLiteral. In particular, it appears that the > preprocessor initially sees just the raw string literal without the 'u' > prefix, butchers it, then later "promotes" it to a UTF-16 literal, but > by then the damage has been done. > > While this absolutely feels like a compiler bug, it's an *awful* big > gotcha that probably should be documented. Also, is there anything that > Qt can do to work around it? (I know these sorts of macro expansions can > be tricksy...) > > Note: and the *local* work-around is apparently to include the 'u' > prefix on my own literal; apparently doubling it (`uu"stuff"`) is okay. > > -- > Matthew > ___ > Development mailing list > Development@qt-project.org > https://lists.qt-project.org/listinfo/development > ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QStringLiteral is broken(ish) on MSVC (compiler bug?)
On 14/03/2019 16.50, Matthew Woehlke wrote: > While working on some modernization of my application — in particular, > converting some UTF-8 literals to use QStringLiteral — I noticed a > concerning compiler warning: > > warning C4566: character represented by universal-character-name > '\u' cannot be represented in the current code page (1252) > > After doing some testing, it turns out that, given code like > QStringLiteral("\u269E \U0001f387 \u269F"), MSVC is indeed butchering > the string. > > Further investigation shows that the problem seems to be with the > implementation of QStringLiteral. In particular, it appears that the > preprocessor initially sees just the raw string literal without the 'u' > prefix, butchers it, then later "promotes" it to a UTF-16 literal, but > by then the damage has been done. > > While this absolutely feels like a compiler bug, it's an *awful* big > gotcha that probably should be documented. Also, is there anything that > Qt can do to work around it? (I know these sorts of macro expansions can > be tricksy...) > > Note: and the *local* work-around is apparently to include the 'u' > prefix on my own literal; apparently doubling it (`uu"stuff"`) is okay. ...forgot to mention; previous mail attempted to have a complete test case attached, except I accidentally applied the work-around and saved it before sending the message. So, to see the problem, build the attached source (with VC++), but remove the 'u' prefix from the QStringLiteral. -- Matthew ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development