[Bug c/67224] UTF-8 support for identifier names in GCC

2021-04-21 Thread egallager at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

Eric Gallager  changed:

   What|Removed |Added

 CC||branning at gmail dot com,
   ||development at jordi dot 
vilar.cat
   ||, dwolf at dannad dot de,
   ||egallager at gcc dot gnu.org,
   ||spoa at eircom dot net

--- Comment #37 from Eric Gallager  ---
Redoing a few CCs that got removed without being marked as removed in the bug
history; presumably from the server migration

[Bug c/67224] UTF-8 support for identifier names in GCC

2020-05-01 Thread lopezibanez at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #36 from Manuel López-Ibáñez  ---
If the patch is in, it should be ok. Or ask in gcc-patches for someone to
commit on your behalf. Gerald is very helpful. Just make sure the subject
of the email is very clear.

On Fri, 1 May 2020, 16:12 lhyatt at gmail dot com, 
wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224
>
> --- Comment #35 from Lewis Hyatt  ---
> (In reply to Lewis Hyatt from comment #34)
> > (In reply to Eric Gallager from comment #33)
> > > This is a big enough feature that it should probably get an entry in
> > > gcc-10/changes.html
> >
> > I emailed a suggested patch to that effect here:
> > https://gcc.gnu.org/ml/gcc-patches/2020-01/msg01667.html. I can commit
> if it
> > looks OK. Thanks!
>
> With GCC 10 release imminent, would anyone be able to confirm it's OK to
> push
> this to changes.html please? Thanks so much.
> https://gcc.gnu.org/pipermail/gcc-patches/2020-April/544343.html
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug c/67224] UTF-8 support for identifier names in GCC

2020-05-01 Thread lhyatt at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #35 from Lewis Hyatt  ---
(In reply to Lewis Hyatt from comment #34)
> (In reply to Eric Gallager from comment #33)
> > This is a big enough feature that it should probably get an entry in
> > gcc-10/changes.html
> 
> I emailed a suggested patch to that effect here:
> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg01667.html. I can commit if it
> looks OK. Thanks!

With GCC 10 release imminent, would anyone be able to confirm it's OK to push
this to changes.html please? Thanks so much.
https://gcc.gnu.org/pipermail/gcc-patches/2020-April/544343.html

[Bug c/67224] UTF-8 support for identifier names in GCC

2020-02-09 Thread lhyatt at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #34 from Lewis Hyatt  ---
(In reply to Eric Gallager from comment #33)
> This is a big enough feature that it should probably get an entry in
> gcc-10/changes.html

I emailed a suggested patch to that effect here:
https://gcc.gnu.org/ml/gcc-patches/2020-01/msg01667.html. I can commit if it
looks OK. Thanks!

[Bug c/67224] UTF-8 support for identifier names in GCC

2019-09-19 Thread egallager at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

Eric Gallager  changed:

   What|Removed |Added

 CC||egallager at gcc dot gnu.org

--- Comment #33 from Eric Gallager  ---
This is a big enough feature that it should probably get an entry in
gcc-10/changes.html

[Bug c/67224] UTF-8 support for identifier names in GCC

2019-09-19 Thread jsm28 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

Joseph S. Myers  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED
   Target Milestone|--- |10.0

--- Comment #32 from Joseph S. Myers  ---
Implemented for GCC 10.

[Bug c/67224] UTF-8 support for identifier names in GCC

2019-09-19 Thread jsm28 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #31 from Joseph S. Myers  ---
Author: jsm28
Date: Thu Sep 19 19:56:11 2019
New Revision: 275979

URL: https://gcc.gnu.org/viewcvs?rev=275979=gcc=rev
Log:
Support extended characters in C/C++ identifiers (PR c/67224)

libcpp/ChangeLog
2019-09-19  Lewis Hyatt  

PR c/67224
* charset.c (_cpp_valid_utf8): New function to help lex UTF-8 tokens.
* internal.h (_cpp_valid_utf8): Declare.
* lex.c (forms_identifier_p): Use it to recognize UTF-8 identifiers.
(_cpp_lex_direct): Handle UTF-8 in identifiers and CPP_OTHER tokens.
Do all work in "default" case to avoid slowing down typical code paths.
Also handle $ and UCN in the default case for consistency.

gcc/Changelog
2019-09-19  Lewis Hyatt  

PR c/67224
* doc/cpp.texi: Document support for extended characters in
identifiers.
* doc/cppopts.texi: Likewise.

gcc/testsuite/ChangeLog
2019-09-19  Lewis Hyatt  

PR c/67224
* c-c++-common/cpp/ucnid-2011-1-utf8.c: New test.
* g++.dg/cpp/ucnid-1-utf8.C: New test.
* g++.dg/cpp/ucnid-2-utf8.C: New test.
* g++.dg/cpp/ucnid-3-utf8.C: New test.
* g++.dg/cpp/ucnid-4-utf8.C: New test.
* g++.dg/other/ucnid-1-utf8.C: New test.
* gcc.dg/cpp/ucnid-1-utf8.c: New test.
* gcc.dg/cpp/ucnid-10-utf8.c: New test.
* gcc.dg/cpp/ucnid-11-utf8.c: New test.
* gcc.dg/cpp/ucnid-12-utf8.c: New test.
* gcc.dg/cpp/ucnid-13-utf8.c: New test.
* gcc.dg/cpp/ucnid-14-utf8.c: New test.
* gcc.dg/cpp/ucnid-15-utf8.c: New test.
* gcc.dg/cpp/ucnid-2-utf8.c: New test.
* gcc.dg/cpp/ucnid-3-utf8.c: New test.
* gcc.dg/cpp/ucnid-4-utf8.c: New test.
* gcc.dg/cpp/ucnid-6-utf8.c: New test.
* gcc.dg/cpp/ucnid-7-utf8.c: New test.
* gcc.dg/cpp/ucnid-9-utf8.c: New test.
* gcc.dg/ucnid-1-utf8.c: New test.
* gcc.dg/ucnid-10-utf8.c: New test.
* gcc.dg/ucnid-11-utf8.c: New test.
* gcc.dg/ucnid-12-utf8.c: New test.
* gcc.dg/ucnid-13-utf8.c: New test.
* gcc.dg/ucnid-14-utf8.c: New test.
* gcc.dg/ucnid-15-utf8.c: New test.
* gcc.dg/ucnid-16-utf8.c: New test.
* gcc.dg/ucnid-2-utf8.c: New test.
* gcc.dg/ucnid-3-utf8.c: New test.
* gcc.dg/ucnid-4-utf8.c: New test.
* gcc.dg/ucnid-5-utf8.c: New test.
* gcc.dg/ucnid-6-utf8.c: New test.
* gcc.dg/ucnid-7-utf8.c: New test.
* gcc.dg/ucnid-8-utf8.c: New test.
* gcc.dg/ucnid-9-utf8.c: New test.

Added:
trunk/gcc/testsuite/c-c++-common/cpp/ucnid-2011-1-utf8.c
trunk/gcc/testsuite/g++.dg/cpp/ucnid-1-utf8.C
trunk/gcc/testsuite/g++.dg/cpp/ucnid-2-utf8.C
trunk/gcc/testsuite/g++.dg/cpp/ucnid-3-utf8.C
trunk/gcc/testsuite/g++.dg/cpp/ucnid-4-utf8.C
trunk/gcc/testsuite/g++.dg/other/ucnid-1-utf8.C
trunk/gcc/testsuite/gcc.dg/cpp/ucnid-1-utf8.c
trunk/gcc/testsuite/gcc.dg/cpp/ucnid-10-utf8.c
trunk/gcc/testsuite/gcc.dg/cpp/ucnid-11-utf8.c
trunk/gcc/testsuite/gcc.dg/cpp/ucnid-12-utf8.c
trunk/gcc/testsuite/gcc.dg/cpp/ucnid-13-utf8.c
trunk/gcc/testsuite/gcc.dg/cpp/ucnid-14-utf8.c
trunk/gcc/testsuite/gcc.dg/cpp/ucnid-15-utf8.c
trunk/gcc/testsuite/gcc.dg/cpp/ucnid-2-utf8.c
trunk/gcc/testsuite/gcc.dg/cpp/ucnid-3-utf8.c
trunk/gcc/testsuite/gcc.dg/cpp/ucnid-4-utf8.c
trunk/gcc/testsuite/gcc.dg/cpp/ucnid-6-utf8.c
trunk/gcc/testsuite/gcc.dg/cpp/ucnid-7-utf8.c
trunk/gcc/testsuite/gcc.dg/cpp/ucnid-9-utf8.c
trunk/gcc/testsuite/gcc.dg/ucnid-1-utf8.c
trunk/gcc/testsuite/gcc.dg/ucnid-10-utf8.c
trunk/gcc/testsuite/gcc.dg/ucnid-11-utf8.c
trunk/gcc/testsuite/gcc.dg/ucnid-12-utf8.c
trunk/gcc/testsuite/gcc.dg/ucnid-13-utf8.c
trunk/gcc/testsuite/gcc.dg/ucnid-14-utf8.c
trunk/gcc/testsuite/gcc.dg/ucnid-15-utf8.c
trunk/gcc/testsuite/gcc.dg/ucnid-16-utf8.c
trunk/gcc/testsuite/gcc.dg/ucnid-2-utf8.c
trunk/gcc/testsuite/gcc.dg/ucnid-3-utf8.c
trunk/gcc/testsuite/gcc.dg/ucnid-4-utf8.c
trunk/gcc/testsuite/gcc.dg/ucnid-5-utf8.c
trunk/gcc/testsuite/gcc.dg/ucnid-6-utf8.c
trunk/gcc/testsuite/gcc.dg/ucnid-7-utf8.c
trunk/gcc/testsuite/gcc.dg/ucnid-8-utf8.c
trunk/gcc/testsuite/gcc.dg/ucnid-9-utf8.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/doc/cpp.texi
trunk/gcc/doc/cppopts.texi
trunk/gcc/testsuite/ChangeLog
trunk/libcpp/ChangeLog
trunk/libcpp/charset.c
trunk/libcpp/internal.h
trunk/libcpp/lex.c

[Bug c/67224] UTF-8 support for identifier names in GCC

2019-08-07 Thread joseph at codesourcery dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #30 from joseph at codesourcery dot com  ---
https://git.savannah.gnu.org/cgit/gnulib.git/plain/doc/Copyright/request-assign.future

is the form to complete and send to ass...@gnu.org (to do an assignment 
covering past and future changes to GCC, which is usually the best one to 
use).

[Bug c/67224] UTF-8 support for identifier names in GCC

2019-08-06 Thread lhyatt at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #29 from Lewis Hyatt  ---
(In reply to jos...@codesourcery.com from comment #28)

> Thanks for working on this!  I encourage sending this to gcc-patches once 
> a few fixes have been made and you've done the legal paperwork, see 
> .
> 


Thank you very much for taking a look and for the feedback. I will incorporate
all this and send to gcc-patches. Regarding the copyright assignment, I
couldn't quite discern from this link what I need to do next... it seems like I
need someone to email me the necessary form, is that correct? 

I will also file additional bug reports for the diagnostics-related stuff; I 
believe I can construct test cases that do not depend on this patch for those.

[Bug c/67224] UTF-8 support for identifier names in GCC

2019-08-06 Thread joseph at codesourcery dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #28 from joseph at codesourcery dot com  ---
On Mon, 22 Jul 2019, lhyatt at gmail dot com wrote:

> I am interested in helping out with this if there is still interest to support
> this feature. (Full disclosure, I don't have any experience with the gcc
> codebase, but I do have a lot of experience developing with gcc.)

Thanks for working on this!  I encourage sending this to gcc-patches once 
a few fixes have been made and you've done the legal paperwork, see 
.

I'm wary of the MIN use in _cpp_lex_direct, as this is 
performance-critical code so it's not clear an extra operation should be 
added for every token.  I'd rather put the check for UTF-8 in the default 
case (a case that should in practice be rare), with a goto from there to 
the case of identifiers.

As a coding style matter, note that in various places sentences in 
comments should start with a capital letter.

> 5. There is a problem with diagnostics output when the source contains UTF-8
> characters. The locator caret ends up in the wrong place, I assume just 
> because
> this code is not aware of the multibyte encoding. That much is not specific to
> my patch, it exists already now e.g. with:

This seems like it should have a separate bug filed for it (I don't see 
any currently open bugs for this issue).

> The bigger problem though is in layout::print_source_line() which colorizes 
> the
> source lines. It seems to end up attempting to colorize just the first byte,
> even for UTF-8, which makes the output no longer valid. I tried to look into 
> it
> but I wasn't sure what are the implications, e.g. would it require some much
> larger overhaul of diagnostics infrastructure anyway to get this right, and
> would it perhaps be better just to disable colorization in the presence of
> UTF-8 input or something like this, for the meantime.

And this should probably also have a separate bug filed (whether or not it 
can occur without this patch applied).

> This is also not specific to this patch and occurs the same if UCN
> is used:

This also seems like a matter for filing a separate bug.  Or maybe two 
separate bugs, one for C and one for C++, since the fixes might be 
different.  For C, the suggestion of \xcf\x80 looks like a missing call to 
identifier_to_locale when printing an identifier using %qs - but the C++ 
code is using %qE, which should use identifier_to_locale automatically, so 
I'm not sure what's wrong there.

> 7. What is the expected output from gcc -E of this code?
> 
> ---
> int π;
> 
> 
> Currently it outputs:
> int \U03c0;
> 
> So curiously, it's as if C++ required translation of extended chars to UCNs is
> happening, so I think this output is actually potentially correct in C++ mode?
> But it is also this way in C mode which I think is probably not expected. It
> seems to come from cpp_output_token() which does not make use of the "original
> spelling" data structures. I am not sure about this one but probably the right
> solution is not much work, if someone knows what that might be?

I don't think the -E output matters much here; it's not specified by the 
standard.

The results of stringizing *are* more precisely defined (the relevant 
tests stringize twice to verify those results).  Strictly, for C++ 
stringizing twice (for extended characters including $ @ `) should make 
the conversion of such characters to UCNs visible (in strings, not just in 
identifiers), because, unlike C, C++ does not have the special rule making 
it implementation-defined whether the \ of a UCN in a string literal is 
doubled when stringizing.  I don't think that's something you need to fix, 
however, since there's no attempt to implement that conversion for C++ at 
present, but it does make a couple of the C++ tests in your patch strictly 
invalid.

> This is also the reason that one of the new testcases
> (gcc/testsuite/gcc.dg/cpp/ucnid-13-utf8.c) fails, this:
> 
> #define Á 1
> 
> also preprocesses (in -E -dD) to include UCNs. I am not sure what is expected
> here.

There is definitely no need to preserve spelling there (it's not even 
possible in general, since the same macro name can be spelt differently in 
otherwise identical definitions of the same macro; it's only a constraint 
violation if either macro argument names or the RHS are different, not if 
the name of the macro itself is spelt differently).  So the right thing is 
to test that the output in that case uses a UCN.

> 8. There are tests (e.g. gcc/testsuite/gcc.dg/ucnid-10.c) which verify that
> when the locale is not utf8, diagnostics use UCNs instead of raw UTF8. I am 
> not
> sure if this still makes sense when the files themselves contain UTF8, but 
> that
> was the behavior that came out so I maintained these tests as well.

Yes, I think that's correct.

[Bug c/67224] UTF-8 support for identifier names in GCC

2019-07-22 Thread lhyatt at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #27 from Lewis Hyatt  ---
Created attachment 46620
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46620=edit
second attempt at posting the patch

Sorry, the previous patch I sent doesn't seem to show correctly in Bugzilla. I
think probably because the test cases include invalid UTF-8 and also
latin1-encoded characters, it may have caused the whole file to be
misinterpreted as the wrong encoding? It only affects the test cases but I am
sending it here as binary and hoping that it is easier to read then. Thanks!

[Bug c/67224] UTF-8 support for identifier names in GCC

2019-07-22 Thread lhyatt at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

Lewis Hyatt  changed:

   What|Removed |Added

 CC||lhyatt at gmail dot com

--- Comment #26 from Lewis Hyatt  ---
Created attachment 46618
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46618=edit
Patch with test cases that implements extended chars in identifiers

Hi All-

I am interested in helping out with this if there is still interest to support
this feature. (Full disclosure, I don't have any experience with the gcc
codebase, but I do have a lot of experience developing with gcc.)

I took a crack at implementing it based on Joseph's outline in Comment #21 and
the rest of the discussion in this thread. The patch is attached, including
test cases. (I more or less took all the existing ucnid* test cases and adapted
them for this, plus added a couple extra ones.) It seems to work fine, as far
as interpreting the identifiers and bootstrapping clean, and test cases also
pass except for one that I'll mention below, but I have many comments +
questions as well:

1. The number of changes to libcpp is actually pretty small. All the work to
recognize UTF-8 happens in forms_identifier_p(), so that the existing fast
paths for regular characters are not affected, and that extended chars end up
getting treated just like UCNs for the most part. forms_identifier_p() makes
use of a new utility _cpp_valid_utf8_in_identifier() in charset.c that is
similar to the existing _cpp_valid_ucn() and handles the UTF8 details.

2. otherwise _cpp_interpret_identifier() and lex_identifier() didn't need any
changes. The former could be optimized a bit, it always allocates a temporary
buffer, even though the buffer is only needed if UCNs appear. (This is the case
already in the case of dollar signs that end up in this code path too.) 
Probably it's not a big deal though.

3. Invalid UTF-8 is left alone and parsed as stray tokens, the same as now.

4. Regarding the case of codepoints not allowed to appear in an identifier. In
C, I did as Joseph suggested, or at least I tried to. The grammar specifies
that an identifier ends once an illegal character is encountered, so this is
how it works now, and then the disallowed UTF-8 forms a stray token next. It
was not clear to me though whether this stray token should consist of just the
next 1 byte of the input, or the entire disallowed UTF-8 character. Currently
it's just the next byte because that's how things worked out of the box.
Changing it wouldn't be too hard, just means the default case of
_cpp_lex_direct()'s main switch statement would need to try to read a UTF char
rather than a byte.

  In C++, I think UCNs or UTF-8 in identifiers should be treated identically in
all respects, unless I misunderstand things (because technically the UTF-8 was
supposed to be converted to UCNs in translation phase 1), so in that case a
disallowed codepoint does not end the token but rather triggers an invalid
character error.

5. There is a problem with diagnostics output when the source contains UTF-8
characters. The locator caret ends up in the wrong place, I assume just because
this code is not aware of the multibyte encoding. That much is not specific to
my patch, it exists already now e.g. with:

$ cat t.cpp
const char* x = "ππ"; int y = z;

$ g++ -c t.cpp
t.cpp:1:57: error: ‘z’ was not declared in this scope
 const char* x = "ππ"; int y = z;
 ^

The bigger problem though is in layout::print_source_line() which colorizes the
source lines. It seems to end up attempting to colorize just the first byte,
even for UTF-8, which makes the output no longer valid. I tried to look into it
but I wasn't sure what are the implications, e.g. would it require some much
larger overhaul of diagnostics infrastructure anyway to get this right, and
would it perhaps be better just to disable colorization in the presence of
UTF-8 input or something like this, for the meantime.

As an example of what I mean, from preprocessing this (in c99 mode):


int ٩;


I get:
t3.c:1:5: error: universal character ٩ is not valid at the start of an
identifier
1 | int ٩;

But if color is enabled, the output gets corrupted because ANSI escapes are
inserted between the two bytes of the multibyte character on the 2nd line of
diagnostics.

6. There is also a problem with formatting the output of some diagnostics, e.g.
when I compile this:

---
int π = 3;
int x = π2;
---

I get:
t2.cpp:2:9: error: ‘π2’ was not declared in this scope; did you mean
‘\xcf\x80’? This is also not specific to this patch and occurs the same if UCN
is used:

$ cat t.cpp
int \u03C0 = 3;
int x = \u03C02;

$ g++ -c t.cpp
t.cpp:2:9: error: ‘π2’ was not declared in this scope
 int x = \u03C02;
 ^~~
t.cpp:2:9: note: suggested 

[Bug c/67224] UTF-8 support for identifier names in GCC

2017-11-01 Thread spoa at eircom dot net
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #25 from spoa at eircom dot net  ---
Many thanks Manu.  The to_UCN.sh script works well.  The only trouble was that
my include file also contain unusual characters with diacritic marks and the
script changes these file names to UCN also.  So compiler cant find them.  I
had to re-edit the .cpp file manually after conversion to UCN to change the
include file names back.  But in spite of that, it is useful and enables coding
with much greater choice of words for identifiers.  Much easier for me to read
my code. Thanks again.

[Bug c/67224] UTF-8 support for identifier names in GCC

2017-10-30 Thread manu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

Manuel López-Ibáñez  changed:

   What|Removed |Added

   Last reconfirmed|2015-08-17 00:00:00 |2017-10-30

--- Comment #24 from Manuel López-Ibáñez  ---
(In reply to s...@eircom.net from comment #23)
> An important patch. Is there a similar patch for versions later than 5.2.0
> of gcc?  I'm looking for gcc-7.2.1-2 patch for unicode idenfifiers.

The patch above is not recommended due to the problems mentioned above. 

The recommended work-around is given here:

https://gcc.gnu.org/wiki/FAQ#utf8_identifiers

Guidelines for a proper implementation are given in comment #21.

[Bug c/67224] UTF-8 support for identifier names in GCC

2017-10-26 Thread spoa at eircom dot net
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

spoa at eircom dot net  changed:

   What|Removed |Added

 CC||spoa at eircom dot net

--- Comment #23 from spoa at eircom dot net  ---
An important patch. Is there a similar patch for versions later than 5.2.0 of
gcc?  I'm looking for gcc-7.2.1-2 patch for unicode idenfifiers.

[Bug c/67224] UTF-8 support for identifier names in GCC

2017-02-10 Thread dwolf at dannad dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

Daniel Wolf  changed:

   What|Removed |Added

 CC||dwolf at dannad dot de

--- Comment #22 from Daniel Wolf  ---
Has there been any progress on this since 2015? I'm maintaining a project that
uses the International Phonetic Alphabet (IPA) internally. My life would be
much easier if I could use identifiers like aʊ or dʒ. Both are valid C++
identifiers supported by Clang, Xcode and Visual Studio, but not supported by
GCC.

My knowledge of compilers is very limited, so I'm afraid I can't be of
practical help. But I'd like to point out that there is indeed demand for this
feature -- see for example this StackOverflow question:


[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-20 Thread ejolson at unr dot edu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #20 from Eric ejolson at unr dot edu ---
I've been looking at the code in lex_identifier as well as what goes on in
forms_identifier_p and so forth.  As some point each identifier needs to be
stored in the symbol table using ht_lookup_with_hash.  Proper functioning
requires that UTF-8 and UCN representations of the same unicode characters are
treated as the same symbol.  Thus, there needs to be some point at which the
identifiers are regularized to be either all UTF-8 or all UCN escaped ASCII. 
As gcc is working with UCNs right now, the obvious implementation allocates
temporary memory to hold the UCN escaped ASCII version of an UTF-8 identifier
and then frees it again after calling ht_lookup.  Any comments would be
appreciated.


[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-20 Thread joseph at codesourcery dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #21 from joseph at codesourcery dot com joseph at codesourcery dot 
com ---
_cpp_interpret_identifier converts UCNs to UTF-8 which is the canonical 
internal form for identifiers - for UTF-8 in identifiers, you just need to 
pass in straight through unmodified there.  (cpplib takes care to store 
the original spelling of the identifier as well for purposes for which 
that matters, but that's simply a matter of lex_identifier calling 
cpp_lookup on the original spelling as well as using 
_cpp_interpret_identifier to get the canonical version.)

So you never need to convert UTF-8 to UCNs in order to handle UTF-8 in 
identifiers (cpplib has logic to do so when needed for output, but you 
don't need to add anything new in that regard).  You do need to decode 
UTF-8 into character values for the code that checks normalization, which 
characters are allowed at the start of identifiers, etc., just as the 
existing code decodes UCNs into such values.  (But as I noted, a UCN not 
allowed in identifiers is lexed as part of an identifier, which is then 
considered invalid, whereas a UTF-8 character not allowed in identifiers 
should be lexed as a separate pp-token.  However, UTF-8 for a character 
allowed in identifiers but not at the start of an identifier should, I 
think, be lexed as an identifier character even at the start of an 
identifier, and then give an error for an invalid identifier if it appears 
at the start of an identifier.  That's my reading of the syntax 
productions in the C standard.)

You can ignore anything claiming to handle UTF-EBCDIC.


[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-18 Thread manu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #13 from Manuel López-Ibáñez manu at gcc dot gnu.org ---
(In reply to Eric from comment #12)
 I'm glad to know people like Joseph are working on UTF-8 in gcc. 

I think at the moment, neither Joseph nor anyone else is planning to work on
this. There doesn't seem to be sufficient demand for this feature so that
companies fund it or volunteers step up to implement it (you are the first one
to do an attempt that I am aware of).

 I spent a week adding UTF-8 input support to pcc.  At that time Microsoft
 Studio and clang already supported UTF-8 input files and I expected that gcc
 would do so in the next release.

Unfortunately, GCC has very few developers compared to Microsoft or Clang. Many
things in GCC will never get done if new people do not contribute to its
development. This is why if you want to see this feature, you are the best and
perhaps the only person to make it happen.

The problem is that this cannot be fixed by one-line patch, otherwise it would
have been fixed a long time ago.

* GCC cannot rely on libiconv being always present. It has to work with glibc's
iconv, which is what is used in GNU/Linux.

* Even if glibc's supported C99 conversion, this will break other things. 

* You need to add tests explicitly for various things (see Joseph's comments).
The tests will be added to the GCC testsuite to prove that your patch works as
it should and to make sure future changes do not break the tests.

* At a minimum, look at all the gcc.dg/cpp/ucnid-*.c g++.dg/cpp/ucnid-*.c and
see what happens if you replace the \uNNN with actual extended characters.

* Joseph thinks that the best approach is to do the conversion from UTF-8 to
UCNs manually within cpplib, such that you can handle all the corner cases of
C/C++ (quoted strings, \µ, macro names,...)

[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-18 Thread ejolson at unr dot edu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #15 from Eric ejolson at unr dot edu ---
Created attachment 36206
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=36206action=edit
Improved UTF-8 identifier patch

Improved patch to support UTF-8 identifiers.  This version by default does no
translation unless -finput-charset=XXX is specified where XXX is something
other than C99 and should not affect EBCDIC hosts.


[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-18 Thread ejolson at unr dot edu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #14 from Eric ejolson at unr dot edu ---
While there may not be current demand for gcc to accept UTF-8 identifiers, the
fact that clang and Visual Studio support this C99 feature means source code
using Greek and accented characters in variable names is likely to become more
prevalent over time.

I have done a little testing to check by default whether string literals can
contain arbitrary 8-bit data.  This is used, for example, in legacy code which
directly includes graphics characters from CP437.  The original preprocessor
specifies UTF-8 as the default input character set and UTF-8 as the
internal character set.  Then, if the internal and working character sets are
identical no translation is done and arbitrary 8-bit data is passed through
cleanly.  A slight modification to my patch needs to be made to retain the same
behavior.  In particular, the patch now specifies both the internal and default
input character sets to be C99 so no translation is done by default.  The
improved patch also includes consideration of EBCDIC hosts.

As iconv was installed on every GNU/Linux system I've tried, I'm not sure what
is wrong with using the C99 mode present in newer releases.  This achieves
exactly the suggested result of converting all UTF-8 input to UCNs in the
preprocessor while directly allowing other potentially useful conversions. 
Perhaps the configure script should be modified to check for a compatibile
version of iconv and if one is not found resort to a manual conversion.

Testing is still underway.  After the standard regression tests are finished I
will create new tests utf8id-.* which will be versions of the uncid-.* tests
for native utf-8 files.  I will also include a new test for arbitrary 8-bit
string literals, to verify further compatibility.


[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-18 Thread joseph at codesourcery dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #17 from joseph at codesourcery dot com joseph at codesourcery dot 
com ---
On Tue, 18 Aug 2015, ejolson at unr dot edu wrote:

 As iconv was installed on every GNU/Linux system I've tried, I'm not sure what
 is wrong with using the C99 mode present in newer releases.  This achieves

The iconv that is installed is glibc iconv.  It has *nothing to do with* 
libiconv, a completely independent package.  iconv --version will report a 
glibc version and iconv --list will produce a list not mentioning C99, 
e.g.:

$ iconv --version
iconv (Ubuntu EGLIBC 2.19-0ubuntu6.6) 2.19
$ iconv --list
The following list contains all the coded character sets known.  This does
not necessarily mean that all combinations of these names can be used for
the FROM and TO command line parameters.  One coded character set can be
listed with several different names (aliases).

  437, 500, 500V1, 850, 851, 852, 855, 856, 857, 860, 861, 862, 863, 864, 865,
  866, 866NAV, 869, 874, 904, 1026, 1046, 1047, 8859_1, 8859_2, 8859_3, 8859_4,
  8859_5, 8859_6, 8859_7, 8859_8, 8859_9, 10646-1:1993, 10646-1:1993/UCS4,
  ANSI_X3.4-1968, ANSI_X3.4-1986, ANSI_X3.4, ANSI_X3.110-1983, ANSI_X3.110,
  ARABIC, ARABIC7, ARMSCII-8, ASCII, ASMO-708, ASMO_449, BALTIC, BIG-5,
  BIG-FIVE, BIG5-HKSCS, BIG5, BIG5HKSCS, BIGFIVE, BRF, BS_4730, CA, CN-BIG5,
  CN-GB, CN, CP-AR, CP-GR, CP-HU, CP037, CP038, CP273, CP274, CP275, CP278,
  CP280, CP281, CP282, CP284, CP285, CP290, CP297, CP367, CP420, CP423, CP424,
  CP437, CP500, CP737, CP770, CP771, CP772, CP773, CP774, CP775, CP803, CP813,
  CP819, CP850, CP851, CP852, CP855, CP856, CP857, CP860, CP861, CP862, CP863,
  CP864, CP865, CP866, CP866NAV, CP868, CP869, CP870, CP871, CP874, CP875,
  CP880, CP891, CP901, CP902, CP903, CP904, CP905, CP912, CP915, CP916, CP918,
  CP920, CP921, CP922, CP930, CP932, CP933, CP935, CP936, CP937, CP939, CP949,
  CP950, CP1004, CP1008, CP1025, CP1026, CP1046, CP1047, CP1070, CP1079,
  CP1081, CP1084, CP1089, CP1097, CP1112, CP1122, CP1123, CP1124, CP1125,
  CP1129, CP1130, CP1132, CP1133, CP1137, CP1140, CP1141, CP1142, CP1143,
  CP1144, CP1145, CP1146, CP1147, CP1148, CP1149, CP1153, CP1154, CP1155,
  CP1156, CP1157, CP1158, CP1160, CP1161, CP1162, CP1163, CP1164, CP1166,
  CP1167, CP1250, CP1251, CP1252, CP1253, CP1254, CP1255, CP1256, CP1257,
  CP1258, CP1282, CP1361, CP1364, CP1371, CP1388, CP1390, CP1399, CP4517,
  CP4899, CP4909, CP4971, CP5347, CP9030, CP9066, CP9448, CP10007, CP12712,
  CP16804, CPIBM861, CSA7-1, CSA7-2, CSASCII, CSA_T500-1983, CSA_T500,
  CSA_Z243.4-1985-1, CSA_Z243.4-1985-2, CSA_Z243.419851, CSA_Z243.419852,
  CSDECMCS, CSEBCDICATDE, CSEBCDICATDEA, CSEBCDICCAFR, CSEBCDICDKNO,
  CSEBCDICDKNOA, CSEBCDICES, CSEBCDICESA, CSEBCDICESS, CSEBCDICFISE,
  CSEBCDICFISEA, CSEBCDICFR, CSEBCDICIT, CSEBCDICPT, CSEBCDICUK, CSEBCDICUS,
  CSEUCKR, CSEUCPKDFMTJAPANESE, CSGB2312, CSHPROMAN8, CSIBM037, CSIBM038,
  CSIBM273, CSIBM274, CSIBM275, CSIBM277, CSIBM278, CSIBM280, CSIBM281,
  CSIBM284, CSIBM285, CSIBM290, CSIBM297, CSIBM420, CSIBM423, CSIBM424,
  CSIBM500, CSIBM803, CSIBM851, CSIBM855, CSIBM856, CSIBM857, CSIBM860,
  CSIBM863, CSIBM864, CSIBM865, CSIBM866, CSIBM868, CSIBM869, CSIBM870,
  CSIBM871, CSIBM880, CSIBM891, CSIBM901, CSIBM902, CSIBM903, CSIBM904,
  CSIBM905, CSIBM918, CSIBM921, CSIBM922, CSIBM930, CSIBM932, CSIBM933,
  CSIBM935, CSIBM937, CSIBM939, CSIBM943, CSIBM1008, CSIBM1025, CSIBM1026,
  CSIBM1097, CSIBM1112, CSIBM1122, CSIBM1123, CSIBM1124, CSIBM1129, CSIBM1130,
  CSIBM1132, CSIBM1133, CSIBM1137, CSIBM1140, CSIBM1141, CSIBM1142, CSIBM1143,
  CSIBM1144, CSIBM1145, CSIBM1146, CSIBM1147, CSIBM1148, CSIBM1149, CSIBM1153,
  CSIBM1154, CSIBM1155, CSIBM1156, CSIBM1157, CSIBM1158, CSIBM1160, CSIBM1161,
  CSIBM1163, CSIBM1164, CSIBM1166, CSIBM1167, CSIBM1364, CSIBM1371, CSIBM1388,
  CSIBM1390, CSIBM1399, CSIBM4517, CSIBM4899, CSIBM4909, CSIBM4971, CSIBM5347,
  CSIBM9030, CSIBM9066, CSIBM9448, CSIBM12712, CSIBM16804, CSIBM11621162,
  CSISO4UNITEDKINGDOM, CSISO10SWEDISH, CSISO11SWEDISHFORNAMES,
  CSISO14JISC6220RO, CSISO15ITALIAN, CSISO16PORTUGESE, CSISO17SPANISH,
  CSISO18GREEK7OLD, CSISO19LATINGREEK, CSISO21GERMAN, CSISO25FRENCH,
  CSISO27LATINGREEK1, CSISO49INIS, CSISO50INIS8, CSISO51INISCYRILLIC,
  CSISO58GB1988, CSISO60DANISHNORWEGIAN, CSISO60NORWEGIAN1, CSISO61NORWEGIAN2,
  CSISO69FRENCH, CSISO84PORTUGUESE2, CSISO85SPANISH2, CSISO86HUNGARIAN,
  CSISO88GREEK7, CSISO89ASMO449, CSISO90, CSISO92JISC62991984B, CSISO99NAPLPS,
  CSISO103T618BIT, CSISO111ECMACYRILLIC, CSISO121CANADIAN1, CSISO122CANADIAN2,
  CSISO139CSN369103, CSISO141JUSIB1002, CSISO143IECP271, CSISO150,
  CSISO150GREEKCCITT, CSISO151CUBA, CSISO153GOST1976874, CSISO646DANISH,
  CSISO2022CN, CSISO2022JP, CSISO2022JP2, CSISO2022KR, CSISO2033,
  CSISO5427CYRILLIC, CSISO5427CYRILLIC1981, CSISO5428GREEK, CSISO10367BOX,
  CSISOLATIN1, CSISOLATIN2, CSISOLATIN3, CSISOLATIN4, CSISOLATIN5, CSISOLATIN6,
  CSISOLATINARABIC, CSISOLATINCYRILLIC, 

[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-18 Thread ejolson at unr dot edu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #16 from Eric ejolson at unr dot edu ---
With my second patch the command line must now include the options

-finput-charset=UTF-8 -fextended-identifiers -fexec-charset=UTF-8

or otherwise C99 will also be used for the default execution character set.  A
better approach to maintain nearly 8-bit clean string literals by default might
result from leaving the default input and execution characters sets as UTF-8
and setting the internal character set to C99 only when -fextended-identifiers
is selected.  Sorry for too many comments.  I'll post a new patch when
everything is ready and has been tested.


[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-18 Thread joseph at codesourcery dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #19 from joseph at codesourcery dot com joseph at codesourcery dot 
com ---
On Tue, 18 Aug 2015, ejolson at unr dot edu wrote:

 which illustrates that g++ does not process trigraphs inside raw string
 literals.  Admittedly I'm looking at the draft standard, but I don't think 
 this

As stated in [lex.pptoken] in both C++11 and C++14: Between the initial 
and final double quote characters of the raw string, any transformations 
performed in phases 1 and 2 (trigraphs, universal-character-names, and 
line splicing) are reverted; this reversion shall apply before any d-char, 
r-char, or delimiting parenthesis is identified..  Yes, the positioning 
of this in the standard may be confusing

That is, the effect is more or less as if trigraphs weren't processed 
inside raw strings (but the implementation involves undoing trigraph 
substitutions, as described in the standard).

I think the right way to implement UTF-8 in identifiers involves making 
lex_identifier handle UTF-8 (when extended identifiers are enabled), and 
making _cpp_lex_direct handle bytes with the high bit set as 
potentially[*] starting identifiers (requiring the same handling of 
normalization state as for the other cases of characters starting 
identifiers, of course).  If you do that, then raw strings and all the 
corner cases of spelling preservation fall out naturally (though they 
still need testcases added to the testsuite).

[*] I think the right rule for C is that UTF-8 for a character not allowed 
in identifiers should produce a preprocessing token on its own rather than 
an error for an invalid character in an identifier (and similarly, such a 
character after the start of the identifier should terminate the 
identifier and produce such a preprocessing token).  Unless and until 
someone implements the C++ phase 1 conversion to UCNs, it would seem 
reasonable to follow this rule for C++ as well.


[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-18 Thread ejolson at unr dot edu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #18 from Eric ejolson at unr dot edu ---
Thanks Joseph for the clarification about the two different versions of iconv. 
I was admittedly confused about this until moments ago.  Anyway, I just
discovered that libiconv doesn't support conversions to and from the IBM1047
EBCDIC character set and this causes some of the regression tests to fail. 
Coupled with the fact that C99 isn't supported in the glibc version of iconv
this creates a little problem with my patch.

You mention a bigger problem which I had not thought about:  the C++ semantics
of raw strings.  Processing UCNs in C++ code apparently requires surprisingly
deep syntactic analysis.  Raw literals seem to appear in the gnu99 and gnu11
extensions to C as well.

Amusingly, if I understand the C++ specifications

www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3337.pdf

trigraphs are supposed to be interpreted before any other processing takes
place.  However, the simple code

#include stdio.h
int main(){
char p1[]=??/u00E4;
char p2[]=R(??/u00E4);
char p3[]=R(\u00E4);
printf(%s or %s or %s\n,p1,p2,p3);
return 0;
}

compiled with

$ g++ -std=c++11 pp.c

produces output

ä or ??/u00E4 or \u00E4

which illustrates that g++ does not process trigraphs inside raw string
literals.  Admittedly I'm looking at the draft standard, but I don't think this
is something which changed suddenly in the final draft.  Clearly, my patch
makes a further mess of raw string literals in gcc.  My first reaction is that
raw string literals were not well thought out, but I suppose bad standards are
sometimes better than no standards.  At anyrate, there appears no easy way of
supporting both UTF-8 identifiers and raw literal strings.

My plan for now is to take a break and keep my UTF-8 identifier support as a
one-line patch reliant on libiconv which breaks EBCDIC encodings and raw string
literals.

[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-17 Thread ejolson at unr dot edu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #12 from Eric ejolson at unr dot edu ---
I'm glad to know people like Joseph are working on UTF-8 in gcc.  Last year I
spent a week adding UTF-8 input support to pcc.  At that time Microsoft Studio
and clang already supported UTF-8 input files and I expected that gcc would do
so in the next release.  As this didn't happen, a few months ago I looked and
developed a one-line patch to add this support to gcc.

It appears the C preprocessor falls back to libiconv when it encounters a
conversion not supported internally.  From what I can tell this is enabled by
default, though it is surely possible to disable it.

I'm aware that C strings are often used to store 8-bit data, for example, to
display various graphics characters from legacy code pages.  I will run the
regression tests as soon as possible to see what, if anything, has broken by my
one-line patch.  UCN quoting of UTF-8 input should happen only if the
-finput-charset=UTF-8 flag is set and this is worth checking.


[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-17 Thread ejolson at unr dot edu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #6 from Eric ejolson at unr dot edu ---
From the webpage (current as of Aug 17, 2015)

http://www.gnu.org/software/libiconv/

under *Details* it is described that the library provides support for the
following encodings:

Full Unicode
UTF-8
UCS-2, UCS-2BE, UCS-2LE
UCS-4, UCS-4BE, UCS-4LE
UTF-16, UTF-16BE, UTF-16LE
UTF-32, UTF-32BE, UTF-32LE
UTF-7
C99, JAVA 

Therefore, I don't understand the statement that libiconv doesn't support C99
or that it isn't, somehow, a character set.


[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-17 Thread ejolson at unr dot edu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #7 from Eric ejolson at unr dot edu ---
Please look at the Raspberry Pi forum post linked in the original report for
more information about testing this patch.  As the text describes there, the
command line options 

-finput-charset=UTF-8 -fextended-identifiers

are both needed in order to compile a UTF-8 input file containing unicode
identifiers.  I have included a small test program as another attachment. 
Searching on UTF-8 Identifiers in GCC will turn up a number of people asking
for this feature and additional example codes that use UTF-8 identifers.  The
document Unicode for the PCC C99 Compiler available at

http://pcc.ludd.ltu.se/documentation/

also contains example UTF-8 C99 input files which can be used to test the
compiler.  The one-line patch submitted above has also been tested in the sense
that the compiler still bootstraps and has no trouble compiling thousands of
lines of standard ASCII C input.

The patch inserts C99 in only one place as the uses of SOURCE_CHARSET are
conflicted and changing them all to C99 doesn't yield a working solution.  In
particular, the C99 in _cpp_convert_input should not be considered the source
character set appearing in the input files but rather an internal character set
suitable for later parsing.  As iconv is already a well debugged library, it
would appear the risks of this patch are minor.

Note however, the following problem:  C99 is probably not the correct for
EBCDIC hosts.  In that case it might be possible to write UCNs using trigraphs
of the form ??/u and ??/U, however, as the number of people wanting
to compile C source files with identifiers encoded using UTF-EBCDIC is likely
zero, the easiest solution going forward is to modify the patch so it only
applies to non-EBCDIC hosts.  As there are already #ifdef's in the code to
check for this, this does not add any new complexity to the code base.


[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-17 Thread ejolson at unr dot edu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #8 from Eric ejolson at unr dot edu ---
Created attachment 36196
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=36196action=edit
Test program with UTF-8 identifiers...

Compile this test program using 

gcc \
-finput-charset=UTF-8 -fextended-identifiers \
-o circle circle.c

to check whether gcc can handle UTF-8 identifiers.


[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-17 Thread manu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #9 from Manuel López-Ibáñez manu at gcc dot gnu.org ---
(In reply to Eric from comment #7)
 also contains example UTF-8 C99 input files which can be used to test the
 compiler.  The one-line patch submitted above has also been tested in the
 sense that the compiler still bootstraps and has no trouble compiling
 thousands of lines of standard ASCII C input.

I think what Joseph is saying is that your approach may work for the small
examples that you have tested, but it would break things that are working fine
right now (in particular raw string literals). Many of those things are not
tested by a gcc bootstrap (but some of them should be tested by the regression
testsuite, did you run that? Point 4 here:
https://gcc.gnu.org/wiki/GettingStarted#Basics:_Contributing_to_GCC_in_10_easy_steps
)

I hope Joseph can give you more details so you may try to implement this in the
proper way.

The only reason why GCC does not have UTF-8 support in identifiers is that no
one had time to implement it yet, so your help is appreciated.

[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-17 Thread manu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

Manuel López-Ibáñez manu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2015-08-17
 CC||manu at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #10 from Manuel López-Ibáñez manu at gcc dot gnu.org ---
(In reply to Eric from comment #7)
 command line options 
 
 -finput-charset=UTF-8 -fextended-identifiers
 
 are both needed in order to compile a UTF-8 input file containing unicode

Note also that since GCC 5.1: The option -fextended-identifiers is now enabled
by default for C++, and for C99 and later C versions
(https://gcc.gnu.org/gcc-5/changes.html) and the default C version is C11, thus
it is enabled by default.

[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-17 Thread joseph at codesourcery dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #11 from joseph at codesourcery dot com joseph at codesourcery dot 
com ---
Sorry, glibc iconv (not libiconv) doesn't handle C99.  So your patch 
would not work on any GNU host in normal configurations of GCC (libiconv 
is a completely separate package and is only likely to be used on non-GNU 
hosts such as Windows, on GNU hosts iconv from glibc is normally used 
although it's possible to use libiconv there).

You need to test cases such as that if a macro is defined twice, once with 
a UCN in its expansion and once with the equivalent character written in 
UTF-8, the difference in the expansion is diagnosed (whichever of all the 
valid UCNs for that character is the one used).  And that the original 
spelling appears on the right hand side of a definition output with -dD.  
And that if (in C but not, properly, C++) a string contains a backslash 
followed by an extended character, this is properly diagnosed as an 
invalid escape sequence rather than being treated as \\usomething or 
\\Usomething.  See the tests in my spelling preservation patch 
https://gcc.gnu.org/ml/gcc-patches/2014-11/msg00548.html.  (Stringizing 
isn't necessarily an issue here because of the special C rules about 
stringizing UCNs together with the C++ rule about converting to UCNs in 
phase 1 - the effect is that for C it's always OK to stringize as the 
extended character, though you can't stringize as a UCN if the extended 
character was originally written, while for C++ you have to stringize as a 
UCN.)  And then you need tests of C++ programs with extended characters 
inside raw strings (like c-c++-common/raw-string-*.c, but none of those 
cover extended characters at present).  And the patch needs to add all 
these tests to the testsuite.


[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-15 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #3 from Andrew Pinski pinskia at gcc dot gnu.org ---
Have you tried -fextended-identifiers ?


[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-15 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #2 from Andrew Pinski pinskia at gcc dot gnu.org ---
Related to bug 41374.


[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-15 Thread joseph at codesourcery dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #5 from joseph at codesourcery dot com joseph at codesourcery dot 
com ---
There is no C99 character set in glibc libiconv (after all, it's not a 
character set at all).  Converting extended characters to UCNs like that 
would in any case be correct for C++ (provided you also convert $ ` @ and 
control characters other than those in the basic source character set) but 
not for C - but for C++, it would be necessary to keep track of the 
conversions to revert them in raw string literals.  This requirement to 
revert such conversions in raw string literals (in C++14, see 2.5 
[lex.pptoken] paragraph 3: Between the initial and final double quote 
characters of the raw string, any transformations performed in phases 1 
and 2 (trigraphs, universal-character-names, and line splicing) are 
reverted; this reversion shall apply before any d-char, r-char, or 
delimiting parenthesis is identified.) renders such an approach 
non-viable (it would break things that currently work); the conversions to 
UCNs have to take place within cpplib, not through an external iconv 
conversion.

Note that cpplib identifier spelling preservation is now implemented 
https://gcc.gnu.org/ml/gcc-patches/2014-11/msg00548.html, which adds 
other ways in which it should be visible whether an identifier was 
represented with UTF-8 or UCNs.


[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-15 Thread manu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #4 from Manuel López-Ibáñez manu at gcc dot gnu.org ---
I cannot say anything about the correctness of the patch, but I would expect
such a patch to contain many testcases (at least similar to those that test for
UCNs see https://gcc.gnu.org/ml/gcc-patches/2014-11/msg00337.html), patches
need to be bootstrapped  regression tested and submitted to gcc-patches with a
Changelog
(https://gcc.gnu.org/wiki/GettingStarted#Basics:_Contributing_to_GCC_in_10_easy_steps).
Please CC Joseph Myers when you submit.

[Bug c/67224] UTF-8 support for identifier names in GCC

2015-08-14 Thread ejolson at unr dot edu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #1 from Eric ejolson at unr dot edu ---
To check the installed version of iconv has C99 support type

  $ iconv --list | grep C99
  C99
  $

which means that iconv is recent enough.