Ping: [PATCH v2] libcpp: Handle extended characters in user-defined literal suffix [PR103902]

2023-07-11 Thread Lewis Hyatt via Gcc-patches
May I please ping this patch again? I think it would be worthwhile to
close this gap in the support for UTF-8 sources. Thanks!
https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html

-Lewis

On Fri, Jun 2, 2023 at 9:45 AM Lewis Hyatt  wrote:
>
> Hello-
>
> Ping please? Thanks.
> https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html
>
> -Lewis
>
> On Tue, May 2, 2023 at 9:27 AM Lewis Hyatt  wrote:
> >
> > May I please ping this one? Thanks...
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html
> >
> > On Thu, Mar 2, 2023 at 6:21 PM Lewis Hyatt  wrote:
> > >
> > > The PR complains that we do not handle UTF-8 in the suffix for a 
> > > user-defined
> > > literal, such as:
> > >
> > > bool operator ""_π (unsigned long long);
> > >
> > > In fact we don't handle any extended identifier characters there, whether
> > > UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space 
> > > after
> > > the "" tokens is included, since then the identifier is lexed in the 
> > > "normal"
> > > way as its own token. But when it is lexed as part of the string token, 
> > > this
> > > is handled in lex_string() with a one-off loop that is not aware of 
> > > extended
> > > characters.
> > >
> > > This patch fixes it by adding a new function scan_cur_identifier() that 
> > > can be
> > > used to lex an identifier while in the middle of lexing another token.
> > >
> > > BTW, the other place that has been mis-lexing identifiers is
> > > lex_identifier_intern(), which is used to implement #pragma push_macro
> > > and #pragma pop_macro. This does not support extended characters either.
> > > I will add that in a subsequent patch, because it can't directly reuse the
> > > new function, but rather needs to lex from a string instead of a 
> > > cpp_buffer.
> > >
> > > With scan_cur_identifier(), we do also correctly warn about bidi and
> > > normalization issues in the extended identifiers comprising the suffix.
> > >
> > > libcpp/ChangeLog:
> > >
> > > PR preprocessor/103902
> > > * lex.cc (identifier_diagnostics_on_lex): New function refactoring
> > > some common code.
> > > (lex_identifier_intern): Use the new function.
> > > (lex_identifier): Don't run identifier diagnostics here, rather 
> > > let
> > > the call site do it when needed.
> > > (_cpp_lex_direct): Adjust the call sites of lex_identifier ()
> > > acccordingly.
> > > (struct scan_id_result): New struct.
> > > (scan_cur_identifier): New function.
> > > (create_literal2): New function.
> > > (lit_accum::create_literal2): New function.
> > > (is_macro): Folded into new function...
> > > (maybe_ignore_udl_macro_suffix): ...here.
> > > (is_macro_not_literal_suffix): Folded likewise.
> > > (lex_raw_string): Handle UTF-8 in UDL suffix via 
> > > scan_cur_identifier ().
> > > (lex_string): Likewise.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > PR preprocessor/103902
> > > * g++.dg/cpp0x/udlit-extended-id-1.C: New test.
> > > * g++.dg/cpp0x/udlit-extended-id-2.C: New test.
> > > * g++.dg/cpp0x/udlit-extended-id-3.C: New test.
> > > * g++.dg/cpp0x/udlit-extended-id-4.C: New test.
> > > ---
> > >
> > > Notes:
> > > Hello-
> > >
> > > This is the updated version of the patch, incorporating feedback from 
> > > Jakub
> > > and Jason, most recently discussed here:
> > >
> > > https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612073.html
> > >
> > > Please let me know how it looks? It is simpler than before with the 
> > > new
> > > approach. Thanks!
> > >
> > > One thing to note. As Jason clarified for me, a usage like this:
> > >
> > >  #pragma GCC poison _x
> > > const char * operator "" _x (const char *, unsigned long);
> > >
> > > The space between the "" and the _x is currently allowed but will be
> > > deprecated in C++23. GCC currently will complain about the poisoned 
> > > use of
> > > _x in this case, and this patch, which is just focused on handling 
> > > UTF-8
> > > properly, does not change this. But it seems that it would be correct
> > > not to apply poison in this case. I can try to follow up with a patch 
> > > to do
> > > so, if it seems worthwhile? Given the syntax is deprecated, maybe 
> > > it's not
> > > worth it...
> > >
> > > For the time being, this patch does add a testcase for the above and 
> > > xfails
> > > it. For the case where no space is present, which is the part touched 
> > > by the
> > > present patch, existing behavior is preserved correctly and no 
> > > diagnostics
> > > such as poison are issued for the UDL suffix. (Contrary to v1 of this
> > > patch.)
> > >
> > > Thanks! bootstrap + regtested all languages on x86-64 Linux with
> > > no regressions.
> > >
> > > -Lewis
> > >
> > >  

Ping: [PATCH v2] libcpp: Handle extended characters in user-defined literal suffix [PR103902]

2023-06-02 Thread Lewis Hyatt via Gcc-patches
Hello-

Ping please? Thanks.
https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html

-Lewis

On Tue, May 2, 2023 at 9:27 AM Lewis Hyatt  wrote:
>
> May I please ping this one? Thanks...
> https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html
>
> On Thu, Mar 2, 2023 at 6:21 PM Lewis Hyatt  wrote:
> >
> > The PR complains that we do not handle UTF-8 in the suffix for a 
> > user-defined
> > literal, such as:
> >
> > bool operator ""_π (unsigned long long);
> >
> > In fact we don't handle any extended identifier characters there, whether
> > UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after
> > the "" tokens is included, since then the identifier is lexed in the 
> > "normal"
> > way as its own token. But when it is lexed as part of the string token, this
> > is handled in lex_string() with a one-off loop that is not aware of extended
> > characters.
> >
> > This patch fixes it by adding a new function scan_cur_identifier() that can 
> > be
> > used to lex an identifier while in the middle of lexing another token.
> >
> > BTW, the other place that has been mis-lexing identifiers is
> > lex_identifier_intern(), which is used to implement #pragma push_macro
> > and #pragma pop_macro. This does not support extended characters either.
> > I will add that in a subsequent patch, because it can't directly reuse the
> > new function, but rather needs to lex from a string instead of a cpp_buffer.
> >
> > With scan_cur_identifier(), we do also correctly warn about bidi and
> > normalization issues in the extended identifiers comprising the suffix.
> >
> > libcpp/ChangeLog:
> >
> > PR preprocessor/103902
> > * lex.cc (identifier_diagnostics_on_lex): New function refactoring
> > some common code.
> > (lex_identifier_intern): Use the new function.
> > (lex_identifier): Don't run identifier diagnostics here, rather let
> > the call site do it when needed.
> > (_cpp_lex_direct): Adjust the call sites of lex_identifier ()
> > acccordingly.
> > (struct scan_id_result): New struct.
> > (scan_cur_identifier): New function.
> > (create_literal2): New function.
> > (lit_accum::create_literal2): New function.
> > (is_macro): Folded into new function...
> > (maybe_ignore_udl_macro_suffix): ...here.
> > (is_macro_not_literal_suffix): Folded likewise.
> > (lex_raw_string): Handle UTF-8 in UDL suffix via 
> > scan_cur_identifier ().
> > (lex_string): Likewise.
> >
> > gcc/testsuite/ChangeLog:
> >
> > PR preprocessor/103902
> > * g++.dg/cpp0x/udlit-extended-id-1.C: New test.
> > * g++.dg/cpp0x/udlit-extended-id-2.C: New test.
> > * g++.dg/cpp0x/udlit-extended-id-3.C: New test.
> > * g++.dg/cpp0x/udlit-extended-id-4.C: New test.
> > ---
> >
> > Notes:
> > Hello-
> >
> > This is the updated version of the patch, incorporating feedback from 
> > Jakub
> > and Jason, most recently discussed here:
> >
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612073.html
> >
> > Please let me know how it looks? It is simpler than before with the new
> > approach. Thanks!
> >
> > One thing to note. As Jason clarified for me, a usage like this:
> >
> >  #pragma GCC poison _x
> > const char * operator "" _x (const char *, unsigned long);
> >
> > The space between the "" and the _x is currently allowed but will be
> > deprecated in C++23. GCC currently will complain about the poisoned use 
> > of
> > _x in this case, and this patch, which is just focused on handling UTF-8
> > properly, does not change this. But it seems that it would be correct
> > not to apply poison in this case. I can try to follow up with a patch 
> > to do
> > so, if it seems worthwhile? Given the syntax is deprecated, maybe it's 
> > not
> > worth it...
> >
> > For the time being, this patch does add a testcase for the above and 
> > xfails
> > it. For the case where no space is present, which is the part touched 
> > by the
> > present patch, existing behavior is preserved correctly and no 
> > diagnostics
> > such as poison are issued for the UDL suffix. (Contrary to v1 of this
> > patch.)
> >
> > Thanks! bootstrap + regtested all languages on x86-64 Linux with
> > no regressions.
> >
> > -Lewis
> >
> >  .../g++.dg/cpp0x/udlit-extended-id-1.C|  68 
> >  .../g++.dg/cpp0x/udlit-extended-id-2.C|   6 +
> >  .../g++.dg/cpp0x/udlit-extended-id-3.C|  15 +
> >  .../g++.dg/cpp0x/udlit-extended-id-4.C|  14 +
> >  libcpp/lex.cc | 382 ++
> >  5 files changed, 317 insertions(+), 168 deletions(-)
> >  create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
> >  create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
> >  create mode 100644 

Ping: [PATCH v2] libcpp: Handle extended characters in user-defined literal suffix [PR103902]

2023-05-02 Thread Lewis Hyatt via Gcc-patches
May I please ping this one? Thanks...
https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html

On Thu, Mar 2, 2023 at 6:21 PM Lewis Hyatt  wrote:
>
> The PR complains that we do not handle UTF-8 in the suffix for a user-defined
> literal, such as:
>
> bool operator ""_π (unsigned long long);
>
> In fact we don't handle any extended identifier characters there, whether
> UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after
> the "" tokens is included, since then the identifier is lexed in the "normal"
> way as its own token. But when it is lexed as part of the string token, this
> is handled in lex_string() with a one-off loop that is not aware of extended
> characters.
>
> This patch fixes it by adding a new function scan_cur_identifier() that can be
> used to lex an identifier while in the middle of lexing another token.
>
> BTW, the other place that has been mis-lexing identifiers is
> lex_identifier_intern(), which is used to implement #pragma push_macro
> and #pragma pop_macro. This does not support extended characters either.
> I will add that in a subsequent patch, because it can't directly reuse the
> new function, but rather needs to lex from a string instead of a cpp_buffer.
>
> With scan_cur_identifier(), we do also correctly warn about bidi and
> normalization issues in the extended identifiers comprising the suffix.
>
> libcpp/ChangeLog:
>
> PR preprocessor/103902
> * lex.cc (identifier_diagnostics_on_lex): New function refactoring
> some common code.
> (lex_identifier_intern): Use the new function.
> (lex_identifier): Don't run identifier diagnostics here, rather let
> the call site do it when needed.
> (_cpp_lex_direct): Adjust the call sites of lex_identifier ()
> acccordingly.
> (struct scan_id_result): New struct.
> (scan_cur_identifier): New function.
> (create_literal2): New function.
> (lit_accum::create_literal2): New function.
> (is_macro): Folded into new function...
> (maybe_ignore_udl_macro_suffix): ...here.
> (is_macro_not_literal_suffix): Folded likewise.
> (lex_raw_string): Handle UTF-8 in UDL suffix via scan_cur_identifier 
> ().
> (lex_string): Likewise.
>
> gcc/testsuite/ChangeLog:
>
> PR preprocessor/103902
> * g++.dg/cpp0x/udlit-extended-id-1.C: New test.
> * g++.dg/cpp0x/udlit-extended-id-2.C: New test.
> * g++.dg/cpp0x/udlit-extended-id-3.C: New test.
> * g++.dg/cpp0x/udlit-extended-id-4.C: New test.
> ---
>
> Notes:
> Hello-
>
> This is the updated version of the patch, incorporating feedback from 
> Jakub
> and Jason, most recently discussed here:
>
> https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612073.html
>
> Please let me know how it looks? It is simpler than before with the new
> approach. Thanks!
>
> One thing to note. As Jason clarified for me, a usage like this:
>
>  #pragma GCC poison _x
> const char * operator "" _x (const char *, unsigned long);
>
> The space between the "" and the _x is currently allowed but will be
> deprecated in C++23. GCC currently will complain about the poisoned use of
> _x in this case, and this patch, which is just focused on handling UTF-8
> properly, does not change this. But it seems that it would be correct
> not to apply poison in this case. I can try to follow up with a patch to 
> do
> so, if it seems worthwhile? Given the syntax is deprecated, maybe it's not
> worth it...
>
> For the time being, this patch does add a testcase for the above and 
> xfails
> it. For the case where no space is present, which is the part touched by 
> the
> present patch, existing behavior is preserved correctly and no diagnostics
> such as poison are issued for the UDL suffix. (Contrary to v1 of this
> patch.)
>
> Thanks! bootstrap + regtested all languages on x86-64 Linux with
> no regressions.
>
> -Lewis
>
>  .../g++.dg/cpp0x/udlit-extended-id-1.C|  68 
>  .../g++.dg/cpp0x/udlit-extended-id-2.C|   6 +
>  .../g++.dg/cpp0x/udlit-extended-id-3.C|  15 +
>  .../g++.dg/cpp0x/udlit-extended-id-4.C|  14 +
>  libcpp/lex.cc | 382 ++
>  5 files changed, 317 insertions(+), 168 deletions(-)
>  create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
>  create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
>  create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
>  create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C
>
> diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C 
> b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
> new file mode 100644
> index 000..411d4fdd0ba
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
> @@ -0,0 +1,68 @@
> +// { dg-do run { target c++11 

Ping: [PATCH v2] libcpp: Handle extended characters in user-defined literal suffix [PR103902]

2023-04-04 Thread Lewis Hyatt via Gcc-patches
May I please ping this one?
https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html

Thanks!

-Lewis

On Thu, Mar 2, 2023 at 6:21 PM Lewis Hyatt  wrote:
>
> The PR complains that we do not handle UTF-8 in the suffix for a user-defined
> literal, such as:
>
> bool operator ""_π (unsigned long long);
>
> In fact we don't handle any extended identifier characters there, whether
> UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after
> the "" tokens is included, since then the identifier is lexed in the "normal"
> way as its own token. But when it is lexed as part of the string token, this
> is handled in lex_string() with a one-off loop that is not aware of extended
> characters.
>
> This patch fixes it by adding a new function scan_cur_identifier() that can be
> used to lex an identifier while in the middle of lexing another token.
>
> BTW, the other place that has been mis-lexing identifiers is
> lex_identifier_intern(), which is used to implement #pragma push_macro
> and #pragma pop_macro. This does not support extended characters either.
> I will add that in a subsequent patch, because it can't directly reuse the
> new function, but rather needs to lex from a string instead of a cpp_buffer.
>
> With scan_cur_identifier(), we do also correctly warn about bidi and
> normalization issues in the extended identifiers comprising the suffix.
>
> libcpp/ChangeLog:
>
> PR preprocessor/103902
> * lex.cc (identifier_diagnostics_on_lex): New function refactoring
> some common code.
> (lex_identifier_intern): Use the new function.
> (lex_identifier): Don't run identifier diagnostics here, rather let
> the call site do it when needed.
> (_cpp_lex_direct): Adjust the call sites of lex_identifier ()
> acccordingly.
> (struct scan_id_result): New struct.
> (scan_cur_identifier): New function.
> (create_literal2): New function.
> (lit_accum::create_literal2): New function.
> (is_macro): Folded into new function...
> (maybe_ignore_udl_macro_suffix): ...here.
> (is_macro_not_literal_suffix): Folded likewise.
> (lex_raw_string): Handle UTF-8 in UDL suffix via scan_cur_identifier 
> ().
> (lex_string): Likewise.
>
> gcc/testsuite/ChangeLog:
>
> PR preprocessor/103902
> * g++.dg/cpp0x/udlit-extended-id-1.C: New test.
> * g++.dg/cpp0x/udlit-extended-id-2.C: New test.
> * g++.dg/cpp0x/udlit-extended-id-3.C: New test.
> * g++.dg/cpp0x/udlit-extended-id-4.C: New test.
> ---
>
> Notes:
> Hello-
>
> This is the updated version of the patch, incorporating feedback from 
> Jakub
> and Jason, most recently discussed here:
>
> https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612073.html
>
> Please let me know how it looks? It is simpler than before with the new
> approach. Thanks!
>
> One thing to note. As Jason clarified for me, a usage like this:
>
>  #pragma GCC poison _x
> const char * operator "" _x (const char *, unsigned long);
>
> The space between the "" and the _x is currently allowed but will be
> deprecated in C++23. GCC currently will complain about the poisoned use of
> _x in this case, and this patch, which is just focused on handling UTF-8
> properly, does not change this. But it seems that it would be correct
> not to apply poison in this case. I can try to follow up with a patch to 
> do
> so, if it seems worthwhile? Given the syntax is deprecated, maybe it's not
> worth it...
>
> For the time being, this patch does add a testcase for the above and 
> xfails
> it. For the case where no space is present, which is the part touched by 
> the
> present patch, existing behavior is preserved correctly and no diagnostics
> such as poison are issued for the UDL suffix. (Contrary to v1 of this
> patch.)
>
> Thanks! bootstrap + regtested all languages on x86-64 Linux with
> no regressions.
>
> -Lewis
>
>  .../g++.dg/cpp0x/udlit-extended-id-1.C|  68 
>  .../g++.dg/cpp0x/udlit-extended-id-2.C|   6 +
>  .../g++.dg/cpp0x/udlit-extended-id-3.C|  15 +
>  .../g++.dg/cpp0x/udlit-extended-id-4.C|  14 +
>  libcpp/lex.cc | 382 ++
>  5 files changed, 317 insertions(+), 168 deletions(-)
>  create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
>  create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
>  create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
>  create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C
>
> diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C 
> b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
> new file mode 100644
> index 000..411d4fdd0ba
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
> @@ -0,0 +1,68 @@
> +// { dg-do run { target