Ping: [PATCH v2] libcpp: Handle extended characters in user-defined literal suffix [PR103902]
May I please ping this patch again? I think it would be worthwhile to close this gap in the support for UTF-8 sources. Thanks! https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html -Lewis On Fri, Jun 2, 2023 at 9:45 AM Lewis Hyatt wrote: > > Hello- > > Ping please? Thanks. > https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html > > -Lewis > > On Tue, May 2, 2023 at 9:27 AM Lewis Hyatt wrote: > > > > May I please ping this one? Thanks... > > https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html > > > > On Thu, Mar 2, 2023 at 6:21 PM Lewis Hyatt wrote: > > > > > > The PR complains that we do not handle UTF-8 in the suffix for a > > > user-defined > > > literal, such as: > > > > > > bool operator ""_π (unsigned long long); > > > > > > In fact we don't handle any extended identifier characters there, whether > > > UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space > > > after > > > the "" tokens is included, since then the identifier is lexed in the > > > "normal" > > > way as its own token. But when it is lexed as part of the string token, > > > this > > > is handled in lex_string() with a one-off loop that is not aware of > > > extended > > > characters. > > > > > > This patch fixes it by adding a new function scan_cur_identifier() that > > > can be > > > used to lex an identifier while in the middle of lexing another token. > > > > > > BTW, the other place that has been mis-lexing identifiers is > > > lex_identifier_intern(), which is used to implement #pragma push_macro > > > and #pragma pop_macro. This does not support extended characters either. > > > I will add that in a subsequent patch, because it can't directly reuse the > > > new function, but rather needs to lex from a string instead of a > > > cpp_buffer. > > > > > > With scan_cur_identifier(), we do also correctly warn about bidi and > > > normalization issues in the extended identifiers comprising the suffix. > > > > > > libcpp/ChangeLog: > > > > > > PR preprocessor/103902 > > > * lex.cc (identifier_diagnostics_on_lex): New function refactoring > > > some common code. > > > (lex_identifier_intern): Use the new function. > > > (lex_identifier): Don't run identifier diagnostics here, rather > > > let > > > the call site do it when needed. > > > (_cpp_lex_direct): Adjust the call sites of lex_identifier () > > > acccordingly. > > > (struct scan_id_result): New struct. > > > (scan_cur_identifier): New function. > > > (create_literal2): New function. > > > (lit_accum::create_literal2): New function. > > > (is_macro): Folded into new function... > > > (maybe_ignore_udl_macro_suffix): ...here. > > > (is_macro_not_literal_suffix): Folded likewise. > > > (lex_raw_string): Handle UTF-8 in UDL suffix via > > > scan_cur_identifier (). > > > (lex_string): Likewise. > > > > > > gcc/testsuite/ChangeLog: > > > > > > PR preprocessor/103902 > > > * g++.dg/cpp0x/udlit-extended-id-1.C: New test. > > > * g++.dg/cpp0x/udlit-extended-id-2.C: New test. > > > * g++.dg/cpp0x/udlit-extended-id-3.C: New test. > > > * g++.dg/cpp0x/udlit-extended-id-4.C: New test. > > > --- > > > > > > Notes: > > > Hello- > > > > > > This is the updated version of the patch, incorporating feedback from > > > Jakub > > > and Jason, most recently discussed here: > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612073.html > > > > > > Please let me know how it looks? It is simpler than before with the > > > new > > > approach. Thanks! > > > > > > One thing to note. As Jason clarified for me, a usage like this: > > > > > > #pragma GCC poison _x > > > const char * operator "" _x (const char *, unsigned long); > > > > > > The space between the "" and the _x is currently allowed but will be > > > deprecated in C++23. GCC currently will complain about the poisoned > > > use of > > > _x in this case, and this patch, which is just focused on handling > > > UTF-8 > > > properly, does not change this. But it seems that it would be correct > > > not to apply poison in this case. I can try to follow up with a patch > > > to do > > > so, if it seems worthwhile? Given the syntax is deprecated, maybe > > > it's not > > > worth it... > > > > > > For the time being, this patch does add a testcase for the above and > > > xfails > > > it. For the case where no space is present, which is the part touched > > > by the > > > present patch, existing behavior is preserved correctly and no > > > diagnostics > > > such as poison are issued for the UDL suffix. (Contrary to v1 of this > > > patch.) > > > > > > Thanks! bootstrap + regtested all languages on x86-64 Linux with > > > no regressions. > > > > > > -Lewis > > > > > >
Ping: [PATCH v2] libcpp: Handle extended characters in user-defined literal suffix [PR103902]
Hello- Ping please? Thanks. https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html -Lewis On Tue, May 2, 2023 at 9:27 AM Lewis Hyatt wrote: > > May I please ping this one? Thanks... > https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html > > On Thu, Mar 2, 2023 at 6:21 PM Lewis Hyatt wrote: > > > > The PR complains that we do not handle UTF-8 in the suffix for a > > user-defined > > literal, such as: > > > > bool operator ""_π (unsigned long long); > > > > In fact we don't handle any extended identifier characters there, whether > > UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after > > the "" tokens is included, since then the identifier is lexed in the > > "normal" > > way as its own token. But when it is lexed as part of the string token, this > > is handled in lex_string() with a one-off loop that is not aware of extended > > characters. > > > > This patch fixes it by adding a new function scan_cur_identifier() that can > > be > > used to lex an identifier while in the middle of lexing another token. > > > > BTW, the other place that has been mis-lexing identifiers is > > lex_identifier_intern(), which is used to implement #pragma push_macro > > and #pragma pop_macro. This does not support extended characters either. > > I will add that in a subsequent patch, because it can't directly reuse the > > new function, but rather needs to lex from a string instead of a cpp_buffer. > > > > With scan_cur_identifier(), we do also correctly warn about bidi and > > normalization issues in the extended identifiers comprising the suffix. > > > > libcpp/ChangeLog: > > > > PR preprocessor/103902 > > * lex.cc (identifier_diagnostics_on_lex): New function refactoring > > some common code. > > (lex_identifier_intern): Use the new function. > > (lex_identifier): Don't run identifier diagnostics here, rather let > > the call site do it when needed. > > (_cpp_lex_direct): Adjust the call sites of lex_identifier () > > acccordingly. > > (struct scan_id_result): New struct. > > (scan_cur_identifier): New function. > > (create_literal2): New function. > > (lit_accum::create_literal2): New function. > > (is_macro): Folded into new function... > > (maybe_ignore_udl_macro_suffix): ...here. > > (is_macro_not_literal_suffix): Folded likewise. > > (lex_raw_string): Handle UTF-8 in UDL suffix via > > scan_cur_identifier (). > > (lex_string): Likewise. > > > > gcc/testsuite/ChangeLog: > > > > PR preprocessor/103902 > > * g++.dg/cpp0x/udlit-extended-id-1.C: New test. > > * g++.dg/cpp0x/udlit-extended-id-2.C: New test. > > * g++.dg/cpp0x/udlit-extended-id-3.C: New test. > > * g++.dg/cpp0x/udlit-extended-id-4.C: New test. > > --- > > > > Notes: > > Hello- > > > > This is the updated version of the patch, incorporating feedback from > > Jakub > > and Jason, most recently discussed here: > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612073.html > > > > Please let me know how it looks? It is simpler than before with the new > > approach. Thanks! > > > > One thing to note. As Jason clarified for me, a usage like this: > > > > #pragma GCC poison _x > > const char * operator "" _x (const char *, unsigned long); > > > > The space between the "" and the _x is currently allowed but will be > > deprecated in C++23. GCC currently will complain about the poisoned use > > of > > _x in this case, and this patch, which is just focused on handling UTF-8 > > properly, does not change this. But it seems that it would be correct > > not to apply poison in this case. I can try to follow up with a patch > > to do > > so, if it seems worthwhile? Given the syntax is deprecated, maybe it's > > not > > worth it... > > > > For the time being, this patch does add a testcase for the above and > > xfails > > it. For the case where no space is present, which is the part touched > > by the > > present patch, existing behavior is preserved correctly and no > > diagnostics > > such as poison are issued for the UDL suffix. (Contrary to v1 of this > > patch.) > > > > Thanks! bootstrap + regtested all languages on x86-64 Linux with > > no regressions. > > > > -Lewis > > > > .../g++.dg/cpp0x/udlit-extended-id-1.C| 68 > > .../g++.dg/cpp0x/udlit-extended-id-2.C| 6 + > > .../g++.dg/cpp0x/udlit-extended-id-3.C| 15 + > > .../g++.dg/cpp0x/udlit-extended-id-4.C| 14 + > > libcpp/lex.cc | 382 ++ > > 5 files changed, 317 insertions(+), 168 deletions(-) > > create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C > > create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C > > create mode 100644
Ping: [PATCH v2] libcpp: Handle extended characters in user-defined literal suffix [PR103902]
May I please ping this one? Thanks... https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html On Thu, Mar 2, 2023 at 6:21 PM Lewis Hyatt wrote: > > The PR complains that we do not handle UTF-8 in the suffix for a user-defined > literal, such as: > > bool operator ""_π (unsigned long long); > > In fact we don't handle any extended identifier characters there, whether > UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after > the "" tokens is included, since then the identifier is lexed in the "normal" > way as its own token. But when it is lexed as part of the string token, this > is handled in lex_string() with a one-off loop that is not aware of extended > characters. > > This patch fixes it by adding a new function scan_cur_identifier() that can be > used to lex an identifier while in the middle of lexing another token. > > BTW, the other place that has been mis-lexing identifiers is > lex_identifier_intern(), which is used to implement #pragma push_macro > and #pragma pop_macro. This does not support extended characters either. > I will add that in a subsequent patch, because it can't directly reuse the > new function, but rather needs to lex from a string instead of a cpp_buffer. > > With scan_cur_identifier(), we do also correctly warn about bidi and > normalization issues in the extended identifiers comprising the suffix. > > libcpp/ChangeLog: > > PR preprocessor/103902 > * lex.cc (identifier_diagnostics_on_lex): New function refactoring > some common code. > (lex_identifier_intern): Use the new function. > (lex_identifier): Don't run identifier diagnostics here, rather let > the call site do it when needed. > (_cpp_lex_direct): Adjust the call sites of lex_identifier () > acccordingly. > (struct scan_id_result): New struct. > (scan_cur_identifier): New function. > (create_literal2): New function. > (lit_accum::create_literal2): New function. > (is_macro): Folded into new function... > (maybe_ignore_udl_macro_suffix): ...here. > (is_macro_not_literal_suffix): Folded likewise. > (lex_raw_string): Handle UTF-8 in UDL suffix via scan_cur_identifier > (). > (lex_string): Likewise. > > gcc/testsuite/ChangeLog: > > PR preprocessor/103902 > * g++.dg/cpp0x/udlit-extended-id-1.C: New test. > * g++.dg/cpp0x/udlit-extended-id-2.C: New test. > * g++.dg/cpp0x/udlit-extended-id-3.C: New test. > * g++.dg/cpp0x/udlit-extended-id-4.C: New test. > --- > > Notes: > Hello- > > This is the updated version of the patch, incorporating feedback from > Jakub > and Jason, most recently discussed here: > > https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612073.html > > Please let me know how it looks? It is simpler than before with the new > approach. Thanks! > > One thing to note. As Jason clarified for me, a usage like this: > > #pragma GCC poison _x > const char * operator "" _x (const char *, unsigned long); > > The space between the "" and the _x is currently allowed but will be > deprecated in C++23. GCC currently will complain about the poisoned use of > _x in this case, and this patch, which is just focused on handling UTF-8 > properly, does not change this. But it seems that it would be correct > not to apply poison in this case. I can try to follow up with a patch to > do > so, if it seems worthwhile? Given the syntax is deprecated, maybe it's not > worth it... > > For the time being, this patch does add a testcase for the above and > xfails > it. For the case where no space is present, which is the part touched by > the > present patch, existing behavior is preserved correctly and no diagnostics > such as poison are issued for the UDL suffix. (Contrary to v1 of this > patch.) > > Thanks! bootstrap + regtested all languages on x86-64 Linux with > no regressions. > > -Lewis > > .../g++.dg/cpp0x/udlit-extended-id-1.C| 68 > .../g++.dg/cpp0x/udlit-extended-id-2.C| 6 + > .../g++.dg/cpp0x/udlit-extended-id-3.C| 15 + > .../g++.dg/cpp0x/udlit-extended-id-4.C| 14 + > libcpp/lex.cc | 382 ++ > 5 files changed, 317 insertions(+), 168 deletions(-) > create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C > create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C > create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C > create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C > > diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C > b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C > new file mode 100644 > index 000..411d4fdd0ba > --- /dev/null > +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C > @@ -0,0 +1,68 @@ > +// { dg-do run { target c++11
Ping: [PATCH v2] libcpp: Handle extended characters in user-defined literal suffix [PR103902]
May I please ping this one? https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html Thanks! -Lewis On Thu, Mar 2, 2023 at 6:21 PM Lewis Hyatt wrote: > > The PR complains that we do not handle UTF-8 in the suffix for a user-defined > literal, such as: > > bool operator ""_π (unsigned long long); > > In fact we don't handle any extended identifier characters there, whether > UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after > the "" tokens is included, since then the identifier is lexed in the "normal" > way as its own token. But when it is lexed as part of the string token, this > is handled in lex_string() with a one-off loop that is not aware of extended > characters. > > This patch fixes it by adding a new function scan_cur_identifier() that can be > used to lex an identifier while in the middle of lexing another token. > > BTW, the other place that has been mis-lexing identifiers is > lex_identifier_intern(), which is used to implement #pragma push_macro > and #pragma pop_macro. This does not support extended characters either. > I will add that in a subsequent patch, because it can't directly reuse the > new function, but rather needs to lex from a string instead of a cpp_buffer. > > With scan_cur_identifier(), we do also correctly warn about bidi and > normalization issues in the extended identifiers comprising the suffix. > > libcpp/ChangeLog: > > PR preprocessor/103902 > * lex.cc (identifier_diagnostics_on_lex): New function refactoring > some common code. > (lex_identifier_intern): Use the new function. > (lex_identifier): Don't run identifier diagnostics here, rather let > the call site do it when needed. > (_cpp_lex_direct): Adjust the call sites of lex_identifier () > acccordingly. > (struct scan_id_result): New struct. > (scan_cur_identifier): New function. > (create_literal2): New function. > (lit_accum::create_literal2): New function. > (is_macro): Folded into new function... > (maybe_ignore_udl_macro_suffix): ...here. > (is_macro_not_literal_suffix): Folded likewise. > (lex_raw_string): Handle UTF-8 in UDL suffix via scan_cur_identifier > (). > (lex_string): Likewise. > > gcc/testsuite/ChangeLog: > > PR preprocessor/103902 > * g++.dg/cpp0x/udlit-extended-id-1.C: New test. > * g++.dg/cpp0x/udlit-extended-id-2.C: New test. > * g++.dg/cpp0x/udlit-extended-id-3.C: New test. > * g++.dg/cpp0x/udlit-extended-id-4.C: New test. > --- > > Notes: > Hello- > > This is the updated version of the patch, incorporating feedback from > Jakub > and Jason, most recently discussed here: > > https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612073.html > > Please let me know how it looks? It is simpler than before with the new > approach. Thanks! > > One thing to note. As Jason clarified for me, a usage like this: > > #pragma GCC poison _x > const char * operator "" _x (const char *, unsigned long); > > The space between the "" and the _x is currently allowed but will be > deprecated in C++23. GCC currently will complain about the poisoned use of > _x in this case, and this patch, which is just focused on handling UTF-8 > properly, does not change this. But it seems that it would be correct > not to apply poison in this case. I can try to follow up with a patch to > do > so, if it seems worthwhile? Given the syntax is deprecated, maybe it's not > worth it... > > For the time being, this patch does add a testcase for the above and > xfails > it. For the case where no space is present, which is the part touched by > the > present patch, existing behavior is preserved correctly and no diagnostics > such as poison are issued for the UDL suffix. (Contrary to v1 of this > patch.) > > Thanks! bootstrap + regtested all languages on x86-64 Linux with > no regressions. > > -Lewis > > .../g++.dg/cpp0x/udlit-extended-id-1.C| 68 > .../g++.dg/cpp0x/udlit-extended-id-2.C| 6 + > .../g++.dg/cpp0x/udlit-extended-id-3.C| 15 + > .../g++.dg/cpp0x/udlit-extended-id-4.C| 14 + > libcpp/lex.cc | 382 ++ > 5 files changed, 317 insertions(+), 168 deletions(-) > create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C > create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C > create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C > create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C > > diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C > b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C > new file mode 100644 > index 000..411d4fdd0ba > --- /dev/null > +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C > @@ -0,0 +1,68 @@ > +// { dg-do run { target