[PATCH] D106577: [clang] Define __STDC_ISO_10646__

2021-09-04 Thread ThePhD via Phabricator via cfe-commits
ThePhD added a comment.

Hi, my name is JeanHeyd Meneide. I'm the Project Editor for C, but more 
importantly I'm the author of this paper: 
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2728.htm

This paper was accepted yesterday (September 3rd, 2021) into the C Standard, 
and (after I merge it and the like ~25 other papers + Annex X I need to merge), 
will appear in the next Draft of the C Standard.

As the paper's introduction and movtiation notes, the interpretation above that 
the locale-dependent encoding of `wchar_t` strings and `char` (MBS) strings for 
runtime functions like `mbstowcs` and `wcstombs` was not only a little bit 
silly, but also impossible to enforce properly on most systems without severe 
undue burden.

The wording of the paper explicitly removes the tie-in of the encoding of 
string literals and wide string literals to the library functions and instead 
makes them implementation-defined. This has no behavior change on any platform 
(it is, in a very strict sense, an expansion of the current definition and a 
standardization of existing practice amongst all implementations). What it does 
mean is that, however, Clang and every other compiler - so long as they pick a 
ISO10646-code point capable encoding for their `wchar_t` literals - can define 
this preprocessor macro unconditionally. My understanding is that on most 
systems where things have not been patched / tweaked, this applies since Clang 
vastly prefers UTF-32 in most of its setups.

It is my strong recommendation this patch be accepted and made unconditional, 
both in anticipation of the upcoming standard and the widespread existing 
practice.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D106577/new/

https://reviews.llvm.org/D106577

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D93031: Enable fexec-charset option

2021-04-20 Thread ThePhD via Phabricator via cfe-commits
ThePhD added a comment.

Just a tiny comment: could you please make sure the name of the resolved 
encoding is also propagated to InitPreprocessor.cpp that sets the 
`__clang_literal_encoding__` macro? 
(https://github.com/llvm/llvm-project/blob/main/clang/lib/Frontend/InitPreprocessor.cpp#L784)


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D93031/new/

https://reviews.llvm.org/D93031

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D100346: [Clang] String Literal and Wide String Literal Encoding from the Preprocessor

2021-04-13 Thread ThePhD via Phabricator via cfe-commits
ThePhD added a comment.

In D100346#2686342 , @aaron.ballman 
wrote:

> Okay, I'm sold. Thank you for the detailed explanation! The changes LGTM. Do 
> you have commit privileges or would you like me to commit on your behalf? (If 
> you'd like me to commit, what email address and name would you like me to use 
> for commit attribution?)

I don't have commit privileges! You should use `ThePhD 
`.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D100346/new/

https://reviews.llvm.org/D100346

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D100346: [Clang] String Literal and Wide String Literal Encoding from the Preprocessor

2021-04-13 Thread ThePhD via Phabricator via cfe-commits
ThePhD updated this revision to Diff 337178.
ThePhD set the repository for this revision to rG LLVM Github Monorepo.
ThePhD added a comment.

Oops, almost forgot the doc fixes for the *wide__* macro!


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D100346/new/

https://reviews.llvm.org/D100346

Files:
  clang/docs/LanguageExtensions.rst
  clang/lib/Frontend/InitPreprocessor.cpp
  clang/test/Preprocessor/init-x86.c
  clang/test/Preprocessor/init.c


Index: clang/test/Preprocessor/init.c
===
--- clang/test/Preprocessor/init.c
+++ clang/test/Preprocessor/init.c
@@ -115,10 +115,12 @@
 // COMMON:#define __STDC__ 1
 // COMMON:#define __VERSION__ {{.*}}
 // COMMON:#define __clang__ 1
+// COMMON:#define __clang_literal_encoding__ {{.*}}
 // COMMON:#define __clang_major__ {{[0-9]+}}
 // COMMON:#define __clang_minor__ {{[0-9]+}}
 // COMMON:#define __clang_patchlevel__ {{[0-9]+}}
 // COMMON:#define __clang_version__ {{.*}}
+// COMMON:#define __clang_wide_literal_encoding__ {{.*}}
 // COMMON:#define __llvm__ 1
 //
 // RUN: %clang_cc1 -E -dM -triple=x86_64-pc-win32 < /dev/null | FileCheck 
-match-full-lines -check-prefix C-DEFAULT %s
@@ -1844,10 +1846,12 @@
 // WEBASSEMBLY-NOT:#define __WINT_UNSIGNED__
 // WEBASSEMBLY-NEXT:#define __WINT_WIDTH__ 32
 // WEBASSEMBLY-NEXT:#define __clang__ 1
+// WEBASSEMBLY-NEXT:#define __clang_literal_encoding__ {{.*}}
 // WEBASSEMBLY-NEXT:#define __clang_major__ {{.*}}
 // WEBASSEMBLY-NEXT:#define __clang_minor__ {{.*}}
 // WEBASSEMBLY-NEXT:#define __clang_patchlevel__ {{.*}}
 // WEBASSEMBLY-NEXT:#define __clang_version__ "{{.*}}"
+// WEBASSEMBLY-NEXT:#define __clang_wide_literal_encoding__ {{.*}}
 // WEBASSEMBLY-NEXT:#define __llvm__ 1
 // WEBASSEMBLY-NOT:#define __unix
 // WEBASSEMBLY-NOT:#define __unix__
Index: clang/test/Preprocessor/init-x86.c
===
--- clang/test/Preprocessor/init-x86.c
+++ clang/test/Preprocessor/init-x86.c
@@ -1306,10 +1306,12 @@
 // X86_64-CLOUDABI:#define __amd64 1
 // X86_64-CLOUDABI:#define __amd64__ 1
 // X86_64-CLOUDABI:#define __clang__ 1
+// X86_64-CLOUDABI:#define __clang_literal_encoding__ {{.*}}
 // X86_64-CLOUDABI:#define __clang_major__ {{.*}}
 // X86_64-CLOUDABI:#define __clang_minor__ {{.*}}
 // X86_64-CLOUDABI:#define __clang_patchlevel__ {{.*}}
 // X86_64-CLOUDABI:#define __clang_version__ {{.*}}
+// X86_64-CLOUDABI:#define __clang_wide_literal_encoding__ {{.*}}
 // X86_64-CLOUDABI:#define __llvm__ 1
 // X86_64-CLOUDABI:#define __x86_64 1
 // X86_64-CLOUDABI:#define __x86_64__ 1
Index: clang/lib/Frontend/InitPreprocessor.cpp
===
--- clang/lib/Frontend/InitPreprocessor.cpp
+++ clang/lib/Frontend/InitPreprocessor.cpp
@@ -778,6 +778,21 @@
 }
   }
 
+  // Macros to help identify the narrow and wide character sets
+  // FIXME: clang currently ignores -fexec-charset=. If this changes,
+  // then this may need to be updated.
+  Builder.defineMacro("__clang_literal_encoding__", "\"UTF-8\"");
+  if (TI.getTypeWidth(TI.getWCharType()) >= 32) {
+// FIXME: 32-bit wchar_t signals UTF-32. This may change
+// if -fwide-exec-charset= is ever supported.
+Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-32\"");
+  } else {
+// FIXME: Less-than 32-bit wchar_t generally means UTF-16
+// (e.g., Windows, 32-bit IBM). This may need to be
+// updated if -fwide-exec-charset= is ever supported.
+Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-16\"");
+  }
+
   if (LangOpts.Optimize)
 Builder.defineMacro("__OPTIMIZE__");
   if (LangOpts.OptimizeSize)
Index: clang/docs/LanguageExtensions.rst
===
--- clang/docs/LanguageExtensions.rst
+++ clang/docs/LanguageExtensions.rst
@@ -383,6 +383,18 @@
   Defined to a string that captures the Clang marketing version, including the
   Subversion tag or revision number, e.g., "``1.5 (trunk 102332)``".
 
+``__clang_literal_encoding__``
+  Defined to a narrow string literal that represents the current encoding of
+  narrow string literals, e.g., ``"hello"``. This macro typically expands to
+  "UTF-8" (but may change in the future if the
+  ``-fexec-charset="Encoding-Name"`` option is implemented.)
+
+``__clang_wide_literal_encoding__``
+  Defined to a narrow string literal that represents the current encoding of
+  wide string literals, e.g., ``L"hello"``. This macro typically expands to
+  "UTF-16" or "UTF-32" (but may change in the future if the
+  ``-fwide-exec-charset="Encoding-Name"`` option is implemented.)
+
 .. _langext-vectors:
 
 Vectors and Extended Vectors


Index: clang/test/Preprocessor/init.c
===
--- clang/test/Preprocessor/init.c
+++ clang/test/Preprocessor/init.c
@@ -115,10 +115,12 @@
 // 

[PATCH] D100346: [Clang] String Literal and Wide String Literal Encoding from the Preprocessor

2021-04-13 Thread ThePhD via Phabricator via cfe-commits
ThePhD marked 4 inline comments as done.
ThePhD added a comment.

In D100346#2685530 , @aaron.ballman 
wrote:

> ...
>
> What about for folks using this from C where there isn't `constexpr` 
> functionality to help them?

The unfortunate bit is that C won't be able to guarantee compile-time culling 
of branches, even if all values present are `constexpr`. Implementations like 
Clang and GCC have significantly powerful enough optimizers that usage of 
`strcmp` and `memcmp` can be recognized and turned into builtins, before being 
const-folded down. This can provide dead code elimination. But, otherwise, C 
can't **guarantee** elision of the code branches without an integral constant 
identifier in a Macro. This is more or less a deficiency in how weak Constant 
Expressions are in C, leaving most people to rely on implementation-defined 
behavior for constant folding to do anything worthwhile with their toolset.

I think in the future, if we are really invested in this path, we should come 
up with a canonical mapping and a specific way of saying "if this is a 
recognized encoding name it has an Integer Constant Expression of value `X` as 
defined by table `Y` in the documentation". We could provide macros 
`__clang_literal_encoding_id__` and `__clang_wide_literal_encoding_id` that has 
the `X` integer constant expression value. But I think that should be a 
follow-on patch that evaluates the totality of encodings, and also maybe 
contacts some IBM folks who did the `-fexec-charset` patches so they can also 
give over any additional encoding mappings they want.

Because Clang is open, anyone could add to it and that way people could have 
that kind of ability in C. iconv has a very full list of encodings, and you'd 
also need to define a resistant equality function similar to what's implemented 
in soasis/text 
(https://github.com/soasis/text/blob/main/include/ztd/text/detail/encoding_name.hpp#L120)
 or in P1885  (http://wg21.link/p1885) so you 
can compare names in a consistent manner across platforms. After that equality 
you then yield the integer value, and then you'd go from there. It doesn't have 
to be standard, just compiler-specific.

I'm not sure all of that belongs in this patch, though, and I think I'd wait 
for the other patches about `iconv` literal converters to drop before having 
the fullness of that conversation.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D100346/new/

https://reviews.llvm.org/D100346

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D100346: [Clang] String Literal and Wide String Literal Encoding from the Preprocessor

2021-04-13 Thread ThePhD via Phabricator via cfe-commits
ThePhD updated this revision to Diff 337158.
ThePhD added a comment.

Change `NOTE:` to `FIXME:`.

Update the text for the documentation to be more explicit.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D100346/new/

https://reviews.llvm.org/D100346

Files:
  clang/docs/LanguageExtensions.rst
  clang/lib/Frontend/InitPreprocessor.cpp
  clang/test/Preprocessor/init-x86.c
  clang/test/Preprocessor/init.c


Index: clang/test/Preprocessor/init.c
===
--- clang/test/Preprocessor/init.c
+++ clang/test/Preprocessor/init.c
@@ -115,10 +115,12 @@
 // COMMON:#define __STDC__ 1
 // COMMON:#define __VERSION__ {{.*}}
 // COMMON:#define __clang__ 1
+// COMMON:#define __clang_literal_encoding__ {{.*}}
 // COMMON:#define __clang_major__ {{[0-9]+}}
 // COMMON:#define __clang_minor__ {{[0-9]+}}
 // COMMON:#define __clang_patchlevel__ {{[0-9]+}}
 // COMMON:#define __clang_version__ {{.*}}
+// COMMON:#define __clang_wide_literal_encoding__ {{.*}}
 // COMMON:#define __llvm__ 1
 //
 // RUN: %clang_cc1 -E -dM -triple=x86_64-pc-win32 < /dev/null | FileCheck 
-match-full-lines -check-prefix C-DEFAULT %s
@@ -1844,10 +1846,12 @@
 // WEBASSEMBLY-NOT:#define __WINT_UNSIGNED__
 // WEBASSEMBLY-NEXT:#define __WINT_WIDTH__ 32
 // WEBASSEMBLY-NEXT:#define __clang__ 1
+// WEBASSEMBLY-NEXT:#define __clang_literal_encoding__ {{.*}}
 // WEBASSEMBLY-NEXT:#define __clang_major__ {{.*}}
 // WEBASSEMBLY-NEXT:#define __clang_minor__ {{.*}}
 // WEBASSEMBLY-NEXT:#define __clang_patchlevel__ {{.*}}
 // WEBASSEMBLY-NEXT:#define __clang_version__ "{{.*}}"
+// WEBASSEMBLY-NEXT:#define __clang_wide_literal_encoding__ {{.*}}
 // WEBASSEMBLY-NEXT:#define __llvm__ 1
 // WEBASSEMBLY-NOT:#define __unix
 // WEBASSEMBLY-NOT:#define __unix__
Index: clang/test/Preprocessor/init-x86.c
===
--- clang/test/Preprocessor/init-x86.c
+++ clang/test/Preprocessor/init-x86.c
@@ -1306,10 +1306,12 @@
 // X86_64-CLOUDABI:#define __amd64 1
 // X86_64-CLOUDABI:#define __amd64__ 1
 // X86_64-CLOUDABI:#define __clang__ 1
+// X86_64-CLOUDABI:#define __clang_literal_encoding__ {{.*}}
 // X86_64-CLOUDABI:#define __clang_major__ {{.*}}
 // X86_64-CLOUDABI:#define __clang_minor__ {{.*}}
 // X86_64-CLOUDABI:#define __clang_patchlevel__ {{.*}}
 // X86_64-CLOUDABI:#define __clang_version__ {{.*}}
+// X86_64-CLOUDABI:#define __clang_wide_literal_encoding__ {{.*}}
 // X86_64-CLOUDABI:#define __llvm__ 1
 // X86_64-CLOUDABI:#define __x86_64 1
 // X86_64-CLOUDABI:#define __x86_64__ 1
Index: clang/lib/Frontend/InitPreprocessor.cpp
===
--- clang/lib/Frontend/InitPreprocessor.cpp
+++ clang/lib/Frontend/InitPreprocessor.cpp
@@ -778,6 +778,21 @@
 }
   }
 
+  // Macros to help identify the narrow and wide character sets
+  // FIXME: clang currently ignores -fexec-charset=. If this changes,
+  // then this may need to be updated.
+  Builder.defineMacro("__clang_literal_encoding__", "\"UTF-8\"");
+  if (TI.getTypeWidth(TI.getWCharType()) >= 32) {
+// FIXME: 32-bit wchar_t signals UTF-32. This may change
+// if -fwide-exec-charset= is ever supported.
+Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-32\"");
+  } else {
+// FIXME: Less-than 32-bit wchar_t generally means UTF-16
+// (e.g., Windows, 32-bit IBM). This may need to be
+// updated if -fwide-exec-charset= is ever supported.
+Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-16\"");
+  }
+
   if (LangOpts.Optimize)
 Builder.defineMacro("__OPTIMIZE__");
   if (LangOpts.OptimizeSize)
Index: clang/docs/LanguageExtensions.rst
===
--- clang/docs/LanguageExtensions.rst
+++ clang/docs/LanguageExtensions.rst
@@ -383,6 +383,18 @@
   Defined to a string that captures the Clang marketing version, including the
   Subversion tag or revision number, e.g., "``1.5 (trunk 102332)``".
 
+``__clang_literal_encoding__``
+  Defined to a narrow string literal that represents the current encoding of
+  narrow string literals, e.g., ``"hello"``. This macro typically expands to
+  "UTF-8" (but may change in the future if the
+  ``-fexec-charset="Encoding-Name"`` option is implemented.)
+
+``__clang_wide_literal_encoding__``
+  Defined to a narrow string literal that represents the current encoding of
+  wide string literals, e.g., ``L"hello"``. This is typically "UTF-16" or
+  "UTF-32" (but may change in the future if the
+  ``-fwide-exec-charset="Encoding-Name"`` option is implemented.)
+
 .. _langext-vectors:
 
 Vectors and Extended Vectors


Index: clang/test/Preprocessor/init.c
===
--- clang/test/Preprocessor/init.c
+++ clang/test/Preprocessor/init.c
@@ -115,10 +115,12 @@
 // COMMON:#define __STDC__ 1
 // COMMON:#define __VERSION__ {{.*}}
 // COMMON:#define __clang__ 1
+// 

[PATCH] D100346: [Clang] String Literal and Wide String Literal Encoding from the Preprocessor

2021-04-12 Thread ThePhD via Phabricator via cfe-commits
ThePhD updated this revision to Diff 337035.
ThePhD marked an inline comment as done.
Herald added subscribers: aheejin, dschuff.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D100346/new/

https://reviews.llvm.org/D100346

Files:
  clang/docs/LanguageExtensions.rst
  clang/lib/Frontend/InitPreprocessor.cpp
  clang/test/Preprocessor/init-x86.c
  clang/test/Preprocessor/init.c


Index: clang/test/Preprocessor/init.c
===
--- clang/test/Preprocessor/init.c
+++ clang/test/Preprocessor/init.c
@@ -115,10 +115,12 @@
 // COMMON:#define __STDC__ 1
 // COMMON:#define __VERSION__ {{.*}}
 // COMMON:#define __clang__ 1
+// COMMON:#define __clang_literal_encoding__ {{.*}}
 // COMMON:#define __clang_major__ {{[0-9]+}}
 // COMMON:#define __clang_minor__ {{[0-9]+}}
 // COMMON:#define __clang_patchlevel__ {{[0-9]+}}
 // COMMON:#define __clang_version__ {{.*}}
+// COMMON:#define __clang_wide_literal_encoding__ {{.*}}
 // COMMON:#define __llvm__ 1
 //
 // RUN: %clang_cc1 -E -dM -triple=x86_64-pc-win32 < /dev/null | FileCheck 
-match-full-lines -check-prefix C-DEFAULT %s
@@ -1844,10 +1846,12 @@
 // WEBASSEMBLY-NOT:#define __WINT_UNSIGNED__
 // WEBASSEMBLY-NEXT:#define __WINT_WIDTH__ 32
 // WEBASSEMBLY-NEXT:#define __clang__ 1
+// WEBASSEMBLY-NEXT:#define __clang_literal_encoding__ {{.*}}
 // WEBASSEMBLY-NEXT:#define __clang_major__ {{.*}}
 // WEBASSEMBLY-NEXT:#define __clang_minor__ {{.*}}
 // WEBASSEMBLY-NEXT:#define __clang_patchlevel__ {{.*}}
 // WEBASSEMBLY-NEXT:#define __clang_version__ "{{.*}}"
+// WEBASSEMBLY-NEXT:#define __clang_wide_literal_encoding__ {{.*}}
 // WEBASSEMBLY-NEXT:#define __llvm__ 1
 // WEBASSEMBLY-NOT:#define __unix
 // WEBASSEMBLY-NOT:#define __unix__
Index: clang/test/Preprocessor/init-x86.c
===
--- clang/test/Preprocessor/init-x86.c
+++ clang/test/Preprocessor/init-x86.c
@@ -1306,10 +1306,12 @@
 // X86_64-CLOUDABI:#define __amd64 1
 // X86_64-CLOUDABI:#define __amd64__ 1
 // X86_64-CLOUDABI:#define __clang__ 1
+// X86_64-CLOUDABI:#define __clang_literal_encoding__ {{.*}}
 // X86_64-CLOUDABI:#define __clang_major__ {{.*}}
 // X86_64-CLOUDABI:#define __clang_minor__ {{.*}}
 // X86_64-CLOUDABI:#define __clang_patchlevel__ {{.*}}
 // X86_64-CLOUDABI:#define __clang_version__ {{.*}}
+// X86_64-CLOUDABI:#define __clang_wide_literal_encoding__ {{.*}}
 // X86_64-CLOUDABI:#define __llvm__ 1
 // X86_64-CLOUDABI:#define __x86_64 1
 // X86_64-CLOUDABI:#define __x86_64__ 1
Index: clang/lib/Frontend/InitPreprocessor.cpp
===
--- clang/lib/Frontend/InitPreprocessor.cpp
+++ clang/lib/Frontend/InitPreprocessor.cpp
@@ -778,6 +778,21 @@
 }
   }
 
+  // macros to help identify the narrow and wide character sets
+  // NOTE: clang currently ignores -fexec-charset=. If this changes,
+  // then this may need to be updated.
+  Builder.defineMacro("__clang_literal_encoding__", "\"UTF-8\"");
+  if (TI.getTypeWidth(TI.getWCharType()) >= 32) {
+// NOTE: 32-bit wchar_t signals UTF-32. This may change
+// if -fwide-exec-charset= is ever supported.
+Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-32\"");
+  } else {
+// NOTE: Less-than 32-bit wchar_t generally means UTF-16
+// (e.g., Windows, 32-bit IBM). This may need to be
+// updated if -fwide-exec-charset= is ever supported.
+Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-16\"");
+  }
+
   if (LangOpts.Optimize)
 Builder.defineMacro("__OPTIMIZE__");
   if (LangOpts.OptimizeSize)
Index: clang/docs/LanguageExtensions.rst
===
--- clang/docs/LanguageExtensions.rst
+++ clang/docs/LanguageExtensions.rst
@@ -383,6 +383,17 @@
   Defined to a string that captures the Clang marketing version, including the
   Subversion tag or revision number, e.g., "``1.5 (trunk 102332)``".
 
+``__clang_literal_encoding__``
+  Defined to a string that represents the current encoding of string literals,
+  e.g., ``"hello"``. This is typically "UTF-8" (but may change in the future
+  if the ``-fexec-charset="Encoding-Name"`` option is implemented.)
+
+``__clang_wide_literal_encoding__``
+  Defined to a string that represents the current encoding of wide string
+  literals, e.g., ``L"hello"``. This is typically "UTF-16" or "UTF-32"
+  (but may change in the future if the
+  ``-fwide-exec-charset="Encoding-Name"`` option is implemented.)
+
 .. _langext-vectors:
 
 Vectors and Extended Vectors


Index: clang/test/Preprocessor/init.c
===
--- clang/test/Preprocessor/init.c
+++ clang/test/Preprocessor/init.c
@@ -115,10 +115,12 @@
 // COMMON:#define __STDC__ 1
 // COMMON:#define __VERSION__ {{.*}}
 // COMMON:#define __clang__ 1
+// COMMON:#define __clang_literal_encoding__ {{.*}}
 // COMMON:#define __clang_major__ {{[0-9]+}}
 

[PATCH] D100346: [Clang] String Literal and Wide String Literal Encoding from the Preprocessor

2021-04-12 Thread ThePhD via Phabricator via cfe-commits
ThePhD marked an inline comment as done.
ThePhD added a comment.

In D100346#2684705 , @rsmith wrote:

> Exposing this information seems fine to me. I think it'd be more useful to 
> expose it in a way the preprocessor can inspect ...

The reason it's done like this is because the linked patches are going to use 
iconv, and iconv (while maintaining a fixed set of encodings it can support 
with canonical names that can be passed to its tools) can have encodings of 
arbitrary name added to it. If the eventual `-f(wide-?)exec-charset` options 
just pass-through to iconv, it's fair to say any translation we do will become 
out of date or unsuitable on a given platform. That's why I just want to expose 
the name! I did the same thing for GCC's functionality.

Microsoft is a bit luckier in that they have a closed set of encodings and a 
stable mapping between Name <-> Code Page Identifier, but they don't even cover 
the same list of encodings as iconv covers in its default distribution. 
Corentin Jabot's https://wg21.link/p1885 gives a canonical mapping, but again 
it's a closed-set of mappings and iconv is inherently extensible (at the "I 
rebuilt the library and installed it on my OS" level)!

The string literals work for people because, despite not being 
preprocessor-comparable, they can be manipulated at compile-time and switched 
on in the usual ways at `constexpr` time. See usages in:

https://github.com/soasis/text/blob/main/include/ztd/text/detail/encoding_name.hpp#L198
https://github.com/soasis/text/blob/main/include/ztd/text/literal.hpp#L54

, which can be used at compile-time like:

https://github.com/soasis/text/blob/main/tests/basic_compile_time/source/validate_code_points.cpp#L45




Comment at: clang/test/Preprocessor/init.c:122-123
 // COMMON:#define __clang_version__ {{.*}}
+// COMMON:#define __clang_literal_encoding__ {{.*}}
+// COMMON:#define __clang_wide_literal_encoding__ {{.*}}
 // COMMON:#define __llvm__ 1

rsmith wrote:
> Please add documentation to `docs/LanguageExtensions.rst` for these.
Done and dusted!


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D100346/new/

https://reviews.llvm.org/D100346

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D100346: [Clang] String Literal and Wide String Literal Encoding from the Preprocessor

2021-04-12 Thread ThePhD via Phabricator via cfe-commits
ThePhD updated this revision to Diff 337024.
ThePhD added a comment.

The tests check for macros in strict byte-value "alphabetical" order. We also 
need documentation, as was suggested!!


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D100346/new/

https://reviews.llvm.org/D100346

Files:
  clang/docs/LanguageExtensions.rst
  clang/lib/Frontend/InitPreprocessor.cpp
  clang/test/Preprocessor/init.c


Index: clang/test/Preprocessor/init.c
===
--- clang/test/Preprocessor/init.c
+++ clang/test/Preprocessor/init.c
@@ -115,10 +115,12 @@
 // COMMON:#define __STDC__ 1
 // COMMON:#define __VERSION__ {{.*}}
 // COMMON:#define __clang__ 1
+// COMMON:#define __clang_literal_encoding__ {{.*}}
 // COMMON:#define __clang_major__ {{[0-9]+}}
 // COMMON:#define __clang_minor__ {{[0-9]+}}
 // COMMON:#define __clang_patchlevel__ {{[0-9]+}}
 // COMMON:#define __clang_version__ {{.*}}
+// COMMON:#define __clang_wide_literal_encoding__ {{.*}}
 // COMMON:#define __llvm__ 1
 //
 // RUN: %clang_cc1 -E -dM -triple=x86_64-pc-win32 < /dev/null | FileCheck 
-match-full-lines -check-prefix C-DEFAULT %s
Index: clang/lib/Frontend/InitPreprocessor.cpp
===
--- clang/lib/Frontend/InitPreprocessor.cpp
+++ clang/lib/Frontend/InitPreprocessor.cpp
@@ -778,6 +778,21 @@
 }
   }
 
+  // macros to help identify the narrow and wide character sets
+  // NOTE: clang currently ignores -fexec-charset=. If this changes,
+  // then this may need to be updated.
+  Builder.defineMacro("__clang_literal_encoding__", "\"UTF-8\"");
+  if (TI.getTypeWidth(TI.getWCharType()) >= 32) {
+// NOTE: 32-bit wchar_t signals UTF-32. This may change
+// if -fwide-exec-charset= is ever supported.
+Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-32\"");
+  } else {
+// NOTE: Less-than 32-bit wchar_t generally means UTF-16
+// (e.g., Windows, 32-bit IBM). This may need to be
+// updated if -fwide-exec-charset= is ever supported.
+Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-16\"");
+  }
+
   if (LangOpts.Optimize)
 Builder.defineMacro("__OPTIMIZE__");
   if (LangOpts.OptimizeSize)
Index: clang/docs/LanguageExtensions.rst
===
--- clang/docs/LanguageExtensions.rst
+++ clang/docs/LanguageExtensions.rst
@@ -383,6 +383,17 @@
   Defined to a string that captures the Clang marketing version, including the
   Subversion tag or revision number, e.g., "``1.5 (trunk 102332)``".
 
+``__clang_literal_encoding__``
+  Defined to a string that represents the current encoding of string literals,
+  e.g., ``"hello"``. This is typically "UTF-8" (but may change in the future
+  if the ``-fexec-charset="Encoding-Name"`` option is implemented.)
+
+``__clang_wide_literal_encoding__``
+  Defined to a string that represents the current encoding of wide string
+  literals, e.g., ``L"hello"``. This is typically "UTF-16" or "UTF-32"
+  (but may change in the future if the
+  ``-fwide-exec-charset="Encoding-Name"`` option is implemented.)
+
 .. _langext-vectors:
 
 Vectors and Extended Vectors


Index: clang/test/Preprocessor/init.c
===
--- clang/test/Preprocessor/init.c
+++ clang/test/Preprocessor/init.c
@@ -115,10 +115,12 @@
 // COMMON:#define __STDC__ 1
 // COMMON:#define __VERSION__ {{.*}}
 // COMMON:#define __clang__ 1
+// COMMON:#define __clang_literal_encoding__ {{.*}}
 // COMMON:#define __clang_major__ {{[0-9]+}}
 // COMMON:#define __clang_minor__ {{[0-9]+}}
 // COMMON:#define __clang_patchlevel__ {{[0-9]+}}
 // COMMON:#define __clang_version__ {{.*}}
+// COMMON:#define __clang_wide_literal_encoding__ {{.*}}
 // COMMON:#define __llvm__ 1
 //
 // RUN: %clang_cc1 -E -dM -triple=x86_64-pc-win32 < /dev/null | FileCheck -match-full-lines -check-prefix C-DEFAULT %s
Index: clang/lib/Frontend/InitPreprocessor.cpp
===
--- clang/lib/Frontend/InitPreprocessor.cpp
+++ clang/lib/Frontend/InitPreprocessor.cpp
@@ -778,6 +778,21 @@
 }
   }
 
+  // macros to help identify the narrow and wide character sets
+  // NOTE: clang currently ignores -fexec-charset=. If this changes,
+  // then this may need to be updated.
+  Builder.defineMacro("__clang_literal_encoding__", "\"UTF-8\"");
+  if (TI.getTypeWidth(TI.getWCharType()) >= 32) {
+// NOTE: 32-bit wchar_t signals UTF-32. This may change
+// if -fwide-exec-charset= is ever supported.
+Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-32\"");
+  } else {
+// NOTE: Less-than 32-bit wchar_t generally means UTF-16
+// (e.g., Windows, 32-bit IBM). This may need to be
+// updated if -fwide-exec-charset= is ever supported.
+Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-16\"");
+  }
+
   

[PATCH] D100346: [Clang] String Literal and Wide String Literal Encoding from the Preprocessor

2021-04-12 Thread ThePhD via Phabricator via cfe-commits
ThePhD updated this revision to Diff 337008.
ThePhD added a comment.

Fixes formatting derps in the pre-linter check.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D100346/new/

https://reviews.llvm.org/D100346

Files:
  clang/lib/Frontend/InitPreprocessor.cpp
  clang/test/Preprocessor/init.c


Index: clang/test/Preprocessor/init.c
===
--- clang/test/Preprocessor/init.c
+++ clang/test/Preprocessor/init.c
@@ -119,6 +119,8 @@
 // COMMON:#define __clang_minor__ {{[0-9]+}}
 // COMMON:#define __clang_patchlevel__ {{[0-9]+}}
 // COMMON:#define __clang_version__ {{.*}}
+// COMMON:#define __clang_literal_encoding__ {{.*}}
+// COMMON:#define __clang_wide_literal_encoding__ {{.*}}
 // COMMON:#define __llvm__ 1
 //
 // RUN: %clang_cc1 -E -dM -triple=x86_64-pc-win32 < /dev/null | FileCheck 
-match-full-lines -check-prefix C-DEFAULT %s
Index: clang/lib/Frontend/InitPreprocessor.cpp
===
--- clang/lib/Frontend/InitPreprocessor.cpp
+++ clang/lib/Frontend/InitPreprocessor.cpp
@@ -778,6 +778,21 @@
 }
   }
 
+  // macros to help identify the narrow and wide character sets
+  // NOTE: clang currently ignores -fexec-charset=. If this changes,
+  // then this code may need to change.
+  Builder.defineMacro("__clang_literal_encoding__", "\"UTF-8\"");
+  if (TI.getTypeWidth(TI.getWCharType()) >= 32) {
+// NOTE: 32-bit wchar_t signals UTF-32. This may change
+// if -fwide-exec-charset= is ever supported.
+Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-32\"");
+  } else {
+// NOTE: Less-than 32-bit wchar_t generally means UTF-16
+// (e.g., Windows, 32-bit IBM). This may change if
+// -fwide-exec-charset= is ever supported.
+Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-16\"");
+  }
+
   if (LangOpts.Optimize)
 Builder.defineMacro("__OPTIMIZE__");
   if (LangOpts.OptimizeSize)


Index: clang/test/Preprocessor/init.c
===
--- clang/test/Preprocessor/init.c
+++ clang/test/Preprocessor/init.c
@@ -119,6 +119,8 @@
 // COMMON:#define __clang_minor__ {{[0-9]+}}
 // COMMON:#define __clang_patchlevel__ {{[0-9]+}}
 // COMMON:#define __clang_version__ {{.*}}
+// COMMON:#define __clang_literal_encoding__ {{.*}}
+// COMMON:#define __clang_wide_literal_encoding__ {{.*}}
 // COMMON:#define __llvm__ 1
 //
 // RUN: %clang_cc1 -E -dM -triple=x86_64-pc-win32 < /dev/null | FileCheck -match-full-lines -check-prefix C-DEFAULT %s
Index: clang/lib/Frontend/InitPreprocessor.cpp
===
--- clang/lib/Frontend/InitPreprocessor.cpp
+++ clang/lib/Frontend/InitPreprocessor.cpp
@@ -778,6 +778,21 @@
 }
   }
 
+  // macros to help identify the narrow and wide character sets
+  // NOTE: clang currently ignores -fexec-charset=. If this changes,
+  // then this code may need to change.
+  Builder.defineMacro("__clang_literal_encoding__", "\"UTF-8\"");
+  if (TI.getTypeWidth(TI.getWCharType()) >= 32) {
+// NOTE: 32-bit wchar_t signals UTF-32. This may change
+// if -fwide-exec-charset= is ever supported.
+Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-32\"");
+  } else {
+// NOTE: Less-than 32-bit wchar_t generally means UTF-16
+// (e.g., Windows, 32-bit IBM). This may change if
+// -fwide-exec-charset= is ever supported.
+Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-16\"");
+  }
+
   if (LangOpts.Optimize)
 Builder.defineMacro("__OPTIMIZE__");
   if (LangOpts.OptimizeSize)
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D100346: [Clang] String Literal and Wide String Literal Encoding from the Preprocessor

2021-04-12 Thread ThePhD via Phabricator via cfe-commits
ThePhD created this revision.
ThePhD added a reviewer: aaron.ballman.
ThePhD added a project: clang.
ThePhD requested review of this revision.
Herald added a subscriber: cfe-commits.

**//Short version://**

Please let us know the encoding that the compiler chooses for it's 
implementation-defined, non-Unicode string literals so we can support users 
properly in cross-platform code.

**//Prior Art://**

A similar feature has already been patch-reviewed and merged into GCC trunk (I 
implemented it), ready to go out the door with GCC 11. It is compiler-specific, 
and that is intentional. It solved a user's bug report there.

I also put in a Feature Request for MSVC. It is also recommended to be 
compiler-specific. They are currently suffering the very interesting 
consequences of not handling it sooner. stdlib library developers having to 
come up with library-based workarounds to determine the charset format of their 
string literals and praying rather than having a compiler macro for it with 
their std::fmt implementation: https://github.com/microsoft/STL/pull/1824 | 
https://github.com/microsoft/STL/issues/1576 | 
https://developercommunity.visualstudio.com/content/idea/1160821/-compiler-feature-macro-for-narrow-literal-foo-enc.html

The C++ Standard's Committee Study Group 16 - Unicode approved a paper that is 
currently undergoing LEWG to determine the string literal and wide string 
literal encoding at both compile-time and runtime; this patch prepares for the 
compile-time portion of that detection, which Corentin Jabot already created a 
proof-of-concept of for Clang, GCC and MSVC: https://wg21.link/p1885

I missed the 12 release, so I hope this makes it for 13.

**//Long version://**

C and C++'s string literals for both "narrow"/"multibyte" string literals (e.g. 
"foo") and "wide" string literals (e.g. L"foo") have an associated encoding 
defined by the implementation. Recently, a review has kicked off for both 
adding new "execution encodings" (e.g., string literal encodings) to Clang's 
Preprocessor and, subsequently, C and C++ frontends.

I left a comment for it to be taken care of but I'm certain the comment was 
drowned out by other contributions in both the -fexec-charset addition and the 
"Add support for iconv encodings and other things" patch review at:

iconv Literal Converter: https://reviews.llvm.org/D88741
-fexec-charset Enabling Patch: https://reviews.llvm.org/D93031

Whether or not this gets updated in the related (but not required) patches, 
this is necessary to successfully inform the end user on a Clang machine what 
the wide string literal and the narrow string literal encoding is.

We use the size of the wide character type (`wchar_t`) to inform our decision, 
as Windows and other old-style 32-bit IBM machines use UTF-16, while most Linux 
distributions use UTF-32. (This is not the case for IBM and other machines of 
specific make in China, Japan, and Korea, but I suspect Clang has not been 
ported to work on such machines.)

Knowing the literal and execution encodings is also of great importance to the 
C++ Standard Committee in general, as they have work coming down the pipeline 
that has been generally approved by SG16 and favorably reviewed by LEWG that 
will make use of such functionality soon, as mentioned in the Prior Art section 
above: https://wg21.link/p1885

Please consider making everyone who cares about portable encoding's lives 
easier, and please consider making the work on `-fexec-charset` and 
`-fwide-exec-charset`.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D100346

Files:
  clang/lib/Frontend/InitPreprocessor.cpp
  clang/test/Preprocessor/init.c


Index: clang/test/Preprocessor/init.c
===
--- clang/test/Preprocessor/init.c
+++ clang/test/Preprocessor/init.c
@@ -119,6 +119,8 @@
 // COMMON:#define __clang_minor__ {{[0-9]+}}
 // COMMON:#define __clang_patchlevel__ {{[0-9]+}}
 // COMMON:#define __clang_version__ {{.*}}
+// COMMON:#define __clang_literal_encoding__ {{.*}}
+// COMMON:#define __clang_wide_literal_encoding__ {{.*}}
 // COMMON:#define __llvm__ 1
 //
 // RUN: %clang_cc1 -E -dM -triple=x86_64-pc-win32 < /dev/null | FileCheck 
-match-full-lines -check-prefix C-DEFAULT %s
Index: clang/lib/Frontend/InitPreprocessor.cpp
===
--- clang/lib/Frontend/InitPreprocessor.cpp
+++ clang/lib/Frontend/InitPreprocessor.cpp
@@ -778,6 +778,20 @@
 }
   }
 
+  // macros to help identify the narrow and wide character sets
+  // NOTE: clang currently ignores -fexec-charset=. If this changes,
+  // then may need to change.
+  Builder.defineMacro("__clang_literal_encoding__", "\"UTF-8\"");
+  if (TI.getTypeWidth(TI.getWCharType()) >= 32) {
+// NOTE: 32-bit wchar_t signals UTF-32. This may change if 
+// -fwide-exec-charset= is ever supported.
+Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-32\"");
+