[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-07-26 Thread Tom Honermann via Phabricator via cfe-commits
tahonermann added inline comments.



Comment at: clang/include/clang/AST/Expr.h:1846-1871
+  std::string getStringAsChar() const {
+assert(getCharByteWidth() == 1 &&
+   "This function is used in places that assume strings use char");
+return std::string(getTrailingObjects(), getTrailingObjects() 
+ getByteLength());
+  }
+  
+  std::u16string getStringAsChar16() const {

MarcusJohnson91 wrote:
> cor3ntin wrote:
> > aaron.ballman wrote:
> > > One potential issue to this is that in C++, these types are defined to be 
> > > UTF-16 and UTF-32, whereas in C, that isn't always the case. Currently, 
> > > Clang defines `__STDC_UTF_16__` and `__STDC_UTF_32__` on all targets, but 
> > > we may need to add some sanity checks to catch if a target overrides that 
> > > behavior. Currently, all the targets in Clang are keeping that promise, 
> > > but I have no idea what shenanigans downstream users get up to and 
> > > whether their targets remove the macro definition or define it to `0` 
> > > instead of `1`.
> > Is it possible that the host and target wchar_t have a different size here?
> I've honestly been wondering how Clang handled that, in the codebase vs at 
> runtime myself for a while.
WG14 N2728 (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2728.htm) proposes 
specifying that `char16_t` and `char32_t` string literals are UTF-16 and UTF-32 
encoded respectively.  SG16 previously tried to identify any implementations 
that use a different encoding for these and was unable to find one.  I think it 
is safe to only support these encodings for `u` and `U` prefixed string 
literals.



Comment at: clang/lib/AST/ScanfFormatString.cpp:344-347
+case LengthModifier::AsUTF16:
+  return ArgType(ArgType::Char16Ty);
+case LengthModifier::AsUTF32:
+  return ArgType(ArgType::Char32Ty);

Should these additions not include a `PtrTo` wrapper as is done for other types?



Comment at: clang/lib/AST/Type.cpp:1980-2002
+bool Type::isChar16Type(const LangOptions ) const {
   if (const auto *BT = dyn_cast(CanonicalType))
-return BT->getKind() == BuiltinType::Char16;
-  return false;
-}
-
-bool Type::isChar32Type() const {
-  if (const auto *BT = dyn_cast(CanonicalType))
-return BT->getKind() == BuiltinType::Char32;
-  return false;
-}
-
-/// Determine whether this type is any of the built-in character
-/// types.
-bool Type::isAnyCharacterType() const {
-  const auto *BT = dyn_cast(CanonicalType);
-  if (!BT) return false;
-  switch (BT->getKind()) {
-  default: return false;
-  case BuiltinType::Char_U:
-  case BuiltinType::UChar:
-  case BuiltinType::WChar_U:
-  case BuiltinType::Char8:
-  case BuiltinType::Char16:
-  case BuiltinType::Char32:
-  case BuiltinType::Char_S:
-  case BuiltinType::SChar:
-  case BuiltinType::WChar_S:
-return true;
+if (BT->getKind() == BuiltinType::Char16)
+  return true;
+  if (!LangOpts.CPlusPlus) {
+return isType("char16_t");
   }

MarcusJohnson91 wrote:
> aaron.ballman wrote:
> > MarcusJohnson91 wrote:
> > > aaron.ballman wrote:
> > > > If I understand properly, one issue is that `char16_t` is defined by C 
> > > > to be *the same type* as `uint_least16_t`, which itself is a typedef to 
> > > > some other integer type. So we can't make `char16_t` be a distinct type 
> > > > in C, it has to be a typedef to a typedef to an integer type.
> > > > 
> > > > One concern I have about the approach in this patch is that it makes 
> > > > the type system a bit more convoluted. This type is *not* really a 
> > > > character type in C, it's a typedef type that playacts as a character 
> > > > type. I think this is a pretty fundamental change to the type system in 
> > > > the compiler in some ways because this query is no longer about the 
> > > > canonical type but about the semantics of how the type is expected to 
> > > > be used.
> > > > 
> > > > I'd definitely like to hear what @rsmith thinks of this approach.
> > > I see your point, but I'm not sure how else we could get it through the 
> > > type checking system without doing it this way?
> > > 
> > > I tried to be careful to only allow the actual character typedefs through 
> > > by making sure char16_t or char32_t is in the typedef chain, and only in 
> > > C mode.
> > > I see your point, but I'm not sure how else we could get it through the 
> > > type checking system without doing it this way?
> > 
> > I am thinking that rather than doing this work as part of the `Type` 
> > object, the extra work can be done from the format string checker. This 
> > keeps the confusion about types localized to the part of the code that 
> > needs to care about it, rather than pushing it onto all consumers of the 
> > type system.
> H, that's a good idea, but I'm not sure how to do that, let me think on 
> it for a bit.
What does the format string checker do 

[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-07-25 Thread Tom Honermann via Phabricator via cfe-commits
tahonermann added inline comments.



Comment at: clang/lib/AST/Type.cpp:1962
 
+bool Type::isType(const std::string TypeName) const {
+  QualType Desugar = this->getLocallyUnqualifiedSingleStepDesugaredType();

MarcusJohnson91 wrote:
> efriedma wrote:
> > MarcusJohnson91 wrote:
> > > aaron.ballman wrote:
> > > > Oh, I see now that this is doing a name comparison against the type -- 
> > > > that's not a good API in general because it's *really* hard to guess at 
> > > > what the type will come out as textually. e.g., `class` and `struct` 
> > > > keywords are interchangeable in C++, C sometimes gets confused with 
> > > > `bool` vs `_Bool`, template arguments sometimes matter, nested name 
> > > > specifiers, etc. Callers of this API will have to guess at these 
> > > > details and the printing of the type may change over time (e.g., C may 
> > > > switch from `_Bool` to `bool` and then code calling `isType("_Bool")` 
> > > > may react poorly to the change).
> > > > 
> > > > I think we need to avoid this sort of API on `Type`.
> > > I see your point, I reverted the behavior back to doing the desugaring in 
> > > just isChar16Type and isChar32Type
> > I'm not convinced we should be looking at sugar even in 
> > isChar16Type/isChar32Type/isAnyCharacterType.  That seems like a great way 
> > to end up with subtle bugs that only manifest when someone uses the wrong 
> > typedef.
> > 
> > Where is the distinction between the value `(uint32_t)1` vs. `(char32_t)1` 
> > relevant for C, anyway?
> char32_t is a typedef, not a builtin type in C.
> 
> the underlying type is uint_least32_t, which is usually another typedef to 
> int.
> 
> in order for char32_t to be accepted in C mode, we have to know that it is a 
> string type and not just some random array, so I'm checking the sugar to see 
> if char32_t appears in the typedef chain.
> 
> 
You can't, in general, count on typedefs being present.  For example, `U"text"` 
won't have `char32_t` present in its type; the `char32_t` typedef definition 
won't even be present if `uchar.h` is not included.  All you can be 
(reasonably) assured of is that UTF-16 and UTF-32 literals will have a type 
that matches the underlying type of the `char16_t` and `char32_t` typedef 
definition (this can be violated for at least some compilers if you try hard, 
but that reflects an invalid configuration).


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-07-25 Thread Marcus Johnson via Phabricator via cfe-commits
MarcusJohnson91 marked 5 inline comments as done.
MarcusJohnson91 added inline comments.



Comment at: clang/lib/AST/Expr.cpp:1091
+  if (llvm::convertUTF32ToUTF8String(AR, Output)) {
+CString = new char[Output.size() + 1];
+memcpy(CString, Output.c_str(), Output.size());

efriedma wrote:
> This leaks memory.
> 
> This function should return either a StringRef to memory that's part of AST 
> object, or an std::string.
I've switched over to using getStringAsChar and doing the conversion there 
instead of in getStrDataAsChar, on the other patch. (just going through this 
review to see if I missed any feedback)



Comment at: clang/lib/AST/Type.cpp:1962
 
+bool Type::isType(const std::string TypeName) const {
+  QualType Desugar = this->getLocallyUnqualifiedSingleStepDesugaredType();

efriedma wrote:
> MarcusJohnson91 wrote:
> > aaron.ballman wrote:
> > > Oh, I see now that this is doing a name comparison against the type -- 
> > > that's not a good API in general because it's *really* hard to guess at 
> > > what the type will come out as textually. e.g., `class` and `struct` 
> > > keywords are interchangeable in C++, C sometimes gets confused with 
> > > `bool` vs `_Bool`, template arguments sometimes matter, nested name 
> > > specifiers, etc. Callers of this API will have to guess at these details 
> > > and the printing of the type may change over time (e.g., C may switch 
> > > from `_Bool` to `bool` and then code calling `isType("_Bool")` may react 
> > > poorly to the change).
> > > 
> > > I think we need to avoid this sort of API on `Type`.
> > I see your point, I reverted the behavior back to doing the desugaring in 
> > just isChar16Type and isChar32Type
> I'm not convinced we should be looking at sugar even in 
> isChar16Type/isChar32Type/isAnyCharacterType.  That seems like a great way to 
> end up with subtle bugs that only manifest when someone uses the wrong 
> typedef.
> 
> Where is the distinction between the value `(uint32_t)1` vs. `(char32_t)1` 
> relevant for C, anyway?
char32_t is a typedef, not a builtin type in C.

the underlying type is uint_least32_t, which is usually another typedef to int.

in order for char32_t to be accepted in C mode, we have to know that it is a 
string type and not just some random array, so I'm checking the sugar to see if 
char32_t appears in the typedef chain.




CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-07-24 Thread Eli Friedman via Phabricator via cfe-commits
efriedma added a comment.

This should split into three patches:

1. The ConvertUTFWrapper patch.
2. The patch to add format warnings for wprintf/wscanf/etc.
3. The patch to add the %l16/%l32 format modifiers.




Comment at: clang/lib/AST/Expr.cpp:1091
+  if (llvm::convertUTF32ToUTF8String(AR, Output)) {
+CString = new char[Output.size() + 1];
+memcpy(CString, Output.c_str(), Output.size());

This leaks memory.

This function should return either a StringRef to memory that's part of AST 
object, or an std::string.



Comment at: clang/lib/AST/OSLog.cpp:206
   const StringLiteral *Lit = 
cast(StringArg->IgnoreParenCasts());
-  assert(Lit && (Lit->isAscii() || Lit->isUTF8()));
-  StringRef Data = Lit->getString();
+  assert(Lit);
+  std::string String(Lit->getStrDataAsChar());

Why are you changing this code?



Comment at: clang/lib/AST/Type.cpp:1962
 
+bool Type::isType(const std::string TypeName) const {
+  QualType Desugar = this->getLocallyUnqualifiedSingleStepDesugaredType();

MarcusJohnson91 wrote:
> aaron.ballman wrote:
> > Oh, I see now that this is doing a name comparison against the type -- 
> > that's not a good API in general because it's *really* hard to guess at 
> > what the type will come out as textually. e.g., `class` and `struct` 
> > keywords are interchangeable in C++, C sometimes gets confused with `bool` 
> > vs `_Bool`, template arguments sometimes matter, nested name specifiers, 
> > etc. Callers of this API will have to guess at these details and the 
> > printing of the type may change over time (e.g., C may switch from `_Bool` 
> > to `bool` and then code calling `isType("_Bool")` may react poorly to the 
> > change).
> > 
> > I think we need to avoid this sort of API on `Type`.
> I see your point, I reverted the behavior back to doing the desugaring in 
> just isChar16Type and isChar32Type
I'm not convinced we should be looking at sugar even in 
isChar16Type/isChar32Type/isAnyCharacterType.  That seems like a great way to 
end up with subtle bugs that only manifest when someone uses the wrong typedef.

Where is the distinction between the value `(uint32_t)1` vs. `(char32_t)1` 
relevant for C, anyway?



Comment at: clang/lib/Sema/SemaChecking.cpp:103
+#include 
+#include 
 

Stray includes?



Comment at: clang/test/Sema/format-strings-non-iso.c:6
+
+int wprintf(const char *restrict, ...);
+int wscanf(const char  *restrict, ...);

This declaration is wrong?  Does that not cause a warning somewhere?



Comment at: llvm/lib/Support/ConvertUTFWrapper.cpp:148
+  // Error out on an uneven byte count.
+  if (SrcBytes.size() % 2)
+return false;

sizeof(UTF32)?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-07-24 Thread Marcus Johnson via Phabricator via cfe-commits
MarcusJohnson91 updated this revision to Diff 361466.
MarcusJohnson91 added a comment.

Full context diff after squashing all the commits together


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

Files:
  clang-tools-extra/clang-tidy/boost/UseToStringCheck.cpp
  clang-tools-extra/clang-tidy/misc/ThrowByValueCatchByReferenceCheck.cpp
  clang/include/clang/AST/ASTContext.h
  clang/include/clang/AST/Expr.h
  clang/include/clang/AST/FormatString.h
  clang/include/clang/AST/Type.h
  clang/include/clang/ASTMatchers/ASTMatchers.h
  clang/include/clang/Basic/Builtins.def
  clang/include/clang/Basic/DiagnosticSemaKinds.td
  clang/lib/AST/Expr.cpp
  clang/lib/AST/ExprCXX.cpp
  clang/lib/AST/FormatString.cpp
  clang/lib/AST/OSLog.cpp
  clang/lib/AST/PrintfFormatString.cpp
  clang/lib/AST/ScanfFormatString.cpp
  clang/lib/AST/TemplateBase.cpp
  clang/lib/AST/Type.cpp
  clang/lib/Lex/LiteralSupport.cpp
  clang/lib/Sema/SemaCast.cpp
  clang/lib/Sema/SemaChecking.cpp
  clang/lib/Sema/SemaDeclAttr.cpp
  clang/lib/Sema/SemaExpr.cpp
  clang/lib/Sema/SemaExprObjC.cpp
  clang/lib/Sema/SemaFixItUtils.cpp
  clang/lib/Sema/SemaOverload.cpp
  clang/lib/Sema/SemaTemplate.cpp
  clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp
  clang/test/Sema/format-strings-c90.c
  clang/test/Sema/format-strings-darwin.c
  clang/test/Sema/format-strings-int-typedefs.c
  clang/test/Sema/format-strings-ms.c
  clang/test/Sema/format-strings-non-iso.c
  clang/test/Sema/format-strings-pedantic.c
  clang/test/Sema/format-strings-scanf.c
  clang/test/Sema/string-plus-char.c
  clang/test/SemaCXX/format-strings-0x.cpp
  clang/test/SemaCXX/format-strings.cpp
  llvm/include/llvm/Support/ConvertUTF.h
  llvm/lib/Support/ConvertUTFWrapper.cpp

Index: llvm/lib/Support/ConvertUTFWrapper.cpp
===
--- llvm/lib/Support/ConvertUTFWrapper.cpp
+++ llvm/lib/Support/ConvertUTFWrapper.cpp
@@ -140,6 +140,64 @@
   llvm::ArrayRef(reinterpret_cast(Src.data()),
   Src.size() * sizeof(UTF16)), Out);
 }
+  
+bool convertUTF32ToUTF8String(ArrayRef SrcBytes, std::string ) {
+  assert(Out.empty());
+
+  // Error out on an uneven byte count.
+  if (SrcBytes.size() % 2)
+return false;
+
+  // Avoid OOB by returning early on empty input.
+  if (SrcBytes.empty())
+return true;
+
+  const UTF32 *Src = reinterpret_cast(SrcBytes.begin());
+  const UTF32 *SrcEnd = reinterpret_cast(SrcBytes.end());
+
+  assert((uintptr_t)Src % sizeof(UTF32) == 0);
+
+  // Byteswap if necessary.
+  std::vector ByteSwapped;
+  if (Src[0] == UNI_UTF16_BYTE_ORDER_MARK_SWAPPED) {
+ByteSwapped.insert(ByteSwapped.end(), Src, SrcEnd);
+for (unsigned I = 0, E = ByteSwapped.size(); I != E; ++I)
+  ByteSwapped[I] = llvm::ByteSwap_32(ByteSwapped[I]);
+Src = [0];
+SrcEnd = [ByteSwapped.size() - 1] + 1;
+  }
+
+  // Skip the BOM for conversion.
+  if (Src[0] == UNI_UTF32_BYTE_ORDER_MARK_NATIVE)
+Src++;
+
+  // Just allocate enough space up front.  We'll shrink it later.  Allocate
+  // enough that we can fit a null terminator without reallocating.
+  Out.resize(SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1);
+  UTF8 *Dst = reinterpret_cast([0]);
+  UTF8 *DstEnd = Dst + Out.size();
+
+  ConversionResult CR =
+  ConvertUTF32toUTF8(, SrcEnd, , DstEnd, strictConversion);
+  assert(CR != targetExhausted);
+
+  if (CR != conversionOK) {
+Out.clear();
+return false;
+  }
+
+  Out.resize(reinterpret_cast(Dst) - [0]);
+  Out.push_back(0);
+  Out.pop_back();
+  return true;
+}
+  
+bool convertUTF32ToUTF8String(ArrayRef Src, std::string )
+{
+  return convertUTF16ToUTF8String(
+  llvm::ArrayRef(reinterpret_cast(Src.data()),
+  Src.size() * sizeof(UTF32)), Out);
+}
 
 bool convertUTF8ToUTF16String(StringRef SrcUTF8,
   SmallVectorImpl ) {
Index: llvm/include/llvm/Support/ConvertUTF.h
===
--- llvm/include/llvm/Support/ConvertUTF.h
+++ llvm/include/llvm/Support/ConvertUTF.h
@@ -122,6 +122,9 @@
 
 #define UNI_UTF16_BYTE_ORDER_MARK_NATIVE  0xFEFF
 #define UNI_UTF16_BYTE_ORDER_MARK_SWAPPED 0xFFFE
+  
+#define UNI_UTF32_BYTE_ORDER_MARK_NATIVE  0xFEFF
+#define UNI_UTF32_BYTE_ORDER_MARK_SWAPPED 0xFFFE
 
 typedef enum {
   conversionOK,   /* conversion successful */
@@ -277,6 +280,24 @@
 * \returns true on success
 */
 bool convertUTF16ToUTF8String(ArrayRef Src, std::string );
+  
+/**
+ * Converts a stream of raw bytes assumed to be UTF32 into a UTF8 std::string.
+ *
+ * \param [in] SrcBytes A buffer of what is assumed to be UTF-32 encoded text.
+ * \param [out] Out Converted UTF-8 is stored here on success.
+ * \returns true on success
+ */
+bool convertUTF32ToUTF8String(ArrayRef SrcBytes, std::string );
+
+/**
+* Converts a UTF32 string into a UTF8 std::string.
+*
+* \param [in] Src A buffer of UTF-32 encoded text.
+* \param [out] Out 

[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-07-23 Thread Marcus Johnson via Phabricator via cfe-commits
MarcusJohnson91 updated this revision to Diff 361414.
MarcusJohnson91 added a comment.

Getting weird crashes all over the place in code I didn't touch, no idea what's 
going on


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

Files:
  clang/lib/AST/Expr.cpp
  clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp


Index: clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp
===
--- clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp
+++ clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp
@@ -782,8 +782,8 @@
   bool BoundsProvided = ArgIndex == DEPR_ONLY;
 
   if (!BoundsProvided) {
-// Currently we only handle (not wide) string literals. It is possible to 
do
-// better, either by looking at references to const variables, or by doing
+// Currently we only handle string literals. It is possible to do better
+// either by looking at references to const variables, or by doing
 // real flow analysis.
 auto FormatString =
 dyn_cast(CE->getArg(ArgIndex)->IgnoreParenImpCasts());
Index: clang/lib/AST/Expr.cpp
===
--- clang/lib/AST/Expr.cpp
+++ clang/lib/AST/Expr.cpp
@@ -1068,7 +1068,7 @@
 }
 
 char *StringLiteral::getStrDataAsChar() {
-  std::string Output = "";
+  std::string Output;
   char *CString = nullptr;
   
   switch (getKind()) {
@@ -1078,21 +1078,18 @@
   return getTrailingObjects();
   break;
 case StringKind::UTF16: {
-  std::string Trail16 = "";
-  Trail16 = getTrailingObjects();
-  ArrayRef ArrayRef16(Trail16.c_str(), Trail16.length());
-  if (llvm::convertUTF16ToUTF8String(ArrayRef16, Output)) {
+  ArrayRef AR(getTrailingObjects(), getByteLength());
+  if (llvm::convertUTF16ToUTF8String(AR, Output)) {
 CString = new char[Output.size() + 1]; // +1 for terminating NULL
 return CString;
   }
   break;
 }
 case StringKind::UTF32: {
-  std::string Trail32 = "";
-  Trail32 = getTrailingObjects();
-  ArrayRef ArrayRef32(Trail32.c_str(), Trail32.length());
-  if (llvm::convertUTF32ToUTF8String(ArrayRef32, Output)) {
+  ArrayRef AR(getTrailingObjects(), getByteLength());
+  if (llvm::convertUTF32ToUTF8String(AR, Output)) {
 CString = new char[Output.size() + 1];
+memcpy(CString, Output.c_str(), Output.size());
 return CString;
   }
   break;
@@ -1100,6 +1097,7 @@
 case StringKind::Wide: {
   if (llvm::convertWideToUTF8(getStringAsWChar(), Output)) {
 CString = new char[Output.size() + 1];
+memcpy(CString, Output.c_str(), Output.size());
 return CString;
   }
   break;
@@ -1108,8 +1106,7 @@
 }
 
 const char *StringLiteral::getStrDataAsChar() const {
-  const char *ConstString = StringLiteral::getStrDataAsChar();
-  return ConstString;
+  return const_cast(getStrDataAsChar());
 }
 
 StringLiteral::StringLiteral(const ASTContext , StringRef Str,


Index: clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp
===
--- clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp
+++ clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp
@@ -782,8 +782,8 @@
   bool BoundsProvided = ArgIndex == DEPR_ONLY;
 
   if (!BoundsProvided) {
-// Currently we only handle (not wide) string literals. It is possible to do
-// better, either by looking at references to const variables, or by doing
+// Currently we only handle string literals. It is possible to do better
+// either by looking at references to const variables, or by doing
 // real flow analysis.
 auto FormatString =
 dyn_cast(CE->getArg(ArgIndex)->IgnoreParenImpCasts());
Index: clang/lib/AST/Expr.cpp
===
--- clang/lib/AST/Expr.cpp
+++ clang/lib/AST/Expr.cpp
@@ -1068,7 +1068,7 @@
 }
 
 char *StringLiteral::getStrDataAsChar() {
-  std::string Output = "";
+  std::string Output;
   char *CString = nullptr;
   
   switch (getKind()) {
@@ -1078,21 +1078,18 @@
   return getTrailingObjects();
   break;
 case StringKind::UTF16: {
-  std::string Trail16 = "";
-  Trail16 = getTrailingObjects();
-  ArrayRef ArrayRef16(Trail16.c_str(), Trail16.length());
-  if (llvm::convertUTF16ToUTF8String(ArrayRef16, Output)) {
+  ArrayRef AR(getTrailingObjects(), getByteLength());
+  if (llvm::convertUTF16ToUTF8String(AR, Output)) {
 CString = new char[Output.size() + 1]; // +1 for terminating NULL
 return CString;
   }
   break;
 }
 case StringKind::UTF32: {
-  std::string Trail32 = "";
-  Trail32 = getTrailingObjects();
-  ArrayRef ArrayRef32(Trail32.c_str(), Trail32.length());
-   

[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-07-23 Thread Marcus Johnson via Phabricator via cfe-commits
MarcusJohnson91 updated this revision to Diff 361410.
MarcusJohnson91 added a comment.

Rebased on main


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

Files:
  clang/lib/AST/OSLog.cpp
  clang/lib/Sema/SemaChecking.cpp
  clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp

Index: clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp
===
--- clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp
+++ clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp
@@ -785,28 +785,10 @@
 // Currently we only handle (not wide) string literals. It is possible to do
 // better, either by looking at references to const variables, or by doing
 // real flow analysis.
-std::string String;
 auto FormatString =
 dyn_cast(CE->getArg(ArgIndex)->IgnoreParenImpCasts());
 
-StringLiteral::StringKind Kind = FormatString->getKind();
-
-if (Kind == StringLiteral::Ascii || Kind == StringLiteral::UTF8) {
-  String = FormatString->getStringAsChar();
-} else if (Kind == StringLiteral::UTF16) {
-  std::wstring_convert, char16_t> Convert;
-  std::u16string U16 = FormatString->getStringAsChar16();
-  String = Convert.to_bytes(U16);
-} else if (Kind == StringLiteral::UTF32) {
-  std::wstring_convert, char32_t> Convert;
-  std::u32string U32 = FormatString->getStringAsChar32();
-  String = Convert.to_bytes(U32);
-} else if (Kind == StringLiteral::Wide) {
-  std::wstring_convert, wchar_t> Convert;
-  std::wstring WChar = FormatString->getStringAsWChar();
-  String = Convert.to_bytes(WChar);
-}
-
+std::string String(FormatString->getStrDataAsChar());
 StringRef StrRef = StringRef(String);
 
 if (FormatString &&
Index: clang/lib/Sema/SemaChecking.cpp
===
--- clang/lib/Sema/SemaChecking.cpp
+++ clang/lib/Sema/SemaChecking.cpp
@@ -622,25 +622,7 @@
 auto *FormatExpr = TheCall->getArg(FormatIndex)->IgnoreParenImpCasts();
 
 if (auto *Format = dyn_cast(FormatExpr)) {
-  StringLiteral::StringKind Kind = Format->getKind();
-  std::string String;
-  
-  if (Kind == StringLiteral::Ascii || Kind == StringLiteral::UTF8) {
-String = Format->getStringAsChar();
-  } else if (Kind == StringLiteral::UTF16) {
-std::wstring_convert, char16_t> Convert;
-std::u16string U16 = Format->getStringAsChar16();
-String = Convert.to_bytes(U16);
-  } else if (Kind == StringLiteral::UTF32) {
-std::wstring_convert, char32_t> Convert;
-std::u32string U32 = Format->getStringAsChar32();
-String = Convert.to_bytes(U32);
-  } else if (Kind == StringLiteral::Wide) {
-std::wstring_convert, wchar_t> Convert;
-std::wstring WChar = Format->getStringAsWChar();
-String = Convert.to_bytes(WChar);
-  }
-  
+  std::string String(Format->getStrDataAsChar());
   StringRef FormatStrRef(String);
   
   EstimateSizeFormatHandler H(FormatStrRef);
@@ -7492,6 +7474,10 @@
 return FExpr->getString().drop_front(Offset);
   }
   
+  const char *getStrDataAsChar() const {
+return FExpr->getStrDataAsChar();
+  }
+  
   std::string getStringAsChar() const {
 return FExpr->getStringAsChar();
   }
@@ -9549,24 +9535,7 @@
 /*IsStringLocation*/ true, OrigFormatExpr->getSourceRange());
 return;
   }
-  StringLiteral::StringKind Kind = FExpr->getKind();
-  std::string String;
-  
-  if (Kind == StringLiteral::Ascii || Kind == StringLiteral::UTF8) {
-String = FExpr->getStringAsChar();
-  } else if (Kind == StringLiteral::UTF16) {
-std::wstring_convert, char16_t> Convert;
-std::u16string U16 = FExpr->getStringAsChar16();
-String = Convert.to_bytes(U16);
-  } else if (Kind == StringLiteral::UTF32) {
-std::wstring_convert, char32_t> Convert;
-std::u32string U32 = FExpr->getStringAsChar32();
-String = Convert.to_bytes(U32);
-  } else if (Kind == StringLiteral::Wide) {
-std::wstring_convert, wchar_t> Convert;
-std::wstring WChar = FExpr->getStringAsWChar();
-String = Convert.to_bytes(WChar);
-  }
+  std::string String(FExpr->getStrDataAsChar());
   
   StringRef StrRef(String);
   const char *Str = StrRef.data();
@@ -9637,25 +9606,7 @@
   assert(T && "String literal not of constant array type!");
   size_t TypeSize = T->getSize().getZExtValue();
   
-  StringLiteral::StringKind Kind = FExpr->getKind();
-  std::string String;
-  
-  if (Kind == StringLiteral::Ascii || Kind == StringLiteral::UTF8) {
-String = FExpr->getStringAsChar();
-  } else if (Kind == StringLiteral::UTF16) {
-std::u16string  U16 = FExpr->getStringAsChar16();
-std::wstring_convert, char16_t> Convert;
-String = Convert.to_bytes(U16);
-  } else if (Kind == StringLiteral::UTF32) {
-std::u32string U32 = 

[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-07-23 Thread Marcus Johnson via Phabricator via cfe-commits
MarcusJohnson91 added inline comments.



Comment at: clang/lib/AST/TemplateBase.cpp:77-80
+  const Decl *D = nullptr;
+  if (T->isTypedefNameType()) {
+D = T->getAs()->getDecl();
+  }

aaron.ballman wrote:
> This approach seems fragile for getting access to an AST context. I think 
> `TemplateArgument::print()` should likely be given an `const ASTContext &` 
> parameter instead.
Just switched it out, D was only used for the LangOpts() call anyway, so I just 
did it directly where it's needed instead of creating a variable



Comment at: 
clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp:791-808
+
+StringLiteral::StringKind Kind = FormatString->getKind();
+
+if (Kind == StringLiteral::Ascii || Kind == StringLiteral::UTF8) {
+  String = FormatString->getStringAsChar();
+} else if (Kind == StringLiteral::UTF16) {
+  std::wstring_convert, char16_t> 
Convert;

MarcusJohnson91 wrote:
> cor3ntin wrote:
> > I think this code is present in multiple places in your PR, could it be 
> > deduplicated?
> Yup, I've been working on that too; This was actually the first thing I 
> started with.
I've moved this code into StringLiteral::getStrDataAsChar(), so now it does the 
conversion if needed, and removed the big if/else chain in all the places it 
was used previously.

Oh, and I made StringLiteral::getStrDataAsChar() public in the StringLiteral 
and I forwarded it in FormatStringLiteral.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-07-23 Thread Marcus Johnson via Phabricator via cfe-commits
MarcusJohnson91 updated this revision to Diff 361409.
MarcusJohnson91 marked an inline comment as done.
MarcusJohnson91 added a comment.

#1: Moved a comment to the top, so it's brettier 

#2: Moved all the ugly StringLiteral conversion code to 
StringLiteral::getStrDataAsChar and forwarded that function to 
FormatStringLiteral::getStrDataAsChar


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

Files:
  clang/lib/AST/OSLog.cpp
  clang/lib/Sema/SemaChecking.cpp
  clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp

Index: clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp
===
--- clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp
+++ clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp
@@ -785,28 +785,10 @@
 // Currently we only handle (not wide) string literals. It is possible to do
 // better, either by looking at references to const variables, or by doing
 // real flow analysis.
-std::string String;
 auto FormatString =
 dyn_cast(CE->getArg(ArgIndex)->IgnoreParenImpCasts());
 
-StringLiteral::StringKind Kind = FormatString->getKind();
-
-if (Kind == StringLiteral::Ascii || Kind == StringLiteral::UTF8) {
-  String = FormatString->getStringAsChar();
-} else if (Kind == StringLiteral::UTF16) {
-  std::wstring_convert, char16_t> Convert;
-  std::u16string U16 = FormatString->getStringAsChar16();
-  String = Convert.to_bytes(U16);
-} else if (Kind == StringLiteral::UTF32) {
-  std::wstring_convert, char32_t> Convert;
-  std::u32string U32 = FormatString->getStringAsChar32();
-  String = Convert.to_bytes(U32);
-} else if (Kind == StringLiteral::Wide) {
-  std::wstring_convert, wchar_t> Convert;
-  std::wstring WChar = FormatString->getStringAsWChar();
-  String = Convert.to_bytes(WChar);
-}
-
+std::string String(FormatString->getStrDataAsChar());
 StringRef StrRef = StringRef(String);
 
 if (FormatString &&
Index: clang/lib/Sema/SemaChecking.cpp
===
--- clang/lib/Sema/SemaChecking.cpp
+++ clang/lib/Sema/SemaChecking.cpp
@@ -622,25 +622,7 @@
 auto *FormatExpr = TheCall->getArg(FormatIndex)->IgnoreParenImpCasts();
 
 if (auto *Format = dyn_cast(FormatExpr)) {
-  StringLiteral::StringKind Kind = Format->getKind();
-  std::string String;
-  
-  if (Kind == StringLiteral::Ascii || Kind == StringLiteral::UTF8) {
-String = Format->getStringAsChar();
-  } else if (Kind == StringLiteral::UTF16) {
-std::wstring_convert, char16_t> Convert;
-std::u16string U16 = Format->getStringAsChar16();
-String = Convert.to_bytes(U16);
-  } else if (Kind == StringLiteral::UTF32) {
-std::wstring_convert, char32_t> Convert;
-std::u32string U32 = Format->getStringAsChar32();
-String = Convert.to_bytes(U32);
-  } else if (Kind == StringLiteral::Wide) {
-std::wstring_convert, wchar_t> Convert;
-std::wstring WChar = Format->getStringAsWChar();
-String = Convert.to_bytes(WChar);
-  }
-  
+  std::string String(Format->getStrDataAsChar());
   StringRef FormatStrRef(String);
   
   EstimateSizeFormatHandler H(FormatStrRef);
@@ -7481,6 +7463,10 @@
 return FExpr->getString().drop_front(Offset);
   }
   
+  const char *getStrDataAsChar() const {
+return FExpr->getStrDataAsChar();
+  }
+  
   std::string getStringAsChar() const {
 return FExpr->getStringAsChar();
   }
@@ -9538,24 +9524,7 @@
 /*IsStringLocation*/ true, OrigFormatExpr->getSourceRange());
 return;
   }
-  StringLiteral::StringKind Kind = FExpr->getKind();
-  std::string String;
-  
-  if (Kind == StringLiteral::Ascii || Kind == StringLiteral::UTF8) {
-String = FExpr->getStringAsChar();
-  } else if (Kind == StringLiteral::UTF16) {
-std::wstring_convert, char16_t> Convert;
-std::u16string U16 = FExpr->getStringAsChar16();
-String = Convert.to_bytes(U16);
-  } else if (Kind == StringLiteral::UTF32) {
-std::wstring_convert, char32_t> Convert;
-std::u32string U32 = FExpr->getStringAsChar32();
-String = Convert.to_bytes(U32);
-  } else if (Kind == StringLiteral::Wide) {
-std::wstring_convert, wchar_t> Convert;
-std::wstring WChar = FExpr->getStringAsWChar();
-String = Convert.to_bytes(WChar);
-  }
+  std::string String(FExpr->getStrDataAsChar());
   
   StringRef StrRef(String);
   const char *Str = StrRef.data();
@@ -9626,25 +9595,7 @@
   assert(T && "String literal not of constant array type!");
   size_t TypeSize = T->getSize().getZExtValue();
   
-  StringLiteral::StringKind Kind = FExpr->getKind();
-  std::string String;
-  
-  if (Kind == StringLiteral::Ascii || Kind == StringLiteral::UTF8) {
-String = FExpr->getStringAsChar();
-  } else if 

[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-07-23 Thread Marcus Johnson via Phabricator via cfe-commits
MarcusJohnson91 updated this revision to Diff 361405.
MarcusJohnson91 marked an inline comment as done.
MarcusJohnson91 added a comment.

Just implemented the change Aaron requested in TemplateBase.cpp


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

Files:
  clang/lib/AST/TemplateBase.cpp


Index: clang/lib/AST/TemplateBase.cpp
===
--- clang/lib/AST/TemplateBase.cpp
+++ clang/lib/AST/TemplateBase.cpp
@@ -74,14 +74,7 @@
 
   if (Policy.MSVCFormatting)
 IncludeType = false;
-  
-  const Decl *D = nullptr;
-  if (T->isTypedefNameType()) {
-D = T->getAs()->getDecl();
-  }
 
-  
-
   if (T->isBooleanType()) {
 if (!Policy.MSVCFormatting)
   Out << (Val.getBoolValue() ? "true" : "false");
@@ -95,15 +88,15 @@
 Out << "(unsigned char)";
 }
 CharacterLiteral::print(Val.getZExtValue(), CharacterLiteral::Ascii, Out);
-  } else if (D != nullptr && T->isAnyCharacterType(D->getLangOpts()) && 
!Policy.MSVCFormatting) {
+  } else if 
(T->isAnyCharacterType(T->getAs()->getDecl()->getLangOpts()) && 
!Policy.MSVCFormatting) {
 CharacterLiteral::CharacterKind Kind;
 if (T->isWideCharType())
   Kind = CharacterLiteral::Wide;
 else if (T->isChar8Type())
   Kind = CharacterLiteral::UTF8;
-else if (T->isChar16Type(D->getLangOpts()))
+else if 
(T->isChar16Type(T->getAs()->getDecl()->getLangOpts()))
   Kind = CharacterLiteral::UTF16;
-else if (T->isChar32Type(D->getLangOpts()))
+else if 
(T->isChar32Type(T->getAs()->getDecl()->getLangOpts()))
   Kind = CharacterLiteral::UTF32;
 else
   Kind = CharacterLiteral::Ascii;


Index: clang/lib/AST/TemplateBase.cpp
===
--- clang/lib/AST/TemplateBase.cpp
+++ clang/lib/AST/TemplateBase.cpp
@@ -74,14 +74,7 @@
 
   if (Policy.MSVCFormatting)
 IncludeType = false;
-  
-  const Decl *D = nullptr;
-  if (T->isTypedefNameType()) {
-D = T->getAs()->getDecl();
-  }
 
-  
-
   if (T->isBooleanType()) {
 if (!Policy.MSVCFormatting)
   Out << (Val.getBoolValue() ? "true" : "false");
@@ -95,15 +88,15 @@
 Out << "(unsigned char)";
 }
 CharacterLiteral::print(Val.getZExtValue(), CharacterLiteral::Ascii, Out);
-  } else if (D != nullptr && T->isAnyCharacterType(D->getLangOpts()) && !Policy.MSVCFormatting) {
+  } else if (T->isAnyCharacterType(T->getAs()->getDecl()->getLangOpts()) && !Policy.MSVCFormatting) {
 CharacterLiteral::CharacterKind Kind;
 if (T->isWideCharType())
   Kind = CharacterLiteral::Wide;
 else if (T->isChar8Type())
   Kind = CharacterLiteral::UTF8;
-else if (T->isChar16Type(D->getLangOpts()))
+else if (T->isChar16Type(T->getAs()->getDecl()->getLangOpts()))
   Kind = CharacterLiteral::UTF16;
-else if (T->isChar32Type(D->getLangOpts()))
+else if (T->isChar32Type(T->getAs()->getDecl()->getLangOpts()))
   Kind = CharacterLiteral::UTF32;
 else
   Kind = CharacterLiteral::Ascii;
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-07-19 Thread Marcus Johnson via Phabricator via cfe-commits
MarcusJohnson91 updated this revision to Diff 359704.
MarcusJohnson91 added a comment.
Herald added subscribers: llvm-commits, dexonsmith, hiraditya.
Herald added a project: LLVM.

Few tweaks since last time, nothing big


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

Files:
  clang/include/clang/AST/Expr.h
  clang/include/clang/AST/FormatString.h
  clang/include/clang/AST/Type.h
  clang/lib/AST/Expr.cpp
  clang/lib/AST/PrintfFormatString.cpp
  clang/lib/AST/Type.cpp
  clang/lib/Sema/SemaChecking.cpp
  llvm/include/llvm/Support/ConvertUTF.h
  llvm/lib/Support/ConvertUTFWrapper.cpp

Index: llvm/lib/Support/ConvertUTFWrapper.cpp
===
--- llvm/lib/Support/ConvertUTFWrapper.cpp
+++ llvm/lib/Support/ConvertUTFWrapper.cpp
@@ -140,6 +140,64 @@
   llvm::ArrayRef(reinterpret_cast(Src.data()),
   Src.size() * sizeof(UTF16)), Out);
 }
+  
+bool convertUTF32ToUTF8String(ArrayRef SrcBytes, std::string ) {
+  assert(Out.empty());
+
+  // Error out on an uneven byte count.
+  if (SrcBytes.size() % 2)
+return false;
+
+  // Avoid OOB by returning early on empty input.
+  if (SrcBytes.empty())
+return true;
+
+  const UTF32 *Src = reinterpret_cast(SrcBytes.begin());
+  const UTF32 *SrcEnd = reinterpret_cast(SrcBytes.end());
+
+  assert((uintptr_t)Src % sizeof(UTF32) == 0);
+
+  // Byteswap if necessary.
+  std::vector ByteSwapped;
+  if (Src[0] == UNI_UTF16_BYTE_ORDER_MARK_SWAPPED) {
+ByteSwapped.insert(ByteSwapped.end(), Src, SrcEnd);
+for (unsigned I = 0, E = ByteSwapped.size(); I != E; ++I)
+  ByteSwapped[I] = llvm::ByteSwap_32(ByteSwapped[I]);
+Src = [0];
+SrcEnd = [ByteSwapped.size() - 1] + 1;
+  }
+
+  // Skip the BOM for conversion.
+  if (Src[0] == UNI_UTF32_BYTE_ORDER_MARK_NATIVE)
+Src++;
+
+  // Just allocate enough space up front.  We'll shrink it later.  Allocate
+  // enough that we can fit a null terminator without reallocating.
+  Out.resize(SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1);
+  UTF8 *Dst = reinterpret_cast([0]);
+  UTF8 *DstEnd = Dst + Out.size();
+
+  ConversionResult CR =
+  ConvertUTF32toUTF8(, SrcEnd, , DstEnd, strictConversion);
+  assert(CR != targetExhausted);
+
+  if (CR != conversionOK) {
+Out.clear();
+return false;
+  }
+
+  Out.resize(reinterpret_cast(Dst) - [0]);
+  Out.push_back(0);
+  Out.pop_back();
+  return true;
+}
+  
+bool convertUTF32ToUTF8String(ArrayRef Src, std::string )
+{
+  return convertUTF16ToUTF8String(
+  llvm::ArrayRef(reinterpret_cast(Src.data()),
+  Src.size() * sizeof(UTF32)), Out);
+}
 
 bool convertUTF8ToUTF16String(StringRef SrcUTF8,
   SmallVectorImpl ) {
Index: llvm/include/llvm/Support/ConvertUTF.h
===
--- llvm/include/llvm/Support/ConvertUTF.h
+++ llvm/include/llvm/Support/ConvertUTF.h
@@ -122,6 +122,9 @@
 
 #define UNI_UTF16_BYTE_ORDER_MARK_NATIVE  0xFEFF
 #define UNI_UTF16_BYTE_ORDER_MARK_SWAPPED 0xFFFE
+  
+#define UNI_UTF32_BYTE_ORDER_MARK_NATIVE  0xFEFF
+#define UNI_UTF32_BYTE_ORDER_MARK_SWAPPED 0xFFFE
 
 typedef enum {
   conversionOK,   /* conversion successful */
@@ -277,6 +280,24 @@
 * \returns true on success
 */
 bool convertUTF16ToUTF8String(ArrayRef Src, std::string );
+  
+/**
+ * Converts a stream of raw bytes assumed to be UTF32 into a UTF8 std::string.
+ *
+ * \param [in] SrcBytes A buffer of what is assumed to be UTF-32 encoded text.
+ * \param [out] Out Converted UTF-8 is stored here on success.
+ * \returns true on success
+ */
+bool convertUTF32ToUTF8String(ArrayRef SrcBytes, std::string );
+
+/**
+* Converts a UTF32 string into a UTF8 std::string.
+*
+* \param [in] Src A buffer of UTF-32 encoded text.
+* \param [out] Out Converted UTF-8 is stored here on success.
+* \returns true on success
+*/
+bool convertUTF32ToUTF8String(ArrayRef Src, std::string );
 
 /**
  * Converts a UTF-8 string into a UTF-16 string with native endianness.
Index: clang/lib/Sema/SemaChecking.cpp
===
--- clang/lib/Sema/SemaChecking.cpp
+++ clang/lib/Sema/SemaChecking.cpp
@@ -9575,15 +9575,15 @@
   
   // Emit a warning if the string literal is truncated and does not contain an
   // embedded null character.
-if (TypeSize < StrRef.size() &&
-StrRef.substr(0, TypeSize).find('\0') == StringRef::npos) {
-  CheckFormatHandler::EmitFormatDiagnostic(
-  S, inFunctionCall, Args[format_idx],
-  S.PDiag(diag::warn_printf_format_string_not_null_terminated),
-  FExpr->getBeginLoc(),
-  /*IsStringLocation=*/true, OrigFormatExpr->getSourceRange());
-  return;
-}
+  if (TypeSize < StrRef.size() &&
+  StrRef.substr(0, TypeSize).find('\0') == StringRef::npos) {
+CheckFormatHandler::EmitFormatDiagnostic(
+ S, 

[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-07-17 Thread Marcus Johnson via Phabricator via cfe-commits
MarcusJohnson91 updated this revision to Diff 359590.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

Files:
  clang/include/clang/AST/Expr.h
  clang/include/clang/AST/FormatString.h
  clang/include/clang/AST/Type.h
  clang/lib/AST/Expr.cpp
  clang/lib/AST/PrintfFormatString.cpp
  clang/lib/AST/Type.cpp

Index: clang/lib/AST/Type.cpp
===
--- clang/lib/AST/Type.cpp
+++ clang/lib/AST/Type.cpp
@@ -1962,18 +1962,6 @@
   return false;
 }
 
-bool Type::isType(const std::string TypeName) const {
-  QualType Desugar = this->getLocallyUnqualifiedSingleStepDesugaredType();
-  
-  
-  while (!Desugar->isCanonicalUnqualified()) {
-if (Desugar.getAsString() == TypeName) {
-  return true;
-}
-Desugar = Desugar->getLocallyUnqualifiedSingleStepDesugaredType();
-  }
-}
-
 bool Type::isChar8Type() const {
   if (const BuiltinType *BT = dyn_cast(CanonicalType))
 return BT->getKind() == BuiltinType::Char8;
@@ -1985,7 +1973,15 @@
 if (BT->getKind() == BuiltinType::Char16)
   return true;
   if (!LangOpts.CPlusPlus) {
-return isType("char16_t");
+QualType Desugar = this->getLocallyUnqualifiedSingleStepDesugaredType();
+
+
+while (!Desugar->isCanonicalUnqualified()) {
+  if (Desugar.getAsString() == "char16_t") {
+return true;
+  }
+  Desugar = Desugar->getLocallyUnqualifiedSingleStepDesugaredType();
+}
   }
   return false;
 }
@@ -1995,7 +1991,14 @@
 if (BT->getKind() == BuiltinType::Char32)
   return true;
   if (!LangOpts.CPlusPlus) {
-return isType("char32_t");
+QualType Desugar = this->getLocallyUnqualifiedSingleStepDesugaredType();
+
+while (!Desugar->isCanonicalUnqualified()) {
+  if (Desugar.getAsString() == "char32_t") {
+return true;
+  }
+  Desugar = Desugar->getLocallyUnqualifiedSingleStepDesugaredType();
+}
   }
   return false;
 }
Index: clang/lib/AST/PrintfFormatString.cpp
===
--- clang/lib/AST/PrintfFormatString.cpp
+++ clang/lib/AST/PrintfFormatString.cpp
@@ -643,6 +643,9 @@
  "const unichar *");
 return ArgType(ArgType::WCStrTy, "wchar_t *");
   }
+  if (LM.getKind() == LengthModifier::AsWide) {
+return ArgType(ArgType::WCStrTy, "wchar_t *");
+  }
   if (LM.getKind() == LengthModifier::AsUTF16)
 return ArgType(ArgType::Char16Ty, "char16_t *");
   if (LM.getKind() == LengthModifier::AsUTF32)
@@ -860,6 +863,9 @@
 LM.setKind(LengthModifier::AsLongDouble);
 break;
   
+  case BuiltinType::Char8:
+LM.setKind(LengthModifier::AsUTF8);
+  
   case BuiltinType::Char16:
 LM.setKind(LengthModifier::AsUTF16);
 break;
Index: clang/lib/AST/Expr.cpp
===
--- clang/lib/AST/Expr.cpp
+++ clang/lib/AST/Expr.cpp
@@ -1331,6 +1331,8 @@
  const LangOptions ,
  const TargetInfo , unsigned *StartToken,
  unsigned *StartTokenByteOffset) const {
+  assert((getKind() == StringLiteral::Ascii || getKind() == StringLiteral::UTF8) &&
+ "Only narrow string literals are currently supported");
   // Loop over all of the tokens in this string until we find the one that
   // contains the byte we're looking for.
   unsigned TokNo = 0;
Index: clang/include/clang/AST/Type.h
===
--- clang/include/clang/AST/Type.h
+++ clang/include/clang/AST/Type.h
@@ -1972,7 +1972,6 @@
   /// Determine whether this type is a scoped enumeration type.
   bool isScopedEnumeralType() const;
   bool isBooleanType() const;
-  bool isType(const std::string TypeName) const;
   bool isCharType() const;
   bool isChar8Type() const;
   bool isWideCharType() const;
Index: clang/include/clang/AST/FormatString.h
===
--- clang/include/clang/AST/FormatString.h
+++ clang/include/clang/AST/FormatString.h
@@ -80,8 +80,8 @@
 AsLongDouble, // 'L'
 AsAllocate,   // for '%as', GNU extension to C90 scanf
 AsMAllocate,  // for '%ms', GNU extension to scanf
-AsUTF16,  // for '%l16(c|s)', soon to be standardized
-AsUTF32,  // for '%l32(c|s)', soon to be standardized
+AsUTF16,  // for '%l16(c|s)', Clang extension
+AsUTF32,  // for '%l32(c|s)', Clang extension
 AsWide,   // 'w' (MSVCRT, like l but only for c, C, s, S, or Z
 AsWideChar = AsLong // for '%ls', only makes sense for printf
   };
Index: clang/include/clang/AST/Expr.h
===
--- clang/include/clang/AST/Expr.h
+++ clang/include/clang/AST/Expr.h
@@ -1850,21 +1850,18 @@
   std::u16string getStringAsChar16() const {
 

[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-07-17 Thread Marcus Johnson via Phabricator via cfe-commits
MarcusJohnson91 added inline comments.



Comment at: clang/include/clang/AST/Expr.h:1846-1871
+  std::string getStringAsChar() const {
+assert(getCharByteWidth() == 1 &&
+   "This function is used in places that assume strings use char");
+return std::string(getTrailingObjects(), getTrailingObjects() 
+ getByteLength());
+  }
+  
+  std::u16string getStringAsChar16() const {

cor3ntin wrote:
> aaron.ballman wrote:
> > One potential issue to this is that in C++, these types are defined to be 
> > UTF-16 and UTF-32, whereas in C, that isn't always the case. Currently, 
> > Clang defines `__STDC_UTF_16__` and `__STDC_UTF_32__` on all targets, but 
> > we may need to add some sanity checks to catch if a target overrides that 
> > behavior. Currently, all the targets in Clang are keeping that promise, but 
> > I have no idea what shenanigans downstream users get up to and whether 
> > their targets remove the macro definition or define it to `0` instead of 
> > `1`.
> Is it possible that the host and target wchar_t have a different size here?
I've honestly been wondering how Clang handled that, in the codebase vs at 
runtime myself for a while.



Comment at: clang/include/clang/AST/FormatString.h:83
 AsMAllocate,  // for '%ms', GNU extension to scanf
+AsUTF16,  // for '%l16(c|s)', soon to be standardized
+AsUTF32,  // for '%l32(c|s)', soon to be standardized

aaron.ballman wrote:
> May want to drop the "soon to be standardized" given that the proposal hasn't 
> been seen by WG14 yet. I think it's fine to say "Clang extension", though. 
> More on the format specifier itself, below.
Done,

and here is the link to the document, I haven't heard any feedback?

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2761.pdf



Comment at: clang/include/clang/AST/Type.h:1975
   bool isBooleanType() const;
+  bool isType(const std::string TypeName) const;
   bool isCharType() const;

aaron.ballman wrote:
> It's not clear from the header file what this function actually *does*. 
> Presumably, everything that can be represented by a `Type` is a type, so I'd 
> assume this returns `true` always. Or should this be a static function?
Yeah, could choose a better name just not sure what.

It takes a type name as the argument, and then it desugars the type one step at 
a time, and if it finds a match it returns true.

so, let's say we're in C++, and someone typedef'd char8_t to String.

This function will say yes, String is compatible with char8_t for example.

it's mostly for C mode's typedef's of char16_t and char32_t



Comment at: clang/lib/AST/Expr.cpp:1197
  unsigned *StartTokenByteOffset) const {
-  assert((getKind() == StringLiteral::Ascii ||
-  getKind() == StringLiteral::UTF8) &&

aaron.ballman wrote:
> I don't see changes that make this assertion valid to remove -- have I missed 
> something?
Nope you didn't miss anything, I did.

this is a remnant from when I was trying to templatize all the format checking 
code instead of converting the format strings.

Restored the assert.



Comment at: clang/lib/AST/FormatString.cpp:235-240
+  if (LO.C2x && I + 1 != E && I[0] == '1' && I[1] == '6') {
+++I;
+++I;
+lmKind = LengthModifier::AsUTF16;
+break;
+  } else if (LO.C2x && I + 1 != E && I[0] == '3' && I[1] == '2') {

aaron.ballman wrote:
> I don't think this is a conforming extension to C -- lowercase length 
> modifiers and conversion specifiers are reserved for the standard, and 
> uppercase length modifiers and conversion specifiers are reserved for the 
> implementation. `l16` starts with a lowercase letter, so it's reserved for 
> the standard.
> 
> Note: WG14 has been considering extensions in a closely-related space 
> (integer types rather than string or character types) that you may be 
> interested in, if you're not already aware of it: 
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2511.pdf. This proposal was 
> not adopted, but did have strong sentiment for something along these lines.
Yeah, honestly not really sure what to do about the reservation, L is also used 
for wide characters and some integers too, and (U|u)(16|32) wasn't taken well 
by the community.



Comment at: clang/lib/AST/OSLog.cpp:212
+  } else if (Lit->isUTF16()) {
+std::wstring_convert, char16_t> Convert;
+std::u16string U16 = Lit->getStringAsChar16();

aaron.ballman wrote:
> MarcusJohnson91 wrote:
> > aaron.ballman wrote:
> > > cor3ntin wrote:
> > > > I'm not sure I have a better suggestion but `codecvt_utf8_utf16` is 
> > > > deprecated in C++17
> > > Good point -- this likely should be lifted into a common interface (as it 
> > > appears several times in the patch) and make use of the existing 

[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-07-01 Thread Aaron Ballman via Phabricator via cfe-commits
aaron.ballman added inline comments.



Comment at: clang/lib/AST/OSLog.cpp:212
+  } else if (Lit->isUTF16()) {
+std::wstring_convert, char16_t> Convert;
+std::u16string U16 = Lit->getStringAsChar16();

MarcusJohnson91 wrote:
> aaron.ballman wrote:
> > cor3ntin wrote:
> > > I'm not sure I have a better suggestion but `codecvt_utf8_utf16` is 
> > > deprecated in C++17
> > Good point -- this likely should be lifted into a common interface (as it 
> > appears several times in the patch) and make use of the existing LLVM UTF 
> > conversion functionality: 
> > https://github.com/intel/llvm/blob/sycl/llvm/include/llvm/Support/ConvertUTF.h
> The ConvertUTF version you linked contains `convertUTF32toUTF8String` but the 
> one in LLVM does not, what's the process for updating this?
Oh, oops, I managed to grab a link from downstream and not LLVM, sorry about 
that! But the downstream and LLVM both have a method to perform the conversion 
https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/Support/ConvertUTF.h#L162,
 so you could add convenience functions to that file that wraps the API, as 
part of this patch.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-06-29 Thread Marcus Johnson via Phabricator via cfe-commits
MarcusJohnson91 added inline comments.



Comment at: clang/lib/AST/OSLog.cpp:212
+  } else if (Lit->isUTF16()) {
+std::wstring_convert, char16_t> Convert;
+std::u16string U16 = Lit->getStringAsChar16();

aaron.ballman wrote:
> cor3ntin wrote:
> > I'm not sure I have a better suggestion but `codecvt_utf8_utf16` is 
> > deprecated in C++17
> Good point -- this likely should be lifted into a common interface (as it 
> appears several times in the patch) and make use of the existing LLVM UTF 
> conversion functionality: 
> https://github.com/intel/llvm/blob/sycl/llvm/include/llvm/Support/ConvertUTF.h
The ConvertUTF version you linked contains `convertUTF32toUTF8String` but the 
one in LLVM does not, what's the process for updating this?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-06-21 Thread Aaron Ballman via Phabricator via cfe-commits
aaron.ballman added inline comments.



Comment at: clang/lib/AST/OSLog.cpp:212
+  } else if (Lit->isUTF16()) {
+std::wstring_convert, char16_t> Convert;
+std::u16string U16 = Lit->getStringAsChar16();

cor3ntin wrote:
> I'm not sure I have a better suggestion but `codecvt_utf8_utf16` is 
> deprecated in C++17
Good point -- this likely should be lifted into a common interface (as it 
appears several times in the patch) and make use of the existing LLVM UTF 
conversion functionality: 
https://github.com/intel/llvm/blob/sycl/llvm/include/llvm/Support/ConvertUTF.h


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-06-21 Thread Corentin Jabot via Phabricator via cfe-commits
cor3ntin added inline comments.



Comment at: clang/lib/AST/OSLog.cpp:214
+std::u16string U16 = Lit->getStringAsChar16();
+String = Convert.to_bytes(U16); // u"char16_t String"
+  } else if (Lit->isUTF32()) {

If `getStringAsChar16` contains the string after `\x` escape replacement, it 
might contain invalid code units or lone surrogate, in which case `to_bytes` 
might throw if `Convert` is not initialized with an error string


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-06-21 Thread Aaron Ballman via Phabricator via cfe-commits
aaron.ballman added subscribers: cor3ntin, tahonermann.
aaron.ballman added a comment.

Adding some subscribers from WG21 SG16 who know far more about character 
encodings and their APIs than I do, as they may have feedback on the design 
aspects of this.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-06-21 Thread Aaron Ballman via Phabricator via cfe-commits
aaron.ballman added a comment.

I had some time to look into this a bit more today. My review still isn't 
complete, but there's enough here to warrant some discussion.




Comment at: clang/include/clang/AST/Expr.h:1846-1871
+  std::string getStringAsChar() const {
+assert(getCharByteWidth() == 1 &&
+   "This function is used in places that assume strings use char");
+return std::string(getTrailingObjects(), getTrailingObjects() 
+ getByteLength());
+  }
+  
+  std::u16string getStringAsChar16() const {

One potential issue to this is that in C++, these types are defined to be 
UTF-16 and UTF-32, whereas in C, that isn't always the case. Currently, Clang 
defines `__STDC_UTF_16__` and `__STDC_UTF_32__` on all targets, but we may need 
to add some sanity checks to catch if a target overrides that behavior. 
Currently, all the targets in Clang are keeping that promise, but I have no 
idea what shenanigans downstream users get up to and whether their targets 
remove the macro definition or define it to `0` instead of `1`.



Comment at: clang/include/clang/AST/FormatString.h:83
 AsMAllocate,  // for '%ms', GNU extension to scanf
+AsUTF16,  // for '%l16(c|s)', soon to be standardized
+AsUTF32,  // for '%l32(c|s)', soon to be standardized

May want to drop the "soon to be standardized" given that the proposal hasn't 
been seen by WG14 yet. I think it's fine to say "Clang extension", though. More 
on the format specifier itself, below.



Comment at: clang/include/clang/AST/Type.h:1975
   bool isBooleanType() const;
+  bool isType(const std::string TypeName) const;
   bool isCharType() const;

It's not clear from the header file what this function actually *does*. 
Presumably, everything that can be represented by a `Type` is a type, so I'd 
assume this returns `true` always. Or should this be a static function?



Comment at: clang/include/clang/Basic/DiagnosticSemaKinds.td:6961
+def err_format_invalid_type : Error<
+  "format strings type is invalid">;
 def err_unexpected_interface : Error<

There are no tests for this diagnostic. Also, should the diagnostic specify 
which format string type is invalid (given that format strings can contain 
multiple type markings)?

Also, this replaces an existing diagnostic that's a warning, but is now 
upgraded to an error. Why is an error appropriate when it was warning before? 
Did you consider leaving this as a warning but marking it as defaulting to an 
error, so users can still disable the diagnostic and allow their code to 
compile?



Comment at: clang/lib/AST/Expr.cpp:1197
  unsigned *StartTokenByteOffset) const {
-  assert((getKind() == StringLiteral::Ascii ||
-  getKind() == StringLiteral::UTF8) &&

I don't see changes that make this assertion valid to remove -- have I missed 
something?



Comment at: clang/lib/AST/FormatString.cpp:235-240
+  if (LO.C2x && I + 1 != E && I[0] == '1' && I[1] == '6') {
+++I;
+++I;
+lmKind = LengthModifier::AsUTF16;
+break;
+  } else if (LO.C2x && I + 1 != E && I[0] == '3' && I[1] == '2') {

I don't think this is a conforming extension to C -- lowercase length modifiers 
and conversion specifiers are reserved for the standard, and uppercase length 
modifiers and conversion specifiers are reserved for the implementation. `l16` 
starts with a lowercase letter, so it's reserved for the standard.

Note: WG14 has been considering extensions in a closely-related space (integer 
types rather than string or character types) that you may be interested in, if 
you're not already aware of it: 
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2511.pdf. This proposal was 
not adopted, but did have strong sentiment for something along these lines.



Comment at: clang/lib/AST/OSLog.cpp:225
+  
+  StringRef Data(String);
+  

Is this actually needed, or can you use a `std::string` iterator when calling 
`ParsePrintfString()`?



Comment at: clang/lib/AST/PrintfFormatString.cpp:254-257
+  if (ParseLengthModifier(FS, I, E, LO) && I == E) { // THIS IS WHERE IT 
FREAKS OUT
 // No more characters left?
 if (Warn)
+  printf("ParsePrintfSpecifier: Incomplete Specifier 8\n");

Surprising comment and what looks to be some debugging code?



Comment at: clang/lib/AST/PrintfFormatString.cpp:635
   }
-  if (LM.getKind() == LengthModifier::AsWide)
-return ArgType(ArgType::WCStrTy, "wchar_t *");

It looks like this coverage got lost?



Comment at: clang/lib/AST/PrintfFormatString.cpp:844
   case BuiltinType::SChar:
+  case BuiltinType::Char8:
 LM.setKind(LengthModifier::AsChar);

[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-06-18 Thread Marcus Johnson via Phabricator via cfe-commits
MarcusJohnson91 added a comment.

Ping


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-06-09 Thread Marcus Johnson via Phabricator via cfe-commits
MarcusJohnson91 added a comment.

In D103426#2807860 , @aaron.ballman 
wrote:

> In D103426#2807245 , 
> @MarcusJohnson91 wrote:
>
>> In D103426#2806391 , 
>> @aaron.ballman wrote:
>>
>>> Do you have a reference to the WG14 paper proposing these conversion 
>>> specifiers?
>>
>> I've previously written up the proposal, which you actually helped with 
>> (thank you for that), just submitted a request to Dan Plakosh to get a 
>> document number and submit it.
>
> Thanks! Once it's submitted and available on the committee webpage, can you 
> post a link to it in this review? That'd help assess whether the 
> implementation matches the proposal or not.

Absolutely :)

>> Maybe I was a bit optimistic, but I don't think l16 and l32 will be a hard 
>> sell, but maybe I'm naive haha
>
> There's never a shortage of surprises when it comes to standardization 
> efforts. :-D

lol, well I'm hopeful it'll be smooth sailing


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-06-09 Thread Marcus Johnson via Phabricator via cfe-commits
MarcusJohnson91 added inline comments.



Comment at: clang/lib/AST/Type.cpp:1980-2002
+bool Type::isChar16Type(const LangOptions ) const {
   if (const auto *BT = dyn_cast(CanonicalType))
-return BT->getKind() == BuiltinType::Char16;
-  return false;
-}
-
-bool Type::isChar32Type() const {
-  if (const auto *BT = dyn_cast(CanonicalType))
-return BT->getKind() == BuiltinType::Char32;
-  return false;
-}
-
-/// Determine whether this type is any of the built-in character
-/// types.
-bool Type::isAnyCharacterType() const {
-  const auto *BT = dyn_cast(CanonicalType);
-  if (!BT) return false;
-  switch (BT->getKind()) {
-  default: return false;
-  case BuiltinType::Char_U:
-  case BuiltinType::UChar:
-  case BuiltinType::WChar_U:
-  case BuiltinType::Char8:
-  case BuiltinType::Char16:
-  case BuiltinType::Char32:
-  case BuiltinType::Char_S:
-  case BuiltinType::SChar:
-  case BuiltinType::WChar_S:
-return true;
+if (BT->getKind() == BuiltinType::Char16)
+  return true;
+  if (!LangOpts.CPlusPlus) {
+return isType("char16_t");
   }

aaron.ballman wrote:
> If I understand properly, one issue is that `char16_t` is defined by C to be 
> *the same type* as `uint_least16_t`, which itself is a typedef to some other 
> integer type. So we can't make `char16_t` be a distinct type in C, it has to 
> be a typedef to a typedef to an integer type.
> 
> One concern I have about the approach in this patch is that it makes the type 
> system a bit more convoluted. This type is *not* really a character type in 
> C, it's a typedef type that playacts as a character type. I think this is a 
> pretty fundamental change to the type system in the compiler in some ways 
> because this query is no longer about the canonical type but about the 
> semantics of how the type is expected to be used.
> 
> I'd definitely like to hear what @rsmith thinks of this approach.
I see your point, but I'm not sure how else we could get it through the type 
checking system without doing it this way?

I tried to be careful to only allow the actual character typedefs through by 
making sure char16_t or char32_t is in the typedef chain, and only in C mode.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-06-09 Thread Aaron Ballman via Phabricator via cfe-commits
aaron.ballman added a comment.

In D103426#2807245 , @MarcusJohnson91 
wrote:

> In D103426#2806391 , @aaron.ballman 
> wrote:
>
>> Do you have a reference to the WG14 paper proposing these conversion 
>> specifiers?
>
> I've previously written up the proposal, which you actually helped with 
> (thank you for that), just submitted a request to Dan Plakosh to get a 
> document number and submit it.

Thanks! Once it's submitted and available on the committee webpage, can you 
post a link to it in this review? That'd help assess whether the implementation 
matches the proposal or not.

> Maybe I was a bit optimistic, but I don't think l16 and l32 will be a hard 
> sell, but maybe I'm naive haha

There's never a shortage of surprises when it comes to standardization efforts. 
:-D


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-06-09 Thread Marcus Johnson via Phabricator via cfe-commits
MarcusJohnson91 added a comment.

In D103426#2806391 , @aaron.ballman 
wrote:

> Do you have a reference to the WG14 paper proposing these conversion 
> specifiers?

I've previously written up the proposal, which you actually helped with (thank 
you for that), just submitted a request to Dan Plakosh to get a document number 
and submit it.

Maybe I was a bit optimistic, but I don't think l16 and l32 will be a hard 
sell, but maybe I'm naive haha


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-06-08 Thread Aaron Ballman via Phabricator via cfe-commits
aaron.ballman added a comment.

I haven't had the chance to give this a thorough review yet, but I do have some 
high level questions.

> Also added support for the %l16(c|s) and %l32(c|s) conversion specifier for 
> char16_t and char32_t types in C and C++, which should soon be accepted by 
> ISO WG14.

Do you have a reference to the WG14 paper proposing these conversion specifiers?




Comment at: clang/lib/AST/Type.cpp:1980-2002
+bool Type::isChar16Type(const LangOptions ) const {
   if (const auto *BT = dyn_cast(CanonicalType))
-return BT->getKind() == BuiltinType::Char16;
-  return false;
-}
-
-bool Type::isChar32Type() const {
-  if (const auto *BT = dyn_cast(CanonicalType))
-return BT->getKind() == BuiltinType::Char32;
-  return false;
-}
-
-/// Determine whether this type is any of the built-in character
-/// types.
-bool Type::isAnyCharacterType() const {
-  const auto *BT = dyn_cast(CanonicalType);
-  if (!BT) return false;
-  switch (BT->getKind()) {
-  default: return false;
-  case BuiltinType::Char_U:
-  case BuiltinType::UChar:
-  case BuiltinType::WChar_U:
-  case BuiltinType::Char8:
-  case BuiltinType::Char16:
-  case BuiltinType::Char32:
-  case BuiltinType::Char_S:
-  case BuiltinType::SChar:
-  case BuiltinType::WChar_S:
-return true;
+if (BT->getKind() == BuiltinType::Char16)
+  return true;
+  if (!LangOpts.CPlusPlus) {
+return isType("char16_t");
   }

If I understand properly, one issue is that `char16_t` is defined by C to be 
*the same type* as `uint_least16_t`, which itself is a typedef to some other 
integer type. So we can't make `char16_t` be a distinct type in C, it has to be 
a typedef to a typedef to an integer type.

One concern I have about the approach in this patch is that it makes the type 
system a bit more convoluted. This type is *not* really a character type in C, 
it's a typedef type that playacts as a character type. I think this is a pretty 
fundamental change to the type system in the compiler in some ways because this 
query is no longer about the canonical type but about the semantics of how the 
type is expected to be used.

I'd definitely like to hear what @rsmith thinks of this approach.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

2021-05-31 Thread Marcus Johnson via Phabricator via cfe-commits
MarcusJohnson91 created this revision.
MarcusJohnson91 created this object with visibility "All Users".
MarcusJohnson91 added a project: clang.
Herald added a subscriber: martong.
Herald added a reviewer: aaron.ballman.
MarcusJohnson91 requested review of this revision.
Herald added a project: clang-tools-extra.

Also added support for the `%l16(c|s)` and `%l32(c|s)` conversion specifier for 
char16_t and char32_t types in C and C++, which should soon be accepted by ISO 
WG14.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D103426

Files:
  clang-tools-extra/clang-tidy/boost/UseToStringCheck.cpp
  clang-tools-extra/clang-tidy/misc/ThrowByValueCatchByReferenceCheck.cpp
  clang/include/clang/AST/ASTContext.h
  clang/include/clang/AST/Expr.h
  clang/include/clang/AST/FormatString.h
  clang/include/clang/AST/Type.h
  clang/include/clang/ASTMatchers/ASTMatchers.h
  clang/include/clang/Basic/Builtins.def
  clang/include/clang/Basic/DiagnosticSemaKinds.td
  clang/lib/AST/Expr.cpp
  clang/lib/AST/ExprCXX.cpp
  clang/lib/AST/FormatString.cpp
  clang/lib/AST/OSLog.cpp
  clang/lib/AST/PrintfFormatString.cpp
  clang/lib/AST/ScanfFormatString.cpp
  clang/lib/AST/TemplateBase.cpp
  clang/lib/AST/Type.cpp
  clang/lib/Lex/LiteralSupport.cpp
  clang/lib/Sema/SemaCast.cpp
  clang/lib/Sema/SemaChecking.cpp
  clang/lib/Sema/SemaDeclAttr.cpp
  clang/lib/Sema/SemaExpr.cpp
  clang/lib/Sema/SemaExprObjC.cpp
  clang/lib/Sema/SemaFixItUtils.cpp
  clang/lib/Sema/SemaOverload.cpp
  clang/lib/Sema/SemaTemplate.cpp
  clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp
  clang/test/Sema/format-strings-c90.c
  clang/test/Sema/format-strings-darwin.c
  clang/test/Sema/format-strings-int-typedefs.c
  clang/test/Sema/format-strings-ms.c
  clang/test/Sema/format-strings-non-iso.c
  clang/test/Sema/format-strings-pedantic.c
  clang/test/Sema/format-strings-scanf.c
  clang/test/Sema/string-plus-char.c
  clang/test/SemaCXX/format-strings-0x.cpp
  clang/test/SemaCXX/format-strings.cpp

Index: clang/test/SemaCXX/format-strings.cpp
===
--- clang/test/SemaCXX/format-strings.cpp
+++ clang/test/SemaCXX/format-strings.cpp
@@ -7,7 +7,10 @@
 extern "C" {
 extern int scanf(const char *restrict, ...);
 extern int printf(const char *restrict, ...);
+extern int wscanf(const char *restrict, ...);
+extern int wprintf(const char *restrict, ...);
 extern int vprintf(const char *restrict, va_list);
+extern int vwprintf(const char *restrict, va_list);
 }
 
 void f(char **sp, float *fp) {
@@ -17,13 +20,24 @@
 #else
   // expected-warning@-4 {{format specifies type 'float *' but the argument has type 'char **'}}
 #endif
+  
+  scanf("%as", sp);
+#if __cplusplus <= 199711L
+  // expected-warning@-2 {{'a' length modifier is not supported by ISO C}}
+#else
+  // expected-warning@-4 {{format specifies type 'float *' but the argument has type 'wchar_t **'}}
+#endif
 
   printf("%a", 1.0);
   scanf("%afoobar", fp);
+  
+  wprintf("%a", 1.0);
+  wscanf("%afoobar", fp);
 }
 
 void g() {
   printf("%ls", "foo"); // expected-warning{{format specifies type 'wchar_t *' but the argument has type 'const char *'}}
+  wprintf("%ls", "foo"); // expected-warning{{format specifies type 'wchar_t *' but the argument has type 'const char *'}}
 }
 
 // Test that we properly handle format_idx on C++ members.
@@ -76,7 +90,7 @@
   va_start(ap,fmt);
   const char * const format = fmt;
   vprintf(format, ap); // no-warning
-
+  
   const char *format2 = fmt;
   vprintf(format2, ap); // expected-warning{{format string is not a string literal}}
 
Index: clang/test/SemaCXX/format-strings-0x.cpp
===
--- clang/test/SemaCXX/format-strings-0x.cpp
+++ clang/test/SemaCXX/format-strings-0x.cpp
@@ -3,33 +3,53 @@
 extern "C" {
 extern int scanf(const char *restrict, ...);
 extern int printf(const char *restrict, ...);
+extern int wscanf(const wchar_t *restrict, ...);
 }
 
 void f(char **sp, float *fp) {
   scanf("%as", sp); // expected-warning{{format specifies type 'float *' but the argument has type 'char **'}}
+  wscanf("%as", sp); // expected-warning{{format specifies type 'float *' but the argument has type 'wchar_t **'}}
 
   printf("%p", sp); // expected-warning{{format specifies type 'void *' but the argument has type 'char **'}}
+  wprintf("%p", sp); // expected-warning{{format specifies type 'void *' but the argument has type 'wchar_t **'}}
   scanf("%p", sp);  // expected-warning{{format specifies type 'void **' but the argument has type 'char **'}}
+  wscanf("%p", sp);  // expected-warning{{format specifies type 'void **' but the argument has type 'wchar_t **'}}
 
   printf("%a", 1.0);
   scanf("%afoobar", fp);
+  wprintf("%a", 1.0);
+  wscanf("%afoobar", fp);
   printf(nullptr);
   printf(*sp); // expected-warning {{not a string literal}}
   // expected-note@-1{{treat the string as an argument to avoid this}}
+  wprintf(*sp); //