[PATCH] D81530: [clangd] Log rather than assert on bad UTF-8.

2020-06-10 Thread Jeff Trull via Phabricator via cfe-commits
jaafar added a comment.

Thanks for the detailed analysis! I have filed 
https://github.com/boostorg/spirit/issues/612


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D81530/new/

https://reviews.llvm.org/D81530



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D81530: [clangd] Log rather than assert on bad UTF-8.

2020-06-10 Thread Sam McCall via Phabricator via cfe-commits
sammccall added a comment.

In D81530#2085725 , @jaafar wrote:

> I can confirm that this commit works equally well for the UTF-8 assertion 
> failure. Thank you!
>
> Is there some way to easily modify the source to produce a diagnostic 
> pointing to the issue's source location? I would like to file an appropriate 
> bug in Boost.


In 
https://www.boost.org/doc/libs/1_73_0/boost/spirit/home/support/char_encoding/iso8859_1.hpp
The line labeled `161  a1`, whose bytes are `"/*  \xa1  161  a1 */   
BOOST_CC_PUNCT,"` triggers it because of the `"\xa1"`.

It's not a cut and dried bug, but I do think removing the high bytes from these 
comments is a good idea, so it's probably worth filing the bug.

The C++ standard doesn't say anything about the encoding of characters on disk 
(the "input encoding" of bytes -> source character set) - it starts with the 
source character set, I think. So what do implementations do?

- clang: from reading the code, clang only supports UTF-8. It supports gcc's 
-finput-charset flag, but setting it to anything other than UTF-8 is an error!
- GCC: supports a variety of input encodings, configured with `-finput-charset` 
or locale. As such it's not really reasonable for a header designed to be 
included to require any particular value, as you can't pass a flag for just 
that header.

In practice this doesn't actually affect compilation because the bad UTF-8 
sequence in the comment is never parsed: clang just skips over it looking for 
`*/`. It probably mostly affects tools that use line/column coordinates (like 
clangd by virtue of LSP, and clang diagnostics), and tools that extract comment 
contents (doxygen et al).
But it still seems that it would be clearer to remove the literal characters 
from the source file, or write them in UTF-8.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D81530/new/

https://reviews.llvm.org/D81530



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D81530: [clangd] Log rather than assert on bad UTF-8.

2020-06-10 Thread Jeff Trull via Phabricator via cfe-commits
jaafar added a comment.

I can confirm that this commit works equally well for the UTF-8 assertion 
failure. Thank you!

Is there some way to easily modify the source to produce a diagnostic pointing 
to the issue's source location? I would like to file an appropriate bug in 
Boost.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D81530/new/

https://reviews.llvm.org/D81530



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D81530: [clangd] Log rather than assert on bad UTF-8.

2020-06-10 Thread Sam McCall via Phabricator via cfe-commits
This revision was automatically updated to reflect the committed changes.
Closed by commit rGf2c8f6e16d25: [clangd] Log rather than assert on bad UTF-8. 
(authored by sammccall).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D81530/new/

https://reviews.llvm.org/D81530

Files:
  clang-tools-extra/clangd/SourceCode.cpp
  clang-tools-extra/clangd/unittests/SourceCodeTests.cpp
  clang-tools-extra/clangd/unittests/SymbolCollectorTests.cpp


Index: clang-tools-extra/clangd/unittests/SymbolCollectorTests.cpp
===
--- clang-tools-extra/clangd/unittests/SymbolCollectorTests.cpp
+++ clang-tools-extra/clangd/unittests/SymbolCollectorTests.cpp
@@ -1490,6 +1490,20 @@
   EXPECT_THAT(Symbols, Contains(QName("operator delete")));
 }
 
+TEST_F(SymbolCollectorTest, BadUTF8) {
+  // Extracted from boost/spirit/home/support/char_encoding/iso8859_1.hpp
+  // This looks like UTF-8 and fools clang, but has high-ISO-8859-1 comments.
+  const char *Header = "int PUNCT = 0;\n"
+   "int types[] = { /* \xa1 */PUNCT };";
+  CollectorOpts.RefFilter = RefKind::All;
+  CollectorOpts.RefsInHeaders = true;
+  runSymbolCollector(Header, "");
+  EXPECT_THAT(Symbols, Contains(QName("types")));
+  EXPECT_THAT(Symbols, Contains(QName("PUNCT")));
+  // Reference is stored, although offset within line is not reliable.
+  EXPECT_THAT(Refs, Contains(Pair(findSymbol(Symbols, "PUNCT").ID, _)));
+}
+
 } // namespace
 } // namespace clangd
 } // namespace clang
Index: clang-tools-extra/clangd/unittests/SourceCodeTests.cpp
===
--- clang-tools-extra/clangd/unittests/SourceCodeTests.cpp
+++ clang-tools-extra/clangd/unittests/SourceCodeTests.cpp
@@ -71,6 +71,21 @@
   EXPECT_EQ(lspLength(""), 1UL);
 }
 
+TEST(SourceCodeTests, lspLengthBadUTF8) {
+  // Results are not well-defined if source file isn't valid UTF-8.
+  // However we shouldn't crash or return something totally wild.
+  const char *BadUTF8[] = {"\xa0", "\xff\xff\xff\xff\xff"};
+
+  for (OffsetEncoding Encoding :
+   {OffsetEncoding::UTF8, OffsetEncoding::UTF16, OffsetEncoding::UTF32}) {
+WithContextValue UTF32(kCurrentOffsetEncoding, Encoding);
+for (const char *Bad : BadUTF8) {
+  EXPECT_GE(lspLength(Bad), 0u);
+  EXPECT_LE(lspLength(Bad), strlen(Bad));
+}
+  }
+}
+
 // The = →  below are ASCII (1 byte), BMP (3 bytes), and astral (4 bytes).
 const char File[] = R"(0:0 = 0
 1:0 → 8
Index: clang-tools-extra/clangd/SourceCode.cpp
===
--- clang-tools-extra/clangd/SourceCode.cpp
+++ clang-tools-extra/clangd/SourceCode.cpp
@@ -55,8 +55,14 @@
 // Iterates over unicode codepoints in the (UTF-8) string. For each,
 // invokes CB(UTF-8 length, UTF-16 length), and breaks if it returns true.
 // Returns true if CB returned true, false if we hit the end of string.
+//
+// If the string is not valid UTF-8, we log this error and "decode" the
+// text in some arbitrary way. This is pretty sad, but this tends to happen 
deep
+// within indexing of headers where clang misdetected the encoding, and
+// propagating the error all the way back up is (probably?) not be worth it.
 template 
 static bool iterateCodepoints(llvm::StringRef U8, const Callback ) {
+  bool LoggedInvalid = false;
   // A codepoint takes two UTF-16 code unit if it's astral (outside BMP).
   // Astral codepoints are encoded as 4 bytes in UTF-8, starting with 0xxx.
   for (size_t I = 0; I < U8.size();) {
@@ -70,9 +76,20 @@
 // This convenient property of UTF-8 holds for all non-ASCII characters.
 size_t UTF8Length = llvm::countLeadingOnes(C);
 // 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here.
-// 1xxx is not valid UTF-8 at all. Assert because it's probably our 
bug.
-assert((UTF8Length >= 2 && UTF8Length <= 4) &&
-   "Invalid UTF-8, or transcoding bug?");
+// 1xxx is not valid UTF-8 at all, maybe some ISO-8859-*.
+if (LLVM_UNLIKELY(UTF8Length < 2 || UTF8Length > 4)) {
+  if (!LoggedInvalid) {
+elog("File has invalid UTF-8 near offset {0}: {1}", I, 
llvm::toHex(U8));
+LoggedInvalid = true;
+  }
+  // We can't give a correct result, but avoid returning something wild.
+  // Pretend this is a valid ASCII byte, for lack of better options.
+  // (Too late to get ISO-8859-* right, we've skipped some bytes already).
+  if (CB(1, 1))
+return true;
+  ++I;
+  continue;
+}
 I += UTF8Length; // Skip over all trailing bytes.
 // A codepoint takes two UTF-16 code unit if it's astral (outside BMP).
 // Astral codepoints are encoded as 4 bytes in UTF-8 (0xxx ...)


Index: clang-tools-extra/clangd/unittests/SymbolCollectorTests.cpp
===
--- 

[PATCH] D81530: [clangd] Log rather than assert on bad UTF-8.

2020-06-10 Thread Kadir Cetinkaya via Phabricator via cfe-commits
kadircet accepted this revision.
kadircet added a comment.
This revision is now accepted and ready to land.

LGTM, thanks! (i bet you rushed it already for the cherry-pick, but just wanted 
to remind again that we should :D)


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D81530/new/

https://reviews.llvm.org/D81530



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D81530: [clangd] Log rather than assert on bad UTF-8.

2020-06-09 Thread Sam McCall via Phabricator via cfe-commits
sammccall created this revision.
sammccall added a reviewer: kadircet.
Herald added subscribers: cfe-commits, usaxena95, arphaman, jkorous, MaskRay, 
ilya-biryukov.
Herald added a project: clang.
sammccall added a subscriber: jaafar.
sammccall added a comment.

@jaafar would you be able to verify this is equivalent to D74731 
 for your purposes?
If so I'd like to cherrypick this into tho 10.x branch after landing.


I don't love this change, but it prevents crashing when indexing boost headers,
and I can't think of a better practical alternative.

Based on a patch by AnakinZheng!


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D81530

Files:
  clang-tools-extra/clangd/SourceCode.cpp
  clang-tools-extra/clangd/unittests/SourceCodeTests.cpp
  clang-tools-extra/clangd/unittests/SymbolCollectorTests.cpp


Index: clang-tools-extra/clangd/unittests/SymbolCollectorTests.cpp
===
--- clang-tools-extra/clangd/unittests/SymbolCollectorTests.cpp
+++ clang-tools-extra/clangd/unittests/SymbolCollectorTests.cpp
@@ -1490,6 +1490,20 @@
   EXPECT_THAT(Symbols, Contains(QName("operator delete")));
 }
 
+TEST_F(SymbolCollectorTest, BadUTF8) {
+  // Extracted from boost/spirit/home/support/char_encoding/iso8859_1.hpp
+  // This looks like UTF-8 and fools clang, but has high-ISO-8859-1 comments.
+  const char *Header = "int PUNCT = 0;\n"
+   "int types[] = { /* \xa1 */PUNCT };";
+  CollectorOpts.RefFilter = RefKind::All;
+  CollectorOpts.RefsInHeaders = true;
+  runSymbolCollector(Header, "");
+  EXPECT_THAT(Symbols, Contains(QName("types")));
+  EXPECT_THAT(Symbols, Contains(QName("PUNCT")));
+  // Reference is stored, although offset within line is not reliable.
+  EXPECT_THAT(Refs, Contains(Pair(findSymbol(Symbols, "PUNCT").ID, _)));
+}
+
 } // namespace
 } // namespace clangd
 } // namespace clang
Index: clang-tools-extra/clangd/unittests/SourceCodeTests.cpp
===
--- clang-tools-extra/clangd/unittests/SourceCodeTests.cpp
+++ clang-tools-extra/clangd/unittests/SourceCodeTests.cpp
@@ -71,6 +71,21 @@
   EXPECT_EQ(lspLength(""), 1UL);
 }
 
+TEST(SourceCodeTests, lspLengthBadUTF8) {
+  // Results are not well-defined if source file isn't valid UTF-8.
+  // However we shouldn't crash or return something totally wild.
+  const char *BadUTF8[] = {"\xa0", "\xff\xff\xff\xff\xff"};
+
+  for (OffsetEncoding Encoding :
+   {OffsetEncoding::UTF8, OffsetEncoding::UTF16, OffsetEncoding::UTF32}) {
+WithContextValue UTF32(kCurrentOffsetEncoding, Encoding);
+for (const char *Bad : BadUTF8) {
+  EXPECT_GE(lspLength(Bad), 0u);
+  EXPECT_LE(lspLength(Bad), strlen(Bad));
+}
+  }
+}
+
 // The = →  below are ASCII (1 byte), BMP (3 bytes), and astral (4 bytes).
 const char File[] = R"(0:0 = 0
 1:0 → 8
Index: clang-tools-extra/clangd/SourceCode.cpp
===
--- clang-tools-extra/clangd/SourceCode.cpp
+++ clang-tools-extra/clangd/SourceCode.cpp
@@ -55,8 +55,14 @@
 // Iterates over unicode codepoints in the (UTF-8) string. For each,
 // invokes CB(UTF-8 length, UTF-16 length), and breaks if it returns true.
 // Returns true if CB returned true, false if we hit the end of string.
+//
+// If the string is not valid UTF-8, we log this error and "decode" the
+// text in some arbitrary way. This is pretty sad, but this tends to happen 
deep
+// within indexing of headers where clang misdetected the encoding, and
+// propagating the error all the way back up is (probably?) not be worth it.
 template 
 static bool iterateCodepoints(llvm::StringRef U8, const Callback ) {
+  bool LoggedInvalid = false;
   // A codepoint takes two UTF-16 code unit if it's astral (outside BMP).
   // Astral codepoints are encoded as 4 bytes in UTF-8, starting with 0xxx.
   for (size_t I = 0; I < U8.size();) {
@@ -70,9 +76,20 @@
 // This convenient property of UTF-8 holds for all non-ASCII characters.
 size_t UTF8Length = llvm::countLeadingOnes(C);
 // 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here.
-// 1xxx is not valid UTF-8 at all. Assert because it's probably our 
bug.
-assert((UTF8Length >= 2 && UTF8Length <= 4) &&
-   "Invalid UTF-8, or transcoding bug?");
+// 1xxx is not valid UTF-8 at all, maybe some ISO-8859-*.
+if (LLVM_UNLIKELY(UTF8Length < 2 || UTF8Length > 4)) {
+  if (!LoggedInvalid) {
+elog("File has invalid UTF-8 near offset {0}: {1}", I, 
llvm::toHex(U8));
+LoggedInvalid = true;
+  }
+  // We can't give a correct result, but avoid returning something wild.
+  // Pretend this is a valid ASCII byte, for lack of better options.
+  // (Too late to get ISO-8859-* right, we've skipped some bytes already).
+  if (CB(1, 1))
+return true;
+  ++I;
+  continue;
+}
 

[PATCH] D81530: [clangd] Log rather than assert on bad UTF-8.

2020-06-09 Thread Sam McCall via Phabricator via cfe-commits
sammccall added a subscriber: jaafar.
sammccall added a comment.

@jaafar would you be able to verify this is equivalent to D74731 
 for your purposes?
If so I'd like to cherrypick this into tho 10.x branch after landing.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D81530/new/

https://reviews.llvm.org/D81530



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits