[PATCH] D88567: [clangd] Fix invalid UTF8 when extracting doc comments.

2020-10-01 Thread Kadir Cetinkaya via Phabricator via cfe-commits
kadircet added inline comments.



Comment at: clang-tools-extra/clangd/CodeCompletionStrings.cpp:93
+  // Clang requires source to be UTF-8, but doesn't enforce this in comments.
+  if (!llvm::json::isUTF8(Doc))
+Doc = llvm::json::fixUTF8(Doc);

sammccall wrote:
> kadircet wrote:
> > it is always surprising to have these helpers in json library :D (just 
> > talking out loud)
> Yeah. They're just wrappers around functions from `ConvertUTF.h`.
> Do you want a patch to move them there?
nah, let's wait for the next time we use those :D


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D88567/new/

https://reviews.llvm.org/D88567

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D88567: [clangd] Fix invalid UTF8 when extracting doc comments.

2020-09-30 Thread Sam McCall via Phabricator via cfe-commits
This revision was automatically updated to reflect the committed changes.
Closed by commit rG216af81c39d1: [clangd] Fix invalid UTF8 when extracting doc 
comments. (authored by sammccall).

Changed prior to commit:
  https://reviews.llvm.org/D88567?vs=295265&id=295266#toc

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D88567/new/

https://reviews.llvm.org/D88567

Files:
  clang-tools-extra/clangd/CodeCompletionStrings.cpp
  clang-tools-extra/clangd/unittests/CodeCompletionStringsTests.cpp
  clang-tools-extra/clangd/unittests/SymbolCollectorTests.cpp


Index: clang-tools-extra/clangd/unittests/SymbolCollectorTests.cpp
===
--- clang-tools-extra/clangd/unittests/SymbolCollectorTests.cpp
+++ clang-tools-extra/clangd/unittests/SymbolCollectorTests.cpp
@@ -1606,11 +1606,11 @@
   // Extracted from boost/spirit/home/support/char_encoding/iso8859_1.hpp
   // This looks like UTF-8 and fools clang, but has high-ISO-8859-1 comments.
   const char *Header = "int PUNCT = 0;\n"
-   "int types[] = { /* \xa1 */PUNCT };";
+   "/* \xa1 */ int types[] = { /* \xa1 */PUNCT };";
   CollectorOpts.RefFilter = RefKind::All;
   CollectorOpts.RefsInHeaders = true;
   runSymbolCollector(Header, "");
-  EXPECT_THAT(Symbols, Contains(QName("types")));
+  EXPECT_THAT(Symbols, Contains(AllOf(QName("types"), Doc("\xef\xbf\xbd ";
   EXPECT_THAT(Symbols, Contains(QName("PUNCT")));
   // Reference is stored, although offset within line is not reliable.
   EXPECT_THAT(Refs, Contains(Pair(findSymbol(Symbols, "PUNCT").ID, _)));
Index: clang-tools-extra/clangd/unittests/CodeCompletionStringsTests.cpp
===
--- clang-tools-extra/clangd/unittests/CodeCompletionStringsTests.cpp
+++ clang-tools-extra/clangd/unittests/CodeCompletionStringsTests.cpp
@@ -7,6 +7,7 @@
 
//===--===//
 
 #include "CodeCompletionStrings.h"
+#include "TestTU.h"
 #include "clang/Sema/CodeCompleteConsumer.h"
 #include "gmock/gmock.h"
 #include "gtest/gtest.h"
@@ -56,6 +57,14 @@
 "Annotation: Ano\n\nIs this brief?");
 }
 
+TEST_F(CompletionStringTest, GetDeclCommentBadUTF8) {
+  //  is not a valid byte here, should be replaced by encoded .
+  auto TU = TestTU::withCode("/*x\xffy*/ struct X;");
+  auto AST = TU.build();
+  EXPECT_EQ("x\xef\xbf\xbdy",
+getDeclComment(AST.getASTContext(), findDecl(AST, "X")));
+}
+
 TEST_F(CompletionStringTest, MultipleAnnotations) {
   Builder.AddAnnotation("Ano1");
   Builder.AddAnnotation("Ano2");
Index: clang-tools-extra/clangd/CodeCompletionStrings.cpp
===
--- clang-tools-extra/clangd/CodeCompletionStrings.cpp
+++ clang-tools-extra/clangd/CodeCompletionStrings.cpp
@@ -12,6 +12,7 @@
 #include "clang/AST/RawCommentList.h"
 #include "clang/Basic/SourceManager.h"
 #include "clang/Sema/CodeCompleteConsumer.h"
+#include "llvm/Support/JSON.h"
 #include 
 #include 
 
@@ -86,7 +87,12 @@
   assert(!Ctx.getSourceManager().isLoadedSourceLocation(RC->getBeginLoc()));
   std::string Doc =
   RC->getFormattedText(Ctx.getSourceManager(), Ctx.getDiagnostics());
-  return looksLikeDocComment(Doc) ? Doc : "";
+  if (!looksLikeDocComment(Doc))
+return "";
+  // Clang requires source to be UTF-8, but doesn't enforce this in comments.
+  if (!llvm::json::isUTF8(Doc))
+Doc = llvm::json::fixUTF8(Doc);
+  return Doc;
 }
 
 void getSignature(const CodeCompletionString &CCS, std::string *Signature,


Index: clang-tools-extra/clangd/unittests/SymbolCollectorTests.cpp
===
--- clang-tools-extra/clangd/unittests/SymbolCollectorTests.cpp
+++ clang-tools-extra/clangd/unittests/SymbolCollectorTests.cpp
@@ -1606,11 +1606,11 @@
   // Extracted from boost/spirit/home/support/char_encoding/iso8859_1.hpp
   // This looks like UTF-8 and fools clang, but has high-ISO-8859-1 comments.
   const char *Header = "int PUNCT = 0;\n"
-   "int types[] = { /* \xa1 */PUNCT };";
+   "/* \xa1 */ int types[] = { /* \xa1 */PUNCT };";
   CollectorOpts.RefFilter = RefKind::All;
   CollectorOpts.RefsInHeaders = true;
   runSymbolCollector(Header, "");
-  EXPECT_THAT(Symbols, Contains(QName("types")));
+  EXPECT_THAT(Symbols, Contains(AllOf(QName("types"), Doc("\xef\xbf\xbd ";
   EXPECT_THAT(Symbols, Contains(QName("PUNCT")));
   // Reference is stored, although offset within line is not reliable.
   EXPECT_THAT(Refs, Contains(Pair(findSymbol(Symbols, "PUNCT").ID, _)));
Index: clang-tools-extra/clangd/unittests/CodeCompletionStringsTests.cpp
===
--- clang-tools-extra/clangd/unittests/CodeCompletionStringsTests.cpp
+++ clang-tools-extra/clangd/unittests/Code

[PATCH] D88567: [clangd] Fix invalid UTF8 when extracting doc comments.

2020-09-30 Thread Sam McCall via Phabricator via cfe-commits
sammccall added a comment.

In D88567#2303332 , @kadircet wrote:

> thanks, LGTM!
>
> Should we also have another test for SymbolCollector, to ensure we don't 
> regress this somehow in the future?

We had a SymbolCollector test for the boost case so I modified it to add a doc 
comment.




Comment at: clang-tools-extra/clangd/CodeCompletionStrings.cpp:93
+  // Clang requires source to be UTF-8, but doesn't enforce this in comments.
+  if (!llvm::json::isUTF8(Doc))
+Doc = llvm::json::fixUTF8(Doc);

kadircet wrote:
> it is always surprising to have these helpers in json library :D (just 
> talking out loud)
Yeah. They're just wrappers around functions from `ConvertUTF.h`.
Do you want a patch to move them there?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D88567/new/

https://reviews.llvm.org/D88567

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D88567: [clangd] Fix invalid UTF8 when extracting doc comments.

2020-09-30 Thread Kadir Cetinkaya via Phabricator via cfe-commits
kadircet accepted this revision.
kadircet added a comment.
This revision is now accepted and ready to land.

thanks, LGTM!

Should we also have another test for SymbolCollector, to ensure we don't 
regress this somehow in the future?




Comment at: clang-tools-extra/clangd/CodeCompletionStrings.cpp:93
+  // Clang requires source to be UTF-8, but doesn't enforce this in comments.
+  if (!llvm::json::isUTF8(Doc))
+Doc = llvm::json::fixUTF8(Doc);

it is always surprising to have these helpers in json library :D (just talking 
out loud)


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D88567/new/

https://reviews.llvm.org/D88567

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D88567: [clangd] Fix invalid UTF8 when extracting doc comments.

2020-09-30 Thread Sam McCall via Phabricator via cfe-commits
sammccall created this revision.
sammccall added a reviewer: kadircet.
Herald added subscribers: cfe-commits, usaxena95, arphaman.
Herald added a project: clang.
sammccall requested review of this revision.
Herald added subscribers: MaskRay, ilya-biryukov.

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D88567

Files:
  clang-tools-extra/clangd/CodeCompletionStrings.cpp
  clang-tools-extra/clangd/unittests/CodeCompletionStringsTests.cpp


Index: clang-tools-extra/clangd/unittests/CodeCompletionStringsTests.cpp
===
--- clang-tools-extra/clangd/unittests/CodeCompletionStringsTests.cpp
+++ clang-tools-extra/clangd/unittests/CodeCompletionStringsTests.cpp
@@ -7,6 +7,7 @@
 
//===--===//
 
 #include "CodeCompletionStrings.h"
+#include "TestTU.h"
 #include "clang/Sema/CodeCompleteConsumer.h"
 #include "gmock/gmock.h"
 #include "gtest/gtest.h"
@@ -56,6 +57,14 @@
 "Annotation: Ano\n\nIs this brief?");
 }
 
+TEST_F(CompletionStringTest, GetDeclCommentBadUTF8) {
+  //  is not a valid byte here, should be replaced by encoded .
+  auto TU = TestTU::withCode("/*x\xffy*/ struct X;");
+  auto AST = TU.build();
+  EXPECT_EQ("x\xef\xbf\xbdy",
+getDeclComment(AST.getASTContext(), findDecl(AST, "X")));
+}
+
 TEST_F(CompletionStringTest, MultipleAnnotations) {
   Builder.AddAnnotation("Ano1");
   Builder.AddAnnotation("Ano2");
Index: clang-tools-extra/clangd/CodeCompletionStrings.cpp
===
--- clang-tools-extra/clangd/CodeCompletionStrings.cpp
+++ clang-tools-extra/clangd/CodeCompletionStrings.cpp
@@ -12,6 +12,7 @@
 #include "clang/AST/RawCommentList.h"
 #include "clang/Basic/SourceManager.h"
 #include "clang/Sema/CodeCompleteConsumer.h"
+#include "llvm/Support/JSON.h"
 #include 
 #include 
 
@@ -86,7 +87,12 @@
   assert(!Ctx.getSourceManager().isLoadedSourceLocation(RC->getBeginLoc()));
   std::string Doc =
   RC->getFormattedText(Ctx.getSourceManager(), Ctx.getDiagnostics());
-  return looksLikeDocComment(Doc) ? Doc : "";
+  if (!looksLikeDocComment(Doc))
+return "";
+  // Clang requires source to be UTF-8, but doesn't enforce this in comments.
+  if (!llvm::json::isUTF8(Doc))
+Doc = llvm::json::fixUTF8(Doc);
+  return Doc;
 }
 
 void getSignature(const CodeCompletionString &CCS, std::string *Signature,


Index: clang-tools-extra/clangd/unittests/CodeCompletionStringsTests.cpp
===
--- clang-tools-extra/clangd/unittests/CodeCompletionStringsTests.cpp
+++ clang-tools-extra/clangd/unittests/CodeCompletionStringsTests.cpp
@@ -7,6 +7,7 @@
 //===--===//
 
 #include "CodeCompletionStrings.h"
+#include "TestTU.h"
 #include "clang/Sema/CodeCompleteConsumer.h"
 #include "gmock/gmock.h"
 #include "gtest/gtest.h"
@@ -56,6 +57,14 @@
 "Annotation: Ano\n\nIs this brief?");
 }
 
+TEST_F(CompletionStringTest, GetDeclCommentBadUTF8) {
+  //  is not a valid byte here, should be replaced by encoded .
+  auto TU = TestTU::withCode("/*x\xffy*/ struct X;");
+  auto AST = TU.build();
+  EXPECT_EQ("x\xef\xbf\xbdy",
+getDeclComment(AST.getASTContext(), findDecl(AST, "X")));
+}
+
 TEST_F(CompletionStringTest, MultipleAnnotations) {
   Builder.AddAnnotation("Ano1");
   Builder.AddAnnotation("Ano2");
Index: clang-tools-extra/clangd/CodeCompletionStrings.cpp
===
--- clang-tools-extra/clangd/CodeCompletionStrings.cpp
+++ clang-tools-extra/clangd/CodeCompletionStrings.cpp
@@ -12,6 +12,7 @@
 #include "clang/AST/RawCommentList.h"
 #include "clang/Basic/SourceManager.h"
 #include "clang/Sema/CodeCompleteConsumer.h"
+#include "llvm/Support/JSON.h"
 #include 
 #include 
 
@@ -86,7 +87,12 @@
   assert(!Ctx.getSourceManager().isLoadedSourceLocation(RC->getBeginLoc()));
   std::string Doc =
   RC->getFormattedText(Ctx.getSourceManager(), Ctx.getDiagnostics());
-  return looksLikeDocComment(Doc) ? Doc : "";
+  if (!looksLikeDocComment(Doc))
+return "";
+  // Clang requires source to be UTF-8, but doesn't enforce this in comments.
+  if (!llvm::json::isUTF8(Doc))
+Doc = llvm::json::fixUTF8(Doc);
+  return Doc;
 }
 
 void getSignature(const CodeCompletionString &CCS, std::string *Signature,
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits