[PATCH] D105167: [analyzer] Fix HTML report deduplication.

2021-08-26 Thread Artem Dergachev via Phabricator via cfe-commits
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
Closed by commit rG73093599287c: [analyzer] Fix scan-build report 
deduplication. (authored by dergachev.a).
Herald added a project: clang.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D105167/new/

https://reviews.llvm.org/D105167

Files:
  clang/include/clang/Analysis/PathDiagnostic.h
  clang/include/clang/StaticAnalyzer/Core/AnalyzerOptions.def
  clang/include/clang/StaticAnalyzer/Core/AnalyzerOptions.h
  clang/lib/StaticAnalyzer/Core/HTMLDiagnostics.cpp
  clang/test/Analysis/analyzer-config.c
  clang/test/Analysis/scan-build/Inputs/deduplication/1.c
  clang/test/Analysis/scan-build/Inputs/deduplication/2.c
  clang/test/Analysis/scan-build/Inputs/deduplication/header.h
  clang/test/Analysis/scan-build/deduplication.test
  clang/test/Analysis/scan-build/rebuild_index/rebuild_index.test
  clang/test/Analysis/scan-build/rebuild_index/report-3.html
  clang/test/Analysis/scan-build/rebuild_index/subdirectory/report-3.html
  clang/test/Analysis/scan-build/rebuild_index/subdirectory/report-4.html
  clang/tools/scan-build/bin/scan-build

Index: clang/tools/scan-build/bin/scan-build
===
--- clang/tools/scan-build/bin/scan-build
+++ clang/tools/scan-build/bin/scan-build
@@ -14,7 +14,6 @@
 use strict;
 use warnings;
 use FindBin qw($RealBin);
-use Digest::MD5;
 use File::Basename;
 use File::Find;
 use File::Copy qw(copy);
@@ -268,27 +267,6 @@
   $ENV{'CCC_ANALYZER_HTML'} = $Dir;
 }
 
-####
-# ComputeDigest - Compute a digest of the specified file.
-####
-
-sub ComputeDigest {
-  my $FName = shift;
-  DieDiag("Cannot read $FName to compute Digest.\n") if (! -r $FName);
-
-  # Use Digest::MD5.  We don't have to be cryptographically secure.  We're
-  # just looking for duplicate files that come from a non-malicious source.
-  # We use Digest::MD5 because it is a standard Perl module that should
-  # come bundled on most systems.
-  open(FILE, $FName) or DieDiag("Cannot open $FName when computing Digest.\n");
-  binmode FILE;
-  my $Result = Digest::MD5->new->addfile(*FILE)->hexdigest;
-  close(FILE);
-
-  # Return the digest.
-  return $Result;
-}
-
 ####
 #  UpdatePrefix - Compute the common prefix of files.
 ####
@@ -374,8 +352,6 @@
 # Sometimes a source file is scanned more than once, and thus produces
 # multiple error reports.  We use a cache to solve this problem.
 
-my %AlreadyScanned;
-
 sub ScanFile {
 
   my $Index = shift;
@@ -383,19 +359,6 @@
   my $FName = shift;
   my $Stats = shift;
 
-  # Compute a digest for the report file.  Determine if we have already
-  # scanned a file that looks just like it.
-
-  my $digest = ComputeDigest("$Dir/$FName");
-
-  if (defined $AlreadyScanned{$digest}) {
-# Redundant file.  Remove it.
-unlink("$Dir/$FName");
-return;
-  }
-
-  $AlreadyScanned{$digest} = 1;
-
   # At this point the report file is not world readable.  Make it happen.
   chmod(0644, "$Dir/$FName");
 
Index: clang/test/Analysis/scan-build/rebuild_index/subdirectory/report-4.html
===
--- clang/test/Analysis/scan-build/rebuild_index/subdirectory/report-4.html
+++ /dev/null
@@ -1,8 +0,0 @@
-
-
-
-
-
-
-
-
Index: clang/test/Analysis/scan-build/rebuild_index/subdirectory/report-3.html
===
--- /dev/null
+++ clang/test/Analysis/scan-build/rebuild_index/subdirectory/report-3.html
@@ -0,0 +1,8 @@
+
+
+
+
+
+
+
+
Index: clang/test/Analysis/scan-build/rebuild_index/report-3.html
===
--- clang/test/Analysis/scan-build/rebuild_index/report-3.html
+++ /dev/null
@@ -1,8 +0,0 @@
-
-
-
-
-
-
-
-
Index: clang/test/Analysis/scan-build/rebuild_index/rebuild_index.test
===
--- clang/test/Analysis/scan-build/rebuild_index/rebuild_index.test
+++ clang/test/Analysis/scan-build/rebuild_index/rebuild_index.test
@@ -4,9 +4,8 @@
 RUN: rm -rf %t.output_dir && mkdir %t.output_dir
 RUN: cp %S/report-1.html %t.output_dir
 RUN: cp %S/report-2.html %t.output_dir
-RUN: cp %S/report-3.html %t.output_dir
 RUN: mkdir %t.output_dir/subdirectory
-RUN: cp %S/subdirectory/report-4.html %t.output_dir/subdirectory
+RUN: cp %S/subdirectory/report-3.html %t.output_dir/subdirectory
 
 RUN: %scan-build --generate-index-only %t.output_dir
 
@@ -15,16 +14,13 @@
 CHECK-FILES:  index.html
 CHECK-FILES-NEXT: report-1.html
 CHECK-FILES-NEXT: 

[PATCH] D105167: [analyzer] Fix HTML report deduplication.

2021-08-26 Thread Artem Dergachev via Phabricator via cfe-commits
NoQ added a comment.

In D105167#2949521 , @ASDenysPetrov 
wrote:

> Nice work! Unfortunately I'm not able to run tests on my Windows env, but 
> I've run you tests files manually. It works for me.
>
> P.S. BTW, is there any workarounds to make current tests supported on 
> Windows? I know there is //REQUIRES// instruction 
> (https://llvm.org/docs/TestingGuide.html#constraining-test-execution) but I 
> didn't find any sufficient description of it.

Yes, you can always remove the `REQUIRES:` directive. The problem with 
generally enabling these tests on Windows is that scan-build is written in Perl 
and the Perl interpreter isn't shipped with Windows by default. Nothing else in 
LLVM is written in Perl so requiring a Perl installation to run LLVM tests just 
for this single use case is counter-productive.




Comment at: clang/lib/StaticAnalyzer/Core/HTMLDiagnostics.cpp:286-287
+  // but the stable report filename is still more verbose.
+  // We should rename the option ("verbose" filename?) but it will break
+  // people's workflows.
+  if (DiagOpts.ShouldWriteStableReportFilename) {

ASDenysPetrov wrote:
> vsavchenko wrote:
> > Can we make a mirror for this option and mark the other one as deprecated?
> Nice idea. I'm in favor of a mirorring.
The new option is `-analyzer-config verbose-report-filename=true` and the old 
option is preserved and acts exactly the same; the help text says that it's 
deprecated.



Comment at: clang/lib/StaticAnalyzer/Core/HTMLDiagnostics.cpp:300
+
+  filename << StringRef(getIssueHash(D, PP)).substr(0, 6).str() << ".html";
+  llvm::sys::path::append(ResultPath, Directory, filename.str());

ASDenysPetrov wrote:
> Do you think 6 trimmed characters are enough to avoid collisions?
It's basically same as before. Previously these 6 digits were completely 
random, now they're chunks of MD5 hashes of the real thing which is probably as 
random.

That's 16.7 million variants. The probability of collision on 6 digits with 100 
warnings is around 0.0001%, with 1000 warnings it goes up to 3% ([[ 
https://en.wikipedia.org/wiki/Birthday_problem | Birthday Problem]]). Even 
though 1000 warnings are a real possibility on a huge project (eg., LLVM or 
WebKit), this is way above the point where you care about collisions.

I think there's nothing that prevents us from bumping it (at least in the 
non-verbose filename mode; in the verbose filename mode it may cause slightly 
more breakages in the case where it's close to hitting the file system's 
filename length limit). Bumping from 6 digits to 8 digits would reduce the 
collision probability on 1000 warnings to 0.0001%. I'll think about it a bit 
more :)


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D105167/new/

https://reviews.llvm.org/D105167

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D105167: [analyzer] Fix HTML report deduplication.

2021-08-26 Thread Artem Dergachev via Phabricator via cfe-commits
NoQ updated this revision to Diff 368917.
NoQ marked 4 inline comments as done.
NoQ added a comment.

Address review comments!


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D105167/new/

https://reviews.llvm.org/D105167

Files:
  clang/include/clang/Analysis/PathDiagnostic.h
  clang/include/clang/StaticAnalyzer/Core/AnalyzerOptions.def
  clang/include/clang/StaticAnalyzer/Core/AnalyzerOptions.h
  clang/lib/StaticAnalyzer/Core/HTMLDiagnostics.cpp
  clang/test/Analysis/analyzer-config.c
  clang/test/Analysis/scan-build/Inputs/deduplication/1.c
  clang/test/Analysis/scan-build/Inputs/deduplication/2.c
  clang/test/Analysis/scan-build/Inputs/deduplication/header.h
  clang/test/Analysis/scan-build/deduplication.test
  clang/test/Analysis/scan-build/rebuild_index/rebuild_index.test
  clang/test/Analysis/scan-build/rebuild_index/report-3.html
  clang/test/Analysis/scan-build/rebuild_index/subdirectory/report-3.html
  clang/test/Analysis/scan-build/rebuild_index/subdirectory/report-4.html
  clang/tools/scan-build/bin/scan-build

Index: clang/tools/scan-build/bin/scan-build
===
--- clang/tools/scan-build/bin/scan-build
+++ clang/tools/scan-build/bin/scan-build
@@ -14,7 +14,6 @@
 use strict;
 use warnings;
 use FindBin qw($RealBin);
-use Digest::MD5;
 use File::Basename;
 use File::Find;
 use File::Copy qw(copy);
@@ -268,27 +267,6 @@
   $ENV{'CCC_ANALYZER_HTML'} = $Dir;
 }
 
-####
-# ComputeDigest - Compute a digest of the specified file.
-####
-
-sub ComputeDigest {
-  my $FName = shift;
-  DieDiag("Cannot read $FName to compute Digest.\n") if (! -r $FName);
-
-  # Use Digest::MD5.  We don't have to be cryptographically secure.  We're
-  # just looking for duplicate files that come from a non-malicious source.
-  # We use Digest::MD5 because it is a standard Perl module that should
-  # come bundled on most systems.
-  open(FILE, $FName) or DieDiag("Cannot open $FName when computing Digest.\n");
-  binmode FILE;
-  my $Result = Digest::MD5->new->addfile(*FILE)->hexdigest;
-  close(FILE);
-
-  # Return the digest.
-  return $Result;
-}
-
 ####
 #  UpdatePrefix - Compute the common prefix of files.
 ####
@@ -374,8 +352,6 @@
 # Sometimes a source file is scanned more than once, and thus produces
 # multiple error reports.  We use a cache to solve this problem.
 
-my %AlreadyScanned;
-
 sub ScanFile {
 
   my $Index = shift;
@@ -383,19 +359,6 @@
   my $FName = shift;
   my $Stats = shift;
 
-  # Compute a digest for the report file.  Determine if we have already
-  # scanned a file that looks just like it.
-
-  my $digest = ComputeDigest("$Dir/$FName");
-
-  if (defined $AlreadyScanned{$digest}) {
-# Redundant file.  Remove it.
-unlink("$Dir/$FName");
-return;
-  }
-
-  $AlreadyScanned{$digest} = 1;
-
   # At this point the report file is not world readable.  Make it happen.
   chmod(0644, "$Dir/$FName");
 
Index: clang/test/Analysis/scan-build/rebuild_index/subdirectory/report-4.html
===
--- clang/test/Analysis/scan-build/rebuild_index/subdirectory/report-4.html
+++ /dev/null
@@ -1,8 +0,0 @@
-
-
-
-
-
-
-
-
Index: clang/test/Analysis/scan-build/rebuild_index/subdirectory/report-3.html
===
--- /dev/null
+++ clang/test/Analysis/scan-build/rebuild_index/subdirectory/report-3.html
@@ -0,0 +1,8 @@
+
+
+
+
+
+
+
+
Index: clang/test/Analysis/scan-build/rebuild_index/report-3.html
===
--- clang/test/Analysis/scan-build/rebuild_index/report-3.html
+++ /dev/null
@@ -1,8 +0,0 @@
-
-
-
-
-
-
-
-
Index: clang/test/Analysis/scan-build/rebuild_index/rebuild_index.test
===
--- clang/test/Analysis/scan-build/rebuild_index/rebuild_index.test
+++ clang/test/Analysis/scan-build/rebuild_index/rebuild_index.test
@@ -4,9 +4,8 @@
 RUN: rm -rf %t.output_dir && mkdir %t.output_dir
 RUN: cp %S/report-1.html %t.output_dir
 RUN: cp %S/report-2.html %t.output_dir
-RUN: cp %S/report-3.html %t.output_dir
 RUN: mkdir %t.output_dir/subdirectory
-RUN: cp %S/subdirectory/report-4.html %t.output_dir/subdirectory
+RUN: cp %S/subdirectory/report-3.html %t.output_dir/subdirectory
 
 RUN: %scan-build --generate-index-only %t.output_dir
 
@@ -15,16 +14,13 @@
 CHECK-FILES:  index.html
 CHECK-FILES-NEXT: report-1.html
 CHECK-FILES-NEXT: report-2.html
-
-// report-3.html is a duplicate of report-1.html so it's not present.
-CHECK-FILES-NOT:  report-3.html
 CHECK-FILES-NEXT: scanview.css
 CHECK-FILES-NEXT: sorttable.js
 

[PATCH] D105167: [analyzer] Fix HTML report deduplication.

2021-08-17 Thread Denys Petrov via Phabricator via cfe-commits
ASDenysPetrov accepted this revision.
ASDenysPetrov added a comment.

Nice work! Unfortunately I'm not able to run tests on my Windows env, but I've 
run you tests files manually. It works for me.

P.S. BTW, is there any workarounds to make current tests supported on Windows? 
I know there is //REQUIRES// instruction 
(https://llvm.org/docs/TestingGuide.html#constraining-test-execution) but I 
didn't a sufficient description of it.




Comment at: clang/lib/StaticAnalyzer/Core/HTMLDiagnostics.cpp:286-287
+  // but the stable report filename is still more verbose.
+  // We should rename the option ("verbose" filename?) but it will break
+  // people's workflows.
+  if (DiagOpts.ShouldWriteStableReportFilename) {

vsavchenko wrote:
> Can we make a mirror for this option and mark the other one as deprecated?
Nice idea. I'm in favor of a mirorring.



Comment at: clang/lib/StaticAnalyzer/Core/HTMLDiagnostics.cpp:300
+
+  filename << StringRef(getIssueHash(D, PP)).substr(0, 6).str() << ".html";
+  llvm::sys::path::append(ResultPath, Directory, filename.str());

Do you think 6 trimmed characters are enough to avoid collisions?


Repository:
  rC Clang

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D105167/new/

https://reviews.llvm.org/D105167

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D105167: [analyzer] Fix HTML report deduplication.

2021-06-30 Thread Valeriy Savchenko via Phabricator via cfe-commits
vsavchenko accepted this revision.
vsavchenko added a comment.
This revision is now accepted and ready to land.

This is incredible!  Thanks for addressing it!  I've encountered this many 
times.




Comment at: clang/lib/StaticAnalyzer/Core/HTMLDiagnostics.cpp:277
+
+  std::stringstream filename;
+  filename << "report-";

steakhal wrote:
> I don't frequently see `stringstream`s in the codebase.
> Why don't you use the llvm alternative?
> ```
> std::string s;
> llvm::raw_string_ostream os(s);
> ```
nit: whatever stream you choose, can you please start it with a capital letter? 




Comment at: clang/lib/StaticAnalyzer/Core/HTMLDiagnostics.cpp:286-287
+  // but the stable report filename is still more verbose.
+  // We should rename the option ("verbose" filename?) but it will break
+  // people's workflows.
+  if (DiagOpts.ShouldWriteStableReportFilename) {

Can we make a mirror for this option and mark the other one as deprecated?


Repository:
  rC Clang

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D105167/new/

https://reviews.llvm.org/D105167

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D105167: [analyzer] Fix HTML report deduplication.

2021-06-30 Thread Balázs Benics via Phabricator via cfe-commits
steakhal added a comment.

AFAIK from CodeChecker's side we are fine with this change. I think we are not 
using the `stable-report-filename` analyzer config either.
Aside from these, the less //Perl// the better, I guess.




Comment at: clang/lib/StaticAnalyzer/Core/HTMLDiagnostics.cpp:277
+
+  std::stringstream filename;
+  filename << "report-";

I don't frequently see `stringstream`s in the codebase.
Why don't you use the llvm alternative?
```
std::string s;
llvm::raw_string_ostream os(s);
```


Repository:
  rC Clang

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D105167/new/

https://reviews.llvm.org/D105167

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D105167: [analyzer] Fix HTML report deduplication.

2021-06-29 Thread Artem Dergachev via Phabricator via cfe-commits
NoQ created this revision.
NoQ added reviewers: vsavchenko, xazax.hun, Szelethus, ASDenysPetrov.
Herald added subscribers: manas, steakhal, martong, dkrupp, donat.nagy, 
arphaman, mikhail.ramalho, a.sidorin, rnkovacs, szepet, baloghadamsoftware.
NoQ requested review of this revision.

Folks, I deleted some code.

Some of it was... Perl code.

Not sure if anybody noticed but scan-build didn't deduplicate reports 
correctly. Its algorithm was extremely primitive (basically it hashed each html 
file and deleted files with duplicate hash). This obviously didn't work with 
cross-file reports (i.e., bug path starts in the main file but ends in a 
header) which we intended to deduplicate by the end-of-path location, because 
the main file name is included in the report and therefore affects md5. D42269 
 doesn't help either. Long story short, it 
didn't honor the modern advancements in `IssueHash` at all.

I could make scan-build read the issue hash but I think my solution is even 
more elegant so hear me out. I put the issue hash into the report filename 
instead, replacing the random section. When clang tries to emit a duplicate 
report, it'll simply fail because the file with that name is already present. 
Moreover, as per tradition, I reduce the issue hash to the first 6 characters, 
so bug report filenames look exactly like before (`report-123abc.html`), except 
now they're automatically stable and deterministic! Such truncation is, of 
course, entirely cosmetic and absolutely unnecessary but I think it's pretty 
cool.

The flag `-analyzer-config stable-report-filename=true` now becomes redundant 
because reports are stable anyway (in fact, they weren't actually stable before 
the patch even with the flag, because they depended on the race between clang 
invocations to emit the reports; I changed it to include a snippet of the issue 
hash as well instead of the race-y index). That said, I'm conflicted about 
removing the flag entirely because it also produces more verbose/readable 
filenames that people seem to enjoy. I think we should enable it by default 
instead, as soon as we make sure it doesn't produce extremely long filenames.

`scan-build --generate-index-only` no longer does deduplication, as seen on the 
updated test "`rebuild_index`". If deduplication is desired, it can typically 
be achieved with a simple

  find -name '*.html' -exec mv "{}" . \;


Repository:
  rC Clang

https://reviews.llvm.org/D105167

Files:
  clang/lib/StaticAnalyzer/Core/HTMLDiagnostics.cpp
  clang/test/Analysis/scan-build/Inputs/deduplication/1.c
  clang/test/Analysis/scan-build/Inputs/deduplication/2.c
  clang/test/Analysis/scan-build/Inputs/deduplication/header.h
  clang/test/Analysis/scan-build/deduplication.test
  clang/test/Analysis/scan-build/rebuild_index/rebuild_index.test
  clang/test/Analysis/scan-build/rebuild_index/report-3.html
  clang/test/Analysis/scan-build/rebuild_index/subdirectory/report-3.html
  clang/test/Analysis/scan-build/rebuild_index/subdirectory/report-4.html
  clang/tools/scan-build/bin/scan-build

Index: clang/tools/scan-build/bin/scan-build
===
--- clang/tools/scan-build/bin/scan-build
+++ clang/tools/scan-build/bin/scan-build
@@ -14,7 +14,6 @@
 use strict;
 use warnings;
 use FindBin qw($RealBin);
-use Digest::MD5;
 use File::Basename;
 use File::Find;
 use File::Copy qw(copy);
@@ -268,27 +267,6 @@
   $ENV{'CCC_ANALYZER_HTML'} = $Dir;
 }
 
-####
-# ComputeDigest - Compute a digest of the specified file.
-####
-
-sub ComputeDigest {
-  my $FName = shift;
-  DieDiag("Cannot read $FName to compute Digest.\n") if (! -r $FName);
-
-  # Use Digest::MD5.  We don't have to be cryptographically secure.  We're
-  # just looking for duplicate files that come from a non-malicious source.
-  # We use Digest::MD5 because it is a standard Perl module that should
-  # come bundled on most systems.
-  open(FILE, $FName) or DieDiag("Cannot open $FName when computing Digest.\n");
-  binmode FILE;
-  my $Result = Digest::MD5->new->addfile(*FILE)->hexdigest;
-  close(FILE);
-
-  # Return the digest.
-  return $Result;
-}
-
 ####
 #  UpdatePrefix - Compute the common prefix of files.
 ####
@@ -374,8 +352,6 @@
 # Sometimes a source file is scanned more than once, and thus produces
 # multiple error reports.  We use a cache to solve this problem.
 
-my %AlreadyScanned;
-
 sub ScanFile {
 
   my $Index = shift;
@@ -383,19 +359,6 @@
   my $FName = shift;
   my $Stats = shift;
 
-  # Compute a digest for the report file.  Determine if we have already
-  # scanned a file that looks just like it.
-
-  my $digest =