[MediaWiki-commits] [Gerrit] wikimedia...relevanceForge[master]: Update Analysis Analysis Tools

2017-12-11 Thread jenkins-bot (Code Review)
jenkins-bot has submitted this change and it was merged. ( 
https://gerrit.wikimedia.org/r/396487 )

Change subject: Update Analysis Analysis Tools
..


Update Analysis Analysis Tools

Move "return" anchor for histogram count examples in compare_counts.pl
from the top of table to relevant line of the table to avoid excess
scrolling.

Write analyze_counts.pl output to file rather than standard out.
Auto-generate output file name based on a command line "tag" with
default value "baseline". Optionally put output in a different
directory.

Add flag to disable default language processing, and update README.

Update README with example of new analyze_counts.pl invocation, and
fix a typo in one example.

Add Serbian and Japanese Kana langdata/ configs and related highlights
to README.

Add additional to-do to README.

Change-Id: I6cb28424657a3a9a982112fa0044e703a1783290
---
M other_tools/analysis_analysis/README.md
M other_tools/analysis_analysis/analyze_counts.pl
M other_tools/analysis_analysis/compare_counts.pl
A other_tools/analysis_analysis/compare_counts/langdata/kana.txt
A other_tools/analysis_analysis/compare_counts/langdata/serbian_c2l.txt
A other_tools/analysis_analysis/compare_counts/langdata/serbian_dual1.txt
6 files changed, 217 insertions(+), 25 deletions(-)

Approvals:
  jenkins-bot: Verified
  DCausse: Looks good to me, approved



diff --git a/other_tools/analysis_analysis/README.md 
b/other_tools/analysis_analysis/README.md
index d5c0994..b5a3493 100755
--- a/other_tools/analysis_analysis/README.md
+++ b/other_tools/analysis_analysis/README.md
@@ -1,6 +1,6 @@
 # Trey's Language Analyzer Analysis Tools
 
-July 2017
+December 2017
 
 These are the tools I use to do analysis of Elasticsearch language analyzers 
and custom analysis chains. Most of [my analysis write 
ups](https://www.mediawiki.org/wiki/User:TJones_%28WMF%29/Notes#Elasticsearch_Analysis_Chain_Analysis)
 are available on MediaWiki.org. The older ones, naturally, used less complex 
versions of this code—I update it whenever something weird happens!
 
@@ -58,13 +58,15 @@
 
 This is a pretty straightforward program to run:
 
-./analyze_counts.pl input_file.txt > output_file.txt
+./analyze_counts.pl [-t ] [-d ] input_file.txt
 
 * The input file is just a UTF-8–encoded text file with the text to be 
analyzed.
   * It is not strictly necessary, but it seems helpful to remove markup unless 
you are testing your analyzer's ability to handle markup.
   * I like to deduplicate lines to decrease the apparent importance of 
domain-specific patterns when I'm looking for general language behavior. For 
example, in Wikipedia, exact paragraphs don't repeat often, but headings like 
"See Also" and "References" certainly do. Deduping helps keep the counts for 
"see", "also", and "references" at more typical levels for general text.
   * It is more efficient for `analyze_counts.pl` to batch up lines of text and 
send them to Elasticsearch together. Up to 100 lines can be grouped together, 
up to around 30,000 characters (over 50K seems to cause problems). If your 
input file has lots of individual lines with significantly more than 10K 
characters per line, you could have trouble.
 * The output file, which I call a "counts file", is pretty self-explanatory, 
but very long, so note that there are two sections: *original tokens mapped to 
final tokens* and *final tokens mapped to original tokens.*
+  * The output file name will be `.counts..txt`. If no 
`` is specified with `-t`, then `baseline` is used.
+  * By default, the counts file will be written to the same directory as the 
input file. If you'd like it to written to a different directory, use `-d `
   * The output is optimized for human readability, so there's *lots* of extra 
whitespace.
   * Obviously, if you had another source of pre- and post-analysis tokens, you 
could readily reformat them into the format output by `analyze_counts.pl` and 
then use `compare_counts.pl` to analyze them.
 * While the program is running, dots and numbers are output to STDERR as a 
progress indicator. Each dot represents 1000 lines of input and the numbers are 
running totals of lines of input processed. On the 1MB sample files, this isn't 
really necessary, but when processing bigger corpora, I like it.
@@ -154,11 +156,11 @@
 
  Folding & Languages
 
-For a comparison analysis, enabling folding (`-f`) applies the available 
folding to tokens for computing Near Match Stats and detecting potential bad 
collisions. Default folding (`./compare_counts/langdata/default.txt`) is always 
available, and additional folding can be specified in a language-specific 
config (e.g., `./compare_counts/langdata/russian.txt`).
+For a comparison analysis, enabling folding (`-f`) applies the available 
folding to tokens for computing Near Match Stats and detecting potential bad 
collisions. Default folding (`./compare_counts/langdata/default.txt`) is 

[MediaWiki-commits] [Gerrit] wikimedia...relevanceForge[master]: Update Analysis Analysis Tools

2017-12-08 Thread Tjones (Code Review)
Tjones has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/396487 )

Change subject: Update Analysis Analysis Tools
..

Update Analysis Analysis Tools

Move "return" anchor for histogram count examples in compare_counts.pl
from the top of table to relevant line of the table to avoid excess
scrolling.

Write analyze_counts.pl output to file rather than standard out.
Auto-generate output file name based on a command line "tag" with
default value "baseline". Optionally put output in a different
directory.

Add flag to disable default language processing, and update README.

Update README with example of new analyze_counts.pl invocation, and
fix a typo in one example.

Add Serbian and Japanese Kana langdata/ configs and related highlights
to README.

Add additional to-do to README.

Change-Id: I6cb28424657a3a9a982112fa0044e703a1783290
---
M other_tools/analysis_analysis/README.md
M other_tools/analysis_analysis/analyze_counts.pl
M other_tools/analysis_analysis/compare_counts.pl
A other_tools/analysis_analysis/compare_counts/langdata/kana.txt
A other_tools/analysis_analysis/compare_counts/langdata/serbian_c2l.txt
A other_tools/analysis_analysis/compare_counts/langdata/serbian_dual1.txt
6 files changed, 217 insertions(+), 25 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/wikimedia/discovery/relevanceForge 
refs/changes/87/396487/1

diff --git a/other_tools/analysis_analysis/README.md 
b/other_tools/analysis_analysis/README.md
index d5c0994..b5a3493 100755
--- a/other_tools/analysis_analysis/README.md
+++ b/other_tools/analysis_analysis/README.md
@@ -1,6 +1,6 @@
 # Trey's Language Analyzer Analysis Tools
 
-July 2017
+December 2017
 
 These are the tools I use to do analysis of Elasticsearch language analyzers 
and custom analysis chains. Most of [my analysis write 
ups](https://www.mediawiki.org/wiki/User:TJones_%28WMF%29/Notes#Elasticsearch_Analysis_Chain_Analysis)
 are available on MediaWiki.org. The older ones, naturally, used less complex 
versions of this code—I update it whenever something weird happens!
 
@@ -58,13 +58,15 @@
 
 This is a pretty straightforward program to run:
 
-./analyze_counts.pl input_file.txt > output_file.txt
+./analyze_counts.pl [-t ] [-d ] input_file.txt
 
 * The input file is just a UTF-8–encoded text file with the text to be 
analyzed.
   * It is not strictly necessary, but it seems helpful to remove markup unless 
you are testing your analyzer's ability to handle markup.
   * I like to deduplicate lines to decrease the apparent importance of 
domain-specific patterns when I'm looking for general language behavior. For 
example, in Wikipedia, exact paragraphs don't repeat often, but headings like 
"See Also" and "References" certainly do. Deduping helps keep the counts for 
"see", "also", and "references" at more typical levels for general text.
   * It is more efficient for `analyze_counts.pl` to batch up lines of text and 
send them to Elasticsearch together. Up to 100 lines can be grouped together, 
up to around 30,000 characters (over 50K seems to cause problems). If your 
input file has lots of individual lines with significantly more than 10K 
characters per line, you could have trouble.
 * The output file, which I call a "counts file", is pretty self-explanatory, 
but very long, so note that there are two sections: *original tokens mapped to 
final tokens* and *final tokens mapped to original tokens.*
+  * The output file name will be `.counts..txt`. If no 
`` is specified with `-t`, then `baseline` is used.
+  * By default, the counts file will be written to the same directory as the 
input file. If you'd like it to written to a different directory, use `-d `
   * The output is optimized for human readability, so there's *lots* of extra 
whitespace.
   * Obviously, if you had another source of pre- and post-analysis tokens, you 
could readily reformat them into the format output by `analyze_counts.pl` and 
then use `compare_counts.pl` to analyze them.
 * While the program is running, dots and numbers are output to STDERR as a 
progress indicator. Each dot represents 1000 lines of input and the numbers are 
running totals of lines of input processed. On the 1MB sample files, this isn't 
really necessary, but when processing bigger corpora, I like it.
@@ -154,11 +156,11 @@
 
  Folding & Languages
 
-For a comparison analysis, enabling folding (`-f`) applies the available 
folding to tokens for computing Near Match Stats and detecting potential bad 
collisions. Default folding (`./compare_counts/langdata/default.txt`) is always 
available, and additional folding can be specified in a language-specific 
config (e.g., `./compare_counts/langdata/russian.txt`).
+For a comparison analysis, enabling folding (`-f`) applies the available 
folding to tokens for computing Near Match Stats and detecting potential bad 
collisions. Default folding 

[MediaWiki-commits] [Gerrit] wikimedia...relevanceForge[master]: Update Analysis Analysis Tools

2017-08-28 Thread jenkins-bot (Code Review)
jenkins-bot has submitted this change and it was merged. ( 
https://gerrit.wikimedia.org/r/370286 )

Change subject: Update Analysis Analysis Tools
..


Update Analysis Analysis Tools

In analyze_counts:
- catch _analyze errors and report them
- track line count ranges for error reporting
- properly escape tabs

In compare_counts:
- improve lost/found category definitions
- improve counting to distinguish net gains and losses
- prevent some uninitialized variable warnings
- restore missing close paren in output
- tidy up whitespace in the code
- update sample output files

Incidental to working on T170423.

Change-Id: Ia6f0a22c8053d1c55aa2c463a762421ebe0e7677
---
M other_tools/analysis_analysis/analyze_counts.pl
M other_tools/analysis_analysis/compare_counts.pl
M other_tools/analysis_analysis/samples/output/en.comp.folded_self.html
M other_tools/analysis_analysis/samples/output/en.comp.unfolded_vs_folded._f.txt
M other_tools/analysis_analysis/samples/output/en.comp.unfolded_vs_folded.txt
5 files changed, 91 insertions(+), 40 deletions(-)

Approvals:
  EBernhardson: Looks good to me, approved
  jenkins-bot: Verified



diff --git a/other_tools/analysis_analysis/analyze_counts.pl 
b/other_tools/analysis_analysis/analyze_counts.pl
index d4cea87..6cf8b9d 100755
--- a/other_tools/analysis_analysis/analyze_counts.pl
+++ b/other_tools/analysis_analysis/analyze_counts.pl
@@ -36,6 +36,8 @@
chomp $line;
$linecnt++;
 
+   my $start_linecnt = $linecnt;
+
# ~30x speed up to process 100 lines at a time.
my $linelen = length($line);
foreach my $i (1..99) {
@@ -69,6 +71,11 @@
 
my $json = `curl -s localhost:9200/wiki_content/_analyze?pretty -d 
'{"analyzer": "text", "text" : "$escline" }'`;
$json = decode_utf8($json);
+
+   if ($json =~ /"error" :\s*{\s*"root_cause" :/s) {
+   print STDERR "\n_analyze error (somewhere on lines 
$start_linecnt-$linecnt):\n$json\n";
+   exit;
+   }
 
my %tokens = ();
my $hs_offset = 0;  # offset to compensate for errors caused by 
high-surrogate characters
@@ -137,6 +144,7 @@
$rv =~ s/\x{0}/ /g;
$rv =~ s/(["\\])/\\$1/g;
$rv =~ s/'/'"'"'/g;
+   $rv =~ s/\t/\\t/g;
return $rv;
 }
 
diff --git a/other_tools/analysis_analysis/compare_counts.pl 
b/other_tools/analysis_analysis/compare_counts.pl
index 5c075d9..c49ebb6 100755
--- a/other_tools/analysis_analysis/compare_counts.pl
+++ b/other_tools/analysis_analysis/compare_counts.pl
@@ -276,8 +276,21 @@
$mapping{$final}{old} ne $mapping{$final}{new}) 
{
my $mapo = $mapping{$final}{old};
my $mapn = $mapping{$final}{new};
-   my $ocnt = () = ($mapo =~ /\[(.*?)\]/g);
-   my $ncnt = () = ($mapn =~ /\[(.*?)\]/g);
+   my $ocnt = 0;
+   my $ncnt = 0;
+   my $o_token_cnt = 0;
+   my $n_token_cnt = 0;
+
+   while ($mapo =~ /\[(\d+) (.*?)\]/g) {
+   $ocnt++;
+   $o_token_cnt += $1;
+   }
+
+   while ($mapn =~ /\[(\d+) (.*?)\]/g) {
+   $ncnt++;
+   $n_token_cnt += $1;
+   }
+
if ($config{terse} > 0) {
$mapo =~ s/\[\d+ /[/g;
$mapn =~ s/\[\d+ /[/g;
@@ -288,7 +301,17 @@
next if $mapo eq $mapn;
}
}
-   push @{$old_v_new_results{($ocnt > 
$ncnt)?'decreased':'increased'}}, $final;
+
+   my $incrdecr = 'increased';
+
+   if ($ocnt > $ncnt) {
+   $incrdecr = 'decreased';
+   }
+   elsif ($ocnt == $ncnt && $o_token_cnt > $n_token_cnt ) {
+   $incrdecr = 'decreased';
+   }
+
+   push @{$old_v_new_results{$incrdecr}}, $final;
}
}
 
@@ -306,7 +329,7 @@
}
 
$statistics{count_histogram}{scalar(@terms)}++;
-   
+
my $final_len = length($final);
push @{$statistics{token_length}{$final_len}}, $final;
 
@@ -671,7 +694,7 @@
my $to = $language_data{fold}{strings}{$from};
if (defined $to) {
$term =~ s/$from/$to/g;
-   # account for differences in string 
+   # 

[MediaWiki-commits] [Gerrit] wikimedia...relevanceForge[master]: Update Analysis Analysis Tools

2017-08-04 Thread Tjones (Code Review)
Tjones has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/370286 )

Change subject: Update Analysis Analysis Tools
..

Update Analysis Analysis Tools

In analyze_counts:
- catch _analyze errors and report them
- track line count ranges for error reporting
- properly escape tabs

In compare_counts:
- improve lost/found category definitions
- improve counting to distinguish net gains and losses
- prevent some uninitialized variable warnings
- restore missing close paren in output
- tidy up whitespace in the code
- update sample output files

Incidental to working on T170423.

Change-Id: Ia6f0a22c8053d1c55aa2c463a762421ebe0e7677
---
M other_tools/analysis_analysis/analyze_counts.pl
M other_tools/analysis_analysis/compare_counts.pl
M other_tools/analysis_analysis/samples/output/en.comp.folded_self.html
M other_tools/analysis_analysis/samples/output/en.comp.unfolded_vs_folded._f.txt
M other_tools/analysis_analysis/samples/output/en.comp.unfolded_vs_folded.txt
5 files changed, 91 insertions(+), 40 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/wikimedia/discovery/relevanceForge 
refs/changes/86/370286/1

diff --git a/other_tools/analysis_analysis/analyze_counts.pl 
b/other_tools/analysis_analysis/analyze_counts.pl
index d4cea87..6cf8b9d 100755
--- a/other_tools/analysis_analysis/analyze_counts.pl
+++ b/other_tools/analysis_analysis/analyze_counts.pl
@@ -36,6 +36,8 @@
chomp $line;
$linecnt++;
 
+   my $start_linecnt = $linecnt;
+
# ~30x speed up to process 100 lines at a time.
my $linelen = length($line);
foreach my $i (1..99) {
@@ -69,6 +71,11 @@
 
my $json = `curl -s localhost:9200/wiki_content/_analyze?pretty -d 
'{"analyzer": "text", "text" : "$escline" }'`;
$json = decode_utf8($json);
+
+   if ($json =~ /"error" :\s*{\s*"root_cause" :/s) {
+   print STDERR "\n_analyze error (somewhere on lines 
$start_linecnt-$linecnt):\n$json\n";
+   exit;
+   }
 
my %tokens = ();
my $hs_offset = 0;  # offset to compensate for errors caused by 
high-surrogate characters
@@ -137,6 +144,7 @@
$rv =~ s/\x{0}/ /g;
$rv =~ s/(["\\])/\\$1/g;
$rv =~ s/'/'"'"'/g;
+   $rv =~ s/\t/\\t/g;
return $rv;
 }
 
diff --git a/other_tools/analysis_analysis/compare_counts.pl 
b/other_tools/analysis_analysis/compare_counts.pl
index 5c075d9..c49ebb6 100755
--- a/other_tools/analysis_analysis/compare_counts.pl
+++ b/other_tools/analysis_analysis/compare_counts.pl
@@ -276,8 +276,21 @@
$mapping{$final}{old} ne $mapping{$final}{new}) 
{
my $mapo = $mapping{$final}{old};
my $mapn = $mapping{$final}{new};
-   my $ocnt = () = ($mapo =~ /\[(.*?)\]/g);
-   my $ncnt = () = ($mapn =~ /\[(.*?)\]/g);
+   my $ocnt = 0;
+   my $ncnt = 0;
+   my $o_token_cnt = 0;
+   my $n_token_cnt = 0;
+
+   while ($mapo =~ /\[(\d+) (.*?)\]/g) {
+   $ocnt++;
+   $o_token_cnt += $1;
+   }
+
+   while ($mapn =~ /\[(\d+) (.*?)\]/g) {
+   $ncnt++;
+   $n_token_cnt += $1;
+   }
+
if ($config{terse} > 0) {
$mapo =~ s/\[\d+ /[/g;
$mapn =~ s/\[\d+ /[/g;
@@ -288,7 +301,17 @@
next if $mapo eq $mapn;
}
}
-   push @{$old_v_new_results{($ocnt > 
$ncnt)?'decreased':'increased'}}, $final;
+
+   my $incrdecr = 'increased';
+
+   if ($ocnt > $ncnt) {
+   $incrdecr = 'decreased';
+   }
+   elsif ($ocnt == $ncnt && $o_token_cnt > $n_token_cnt ) {
+   $incrdecr = 'decreased';
+   }
+
+   push @{$old_v_new_results{$incrdecr}}, $final;
}
}
 
@@ -306,7 +329,7 @@
}
 
$statistics{count_histogram}{scalar(@terms)}++;
-   
+
my $final_len = length($final);
push @{$statistics{token_length}{$final_len}}, $final;
 
@@ -671,7 +694,7 @@
my $to = $language_data{fold}{strings}{$from};
if (defined $to) {
$term =~ s/$from/$to/g;
-   # account for differences in string 
+