[MediaWiki-commits] [Gerrit] wikimedia...relevanceForge[master]: Update Analysis Analysis Tools
jenkins-bot has submitted this change and it was merged. ( https://gerrit.wikimedia.org/r/396487 ) Change subject: Update Analysis Analysis Tools .. Update Analysis Analysis Tools Move "return" anchor for histogram count examples in compare_counts.pl from the top of table to relevant line of the table to avoid excess scrolling. Write analyze_counts.pl output to file rather than standard out. Auto-generate output file name based on a command line "tag" with default value "baseline". Optionally put output in a different directory. Add flag to disable default language processing, and update README. Update README with example of new analyze_counts.pl invocation, and fix a typo in one example. Add Serbian and Japanese Kana langdata/ configs and related highlights to README. Add additional to-do to README. Change-Id: I6cb28424657a3a9a982112fa0044e703a1783290 --- M other_tools/analysis_analysis/README.md M other_tools/analysis_analysis/analyze_counts.pl M other_tools/analysis_analysis/compare_counts.pl A other_tools/analysis_analysis/compare_counts/langdata/kana.txt A other_tools/analysis_analysis/compare_counts/langdata/serbian_c2l.txt A other_tools/analysis_analysis/compare_counts/langdata/serbian_dual1.txt 6 files changed, 217 insertions(+), 25 deletions(-) Approvals: jenkins-bot: Verified DCausse: Looks good to me, approved diff --git a/other_tools/analysis_analysis/README.md b/other_tools/analysis_analysis/README.md index d5c0994..b5a3493 100755 --- a/other_tools/analysis_analysis/README.md +++ b/other_tools/analysis_analysis/README.md @@ -1,6 +1,6 @@ # Trey's Language Analyzer Analysis Tools -July 2017 +December 2017 These are the tools I use to do analysis of Elasticsearch language analyzers and custom analysis chains. Most of [my analysis write ups](https://www.mediawiki.org/wiki/User:TJones_%28WMF%29/Notes#Elasticsearch_Analysis_Chain_Analysis) are available on MediaWiki.org. The older ones, naturally, used less complex versions of this code—I update it whenever something weird happens! @@ -58,13 +58,15 @@ This is a pretty straightforward program to run: -./analyze_counts.pl input_file.txt > output_file.txt +./analyze_counts.pl [-t ] [-d ] input_file.txt * The input file is just a UTF-8–encoded text file with the text to be analyzed. * It is not strictly necessary, but it seems helpful to remove markup unless you are testing your analyzer's ability to handle markup. * I like to deduplicate lines to decrease the apparent importance of domain-specific patterns when I'm looking for general language behavior. For example, in Wikipedia, exact paragraphs don't repeat often, but headings like "See Also" and "References" certainly do. Deduping helps keep the counts for "see", "also", and "references" at more typical levels for general text. * It is more efficient for `analyze_counts.pl` to batch up lines of text and send them to Elasticsearch together. Up to 100 lines can be grouped together, up to around 30,000 characters (over 50K seems to cause problems). If your input file has lots of individual lines with significantly more than 10K characters per line, you could have trouble. * The output file, which I call a "counts file", is pretty self-explanatory, but very long, so note that there are two sections: *original tokens mapped to final tokens* and *final tokens mapped to original tokens.* + * The output file name will be `.counts..txt`. If no `` is specified with `-t`, then `baseline` is used. + * By default, the counts file will be written to the same directory as the input file. If you'd like it to written to a different directory, use `-d ` * The output is optimized for human readability, so there's *lots* of extra whitespace. * Obviously, if you had another source of pre- and post-analysis tokens, you could readily reformat them into the format output by `analyze_counts.pl` and then use `compare_counts.pl` to analyze them. * While the program is running, dots and numbers are output to STDERR as a progress indicator. Each dot represents 1000 lines of input and the numbers are running totals of lines of input processed. On the 1MB sample files, this isn't really necessary, but when processing bigger corpora, I like it. @@ -154,11 +156,11 @@ Folding & Languages -For a comparison analysis, enabling folding (`-f`) applies the available folding to tokens for computing Near Match Stats and detecting potential bad collisions. Default folding (`./compare_counts/langdata/default.txt`) is always available, and additional folding can be specified in a language-specific config (e.g., `./compare_counts/langdata/russian.txt`). +For a comparison analysis, enabling folding (`-f`) applies the available folding to tokens for computing Near Match Stats and detecting potential bad collisions. Default folding (`./compare_counts/langdata/default.txt`) is
[MediaWiki-commits] [Gerrit] wikimedia...relevanceForge[master]: Update Analysis Analysis Tools
Tjones has uploaded a new change for review. ( https://gerrit.wikimedia.org/r/396487 ) Change subject: Update Analysis Analysis Tools .. Update Analysis Analysis Tools Move "return" anchor for histogram count examples in compare_counts.pl from the top of table to relevant line of the table to avoid excess scrolling. Write analyze_counts.pl output to file rather than standard out. Auto-generate output file name based on a command line "tag" with default value "baseline". Optionally put output in a different directory. Add flag to disable default language processing, and update README. Update README with example of new analyze_counts.pl invocation, and fix a typo in one example. Add Serbian and Japanese Kana langdata/ configs and related highlights to README. Add additional to-do to README. Change-Id: I6cb28424657a3a9a982112fa0044e703a1783290 --- M other_tools/analysis_analysis/README.md M other_tools/analysis_analysis/analyze_counts.pl M other_tools/analysis_analysis/compare_counts.pl A other_tools/analysis_analysis/compare_counts/langdata/kana.txt A other_tools/analysis_analysis/compare_counts/langdata/serbian_c2l.txt A other_tools/analysis_analysis/compare_counts/langdata/serbian_dual1.txt 6 files changed, 217 insertions(+), 25 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/wikimedia/discovery/relevanceForge refs/changes/87/396487/1 diff --git a/other_tools/analysis_analysis/README.md b/other_tools/analysis_analysis/README.md index d5c0994..b5a3493 100755 --- a/other_tools/analysis_analysis/README.md +++ b/other_tools/analysis_analysis/README.md @@ -1,6 +1,6 @@ # Trey's Language Analyzer Analysis Tools -July 2017 +December 2017 These are the tools I use to do analysis of Elasticsearch language analyzers and custom analysis chains. Most of [my analysis write ups](https://www.mediawiki.org/wiki/User:TJones_%28WMF%29/Notes#Elasticsearch_Analysis_Chain_Analysis) are available on MediaWiki.org. The older ones, naturally, used less complex versions of this code—I update it whenever something weird happens! @@ -58,13 +58,15 @@ This is a pretty straightforward program to run: -./analyze_counts.pl input_file.txt > output_file.txt +./analyze_counts.pl [-t ] [-d ] input_file.txt * The input file is just a UTF-8–encoded text file with the text to be analyzed. * It is not strictly necessary, but it seems helpful to remove markup unless you are testing your analyzer's ability to handle markup. * I like to deduplicate lines to decrease the apparent importance of domain-specific patterns when I'm looking for general language behavior. For example, in Wikipedia, exact paragraphs don't repeat often, but headings like "See Also" and "References" certainly do. Deduping helps keep the counts for "see", "also", and "references" at more typical levels for general text. * It is more efficient for `analyze_counts.pl` to batch up lines of text and send them to Elasticsearch together. Up to 100 lines can be grouped together, up to around 30,000 characters (over 50K seems to cause problems). If your input file has lots of individual lines with significantly more than 10K characters per line, you could have trouble. * The output file, which I call a "counts file", is pretty self-explanatory, but very long, so note that there are two sections: *original tokens mapped to final tokens* and *final tokens mapped to original tokens.* + * The output file name will be `.counts..txt`. If no `` is specified with `-t`, then `baseline` is used. + * By default, the counts file will be written to the same directory as the input file. If you'd like it to written to a different directory, use `-d ` * The output is optimized for human readability, so there's *lots* of extra whitespace. * Obviously, if you had another source of pre- and post-analysis tokens, you could readily reformat them into the format output by `analyze_counts.pl` and then use `compare_counts.pl` to analyze them. * While the program is running, dots and numbers are output to STDERR as a progress indicator. Each dot represents 1000 lines of input and the numbers are running totals of lines of input processed. On the 1MB sample files, this isn't really necessary, but when processing bigger corpora, I like it. @@ -154,11 +156,11 @@ Folding & Languages -For a comparison analysis, enabling folding (`-f`) applies the available folding to tokens for computing Near Match Stats and detecting potential bad collisions. Default folding (`./compare_counts/langdata/default.txt`) is always available, and additional folding can be specified in a language-specific config (e.g., `./compare_counts/langdata/russian.txt`). +For a comparison analysis, enabling folding (`-f`) applies the available folding to tokens for computing Near Match Stats and detecting potential bad collisions. Default folding
[MediaWiki-commits] [Gerrit] wikimedia...relevanceForge[master]: Update Analysis Analysis Tools
jenkins-bot has submitted this change and it was merged. ( https://gerrit.wikimedia.org/r/370286 ) Change subject: Update Analysis Analysis Tools .. Update Analysis Analysis Tools In analyze_counts: - catch _analyze errors and report them - track line count ranges for error reporting - properly escape tabs In compare_counts: - improve lost/found category definitions - improve counting to distinguish net gains and losses - prevent some uninitialized variable warnings - restore missing close paren in output - tidy up whitespace in the code - update sample output files Incidental to working on T170423. Change-Id: Ia6f0a22c8053d1c55aa2c463a762421ebe0e7677 --- M other_tools/analysis_analysis/analyze_counts.pl M other_tools/analysis_analysis/compare_counts.pl M other_tools/analysis_analysis/samples/output/en.comp.folded_self.html M other_tools/analysis_analysis/samples/output/en.comp.unfolded_vs_folded._f.txt M other_tools/analysis_analysis/samples/output/en.comp.unfolded_vs_folded.txt 5 files changed, 91 insertions(+), 40 deletions(-) Approvals: EBernhardson: Looks good to me, approved jenkins-bot: Verified diff --git a/other_tools/analysis_analysis/analyze_counts.pl b/other_tools/analysis_analysis/analyze_counts.pl index d4cea87..6cf8b9d 100755 --- a/other_tools/analysis_analysis/analyze_counts.pl +++ b/other_tools/analysis_analysis/analyze_counts.pl @@ -36,6 +36,8 @@ chomp $line; $linecnt++; + my $start_linecnt = $linecnt; + # ~30x speed up to process 100 lines at a time. my $linelen = length($line); foreach my $i (1..99) { @@ -69,6 +71,11 @@ my $json = `curl -s localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "text", "text" : "$escline" }'`; $json = decode_utf8($json); + + if ($json =~ /"error" :\s*{\s*"root_cause" :/s) { + print STDERR "\n_analyze error (somewhere on lines $start_linecnt-$linecnt):\n$json\n"; + exit; + } my %tokens = (); my $hs_offset = 0; # offset to compensate for errors caused by high-surrogate characters @@ -137,6 +144,7 @@ $rv =~ s/\x{0}/ /g; $rv =~ s/(["\\])/\\$1/g; $rv =~ s/'/'"'"'/g; + $rv =~ s/\t/\\t/g; return $rv; } diff --git a/other_tools/analysis_analysis/compare_counts.pl b/other_tools/analysis_analysis/compare_counts.pl index 5c075d9..c49ebb6 100755 --- a/other_tools/analysis_analysis/compare_counts.pl +++ b/other_tools/analysis_analysis/compare_counts.pl @@ -276,8 +276,21 @@ $mapping{$final}{old} ne $mapping{$final}{new}) { my $mapo = $mapping{$final}{old}; my $mapn = $mapping{$final}{new}; - my $ocnt = () = ($mapo =~ /\[(.*?)\]/g); - my $ncnt = () = ($mapn =~ /\[(.*?)\]/g); + my $ocnt = 0; + my $ncnt = 0; + my $o_token_cnt = 0; + my $n_token_cnt = 0; + + while ($mapo =~ /\[(\d+) (.*?)\]/g) { + $ocnt++; + $o_token_cnt += $1; + } + + while ($mapn =~ /\[(\d+) (.*?)\]/g) { + $ncnt++; + $n_token_cnt += $1; + } + if ($config{terse} > 0) { $mapo =~ s/\[\d+ /[/g; $mapn =~ s/\[\d+ /[/g; @@ -288,7 +301,17 @@ next if $mapo eq $mapn; } } - push @{$old_v_new_results{($ocnt > $ncnt)?'decreased':'increased'}}, $final; + + my $incrdecr = 'increased'; + + if ($ocnt > $ncnt) { + $incrdecr = 'decreased'; + } + elsif ($ocnt == $ncnt && $o_token_cnt > $n_token_cnt ) { + $incrdecr = 'decreased'; + } + + push @{$old_v_new_results{$incrdecr}}, $final; } } @@ -306,7 +329,7 @@ } $statistics{count_histogram}{scalar(@terms)}++; - + my $final_len = length($final); push @{$statistics{token_length}{$final_len}}, $final; @@ -671,7 +694,7 @@ my $to = $language_data{fold}{strings}{$from}; if (defined $to) { $term =~ s/$from/$to/g; - # account for differences in string + #
[MediaWiki-commits] [Gerrit] wikimedia...relevanceForge[master]: Update Analysis Analysis Tools
Tjones has uploaded a new change for review. ( https://gerrit.wikimedia.org/r/370286 ) Change subject: Update Analysis Analysis Tools .. Update Analysis Analysis Tools In analyze_counts: - catch _analyze errors and report them - track line count ranges for error reporting - properly escape tabs In compare_counts: - improve lost/found category definitions - improve counting to distinguish net gains and losses - prevent some uninitialized variable warnings - restore missing close paren in output - tidy up whitespace in the code - update sample output files Incidental to working on T170423. Change-Id: Ia6f0a22c8053d1c55aa2c463a762421ebe0e7677 --- M other_tools/analysis_analysis/analyze_counts.pl M other_tools/analysis_analysis/compare_counts.pl M other_tools/analysis_analysis/samples/output/en.comp.folded_self.html M other_tools/analysis_analysis/samples/output/en.comp.unfolded_vs_folded._f.txt M other_tools/analysis_analysis/samples/output/en.comp.unfolded_vs_folded.txt 5 files changed, 91 insertions(+), 40 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/wikimedia/discovery/relevanceForge refs/changes/86/370286/1 diff --git a/other_tools/analysis_analysis/analyze_counts.pl b/other_tools/analysis_analysis/analyze_counts.pl index d4cea87..6cf8b9d 100755 --- a/other_tools/analysis_analysis/analyze_counts.pl +++ b/other_tools/analysis_analysis/analyze_counts.pl @@ -36,6 +36,8 @@ chomp $line; $linecnt++; + my $start_linecnt = $linecnt; + # ~30x speed up to process 100 lines at a time. my $linelen = length($line); foreach my $i (1..99) { @@ -69,6 +71,11 @@ my $json = `curl -s localhost:9200/wiki_content/_analyze?pretty -d '{"analyzer": "text", "text" : "$escline" }'`; $json = decode_utf8($json); + + if ($json =~ /"error" :\s*{\s*"root_cause" :/s) { + print STDERR "\n_analyze error (somewhere on lines $start_linecnt-$linecnt):\n$json\n"; + exit; + } my %tokens = (); my $hs_offset = 0; # offset to compensate for errors caused by high-surrogate characters @@ -137,6 +144,7 @@ $rv =~ s/\x{0}/ /g; $rv =~ s/(["\\])/\\$1/g; $rv =~ s/'/'"'"'/g; + $rv =~ s/\t/\\t/g; return $rv; } diff --git a/other_tools/analysis_analysis/compare_counts.pl b/other_tools/analysis_analysis/compare_counts.pl index 5c075d9..c49ebb6 100755 --- a/other_tools/analysis_analysis/compare_counts.pl +++ b/other_tools/analysis_analysis/compare_counts.pl @@ -276,8 +276,21 @@ $mapping{$final}{old} ne $mapping{$final}{new}) { my $mapo = $mapping{$final}{old}; my $mapn = $mapping{$final}{new}; - my $ocnt = () = ($mapo =~ /\[(.*?)\]/g); - my $ncnt = () = ($mapn =~ /\[(.*?)\]/g); + my $ocnt = 0; + my $ncnt = 0; + my $o_token_cnt = 0; + my $n_token_cnt = 0; + + while ($mapo =~ /\[(\d+) (.*?)\]/g) { + $ocnt++; + $o_token_cnt += $1; + } + + while ($mapn =~ /\[(\d+) (.*?)\]/g) { + $ncnt++; + $n_token_cnt += $1; + } + if ($config{terse} > 0) { $mapo =~ s/\[\d+ /[/g; $mapn =~ s/\[\d+ /[/g; @@ -288,7 +301,17 @@ next if $mapo eq $mapn; } } - push @{$old_v_new_results{($ocnt > $ncnt)?'decreased':'increased'}}, $final; + + my $incrdecr = 'increased'; + + if ($ocnt > $ncnt) { + $incrdecr = 'decreased'; + } + elsif ($ocnt == $ncnt && $o_token_cnt > $n_token_cnt ) { + $incrdecr = 'decreased'; + } + + push @{$old_v_new_results{$incrdecr}}, $final; } } @@ -306,7 +329,7 @@ } $statistics{count_histogram}{scalar(@terms)}++; - + my $final_len = length($final); push @{$statistics{token_length}{$final_len}}, $final; @@ -671,7 +694,7 @@ my $to = $language_data{fold}{strings}{$from}; if (defined $to) { $term =~ s/$from/$to/g; - # account for differences in string +