[MediaWiki-commits] [Gerrit] Remove dumpGrepper files - change (mediawiki...parsoid)

2014-12-18 Thread jenkins-bot (Code Review)
jenkins-bot has submitted this change and it was merged.

Change subject: Remove dumpGrepper files
..


Remove dumpGrepper files

 * Can now be installed with npm i -g dumpgrepper

 * Leaving it out of devDependencies because libxml fails to compile on
   jenkins and there's no optionalDevDependencies in npm, yet.

Change-Id: If21dfcf0575b15776e388e5220d1b6cb811be2f6
---
M tests/README.md
D tests/dumpGrepPatterns/martian-endtags.sh
D tests/dumpGrepPatterns/misc.txt
D tests/dumpGrepper.js
D tests/dumpReader.js
5 files changed, 13 insertions(+), 297 deletions(-)

Approvals:
  Cscott: Looks good to me, approved
  jenkins-bot: Verified



diff --git a/tests/README.md b/tests/README.md
index 931f959..2be3662 100644
--- a/tests/README.md
+++ b/tests/README.md
@@ -70,3 +70,16 @@
$ node client
 
 Then take a look at [the statistics](http://localhost:8001/).
+
+== Running the dumpgrepper ==
+
+The dumpgrepper utility is useful to search XML dumps for specific regexp
+patterns. With a simple regexp, an enwiki dump can be grepped in ~20 minutes.
+
+The grepper operates on actual wikitext (with XML encoding removed), so there 
is
+no need to complicate regexps with entities. It supports JavaScript RegExps.
+
+   $ npm install -g dumpgrepper
+
+More information on [github][https://github.com/wikimedia/dumpgrepper] and the
+[mediawiki wiki][https://www.mediawiki.org/wiki/Parsoid/DumpGrepper].
diff --git a/tests/dumpGrepPatterns/martian-endtags.sh 
b/tests/dumpGrepPatterns/martian-endtags.sh
deleted file mode 100755
index b0395cf..000
--- a/tests/dumpGrepPatterns/martian-endtags.sh
+++ /dev/null
@@ -1,27 +0,0 @@
-#!/bin/sh
-
-# extension tag hooks enabled at en.wikipedia.org
-exts=categorytree|charinsert|gallery|hiero|imagemap|inputbox|math|nowiki|poem|pre|ref|references|source|syntaxhighlight|timeline
-
-wiki=nowiki|includeonly|noinclude|onlyinclude
-
-# just the html5 elements
-html5s=a|abbr|address|area|article|aside|audio|b|base|bdi|bdo|blockquote|body|br|button|canvas|caption|cite|code|col|colgroup|command|data|datalist|dd|del|details|dfn|div|dl|dt|em|embed|fieldset|figcaption|figure|footer|form|h1|h2|h3|h4|h5|h6|head|header|hgroup|hr|html|i|iframe|img|input|ins|kbd|keygen|label|legend|li|link|map|mark|menu|meta|meter|nav|noscript|object|ol|optgroup|option|output|p|param|pre|progress|q|rp|rt|rtc|ruby|s|samp|script|section|select|small|source|span|strong|style|sub|summary|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|track|u|ul|var|video|wbr
-
-htmlold=center|font|tt
-
-normaltags=$exts|$wiki|$html5s|$htmlold
-
-#regexp=(?!\/|$exts|$htmls)[^]*.*?!--([^]+|(\/|$exts|$htmls)[^]*)*\/(?!$exts|$htmls)[^]*
-#regexp=lt;(?!/|$normaltags)[^]+gt;[^]+lt;!--[^-]*lt;/(?!$normaltags)((?!gt;).)+gt;
-regexp=/(?=[a-z])(?!$normaltags)[^]+
-#regexp=(?!\/|$exts|$htmls)[^]*
-
-#echo $regexp
-
-if [ -z $1 ];then
-echo Usage: $0 xmldump.gz
-exit 1
-fi
-
-zcat $1 | node ../dumpGrepper.js -i $regexp
diff --git a/tests/dumpGrepPatterns/misc.txt b/tests/dumpGrepPatterns/misc.txt
deleted file mode 100644
index cbbcc7f..000
--- a/tests/dumpGrepPatterns/misc.txt
+++ /dev/null
@@ -1,18 +0,0 @@
-# A collection of misc interesting regexps
-
-# ISBN links with at least one line break 
(https://bugzilla.wikimedia.org/show_bug.cgi?id=29025)
-(?:(?:RFC|PMID)[ \t\n\r\f]*[\n\f\r]+[ \t\n\r\f]*([0-9]+)|ISBN[ 
\t\n\r\f]*[\n\f\r]+[ \t\n\r\f]*(\b(?:97[89][ -]?)?(?:[0-9][ -]?){9}[0-9Xx]\b))
-
-# ISBN links with at least two line breaks 
(https://bugzilla.wikimedia.org/show_bug.cgi?id=29025)
-(?:(?:RFC|PMID)[ \t\n\r\f]*(?:[\n\f\r][ \t\n\r\f]*){2,}([0-9]+)|ISBN[ 
\t\n\r\f]*(?:[\n\f\r][ \t\n\r\f]*){2,}(\b(?:97[89][ -]?)?(?:[0-9][ 
-]?){9}[0-9Xx]\b))
-
-# Template:Table_cell_templates in enwiki
-{{\s*(?:rh|rh2|yes|Ya|no|Na|coming soon|bad|eliminated|Site active|Site 
inactive|good|yes2|won|no2|nom|sho|TBA|partial|yes-No|okay|some|any|n/a|BLACK|dunno|Unknown|Depends|Included|dropped|terminated|beta|table-experimental|free|nonfree|proprietary|needs|incorrect|no
 
result|pending|nightly|release-candidate|[?]|unofficial|usually|rarely|sometimes|draw)\s*(?:[|]|}})
-
-# cases which aren't the simple '| {{yes}}' case.
-[^ \t|]\s*{{\s*(?:rh|rh2|yes|Ya|no|Na|coming soon|bad|eliminated|Site 
active|Site 
inactive|good|yes2|won|no2|nom|sho|TBA|partial|yes-No|okay|some|any|n/a|BLACK|dunno|Unknown|Depends|Included|dropped|terminated|beta|table-experimental|free|nonfree|proprietary|needs|incorrect|no
 
result|pending|nightly|release-candidate|[?]|unofficial|usually|rarely|sometimes|draw)\s*(?:[|]|}})
-
-# blank lines with more than one comment (bug 41756)
-^([ ]*!--((?!--).)*--){2,}[ ]*$  (use with -m option)
-# more precise version, avoid those surrounded by newlines
-[^\n]\n([ ]*!--((?!--).)*--){2,}[ ]*\n(?!\n)
diff --git a/tests/dumpGrepper.js b/tests/dumpGrepper.js
deleted file mode 100755
index 546087b..000
--- a/tests/dumpGrepper.js
+++ /dev/null
@@ -1,122 

[MediaWiki-commits] [Gerrit] Remove dumpGrepper files - change (mediawiki...parsoid)

2014-12-17 Thread Arlolra (Code Review)
Arlolra has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/180642

Change subject: Remove dumpGrepper files
..

Remove dumpGrepper files

 * Adds a dev dependency on dumpgrepper.

 * And a script so you don't have to go fishing in node_modules/.bin to
   run it. More useful in npm v2.x where you can,
 npm run dumpgrepper -- --help

Change-Id: If21dfcf0575b15776e388e5220d1b6cb811be2f6
---
M package.json
D tests/dumpGrepPatterns/martian-endtags.sh
D tests/dumpGrepPatterns/misc.txt
D tests/dumpGrepper.js
D tests/dumpReader.js
5 files changed, 4 insertions(+), 299 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/mediawiki/services/parsoid 
refs/changes/42/180642/1

diff --git a/package.json b/package.json
index 5cb9bc7..569ddfa 100644
--- a/package.json
+++ b/package.json
@@ -26,7 +26,8 @@
chai: ~1.9.1,
colors: ~0.6.2,
mocha: ~1.21.4,
-   supertest: 0.14.0
+   supertest: ~0.14.0,
+   dumpgrepper: ~0.1.0
},
main: lib/index.js,
bin: {
@@ -36,7 +37,8 @@
start: node api/server.js,
mocha: mocha --opts tests/mocha/mocha.opts tests/mocha,
parserTests: node tests/parserTests.js --wt2html --wt2wt 
--html2wt --html2html --selser --no-color --quiet --blacklist,
-   test: npm run parserTests  npm run mocha
+   test: npm run parserTests  npm run mocha,
+   dumpgrepper: dumpgrepper
},
repository: {
type: git,
diff --git a/tests/dumpGrepPatterns/martian-endtags.sh 
b/tests/dumpGrepPatterns/martian-endtags.sh
deleted file mode 100755
index b0395cf..000
--- a/tests/dumpGrepPatterns/martian-endtags.sh
+++ /dev/null
@@ -1,27 +0,0 @@
-#!/bin/sh
-
-# extension tag hooks enabled at en.wikipedia.org
-exts=categorytree|charinsert|gallery|hiero|imagemap|inputbox|math|nowiki|poem|pre|ref|references|source|syntaxhighlight|timeline
-
-wiki=nowiki|includeonly|noinclude|onlyinclude
-
-# just the html5 elements
-html5s=a|abbr|address|area|article|aside|audio|b|base|bdi|bdo|blockquote|body|br|button|canvas|caption|cite|code|col|colgroup|command|data|datalist|dd|del|details|dfn|div|dl|dt|em|embed|fieldset|figcaption|figure|footer|form|h1|h2|h3|h4|h5|h6|head|header|hgroup|hr|html|i|iframe|img|input|ins|kbd|keygen|label|legend|li|link|map|mark|menu|meta|meter|nav|noscript|object|ol|optgroup|option|output|p|param|pre|progress|q|rp|rt|rtc|ruby|s|samp|script|section|select|small|source|span|strong|style|sub|summary|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|track|u|ul|var|video|wbr
-
-htmlold=center|font|tt
-
-normaltags=$exts|$wiki|$html5s|$htmlold
-
-#regexp=(?!\/|$exts|$htmls)[^]*.*?!--([^]+|(\/|$exts|$htmls)[^]*)*\/(?!$exts|$htmls)[^]*
-#regexp=lt;(?!/|$normaltags)[^]+gt;[^]+lt;!--[^-]*lt;/(?!$normaltags)((?!gt;).)+gt;
-regexp=/(?=[a-z])(?!$normaltags)[^]+
-#regexp=(?!\/|$exts|$htmls)[^]*
-
-#echo $regexp
-
-if [ -z $1 ];then
-echo Usage: $0 xmldump.gz
-exit 1
-fi
-
-zcat $1 | node ../dumpGrepper.js -i $regexp
diff --git a/tests/dumpGrepPatterns/misc.txt b/tests/dumpGrepPatterns/misc.txt
deleted file mode 100644
index cbbcc7f..000
--- a/tests/dumpGrepPatterns/misc.txt
+++ /dev/null
@@ -1,18 +0,0 @@
-# A collection of misc interesting regexps
-
-# ISBN links with at least one line break 
(https://bugzilla.wikimedia.org/show_bug.cgi?id=29025)
-(?:(?:RFC|PMID)[ \t\n\r\f]*[\n\f\r]+[ \t\n\r\f]*([0-9]+)|ISBN[ 
\t\n\r\f]*[\n\f\r]+[ \t\n\r\f]*(\b(?:97[89][ -]?)?(?:[0-9][ -]?){9}[0-9Xx]\b))
-
-# ISBN links with at least two line breaks 
(https://bugzilla.wikimedia.org/show_bug.cgi?id=29025)
-(?:(?:RFC|PMID)[ \t\n\r\f]*(?:[\n\f\r][ \t\n\r\f]*){2,}([0-9]+)|ISBN[ 
\t\n\r\f]*(?:[\n\f\r][ \t\n\r\f]*){2,}(\b(?:97[89][ -]?)?(?:[0-9][ 
-]?){9}[0-9Xx]\b))
-
-# Template:Table_cell_templates in enwiki
-{{\s*(?:rh|rh2|yes|Ya|no|Na|coming soon|bad|eliminated|Site active|Site 
inactive|good|yes2|won|no2|nom|sho|TBA|partial|yes-No|okay|some|any|n/a|BLACK|dunno|Unknown|Depends|Included|dropped|terminated|beta|table-experimental|free|nonfree|proprietary|needs|incorrect|no
 
result|pending|nightly|release-candidate|[?]|unofficial|usually|rarely|sometimes|draw)\s*(?:[|]|}})
-
-# cases which aren't the simple '| {{yes}}' case.
-[^ \t|]\s*{{\s*(?:rh|rh2|yes|Ya|no|Na|coming soon|bad|eliminated|Site 
active|Site 
inactive|good|yes2|won|no2|nom|sho|TBA|partial|yes-No|okay|some|any|n/a|BLACK|dunno|Unknown|Depends|Included|dropped|terminated|beta|table-experimental|free|nonfree|proprietary|needs|incorrect|no
 
result|pending|nightly|release-candidate|[?]|unofficial|usually|rarely|sometimes|draw)\s*(?:[|]|}})
-
-# blank lines with more than one comment (bug 41756)
-^([ ]*!--((?!--).)*--){2,}[ ]*$  (use with -m option)
-# more precise version, avoid those surrounded by newlines
-[^\n]\n([ ]*!--((?!--).)*--){2,}[ ]*\n(?!\n)