[MediaWiki-commits] [Gerrit] Remove dumpGrepper files - change (mediawiki...parsoid)
jenkins-bot has submitted this change and it was merged. Change subject: Remove dumpGrepper files .. Remove dumpGrepper files * Can now be installed with npm i -g dumpgrepper * Leaving it out of devDependencies because libxml fails to compile on jenkins and there's no optionalDevDependencies in npm, yet. Change-Id: If21dfcf0575b15776e388e5220d1b6cb811be2f6 --- M tests/README.md D tests/dumpGrepPatterns/martian-endtags.sh D tests/dumpGrepPatterns/misc.txt D tests/dumpGrepper.js D tests/dumpReader.js 5 files changed, 13 insertions(+), 297 deletions(-) Approvals: Cscott: Looks good to me, approved jenkins-bot: Verified diff --git a/tests/README.md b/tests/README.md index 931f959..2be3662 100644 --- a/tests/README.md +++ b/tests/README.md @@ -70,3 +70,16 @@ $ node client Then take a look at [the statistics](http://localhost:8001/). + +== Running the dumpgrepper == + +The dumpgrepper utility is useful to search XML dumps for specific regexp +patterns. With a simple regexp, an enwiki dump can be grepped in ~20 minutes. + +The grepper operates on actual wikitext (with XML encoding removed), so there is +no need to complicate regexps with entities. It supports JavaScript RegExps. + + $ npm install -g dumpgrepper + +More information on [github][https://github.com/wikimedia/dumpgrepper] and the +[mediawiki wiki][https://www.mediawiki.org/wiki/Parsoid/DumpGrepper]. diff --git a/tests/dumpGrepPatterns/martian-endtags.sh b/tests/dumpGrepPatterns/martian-endtags.sh deleted file mode 100755 index b0395cf..000 --- a/tests/dumpGrepPatterns/martian-endtags.sh +++ /dev/null @@ -1,27 +0,0 @@ -#!/bin/sh - -# extension tag hooks enabled at en.wikipedia.org -exts=categorytree|charinsert|gallery|hiero|imagemap|inputbox|math|nowiki|poem|pre|ref|references|source|syntaxhighlight|timeline - -wiki=nowiki|includeonly|noinclude|onlyinclude - -# just the html5 elements -html5s=a|abbr|address|area|article|aside|audio|b|base|bdi|bdo|blockquote|body|br|button|canvas|caption|cite|code|col|colgroup|command|data|datalist|dd|del|details|dfn|div|dl|dt|em|embed|fieldset|figcaption|figure|footer|form|h1|h2|h3|h4|h5|h6|head|header|hgroup|hr|html|i|iframe|img|input|ins|kbd|keygen|label|legend|li|link|map|mark|menu|meta|meter|nav|noscript|object|ol|optgroup|option|output|p|param|pre|progress|q|rp|rt|rtc|ruby|s|samp|script|section|select|small|source|span|strong|style|sub|summary|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|track|u|ul|var|video|wbr - -htmlold=center|font|tt - -normaltags=$exts|$wiki|$html5s|$htmlold - -#regexp=(?!\/|$exts|$htmls)[^]*.*?!--([^]+|(\/|$exts|$htmls)[^]*)*\/(?!$exts|$htmls)[^]* -#regexp=lt;(?!/|$normaltags)[^]+gt;[^]+lt;!--[^-]*lt;/(?!$normaltags)((?!gt;).)+gt; -regexp=/(?=[a-z])(?!$normaltags)[^]+ -#regexp=(?!\/|$exts|$htmls)[^]* - -#echo $regexp - -if [ -z $1 ];then -echo Usage: $0 xmldump.gz -exit 1 -fi - -zcat $1 | node ../dumpGrepper.js -i $regexp diff --git a/tests/dumpGrepPatterns/misc.txt b/tests/dumpGrepPatterns/misc.txt deleted file mode 100644 index cbbcc7f..000 --- a/tests/dumpGrepPatterns/misc.txt +++ /dev/null @@ -1,18 +0,0 @@ -# A collection of misc interesting regexps - -# ISBN links with at least one line break (https://bugzilla.wikimedia.org/show_bug.cgi?id=29025) -(?:(?:RFC|PMID)[ \t\n\r\f]*[\n\f\r]+[ \t\n\r\f]*([0-9]+)|ISBN[ \t\n\r\f]*[\n\f\r]+[ \t\n\r\f]*(\b(?:97[89][ -]?)?(?:[0-9][ -]?){9}[0-9Xx]\b)) - -# ISBN links with at least two line breaks (https://bugzilla.wikimedia.org/show_bug.cgi?id=29025) -(?:(?:RFC|PMID)[ \t\n\r\f]*(?:[\n\f\r][ \t\n\r\f]*){2,}([0-9]+)|ISBN[ \t\n\r\f]*(?:[\n\f\r][ \t\n\r\f]*){2,}(\b(?:97[89][ -]?)?(?:[0-9][ -]?){9}[0-9Xx]\b)) - -# Template:Table_cell_templates in enwiki -{{\s*(?:rh|rh2|yes|Ya|no|Na|coming soon|bad|eliminated|Site active|Site inactive|good|yes2|won|no2|nom|sho|TBA|partial|yes-No|okay|some|any|n/a|BLACK|dunno|Unknown|Depends|Included|dropped|terminated|beta|table-experimental|free|nonfree|proprietary|needs|incorrect|no result|pending|nightly|release-candidate|[?]|unofficial|usually|rarely|sometimes|draw)\s*(?:[|]|}}) - -# cases which aren't the simple '| {{yes}}' case. -[^ \t|]\s*{{\s*(?:rh|rh2|yes|Ya|no|Na|coming soon|bad|eliminated|Site active|Site inactive|good|yes2|won|no2|nom|sho|TBA|partial|yes-No|okay|some|any|n/a|BLACK|dunno|Unknown|Depends|Included|dropped|terminated|beta|table-experimental|free|nonfree|proprietary|needs|incorrect|no result|pending|nightly|release-candidate|[?]|unofficial|usually|rarely|sometimes|draw)\s*(?:[|]|}}) - -# blank lines with more than one comment (bug 41756) -^([ ]*!--((?!--).)*--){2,}[ ]*$ (use with -m option) -# more precise version, avoid those surrounded by newlines -[^\n]\n([ ]*!--((?!--).)*--){2,}[ ]*\n(?!\n) diff --git a/tests/dumpGrepper.js b/tests/dumpGrepper.js deleted file mode 100755 index 546087b..000 --- a/tests/dumpGrepper.js +++ /dev/null @@ -1,122
[MediaWiki-commits] [Gerrit] Remove dumpGrepper files - change (mediawiki...parsoid)
Arlolra has uploaded a new change for review. https://gerrit.wikimedia.org/r/180642 Change subject: Remove dumpGrepper files .. Remove dumpGrepper files * Adds a dev dependency on dumpgrepper. * And a script so you don't have to go fishing in node_modules/.bin to run it. More useful in npm v2.x where you can, npm run dumpgrepper -- --help Change-Id: If21dfcf0575b15776e388e5220d1b6cb811be2f6 --- M package.json D tests/dumpGrepPatterns/martian-endtags.sh D tests/dumpGrepPatterns/misc.txt D tests/dumpGrepper.js D tests/dumpReader.js 5 files changed, 4 insertions(+), 299 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/mediawiki/services/parsoid refs/changes/42/180642/1 diff --git a/package.json b/package.json index 5cb9bc7..569ddfa 100644 --- a/package.json +++ b/package.json @@ -26,7 +26,8 @@ chai: ~1.9.1, colors: ~0.6.2, mocha: ~1.21.4, - supertest: 0.14.0 + supertest: ~0.14.0, + dumpgrepper: ~0.1.0 }, main: lib/index.js, bin: { @@ -36,7 +37,8 @@ start: node api/server.js, mocha: mocha --opts tests/mocha/mocha.opts tests/mocha, parserTests: node tests/parserTests.js --wt2html --wt2wt --html2wt --html2html --selser --no-color --quiet --blacklist, - test: npm run parserTests npm run mocha + test: npm run parserTests npm run mocha, + dumpgrepper: dumpgrepper }, repository: { type: git, diff --git a/tests/dumpGrepPatterns/martian-endtags.sh b/tests/dumpGrepPatterns/martian-endtags.sh deleted file mode 100755 index b0395cf..000 --- a/tests/dumpGrepPatterns/martian-endtags.sh +++ /dev/null @@ -1,27 +0,0 @@ -#!/bin/sh - -# extension tag hooks enabled at en.wikipedia.org -exts=categorytree|charinsert|gallery|hiero|imagemap|inputbox|math|nowiki|poem|pre|ref|references|source|syntaxhighlight|timeline - -wiki=nowiki|includeonly|noinclude|onlyinclude - -# just the html5 elements -html5s=a|abbr|address|area|article|aside|audio|b|base|bdi|bdo|blockquote|body|br|button|canvas|caption|cite|code|col|colgroup|command|data|datalist|dd|del|details|dfn|div|dl|dt|em|embed|fieldset|figcaption|figure|footer|form|h1|h2|h3|h4|h5|h6|head|header|hgroup|hr|html|i|iframe|img|input|ins|kbd|keygen|label|legend|li|link|map|mark|menu|meta|meter|nav|noscript|object|ol|optgroup|option|output|p|param|pre|progress|q|rp|rt|rtc|ruby|s|samp|script|section|select|small|source|span|strong|style|sub|summary|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|track|u|ul|var|video|wbr - -htmlold=center|font|tt - -normaltags=$exts|$wiki|$html5s|$htmlold - -#regexp=(?!\/|$exts|$htmls)[^]*.*?!--([^]+|(\/|$exts|$htmls)[^]*)*\/(?!$exts|$htmls)[^]* -#regexp=lt;(?!/|$normaltags)[^]+gt;[^]+lt;!--[^-]*lt;/(?!$normaltags)((?!gt;).)+gt; -regexp=/(?=[a-z])(?!$normaltags)[^]+ -#regexp=(?!\/|$exts|$htmls)[^]* - -#echo $regexp - -if [ -z $1 ];then -echo Usage: $0 xmldump.gz -exit 1 -fi - -zcat $1 | node ../dumpGrepper.js -i $regexp diff --git a/tests/dumpGrepPatterns/misc.txt b/tests/dumpGrepPatterns/misc.txt deleted file mode 100644 index cbbcc7f..000 --- a/tests/dumpGrepPatterns/misc.txt +++ /dev/null @@ -1,18 +0,0 @@ -# A collection of misc interesting regexps - -# ISBN links with at least one line break (https://bugzilla.wikimedia.org/show_bug.cgi?id=29025) -(?:(?:RFC|PMID)[ \t\n\r\f]*[\n\f\r]+[ \t\n\r\f]*([0-9]+)|ISBN[ \t\n\r\f]*[\n\f\r]+[ \t\n\r\f]*(\b(?:97[89][ -]?)?(?:[0-9][ -]?){9}[0-9Xx]\b)) - -# ISBN links with at least two line breaks (https://bugzilla.wikimedia.org/show_bug.cgi?id=29025) -(?:(?:RFC|PMID)[ \t\n\r\f]*(?:[\n\f\r][ \t\n\r\f]*){2,}([0-9]+)|ISBN[ \t\n\r\f]*(?:[\n\f\r][ \t\n\r\f]*){2,}(\b(?:97[89][ -]?)?(?:[0-9][ -]?){9}[0-9Xx]\b)) - -# Template:Table_cell_templates in enwiki -{{\s*(?:rh|rh2|yes|Ya|no|Na|coming soon|bad|eliminated|Site active|Site inactive|good|yes2|won|no2|nom|sho|TBA|partial|yes-No|okay|some|any|n/a|BLACK|dunno|Unknown|Depends|Included|dropped|terminated|beta|table-experimental|free|nonfree|proprietary|needs|incorrect|no result|pending|nightly|release-candidate|[?]|unofficial|usually|rarely|sometimes|draw)\s*(?:[|]|}}) - -# cases which aren't the simple '| {{yes}}' case. -[^ \t|]\s*{{\s*(?:rh|rh2|yes|Ya|no|Na|coming soon|bad|eliminated|Site active|Site inactive|good|yes2|won|no2|nom|sho|TBA|partial|yes-No|okay|some|any|n/a|BLACK|dunno|Unknown|Depends|Included|dropped|terminated|beta|table-experimental|free|nonfree|proprietary|needs|incorrect|no result|pending|nightly|release-candidate|[?]|unofficial|usually|rarely|sometimes|draw)\s*(?:[|]|}}) - -# blank lines with more than one comment (bug 41756) -^([ ]*!--((?!--).)*--){2,}[ ]*$ (use with -m option) -# more precise version, avoid those surrounded by newlines -[^\n]\n([ ]*!--((?!--).)*--){2,}[ ]*\n(?!\n)