[Wikidata-bugs] [Maniphest] [Commented On] T178047: Investigate why wikidata abstracts dumps are so large

2018-04-30 Thread ArielGlenn
ArielGlenn added a comment. I'd like to wait for the first run. I'll retitle the task then too :-)TASK DETAILhttps://phabricator.wikimedia.org/T178047EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: brion, gerritbot, hoo, ArielGlenn, Lahi, Gq86,

[Wikidata-bugs] [Maniphest] [Commented On] T178047: Investigate why wikidata abstracts dumps are so large

2018-04-30 Thread hoo
hoo added a comment. @ArielGlenn Do we want to close this, yet? Or wait for the first new dumps?TASK DETAILhttps://phabricator.wikimedia.org/T178047EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlenn, hooCc: brion, gerritbot, hoo, ArielGlenn, Lahi,

[Wikidata-bugs] [Maniphest] [Commented On] T178047: Investigate why wikidata abstracts dumps are so large

2018-04-28 Thread gerritbot
gerritbot added a comment. Change 416409 merged by jenkins-bot: [mediawiki/extensions/ActiveAbstract@master] don't try to abstract things that aren't text or wikitext https://gerrit.wikimedia.org/r/416409TASK DETAILhttps://phabricator.wikimedia.org/T178047EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T178047: Investigate why wikidata abstracts dumps are so large

2018-04-04 Thread ArielGlenn
ArielGlenn added a comment. Email sent to xmldatadumps-l and wikitech-l.TASK DETAILhttps://phabricator.wikimedia.org/T178047EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ArielGlennCc: brion, gerritbot, hoo, ArielGlenn, Versusxo, Majesticalreaper22,

[Wikidata-bugs] [Maniphest] [Commented On] T178047: Investigate why wikidata abstracts dumps are so large

2018-04-04 Thread hoo
hoo added a comment. In T178047#4097208, @ArielGlenn wrote: Well I don't mind a waiting period, let's agree on... one week? It will probably take longer than that for it to get merged and rolled out anyways. But we need an eta before I send the email :-) This is probably somewhere in between a

[Wikidata-bugs] [Maniphest] [Commented On] T178047: Investigate why wikidata abstracts dumps are so large

2018-04-02 Thread ArielGlenn
ArielGlenn added a comment. Well I don't mind a waiting period, let's agree on... one week? It will probably take longer than that for it to get merged and rolled out anyways. But we need an eta before I send the email :-)TASK DETAILhttps://phabricator.wikimedia.org/T178047EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T178047: Investigate why wikidata abstracts dumps are so large

2018-04-02 Thread hoo
hoo added a comment. In T178047#4091341, @ArielGlenn wrote: Looking at this list I think we are good to go. Definitely. I still think this should be announced, but given the very limited scope we might even get away without a waiting period before applying the change?TASK

[Wikidata-bugs] [Maniphest] [Commented On] T178047: Investigate why wikidata abstracts dumps are so large

2018-03-29 Thread ArielGlenn
ArielGlenn added a comment. On ms1001 in the public dumps dir I did this: list=*wik* for dirname in $list; do echo "doing $dirname"; zcat "${dirname}/20180320/${dirname}-20180320-stub-articles.xml.gz" |grep -A16 '0' | grep '' | grep -v wikitext | grep -v wikibase-item | grep -v

[Wikidata-bugs] [Maniphest] [Commented On] T178047: Investigate why wikidata abstracts dumps are so large

2018-03-22 Thread brion
brion added a comment. In T178047#4073991, @ArielGlenn wrote: In T178047#4073899, @brion wrote: Not sure offhand about the schema; Yahoo's old documentation seems to have vanished from the net. (Probably on the wayback machine but I can't find a URL reference) We don't have a schema in our

[Wikidata-bugs] [Maniphest] [Commented On] T178047: Investigate why wikidata abstracts dumps are so large

2018-03-22 Thread ArielGlenn
ArielGlenn added a comment. In T178047#4073899, @brion wrote: Not sure offhand about the schema; Yahoo's old documentation seems to have vanished from the net. (Probably on the wayback machine but I can't find a URL reference) We don't have a schema in our repos anywhere that must be updated

[Wikidata-bugs] [Maniphest] [Commented On] T178047: Investigate why wikidata abstracts dumps are so large

2018-03-22 Thread brion
brion added a comment. Not sure offhand about the schema; Yahoo's old documentation seems to have vanished from the net. (Probably on the wayback machine but I can't find a URL reference) Ideally, I think we'd want a way for the content handler to provide a text extract that can be used here.

[Wikidata-bugs] [Maniphest] [Commented On] T178047: Investigate why wikidata abstracts dumps are so large

2018-03-19 Thread ArielGlenn
ArielGlenn added a comment. Does anyone know where the schema for these xml files lives? I've grepped around in mw core and in the abstract extension repos and found nothing.TASK DETAILhttps://phabricator.wikimedia.org/T178047EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T178047: Investigate why wikidata abstracts dumps are so large

2018-03-13 Thread ArielGlenn
ArielGlenn added a comment. Actually, is this any different than having 'deleted="deleted"'as the attribute when a revision, contributor or comment is no longer available? AFAIK that's not a standard attribute or anything, it's just in our schema. Which reminds me, the change about needs to go

[Wikidata-bugs] [Maniphest] [Commented On] T178047: Investigate why wikidata abstracts dumps are so large

2018-03-13 Thread ArielGlenn
ArielGlenn added a comment. Well, on wikidatawiki in beta, the new code generates a whole lot of as we expect; on other wikis it produces the usual output. So that looks good. Now trying to find out about standard xml libraries.TASK DETAILhttps://phabricator.wikimedia.org/T178047EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T178047: Investigate why wikidata abstracts dumps are so large

2018-03-12 Thread ArielGlenn
ArielGlenn added a comment. I've updated it accoring to your second suggestion (untested though). I prefer to have empty abstract tags in there rather than skip them completely. The file ought to compress down to something pretty tiny at least!TASK

[Wikidata-bugs] [Maniphest] [Commented On] T178047: Investigate why wikidata abstracts dumps are so large

2018-03-05 Thread hoo
hoo added a comment. In T178047#4023267, @ArielGlenn wrote: I'm tempted to just turn off abstracts for Wikidata altogether, since every item in there is a Qxxx with junk for the abstract. If this is just NS0 (or content namespaces… which are all Wikibase entity namespaces), this definitely

[Wikidata-bugs] [Maniphest] [Commented On] T178047: Investigate why wikidata abstracts dumps are so large

2018-03-05 Thread ArielGlenn
ArielGlenn added a comment. I'm tempted to just turn off abstracts for Wikidata altogether, since every item in there is a Qxxx with junk for the abstract. But your approach is better, in case similar content creeps into other projects. WHat do you think about

[Wikidata-bugs] [Maniphest] [Commented On] T178047: Investigate why wikidata abstracts dumps are so large

2018-03-05 Thread gerritbot
gerritbot added a comment. Change 416409 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn): [mediawiki/extensions/ActiveAbstract@master] don't try to abstract things that aren't text or wikitext https://gerrit.wikimedia.org/r/416409TASK

[Wikidata-bugs] [Maniphest] [Commented On] T178047: Investigate why wikidata abstracts dumps are so large

2017-10-25 Thread hoo
hoo added a comment. Relevant code: https://github.com/wikimedia/mediawiki-extensions-ActiveAbstract/blob/master/AbstractFilter.php#L131 I'm not sure how Wikidata abstracts could be meaningful… I can make a (rather bold) suggestion to just drop an empty string in case we're dealing with non