[Wikidata-bugs] [Maniphest] [Commented On] T206535: wikidata weekly dumps take too long to complete

2018-12-23 Thread ArielGlenn
ArielGlenn added a comment. This is already an improvement; the weeklies finished late Saturday night instead of on Monday. This coming run should go faster, since lbzip2 will be used for all/nt and all/ttl as well.TASK DETAILhttps://phabricator.wikimedia.org/T206535EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T206535: wikidata weekly dumps take too long to complete

2018-12-20 Thread ArielGlenn
ArielGlenn added a comment. As you see I merged the change to the rdf shell script. The one running now is all nt, so we won't see the lbzip2 use until the next part of the cron job, the truthy nt ones. I double checked the output from the json files, and the md5sum of the gz and bz2 files are

[Wikidata-bugs] [Maniphest] [Commented On] T206535: wikidata weekly dumps take too long to complete

2018-12-20 Thread gerritbot
gerritbot added a comment. Change 480140 merged by ArielGlenn: [operations/puppet@production] use lbzip2 in wikidata rdf weeklies https://gerrit.wikimedia.org/r/480140TASK DETAILhttps://phabricator.wikimedia.org/T206535EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T206535: wikidata weekly dumps take too long to complete

2018-12-17 Thread gerritbot
gerritbot added a comment. Change 480140 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn): [operations/puppet@production] use lbzp2 in wikidata rdf weeklies https://gerrit.wikimedia.org/r/480140TASK DETAILhttps://phabricator.wikimedia.org/T206535EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T206535: wikidata weekly dumps take too long to complete

2018-12-16 Thread ArielGlenn
ArielGlenn added a comment. The first use of lbzip2 has been deployed, so that it can take effect for tomorrow's json dumps. If that goes well, I'd like to enable it for the rdf dumps during next week's run.TASK DETAILhttps://phabricator.wikimedia.org/T206535EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T206535: wikidata weekly dumps take too long to complete

2018-12-16 Thread gerritbot
gerritbot added a comment. Change 474159 merged by ArielGlenn: [operations/puppet@production] use lbzip2 for recompression of wikidata weeky json dumps https://gerrit.wikimedia.org/r/474159TASK DETAILhttps://phabricator.wikimedia.org/T206535EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T206535: wikidata weekly dumps take too long to complete

2018-12-12 Thread ArielGlenn
ArielGlenn added a comment. The lbzip2 code doesn't produce chunked bzip2 streams (like e.g. the multistream xml pages-articles dumps). It's one stream only. I expect that is why php runs ok on it.TASK DETAILhttps://phabricator.wikimedia.org/T206535EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T206535: wikidata weekly dumps take too long to complete

2018-11-16 Thread ArielGlenn
ArielGlenn added a comment. What do folks think about this for a first step? When we're happy that these are ok, we can roll out to the rdf dumps. Right now the last dumps of the weekly (lexeme) finish on Sunday so that's just not sustainable going forwards.TASK

[Wikidata-bugs] [Maniphest] [Commented On] T206535: wikidata weekly dumps take too long to complete

2018-11-16 Thread gerritbot
gerritbot added a comment. Change 474159 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn): [operations/puppet@production] use lbzip2 for recompression of wikidata weeky json dumps https://gerrit.wikimedia.org/r/474159TASK DETAILhttps://phabricator.wikimedia.org/T206535EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T206535: wikidata weekly dumps take too long to complete

2018-10-18 Thread ArielGlenn
ArielGlenn added a comment. In T206535#4675178, @hoo wrote: ... Have you tested importing that via php (and/or anything else that uses the libzip2 compat stuff)? script: [ariel@bigtrouble ~]$ more catbz2file.php md5sum of original (gz) file:

[Wikidata-bugs] [Maniphest] [Commented On] T206535: wikidata weekly dumps take too long to complete

2018-10-17 Thread hoo
hoo added a comment. In T206535#4675204, @Smalyshev wrote: I just thought about this a bit and we might want to split the dumping process up into two steps: I thought that's what is happening now? Or I miss something? Well, currently we invoke one bash script (via cron) that does all the work

[Wikidata-bugs] [Maniphest] [Commented On] T206535: wikidata weekly dumps take too long to complete

2018-10-17 Thread Smalyshev
Smalyshev added a comment. I just thought about this a bit and we might want to split the dumping process up into two steps: I thought that's what is happening now? Or I miss something?TASK DETAILhttps://phabricator.wikimedia.org/T206535EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T206535: wikidata weekly dumps take too long to complete

2018-10-17 Thread Smalyshev
Smalyshev added a comment. I think bzip2 format is standartized, so all well-behaving tools should be interoperable. I'd try it with lbzip2 and check the recent dumps - if they work with standard tools then I think it should be fine.TASK DETAILhttps://phabricator.wikimedia.org/T206535EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T206535: wikidata weekly dumps take too long to complete

2018-10-17 Thread hoo
hoo added a comment. I just thought about this a bit and we might want to split the dumping process up into two steps: Generating the dump via the maintenance script (sharded) and concatenating the shards Re-Compress/ format conversion (ttl <> nt)/ … This way we could run 1) in serial/ limited

[Wikidata-bugs] [Maniphest] [Commented On] T206535: wikidata weekly dumps take too long to complete

2018-10-17 Thread hoo
hoo added a comment. In T206535#4673798, @ArielGlenn wrote: I've enabled the use of lbzip2 for the xml/sql dumps starting with the Oct 20th run; we could consider using this for the wikidata weeklies recompression into bz2 files, at, say, four threads (half the number of shards). As far as I can

[Wikidata-bugs] [Maniphest] [Commented On] T206535: wikidata weekly dumps take too long to complete

2018-10-17 Thread ArielGlenn
ArielGlenn added a comment. ariel@snapshot1008:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20181015$ date; zcat wikidata-20181015-all-BETA.ttl.gz | lbzip2 -n 4 > /mnt/dumpsdata/temp/ariel/wikidata-20181015-all-BETA.ttl.bz2; date Wed Oct 17 12:11:32 UTC 2018 Wed Oct 17 13:25:23 UTC 2018

[Wikidata-bugs] [Maniphest] [Commented On] T206535: wikidata weekly dumps take too long to complete

2018-10-17 Thread ArielGlenn
ArielGlenn added a comment. I've enabled the use of lbzip2 for the xml/sql dumps starting with the Oct 20th run; we could consider using this for the wikidata weeklies recompression into bz2 files, at, say, four threads (half the number of shards). As far as I can tell it puts out binary-format

[Wikidata-bugs] [Maniphest] [Commented On] T206535: wikidata weekly dumps take too long to complete

2018-10-09 Thread Smalyshev
Smalyshev added a comment. Well, the dumps are big, so not sure whether it's possible to do much about it... Maybe we could reduce frequency to bi-weekly or something? Also, the longest operation right now seems to be re-zipping (gz -> bz2) of .nt dump. It takes over 1.5 days, judging by