ArielGlenn added a comment.
This is already an improvement; the weeklies finished late Saturday night instead of on Monday. This coming run should go faster, since lbzip2 will be used for all/nt and all/ttl as well.TASK DETAILhttps://phabricator.wikimedia.org/T206535EMAIL
ArielGlenn added a comment.
As you see I merged the change to the rdf shell script. The one running now is all nt, so we won't see the lbzip2 use until the next part of the cron job, the truthy nt ones. I double checked the output from the json files, and the md5sum of the gz and bz2 files are
gerritbot added a comment.
Change 480140 merged by ArielGlenn:
[operations/puppet@production] use lbzip2 in wikidata rdf weeklies
https://gerrit.wikimedia.org/r/480140TASK DETAILhttps://phabricator.wikimedia.org/T206535EMAIL
gerritbot added a comment.
Change 480140 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] use lbzp2 in wikidata rdf weeklies
https://gerrit.wikimedia.org/r/480140TASK DETAILhttps://phabricator.wikimedia.org/T206535EMAIL
ArielGlenn added a comment.
The first use of lbzip2 has been deployed, so that it can take effect for tomorrow's json dumps. If that goes well, I'd like to enable it for the rdf dumps during next week's run.TASK DETAILhttps://phabricator.wikimedia.org/T206535EMAIL
gerritbot added a comment.
Change 474159 merged by ArielGlenn:
[operations/puppet@production] use lbzip2 for recompression of wikidata weeky json dumps
https://gerrit.wikimedia.org/r/474159TASK DETAILhttps://phabricator.wikimedia.org/T206535EMAIL
ArielGlenn added a comment.
The lbzip2 code doesn't produce chunked bzip2 streams (like e.g. the multistream xml pages-articles dumps). It's one stream only. I expect that is why php runs ok on it.TASK DETAILhttps://phabricator.wikimedia.org/T206535EMAIL
ArielGlenn added a comment.
What do folks think about this for a first step? When we're happy that these are ok, we can roll out to the rdf dumps. Right now the last dumps of the weekly (lexeme) finish on Sunday so that's just not sustainable going forwards.TASK
gerritbot added a comment.
Change 474159 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] use lbzip2 for recompression of wikidata weeky json dumps
https://gerrit.wikimedia.org/r/474159TASK DETAILhttps://phabricator.wikimedia.org/T206535EMAIL
ArielGlenn added a comment.
In T206535#4675178, @hoo wrote:
...
Have you tested importing that via php (and/or anything else that uses the libzip2 compat stuff)?
script:
[ariel@bigtrouble ~]$ more catbz2file.php
md5sum of original (gz) file:
hoo added a comment.
In T206535#4675204, @Smalyshev wrote:
I just thought about this a bit and we might want to split the dumping process up into two steps:
I thought that's what is happening now? Or I miss something?
Well, currently we invoke one bash script (via cron) that does all the work
Smalyshev added a comment.
I just thought about this a bit and we might want to split the dumping process up into two steps:
I thought that's what is happening now? Or I miss something?TASK DETAILhttps://phabricator.wikimedia.org/T206535EMAIL
Smalyshev added a comment.
I think bzip2 format is standartized, so all well-behaving tools should be interoperable. I'd try it with lbzip2 and check the recent dumps - if they work with standard tools then I think it should be fine.TASK DETAILhttps://phabricator.wikimedia.org/T206535EMAIL
hoo added a comment.
I just thought about this a bit and we might want to split the dumping process up into two steps:
Generating the dump via the maintenance script (sharded) and concatenating the shards
Re-Compress/ format conversion (ttl <> nt)/ …
This way we could run 1) in serial/ limited
hoo added a comment.
In T206535#4673798, @ArielGlenn wrote:
I've enabled the use of lbzip2 for the xml/sql dumps starting with the Oct 20th run; we could consider using this for the wikidata weeklies recompression into bz2 files, at, say, four threads (half the number of shards). As far as I can
ArielGlenn added a comment.
ariel@snapshot1008:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20181015$ date; zcat wikidata-20181015-all-BETA.ttl.gz | lbzip2 -n 4 > /mnt/dumpsdata/temp/ariel/wikidata-20181015-all-BETA.ttl.bz2; date
Wed Oct 17 12:11:32 UTC 2018
Wed Oct 17 13:25:23 UTC 2018
ArielGlenn added a comment.
I've enabled the use of lbzip2 for the xml/sql dumps starting with the Oct 20th run; we could consider using this for the wikidata weeklies recompression into bz2 files, at, say, four threads (half the number of shards). As far as I can tell it puts out binary-format
Smalyshev added a comment.
Well, the dumps are big, so not sure whether it's possible to do much about it... Maybe we could reduce frequency to bi-weekly or something?
Also, the longest operation right now seems to be re-zipping (gz -> bz2) of .nt dump. It takes over 1.5 days, judging by
18 matches
Mail list logo