Smalyshev added a comment.

Well, the dumps are big, so not sure whether it's possible to do much about it... Maybe we could reduce frequency to bi-weekly or something?

Also, the longest operation right now seems to be re-zipping (gz -> bz2) of .nt dump. It takes over 1.5 days, judging by timestamps (unfortunately, I can't see from timestamps how much ttl->nt takes). I wonder if there's a way to generate .gz and .bz2 in parallel. .bz2 can compose chunks just like .gz so maybe there's a way to do it?

Other options:

  • Re-do performance audit of the dump generator, we did it last time 2+ years ago IIRC and there may be some potential for improvement
  • Remove (or reduce frequency of) .ttl dump - it duplicates .nt one and the latter is superior in terms of processing, though larger. .ttl is much more readable, etc. but I am not sure how much readability matters in a 70G dump.
  • Play with parallelism/sharding/etc. - maybe there are some things there that we can tweak to make it run faster.

TASK DETAIL
https://phabricator.wikimedia.org/T206535

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: hoo, Smalyshev, ArielGlenn, Nandana, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, gnosygnu, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to