| Smalyshev added a comment. |
Well, the dumps are big, so not sure whether it's possible to do much about it... Maybe we could reduce frequency to bi-weekly or something?
Also, the longest operation right now seems to be re-zipping (gz -> bz2) of .nt dump. It takes over 1.5 days, judging by timestamps (unfortunately, I can't see from timestamps how much ttl->nt takes). I wonder if there's a way to generate .gz and .bz2 in parallel. .bz2 can compose chunks just like .gz so maybe there's a way to do it?
Other options:
- Re-do performance audit of the dump generator, we did it last time 2+ years ago IIRC and there may be some potential for improvement
- Remove (or reduce frequency of) .ttl dump - it duplicates .nt one and the latter is superior in terms of processing, though larger. .ttl is much more readable, etc. but I am not sure how much readability matters in a 70G dump.
- Play with parallelism/sharding/etc. - maybe there are some things there that we can tweak to make it run faster.
TASK DETAIL
EMAIL PREFERENCES
To: Smalyshev
Cc: hoo, Smalyshev, ArielGlenn, Nandana, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, gnosygnu, Wikidata-bugs, aude, Mbch331
Cc: hoo, Smalyshev, ArielGlenn, Nandana, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, gnosygnu, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
