| hoo added a comment. |
I ran the above mentioned tool on a slow-ish VM over the latest truthy dump:
$ time ~/gz-sort/gz-sort -u -S 100M wikidata-20170927-truthy-BETA.nt.gz ~/wikidata-20170927-truthy-BETA.nt.sort.gz line count: 1924967162 presort: 219.15 minutes merge 396083: 186.55 minutes merge 792167: 183.47 minutes merge 1584335: 182.98 minutes merge 3166064: 183.37 minutes merge 6332128: 183.77 minutes merge 12664257: 183.28 minutes merge 25328515: 183.90 minutes merge 50657030: 185.42 minutes merge 101314061: 217.00 minutes merge 192496716: 219.67 minutes merge 384993432: 218.62 minutes merge 641655720: 217.23 minutes merge 962483581: 224.97 minutes removed 303419 non-unique lines real 2789m32.668s user 2598m34.233s sys 18m21.880s
The resulting gzipped file was about 4% larger, but that was probably due to it not being compressed with -9. Sadly I accidentally deleted the sorted dump, thus I can't check how large it would be with gzip -9 or other compressions… but I kind of doubt that's worth it.
TASK DETAIL
EMAIL PREFERENCES
To: hoo
Cc: daniel, Lydia_Pintscher, ArielGlenn, aude, Aklapper, hoo, GoranSMilovanovic, QZanden, Izno, Wikidata-bugs, Svick, Mbch331, jeremyb
Cc: daniel, Lydia_Pintscher, ArielGlenn, aude, Aklapper, hoo, GoranSMilovanovic, QZanden, Izno, Wikidata-bugs, Svick, Mbch331, jeremyb
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
