bennofs added a comment.
In T222985#7163999 <https://phabricator.wikimedia.org/T222985#7163999>,
@ArielGlenn wrote:
> lbzip2 decompresses in parallel as well. We use that for compression of the
SQL/XML dumps.
Yes, the problem is that bzip2 is just really slow to deco
bennofs added a comment.
This does not seem fully fixed yet:
https://www.wikidata.org/wiki/Wikidata_talk:SPARQL_query_service#Possible_bug.
Example from that post:
SELECT ?item ?itemLabel ?linkTo {
?item wdt:P780/wdt:P31*/wdt:P279* wd:Q737460, wd:Q86, wd:Q21120251.
SERVICE
bennofs added a comment.
This query
https://query.wikidata.org/#SELECT%20%3Fprop%20%3Ftype%20WHERE%20%7B%20%3Fprop%20wikibase%3ApropertyType%20%3Ftype%20FILTER%20%28CONTAINS%28STR%28%3Fprop%29%2C%22Q%22%29%20%26%26%203%21%3D1%29%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20
bennofs added a comment.
$ time zstdcat -v -d wikidata-20190506-all.json.bz2 | zstd > /dev/null
real4m5.341s
user2m22.4
bennofs added a comment.
But I can do a zstd decompression -> zstd compression test.
TASK DETAIL
https://phabricator.wikimedia.org/T222985
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: bennofs
Cc: ArielGlenn, Liuxinyu970226, benn
bennofs added a comment.
I don't have enough disk space for a compression test, that's correct.
TASK DETAIL
https://phabricator.wikimedia.org/T222985
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: bennofs
Cc: ArielGlenn, Liuxinyu97022
bennofs added a comment.
Now the same with zstd:
$ time zstdcat -v -d wikidata-20190506-all.json.bz2 | cat > /dev/null
real3m48.657s
user0m3.792s
sys 0m58.768s
here's the sizes:
35G wikidata-20190506-all.json.bz2
39G
bennofs added a comment.
So I tried lbzip2, here's the result (on a VM sever with 2 cores, 2.1GHz, the
decompression is CPU bound):
$ time lbzip2 -n2 -v -d -c wikidata-20190506-all.json.bz2 | cat > /
bennofs created this task.
bennofs added projects: Wikidata, Dumps-Generation.
Restricted Application added a subscriber: Liuxinyu970226.
TASK DESCRIPTION
At this time, wikidata provides JSON dumps compressed with gzip or bzip2.
However, neither are not optimal:
- the gzip dump is quite