Lucas_Werkmeister_WMDE created this task.
Lucas_Werkmeister_WMDE added a project: Wikidata.
Herald added a subscriber: Aklapper.

TASK DESCRIPTION

HDT is a compact binary format for RDF that can also support efficient querying. On the mailing list, people have requested that we offer an HDT dump in addition to the TTL dumps, allowing them to run queries on their own systems that would take too long to run on the Wikidata Query Service.

There is an rdf2hdt tool (link; LGPLv2.1+) that can convert TTL dumps to HDT files. Unfortunately, it doesn’t run in a streaming fashion (it doesn’t even open the output file until it’s done converting) and seems to require almost as much memory as the uncompressed TTL dump to run. I tried to run it on the latest Wikidata dump, but the program was OOM-killed after having consumed 2.32 GiB of the gzipped input dump (according to pv), which corresponds to 15.63 GiB of uncompressed input data; the last VmSize before it was killed was 13.04 GiB. As the full uncompressed TTL dump is 187 GiB (201 GB), it looks like we would need a machine with at least ~200 GB of memory to do the conversion. (Perhaps we could get away with using lots of swap space instead of actual RAM – I have no idea what kind of memory access patterns the tool has.)

As for the processing time, on my system 9% of the dump were processed in 23 minutes, so the full conversion would probably take some hours, but not days. The CPU time as reported by Bash’s time builtin was actually less than the wall-clock time, so it doesn’t look like the tool is multi-threaded. But of course it’s possible that there is some additional phase of processing after the tool is done reading the file, and I have no idea how long that could take.


TASK DETAIL
https://phabricator.wikimedia.org/T179681

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Lucas_Werkmeister_WMDE
Cc: Lucas_Werkmeister_WMDE, Aklapper, Lahi, GoranSMilovanovic, QZanden, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to