On 16/02/2022 11:56, Neubert, Joachim wrote:
I've loaded the Wikidata "truthy" dataset with 6b triples. Summary stats is:

10:09:29 INFO  Load node table  = 35555 seconds
10:09:29 INFO  Load ingest data = 25165 seconds
10:09:29 INFO  Build index SPO  = 11241 seconds
10:09:29 INFO  Build index POS  = 14100 seconds
10:09:29 INFO  Build index OSP  = 12435 seconds
10:09:29 INFO  Overall          98496 seconds
10:09:29 INFO  Overall          27h 21m 36s
10:09:29 INFO  Triples loaded   = 6756025616
10:09:29 INFO  Quads loaded     = 0
10:09:29 INFO  Overall Rate     68591 tuples per second

This was done on a large machine with 2TB RAM and -threads=48, but anyway: It 
looks like tdb2.xloader in apache-jena-4.5.0-SNAPSHOT brought HUGE improvements 
over prior versions (unfortunately I cannot find a log, but it took multiple 
days with 3.x on the same machine).

This is very helpful - faster than Lorenz reported on a 128G / 12 threads (31h). It does suggests there is effectively a soft upper bound on going faster by more RAM, more threads.

That seems likely - disk bandwith also matters and because xloader is phased between sort and index writing steps, it is unlikely to be getting the best overlap of CPU crunching and I/O.

This all gets into RAID0, or allocating files across different disk.

There comes a point where it gets quite a task to setup the machine.

One other area I think might be easy to improve - more for smaller machines - is during data ingest. There, the node table index is being randomly read. On smaller RAM machines, the ingest phase is proporiately longer, sometimes a lot.

An idea I had is calling the madvise system call on the mmap segments to tell the kernel the access is random (requires native code; Java17 makes it possible to directly call mdavise(2) without needing a C (etc) converter layer).

> If you think it useful, I am happy to share more details.

What was the storage?

    Andy

Two observations:


-        As Andy (thanks again for all your help!) already mentioned, gzip 
files apparently load significantly faster then bzip2 files. I experienced  
200,000 vs. 100,000 triples/second in the parse nodes step (though colleagues 
had jobs on the machine too, which might have influenced the results).

-        During the extended POS/POS/OSP sort periods, I saw only one or two 
gzip instances (used in the background), which perhaps were a bottleneck. I 
wonder if using pigz could extend parallel processing.

If you think it usefull, I am happy to share more details. If I can help with 
running some particular tests on a massive parallel machine, please let me know.

Cheers, Joachim

--
Joachim Neubert

ZBW - Leibniz Information Centre for Economics
Neuer Jungfernstieg 21
20354 Hamburg
Phone +49-40-42834-462


Reply via email to