On 16/02/2022 11:56, Neubert, Joachim wrote:
I've loaded the Wikidata "truthy" dataset with 6b triples. Summary stats is:
10:09:29 INFO Load node table = 35555 seconds
10:09:29 INFO Load ingest data = 25165 seconds
10:09:29 INFO Build index SPO = 11241 seconds
10:09:29 INFO Build index POS = 14100 seconds
10:09:29 INFO Build index OSP = 12435 seconds
10:09:29 INFO Overall 98496 seconds
10:09:29 INFO Overall 27h 21m 36s
10:09:29 INFO Triples loaded = 6756025616
10:09:29 INFO Quads loaded = 0
10:09:29 INFO Overall Rate 68591 tuples per second
This was done on a large machine with 2TB RAM and -threads=48, but anyway: It
looks like tdb2.xloader in apache-jena-4.5.0-SNAPSHOT brought HUGE improvements
over prior versions (unfortunately I cannot find a log, but it took multiple
days with 3.x on the same machine).
This is very helpful - faster than Lorenz reported on a 128G / 12
threads (31h). It does suggests there is effectively a soft upper bound
on going faster by more RAM, more threads.
That seems likely - disk bandwith also matters and because xloader is
phased between sort and index writing steps, it is unlikely to be
getting the best overlap of CPU crunching and I/O.
This all gets into RAID0, or allocating files across different disk.
There comes a point where it gets quite a task to setup the machine.
One other area I think might be easy to improve - more for smaller
machines - is during data ingest. There, the node table index is being
randomly read. On smaller RAM machines, the ingest phase is proporiately
longer, sometimes a lot.
An idea I had is calling the madvise system call on the mmap segments to
tell the kernel the access is random (requires native code; Java17 makes
it possible to directly call mdavise(2) without needing a C (etc)
converter layer).
> If you think it useful, I am happy to share more details.
What was the storage?
Andy
Two observations:
- As Andy (thanks again for all your help!) already mentioned, gzip
files apparently load significantly faster then bzip2 files. I experienced
200,000 vs. 100,000 triples/second in the parse nodes step (though colleagues
had jobs on the machine too, which might have influenced the results).
- During the extended POS/POS/OSP sort periods, I saw only one or two
gzip instances (used in the background), which perhaps were a bottleneck. I
wonder if using pigz could extend parallel processing.
If you think it usefull, I am happy to share more details. If I can help with
running some particular tests on a massive parallel machine, please let me know.
Cheers, Joachim
--
Joachim Neubert
ZBW - Leibniz Information Centre for Economics
Neuer Jungfernstieg 21
20354 Hamburg
Phone +49-40-42834-462