Re: Loading Wikidata

Marco Neumann Fri, 18 Feb 2022 01:01:27 -0800

Thank you for the effort Joachim, what CPU and OS was used for the load
test?


Best,
Marco

On Fri, Feb 18, 2022 at 8:51 AM Neubert, Joachim <[email protected]> wrote:

> Storage of the machine is one 10TB raid6 SSD.
>
> Cheers, Joachim
>
> > -----Ursprüngliche Nachricht-----
> > Von: Andy Seaborne <[email protected]>
> > Gesendet: Mittwoch, 16. Februar 2022 20:05
> > An: [email protected]
> > Betreff: Re: Loading Wikidata
> >
> >
> >
> > On 16/02/2022 11:56, Neubert, Joachim wrote:
> > > I've loaded the Wikidata "truthy" dataset with 6b triples. Summary
> stats is:
> > >
> > > 10:09:29 INFO  Load node table  = 35555 seconds
> > > 10:09:29 INFO  Load ingest data = 25165 seconds
> > > 10:09:29 INFO  Build index SPO  = 11241 seconds
> > > 10:09:29 INFO  Build index POS  = 14100 seconds
> > > 10:09:29 INFO  Build index OSP  = 12435 seconds
> > > 10:09:29 INFO  Overall          98496 seconds
> > > 10:09:29 INFO  Overall          27h 21m 36s
> > > 10:09:29 INFO  Triples loaded   = 6756025616
> > > 10:09:29 INFO  Quads loaded     = 0
> > > 10:09:29 INFO  Overall Rate     68591 tuples per second
> > >
> > > This was done on a large machine with 2TB RAM and -threads=48, but
> > anyway: It looks like tdb2.xloader in apache-jena-4.5.0-SNAPSHOT brought
> > HUGE improvements over prior versions (unfortunately I cannot find a log,
> > but it took multiple days with 3.x on the same machine).
> >
> > This is very helpful - faster than Lorenz reported on a 128G / 12 threads
> > (31h). It does suggests there is effectively a soft upper bound on going
> faster
> > by more RAM, more threads.
> >
> > That seems likely - disk bandwith also matters and because xloader is
> phased
> > between sort and index writing steps, it is unlikely to be getting the
> best
> > overlap of CPU crunching and I/O.
> >
> > This all gets into RAID0, or allocating files across different disk.
> >
> > There comes a point where it gets quite a task to setup the machine.
> >
> > One other area I think might be easy to improve - more for smaller
> machines
> > - is during data ingest. There, the node table index is being randomly
> read.
> > On smaller RAM machines, the ingest phase is proporiately longer,
> > sometimes a lot.
> >
> > An idea I had is calling the madvise system call on the mmap segments to
> tell
> > the kernel the access is random (requires native code; Java17 makes it
> > possible to directly call mdavise(2) without needing a C (etc) converter
> layer).
> >
> >  > If you think it useful, I am happy to share more details.
> >
> > What was the storage?
> >
> >      Andy
> > >
> > > Two observations:
> > >
> > >
> > > -        As Andy (thanks again for all your help!) already mentioned,
> gzip files
> > apparently load significantly faster then bzip2 files. I experienced
> 200,000 vs.
> > 100,000 triples/second in the parse nodes step (though colleagues had
> jobs
> > on the machine too, which might have influenced the results).
> > >
> > > -        During the extended POS/POS/OSP sort periods, I saw only one
> or two
> > gzip instances (used in the background), which perhaps were a
> bottleneck. I
> > wonder if using pigz could extend parallel processing.
> > >
> > > If you think it usefull, I am happy to share more details. If I can
> help with
> > running some particular tests on a massive parallel machine, please let
> me
> > know.
> > >
> > > Cheers, Joachim
> > >
> > > --
> > > Joachim Neubert
> > >
> > > ZBW - Leibniz Information Centre for Economics Neuer Jungfernstieg 21
> > > 20354 Hamburg
> > > Phone +49-40-42834-462
> > >
> > >
>


-- 


---
Marco Neumann
KONA

Re: Loading Wikidata

Reply via email to