I used cat /proc/version
> -----Ursprüngliche Nachricht-----
> Von: Andrii Berezovskyi <[email protected]>
> Gesendet: Freitag, 18. Februar 2022 10:35
> An: [email protected]
> Betreff: Re: Loading Wikidata
>
> May I ask an unrelated question: how do you get Ubuntu version in such a
> format? 'cat /etc/os-release' (or lsb_release, hostnamectl, neofetch) only
> gives me the '20.04.3' format or Focal.
>
> On 2022-02-18, 10:17, "Neubert, Joachim" <[email protected]> wrote:
>
> OS is Centos 7.9 in a docker container running on Ubuntu 9.3.0.
>
> CPU is 4 x Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz / 18 core (144 cores
> in total)
>
> Cheers, Joachim
>
> > -----Ursprüngliche Nachricht-----
> > Von: Marco Neumann <[email protected]>
> > Gesendet: Freitag, 18. Februar 2022 10:00
> > An: [email protected]
> > Betreff: Re: Loading Wikidata
> >
> > Thank you for the effort Joachim, what CPU and OS was used for the load
> > test?
> >
> > Best,
> > Marco
> >
> > On Fri, Feb 18, 2022 at 8:51 AM Neubert, Joachim <[email protected]>
> > wrote:
> >
> > > Storage of the machine is one 10TB raid6 SSD.
> > >
> > > Cheers, Joachim
> > >
> > > > -----Ursprüngliche Nachricht-----
> > > > Von: Andy Seaborne <[email protected]>
> > > > Gesendet: Mittwoch, 16. Februar 2022 20:05
> > > > An: [email protected]
> > > > Betreff: Re: Loading Wikidata
> > > >
> > > >
> > > >
> > > > On 16/02/2022 11:56, Neubert, Joachim wrote:
> > > > > I've loaded the Wikidata "truthy" dataset with 6b triples. Summary
> > > stats is:
> > > > >
> > > > > 10:09:29 INFO Load node table = 35555 seconds
> > > > > 10:09:29 INFO Load ingest data = 25165 seconds
> > > > > 10:09:29 INFO Build index SPO = 11241 seconds
> > > > > 10:09:29 INFO Build index POS = 14100 seconds
> > > > > 10:09:29 INFO Build index OSP = 12435 seconds
> > > > > 10:09:29 INFO Overall 98496 seconds
> > > > > 10:09:29 INFO Overall 27h 21m 36s
> > > > > 10:09:29 INFO Triples loaded = 6756025616
> > > > > 10:09:29 INFO Quads loaded = 0
> > > > > 10:09:29 INFO Overall Rate 68591 tuples per second
> > > > >
> > > > > This was done on a large machine with 2TB RAM and -threads=48,
> but
> > > > anyway: It looks like tdb2.xloader in apache-jena-4.5.0-SNAPSHOT
> > > > brought HUGE improvements over prior versions (unfortunately I
> > > > cannot find a log, but it took multiple days with 3.x on the same
> > machine).
> > > >
> > > > This is very helpful - faster than Lorenz reported on a 128G / 12
> > > > threads (31h). It does suggests there is effectively a soft upper
> > > > bound on going
> > > faster
> > > > by more RAM, more threads.
> > > >
> > > > That seems likely - disk bandwith also matters and because xloader
> > > > is
> > > phased
> > > > between sort and index writing steps, it is unlikely to be getting
> > > > the
> > > best
> > > > overlap of CPU crunching and I/O.
> > > >
> > > > This all gets into RAID0, or allocating files across different disk.
> > > >
> > > > There comes a point where it gets quite a task to setup the machine.
> > > >
> > > > One other area I think might be easy to improve - more for smaller
> > > machines
> > > > - is during data ingest. There, the node table index is being
> > > > randomly
> > > read.
> > > > On smaller RAM machines, the ingest phase is proporiately longer,
> > > > sometimes a lot.
> > > >
> > > > An idea I had is calling the madvise system call on the mmap
> > > > segments to
> > > tell
> > > > the kernel the access is random (requires native code; Java17 makes
> > > > it possible to directly call mdavise(2) without needing a C (etc)
> > > > converter
> > > layer).
> > > >
> > > > > If you think it useful, I am happy to share more details.
> > > >
> > > > What was the storage?
> > > >
> > > > Andy
> > > > >
> > > > > Two observations:
> > > > >
> > > > >
> > > > > - As Andy (thanks again for all your help!) already
> mentioned,
> > > gzip files
> > > > apparently load significantly faster then bzip2 files. I experienced
> > > 200,000 vs.
> > > > 100,000 triples/second in the parse nodes step (though colleagues
> > > > had
> > > jobs
> > > > on the machine too, which might have influenced the results).
> > > > >
> > > > > - During the extended POS/POS/OSP sort periods, I saw only
> one
> > > or two
> > > > gzip instances (used in the background), which perhaps were a
> > > bottleneck. I
> > > > wonder if using pigz could extend parallel processing.
> > > > >
> > > > > If you think it usefull, I am happy to share more details. If I
> > > > > can
> > > help with
> > > > running some particular tests on a massive parallel machine, please
> > > > let
> > > me
> > > > know.
> > > > >
> > > > > Cheers, Joachim
> > > > >
> > > > > --
> > > > > Joachim Neubert
> > > > >
> > > > > ZBW - Leibniz Information Centre for Economics Neuer
> Jungfernstieg
> > > > > 21
> > > > > 20354 Hamburg
> > > > > Phone +49-40-42834-462
> > > > >
> > > > >
> > >
> >
> >
> > --
> >
> >
> > ---
> > Marco Neumann
> > KONA