AW: Loading Wikidata

Neubert, Joachim Fri, 18 Feb 2022 01:52:32 -0800
Yes, you're right. /etc/os-release reports "Ubuntu 20.04.2 LTS"

> -----Ursprüngliche Nachricht-----
> Von: Andrii Berezovskyi <andr...@kth.se>
> Gesendet: Freitag, 18. Februar 2022 10:49
> An: users@jena.apache.org
> Betreff: Re: Loading Wikidata
> 
> I see, thanks. Are you sure 9.3.0 is not the version of GCC but Ubuntu?
> 
> > On 18 Feb 2022, at 10:46, Neubert, Joachim <j.neub...@zbw.eu> wrote:
> >
> > I used cat /proc/version
> >
> >> -----Ursprüngliche Nachricht-----
> >> Von: Andrii Berezovskyi <andr...@kth.se>
> >> Gesendet: Freitag, 18. Februar 2022 10:35
> >> An: users@jena.apache.org
> >> Betreff: Re: Loading Wikidata
> >>
> >> May I ask an unrelated question: how do you get Ubuntu version in
> >> such a format? 'cat /etc/os-release' (or lsb_release, hostnamectl,
> >> neofetch) only gives me the '20.04.3' format or Focal.
> >>
> >> On 2022-02-18, 10:17, "Neubert, Joachim" <j.neub...@zbw.eu> wrote:
> >>
> >>    OS is Centos 7.9 in a docker container running on Ubuntu 9.3.0.
> >>
> >>    CPU is 4 x Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz / 18 core (144
> >> cores in total)
> >>
> >>    Cheers, Joachim
> >>
> >>> -----Ursprüngliche Nachricht-----
> >>> Von: Marco Neumann <marco.neum...@gmail.com>
> >>> Gesendet: Freitag, 18. Februar 2022 10:00
> >>> An: users@jena.apache.org
> >>> Betreff: Re: Loading Wikidata
> >>>
> >>> Thank you for the effort Joachim, what CPU and OS was used for the
> >>> load test?
> >>>
> >>> Best,
> >>> Marco
> >>>
> >>> On Fri, Feb 18, 2022 at 8:51 AM Neubert, Joachim <j.neub...@zbw.eu>
> >>> wrote:
> >>>
> >>>> Storage of the machine is one 10TB raid6 SSD.
> >>>>
> >>>> Cheers, Joachim
> >>>>
> >>>>> -----Ursprüngliche Nachricht-----
> >>>>> Von: Andy Seaborne <a...@apache.org>
> >>>>> Gesendet: Mittwoch, 16. Februar 2022 20:05
> >>>>> An: users@jena.apache.org
> >>>>> Betreff: Re: Loading Wikidata
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 16/02/2022 11:56, Neubert, Joachim wrote:
> >>>>>> I've loaded the Wikidata "truthy" dataset with 6b triples.
> >>>>>> Summary
> >>>> stats is:
> >>>>>>
> >>>>>> 10:09:29 INFO  Load node table  = 35555 seconds
> >>>>>> 10:09:29 INFO  Load ingest data = 25165 seconds
> >>>>>> 10:09:29 INFO  Build index SPO  = 11241 seconds
> >>>>>> 10:09:29 INFO  Build index POS  = 14100 seconds
> >>>>>> 10:09:29 INFO  Build index OSP  = 12435 seconds
> >>>>>> 10:09:29 INFO  Overall          98496 seconds
> >>>>>> 10:09:29 INFO  Overall          27h 21m 36s
> >>>>>> 10:09:29 INFO  Triples loaded   = 6756025616
> >>>>>> 10:09:29 INFO  Quads loaded     = 0
> >>>>>> 10:09:29 INFO  Overall Rate     68591 tuples per second
> >>>>>>
> >>>>>> This was done on a large machine with 2TB RAM and -threads=48,
> >> but
> >>>>> anyway: It looks like tdb2.xloader in apache-jena-4.5.0-SNAPSHOT
> >>>>> brought HUGE improvements over prior versions (unfortunately I
> >>>>> cannot find a log, but it took multiple days with 3.x on the same
> >>> machine).
> >>>>>
> >>>>> This is very helpful - faster than Lorenz reported on a 128G / 12
> >>>>> threads (31h). It does suggests there is effectively a soft upper
> >>>>> bound on going
> >>>> faster
> >>>>> by more RAM, more threads.
> >>>>>
> >>>>> That seems likely - disk bandwith also matters and because xloader
> >>>>> is
> >>>> phased
> >>>>> between sort and index writing steps, it is unlikely to be getting
> >>>>> the
> >>>> best
> >>>>> overlap of CPU crunching and I/O.
> >>>>>
> >>>>> This all gets into RAID0, or allocating files across different disk.
> >>>>>
> >>>>> There comes a point where it gets quite a task to setup the machine.
> >>>>>
> >>>>> One other area I think might be easy to improve - more for smaller
> >>>> machines
> >>>>> - is during data ingest. There, the node table index is being
> >>>>> randomly
> >>>> read.
> >>>>> On smaller RAM machines, the ingest phase is proporiately longer,
> >>>>> sometimes a lot.
> >>>>>
> >>>>> An idea I had is calling the madvise system call on the mmap
> >>>>> segments to
> >>>> tell
> >>>>> the kernel the access is random (requires native code; Java17
> >>>>> makes it possible to directly call mdavise(2) without needing a C
> >>>>> (etc) converter
> >>>> layer).
> >>>>>
> >>>>>> If you think it useful, I am happy to share more details.
> >>>>>
> >>>>> What was the storage?
> >>>>>
> >>>>>     Andy
> >>>>>>
> >>>>>> Two observations:
> >>>>>>
> >>>>>>
> >>>>>> -        As Andy (thanks again for all your help!) already mentioned,
> >>>> gzip files
> >>>>> apparently load significantly faster then bzip2 files. I
> >>>>> experienced
> >>>> 200,000 vs.
> >>>>> 100,000 triples/second in the parse nodes step (though colleagues
> >>>>> had
> >>>> jobs
> >>>>> on the machine too, which might have influenced the results).
> >>>>>>
> >>>>>> -        During the extended POS/POS/OSP sort periods, I saw only
> >> one
> >>>> or two
> >>>>> gzip instances (used in the background), which perhaps were a
> >>>> bottleneck. I
> >>>>> wonder if using pigz could extend parallel processing.
> >>>>>>
> >>>>>> If you think it usefull, I am happy to share more details. If I
> >>>>>> can
> >>>> help with
> >>>>> running some particular tests on a massive parallel machine,
> >>>>> please let
> >>>> me
> >>>>> know.
> >>>>>>
> >>>>>> Cheers, Joachim
> >>>>>>
> >>>>>> --
> >>>>>> Joachim Neubert
> >>>>>>
> >>>>>> ZBW - Leibniz Information Centre for Economics Neuer
> >> Jungfernstieg
> >>>>>> 21
> >>>>>> 20354 Hamburg
> >>>>>> Phone +49-40-42834-462
> >>>>>>
> >>>>>>
> >>>>
> >>>
> >>>
> >>> --
> >>>
> >>>
> >>> ---
> >>> Marco Neumann
> >>> KONA
> >
AW: Loading Wikidata

Reply via email to