Re: Apache Jena tdbloader performance and limits

Dick Murray Thu, 21 May 2020 06:16:20 -0700

I've just finished downloading the WikiPedia latest-truthy.nt.gz (39G) and
decompressing (605G) in ~10 hours using Ubuntu 19.10 on a Raspberry Pi4
using a USB3 1TB HDD.


I'll update you on the sort and uniq (from memory there were not that my
duplicates).

Dick


On Wed, 20 May 2020 at 11:21, Wolfgang Fahl <[email protected]> wrote:

> Thank you Dick for your response.
>
> > Basically, you need hardware!
> That option is very limited with my budget and my current 64 GByte
> Servers up to 12 cores and  4 TB 7200 rpm disks and SSDs of up to 512
> GByte  seem reasonable to me. I'd rather wait a bit longer than pay for
> hardware especially with the risk of thing crashing anyway.
>
> The splitting option you mention seems to be a lot of extra hassle and I
> assume this is based on the approach of "import all of WikiData".
> Currently i see that the hurdles for doing such a "full import" are very
> high. For my usecase I might be able to put up with some 3-5% of
> Wikidata since I am basically interested in what
> https://www.wikidata.org/wiki/Wikidata:Scholia offers for the
>
> https://projects.tib.eu/confident/ ConfIDent project.
>
> What kind of tuning besides the hardware was effective for you?
>
> Does anybody have experience with partial dumps created by
> https://tools.wmflabs.org/wdumps/?
>
> Cheers
>
>   Wolfgang
>
> Am 20.05.20 um 11:22 schrieb Dick Murray:
> > That's a blast from the past!
> >
> > Not all of the details from that exchange are on the Jean list because
> > Laura and myself took the conversation offline...
> >
> > The short story is I imported the WikiData in 3 days using an IBM 24 core
> > 512GB RAM server and 16 1TB SSD's. The swap was configured to be striped
> > 1TB SSD's. Any thrashing was absorbed by the 24 cores, i.e. there was
> > plenty of cycles for the OS to be doing housekeeping, and there was a lot
> > of housekeeping!
> >
> > Basically, you need hardware!
> >
> > I managed to reduce this time to a day by performing 4 imports in
> parallel.
> > This was only possible because my server could absorb this amount of
> > throughput.
> >
> > Importing in parallel resulted in 4 TDB's which were queried using a beta
> > Jena extension (known as Mosaic internally). This has it's own issues
> such
> > as he requirement to de-duplicate 4 streams of quads to answer COUNT(...)
> > actions, using Java streams. This led to further work whereby
> preprocessing
> > was performed to guarantee that each quad was unique in the 4 TDB's,
> which
> > meant the .distinct() could be skipped in the stream processing.
> >
> > About a year ago I performed that same test on a Ryzen 2950X based
> system,
> > using the same disks plus 3 M.2 drives and received similar results.
> >
> > You also need to consider what bzip2 lzmash level was used to compress.
> > Wiki use bzip2 because of it's aggressive compression, i.e. they want to
> > reduce the compressed file as much as possible.
> >
> >
> > On Wed, 20 May 2020 at 06:56, Wolfgang Fahl <[email protected]> wrote:
> >
> >> Dear Apache Jena users,
> >>
> >> Some 2 years ago Laura Morlaes and Dick Murray had an exchange on this
> >> list on how to influence the performance of
> >> tdbloader. The issue is currently of interest for me again in the
> context
> >> of trying to load some 15 billion triples from a
> >> copy of wikidata. At
> >> http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData i have
> >> documented what i am trying to accomplish
> >> and a few days ago I placed a question on stackoverflow
> >>
> https://stackoverflow.com/questions/61813248/jena-tdbloader2-performance-and-limits
> >> with the following three questions:
> >>
> >> *What is proven to speed up the import without investing into extra
> >> hardware?*
> >> e.g. splitting the files, changing VM arguments, running multiple
> >> processes ...
> >>
> >> *What explains the decreasing speed at higher numbers of triples and how
> >> can this be avoided?*
> >>
> >> *What sucessful multi-billion triple imports for Jena do you know of and
> >> what are the circumstances for these?*
> >>
> >> There were some 50 fews on the question so far and some comments but
> there
> >> is no real hint yet on what could improve things.
> >>
> >> Especially the Java VM crashes that happened with different Java
> >> environments on the Mac OSX machine are disappointing since event with a
> >> slow speed the import would have been finished after a  while but with a
> >> crash its a never ending story.
> >>
> >> I am curious to learn what your experience and advice is.
> >>
> >> Yours
> >>
> >>   Wolfgang
> >>
> >> --
> >>
> >>
> >> Wolfgang Fahl
> >> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
> >> Tel. +49 2154 811-480, Fax +49 2154 811-481
> >> Web: http://www.bitplan.de
> >>
> >>
> --
>
> BITPlan - smart solutions
> Wolfgang Fahl
> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
> Tel. +49 2154 811-480, Fax +49 2154 811-481
> Web: http://www.bitplan.de
> BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548,
> Geschäftsführer: Wolfgang Fahl
>
>
>

Re: Apache Jena tdbloader performance and limits

Reply via email to