Thank you Dick for your response.

> Basically, you need hardware!
That option is very limited with my budget and my current 64 GByte
Servers up to 12 cores and  4 TB 7200 rpm disks and SSDs of up to 512
GByte  seem reasonable to me. I'd rather wait a bit longer than pay for
hardware especially with the risk of thing crashing anyway.

The splitting option you mention seems to be a lot of extra hassle and I
assume this is based on the approach of "import all of WikiData".
Currently i see that the hurdles for doing such a "full import" are very
high. For my usecase I might be able to put up with some 3-5% of
Wikidata since I am basically interested in what  
https://www.wikidata.org/wiki/Wikidata:Scholia offers for the

https://projects.tib.eu/confident/ ConfIDent project.

What kind of tuning besides the hardware was effective for you?

Does anybody have experience with partial dumps created by
https://tools.wmflabs.org/wdumps/?

Cheers

  Wolfgang

Am 20.05.20 um 11:22 schrieb Dick Murray:
> That's a blast from the past!
>
> Not all of the details from that exchange are on the Jean list because
> Laura and myself took the conversation offline...
>
> The short story is I imported the WikiData in 3 days using an IBM 24 core
> 512GB RAM server and 16 1TB SSD's. The swap was configured to be striped
> 1TB SSD's. Any thrashing was absorbed by the 24 cores, i.e. there was
> plenty of cycles for the OS to be doing housekeeping, and there was a lot
> of housekeeping!
>
> Basically, you need hardware!
>
> I managed to reduce this time to a day by performing 4 imports in parallel.
> This was only possible because my server could absorb this amount of
> throughput.
>
> Importing in parallel resulted in 4 TDB's which were queried using a beta
> Jena extension (known as Mosaic internally). This has it's own issues such
> as he requirement to de-duplicate 4 streams of quads to answer COUNT(...)
> actions, using Java streams. This led to further work whereby preprocessing
> was performed to guarantee that each quad was unique in the 4 TDB's, which
> meant the .distinct() could be skipped in the stream processing.
>
> About a year ago I performed that same test on a Ryzen 2950X based system,
> using the same disks plus 3 M.2 drives and received similar results.
>
> You also need to consider what bzip2 lzmash level was used to compress.
> Wiki use bzip2 because of it's aggressive compression, i.e. they want to
> reduce the compressed file as much as possible.
>
>
> On Wed, 20 May 2020 at 06:56, Wolfgang Fahl <[email protected]> wrote:
>
>> Dear Apache Jena users,
>>
>> Some 2 years ago Laura Morlaes and Dick Murray had an exchange on this
>> list on how to influence the performance of
>> tdbloader. The issue is currently of interest for me again in the context
>> of trying to load some 15 billion triples from a
>> copy of wikidata. At
>> http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData i have
>> documented what i am trying to accomplish
>> and a few days ago I placed a question on stackoverflow
>> https://stackoverflow.com/questions/61813248/jena-tdbloader2-performance-and-limits
>> with the following three questions:
>>
>> *What is proven to speed up the import without investing into extra
>> hardware?*
>> e.g. splitting the files, changing VM arguments, running multiple
>> processes ...
>>
>> *What explains the decreasing speed at higher numbers of triples and how
>> can this be avoided?*
>>
>> *What sucessful multi-billion triple imports for Jena do you know of and
>> what are the circumstances for these?*
>>
>> There were some 50 fews on the question so far and some comments but there
>> is no real hint yet on what could improve things.
>>
>> Especially the Java VM crashes that happened with different Java
>> environments on the Mac OSX machine are disappointing since event with a
>> slow speed the import would have been finished after a  while but with a
>> crash its a never ending story.
>>
>> I am curious to learn what your experience and advice is.
>>
>> Yours
>>
>>   Wolfgang
>>
>> --
>>
>>
>> Wolfgang Fahl
>> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
>> Tel. +49 2154 811-480, Fax +49 2154 811-481
>> Web: http://www.bitplan.de
>>
>>
-- 

BITPlan - smart solutions
Wolfgang Fahl
Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
Tel. +49 2154 811-480, Fax +49 2154 811-481
Web: http://www.bitplan.de
BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, 
Geschäftsführer: Wolfgang Fahl 


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to