That's a blast from the past! Not all of the details from that exchange are on the Jean list because Laura and myself took the conversation offline...
The short story is I imported the WikiData in 3 days using an IBM 24 core 512GB RAM server and 16 1TB SSD's. The swap was configured to be striped 1TB SSD's. Any thrashing was absorbed by the 24 cores, i.e. there was plenty of cycles for the OS to be doing housekeeping, and there was a lot of housekeeping! Basically, you need hardware! I managed to reduce this time to a day by performing 4 imports in parallel. This was only possible because my server could absorb this amount of throughput. Importing in parallel resulted in 4 TDB's which were queried using a beta Jena extension (known as Mosaic internally). This has it's own issues such as he requirement to de-duplicate 4 streams of quads to answer COUNT(...) actions, using Java streams. This led to further work whereby preprocessing was performed to guarantee that each quad was unique in the 4 TDB's, which meant the .distinct() could be skipped in the stream processing. About a year ago I performed that same test on a Ryzen 2950X based system, using the same disks plus 3 M.2 drives and received similar results. You also need to consider what bzip2 lzmash level was used to compress. Wiki use bzip2 because of it's aggressive compression, i.e. they want to reduce the compressed file as much as possible. On Wed, 20 May 2020 at 06:56, Wolfgang Fahl <[email protected]> wrote: > Dear Apache Jena users, > > Some 2 years ago Laura Morlaes and Dick Murray had an exchange on this > list on how to influence the performance of > tdbloader. The issue is currently of interest for me again in the context > of trying to load some 15 billion triples from a > copy of wikidata. At > http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData i have > documented what i am trying to accomplish > and a few days ago I placed a question on stackoverflow > https://stackoverflow.com/questions/61813248/jena-tdbloader2-performance-and-limits > with the following three questions: > > *What is proven to speed up the import without investing into extra > hardware?* > e.g. splitting the files, changing VM arguments, running multiple > processes ... > > *What explains the decreasing speed at higher numbers of triples and how > can this be avoided?* > > *What sucessful multi-billion triple imports for Jena do you know of and > what are the circumstances for these?* > > There were some 50 fews on the question so far and some comments but there > is no real hint yet on what could improve things. > > Especially the Java VM crashes that happened with different Java > environments on the Mac OSX machine are disappointing since event with a > slow speed the import would have been finished after a while but with a > crash its a never ending story. > > I am curious to learn what your experience and advice is. > > Yours > > Wolfgang > > -- > > > Wolfgang Fahl > Pater-Delp-Str. 1, D-47877 Willich Schiefbahn > Tel. +49 2154 811-480, Fax +49 2154 811-481 > Web: http://www.bitplan.de > >
