Thank you Dick for your response. > Basically, you need hardware! That option is very limited with my budget and my current 64 GByte Servers up to 12 cores and 4 TB 7200 rpm disks and SSDs of up to 512 GByte seem reasonable to me. I'd rather wait a bit longer than pay for hardware especially with the risk of thing crashing anyway.
The splitting option you mention seems to be a lot of extra hassle and I assume this is based on the approach of "import all of WikiData". Currently i see that the hurdles for doing such a "full import" are very high. For my usecase I might be able to put up with some 3-5% of Wikidata since I am basically interested in what https://www.wikidata.org/wiki/Wikidata:Scholia offers for the https://projects.tib.eu/confident/ ConfIDent project. What kind of tuning besides the hardware was effective for you? Does anybody have experience with partial dumps created by https://tools.wmflabs.org/wdumps/? Cheers Wolfgang Am 20.05.20 um 11:22 schrieb Dick Murray: > That's a blast from the past! > > Not all of the details from that exchange are on the Jean list because > Laura and myself took the conversation offline... > > The short story is I imported the WikiData in 3 days using an IBM 24 core > 512GB RAM server and 16 1TB SSD's. The swap was configured to be striped > 1TB SSD's. Any thrashing was absorbed by the 24 cores, i.e. there was > plenty of cycles for the OS to be doing housekeeping, and there was a lot > of housekeeping! > > Basically, you need hardware! > > I managed to reduce this time to a day by performing 4 imports in parallel. > This was only possible because my server could absorb this amount of > throughput. > > Importing in parallel resulted in 4 TDB's which were queried using a beta > Jena extension (known as Mosaic internally). This has it's own issues such > as he requirement to de-duplicate 4 streams of quads to answer COUNT(...) > actions, using Java streams. This led to further work whereby preprocessing > was performed to guarantee that each quad was unique in the 4 TDB's, which > meant the .distinct() could be skipped in the stream processing. > > About a year ago I performed that same test on a Ryzen 2950X based system, > using the same disks plus 3 M.2 drives and received similar results. > > You also need to consider what bzip2 lzmash level was used to compress. > Wiki use bzip2 because of it's aggressive compression, i.e. they want to > reduce the compressed file as much as possible. > > > On Wed, 20 May 2020 at 06:56, Wolfgang Fahl <[email protected]> wrote: > >> Dear Apache Jena users, >> >> Some 2 years ago Laura Morlaes and Dick Murray had an exchange on this >> list on how to influence the performance of >> tdbloader. The issue is currently of interest for me again in the context >> of trying to load some 15 billion triples from a >> copy of wikidata. At >> http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData i have >> documented what i am trying to accomplish >> and a few days ago I placed a question on stackoverflow >> https://stackoverflow.com/questions/61813248/jena-tdbloader2-performance-and-limits >> with the following three questions: >> >> *What is proven to speed up the import without investing into extra >> hardware?* >> e.g. splitting the files, changing VM arguments, running multiple >> processes ... >> >> *What explains the decreasing speed at higher numbers of triples and how >> can this be avoided?* >> >> *What sucessful multi-billion triple imports for Jena do you know of and >> what are the circumstances for these?* >> >> There were some 50 fews on the question so far and some comments but there >> is no real hint yet on what could improve things. >> >> Especially the Java VM crashes that happened with different Java >> environments on the Mac OSX machine are disappointing since event with a >> slow speed the import would have been finished after a while but with a >> crash its a never ending story. >> >> I am curious to learn what your experience and advice is. >> >> Yours >> >> Wolfgang >> >> -- >> >> >> Wolfgang Fahl >> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn >> Tel. +49 2154 811-480, Fax +49 2154 811-481 >> Web: http://www.bitplan.de >> >> -- BITPlan - smart solutions Wolfgang Fahl Pater-Delp-Str. 1, D-47877 Willich Schiefbahn Tel. +49 2154 811-480, Fax +49 2154 811-481 Web: http://www.bitplan.de BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl
signature.asc
Description: OpenPGP digital signature
