I've just finished downloading the WikiPedia latest-truthy.nt.gz (39G) and decompressing (605G) in ~10 hours using Ubuntu 19.10 on a Raspberry Pi4 using a USB3 1TB HDD.
I'll update you on the sort and uniq (from memory there were not that my duplicates). Dick On Wed, 20 May 2020 at 11:21, Wolfgang Fahl <[email protected]> wrote: > Thank you Dick for your response. > > > Basically, you need hardware! > That option is very limited with my budget and my current 64 GByte > Servers up to 12 cores and 4 TB 7200 rpm disks and SSDs of up to 512 > GByte seem reasonable to me. I'd rather wait a bit longer than pay for > hardware especially with the risk of thing crashing anyway. > > The splitting option you mention seems to be a lot of extra hassle and I > assume this is based on the approach of "import all of WikiData". > Currently i see that the hurdles for doing such a "full import" are very > high. For my usecase I might be able to put up with some 3-5% of > Wikidata since I am basically interested in what > https://www.wikidata.org/wiki/Wikidata:Scholia offers for the > > https://projects.tib.eu/confident/ ConfIDent project. > > What kind of tuning besides the hardware was effective for you? > > Does anybody have experience with partial dumps created by > https://tools.wmflabs.org/wdumps/? > > Cheers > > Wolfgang > > Am 20.05.20 um 11:22 schrieb Dick Murray: > > That's a blast from the past! > > > > Not all of the details from that exchange are on the Jean list because > > Laura and myself took the conversation offline... > > > > The short story is I imported the WikiData in 3 days using an IBM 24 core > > 512GB RAM server and 16 1TB SSD's. The swap was configured to be striped > > 1TB SSD's. Any thrashing was absorbed by the 24 cores, i.e. there was > > plenty of cycles for the OS to be doing housekeeping, and there was a lot > > of housekeeping! > > > > Basically, you need hardware! > > > > I managed to reduce this time to a day by performing 4 imports in > parallel. > > This was only possible because my server could absorb this amount of > > throughput. > > > > Importing in parallel resulted in 4 TDB's which were queried using a beta > > Jena extension (known as Mosaic internally). This has it's own issues > such > > as he requirement to de-duplicate 4 streams of quads to answer COUNT(...) > > actions, using Java streams. This led to further work whereby > preprocessing > > was performed to guarantee that each quad was unique in the 4 TDB's, > which > > meant the .distinct() could be skipped in the stream processing. > > > > About a year ago I performed that same test on a Ryzen 2950X based > system, > > using the same disks plus 3 M.2 drives and received similar results. > > > > You also need to consider what bzip2 lzmash level was used to compress. > > Wiki use bzip2 because of it's aggressive compression, i.e. they want to > > reduce the compressed file as much as possible. > > > > > > On Wed, 20 May 2020 at 06:56, Wolfgang Fahl <[email protected]> wrote: > > > >> Dear Apache Jena users, > >> > >> Some 2 years ago Laura Morlaes and Dick Murray had an exchange on this > >> list on how to influence the performance of > >> tdbloader. The issue is currently of interest for me again in the > context > >> of trying to load some 15 billion triples from a > >> copy of wikidata. At > >> http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData i have > >> documented what i am trying to accomplish > >> and a few days ago I placed a question on stackoverflow > >> > https://stackoverflow.com/questions/61813248/jena-tdbloader2-performance-and-limits > >> with the following three questions: > >> > >> *What is proven to speed up the import without investing into extra > >> hardware?* > >> e.g. splitting the files, changing VM arguments, running multiple > >> processes ... > >> > >> *What explains the decreasing speed at higher numbers of triples and how > >> can this be avoided?* > >> > >> *What sucessful multi-billion triple imports for Jena do you know of and > >> what are the circumstances for these?* > >> > >> There were some 50 fews on the question so far and some comments but > there > >> is no real hint yet on what could improve things. > >> > >> Especially the Java VM crashes that happened with different Java > >> environments on the Mac OSX machine are disappointing since event with a > >> slow speed the import would have been finished after a while but with a > >> crash its a never ending story. > >> > >> I am curious to learn what your experience and advice is. > >> > >> Yours > >> > >> Wolfgang > >> > >> -- > >> > >> > >> Wolfgang Fahl > >> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn > >> Tel. +49 2154 811-480, Fax +49 2154 811-481 > >> Web: http://www.bitplan.de > >> > >> > -- > > BITPlan - smart solutions > Wolfgang Fahl > Pater-Delp-Str. 1, D-47877 Willich Schiefbahn > Tel. +49 2154 811-480, Fax +49 2154 811-481 > Web: http://www.bitplan.de > BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, > Geschäftsführer: Wolfgang Fahl > > >
