Wouldn't it be a good idea to have a page in the Fuseki/TDB2
documentation with benchmark results and/or user-reported loading
statistics, including hardware specs?

It would also be useful to map such specs to the AWS instance types:
https://aws.amazon.com/ec2/instance-types/

On Mon, Jun 8, 2020 at 11:43 PM Andy Seaborne <[email protected]> wrote:
>
> Hi Johannes,
>
> On 08/06/2020 16:54, Hoffart, Johannes wrote:
> > Hi,
> >
> > I want to load the full Wikidata dump, available at 
> > https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 to use 
> > in Jena.
> >
> > I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G. 
> > Initially, the progress (measured by dataset size) is quick. It slows down 
> > very much after a couple of 100GB written, and finally, at around 500GB, 
> > the progress is almost halted.
>
> Loading performance is sensitive to the hardware used.  Large RAM, high
> performance SSD.
>
> Setting the heap size larger actually slows the process down. The
> database indexes are cached outside the heap in the main OS filesystem
> case so a cache size of 120G is taking space away from that space.
> A heap size of ~8G should be more than enough.
>
> The other factor is the storage. A large SSD, and best of an M.2
> connected local SSD, is significantly faster.
>
> It can be worthwhile to build the database on a machine spec'ed for
> loading and move it elsewhere for query use. The database, once built,
> can be file-copied.
>
> It will take many hours to load under optimal conditions - it has been
> reported it takes over an hour just to count the lines in the
> latest-all.ttl.bz2 file using the standard unix tools (no java in
> sight!). I'm trying to just parse the file and the parser is taking
> hours. There are ea lot of warnings (you can ignore them - they are just
> warnings, not errors).
>
> latest-truthy is a significantly smaller. Getting the process working
> (it's only in NT format but you can just load the prefixes taken from
> the TTL version separately)
>
> And check the download of any of these large files - I have had it
> truncate in one attempt I made.
>
>      Andy
>
> > Did anyone ingest Wikidata into Jena before? What are the system 
> > requirements? Is there a specific tdb2.tdbloader configuration that would 
> > speed things up? For example building an index after data ingest?
>
> tdb2.tdbloader has options for loader algorithm. --loader=parallel is
> probably fastest if you have the SSD space.
>
> >
> > Thanks
> > Johannes
> >
> > Johannes Hoffart, Executive Director, Technology Division
> > Goldman Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 | D-60329 
> > Frankfurt am Main
> > Email: [email protected]<mailto:[email protected]> | Tel: +49 
> > (0)69 7532 3558
> > Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen | Dr. 
> > Matthias Bock
> > Vorsitzender des Aufsichtsrats: Dermot McDonogh
> > Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190
> >
> >
> > ________________________________
> >
> > Your Personal Data: We may collect and process information about you that 
> > may be subject to data protection laws. For more information about how we 
> > use and disclose your personal data, how we protect your information, our 
> > legal basis to use your information, your rights and who you can contact, 
> > please refer to: 
> > www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>
> >

Reply via email to