Wouldn't it be a good idea to have a page in the Fuseki/TDB2 documentation with benchmark results and/or user-reported loading statistics, including hardware specs?
It would also be useful to map such specs to the AWS instance types: https://aws.amazon.com/ec2/instance-types/ On Mon, Jun 8, 2020 at 11:43 PM Andy Seaborne <[email protected]> wrote: > > Hi Johannes, > > On 08/06/2020 16:54, Hoffart, Johannes wrote: > > Hi, > > > > I want to load the full Wikidata dump, available at > > https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 to use > > in Jena. > > > > I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G. > > Initially, the progress (measured by dataset size) is quick. It slows down > > very much after a couple of 100GB written, and finally, at around 500GB, > > the progress is almost halted. > > Loading performance is sensitive to the hardware used. Large RAM, high > performance SSD. > > Setting the heap size larger actually slows the process down. The > database indexes are cached outside the heap in the main OS filesystem > case so a cache size of 120G is taking space away from that space. > A heap size of ~8G should be more than enough. > > The other factor is the storage. A large SSD, and best of an M.2 > connected local SSD, is significantly faster. > > It can be worthwhile to build the database on a machine spec'ed for > loading and move it elsewhere for query use. The database, once built, > can be file-copied. > > It will take many hours to load under optimal conditions - it has been > reported it takes over an hour just to count the lines in the > latest-all.ttl.bz2 file using the standard unix tools (no java in > sight!). I'm trying to just parse the file and the parser is taking > hours. There are ea lot of warnings (you can ignore them - they are just > warnings, not errors). > > latest-truthy is a significantly smaller. Getting the process working > (it's only in NT format but you can just load the prefixes taken from > the TTL version separately) > > And check the download of any of these large files - I have had it > truncate in one attempt I made. > > Andy > > > Did anyone ingest Wikidata into Jena before? What are the system > > requirements? Is there a specific tdb2.tdbloader configuration that would > > speed things up? For example building an index after data ingest? > > tdb2.tdbloader has options for loader algorithm. --loader=parallel is > probably fastest if you have the SSD space. > > > > > Thanks > > Johannes > > > > Johannes Hoffart, Executive Director, Technology Division > > Goldman Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 | D-60329 > > Frankfurt am Main > > Email: [email protected]<mailto:[email protected]> | Tel: +49 > > (0)69 7532 3558 > > Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen | Dr. > > Matthias Bock > > Vorsitzender des Aufsichtsrats: Dermot McDonogh > > Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190 > > > > > > ________________________________ > > > > Your Personal Data: We may collect and process information about you that > > may be subject to data protection laws. For more information about how we > > use and disclose your personal data, how we protect your information, our > > legal basis to use your information, your rights and who you can contact, > > please refer to: > > www.gs.com/privacy-notices<http://www.gs.com/privacy-notices> > >
