Thompsonbry.systap added a comment. This may not be the right ticket, but I did some experimentation with the data sets that I referenced above looking at parameterization of the load. Using an Intel 2011 Mac Mini with 16GB of RAM and an SSD I have a total throughput across all datasets of 6 hours, which is basically 20k triples per second (tps) over 429M triples loaded. The best parameters are below. This configuration used slightly more space on the disk (66G vs 60G). It uses a much smaller branching factor for the OSP index and the small slot optimization on the RWStore to attempt to co-locate the scattered OSP index updates (the updates for this index are always scattered because the inserts are always clustered on the source vertex - this is just how it works out for every application I have seen.)
I am using a write cache with 1000 native 1M buffers. You could increase this and probably reduce the IO wait further. I would suggest trying with 2000 buffers and see what impact it has. You should be able to realize a performance gain by defining some wiki data specific vocabularies to inline frequently used URIs into 2-3 bytes. This would reduce the average stride on the statement indices since the predicate (link type) position will typically be 2-3 bytes. It also improves the query performance somewhat since vocabulary items do not require dictionary joins (but we do cache the frequently used terms in the lexicon relation regardless). I generally approach vocabulary definition by simply capturing the frequently used predicates for the domain. However, it is also possible to write a SPARQL query that computes the most common predicates and then feed that into the vocabulary definition process. We can repeat this experimentation again once the new data sets are ready. # # Note: These options are applied when the journal and the triple store are # first created. ## ## Journal options. ## # The backing file. This contains all your data. You want to put this someplace # safe. The default locator will wind up in the directory from which you start # your servlet container. com.bigdata.journal.AbstractJournal.file=bigdata.jnl # The persistence engine. Use 'Disk' for the WORM or 'DiskRW' for the RWStore. com.bigdata.journal.AbstractJournal.bufferMode=DiskRW # Setup for the RWStore recycler rather than session protection. com.bigdata.service.AbstractTransactionService.minReleaseAge=1 # Enable group commit. See http://wiki.blazegraph.com/wiki/index.php/GroupCommit # Note: Group commit is a beta feature in BlazeGraph release 1.5.1. #com.bigdata.journal.Journal.groupCommit=true com.bigdata.btree.writeRetentionQueue.capacity=4000 com.bigdata.btree.BTree.branchingFactor=128 # 200M initial extent. com.bigdata.journal.AbstractJournal.initialExtent=209715200 com.bigdata.journal.AbstractJournal.maximumExtent=209715200 # Create namespace (triples+RDR, no-inference, no text index) com.bigdata.rdf.sail.truthMaintenance=false com.bigdata.rdf.store.AbstractTripleStore.quads=false com.bigdata.rdf.store.AbstractTripleStore.statementIdentifiers=true com.bigdata.rdf.store.AbstractTripleStore.textIndex=false com.bigdata.rdf.store.AbstractTripleStore.axiomsClass=com.bigdata.rdf.axioms.NoAxioms # FIXME DEFINE AND USE WIKI DATA VOCABULARY CLASS # Bump up the branching factor for the lexicon indices on the default kb. com.bigdata.namespace.kb.lex.com.bigdata.btree.BTree.branchingFactor=400 com.bigdata.namespace.kb.lex.ID2TERM.com.bigdata.btree.BTree.branchingFactor=800 com.bigdata.namespace.kb.lex.TERM2ID.com.bigdata.btree.BTree.branchingFactor=128 # Bump up the branching factor for the statement indices on the default kb. com.bigdata.namespace.kb.spo.com.bigdata.btree.BTree.branchingFactor=1024 com.bigdata.namespace.kb.spo.OSP.com.bigdata.btree.BTree.branchingFactor=64 com.bigdata.namespace.kb.spo.SPO.com.bigdata.btree.BTree.branchingFactor=600 # larger statement buffer capacity for bulk loading. com.bigdata.rdf.sail.bufferCapacity=100000 # Override the #of write cache buffers to improve bulk load performance. Requires enough native heap! com.bigdata.journal.AbstractJournal.writeCacheBufferCount=1000 # Enable small slot optimization! com.bigdata.rwstore.RWStore.smallSlotType=1024 Thanks, Bryan TASK DETAIL https://phabricator.wikimedia.org/T92308 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev, Thompsonbry.systap Cc: Thompsonbry.systap, Haasepeter, Beebs.systap, Manybubbles, Aklapper, Smalyshev, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke, daniel, JanZerebecki _______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
