Thompsonbry.systap added a comment.

This may not be the right ticket, but I did some experimentation with the data 
sets that I referenced above looking at parameterization of the load.  Using an 
Intel 2011 Mac Mini with 16GB of RAM and an SSD I have a total throughput 
across all datasets of 6 hours, which is basically 20k triples per second (tps) 
over 429M triples loaded.  The best parameters are below.  This configuration 
used slightly more space on the disk (66G vs 60G).  It uses a much smaller 
branching factor for the OSP index and the small slot optimization on the 
RWStore to attempt to co-locate the scattered OSP index updates (the updates 
for this index are always scattered because the inserts are always clustered on 
the source vertex - this is just how it works out for every application I have 
seen.)

I am using a write cache with 1000 native 1M buffers.  You could increase this 
and probably reduce the IO wait further.  I would suggest trying with 2000 
buffers and see what impact it has.

You should be able to realize a performance gain by defining some wiki data 
specific vocabularies to inline frequently used URIs into 2-3 bytes.  This 
would reduce the average stride on the statement indices since the predicate 
(link type) position will typically be 2-3 bytes.  It also improves the query 
performance somewhat since vocabulary items do not require dictionary joins 
(but we do cache the frequently used terms in the lexicon relation regardless). 
 I generally approach vocabulary definition by simply capturing the frequently 
used predicates for the domain.  However, it is also possible to write a SPARQL 
query that computes the most common predicates and then feed that into the 
vocabulary definition process.

We can repeat this experimentation again once the new data sets are ready.

  #
  # Note: These options are applied when the journal and the triple store are
  # first created.
  
  ##
  ## Journal options.
  ##
  
  # The backing file. This contains all your data.  You want to put this 
someplace
  # safe.  The default locator will wind up in the directory from which you 
start
  # your servlet container.
  com.bigdata.journal.AbstractJournal.file=bigdata.jnl
  
  # The persistence engine.  Use 'Disk' for the WORM or 'DiskRW' for the 
RWStore.
  com.bigdata.journal.AbstractJournal.bufferMode=DiskRW
  
  # Setup for the RWStore recycler rather than session protection.
  com.bigdata.service.AbstractTransactionService.minReleaseAge=1
  
  # Enable group commit. See 
http://wiki.blazegraph.com/wiki/index.php/GroupCommit
  # Note: Group commit is a beta feature in BlazeGraph release 1.5.1.
  #com.bigdata.journal.Journal.groupCommit=true
  
  com.bigdata.btree.writeRetentionQueue.capacity=4000
  com.bigdata.btree.BTree.branchingFactor=128
  
  # 200M initial extent.
  com.bigdata.journal.AbstractJournal.initialExtent=209715200
  com.bigdata.journal.AbstractJournal.maximumExtent=209715200
  
  # Create namespace (triples+RDR, no-inference, no text index)
  com.bigdata.rdf.sail.truthMaintenance=false
  com.bigdata.rdf.store.AbstractTripleStore.quads=false
  com.bigdata.rdf.store.AbstractTripleStore.statementIdentifiers=true
  com.bigdata.rdf.store.AbstractTripleStore.textIndex=false
  
com.bigdata.rdf.store.AbstractTripleStore.axiomsClass=com.bigdata.rdf.axioms.NoAxioms
  # FIXME DEFINE AND USE WIKI DATA VOCABULARY CLASS
  # Bump up the branching factor for the lexicon indices on the default kb.
  com.bigdata.namespace.kb.lex.com.bigdata.btree.BTree.branchingFactor=400
  
com.bigdata.namespace.kb.lex.ID2TERM.com.bigdata.btree.BTree.branchingFactor=800
  
com.bigdata.namespace.kb.lex.TERM2ID.com.bigdata.btree.BTree.branchingFactor=128
  # Bump up the branching factor for the statement indices on the default kb.
  com.bigdata.namespace.kb.spo.com.bigdata.btree.BTree.branchingFactor=1024
  com.bigdata.namespace.kb.spo.OSP.com.bigdata.btree.BTree.branchingFactor=64
  com.bigdata.namespace.kb.spo.SPO.com.bigdata.btree.BTree.branchingFactor=600
  # larger statement buffer capacity for bulk loading.
  com.bigdata.rdf.sail.bufferCapacity=100000
  # Override the #of write cache buffers to improve bulk load performance. 
Requires enough native heap!
  com.bigdata.journal.AbstractJournal.writeCacheBufferCount=1000
  
  # Enable small slot optimization!
  com.bigdata.rwstore.RWStore.smallSlotType=1024

Thanks,
Bryan


TASK DETAIL
  https://phabricator.wikimedia.org/T92308

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
<username>.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, Thompsonbry.systap
Cc: Thompsonbry.systap, Haasepeter, Beebs.systap, Manybubbles, Aklapper, 
Smalyshev, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke, daniel, JanZerebecki



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to