Hi, I migrated my Freebase Jena TDB setup to SSD host and continued my tdbupdate deletes experiments. The expectation was the transaction times are improved. I was able to see the improvements:
batch 5000 triples - 1 min 20 sec / batch Now when I reconfigure my bath size to 8000 to see if the time stays at this level for single batch I see: WARN TDB:: Transaction not commited or aborted: Transaction: 1 : Mode=WRITE : State=ACTIVE : /data/servers/freebase_data/ Exception in thread "main" java.lang.StackOverflowError at org.apache.jena.atlas.iterator.RepeatApplyIterator.hasNext(RepeatApplyIterator.java:42) at com.hp.hpl.jena.tdb.solver.SolverLib$IterAbortable.hasNext(SolverLib.java:197) at org.apache.jena.atlas.iterator.RepeatApplyIterator.hasNext(RepeatApplyIterator.java:46) at com.hp.hpl.jena.tdb.solver.SolverLib$IterAbortable.hasNext(SolverLib.java:197) at org.apache.jena.atlas.iterator.RepeatApplyIterator.hasNext(RepeatApplyIterator.java:46) Is this a bug? Please advise how should I proceed? 2014-06-18 14:40 GMT+01:00 Ewa Szwed <[email protected]>: > Hi, > Is that expected that tdbupdate deletes take much longer than inserts. > For inserts I have divided my set into batches of 500 000 triples and I > insert 1 batch in 5 minutes; the deletes: I created a batch of 100 000 > triples and it has been running whole night and is still running? Can this > be tuned? Now I am experimenting with small batches for deletes. > Here are my results so far: > batch 1000 triples - 2mins/batch > batch 2000 triples - 2,5 mins/batch > batch 4000 triples - 6 mins/batch > I have enabled gc log and I see gc runs very often. > Can JVM tuning improve performance of tdbupdate deletes much? > > > 2014-06-17 15:37 GMT+01:00 Andy Seaborne <[email protected]>: > > On 17/06/14 09:17, Ewa Szwed wrote: >> >>> Hi, >>> I have a question about performance of tdbupdate tool. >>> I would like to update my Jena TDB store that hosts freebase data with >>> the >>> delta that I calculated between 2 consecutive freebase dumps. >>> I have 11 000 000 triples for deletion and 22 000 000 triples for >>> insertions. >>> My approach is that I divide these big sets into smaller batches - and >>> use >>> tdbupdate with files passed as param for sparql deletes and sparql >>> inserts. >>> I would expect that the performance is comparable to tdbloader but this >>> is >>> not the case? Can this be improved? Can I used tdbloader instead of >>> tdbupdate for inserts update? >>> I appreciate any comment? >>> >>> >> tdbloader only makes a difference when loading an empty dataset - >> tdbloader knows how to manipulate the indexes in a better fashion than >> simple letting the indexes incrementally update and tdbloader2 knows how to >> build the index b+trees directly. >> >> tdbupdate is applying SPARQL updates and does not exploit the >> implementation of TDB. >> >> Normally, I'd suggest considering working with N-Quads and doing insert >> and delete on those (if no bnodes) but the reload size of Freebase makes >> that a bit of a nuisance. >> >> With 11e6 and 22e6, batching in smaller groups is going to be a good idea >> otherwise the transaction mechanism will (there is spill-to-disk (see TDB. >> transactionJournalWriteBlockMode) but it will be slow because there is a >> disk. >> >> Having the database on an SSD will make a big difference It greatly >> speeds up transactions. >> >> Andy >> >> >
