Hi, Is that expected that tdbupdate deletes take much longer than inserts. For inserts I have divided my set into batches of 500 000 triples and I insert 1 batch in 5 minutes; the deletes: I created a batch of 100 000 triples and it has been running whole night and is still running? Can this be tuned? Now I am experimenting with small batches for deletes. Here are my results so far: batch 1000 triples - 2mins/batch batch 2000 triples - 2,5 mins/batch batch 4000 triples - 6 mins/batch I have enabled gc log and I see gc runs very often. Can JVM tuning improve performance of tdbupdate deletes much?
2014-06-17 15:37 GMT+01:00 Andy Seaborne <[email protected]>: > On 17/06/14 09:17, Ewa Szwed wrote: > >> Hi, >> I have a question about performance of tdbupdate tool. >> I would like to update my Jena TDB store that hosts freebase data with the >> delta that I calculated between 2 consecutive freebase dumps. >> I have 11 000 000 triples for deletion and 22 000 000 triples for >> insertions. >> My approach is that I divide these big sets into smaller batches - and use >> tdbupdate with files passed as param for sparql deletes and sparql >> inserts. >> I would expect that the performance is comparable to tdbloader but this is >> not the case? Can this be improved? Can I used tdbloader instead of >> tdbupdate for inserts update? >> I appreciate any comment? >> >> > tdbloader only makes a difference when loading an empty dataset - > tdbloader knows how to manipulate the indexes in a better fashion than > simple letting the indexes incrementally update and tdbloader2 knows how to > build the index b+trees directly. > > tdbupdate is applying SPARQL updates and does not exploit the > implementation of TDB. > > Normally, I'd suggest considering working with N-Quads and doing insert > and delete on those (if no bnodes) but the reload size of Freebase makes > that a bit of a nuisance. > > With 11e6 and 22e6, batching in smaller groups is going to be a good idea > otherwise the transaction mechanism will (there is spill-to-disk (see TDB. > transactionJournalWriteBlockMode) but it will be slow because there is a > disk. > > Having the database on an SSD will make a big difference It greatly > speeds up transactions. > > Andy > >
