On 17/06/14 09:17, Ewa Szwed wrote:
Hi,
I have a question about performance of tdbupdate tool.
I would like to update my Jena TDB store that hosts freebase data with the
delta that I calculated between 2 consecutive freebase dumps.
I have 11 000 000 triples for deletion and 22 000 000 triples for
insertions.
My approach is that I divide these big sets into smaller batches - and use
tdbupdate with files passed as param for sparql deletes and sparql inserts.
I would expect that the performance is comparable to tdbloader but this is
not the case? Can this be improved? Can I used tdbloader instead of
tdbupdate for inserts update?
I appreciate any comment?


tdbloader only makes a difference when loading an empty dataset - tdbloader knows how to manipulate the indexes in a better fashion than simple letting the indexes incrementally update and tdbloader2 knows how to build the index b+trees directly.

tdbupdate is applying SPARQL updates and does not exploit the implementation of TDB.

Normally, I'd suggest considering working with N-Quads and doing insert and delete on those (if no bnodes) but the reload size of Freebase makes that a bit of a nuisance.

With 11e6 and 22e6, batching in smaller groups is going to be a good idea otherwise the transaction mechanism will (there is spill-to-disk (see TDB.transactionJournalWriteBlockMode) but it will be slow because there is a disk.

Having the database on an SSD will make a big difference It greatly speeds up transactions.

                Andy

Reply via email to