On 17/06/14 09:17, Ewa Szwed wrote:
Hi,
I have a question about performance of tdbupdate tool.
I would like to update my Jena TDB store that hosts freebase data with the
delta that I calculated between 2 consecutive freebase dumps.
I have 11 000 000 triples for deletion and 22 000 000 triples for
insertions.
My approach is that I divide these big sets into smaller batches - and use
tdbupdate with files passed as param for sparql deletes and sparql inserts.
I would expect that the performance is comparable to tdbloader but this is
not the case? Can this be improved? Can I used tdbloader instead of
tdbupdate for inserts update?
I appreciate any comment?
tdbloader only makes a difference when loading an empty dataset -
tdbloader knows how to manipulate the indexes in a better fashion than
simple letting the indexes incrementally update and tdbloader2 knows how
to build the index b+trees directly.
tdbupdate is applying SPARQL updates and does not exploit the
implementation of TDB.
Normally, I'd suggest considering working with N-Quads and doing insert
and delete on those (if no bnodes) but the reload size of Freebase makes
that a bit of a nuisance.
With 11e6 and 22e6, batching in smaller groups is going to be a good
idea otherwise the transaction mechanism will (there is spill-to-disk
(see TDB.transactionJournalWriteBlockMode) but it will be slow because
there is a disk.
Having the database on an SSD will make a big difference It greatly
speeds up transactions.
Andy