Re: tdbupdate vs tdbloader

Andy Seaborne Tue, 17 Jun 2014 07:38:29 -0700

On 17/06/14 09:17, Ewa Szwed wrote:

Hi,
I have a question about performance of tdbupdate tool.
I would like to update my Jena TDB store that hosts freebase data with the
delta that I calculated between 2 consecutive freebase dumps.
I have 11 000 000 triples for deletion and 22 000 000 triples for
insertions.
My approach is that I divide these big sets into smaller batches - and use
tdbupdate with files passed as param for sparql deletes and sparql inserts.
I would expect that the performance is comparable to tdbloader but this is
not the case? Can this be improved? Can I used tdbloader instead of
tdbupdate for inserts update?
I appreciate any comment?

tdbloader only makes a difference when loading an empty dataset -tdbloader knows how to manipulate the indexes in a better fashion thansimple letting the indexes incrementally update and tdbloader2 knows howto build the index b+trees directly.

tdbupdate is applying SPARQL updates and does not exploit theimplementation of TDB.

Normally, I'd suggest considering working with N-Quads and doing insertand delete on those (if no bnodes) but the reload size of Freebase makesthat a bit of a nuisance.

With 11e6 and 22e6, batching in smaller groups is going to be a goodidea otherwise the transaction mechanism will (there is spill-to-disk(see TDB.transactionJournalWriteBlockMode) but it will be slow becausethere is a disk.

Having the database on an SSD will make a big difference It greatlyspeeds up transactions.


                Andy

Re: tdbupdate vs tdbloader

Reply via email to