Hi,
Is that expected that tdbupdate deletes take much longer than inserts.
For inserts I have divided my set into batches of 500 000 triples and I
insert 1 batch in 5 minutes; the deletes: I created a batch of 100 000
triples and it has been running whole night and is still running? Can this
be tuned? Now I am experimenting with small batches for deletes.
Here are my results so far:
batch 1000 triples - 2mins/batch
batch 2000 triples - 2,5 mins/batch
batch 4000 triples - 6 mins/batch
I have enabled gc log and I see gc runs very often.
Can JVM tuning improve performance of tdbupdate deletes much?


2014-06-17 15:37 GMT+01:00 Andy Seaborne <[email protected]>:

> On 17/06/14 09:17, Ewa Szwed wrote:
>
>> Hi,
>> I have a question about performance of tdbupdate tool.
>> I would like to update my Jena TDB store that hosts freebase data with the
>> delta that I calculated between 2 consecutive freebase dumps.
>> I have 11 000 000 triples for deletion and 22 000 000 triples for
>> insertions.
>> My approach is that I divide these big sets into smaller batches - and use
>> tdbupdate with files passed as param for sparql deletes and sparql
>> inserts.
>> I would expect that the performance is comparable to tdbloader but this is
>> not the case? Can this be improved? Can I used tdbloader instead of
>> tdbupdate for inserts update?
>> I appreciate any comment?
>>
>>
> tdbloader only makes a difference when loading an empty dataset -
> tdbloader knows how to manipulate the indexes in a better fashion than
> simple letting the indexes incrementally update and tdbloader2 knows how to
> build the index b+trees directly.
>
> tdbupdate is applying SPARQL updates and does not exploit the
> implementation of TDB.
>
> Normally, I'd suggest considering working with N-Quads and doing insert
> and delete on those (if no bnodes) but the reload size of Freebase makes
> that a bit of a nuisance.
>
> With 11e6 and 22e6, batching in smaller groups is going to be a good idea
> otherwise the transaction mechanism will (there is spill-to-disk (see TDB.
> transactionJournalWriteBlockMode) but it will be slow because there is a
> disk.
>
> Having the database on an SSD will make a big difference  It greatly
> speeds up transactions.
>
>                 Andy
>
>

Reply via email to