Re: tdbupdate vs tdbloader

Ewa Szwed Mon, 23 Jun 2014 10:28:24 -0700

Hi,
I migrated my Freebase Jena TDB setup to SSD host and continued my
tdbupdate deletes experiments. The expectation was the transaction times
are improved. I was able to see the improvements:


batch 5000 triples - 1 min 20 sec / batch

Now when I reconfigure my bath size to 8000 to see if the time stays at
this level for single batch I see:


WARN TDB:: Transaction not commited or aborted: Transaction: 1 : Mode=WRITE
: State=ACTIVE
 : /data/servers/freebase_data/

Exception in thread "main" java.lang.StackOverflowError
at
org.apache.jena.atlas.iterator.RepeatApplyIterator.hasNext(RepeatApplyIterator.java:42)
at
com.hp.hpl.jena.tdb.solver.SolverLib$IterAbortable.hasNext(SolverLib.java:197)
at
org.apache.jena.atlas.iterator.RepeatApplyIterator.hasNext(RepeatApplyIterator.java:46)
at
com.hp.hpl.jena.tdb.solver.SolverLib$IterAbortable.hasNext(SolverLib.java:197)
at
org.apache.jena.atlas.iterator.RepeatApplyIterator.hasNext(RepeatApplyIterator.java:46)

Is this a bug?
Please advise how should I proceed?


2014-06-18 14:40 GMT+01:00 Ewa Szwed <[email protected]>:

> Hi,
> Is that expected that tdbupdate deletes take much longer than inserts.
> For inserts I have divided my set into batches of 500 000 triples and I
> insert 1 batch in 5 minutes; the deletes: I created a batch of 100 000
> triples and it has been running whole night and is still running? Can this
> be tuned? Now I am experimenting with small batches for deletes.
> Here are my results so far:
> batch 1000 triples - 2mins/batch
> batch 2000 triples - 2,5 mins/batch
> batch 4000 triples - 6 mins/batch
> I have enabled gc log and I see gc runs very often.
> Can JVM tuning improve performance of tdbupdate deletes much?
>
>
> 2014-06-17 15:37 GMT+01:00 Andy Seaborne <[email protected]>:
>
> On 17/06/14 09:17, Ewa Szwed wrote:
>>
>>> Hi,
>>> I have a question about performance of tdbupdate tool.
>>> I would like to update my Jena TDB store that hosts freebase data with
>>> the
>>> delta that I calculated between 2 consecutive freebase dumps.
>>> I have 11 000 000 triples for deletion and 22 000 000 triples for
>>> insertions.
>>> My approach is that I divide these big sets into smaller batches - and
>>> use
>>> tdbupdate with files passed as param for sparql deletes and sparql
>>> inserts.
>>> I would expect that the performance is comparable to tdbloader but this
>>> is
>>> not the case? Can this be improved? Can I used tdbloader instead of
>>> tdbupdate for inserts update?
>>> I appreciate any comment?
>>>
>>>
>> tdbloader only makes a difference when loading an empty dataset -
>> tdbloader knows how to manipulate the indexes in a better fashion than
>> simple letting the indexes incrementally update and tdbloader2 knows how to
>> build the index b+trees directly.
>>
>> tdbupdate is applying SPARQL updates and does not exploit the
>> implementation of TDB.
>>
>> Normally, I'd suggest considering working with N-Quads and doing insert
>> and delete on those (if no bnodes) but the reload size of Freebase makes
>> that a bit of a nuisance.
>>
>> With 11e6 and 22e6, batching in smaller groups is going to be a good idea
>> otherwise the transaction mechanism will (there is spill-to-disk (see TDB.
>> transactionJournalWriteBlockMode) but it will be slow because there is a
>> disk.
>>
>> Having the database on an SSD will make a big difference  It greatly
>> speeds up transactions.
>>
>>                 Andy
>>
>>
>

Re: tdbupdate vs tdbloader

Reply via email to