Re: Reducing Jena TDB index size

Rob Vesse Fri, 11 Oct 2013 01:57:21 -0700

Daniel

Like most things in computing this is a performance trade off.  Being able
to remove nodes from the nodes table when they are no longer used would
require one of two approaches in TDB:

1 - Reference counting of node usages
2 - Post deletion scans for node usage

Both of these approaches would have performance impacts reducing the
overall performance of TDB while also potentially increasing the up-front
minimum memory footprint.

Even if either of these approaches were to be implemented they would also
require significant re-architecting of TDB since Node IDs are assigned in
such a way that they are offsets into the Node Table.  Therefore deleting
a node would require reassigning every Node ID after the deleted node and
rewriting the entire indices to reflect the deletion.  This would
essentially render deletions on TDB to be so non-performant as to be
unusable.  The other alternative would be a rewrite of TDB to use a hash
based node table but again this is a significant re-architecting.

Note that this design trade off is not unique to TDB, many open source and
commercial RDF stores use similar approaches to encoding nodes and thus
make a similar trade off.

If you are really concerned about the growth of the TDB dataset over time
then the workaround is to periodically dump out the dataset to NQuads and
then reload it into a fresh dataset.  Since the dump will only contain
nodes that are currently used this will eliminate the storage requirements
of all unused nodes.

Rob

On 10/11/13 8:52 AM, "Daniel Gerber" <dger...@informatik.uni-leipzig.de>
wrote:

>
>On 10.10.2013, at 12:37, Andy Seaborne <a...@apache.org> wrote:
>
>> On 10/10/13 10:37, Daniel Gerber wrote:
>>> Hi,
>> > I'm importing 20Mb of data every day into a Jena TDB store.
>>> Before insertion, I'm deleting everything (model.removeAll()). But I
>>> noticed that the size of the index does not shrink, it even increases
>>> every day (it's now at 11GB and soon will hit physical limits). I
>>> found this question [1] on stack overflow but could not find any
>>> mailing list entry (so sorry for re-asking this question). Is there
>>> any way, except deletion, to reduce the size of a Jena TDB
>>> directory/index.
>>> 
>>> Cheers, Daniel
>>> 
>>> [1]
>>> 
>>>http://stackoverflow.com/questions/11088082/how-to-reduce-the-size-of-th
>>>e-tdb-backed-jena-dataset
>>> 
>> 
>> Daniel,
>> 
>> Your question is a good one - the ful answer depends on the details of
>>your setup though.
>> 
>> The indexes won't shrink - TDB never gives disk space back to the OS -
>>but disk space is reused when reallocated within the same JVM.  If you
>>are deleting, stopping, restarting (hence different JVMs), then there
>>can be memory leaks but it sounds like this is not the case here as the
>>"leak" in that case can be most of the database and you'd notice!
>> 
>> The other issue is blank nodes - does your data have a significant
>>amount of blank nodes?  If so, each load is creating new blank nodes.
>>Nodes are not garbaged collected so old blank nodes (and unused URIs and
>>literals) remain in the node table.
>
>Hi Andy,
>Thanks for your insights.
>Well yes. I do have blank nodes. So there is no way if manually cleaning
>up the node table? I wonder how this can be excepted behavior. Who want's
>to run a database which grows everyday for hundreds of MBs (while
>importing 200k triple)?
>
>> If you are clearing out an entire database, then closing the database
>>(and removing from the StoreConnection manager), deleting the files,
>>then loading, which can be by bulk loader, may work for you.
>
>Well I can't simply delete everything since I do have different graphs
>inside this directory.
>Do you see any chance to fix the issue?
>
>Cheers,
>Daniel
>
>> 
>> [[
>> Except on MSWindows64, where it is not possible to delete memory files
>>while the JVM is running (they don't get deleted).
>> ]]
>> 
>>      Andy
>> 
>

Re: Reducing Jena TDB index size

Reply via email to