Re: TDB size

Trevor Donaldson Mon, 16 Feb 2015 16:39:42 -0800

Thanks Andy for responses. See responses inline. One thing that I noticed
is that if I use sparql insert queries that the behavior is as expected.


On Mon, Feb 16, 2015 at 5:07 PM, Andy Seaborne <[email protected]> wrote:

> Still waiting for the answer to:
> [[
> Does the size stabilise?
> If not, do some files stabilise in size and other not?
> ]]

    No the size does not stabilize. I will have to check if some files
stabilize and others do not.


>
>


> On 16/02/15 18:14, Trevor Donaldson wrote:
>
>> Hi think my question got lost. Is it correct to add millions of triples to
>> the model and then persist the model once using putmodel. I didn't want to
>> get a timeout or anything like that.
>>
>
> Updates don't time out.  Small numbers of millions is no big deal in a
> single operation append operation (INSERT DATA or POST / addModel).
>
> ***I would be using a putModel not an addModel. The reason for this is
that I need to update a model after removing certain statements.


> In your loop you have:
>
> Resource subject = ResourceFactory.createResource("http://
> example.org/task/"+i);
>
> so a new URI is generated every time.  Nodes are not recovered on delete
> (too expensive to reference count them - see earlier in the thread).
>
> Batching updates may help performance.
>
>  On Feb 13, 2015 5:11 PM, "Trevor Donaldson" <[email protected]> wrote:
>>
>>  I am using Fuseki2. I thought it manages the transactions for me. Is this
>>> not the case?
>>>
>>
> ****"Manages" in the sense that each HTTP interaction is a transaction.
> HTTP is a stateless protocol. Yes "manages" in the sense that each HTTP
> interaction is a transaction. I was responding to your transaction
> question. I thought that Fuseki had all the code to handle transactions. I
> may be wrong though.
>
>  I was using datasetfactory to interact with fuseki.
>>
>
> Was that meant to be DatasetAccessorFactory?


**Yes I was typing from memory. Yes it was meant to be
DatasetAccessorFactory.

>
>
>         Andy
>
>
>  On Feb 13, 2015 12:10 PM, "Andy Seaborne" <[email protected]> wrote:
>>>
>>>  This may be related:
>>>>
>>>> https://issues.apache.org/jira/browse/JENA-804
>>>>
>>>> I say "may" because the exact patterns of use deep affect the outcome.
>>>> In
>>>> JENA-804 it is across transaction boundaries, which your "putModel"
>>>> isn't.
>>>>
>>>> (Are you really running without transactions?)
>>>>
>>>>          Andy
>>>>
>>>> On 13/02/15 16:56, Andy Seaborne wrote:
>>>>
>>>>  Does the size stabilise?
>>>>> If not, do some files stabilise in size and other not?
>>>>>
>>>>> There are two places for growth:
>>>>>
>>>>> nodes - does the new data have new RDF terms in it?  Old terms are not
>>>>> deleted, just left round to be reused so if you are adding terms, the
>>>>> node table can grow.  (Terms are not reference counted - that would be
>>>>> very expensive for sucgh a small data item.)
>>>>>
>>>>> TDB (current version) does not properly reuse freed up space in indexes
>>>>> but should do within a transaction. put is delete-add and some space
>>>>> should be reused
>>>>>
>>>>> A proper fix to reuse across transactions may require a database format
>>>>> change but I haven't had time to workout the details though off the top
>>>>> of my head, much use should be doable by moving the free chain
>>>>> management onto the main database on a transaction as its single-active
>>>>> writer.  The code is currently too cautious about old generation
>>>>> readers
>>>>> which I now see it need not be.
>>>>>
>>>>>       Andy
>>>>>
>>>>> On 12/02/15 17:51, Trevor Donaldson wrote:
>>>>>
>>>>>  Any thoughts anyone? If I change my model every hour with new data or
>>>>>> data
>>>>>> to replace. Lets say over a period of inserting years worth of triples
>>>>>> should I persist potentially millions of triples at one time using
>>>>>> putModel? Committing one time seems to be the only way to not mitigate
>>>>>> against the directory growing exponentially.
>>>>>>
>>>>>> On Thu, Feb 12, 2015 at 9:53 AM, Trevor Donaldson <
>>>>>> [email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>   Damian,
>>>>>>
>>>>>>>
>>>>>>> I am using du -ksh ./* on the databases directory.
>>>>>>>
>>>>>>> I am getting
>>>>>>> 25M      ./test_store
>>>>>>>
>>>>>>> On Thu, Feb 12, 2015 at 9:35 AM, Damian Steer <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>   On 12/02/15 13:49, Trevor Donaldson wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>  On Thu, Feb 12, 2015 at 6:32 AM, Trevor Donaldson
>>>>>>>>>
>>>>>>>>>> <[email protected]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>   wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>   Hi,
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I am in the middle of updating our store from RDB to TDB. I have
>>>>>>>>>>>
>>>>>>>>>>>  noticed
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>  a significant size increase in the amount of storage needed.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>  Currently RDB
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>  is able to hold all the data I need (4 third party services and 4
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>  years of
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>  their data) and it equals ~ 12G. I started inserting data from 1
>>>>>>>>>
>>>>>>>>>> third
>>>>>>>>>>> party service, only 4 months of their data into TDB and the TDB
>>>>>>>>>>>
>>>>>>>>>>>  database
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>  size has already reached 15G. Is this behavior expected?
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>  Hi Trevor,
>>>>>>>>
>>>>>>>> How are you measuring the space used? TDB files tend to be sparse,
>>>>>>>> so
>>>>>>>> the disk use reported can be unreliable. Example from my system:
>>>>>>>>
>>>>>>>> 6.2M [...] 264M [...] GOSP.dat
>>>>>>>>
>>>>>>>> The first number (6.2M) is essentially the disk space taken, the
>>>>>>>> second
>>>>>>>> (264M!) is the 'length' of the file.
>>>>>>>>
>>>>>>>> Damian
>>>>>>>>
>>>>>>>> --
>>>>>>>> Damian Steer
>>>>>>>> Senior Technical Researcher
>>>>>>>> Research IT
>>>>>>>> +44 (0) 117 928 7057
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

Re: TDB size

Reply via email to