Re: TDB size

Trevor Donaldson Thu, 19 Feb 2015 04:53:10 -0800

Thanks Andy for your feedback. I realized I can not use
datasetAccessor.putModel or datasetAccessor.add. After much trial and error
I have determined that the following allocates space (often doubling the
database size) and never releases it.


   for(int i=0; i<1000;i++){
      Resource s = ResourceFactory.createResource("http://example.org/A/
"+i);
      Property p = ResourceFactory.createProperty("urn:my:id");
      RDFNode o = ResourceFactory.createTypedLiteral(i);

      Statement stmt = model.createStatement(s,p,o);
      model.add(stmt);
   }

   datasetAccessor.putModel(GRAPH_NAME,model);  //doesn't matter if I use
this or add(GRAPH_NAME,model)


One thing I have noticed is that if I use sparql inserts and deletes the
database size does not grow exponentially. I think the sparql query route
is probably more prone to error but since it seems more reliable as far as
database growth, I will go that route.

On Mon, Feb 16, 2015 at 7:36 PM, Trevor Donaldson <[email protected]>
wrote:

> Thanks Andy for responses. See responses inline. One thing that I noticed
> is that if I use sparql insert queries that the behavior is as expected.
>
> On Mon, Feb 16, 2015 at 5:07 PM, Andy Seaborne <[email protected]> wrote:
>
>> Still waiting for the answer to:
>> [[
>> Does the size stabilise?
>> If not, do some files stabilise in size and other not?
>> ]]
>
>     No the size does not stabilize. I will have to check if some files
> stabilize and others do not.
>
>
>>
>>
>
>
>> On 16/02/15 18:14, Trevor Donaldson wrote:
>>
>>> Hi think my question got lost. Is it correct to add millions of triples
>>> to
>>> the model and then persist the model once using putmodel. I didn't want
>>> to
>>> get a timeout or anything like that.
>>>
>>
>> Updates don't time out.  Small numbers of millions is no big deal in a
>> single operation append operation (INSERT DATA or POST / addModel).
>>
>> ***I would be using a putModel not an addModel. The reason for this is
> that I need to update a model after removing certain statements.
>
>
>> In your loop you have:
>>
>> Resource subject = ResourceFactory.createResource("http://
>> example.org/task/"+i);
>>
>> so a new URI is generated every time.  Nodes are not recovered on delete
>> (too expensive to reference count them - see earlier in the thread).
>>
>> Batching updates may help performance.
>>
>>  On Feb 13, 2015 5:11 PM, "Trevor Donaldson" <[email protected]> wrote:
>>>
>>>  I am using Fuseki2. I thought it manages the transactions for me. Is
>>>> this
>>>> not the case?
>>>>
>>>
>> ****"Manages" in the sense that each HTTP interaction is a transaction.
>> HTTP is a stateless protocol. Yes "manages" in the sense that each HTTP
>> interaction is a transaction. I was responding to your transaction
>> question. I thought that Fuseki had all the code to handle transactions. I
>> may be wrong though.
>>
>>  I was using datasetfactory to interact with fuseki.
>>>
>>
>> Was that meant to be DatasetAccessorFactory?
>
>
> **Yes I was typing from memory. Yes it was meant to be
> DatasetAccessorFactory.
>
>>
>>
>>         Andy
>>
>>
>>  On Feb 13, 2015 12:10 PM, "Andy Seaborne" <[email protected]> wrote:
>>>>
>>>>  This may be related:
>>>>>
>>>>> https://issues.apache.org/jira/browse/JENA-804
>>>>>
>>>>> I say "may" because the exact patterns of use deep affect the outcome.
>>>>> In
>>>>> JENA-804 it is across transaction boundaries, which your "putModel"
>>>>> isn't.
>>>>>
>>>>> (Are you really running without transactions?)
>>>>>
>>>>>          Andy
>>>>>
>>>>> On 13/02/15 16:56, Andy Seaborne wrote:
>>>>>
>>>>>  Does the size stabilise?
>>>>>> If not, do some files stabilise in size and other not?
>>>>>>
>>>>>> There are two places for growth:
>>>>>>
>>>>>> nodes - does the new data have new RDF terms in it?  Old terms are not
>>>>>> deleted, just left round to be reused so if you are adding terms, the
>>>>>> node table can grow.  (Terms are not reference counted - that would be
>>>>>> very expensive for sucgh a small data item.)
>>>>>>
>>>>>> TDB (current version) does not properly reuse freed up space in
>>>>>> indexes
>>>>>> but should do within a transaction. put is delete-add and some space
>>>>>> should be reused
>>>>>>
>>>>>> A proper fix to reuse across transactions may require a database
>>>>>> format
>>>>>> change but I haven't had time to workout the details though off the
>>>>>> top
>>>>>> of my head, much use should be doable by moving the free chain
>>>>>> management onto the main database on a transaction as its
>>>>>> single-active
>>>>>> writer.  The code is currently too cautious about old generation
>>>>>> readers
>>>>>> which I now see it need not be.
>>>>>>
>>>>>>       Andy
>>>>>>
>>>>>> On 12/02/15 17:51, Trevor Donaldson wrote:
>>>>>>
>>>>>>  Any thoughts anyone? If I change my model every hour with new data or
>>>>>>> data
>>>>>>> to replace. Lets say over a period of inserting years worth of
>>>>>>> triples
>>>>>>> should I persist potentially millions of triples at one time using
>>>>>>> putModel? Committing one time seems to be the only way to not
>>>>>>> mitigate
>>>>>>> against the directory growing exponentially.
>>>>>>>
>>>>>>> On Thu, Feb 12, 2015 at 9:53 AM, Trevor Donaldson <
>>>>>>> [email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>   Damian,
>>>>>>>
>>>>>>>>
>>>>>>>> I am using du -ksh ./* on the databases directory.
>>>>>>>>
>>>>>>>> I am getting
>>>>>>>> 25M      ./test_store
>>>>>>>>
>>>>>>>> On Thu, Feb 12, 2015 at 9:35 AM, Damian Steer <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>   On 12/02/15 13:49, Trevor Donaldson wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>  On Thu, Feb 12, 2015 at 6:32 AM, Trevor Donaldson
>>>>>>>>>>
>>>>>>>>>>> <[email protected]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>   wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>   Hi,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I am in the middle of updating our store from RDB to TDB. I have
>>>>>>>>>>>>
>>>>>>>>>>>>  noticed
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>  a significant size increase in the amount of storage needed.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>  Currently RDB
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>  is able to hold all the data I need (4 third party services and 4
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>  years of
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>  their data) and it equals ~ 12G. I started inserting data from 1
>>>>>>>>>>
>>>>>>>>>>> third
>>>>>>>>>>>> party service, only 4 months of their data into TDB and the TDB
>>>>>>>>>>>>
>>>>>>>>>>>>  database
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>  size has already reached 15G. Is this behavior expected?
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>  Hi Trevor,
>>>>>>>>>
>>>>>>>>> How are you measuring the space used? TDB files tend to be sparse,
>>>>>>>>> so
>>>>>>>>> the disk use reported can be unreliable. Example from my system:
>>>>>>>>>
>>>>>>>>> 6.2M [...] 264M [...] GOSP.dat
>>>>>>>>>
>>>>>>>>> The first number (6.2M) is essentially the disk space taken, the
>>>>>>>>> second
>>>>>>>>> (264M!) is the 'length' of the file.
>>>>>>>>>
>>>>>>>>> Damian
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Damian Steer
>>>>>>>>> Senior Technical Researcher
>>>>>>>>> Research IT
>>>>>>>>> +44 (0) 117 928 7057
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Re: TDB size

Reply via email to