Thanks Andy for responses. See responses inline. One thing that I noticed is that if I use sparql insert queries that the behavior is as expected.
On Mon, Feb 16, 2015 at 5:07 PM, Andy Seaborne <[email protected]> wrote: > Still waiting for the answer to: > [[ > Does the size stabilise? > If not, do some files stabilise in size and other not? > ]] No the size does not stabilize. I will have to check if some files stabilize and others do not. > > > On 16/02/15 18:14, Trevor Donaldson wrote: > >> Hi think my question got lost. Is it correct to add millions of triples to >> the model and then persist the model once using putmodel. I didn't want to >> get a timeout or anything like that. >> > > Updates don't time out. Small numbers of millions is no big deal in a > single operation append operation (INSERT DATA or POST / addModel). > > ***I would be using a putModel not an addModel. The reason for this is that I need to update a model after removing certain statements. > In your loop you have: > > Resource subject = ResourceFactory.createResource("http:// > example.org/task/"+i); > > so a new URI is generated every time. Nodes are not recovered on delete > (too expensive to reference count them - see earlier in the thread). > > Batching updates may help performance. > > On Feb 13, 2015 5:11 PM, "Trevor Donaldson" <[email protected]> wrote: >> >> I am using Fuseki2. I thought it manages the transactions for me. Is this >>> not the case? >>> >> > ****"Manages" in the sense that each HTTP interaction is a transaction. > HTTP is a stateless protocol. Yes "manages" in the sense that each HTTP > interaction is a transaction. I was responding to your transaction > question. I thought that Fuseki had all the code to handle transactions. I > may be wrong though. > > I was using datasetfactory to interact with fuseki. >> > > Was that meant to be DatasetAccessorFactory? **Yes I was typing from memory. Yes it was meant to be DatasetAccessorFactory. > > > Andy > > > On Feb 13, 2015 12:10 PM, "Andy Seaborne" <[email protected]> wrote: >>> >>> This may be related: >>>> >>>> https://issues.apache.org/jira/browse/JENA-804 >>>> >>>> I say "may" because the exact patterns of use deep affect the outcome. >>>> In >>>> JENA-804 it is across transaction boundaries, which your "putModel" >>>> isn't. >>>> >>>> (Are you really running without transactions?) >>>> >>>> Andy >>>> >>>> On 13/02/15 16:56, Andy Seaborne wrote: >>>> >>>> Does the size stabilise? >>>>> If not, do some files stabilise in size and other not? >>>>> >>>>> There are two places for growth: >>>>> >>>>> nodes - does the new data have new RDF terms in it? Old terms are not >>>>> deleted, just left round to be reused so if you are adding terms, the >>>>> node table can grow. (Terms are not reference counted - that would be >>>>> very expensive for sucgh a small data item.) >>>>> >>>>> TDB (current version) does not properly reuse freed up space in indexes >>>>> but should do within a transaction. put is delete-add and some space >>>>> should be reused >>>>> >>>>> A proper fix to reuse across transactions may require a database format >>>>> change but I haven't had time to workout the details though off the top >>>>> of my head, much use should be doable by moving the free chain >>>>> management onto the main database on a transaction as its single-active >>>>> writer. The code is currently too cautious about old generation >>>>> readers >>>>> which I now see it need not be. >>>>> >>>>> Andy >>>>> >>>>> On 12/02/15 17:51, Trevor Donaldson wrote: >>>>> >>>>> Any thoughts anyone? If I change my model every hour with new data or >>>>>> data >>>>>> to replace. Lets say over a period of inserting years worth of triples >>>>>> should I persist potentially millions of triples at one time using >>>>>> putModel? Committing one time seems to be the only way to not mitigate >>>>>> against the directory growing exponentially. >>>>>> >>>>>> On Thu, Feb 12, 2015 at 9:53 AM, Trevor Donaldson < >>>>>> [email protected]> >>>>>> wrote: >>>>>> >>>>>> Damian, >>>>>> >>>>>>> >>>>>>> I am using du -ksh ./* on the databases directory. >>>>>>> >>>>>>> I am getting >>>>>>> 25M ./test_store >>>>>>> >>>>>>> On Thu, Feb 12, 2015 at 9:35 AM, Damian Steer <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> On 12/02/15 13:49, Trevor Donaldson wrote: >>>>>>> >>>>>>>> >>>>>>>> On Thu, Feb 12, 2015 at 6:32 AM, Trevor Donaldson >>>>>>>>> >>>>>>>>>> <[email protected] >>>>>>>>>> >>>>>>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I am in the middle of updating our store from RDB to TDB. I have >>>>>>>>>>> >>>>>>>>>>> noticed >>>>>>>>>> >>>>>>>>> >>>>>>>> a significant size increase in the amount of storage needed. >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Currently RDB >>>>>>>>>> >>>>>>>>> >>>>>>>> is able to hold all the data I need (4 third party services and 4 >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> years of >>>>>>>>>> >>>>>>>>> >>>>>>>> their data) and it equals ~ 12G. I started inserting data from 1 >>>>>>>>> >>>>>>>>>> third >>>>>>>>>>> party service, only 4 months of their data into TDB and the TDB >>>>>>>>>>> >>>>>>>>>>> database >>>>>>>>>> >>>>>>>>> >>>>>>>> size has already reached 15G. Is this behavior expected? >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> Hi Trevor, >>>>>>>> >>>>>>>> How are you measuring the space used? TDB files tend to be sparse, >>>>>>>> so >>>>>>>> the disk use reported can be unreliable. Example from my system: >>>>>>>> >>>>>>>> 6.2M [...] 264M [...] GOSP.dat >>>>>>>> >>>>>>>> The first number (6.2M) is essentially the disk space taken, the >>>>>>>> second >>>>>>>> (264M!) is the 'length' of the file. >>>>>>>> >>>>>>>> Damian >>>>>>>> >>>>>>>> -- >>>>>>>> Damian Steer >>>>>>>> Senior Technical Researcher >>>>>>>> Research IT >>>>>>>> +44 (0) 117 928 7057 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >> >
