Re: tdb2.tdbsync

Rob Vesse Thu, 13 Jun 2019 03:14:23 -0700

Well it's deletes primarily that are problematic for two reasons.

First is blank nodes equivalence as the internal IDs TDB2 assigns for blank 
nodes are completely unrelated to the source blank node IDs in the data 
serialization especially if that source data is changing over time (because 
your data serializer may use different IDs each time).  Figuring out what are 
new blank nodes versus what are equivalent blank nodes is the sub-graph 
isomorphism problem which has NP-complete complexity.

Secondly in order to detect deletes you would need to build a completely new 
dataset from the data file and then do a comparison between the old and new 
dataset by looping over the old and doing lookups against the other.  This 
would be extremely expensive both in terms of time and resources even in 
datasets that used no blank nodes.

Creating a fresh dataset will always be much faster

Rob

On 13/06/2019, 09:34, "Laura Morales" <[email protected]> wrote:

    yes yes of course I can reload everything, that's what I do already. I 
simply thought it might be quite handy, for instance, if I had a folder 
containing an arbitrary number of rdf files, and as these files change I could 
call a tdb2.tdbsync tool that automatically updates a tdb dataset with only the 
changes (instead of reloading everything).

    > Sent: Thursday, June 13, 2019 at 10:26 AM
    > From: "Rob Vesse" <[email protected]>
    > To: [email protected]
    > Subject: Re: tdb2.tdbsync
    >
    > Can you not just do a fresh TDB load into a new dataset from the data 
file?
    >
    > This would be much faster and more performant than what you are proposing 
(in particular the delete handling would be very expensive)
    >
    > Rob

Re: tdb2.tdbsync

Reply via email to