Finding differences between graphs (Was: Jena/Fuseki graph sync)

Claude Warren Thu, 07 Dec 2017 23:27:38 -0800

On Fri, Nov 24, 2017 at 12:19 PM, Laura Morales <[email protected]> wrote:


> > What about simply deleting the old graph and loading the triples of the
> > .nt file into the graph afterwards? I don't see any benefit of such a
> > "tool" - you could just write your own bash script for this if you need
> > this quite often.
>
> The advantage is with large graphs, such as wikidata. If I download their
> dumps once a week, it's much more efficient to only change a few triples
> instead of deleting the entire graph and recreating the whole TDB store.
>


Performing a diff between two graphs with blank nodes might be speed up
using bloom filters.

I have code that represents triples as bloom filters and I know that 9 byte
filters will work for very large graphs so you could probably get aways
with 8 bytes to make them fit in a standard integer size.

This is a multiple pass operation.

create a bloom filter for each node in graph A.  Call this list A

step through  graph B creating bloom filters for each triple. if the triple
in question has blank nodes only encode non blank nodes

If the bloom filter is not in List A it is new.

if the bloom filter is in list A then it may be new and a direct lookup in
graph A. if it is not found add it

If your filter list has a pointer to the triples that it represents
(remember there can be bloom filter collisions) then you can rapidly
determine if there is a match and you also have a good starting place to do
blank node comparisons to determine if the triples are equivalent.

If anyone is interested in trying this I have some triple/bloom filter code
in my github repository.

Claude

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Finding differences between graphs (Was: Jena/Fuseki graph sync)

Reply via email to