Re: A fault-tolerant and replicated data publishing solution (by Epimorphics)... and how to calculate the triples to add/remove?

Andy Seaborne Sun, 20 May 2012 05:11:04 -0700

On 18/05/12 17:35, Paolo Castagna wrote:

Hi, I've just read this blog post from Andy:
http://www.epimorphics.com/web/wiki/epimorphics-builds-data-publish-platform-environment-agency


 It describes a "quite simple" fault-tolerant and replicated data
publishing solution using Apache Jena and Fuseki. Interesting.

It's a master/slave architecture. The master (called by Andy in his
post 'controller server') receives all updates and "calculates the
triples to be added, the triples to be removed" so that changes are
'idempotent' (i.e. they can be reapplied multiple times (in the same
order!) with the same effect).

It would be interesting to know if the 'controller server' exposes a
full SPARQL Update endpoint and/or the Graph Store HTTP Protocol and
if that is the case how triples to be added/removed are calculated.
(This is something I wanted to learn for a while, but I still did not
find the time... a small example would be wonderful! ;-)).

To conclude, I fully agree on the "quite simple design" and "simple
systems are easier to operate". The approach described can work well
in a lot of scenarios where the rate of updates/writes isn't
excessive and you have mostly reads (which I still believe to be the
case most of the times when you have RDF data, since data is often
human generated/curated data). My hope is to see something similar in
the 'open' so that other Apache Jena and Fuseki users can benefit
from an highly available and open source publishing solution for RDF
data (and they can focus their energies/efforts elsewhere: on the
quality of their data modeling, data, applications, user experience,
etc.).

Paolo

PS: Disclaimer: I don't work for Epimorphics, those are just my
personal opinions and, last but not least, I love simplicity.

The controller is not the same as the replicas with a copy of the DBthat gets updated and propagated. It takes the changes (the data updatesare CSV), converts to RDF and calculates the adds and deletes. Thisprocess produces DELELE DATA ... INSERT DATA ... and the script to resetthe current view. It's not general SPARQL Update and not master/slaveas such.


Think of it as a design pattern.

A different pattern for more general updates, assuming fail-stop nodesand no partitions, with briefly holding up updates at a point where aserver is being introduced would also be possible. Keeping a "lasttransaction" id would help but I'm not sure it's necessary. Useful fora few machines, but the restrictions become a burden as the number goesup at which point a more complicated design would be more useful.Again, this is for a system that is mainly publishing, with someupdates, not a system with a high number and proportion of updates, andwhere absolute consistency isn't needed.


Horses for courses.

        Andy

Re: A fault-tolerant and replicated data publishing solution (by Epimorphics)... and how to calculate the triples to add/remove?

Reply via email to