On 18/05/12 17:35, Paolo Castagna wrote:
Hi, I've just read this blog post from Andy:
http://www.epimorphics.com/web/wiki/epimorphics-builds-data-publish-platform-environment-agency

 It describes a "quite simple" fault-tolerant and replicated data
publishing solution using Apache Jena and Fuseki. Interesting.

It's a master/slave architecture. The master (called by Andy in his
post 'controller server') receives all updates and "calculates the
triples to be added, the triples to be removed" so that changes are
'idempotent' (i.e. they can be reapplied multiple times (in the same
order!) with the same effect).

It would be interesting to know if the 'controller server' exposes a
full SPARQL Update endpoint and/or the Graph Store HTTP Protocol and
if that is the case how triples to be added/removed are calculated.
(This is something I wanted to learn for a while, but I still did not
find the time... a small example would be wonderful! ;-)).

To conclude, I fully agree on the "quite simple design" and "simple
systems are easier to operate". The approach described can work well
in a lot of scenarios where the rate of updates/writes isn't
excessive and you have mostly reads (which I still believe to be the
case most of the times when you have RDF data, since data is often
human generated/curated data). My hope is to see something similar in
the 'open' so that other Apache Jena and Fuseki users can benefit
from an highly available and open source publishing solution for RDF
data (and they can focus their energies/efforts elsewhere: on the
quality of their data modeling, data, applications, user experience,
etc.).

Paolo

PS: Disclaimer: I don't work for Epimorphics, those are just my
personal opinions and, last but not least, I love simplicity.

The controller is not the same as the replicas with a copy of the DB that gets updated and propagated. It takes the changes (the data updates are CSV), converts to RDF and calculates the adds and deletes. This process produces DELELE DATA ... INSERT DATA ... and the script to reset the current view. It's not general SPARQL Update and not master/slave as such.

Think of it as a design pattern.

A different pattern for more general updates, assuming fail-stop nodes and no partitions, with briefly holding up updates at a point where a server is being introduced would also be possible. Keeping a "last transaction" id would help but I'm not sure it's necessary. Useful for a few machines, but the restrictions become a burden as the number goes up at which point a more complicated design would be more useful. Again, this is for a system that is mainly publishing, with some updates, not a system with a high number and proportion of updates, and where absolute consistency isn't needed.

Horses for courses.

        Andy

Reply via email to