On 18/05/12 17:35, Paolo Castagna wrote:
Hi, I've just read this blog post from Andy:
http://www.epimorphics.com/web/wiki/epimorphics-builds-data-publish-platform-environment-agency
It describes a "quite simple" fault-tolerant and replicated data
publishing solution using Apache Jena and Fuseki. Interesting.
It's a master/slave architecture. The master (called by Andy in his
post 'controller server') receives all updates and "calculates the
triples to be added, the triples to be removed" so that changes are
'idempotent' (i.e. they can be reapplied multiple times (in the same
order!) with the same effect).
It would be interesting to know if the 'controller server' exposes a
full SPARQL Update endpoint and/or the Graph Store HTTP Protocol and
if that is the case how triples to be added/removed are calculated.
(This is something I wanted to learn for a while, but I still did not
find the time... a small example would be wonderful! ;-)).
To conclude, I fully agree on the "quite simple design" and "simple
systems are easier to operate". The approach described can work well
in a lot of scenarios where the rate of updates/writes isn't
excessive and you have mostly reads (which I still believe to be the
case most of the times when you have RDF data, since data is often
human generated/curated data). My hope is to see something similar in
the 'open' so that other Apache Jena and Fuseki users can benefit
from an highly available and open source publishing solution for RDF
data (and they can focus their energies/efforts elsewhere: on the
quality of their data modeling, data, applications, user experience,
etc.).
Paolo
PS: Disclaimer: I don't work for Epimorphics, those are just my
personal opinions and, last but not least, I love simplicity.
The controller is not the same as the replicas with a copy of the DB
that gets updated and propagated. It takes the changes (the data updates
are CSV), converts to RDF and calculates the adds and deletes. This
process produces DELELE DATA ... INSERT DATA ... and the script to reset
the current view. It's not general SPARQL Update and not master/slave
as such.
Think of it as a design pattern.
A different pattern for more general updates, assuming fail-stop nodes
and no partitions, with briefly holding up updates at a point where a
server is being introduced would also be possible. Keeping a "last
transaction" id would help but I'm not sure it's necessary. Useful for
a few machines, but the restrictions become a burden as the number goes
up at which point a more complicated design would be more useful.
Again, this is for a system that is mainly publishing, with some
updates, not a system with a high number and proportion of updates, and
where absolute consistency isn't needed.
Horses for courses.
Andy