Re: How to deploy a scalable SPARQL Jena service ?

Andy Seaborne Sun, 08 Jan 2023 13:01:43 -0800



On 06/01/2023 15:37, Jonathan MERCIER wrote:

Hi Jonathan,
Hi Andy,
Could you say somnthing about the usage patterns you are interested insupporting? Size of data? Query load?
Yes of course, we aims to store Partially uniprot ontology in order tostudy metabolism on multiple layer Organism/Gene/Protein/Reaction/Pathway.Thus we will get a huge amount of public and private data (both academicresearch and industrial).So we have to use apache shiro to contol who can acces some data (bytenant)


Shiro will do the authentication and API security for authorization.

To get the access control on parts of the overall data, do you split thedata into separate triplestores? Do you use the per-graph access controlof Jena to get data level security?

The per-graph access control works if (1) you can manage the data thatway with named graphs and (2) the access control is user, or role, based.

In dayjob, I'm working on another data access control system - we haveexisting data which does not decompose into named graphs very easily andthe access control rules don't fit user/role bases (Role Based AccessControl = RBAC).

Attribute Based Access Control (ABAC) can go down to labelling theaccess conditions on individual triples - and also provides of simpletriple pattern matching (because sometimes, many triples have the samelabel e.g. they have the same property).

The "attribute" part comes from having key/value boolean expressions foraccess conditions, such as "department=engineering & status=employee"which can be moved around with the data when sharing across enterpriseboundaries.

Currently size of data is estimated around 1 To
We will provides a Knowledge release time to time so we will most oftime doing read only query and sometime we will push our new release (1To).

Then the full capabilities of RDF Delta may not be needed. Sounds likeoffline database build, copy DB to multiple triple stores behind a loadbalancer.

Full 24x7 update with no single point of failure is nice but it iscomplex. More servers (cost), more admin (more cost!).

Or for a few not-time critical incremental updates, a simple mode forRDF Delta is with a single patch manager with a replicated filesystem.This is a single point of failure for updates, but the Fuseki replicascan provide query service through-out. It is simpler to operate.


    Andy

There is a Lucene based text index.
Indeed I see this I will take a look, on how to enable lucene with TDB
Also we will take a look to the fuseki API in order to be able to use itthrough our python application (more rarely Kotlin)
We aims to perform some GeoSpatial query (maybe we would have to make aplugin) in order to have a dedicated algorithm to walk though ourknowledge graph
2) can we deploy a distributed TDB service, in order to have efficientquery ?
It can scale sideways with multiple copies of the database keptconsistent across a cluster of replicas using the separate project (itis not an Apache Foundation project) that provides high availabilityand multiple query
RDF Delta <https://afs.github.io/rdf-delta>
Thanks Andy I will take a look

Re: How to deploy a scalable SPARQL Jena service ?

Reply via email to