On 06/01/2023 15:37, Jonathan MERCIER wrote:
Hi Jonathan,
Hi Andy,
Could you say somnthing about the usage patterns you are interested in
supporting? Size of data? Query load?
Yes of course, we aims to store Partially uniprot ontology in order to
study metabolism on multiple layer Organism/Gene/Protein/Reaction/Pathway.
Thus we will get a huge amount of public and private data (both academic
research and industrial).
So we have to use apache shiro to contol who can acces some data (by
tenant)
Shiro will do the authentication and API security for authorization.
To get the access control on parts of the overall data, do you split the
data into separate triplestores? Do you use the per-graph access control
of Jena to get data level security?
The per-graph access control works if (1) you can manage the data that
way with named graphs and (2) the access control is user, or role, based.
In dayjob, I'm working on another data access control system - we have
existing data which does not decompose into named graphs very easily and
the access control rules don't fit user/role bases (Role Based Access
Control = RBAC).
Attribute Based Access Control (ABAC) can go down to labelling the
access conditions on individual triples - and also provides of simple
triple pattern matching (because sometimes, many triples have the same
label e.g. they have the same property).
The "attribute" part comes from having key/value boolean expressions for
access conditions, such as "department=engineering & status=employee"
which can be moved around with the data when sharing across enterprise
boundaries.
Currently size of data is estimated around 1 To
We will provides a Knowledge release time to time so we will most of
time doing read only query and sometime we will push our new release (1
To).
Then the full capabilities of RDF Delta may not be needed. Sounds like
offline database build, copy DB to multiple triple stores behind a load
balancer.
Full 24x7 update with no single point of failure is nice but it is
complex. More servers (cost), more admin (more cost!).
Or for a few not-time critical incremental updates, a simple mode for
RDF Delta is with a single patch manager with a replicated filesystem.
This is a single point of failure for updates, but the Fuseki replicas
can provide query service through-out. It is simpler to operate.
Andy
There is a Lucene based text index.
Indeed I see this I will take a look, on how to enable lucene with TDB
Also we will take a look to the fuseki API in order to be able to use it
through our python application (more rarely Kotlin)
We aims to perform some GeoSpatial query (maybe we would have to make a
plugin) in order to have a dedicated algorithm to walk though our
knowledge graph
2) can we deploy a distributed TDB service, in order to have efficient
query ?
It can scale sideways with multiple copies of the database kept
consistent across a cluster of replicas using the separate project (it
is not an Apache Foundation project) that provides high availability
and multiple query
RDF Delta <https://afs.github.io/rdf-delta>
Thanks Andy I will take a look