Re: Replacing FileManager with Dataset

2022-01-22 Thread Martynas Jusevičius
I meant  map in FileManager...

On Sat, Jan 22, 2022 at 9:14 PM Martynas Jusevičius
 wrote:
>
> Hi,
>
> We are using FileManager as part of the Ontology API. We noticed it's
> Model cache related methods are marked deprecated:
> https://jena.apache.org/documentation/javadoc/jena/org/apache/jena/util/FileManager.html
>
> I don't see how OntDocumentManager can work without some sort of model
> cache. But I think that FileManager (or at least the caching part)
> could be replaced with a Dataset implementation.  cache
> map in FileManager is essentially the same as the named graphs in the
> Dataset.
>
> One immediate advantage would be that all OntDocumentManager
> ontologies would be accessible using SPARQL. We have a use case to
> make the queryable which lead to this idea.
>
> I've made a PoC implementation. It uses DataManager as a subclass of
> FileManager with getModelCache() exposed as an immutable map, because
> FileManager itself provides no way to list the entries in the cache
> map.
> https://github.com/AtomGraph/LinkedDataHub/blob/develop/src/main/java/com/atomgraph/linkeddatahub/server/util/DataManagerDataset.java
>
> Thoughts?
>
> Martynas


Replacing FileManager with Dataset

2022-01-22 Thread Martynas Jusevičius
Hi,

We are using FileManager as part of the Ontology API. We noticed it's
Model cache related methods are marked deprecated:
https://jena.apache.org/documentation/javadoc/jena/org/apache/jena/util/FileManager.html

I don't see how OntDocumentManager can work without some sort of model
cache. But I think that FileManager (or at least the caching part)
could be replaced with a Dataset implementation.  cache
map in FileManager is essentially the same as the named graphs in the
Dataset.

One immediate advantage would be that all OntDocumentManager
ontologies would be accessible using SPARQL. We have a use case to
make the queryable which lead to this idea.

I've made a PoC implementation. It uses DataManager as a subclass of
FileManager with getModelCache() exposed as an immutable map, because
FileManager itself provides no way to list the entries in the cache
map.
https://github.com/AtomGraph/LinkedDataHub/blob/develop/src/main/java/com/atomgraph/linkeddatahub/server/util/DataManagerDataset.java

Thoughts?

Martynas


Re: Dynamically restricting graph access at SPARQL query time

2022-01-22 Thread Andy Seaborne




On 21/01/2022 15:26, Martynas Jusevičius wrote:

WebAccessControl ontology might be relevant here:
https://www.w3.org/wiki/WebAccessControl
We're using a request filter that controls access against
authorizations using SPARQL.




On Fri, Jan 21, 2022 at 4:13 PM Vilnis Termanis
 wrote:


Hi,


Hi Vilnis,


For a SPARQL query via Fuseki, we are trying to restrict visibility of
groups of triples (each with multiple subjects) dynamically, in order
to allow for generic queries to be executed by users (instead of
providing tinned ones).



Looking at the available ACL mechanisms in Jena/Fuseki, I assume
storing each of these groups as a distinct graph might be the way
forward. (The expectation is to be able to support 10^5 or higher
number of these.)


If each graph is in the same TDB dataset, graph numbers are not much 
different from any other node frequency. Millions of graphs are 
possible. It's all quads. 4 Node/NodeIds.


So it might be a way forward (details matter...)

Managing said dataset is another matter.

The description sounds a bit SOLID-like - see Martynas's comment
and -> https://inrupt.com.


I.e.: Given a user (external to Fuseki, e.g. presented via shiro via
LDAP/other), only consider triples from the set of graphs 1..N during
the query. (Where the allowed list of 1..N graphs is to be looked up
at the point of the query.)


How often is LDAP being accessed per query execution? Going off machine 
is a significant cost compared to triple access.  (From experience of 
others, LDAP servers can be "unhelpful" - e.g. big spread in the latency 
of requests based on environmental factors).


(shiro is only integrated for Fuseki/webapp - it does work with 
Fuseki/main but you have to add it. Current WIP should, eventually, 
improve this.)



 From my limited understanding, some potential routes are:

a) jena-fuseki-access - Filters triples at storage level via "TDB Quad
Filter" support in TDB.


Yes. Filtering is a hook to use. Sounds like your UC might need its own 
filter code (in Java) for the policy.



However, the configuration of allowed graphs per user is static at runtime.


jena-fuseki-access is a layer on top of the filtering mechanism for the 
common case of ACLs on named graphs. That layer isn't compulsory for 
quad filtering. The code may be inspiration for setup of custom code.



b) jena-permissions - Extends the SPARQL query engine with an Op
rewriter which allows a user-defined evalulator implementation to
allow/deny access to a graph/triple, given a specific user/principle.
(The specific yes/no evaluation responses are cached for the duration
of a query/operation.)


From what I know, should work. Claude may be able to say more.


However, this can only applied to a single graph as it stands.


A dataset is a collection of named graphs. Each graph can have 
jena-permissions wrapped around it.


Your UC description was about groups of triples, and then it slid into 
named graph. Is NG points above and below about implementation 
possibility or is the incoming data already using named graphs?


This approach will not be using the low level filtering but it may
not show. What matters is the number of visible graphs, not the total 
number.


Can be possibly combined with (C).


c) Parse & re-write the query to e.g. scope it using a fixed set of
"FROM" clauses. From some minimal testing (with ~200 FROM clauses)
this does not appear to perform well (compare to a tinned query which
explicitly restricts access via knowledge of the ontologies involved).
I appreciate that maybe having a large list of FROM clauses is an
anti-pattern.


Quite likely - depends on the query complexity and numbers. There's a 
hook in Fuseki query evaluation for this - did you try that or did you 
do it client-side?


If the query is a small amount of work, the setup overhead will be 
significant but it is (roughly) a fixed overhead so a longer running 
query is less impacted.



My questions are:

1) Does filtering to a set of subset of graphs (from a large set of
graphs) to restrict access sounds like a sensible thing to do? (Note
that each of these graphs would contain a set of multiple subjects -
i.e. we are not trying filter by specific predicate/object values.)


Sounds possible - "sensible" depends on the details of the intended usage.


2) Would extending either jena-fuseki-access to support the
user-graph-list lookup dynamically OR extend jena-permissions to work
at dataset level be sensible things to do?


Functionally - yes, but lots of details matter.

And user-graph-list sounds SOLID-like.

In the SOLID approach the access path is known and its the path that 
decides access or not.  Very different to filtering.



3) If the answer to either of (2) is yes - I'd be interested in
getting a better understanding of what would be involved to gauge the
size/effort of such an extension. I have had a look codebases for the
aforementioned projects, but my knowledge of TDB/ARQ/etc is very
limited. (We'd