Re: querying lots of quad files in block storage

2022-04-14 Thread Martynas Jusevičius
There was a related thread
https://www.mail-archive.com/users@jena.apache.org/msg18577.html

On Thu, 14 Apr 2022 at 22.42, Justin  wrote:

> Hello,
>
> I am looking to see if Jena is a good fit for querying many billion quads
> (in thousands of .nq files) sitting in block storage (like AWS S3). The .nq
> files don't change. New .nq files do get added to S3, however. Also update
> queries are not needed -- just selects, constructs, asks, etc.
>
> It would be easy to iterate over all the files and produce TDB2s in a
> filesystem (on AWS EBS or EFS)...
>
> Has anyone gone down this path and have some wisdom to share?
> I understand queries won't be as snappy as querying a single TDB2.
>
> Thanks,
> Justin
>


querying lots of quad files in block storage

2022-04-14 Thread Justin
Hello,

I am looking to see if Jena is a good fit for querying many billion quads
(in thousands of .nq files) sitting in block storage (like AWS S3). The .nq
files don't change. New .nq files do get added to S3, however. Also update
queries are not needed -- just selects, constructs, asks, etc.

It would be easy to iterate over all the files and produce TDB2s in a
filesystem (on AWS EBS or EFS)...

Has anyone gone down this path and have some wisdom to share?
I understand queries won't be as snappy as querying a single TDB2.

Thanks,
Justin


Re: Interaction between Text indexing, Fuseki services & Data Access Control

2022-04-14 Thread Vilnis Termanis
Hi,

Specifying the following in each service (which is to ignore text
indexing) now works (tested against 4.4.0):

ja:context [ ja:cxtName "http://jena.apache.org/text#index; ;
ja:cxtValue false ] ;

... but not if the associated dataset is an AccessControlledDataset.

>From my understanding, the issue is to do with the fact that
fuseki-access uses QueryExecutionFactory whilst fuseki-core uses
QueryExecDatasetBuilder. The latter takes the HttpAction's context
into account (which presumably leads to the inclusion of the service
context values) while the former does not. (In addition, it would
appear that fuseki-access does not honour query-specific timeouts due
to a similar reason.)

This patch seems to fix the issue:
https://github.com/vtermanis/jena/commit/e5cb112f829f305c1f76c8f5305f4394d8e9b04f

I know that this most likely is not the right way to address it.
(Should fuseki-access re-use some common QueryExecution building code
from fuseki-core?) I also wasn't sure how to add an automated (to
jena-integration-tests or with mocking in fuseki-access?) test-case
for this, but I can provide minimal manual steps.

Should I create a Jira ticket for this?

Regards,
Vilnis


On Tue, 12 Apr 2022 at 21:44, Vilnis Termanis
 wrote:
>
> Hi Andy,
>
> Thank you for the suggestion of in-config context overrides - I had
> not realised that was possible (with the newer style of defining
> Fuseki services) - that's really useful.
> We'll re-rest the aforementioned 2b & 3b cases.
>
> Regards,
> Vilnis
>
> On Fri, 8 Apr 2022 at 11:51, Andy Seaborne  wrote:
> >
> > Hi Vilnis,
> >
> > On 07/04/2022 11:10, Vilnis Termanis wrote:
> > > Hi,
> > >
> > > In brief: Can Fuseki Data ACL be applied to text indexing?
> >
> > As a general point - a text index itself is not ACL aware. It is setup
> > ahead of time and does not index triples directly. The GeoSPARQL cache
> > is probably similar (I'm less familiar with the GeoSPARQL code).
> >
> > When the query is under the control of a trusted client, the pattern:
> >
> > WHERE {
> >  ?s a ex:Product ;
> > text:query (rdfs:label 'printer') ;
> > rdfs:label ?lbl
> > }
> >
> > can be check of the triple.
> >
> > If the query isn't controlled, then that won't work.
> >
> > (Has your usage style changed in the last year?)
> >
> > > And is it
> > > possible to selectively expose text index access per service for a
> > > shared dataset?
> >
> > Yes.
> >
> > The context setting can be set per dataset, per service or per endpoint
> > with ja:context [ ja:cxtName "NAME" ;  ja:cxtValue "VALUE" ] ;
> >
> > E.g.
> >  fuseki:endpoint [
> >  fuseki:operation fuseki:query ;
> >  fuseki:name "sparql"
> >  ja:context [
> > ja:cxtName "NAME" ;  ja:cxtValue "VALUE"
> >  ] ;
> >  ] ;
> >
> > >
> > > In detail:
> > >
> > > We're using a single TDB dataset in unionDefaultGraph mode) with
> > > multiple services, wrapped with both ACL (AccessControlledDataset) as
> > > well as text indexing (TextDataset) and are hoping to provide the
> > > following Fuseki services:
> > >
> > > 1. "full access" - a) Read/write everything b) including text index
> > > 2. "selected graphs only" - a) Read only from selected graphs b) no index 
> > > access
> > > 3. "read all" - a) Read everything b) no index access
> > >
> > > In the assembler configuration, datasets for the above services are
> > > respectively defined as (where all use the same underlying dataset):
> > > 1. TextDataset(DatasetTDB)
> > > 2. AccessControlledDataset(DatasetTDB)
> > > 3. DatasetTDB
> > >
> > > 1a & 1b work as expected, as do 2a & 3a. 2b & 3b however still allow
> > > access to text indexing, despite not being explicitly configured as
> > > such in their respective services.
> >
> > re: 2b/3b: That could be a bug or a configuration error.
> >
> > The context value is set on the text dataset. So if the server
> > configuration has a service that does not go through the text dataset,
> > the index should not be visible. There will be an entry in the server log.
> >
> > You don't actually need the DatasetGraphText if the index is only read
> > (i.e. preloaded and no runtime updates).
> >
> > >  From looking at code, I can see that index availability is based on
> > > the TextQuery.textIndex symbol in the execution context
> > > (TextQueryPF.java). This means that, as long as at least one service
> > > enabled text indexing on a dataset, any other services referencing the
> > > same underlying store will also use it.
> > > (Judging by comments in the code, the "instanceof DatasetGraphText"
> > > check is deprecated, even if the logic for now remains in
> > > chooseTextIndex()).
> > >
> > > So our questions are:
> > >
> > > I) Is it currently possible to disallow access to the text index for
> > > some services but not others (using the same underlying dataset)?
> >
> > Should be - see above.
> >
> > > II) If not, what might be best approach to implement such a
> > > restriction?