Re: Achieving reasonably performing federated queries

Olivier Rossel Tue, 23 Jul 2013 06:58:34 -0700

Same interrogations here.
So I +1 this question immensely!


On Tue, Jul 23, 2013 at 11:48 AM, Sarven Capadisli <[email protected]> wrote:

> Hi all,
>
> This is partly a summary of my recent experiences with federated queries
> and partly a request for your feedback on making /reasonably/ performing
> federated queries.
>
> The query in question is here [1]. Essentially there are two endpoints
> (which may or may not be the same), and they return the same pattern. There
> are millions of triples to get through, so throwing out false negatives
> (early on) is quite important. We assume that graph names are not known and
> that everything is accessible from the default graph. The endpoint which
> dispatches the two queries needs to filter out what's remaining. There are
> no common variables. This means that both endpoints need to do their own
> thing and then the patterns are joined.
>
> Needless to say, OPTIONALs that are in there are expensive, but they help
> a great deal in making sure to use only what's necessary i.e., either a
> refArea doesn't have an exactMatch or if there is an exactMatch, it
> contains the domain of the refArea that's at the other endpoint. Without
> OPTIONALs, the outer endpoint will end up with more possibilities to join.
> Using MINUS is more or less the same.
>
> By default, ARQ uses an optimizer to do a whole bunch of good stuff that's
> mostly foreign to me. What I'm aware of however is how it behaves when it
> comes SERVICE calls. When the first SERVICE call comes back with n number
> of triples, the second SERVICE is called n times. Undoubtedly, this doesn't
> sale at all.
>
> To work around this, I've turned off the optimizer with
> Optimize.noOptimizer() [2] with a simple class which is called from the
> parent endpoint's TDB assembler file. As expected, that allows the parent
> to make only two SERVICE calls.
>
> This is the current state of things. I'd like to take it further to get
> more out of this, but at this point, I need a different set of eyes.
>
> [I will prepare a chart for this, but this rough explanation might do for
> now] As there are different endpoints with different amounts of data, what
> I've experienced is that some of the fastest quickest queries take around 3
> seconds. That's typically queries with low number of joins; ~150x150=22500
> possibilities before the last filter kicks in. It gets heavy quite fast, as
> I've seen some queries to take 30 seconds or more.
>
> The TDB optimizer stats file is up to date on all endpoints.
>
> I am completely open to how this query can be restructured, or simply like
> to hear about your own experiences with federated queries.
>
> [1] http://csarven.ca/linked-**statistical-data-analysis#**
> federated-sparql-query<http://csarven.ca/linked-statistical-data-analysis#federated-sparql-query>
> [2] http://jena.apache.org/**documentation/javadoc/arq/com/**
> hp/hpl/jena/sparql/algebra/**optimize/Optimize.html#**noOptimizer()<http://jena.apache.org/documentation/javadoc/arq/com/hp/hpl/jena/sparql/algebra/optimize/Optimize.html#noOptimizer()>
>
> -Sarven
> http://csarven.ca/#i
>
>
>

Re: Achieving reasonably performing federated queries

Reply via email to