Re: Achieving reasonably performing federated queries

Diogo FC Patrao Tue, 23 Jul 2013 07:59:25 -0700

Hello

I observed the same behaviour as you did and have some considerations that
are product of that. I haven't checked the Jena sources, so I may be wrong
here.

As stated before in this list, ARQ doesn't have any federation-specific
optimization, so it behaves as if the cost of accessing local and federated
data were the same.

Consider the SQL query below:

SELECT * FROM A JOIN B ON (A.some_column = B.some_column)

There's at least two plans for solving that: (1) one that makes the
crossproduct between A and B, and then filters results according to the
condition; and (2) that, for each element in A, looks for elements in B
that satisfies the condition.  To select the better plan, the SQL planner
takes in consideration wheter the relevant columns are indexed and the size
of tables involved - all factors impact on the cost of accessing rows.

The better plan for the query you posted would be (1), simply because of
the cost of accessing a remote service. But, if the first SERVICEd query
would return just a few lines, maybe it would be better to run  a couple of
times the same query  as in (2) than to get all results.

As for optimizing the query, I would try separating the each query into a
UNION, one part with the OPTIONAL, the other without it. Getting the
subproperties, depending on which triplestore you're querying, can be
expensive too. If it's Fuseki+TDB and you have access to the server
configuration, you could turn on RDFs inference. Also, the order of the
triples can influence a lot on the overall query performance - put the
triples that return lesser results before the others.

Good luck!

--
diogo patrão

On Tue, Jul 23, 2013 at 10:56 AM, Olivier Rossel
<[email protected]>wrote:

> Same interrogations here.
> So I +1 this question immensely!
>
>
> On Tue, Jul 23, 2013 at 11:48 AM, Sarven Capadisli <[email protected]>
> wrote:
>
> > Hi all,
> >
> > This is partly a summary of my recent experiences with federated queries
> > and partly a request for your feedback on making /reasonably/ performing
> > federated queries.
> >
> > The query in question is here [1]. Essentially there are two endpoints
> > (which may or may not be the same), and they return the same pattern.
> There
> > are millions of triples to get through, so throwing out false negatives
> > (early on) is quite important. We assume that graph names are not known
> and
> > that everything is accessible from the default graph. The endpoint which
> > dispatches the two queries needs to filter out what's remaining. There
> are
> > no common variables. This means that both endpoints need to do their own
> > thing and then the patterns are joined.
> >
> > Needless to say, OPTIONALs that are in there are expensive, but they help
> > a great deal in making sure to use only what's necessary i.e., either a
> > refArea doesn't have an exactMatch or if there is an exactMatch, it
> > contains the domain of the refArea that's at the other endpoint. Without
> > OPTIONALs, the outer endpoint will end up with more possibilities to
> join.
> > Using MINUS is more or less the same.
> >
> > By default, ARQ uses an optimizer to do a whole bunch of good stuff
> that's
> > mostly foreign to me. What I'm aware of however is how it behaves when it
> > comes SERVICE calls. When the first SERVICE call comes back with n number
> > of triples, the second SERVICE is called n times. Undoubtedly, this
> doesn't
> > sale at all.
> >
> > To work around this, I've turned off the optimizer with
> > Optimize.noOptimizer() [2] with a simple class which is called from the
> > parent endpoint's TDB assembler file. As expected, that allows the parent
> > to make only two SERVICE calls.
> >
> > This is the current state of things. I'd like to take it further to get
> > more out of this, but at this point, I need a different set of eyes.
> >
> > [I will prepare a chart for this, but this rough explanation might do for
> > now] As there are different endpoints with different amounts of data,
> what
> > I've experienced is that some of the fastest quickest queries take
> around 3
> > seconds. That's typically queries with low number of joins;
> ~150x150=22500
> > possibilities before the last filter kicks in. It gets heavy quite fast,
> as
> > I've seen some queries to take 30 seconds or more.
> >
> > The TDB optimizer stats file is up to date on all endpoints.
> >
> > I am completely open to how this query can be restructured, or simply
> like
> > to hear about your own experiences with federated queries.
> >
> > [1] http://csarven.ca/linked-**statistical-data-analysis#**
> > federated-sparql-query<
> http://csarven.ca/linked-statistical-data-analysis#federated-sparql-query>
> > [2] http://jena.apache.org/**documentation/javadoc/arq/com/**
> > hp/hpl/jena/sparql/algebra/**optimize/Optimize.html#**noOptimizer()<
> http://jena.apache.org/documentation/javadoc/arq/com/hp/hpl/jena/sparql/algebra/optimize/Optimize.html#noOptimizer()
> >
> >
> > -Sarven
> > http://csarven.ca/#i
> >
> >
> >
>

Re: Achieving reasonably performing federated queries

Reply via email to