Re: Achieving reasonably performing federated queries

Claude Warren Wed, 24 Jul 2013 00:49:36 -0700

I did something like this a year ago (I should probably write it up).  In
our case we had what we called a "roadmap" that could identify properties
various sparql endpoints that were logically the same (e.g.
foo:molecularWeight, bar:molecular_weight and baz:atomic_weight might all
be the same).  We constructed a vocabulary that we would query in and then
mapped that vocabulary to the schemas/vocabularies of the endpoints we were
interested in.  A custom query engine would then transform the query in our
constructed vocabulary into queries that the endpoints could understand,
and then map the results back to the vocabulary.  Our endpoints were large
and scattered around the world.  (this was a medical/drug/genomic
application).


Issues we had to deal with:
1) endpoints may or may not be up at the time of query.
2) some endpoints may not respond quickly enough.

To deal with issue 1 we identified endpoints that had the identical data
sets (e.g. clones) and would list them as alternatives.  The system would
periodically poll all the known endpoints and determine which were up and
which had the best response time.  The alternative that had the best
response time would be selected at query time.

All of this was packaged as a single jar and could run on a laptop as well
as a server.  We did not use Fuseki but build the application on top of
Jetty using much of the Fuseki code as a pattern.

Let me know if you are interested and I will endeavour to put together a
better description of how it all worked.

Claude



On Tue, Jul 23, 2013 at 3:57 PM, Diogo FC Patrao <[email protected]>wrote:

> Hello
>
> I observed the same behaviour as you did and have some considerations that
> are product of that. I haven't checked the Jena sources, so I may be wrong
> here.
>
> As stated before in this list, ARQ doesn't have any federation-specific
> optimization, so it behaves as if the cost of accessing local and federated
> data were the same.
>
> Consider the SQL query below:
>
> SELECT * FROM A JOIN B ON (A.some_column = B.some_column)
>
> There's at least two plans for solving that: (1) one that makes the
> crossproduct between A and B, and then filters results according to the
> condition; and (2) that, for each element in A, looks for elements in B
> that satisfies the condition.  To select the better plan, the SQL planner
> takes in consideration wheter the relevant columns are indexed and the size
> of tables involved - all factors impact on the cost of accessing rows.
>
> The better plan for the query you posted would be (1), simply because of
> the cost of accessing a remote service. But, if the first SERVICEd query
> would return just a few lines, maybe it would be better to run  a couple of
> times the same query  as in (2) than to get all results.
>
> As for optimizing the query, I would try separating the each query into a
> UNION, one part with the OPTIONAL, the other without it. Getting the
> subproperties, depending on which triplestore you're querying, can be
> expensive too. If it's Fuseki+TDB and you have access to the server
> configuration, you could turn on RDFs inference. Also, the order of the
> triples can influence a lot on the overall query performance - put the
> triples that return lesser results before the others.
>
> Good luck!
>
> --
> diogo patrão
>
>
>
>
> On Tue, Jul 23, 2013 at 10:56 AM, Olivier Rossel
> <[email protected]>wrote:
>
> > Same interrogations here.
> > So I +1 this question immensely!
> >
> >
> > On Tue, Jul 23, 2013 at 11:48 AM, Sarven Capadisli <[email protected]>
> > wrote:
> >
> > > Hi all,
> > >
> > > This is partly a summary of my recent experiences with federated
> queries
> > > and partly a request for your feedback on making /reasonably/
> performing
> > > federated queries.
> > >
> > > The query in question is here [1]. Essentially there are two endpoints
> > > (which may or may not be the same), and they return the same pattern.
> > There
> > > are millions of triples to get through, so throwing out false negatives
> > > (early on) is quite important. We assume that graph names are not known
> > and
> > > that everything is accessible from the default graph. The endpoint
> which
> > > dispatches the two queries needs to filter out what's remaining. There
> > are
> > > no common variables. This means that both endpoints need to do their
> own
> > > thing and then the patterns are joined.
> > >
> > > Needless to say, OPTIONALs that are in there are expensive, but they
> help
> > > a great deal in making sure to use only what's necessary i.e., either a
> > > refArea doesn't have an exactMatch or if there is an exactMatch, it
> > > contains the domain of the refArea that's at the other endpoint.
> Without
> > > OPTIONALs, the outer endpoint will end up with more possibilities to
> > join.
> > > Using MINUS is more or less the same.
> > >
> > > By default, ARQ uses an optimizer to do a whole bunch of good stuff
> > that's
> > > mostly foreign to me. What I'm aware of however is how it behaves when
> it
> > > comes SERVICE calls. When the first SERVICE call comes back with n
> number
> > > of triples, the second SERVICE is called n times. Undoubtedly, this
> > doesn't
> > > sale at all.
> > >
> > > To work around this, I've turned off the optimizer with
> > > Optimize.noOptimizer() [2] with a simple class which is called from the
> > > parent endpoint's TDB assembler file. As expected, that allows the
> parent
> > > to make only two SERVICE calls.
> > >
> > > This is the current state of things. I'd like to take it further to get
> > > more out of this, but at this point, I need a different set of eyes.
> > >
> > > [I will prepare a chart for this, but this rough explanation might do
> for
> > > now] As there are different endpoints with different amounts of data,
> > what
> > > I've experienced is that some of the fastest quickest queries take
> > around 3
> > > seconds. That's typically queries with low number of joins;
> > ~150x150=22500
> > > possibilities before the last filter kicks in. It gets heavy quite
> fast,
> > as
> > > I've seen some queries to take 30 seconds or more.
> > >
> > > The TDB optimizer stats file is up to date on all endpoints.
> > >
> > > I am completely open to how this query can be restructured, or simply
> > like
> > > to hear about your own experiences with federated queries.
> > >
> > > [1] http://csarven.ca/linked-**statistical-data-analysis#**
> > > federated-sparql-query<
> >
> http://csarven.ca/linked-statistical-data-analysis#federated-sparql-query>
> > > [2] http://jena.apache.org/**documentation/javadoc/arq/com/**
> > > hp/hpl/jena/sparql/algebra/**optimize/Optimize.html#**noOptimizer()<
> >
> http://jena.apache.org/documentation/javadoc/arq/com/hp/hpl/jena/sparql/algebra/optimize/Optimize.html#noOptimizer()
> > >
> > >
> > > -Sarven
> > > http://csarven.ca/#i
> > >
> > >
> > >
> >
>



-- 
I like: Like Like - The likeliest place on the web<http://like-like.xenei.com>
Identity: https://www.identify.nu/[email protected]
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: Achieving reasonably performing federated queries

Reply via email to