Hello I observed the same behaviour as you did and have some considerations that are product of that. I haven't checked the Jena sources, so I may be wrong here.
As stated before in this list, ARQ doesn't have any federation-specific optimization, so it behaves as if the cost of accessing local and federated data were the same. Consider the SQL query below: SELECT * FROM A JOIN B ON (A.some_column = B.some_column) There's at least two plans for solving that: (1) one that makes the crossproduct between A and B, and then filters results according to the condition; and (2) that, for each element in A, looks for elements in B that satisfies the condition. To select the better plan, the SQL planner takes in consideration wheter the relevant columns are indexed and the size of tables involved - all factors impact on the cost of accessing rows. The better plan for the query you posted would be (1), simply because of the cost of accessing a remote service. But, if the first SERVICEd query would return just a few lines, maybe it would be better to run a couple of times the same query as in (2) than to get all results. As for optimizing the query, I would try separating the each query into a UNION, one part with the OPTIONAL, the other without it. Getting the subproperties, depending on which triplestore you're querying, can be expensive too. If it's Fuseki+TDB and you have access to the server configuration, you could turn on RDFs inference. Also, the order of the triples can influence a lot on the overall query performance - put the triples that return lesser results before the others. Good luck! -- diogo patrĂ£o On Tue, Jul 23, 2013 at 10:56 AM, Olivier Rossel <[email protected]>wrote: > Same interrogations here. > So I +1 this question immensely! > > > On Tue, Jul 23, 2013 at 11:48 AM, Sarven Capadisli <[email protected]> > wrote: > > > Hi all, > > > > This is partly a summary of my recent experiences with federated queries > > and partly a request for your feedback on making /reasonably/ performing > > federated queries. > > > > The query in question is here [1]. Essentially there are two endpoints > > (which may or may not be the same), and they return the same pattern. > There > > are millions of triples to get through, so throwing out false negatives > > (early on) is quite important. We assume that graph names are not known > and > > that everything is accessible from the default graph. The endpoint which > > dispatches the two queries needs to filter out what's remaining. There > are > > no common variables. This means that both endpoints need to do their own > > thing and then the patterns are joined. > > > > Needless to say, OPTIONALs that are in there are expensive, but they help > > a great deal in making sure to use only what's necessary i.e., either a > > refArea doesn't have an exactMatch or if there is an exactMatch, it > > contains the domain of the refArea that's at the other endpoint. Without > > OPTIONALs, the outer endpoint will end up with more possibilities to > join. > > Using MINUS is more or less the same. > > > > By default, ARQ uses an optimizer to do a whole bunch of good stuff > that's > > mostly foreign to me. What I'm aware of however is how it behaves when it > > comes SERVICE calls. When the first SERVICE call comes back with n number > > of triples, the second SERVICE is called n times. Undoubtedly, this > doesn't > > sale at all. > > > > To work around this, I've turned off the optimizer with > > Optimize.noOptimizer() [2] with a simple class which is called from the > > parent endpoint's TDB assembler file. As expected, that allows the parent > > to make only two SERVICE calls. > > > > This is the current state of things. I'd like to take it further to get > > more out of this, but at this point, I need a different set of eyes. > > > > [I will prepare a chart for this, but this rough explanation might do for > > now] As there are different endpoints with different amounts of data, > what > > I've experienced is that some of the fastest quickest queries take > around 3 > > seconds. That's typically queries with low number of joins; > ~150x150=22500 > > possibilities before the last filter kicks in. It gets heavy quite fast, > as > > I've seen some queries to take 30 seconds or more. > > > > The TDB optimizer stats file is up to date on all endpoints. > > > > I am completely open to how this query can be restructured, or simply > like > > to hear about your own experiences with federated queries. > > > > [1] http://csarven.ca/linked-**statistical-data-analysis#** > > federated-sparql-query< > http://csarven.ca/linked-statistical-data-analysis#federated-sparql-query> > > [2] http://jena.apache.org/**documentation/javadoc/arq/com/** > > hp/hpl/jena/sparql/algebra/**optimize/Optimize.html#**noOptimizer()< > http://jena.apache.org/documentation/javadoc/arq/com/hp/hpl/jena/sparql/algebra/optimize/Optimize.html#noOptimizer() > > > > > > -Sarven > > http://csarven.ca/#i > > > > > > >
