Same interrogations here. So I +1 this question immensely!
On Tue, Jul 23, 2013 at 11:48 AM, Sarven Capadisli <[email protected]> wrote: > Hi all, > > This is partly a summary of my recent experiences with federated queries > and partly a request for your feedback on making /reasonably/ performing > federated queries. > > The query in question is here [1]. Essentially there are two endpoints > (which may or may not be the same), and they return the same pattern. There > are millions of triples to get through, so throwing out false negatives > (early on) is quite important. We assume that graph names are not known and > that everything is accessible from the default graph. The endpoint which > dispatches the two queries needs to filter out what's remaining. There are > no common variables. This means that both endpoints need to do their own > thing and then the patterns are joined. > > Needless to say, OPTIONALs that are in there are expensive, but they help > a great deal in making sure to use only what's necessary i.e., either a > refArea doesn't have an exactMatch or if there is an exactMatch, it > contains the domain of the refArea that's at the other endpoint. Without > OPTIONALs, the outer endpoint will end up with more possibilities to join. > Using MINUS is more or less the same. > > By default, ARQ uses an optimizer to do a whole bunch of good stuff that's > mostly foreign to me. What I'm aware of however is how it behaves when it > comes SERVICE calls. When the first SERVICE call comes back with n number > of triples, the second SERVICE is called n times. Undoubtedly, this doesn't > sale at all. > > To work around this, I've turned off the optimizer with > Optimize.noOptimizer() [2] with a simple class which is called from the > parent endpoint's TDB assembler file. As expected, that allows the parent > to make only two SERVICE calls. > > This is the current state of things. I'd like to take it further to get > more out of this, but at this point, I need a different set of eyes. > > [I will prepare a chart for this, but this rough explanation might do for > now] As there are different endpoints with different amounts of data, what > I've experienced is that some of the fastest quickest queries take around 3 > seconds. That's typically queries with low number of joins; ~150x150=22500 > possibilities before the last filter kicks in. It gets heavy quite fast, as > I've seen some queries to take 30 seconds or more. > > The TDB optimizer stats file is up to date on all endpoints. > > I am completely open to how this query can be restructured, or simply like > to hear about your own experiences with federated queries. > > [1] http://csarven.ca/linked-**statistical-data-analysis#** > federated-sparql-query<http://csarven.ca/linked-statistical-data-analysis#federated-sparql-query> > [2] http://jena.apache.org/**documentation/javadoc/arq/com/** > hp/hpl/jena/sparql/algebra/**optimize/Optimize.html#**noOptimizer()<http://jena.apache.org/documentation/javadoc/arq/com/hp/hpl/jena/sparql/algebra/optimize/Optimize.html#noOptimizer()> > > -Sarven > http://csarven.ca/#i > > >
