Hi all,

This is partly a summary of my recent experiences with federated queries and partly a request for your feedback on making /reasonably/ performing federated queries.

The query in question is here [1]. Essentially there are two endpoints (which may or may not be the same), and they return the same pattern. There are millions of triples to get through, so throwing out false negatives (early on) is quite important. We assume that graph names are not known and that everything is accessible from the default graph. The endpoint which dispatches the two queries needs to filter out what's remaining. There are no common variables. This means that both endpoints need to do their own thing and then the patterns are joined.

Needless to say, OPTIONALs that are in there are expensive, but they help a great deal in making sure to use only what's necessary i.e., either a refArea doesn't have an exactMatch or if there is an exactMatch, it contains the domain of the refArea that's at the other endpoint. Without OPTIONALs, the outer endpoint will end up with more possibilities to join. Using MINUS is more or less the same.

By default, ARQ uses an optimizer to do a whole bunch of good stuff that's mostly foreign to me. What I'm aware of however is how it behaves when it comes SERVICE calls. When the first SERVICE call comes back with n number of triples, the second SERVICE is called n times. Undoubtedly, this doesn't sale at all.

To work around this, I've turned off the optimizer with Optimize.noOptimizer() [2] with a simple class which is called from the parent endpoint's TDB assembler file. As expected, that allows the parent to make only two SERVICE calls.

This is the current state of things. I'd like to take it further to get more out of this, but at this point, I need a different set of eyes.

[I will prepare a chart for this, but this rough explanation might do for now] As there are different endpoints with different amounts of data, what I've experienced is that some of the fastest quickest queries take around 3 seconds. That's typically queries with low number of joins; ~150x150=22500 possibilities before the last filter kicks in. It gets heavy quite fast, as I've seen some queries to take 30 seconds or more.

The TDB optimizer stats file is up to date on all endpoints.

I am completely open to how this query can be restructured, or simply like to hear about your own experiences with federated queries.

[1] http://csarven.ca/linked-statistical-data-analysis#federated-sparql-query [2] http://jena.apache.org/documentation/javadoc/arq/com/hp/hpl/jena/sparql/algebra/optimize/Optimize.html#noOptimizer()

-Sarven
http://csarven.ca/#i


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to