Re: Achieving reasonably performing federated queries

Andy Seaborne Thu, 25 Jul 2013 08:13:23 -0700

It may help if ARQ did a hash join in this case - getting the data fromthe two SERVICEs could even be done in parallel (except that in turn maybe unacceptable).

The advantage of the current approach is that it does not run out ofmemory - it does not consume temporary RAM in proportion to the datasize. But it's not a free choice and may be slower (regardless of beinginappropriate for this SERVICE situation).


There isn't code in TDB to do that, currently.

        Andy

On 25/07/13 14:33, Rob Vesse wrote:

Yes you should be able to add the following:

--set arq:optIndexJoinStrategy=false

I'm not 100% sure that the short form will work, you may need to use the
fully expanded form:

--set http://jena.hpl.hp.com/ARQ#optIndexJoinStrategy=false


It should work - it's a bug if it doesn't.


However as noted in my email this is new in 2.10.2-SNAPSHOT builds so
unless you are using the latest SNAPSHOTs this would have no effect.  In
all previous releases this particular optimization was always on.

Rob


On 7/25/13 1:56 PM, "Diogo FC Patrao" <[email protected]> wrote:

Hello

The better plan for the query you posted would be (1), simply because of

the cost of accessing a remote service. But, if the first SERVICEd
query
would return just a few lines, maybe it would be better to run  a
couple
of
times the same query  as in (2) than to get all results.


I agree. I started out with (2) because ARQ by default did that.
However,
soon after, that wasn't going to work out and so explored a way to do
(1).
Now doing (1) but I'm trying to get more out of it. I have to take a
look
closer at Rob Vasse's suggestion:
ARQ.getContext().set(ARQ.**optIndexJoinStrategy,
false);



Yes; it is a great feature that we can turn on and off certain
optimizations!

Rob, can we turn that on and off by the ARQ command line?

  As for optimizing the query, I would try separating the each query
into a

UNION, one part with the OPTIONAL, the other without it. Getting the
subproperties, depending on which triplestore you're querying, can be
expensive too. If it's Fuseki+TDB and you have access to the server
configuration, you could turn on RDFs inference. Also, the order of the
triples can influence a lot on the overall query performance - put the
triples that return lesser results before the others.

Good luck!


I'm not sure I see how UNION can be used as per your suggestion such
that
the results contain values for each field. Only one of the variables in
OPTIONAL is used towards the final output. Duplicating the earlier
pattern
plus what was in OPTIONAL is probably not ideal. Did I misunderstand
you?


Yes, but that was an idea based solely on my experience with RDB. Writing

SELECT * FROM A WHERE type_id in (1,2)

can be slower than

SELECT * FROM A WHERE type_id = 1
UNION ALL
SELECT * FROM A WHERE type_id = 2

, believe me or not. I never really worked with OPTIONALs so I'm guessing
it out of thin air. But I think's worth the shot.

I'll test it with only RDFS inference.


The SPARQL will look better too.

cheers!

dfcp

Based on my tests, the order of the statements are as good as they get.

Thanks for the suggestions.

-Sarven

Re: Achieving reasonably performing federated queries

Reply via email to