It may help if ARQ did a hash join in this case - getting the data from
the two SERVICEs could even be done in parallel (except that in turn may
be unacceptable).
The advantage of the current approach is that it does not run out of
memory - it does not consume temporary RAM in proportion to the data
size. But it's not a free choice and may be slower (regardless of being
inappropriate for this SERVICE situation).
There isn't code in TDB to do that, currently.
Andy
On 25/07/13 14:33, Rob Vesse wrote:
Yes you should be able to add the following:
--set arq:optIndexJoinStrategy=false
I'm not 100% sure that the short form will work, you may need to use the
fully expanded form:
--set http://jena.hpl.hp.com/ARQ#optIndexJoinStrategy=false
It should work - it's a bug if it doesn't.
However as noted in my email this is new in 2.10.2-SNAPSHOT builds so
unless you are using the latest SNAPSHOTs this would have no effect. In
all previous releases this particular optimization was always on.
Rob
On 7/25/13 1:56 PM, "Diogo FC Patrao" <[email protected]> wrote:
Hello
The better plan for the query you posted would be (1), simply because of
the cost of accessing a remote service. But, if the first SERVICEd
query
would return just a few lines, maybe it would be better to run a
couple
of
times the same query as in (2) than to get all results.
I agree. I started out with (2) because ARQ by default did that.
However,
soon after, that wasn't going to work out and so explored a way to do
(1).
Now doing (1) but I'm trying to get more out of it. I have to take a
look
closer at Rob Vasse's suggestion:
ARQ.getContext().set(ARQ.**optIndexJoinStrategy,
false);
Yes; it is a great feature that we can turn on and off certain
optimizations!
Rob, can we turn that on and off by the ARQ command line?
As for optimizing the query, I would try separating the each query
into a
UNION, one part with the OPTIONAL, the other without it. Getting the
subproperties, depending on which triplestore you're querying, can be
expensive too. If it's Fuseki+TDB and you have access to the server
configuration, you could turn on RDFs inference. Also, the order of the
triples can influence a lot on the overall query performance - put the
triples that return lesser results before the others.
Good luck!
I'm not sure I see how UNION can be used as per your suggestion such
that
the results contain values for each field. Only one of the variables in
OPTIONAL is used towards the final output. Duplicating the earlier
pattern
plus what was in OPTIONAL is probably not ideal. Did I misunderstand
you?
Yes, but that was an idea based solely on my experience with RDB. Writing
SELECT * FROM A WHERE type_id in (1,2)
can be slower than
SELECT * FROM A WHERE type_id = 1
UNION ALL
SELECT * FROM A WHERE type_id = 2
, believe me or not. I never really worked with OPTIONALs so I'm guessing
it out of thin air. But I think's worth the shot.
I'll test it with only RDFS inference.
The SPARQL will look better too.
cheers!
dfcp
Based on my tests, the order of the statements are as good as they get.
Thanks for the suggestions.
-Sarven