On 01/11/16 12:38, Osma Suominen wrote:
Hi,
Some further observations. The query I sent earlier was a minimal
example, and it was possible to fix it by just moving the VALUES block.
But a slightly more realistic (closer to the original query I'm having
problems with) example involves a UNION and cannot be fixed so easily -
placing the VALUES block first doesn't help:
--cut--
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT *
WHERE {
VALUES ?uri { <http://www.yso.fi/onto/yso/p864> }
{ ?s ?p ?uri }
UNION
{ ?uri ?p ?o
OPTIONAL {
?x skos:member ?o .
FILTER NOT EXISTS {
?x skos:member ?other .
FILTER NOT EXISTS {
?other skos:broader ?uri
^^^^^^ [*]
}
}
} }
}
--cut--
Jena 3.1.0 tdbquery: 0.9 seconds
Jena 3.1.1-SNAPSHOT tdbquery: 12.8 seconds
I'm aware that in SPARQL, evaluation proceeds from the inside out and
Jena ARQ has moved more and more in this direction with recent releases,
which may also explain this change.
It has always been inside-out then optimized to use stream based index
joins.
However in this case "inside out" is confusing because the query has a
double negation of FILTER NOT EXIST.
At 3.1.1 (JENA-1171), EXISTS are analysed whereas previous they were
skipped which could lead to wrong answers.
Osma - could you please try putting the VALUES in each arm of the UNION
which gets you to something like the first example.
The issue is [*], using the variable ?uri again inside an OPTIONAL.
It is possible that ?uri will range over more than the VALUES setting
and affect the OPTIONAL yet the inner EXISTS usage does not set ?uri and
it is not propagated to be joined with the set value.
As wikipedia says for correlated subquery in SQL:
"Because the subquery is evaluated once for each row processed by the
outer query, it can be inefficient."
But how should VALUES blocks be
placed for optimal query execution? It seems like a waste not to
propagate those fixed bindings into inner parts of the query, even
though that may violate the inside-out order.
It can be pushed in because:
join(A, union(B,C)) == union(join(A,B), join(A,C))
now if A is an complex expression, that is a bad idea (probably).
If A is a small VALUES block then it makes sense. It isn't done though.
> In the above query, I
don't know where to place the VALUES so that the binding for ?uri (in
effect, changing the variable to a constant) would be applied in all
parts of the query.
See above.
Placing the VALUES block at the bottom of the query (outside the WHERE
block) doesn't help either. In fact execution time increases to 17
seconds with 3.1.1-SNAPSHOT (but is unchanged with 3.1.0).
I tried --engine=ref and it was extremely slow also with 3.1.0, so in
that sense, nothing has changed, only an optimization has been dropped
somewhere.
Should I report this as an issue? Or am I just doing something wrong?
-Osma
On 01/11/16 11:03, Osma Suominen wrote:
Hi,
I'm investigating a performance regression we're seeing with the current
Jena 3.1.1-SNAPSHOT compared to 3.1.0.
The data in graph <http://www.yso.fi/onto/yso/> is the YSO ontology,
available from http://api.finto.fi/download/yso/yso-skos.ttl
This query used to take about 0.2 seconds (with 3.1.0) and now takes
about 10 seconds:
--cut--
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT *
FROM NAMED <http://www.yso.fi/onto/yso/>
WHERE {
?uri ?p ?o .
OPTIONAL {
?x skos:member ?o .
FILTER NOT EXISTS {
?x skos:member ?other .
FILTER NOT EXISTS {
?other skos:broader ?uri
}
}
}
VALUES ?uri { <http://www.yso.fi/onto/yso/p864> }
}
--cut--
If I move the VALUES block to the top of the query, right after WHERE,
then the query becomes fast again.
Is the placement of the VALUES block supposed to affect query evaluation
order in this way? It appears to me that in the slow version, ?uri is
not bound inside the inner FILTER NOT EXISTS, which causes an explosion
of results internally.
-Osma