In Jena 2.x, RDF 1.0 "?var = string" wasn't an optimization because it is not the same as writing it into the graph pattern. In RDF 1.1, it is so the optimization became more practical. At the same time, the whole of filter placement has been written to make it more comprehensive.

The disjunction used by string equality was not handled for later placement:

https://issues.apache.org/jira/browse/JENA-1235

I have noticed other cases where order of triples and bgps makes
quite a difference in execution time, but I can't figure out any
science to it. Are there any guidelines for ordering the components
of a complex query (including UNION and OPTIONAL clauses) to optimize
performance? Can you tell anything by a static analysis of the sparql
algebra?

BGP optimization and UNION/OPTIONAL are different optimization steps. For TDB, BGP uses the stats file if present, else the fixed optimization style of most grounded triple pattern first. If there are equal re-orderings, the original order is retained.

For UNION/OPTIONAL, there quite a number of optimizations, equality and filter placement being two of the major ones.

There is a tension between these two high level optimizations and BGP reordering (which is done later). They both may be beneficial yet telling which is better is quite hard and very data dependent.

        Andy


On 18/09/16 22:07, Paul Tyson wrote:
I looked at some more queries that worked in jena 2.x but seem to hang
in 3.x. They all follow the same pattern of a complex FILTER query on
string values. Rewriting the filter conditions into subgraph patterns
solved the problem.

Is this a defect induced by algebra optimizations in 3.x? Or, is it more
proper to apply string filters in the manner you suggested, by enclosing
them in subgraph patterns close to the triples they filter?

There was one case that was a little more complex. The original query
was like:

CONSTRUCT {
?var1 :p1 false .
}
WHERE {
FILTER ((?var2 != "str1" && !strstarts(?var3,"str2")))
?var1 :p2 ?var3 ;
 :p3 ?var2 ;
 :p4 "str3" ;
 :p5 "str4" ;
 :p6 "str5" .
FILTER NOT EXISTS {
FILTER (((?var4 = "str6" || ?var4 = "str7" || ?var4 = "str8" || ?var4 =
"str9" || ?var4 = "str10" || ?var4 = "str11" || ?var4 = "str12" || ?var4
= "str13")))
?var5 :p7 ?var4 ;
 :p8 ?var3 .
}
}

I initially rewrote the FILTER NOT EXISTS clause to read:

FILTER NOT EXISTS {
{FILTER (((?var4 = "str6" || ?var4 = "str7" || ?var4 = "str8" || ?var4 =
"str9" || ?var4 = "str10" || ?var4 = "str11" || ?var4 = "str12" || ?var4
= "str13")))
?var5 :p7 ?var4 .}
?var5 :p8 ?var3 .
}

which still seemed to hang. Reordering the FILTER NOT EXISTS bgp to the
following solved the problem.

FILTER NOT EXISTS {
?var5 :p8 ?var3 .
{FILTER (((?var4 = "str6" || ?var4 = "str7" || ?var4 = "str8" || ?var4 =
"str9" || ?var4 = "str10" || ?var4 = "str11" || ?var4 = "str12" || ?var4
= "str13")))
?var5 :p7 ?var4 .}
}

I have noticed other cases where order of triples and bgps makes quite a
difference in execution time, but I can't figure out any science to it.
Are there any guidelines for ordering the components of a complex query
(including UNION and OPTIONAL clauses) to optimize performance? Can you
tell anything by a static analysis of the sparql algebra?

Regards,
--Paul



On Fri, 2016-09-16 at 08:37 -0500, Paul Tyson wrote:
Andy,

With that rewrite, the 3.x tdbquery works as expected.

I will investigate further this weekend and send other queries that don't work 
in 3.x.

Regards,
--Paul

On Sep 16, 2016, at 04:26, Andy Seaborne <a...@apache.org> wrote:

Paul,  If you could try the query below which mimics the effect of placing the 
?var4 filter part, it will help determine if this is a filter placement issue 
or not.

The difference is that first basic graph pattern is inside a {} with the 
relevant part of the filter expression.

   Andy


PREFIX  :     <http://example/>

SELECT  *
WHERE
 { FILTER ( ( ?var3 = "str1" ) || ( ?var3 = "str2" ) )
   { ?var2  :p1  ?var4 ;
            :p2  ?var3
     FILTER ( ! ( ( ( ?var4 = "" ) ||
              ( ?var4 = "str3" ) ) ||
              regex(?var4, "pat1") ) )
   }
   {   { ?var1  :p3  ?var4 }
     UNION
       { ?var1  :p4  ?var4 }
   }
 }


   Andy


On 14/09/16 13:15, Paul Tyson wrote:
On Wed, 2016-09-14 at 10:57 +0100, Andy Seaborne wrote:
Hi Paul,

It's difficult to tell what's going on from your report. Plain strings
are not quite identical in RDF 1.0 and RDF 1.1 so I hope you have
related the data for running Jena 3.x.

I admit I have not studied the subtleties around string literals with
and without datatype tags. None of my data loadfiles have tagged string
literals, nor do my queries. Are you saying they should?


On less data, does either case produce the wrong answers?

I'll produce a smaller dataset to test.

The regex is not being pushed inwards in the same way which may be an
issue - it "all depends" on the data.

A smaller query exhibiting a timing difference would be very helpful.
Are all parts of the FILTER necessary for the effect?
Yes, they eliminate spurious matches.


   Andy

Unrelated:

{
?var1 :p3 ?var4 .
} UNION {
?var1 :p4 ?var4 .
}

can be written

?var1 (:p3|:p4) ?var4
Yes, but I generate these queries from RIF source, and UNION is easier
for the general RIF statement "Or(x,y)". The surface syntax doesn't make
any difference in the algebra, does it?

Regards,
--Paul

On 14/09/16 02:01, Paul Tyson wrote:
I have some queries that worked fine in jena-2.13.0 but not in
jena-3.1.0, using the same data.

For a long time I've been running a couple dozen queries regularly over
a large (900M triples) TDB, using jena-2.13.0. When I recently upgraded
to jena-3.1.0, I found that 5 of these queries would not return (ran
forever). qparse revealed that the sparql algebra is quite different in
2.13.0 and 3.1.0 (or apparently any 3.n.n version).

Here is a sample query that worked in 2.13.0 but not in 3.1.0, along
with the algebra given by qparse --explain for 2.13.0 and 3.1.0:

prefix : <http://example.org>
CONSTRUCT {
?var1 <http://www.w3.org/2004/02/skos/core#exactMatch> ?var2 .
}
WHERE {
FILTER (((?var3 = "str1" || ?var3 = "str2") && !(?var4 = "" || ?var4 =
"str3" || regex(?var4,"pat1"))))
?var2 :p1 ?var4 ; :p2 ?var3 .
{{
?var1 :p3 ?var4 .
} UNION {
?var1 :p4 ?var4 .
}}
}

Jena-2.13.0 produces algebra:
(prefix ((: <http://example.org>))
 (sequence
   (filter (|| (= ?var3 "str1") (= ?var3 "str2"))
     (sequence
       (filter (! (|| (|| (= ?var4 "") (= ?var4 "str3")) (regex ?var4
"pat1")))
         (bgp (triple ?var2 :p1 ?var4)))
       (bgp (triple ?var2 :p2 ?var3))))
   (union
     (bgp (triple ?var1 :p3 ?var4))
     (bgp (triple ?var1 :p4 ?var4)))))

Jena-3.1.0 produces algebra:
(prefix ((: <http://example.org>))
 (filter (! (|| (|| (= ?var4 "") (= ?var4 "str3")) (regex ?var4
"pat1")))
   (disjunction
     (assign ((?var3 "str1"))
       (sequence
         (bgp
           (triple ?var2 :p1 ?var4)
           (triple ?var2 :p2 "str1")
         )
         (union
           (bgp (triple ?var1 :p3 ?var4))
           (bgp (triple ?var1 :p4 ?var4)))))
     (assign ((?var3 "str2"))
       (sequence
         (bgp
           (triple ?var2 :p1 ?var4)
           (triple ?var2 :p2 "str2")
         )
         (union
           (bgp (triple ?var1 :p3 ?var4))
           (bgp (triple ?var1 :p4 ?var4))))))))

Thanks for any insight or assistance into this problem.

Regards,
--Paul





Reply via email to