Hello Oliver,

The cost of "variable P" heavily depends on the sample query.

1. In good case the overhead is zero. It's possible when graphs (G) are
known or almost all data of the dataset are in very numerous small
graphs, so it's cheaper to read whole graphs than to try to use
additional indices that needs known P first.

2. If the value of P can be found before search, as if
:display :startRelation ?startRel .
is matched before
?gene ?startRel ?start .
then it may cost zero if statistics of the desired predicate (or
predicates) are close to average statistics across all predicates of the
database. In that case the optimizer will build an execution plan using
"average" statistics but the plan will match one built using statistics
specific for a specific predicate. If the desired predicate is very
frequent or very infrequent then the optimizer can miss, resulting in
poor plan; the cost of the planning error can be big for single-process
installations and blocking for clusters. Note that rdf:type is the most
common example of predicate with far-from-average statistics.

3. In some cases, known G can eliminate the extra cost of unknown P. As
a common rule, if you know G or if you can easily calculate G at the
beginning of the query then the pattern
<calculate ?g here>
GRAPH ?g { <interesting part of query> }
is worth trying. Some customer reported 500x performance on single
server.

4. If the query is executed by an ODBC client or the like and P is a
parameter passed from outside then it may be practical to inline P into
the text of the query rather than using ?:p parameter passing notation.
At least, it's practical to have a separate variant of the query for
parameter set to rdf:type

Best Regards,

Ivan Mikhailov
OpenLink Software
http://virtuoso.openlinksw.com

On Fri, 2016-07-22 at 17:09 +0200, Olivier Dameron wrote:
> Dear Virtuosans,
>    I noticed that queries can be much slower when using variables as
> properties (even if the variable can only have one value) e.g.:
> 
> DATA (extract):
> :display :startRelation :posStart.
> :gene0 rdf:type :Gene.
> :gene0 :posStart "53416"^^xsd:numeric.
> :gene1 rdf:type :Gene.
> :gene1 :posStart "29513"^^xsd:numeric.
> ...
> 
> SIMPLE QUERY:
> ?gene :posStart ?start.
> 
> SLOWER QUERY:
> :display :startRelation ?startRel .
> ?gene ?startRel ?start.
> 
>    I assume that the query engine first tries to match the "?gene
> ?startRel ?start" constraint, whereas begining by ":display
> :startRelation ?startRel." would define the value of ?startRel which
> would be used to find the start positions of the genes.
>    I can live with the simple query, but the second one would make the
> development of our application easier. Is there anything we could do to
> improve the performance of the second query?
> 
> Thank you!
> kind regards
> Olivier Dameron
> 
> NB: for the record, below is the script I used for generating the
> dataset, as well as the two queries
> 
> ===== generateDataSet.py
> #! /usr/bin/env python
> 
> import random
> 
> nbGenes = 50000
> geneLengthMax = 100
> chromosomeLength = 100000
> 
> with open("debugDataset.ttl", "w") as dataFile:
>       dataFile.write("""
> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
> @prefix : <http://www.univ-rennes1.fr/odameron/debugVirtuoso/>.
> 
> :display :startRelation :posStart.
> :display :stopRelation :posStop.
> 
> """)
>       for i in range(nbGenes):
>               geneIdent = ":gene" + str(i)
>               dataFile.write("\n" + geneIdent + " rdf:type :Gene.\n")
>               posStart = random.randint(1, chromosomeLength-geneLengthMax)
>               posStop = posStart + random.randint(1, geneLengthMax)
>               dataFile.write(geneIdent + " :posStart \"" + str(posStart) +
> "\"^^xsd:numeric.\n")
>               dataFile.write(geneIdent + " :posStop \"" + str(posStop) +
> "\"^^xsd:numeric.\n")
> 
> 
> ===== getOverlap.sparql
> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> PREFIX : <http://www.univ-rennes1.fr/odameron/debugVirtuoso/>
> 
> SELECT (count(*) as ?nbOverlap)
> 
> WHERE {
>   ?gene1 a :Gene;
>     :posStart ?start1;
>     :posStop ?stop1.
> 
>   ?gene2 a :Gene;
>     :posStart ?start2;
>     :posStop ?stop2.
>   FILTER (?start1 < ?start2 && ?start1 < ?stop2 && ?start2 < ?stop1)
> }
> 
> 
> ===== getOverlapSLOW.sparql
> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> PREFIX : <http://www.univ-rennes1.fr/odameron/debugVirtuoso/>
> 
> SELECT (count(*) as ?nbOverlap)
> 
> WHERE {
>   :display :startRelation ?startRel.
>   :display :stopRelation ?stopRel.
> 
>   ?gene1 a :Gene;
>     ?startRel ?start1;
>     ?stopRel ?stop1.
> 
>   ?gene2 a :Gene;
>     ?startRel ?start2;
>     ?stopRel ?stop2.
>   FILTER (?start1 < ?start2 && ?start1 < ?stop2 && ?start2 < ?stop1)
> }
> 
> 
> 
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
> patterns at an interface-level. Reveals which users, apps, and protocols are 
> consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
> J-Flow, sFlow and other flows. Make informed decisions using capacity planning
> reports.http://sdm.link/zohodev2dev
> _______________________________________________
> Virtuoso-users mailing list
> Virtuoso-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/virtuoso-users



------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users

Reply via email to