Hello Oliver, The cost of "variable P" heavily depends on the sample query.
1. In good case the overhead is zero. It's possible when graphs (G) are known or almost all data of the dataset are in very numerous small graphs, so it's cheaper to read whole graphs than to try to use additional indices that needs known P first. 2. If the value of P can be found before search, as if :display :startRelation ?startRel . is matched before ?gene ?startRel ?start . then it may cost zero if statistics of the desired predicate (or predicates) are close to average statistics across all predicates of the database. In that case the optimizer will build an execution plan using "average" statistics but the plan will match one built using statistics specific for a specific predicate. If the desired predicate is very frequent or very infrequent then the optimizer can miss, resulting in poor plan; the cost of the planning error can be big for single-process installations and blocking for clusters. Note that rdf:type is the most common example of predicate with far-from-average statistics. 3. In some cases, known G can eliminate the extra cost of unknown P. As a common rule, if you know G or if you can easily calculate G at the beginning of the query then the pattern <calculate ?g here> GRAPH ?g { <interesting part of query> } is worth trying. Some customer reported 500x performance on single server. 4. If the query is executed by an ODBC client or the like and P is a parameter passed from outside then it may be practical to inline P into the text of the query rather than using ?:p parameter passing notation. At least, it's practical to have a separate variant of the query for parameter set to rdf:type Best Regards, Ivan Mikhailov OpenLink Software http://virtuoso.openlinksw.com On Fri, 2016-07-22 at 17:09 +0200, Olivier Dameron wrote: > Dear Virtuosans, > I noticed that queries can be much slower when using variables as > properties (even if the variable can only have one value) e.g.: > > DATA (extract): > :display :startRelation :posStart. > :gene0 rdf:type :Gene. > :gene0 :posStart "53416"^^xsd:numeric. > :gene1 rdf:type :Gene. > :gene1 :posStart "29513"^^xsd:numeric. > ... > > SIMPLE QUERY: > ?gene :posStart ?start. > > SLOWER QUERY: > :display :startRelation ?startRel . > ?gene ?startRel ?start. > > I assume that the query engine first tries to match the "?gene > ?startRel ?start" constraint, whereas begining by ":display > :startRelation ?startRel." would define the value of ?startRel which > would be used to find the start positions of the genes. > I can live with the simple query, but the second one would make the > development of our application easier. Is there anything we could do to > improve the performance of the second query? > > Thank you! > kind regards > Olivier Dameron > > NB: for the record, below is the script I used for generating the > dataset, as well as the two queries > > ===== generateDataSet.py > #! /usr/bin/env python > > import random > > nbGenes = 50000 > geneLengthMax = 100 > chromosomeLength = 100000 > > with open("debugDataset.ttl", "w") as dataFile: > dataFile.write(""" > @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>. > @prefix : <http://www.univ-rennes1.fr/odameron/debugVirtuoso/>. > > :display :startRelation :posStart. > :display :stopRelation :posStop. > > """) > for i in range(nbGenes): > geneIdent = ":gene" + str(i) > dataFile.write("\n" + geneIdent + " rdf:type :Gene.\n") > posStart = random.randint(1, chromosomeLength-geneLengthMax) > posStop = posStart + random.randint(1, geneLengthMax) > dataFile.write(geneIdent + " :posStart \"" + str(posStart) + > "\"^^xsd:numeric.\n") > dataFile.write(geneIdent + " :posStop \"" + str(posStop) + > "\"^^xsd:numeric.\n") > > > ===== getOverlap.sparql > PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> > PREFIX : <http://www.univ-rennes1.fr/odameron/debugVirtuoso/> > > SELECT (count(*) as ?nbOverlap) > > WHERE { > ?gene1 a :Gene; > :posStart ?start1; > :posStop ?stop1. > > ?gene2 a :Gene; > :posStart ?start2; > :posStop ?stop2. > FILTER (?start1 < ?start2 && ?start1 < ?stop2 && ?start2 < ?stop1) > } > > > ===== getOverlapSLOW.sparql > PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> > PREFIX : <http://www.univ-rennes1.fr/odameron/debugVirtuoso/> > > SELECT (count(*) as ?nbOverlap) > > WHERE { > :display :startRelation ?startRel. > :display :stopRelation ?stopRel. > > ?gene1 a :Gene; > ?startRel ?start1; > ?stopRel ?stop1. > > ?gene2 a :Gene; > ?startRel ?start2; > ?stopRel ?stop2. > FILTER (?start1 < ?start2 && ?start1 < ?stop2 && ?start2 < ?stop1) > } > > > > ------------------------------------------------------------------------------ > What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic > patterns at an interface-level. Reveals which users, apps, and protocols are > consuming the most bandwidth. Provides multi-vendor support for NetFlow, > J-Flow, sFlow and other flows. Make informed decisions using capacity planning > reports.http://sdm.link/zohodev2dev > _______________________________________________ > Virtuoso-users mailing list > Virtuoso-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/virtuoso-users ------------------------------------------------------------------------------ What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports.http://sdm.link/zohodev2dev _______________________________________________ Virtuoso-users mailing list Virtuoso-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/virtuoso-users