Re: [Virtuoso-users] bif:contains - using a string variable as search term

Hugh Williams Tue, 14 Jun 2011 18:07:35 +0000

Hi Robert,

We are working on a new open source snapshot build we can provide to you which 
may improve performance somewhat although no guarantees as the order by will 
always result in a full table scan first to order it then limit the results. 
What datasets do you have loaded in your store, is there a sample dataset we 
could load locally to see this issue first hand.


For the query you were getting the syntax error with, try running the following:

      PREFIX dc: <http://purl.org/dc/elements/1.1/> 
        SELECT DISTINCT *
        from <http://dbpedia.org>
        WHERE
        {{
            SELECT DISTINCT?title ?u ?created
            WHERE
            {
              ?prog dc:title ?title;
              dc:created ?created.
        
            ?u dbpprop:name ?name .
            FILTER (bif:isnotnull (bif:strstr ( str(?name), ?title)))
           } LIMIT 1
        }} ORDER BY DESC (?created)

Note the double curly brackets  ie “{{ ...}}”  , I also had to add ?created to 
the result list otherwise you get another error indicating "SP031: SPARQL 
compiler: Variable 'created' is used in the query result set but not assigned
"
Best Regards
Hugh Williams
Professional Services
OpenLink Software
Web: http://www.openlinksw.com
Support: http://support.openlinksw.com
Forums: http://boards.openlinksw.com/support
Twitter: http://twitter.com/OpenLink

On 13 Jun 2011, at 23:05, Robert Globisch wrote:

> Hi Hugh,
> 
> thank you!
> 
> There seems to be no noticeable decrease of execution time when i use this 
> ORDER BY clause.
> 
> In the meantime I did some further tests. It seems to me like the ORDER BY 
> clause causes this massive performance slowdowns.
> When i use the following query without the ORDER BY clause the performance is 
> quite acceptable even dc:created property is included.
> 
> [notice: to improve performance i use dbpprop:name instead of rdfs:label now. 
> Think it will avoid labels of concepts/ontologys etc. i do not need for my 
> results]
> 
> 
> See:
> 
> PREFIX dc: <http://purl.org/dc/elements/1.1/> 
> select distinct ?title ?u
> from <http://dbpedia.org>
> WHERE
> {
> ?prog dc:title ?title;
> dc:created ?created.
> 
> ?u dbpprop:name ?name .
> FILTER (bif:isnotnull (bif:strstr ( str(?name), ?title)))
> }
> LIMIT     1
> 
> > execution time on my 4gb quadcore system: about 60s with LIMIT 1 (2, 4)  ; 
> > and about 120s with LIMIT 8 
> 
> When i add the ORDER BY clause (within search pattern or beyond using your 
> proposition) it's not usable anymore (execution time: about 10 minutes).
> My aim is to find the result(s) for my last created triple (newest one) thats 
> because i added the dc:created property.
> 
> As far as i read the ORDER BY clause it sorting the whole table before, 
> right? Not only the limited results?
> Could this be bypassed someway? 
> 
> Maybe putting all my dc:title triples into a separate graph, sorting them by 
> ?created and put them into a subquery using bif:strstr function?
> Phil M pointed out a similar solution on 
> http://stackoverflow.com/questions/1154546/sorting-sparql-results-by-date?answertab=votes#tab-top
>  storing the creation date in a separate graph.
> 
> In this case i will refer to my question on semanticweb.com too:
> 
> http://answers.semanticweb.com/questions/10014/limit-before-order-by-clause
> 
> Unfortunately this subquery does not work at the second SELECT clause -> 
> syntax error at 'SELECT' before 'distinct' 
> 
> 
>         PREFIX dc: <http://purl.org/dc/elements/1.1/> 
>         SELECT DISTINCT *
>         from <http://dbpedia.org>
>         WHERE
>         {
>             SELECT DISTINCT?title ?u 
>             WHERE
>             {
>               ?prog dc:title ?title;
>               dc:created ?created.
>         
>             ?u dbpprop:name ?name .
>             FILTER (bif:isnotnull (bif:strstr ( str(?name), ?title)))
>            } LIMIT 1
>         } ORDER BY DESC (?created)
> 
> 
> Best regards,
> 
> Robert
> 
> 
> On 13.06.2011 11:19, Hugh Williams wrote:
>> 
>> Hi Robert,
>> 
>> Development suggest the query:
>> 
>> sparql
>> PREFIX dc: <http://purl.org/dc/elements/1.1/> 
>> select distinct ?title ?u
>> from <http://dbpedia.org>
>> WHERE
>> {
>> ?prog dc:title ?title .
>> ?u rdfs:label ?label .
>> FILTER (bif:isnotnull (bif:strstr (?label, ?title)))
>> }
>> ORDER BY DESC ((select ?created where { ?prog dc:created ?created. } ))
>> LIMIT     1
>> 
>> should be the fastest. Note FROM <graph> and an implicit hint to the
>> optimizer that ?created can be calculated later and does not affect
>> filtering (i.e. the presence of ?created is not essential).
>> 
>> Best Regards
>> Hugh Williams
>> Professional Services
>> OpenLink Software
>> Web: http://www.openlinksw.com
>> Support: http://support.openlinksw.com
>> Forums: http://boards.openlinksw.com/support
>> Twitter: http://twitter.com/OpenLink
>> 
>> On 12 Jun 2011, at 15:28, Robert Globisch wrote:
>> 
>>> Hi Hugh,
>>> 
>>> that's me, yes. Hello :)
>>> 
>>> When i remove the dc:created property (bounded to my ?prog variable) it 
>>> gets a lot faster on my Thinkpad (TP) and QuadCore system.
>>> I need the dc:created property to order the results based on their date of 
>>> creation (time you tuned in to a channel) of my files loaded into the store.
>>> As you can see it improves execution time massively.
>>> 
>>> I run the explain function for the following query using the virtuoso.db 
>>> loaded with the whole en.dbpedia dataset.
>>> Hope that's what you wanted to have.
>>> 
>>> 
>>> ************************************************************************************
>>> ************************************************************************************
>>> PREFIX po: <http://purl.org/ontology/po/>
>>> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 
>>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
>>> PREFIX dc: <http://purl.org/dc/elements/1.1/> 
>>> PREFIX dbpprop: <http://dbpedia.org/property/>
>>> PREFIX fn: <http://www.w3.org/2005/xpath-functions#>
>>> PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
>>> PREFIX foaf: <http://xmlns.com/foaf/0.1/>
>>> PREFIX dcterms: <http://purl.org/dc/terms/>
>>> 
>>> select distinct ?title ?u
>>> WHERE
>>> {
>>> ?prog dc:title ?title;
>>> dc:created ?created.
>>> 
>>> 
>>> ?u rdfs:label ?label.
>>> 
>>> FILTER (bif:isnotnull (bif:strstr (?label, ?title)))
>>> 
>>> 
>>> }
>>> ORDER BY DESC (?created)
>>> LIMIT     1
>>> 
>>> 
>>> 
>>> ************************************************************************************
>>> ************************************************************************************
>>> 
>>> 
>>> Result:
>>> 
>>> REPORT
>>> VARCHAR
>>> _______________________________________________________________________________
>>> 
>>> {
>>> Subquery 21
>>> {
>>> Fork 50
>>> {
>>> END Node
>>> 
>>> After test:
>>>       0: if ( 0  1(=)  1 ) then 10 else 3 unkn 10
>>>       3: if ( 0  1(=)  1 ) then 10 else 6 unkn 10
>>>       6: if ( 0  1(=)  1 ) then 10 else 9 unkn 10
>>>       9: BReturn 1
>>>       10: BReturn 0
>>> from DB.DBA.RDF_QUAD by RDF_QUAD_POGS   1.1e+002 rows
>>> Key RDF_QUAD_POGS  ASC ($27 "s-13-1-t0.O", $26 "s-13-1-t0.S")
>>>  inlined <col=554 P =  #dc/elements/1.1/title >
>>>  Local Test
>>>       0: if ( 0  1(=)  1 ) then 4 else 3 unkn 4
>>>       3: BReturn 1
>>>       4: BReturn 0
>>> 
>>> 
>>> Precode:
>>>       0: $30 "__ro2sq" := Call __ro2sq ($27 "s-13-1-t0.O")
>>>       5: BReturn 0
>>> from DB.DBA.RDF_QUAD by RDF_QUAD       0.23 rows
>>> Key RDF_QUAD  ASC ($32 "s-13-1-t1.O")
>>>  inlined <col=554 P =  #dc/elements/1.1/created > , <col=553 S = $26 
>>> "s-13-1-t0.S">
>>> 
>>> from DB.DBA.RDF_QUAD by RDF_QUAD   9.6e+006 rows
>>> Key RDF_QUAD  ASC ($37 "s-13-1-t2.O", $36 "s-13-1-t2.S")
>>>  inlined <col=554 P =  #label >
>>> 
>>> 
>>> After test:
>>>       0: $40 "__ro2sq" := Call __ro2sq ($37 "s-13-1-t2.O")
>>>       5: $41 "strstr" := Call strstr ($40 "__ro2sq", $30 "__ro2sq")
>>>       10: $42 "isnotnull" := Call isnotnull ($41 "strstr")
>>>       15: if ( 0  1(=) $42 "isnotnull") then 19 else 18 unkn 19
>>>       18: BReturn 1
>>>       19: BReturn 0
>>> 
>>> After code:
>>>       0: $43 "__id2i" := Call __id2i ($36 "s-13-1-t2.S")
>>>       5: BReturn 0
>>> Distinct (HASH) ($27 "s-13-1-t0.O", $36 "s-13-1-t2.S")
>>> 
>>> Precode:
>>>       0: $49 "__ro2sq" := Call __ro2sq ($32 "s-13-1-t1.O")
>>>       5: BReturn 0
>>> Sort (HASH) (TOP  1  ) ($49 "__ro2sq") -> ($30 "__ro2sq", $43 "__id2i")
>>> 
>>> }
>>> top order by node
>>> 
>>> After code:
>>>       0: $22 "title" :=  := artm $30 "__ro2sq"
>>>       4: $23 "u" :=  := artm $43 "__id2i"
>>>       8: BReturn 0
>>> Subquery Select($22 "title", $23 "u", <$39 "<DB.DBA.RDF_QUAD s-13-1-t2>" 
>>> spec 5>, <$34 "<DB.DBA.RDF_QUAD s-13-1-t1>" spec 5>, <$29 "<DB.DBA.RDF_QUAD 
>>> s-13-1-t0>" spec 5>)
>>> }
>>> 
>>> 
>>> After code:
>>>       0: $70 "title" := Call __ro2sq ($22 "title")
>>>       5: $71 "u" := Call __ro2sq ($23 "u")
>>>       10: BReturn 0
>>> Select ($70 "title", $71 "u")
>>> }
>>> 
>>> 69 Rows. -- 328 msec.
>>> 
>>> ************************************************************************************
>>> ************************************************************************************
>>> 
>>> 
>>> Best regards,
>>> 
>>> Robert 
>>> 
>>> 
>>> 
>>> On 12.06.2011 15:39, Hugh Williams wrote:
>>>> 
>>>> Hi Robert,
>>>> 
>>>> I presume you are also "Robbet <[email protected]>” who posted similar 
>>>> questions on the vos mailing list ?
>>>> 
>>>> Can you use the Virtuoso explain function to generate a compiler query 
>>>> execution plan so we can so how this is being constructed as detailed at:
>>>> 
>>>>  http://docs.openlinksw.com/virtuoso/fn_explain.html
>>>> 
>>>> 
>>>>                     
>>>> 
>>>>                   
>>>> It is also
>>>> 
>>>>                     not clear to me what the figures you state in the
>>>>                     following
>>>> 
>>>>                     mean:
>>>> 
>>>> 
>>>>                     
>>>> 
>>>>                   
>>>> 
>>>> 
>>>>                     
>>>>> 
>>>>>> As soon as i remove the dc:created property query gets about 10-100x 
>>>>>> faster
>>>>>> (TP: from 3,5mins > 30s / Quad core: 7 mins  > 5,5mins).
>>>> 
>>>>                     
>>>> 
>>>>                     
>>>> 
>>>>                     
>>>> 
>>>>                   
>>>> What
>>>> 
>>>>                     is TP and what are the timing difference with and
>>>>                     without the
>>>> 
>>>>                     dc:created property ?
>>>> 
>>>> 
>>>>                     
>>>> 
>>>>                   
>>>> Best Regards
>>>> Hugh Williams
>>>> Professional Services
>>>> OpenLink Software
>>>> Web: http://www.openlinksw.com
>>>> Support: http://support.openlinksw.com
>>>> Forums: http://boards.openlinksw.com/support
>>>> Twitter: http://twitter.com/OpenLink
>>>> 
>>>> On 12 Jun 2011, at 12:43, Kingsley Idehen wrote:
>>>> 
>>>>> On 6/12/11 1:22 AM, Robert Globisch wrote:
>>>>>> 
>>>>>> Hello Kingsley,
>>>>>> 
>>>>>> i will need your help once again. Actually i'm a bit frustrated :/
>>>>>> 
>>>>>> During the last few hours i made some test examples to find out how my 
>>>>>> query performs:
>>>>>> 
>>>>>> First i created a new virtuoso.db with the labels_en.nt dbpedia dataset 
>>>>>> only (virtuoso.db size about 2.6GB).
>>>>>> I added some of my own triples. Only a few with some dc: and po 
>>>>>> properties. (see attachment - example file).
>>>>>> 
>>>>>> Afterwards i ran the following query with free text searc index disabled 
>>>>>> / enabled to get matching title strings within dbpedia:
>>>>>> 
>>>>>> SELECT distinct ?title ?label
>>>>>> 
>>>>>> WHERE 
>>>>>> {
>>>>>> 
>>>>>> ?prog dc:title ?title;
>>>>>> dc:created ?created.
>>>>>> 
>>>>>> ?dbpedia rdfs:label ?label
>>>>>> 
>>>>>> FILTER (bif:isnotnull (bif:strstr (?label, ?title)))
>>>>>> 
>>>>>> }
>>>>>> LIMIT 1
>>>>>> 
>>>>>> 
>>>>>> Execution time on an Intel QuadCore system with 4gb of ram (as already 
>>>>>> discussed) was about 7 minutes (with free text enabled / disabled).
>>>>>> I performed same query on the whole de.dbpedia data set (separate 
>>>>>> virtuoso.db - size about 8,5 GB) on a small Thinkpad (AMD Dual Core with 
>>>>>> 4gb ram)
>>>>>> and it took about 3,5 minutes to execute. Some interesting fact i 
>>>>>> noticed: As soon as i remove the dc:created property query gets about 
>>>>>> 10-100x faster
>>>>>> (TP: from 3,5mins > 30s / Quad core: 7 mins  > 5,5mins).
>>>>>> 
>>>>>> 
>>>>>> Is there anything left i could do to increase performance besides 
>>>>>> hosting it on a more powerful system?
>>> 
>> 
>

Re: [Virtuoso-users] bif:contains - using a string variable as search term

Reply via email to