Re: Multifilter SPARQL queries

Andy Seaborne Thu, 03 Apr 2014 06:59:23 -0700

On 02/04/14 19:28, Adeeb Noor wrote:

Thanks, Andy it works.


Another question please,


One anew thread is easier.

I have this very long query with too many filters
and my tdb is 15G. I have the problem with time to get the results back.
For instance, when I put limit it returns fast but with no limit, it takes
me for ever. I was wondering if there is a way to do cashing in the query.
Below is the example:

SELECT DISTINCT  *
WHERE {


Which TDB optimizer is in use?  (which .opt file?)

Choosing between reordering triples and where to put filters is adelicate balance. Maybe this is the issue, or maybe it's just expensivebecause 'replace' is expensive (I don't if it is or not - it's doingregex stuff).


Try:

1/ Removing FILTERs and see what the effect on speed is
2/ Reordering the parts of the pattern and see if it makes a difference.

?d ddids:has_pharmgkb_variantLocation_association ?snp1 .
FILTER (replace(str(?snp1),"([^_]*_){3}","") =
replace(str(?snp2),"([^_]*_){3}","") )
FILTER (replace(str(?snp1),"([^_]*_){3}","") =
replace(str(?snp3),"([^_]*_){3}","") )


ddidd:C0123931 ddids:has_pharmgkb_variantLocation_association ?snp2 .
?drug2 ddids:has_pharmgkb_variantLocation_association ?snp3 .
FILTER (replace(str(?snp2),"([^_]*_){3}","") =
replace(str(?snp3),"([^_]*_){3}","") )

ddidd:C0123931 ddids:drugBank_enzyme ?enzyme .
?drug2 ddids:drugBank_enzyme ?enzyme .

ddidd:C0123931 ddids:has_pharmgkb_gene_association ?gene1 .
?drug2 ddids:has_pharmgkb_gene_association ?gene2 .
FILTER (replace(str(?gene1),"([^_]*_){3}","") =
replace(str(?gene2),"([^_]*_){3}","") )


?drug2 ddids:drugBank_category "Approved"^^xsd:string .


?enzyme ddids:label ?lenzyme.
?drug2 rdfs:label ?ldrug2.
ddidd:C0123931 ddids:label ?ldrug .
?d a ddids:Disease .
?d ddids:label ?disease.


FILTER ( str (ddidd:C0123931) < str (?drug2) )
FILTER ( str (?gene1) < str (?gene2) )
FILTER ( str (?snp2) < str (?snp3) )

}


On Sat, Mar 29, 2014 at 4:02 AM, Andy Seaborne <[email protected]> wrote:

On 28/03/14 23:26, Adeeb Noor wrote:

thanks Dave for the very useful answers. I have to check my KB and then
decide which way to go.

Another silly question: how can I remove the duplicate in my result below

SELECT DISTINCT *

WHERE {

?s ddids:x-kegg.pathway ?o.

?s2 ddids:x-kegg.pathway ?o.

FILTER (?s != ?s2 ) }


There needs to be a way to impose an arbitrary order on ?s and ?s2 so that
?s is different from ?s2 in some way you can choose one over the other

FILTER ( str(?s) < str(?s2) )

Or, and this is less general as you compose patterns, project the column
and do DISTINCT

SELECT DISTINCT ?s ?o

         Andy

------------------------------------------------------------
------------------------

| s              | o                                              | s2
          |

============================================================
========================

| ddidd:C1514505 | <http://identifiers.org/kegg.pathway/hsa00590> |
ddidd:C1879725 |

| ddidd:C1879725 | <http://identifiers.org/kegg.pathway/hsa00590> |
ddidd:C1514505 |

------------------------------------------------------------
------------------------




On Fri, Mar 28, 2014 at 9:15 AM, Dave Reynolds <[email protected]

wrote:


  On 27/03/14 17:52, Adeeb Noor wrote:


  Hi Dave:


Thank you so much for the very helpful comments, it is now more clear to
me
than before.

I totally agree that I need to figure out why I need to use something
over
the other.

In my case for example, I have this huge TDB with 16GB that has lots of
biomedical data. I would like for example to find a gene that associated
with at least 3 different phenotype. Therefore,  I can do this with the
following:

1- OWL (pellet)
<owl:Class rdf:about="
https://csel.cs.colorado.edu/~noor/Drug_Disease_ontology/
DDID.owl#multiDiseases
">
           <owl:equivalentClass>
               <owl:Restriction>
                   <owl:onProperty rdf:resource="
https://csel.cs.colorado.edu/~noor/Drug_Disease_ontology/
DDID.owl#gene_associated_with_disease
"/>
                   <owl:onClass rdf:resource="
https://csel.cs.colorado.edu/~noor/Drug_Disease_ontology/
DDID.owl#Disease
"/>
                   <owl:minQualifiedCardinality
rdf:datatype="&xsd;nonNegativeInteger">3</owl:minQualifiedCardinality>
               </owl:Restriction>
           </owl:equivalentClass>
           <rdfs:subClassOf rdf:resource="
https://csel.cs.colorado.edu/~noor/Drug_Disease_ontology/DDID.rdf#DDID
"/>
       </owl:Class>

2- Construct:

CONSTRUCT {

?s ddids:gene_associated_with_disease ?o .

?s ddids:gene_associated_with_disease ?o1 .

?s ddids:gene_associated_with_disease ?o2 .}

WHERE {

?s ddids:gene_associated_with_disease ?o .

?s ddids:gene_associated_with_disease ?o1 .

?s ddids:gene_associated_with_disease ?o2 .

FILTER (?o != ?o1 )

FILTER (?o != ?o2 )

FILTER (?o1 != ?o2 )

}
and store the result of construct into new TDB and work on it.

3- sparql update

INSERT {
?s ddids:gene_has_multiple_association ?o

WHERE {

?s ddids:gene_associated_with_disease ?o .

?s ddids:gene_associated_with_disease ?o1 .

?s ddids:gene_associated_with_disease ?o2 .

FILTER (?o != ?o1 )

FILTER (?o != ?o2 )

FILTER (?o1 != ?o2 )

}
The three methods will at the end give me the same answer, but the
performance is different.

Not necessarily the same answer.

Your test is making a unique name assumption so that just because the
three disease values have different URIs then are different diseases.

I seem to recall that Pellet can be asked to make a default-UNA (i.e. go
outside the specs) so you could arrange for Pellet to generate similar
results but it should do so by default.

Imagine what would happen if you now add:

    :d1   owl:sameAs   :d2 .
    :d2   owl:sameAs   :d3 .

where :d1-3 are all associated with the same gene.


   If I want to do this test in owl, it takes around

14 hours to complete, in construct 2 mins, and sparql updates less than
a
minute.

What do you think Dave ?

Like I say, it depends what you want.

If that's the only question you want to answer and you can justify making
a strong unique name assumption then you can use #3 and it's certainly
more
scalable.

If you need better guarantees of correctness but only need to operate
over
parts of the data then you could use things like #2 to pull relevant data
into a small in-memory store and then do OWL reasoning over that. Though
you would have to pull in all relevant statements involving the resources
in your core query (c.f. the sameAs example) and even then there can be
indirect consequences that you miss.

Dave

Re: Multifilter SPARQL queries

Reply via email to