Re: Unexpectedly slow query

hueyl16 Mon, 27 May 2013 04:33:27 -0700

Thanks for the info!

Just to sum up my options:
1. Use fixed.opt instead of stats.opt   (I am currently doing that. Works fine, 
but other queries are a little slower than before, but acceptable)


2. Add Bind (1 as ?x) whenever I am using the dynamic subPropertyOf 
relationships

3. Add ( (TERM TERM VAR) 2 )to the stats file


 4. How about I add:  
     VALUES ?dynProp { ec:Has_Dyspnea_Score ec:Has_Dysphagia_Score }
     ?pat ?dynProp ?finding .


   Would that work and be stylistically acceptable? I am leaning towards option 
4 since I do not have to manipulate the TDB store and it can be easily 
understood by other developers.




 On a more philosophical note:
You mentioned that this use of sub properties is "unusual" since the property 
is often assumed to be something constant. That raises the question: If this is 
not one of the proper/expected uses of sub-properties, what is their purpose? 

I am trying to allow for queries that either ask for Dyspnea and/or Dysphagia 
scores explicitly or all types of Findings in general. IMO I have two options:
1. Just use "Has_Finding". If the user wants only Dyspnea scores, I can 
restrict the object.
      ?pat ec:Has_Finding ?finding .
      ?finding rdfs:subClassOf* nci:Dyspnea_Score

2. Use sub-properties as outlined so far. I went with this one since I need to 
know the possible sub-types of findings at design-time, when no data triples 
are asserted. And I need to be able to tell that "Dyspnea_Score" is an "entity" 
or property of Patient, but its sub-classes are real-life values for that 
entity, e.g. "Dyspnea_Score_1". 

My current use-case is fairly hierachical/relational. I am using the ontology 
for query expansion. But in the future we do want the additional semantic 
information and easy incorporation of other ontologies.

Am I trying to squeeze something into a triple store that I should not?

-Wolfgang

-----Original Message-----
From: Andy Seaborne <[email protected]>
To: users <[email protected]>
Sent: Sat, May 25, 2013 7:20 pm
Subject: Re: Unexpectedly slow query


On 24/05/13 10:23, [email protected] wrote:
> Sorry about the ecIn: vs. ec: namespaces/prefixes. I removed the ecIn 
namespace in the process. But in general, the Has_Id property is not the 
crucial 
one, even though it is also put into the wrong place in the execution plan.
>
> My original query (modified for prefixes) is:
>
>
>   SELECT  *
>     WHERE
>       { ?pat rdf:type nci:Patient .
>         ?pat ec:Has_Id ?patId .
>         ?findingProp rdfs:subPropertyOf ec:Has_Finding .
>         ?pat ?findingProp ?finding .
>         ?finding rdf:type ?findingType
>       }
> This to me also represents the most efficient order of triple patterns for 
> the 
execution plan. But the execution plan I initially got without regenerating the 
stats file after inserting all my data triples was:
>
> (?pat <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl/instances#Has_Id> 
?patId)
> (?pat rdf:type <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Patient>)
> (?findingProp rdfs:subPropertyOf 
> <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#Has_Finding>)
> (?finding rdf:type ?findingType)
> (?pat ?findingProp ?finding)
>
>
> The main problem here is that (?pat ?findingProp ?finding) comes after 
(?finding rdf:type ?findingType) .
> The other problem is that (?pat ec:Has_Id ?patId)   comes first, even though 
(?pat rdf:type nci:Patient>) is more restrictive.
>
> After regenerating the stats file I got this:

That looks possible using a stats file egenrated by 2.10.1 - that isn't the

>
> (?findingProp rdfs:subPropertyOf 
> <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#Has_Finding>)
> (?pat rdf:type <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Patient>)
> (?pat <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#Has_Id> ?patId)
> (?finding rdf:type ?findingType)
> (?pat ?findingProp ?finding)
>
>
>
> This version fixed the ec:Has_Id problem since it is now located
> afterthe (?pat a nci:Patient) triple pattern. But we still have the problem
 > with the last two lines. They should be reversed since
 > (?pat ?findingProp ?finding) is way more restrictive than
 > (?finding a ?findingType). The new stats file did however
> cause the (?findingProp rdfs:subPropertyOf ec:Has_Finding) triple
> pattern to be moved to the top.
>
> ?findingProp is bound to the predicates ec:Has_Dysnpea_Score and
 > ec:Has_Dysphagia_Score, which are sub-properties of ec:Has_Finding.

The stats analysis is only partially dynamic.  It runs at the start of a 
basic graph pattern, not incrementally during a pattern.  This matters 
for TDB because TDB does not solve patterns using node values, it used 
internal NodeIds (which are 64 bits).  It's a trade-off.

The system does not "know" that ?findingProp will be bound to only 
ec:Has_Dysnpea_Score and ec:Has_Dysphagia_Score.

The optimizer has a step of merging blocks so putting extra {} the query 
doesn't help.  A trick is to put in a BIND

  SELECT  *
    WHERE
      {
         ?findingProp rdfs:subPropertyOf ec:Has_Finding .
         BIND(1 as ?x)
         ?pat rdf:type nci:Patient .
         ?pat ec:Has_Id ?patId .
         ?pat ?findingProp ?finding .
         ?finding rdf:type ?findingType
      }

the it excutes the after BIND on each value from ?findingProp as a Node 
not at the TDB level -- if there are only two, that might be good.

The optimizer is not doing the best job it could becuase at the last 
step, (two remaining triples to order) it has:

     3       -1 : TERM TERM ?finding
     4   556563 : ?finding <::type> ?findingType

and TERM TERM ?finding (TERM means it is grounded at that point in the 
execution should estimate as a small number, then it'll flip the last 
two triples.

where the initial input is:
    id    count   Pattern
     0      100 : ?pat <::type> <::Patient>
     1   290525 : ?pat <::Has_Id> ?patId
     2       10 : ?findingProp <::subPropertyOf> <::Has_Finding>
     3       -1 : ?pat ?findingProp ?finding
     4   556563 : ?finding <::type> ?findingType

If you add

   ( (TERM TERM VAR) 2 )

to the stats rules, it seems to redo the order.

e..g at the end
...
   (<http://.../euroCAT.owl#Has_Frac_Cumulative_Dose> 22873)
   ( (TERM TERM VAR) 2 )
   (other 0))

Issue recorded as JENA-460

It is the use of rdfs:subPropertyOf which relates something in the 
subject position to something in the property position that is unusual 
(and several RDF systems assume the property is a constant in the query 
and need to do a scan var in the property position.

 > The stats file shows entries for ec:Has_Dysnpea_Score and
> ec:Has_Dysphagia_Score (counts are 7 and 8). But it chooses to
> put (?pat ?findingProp ?finding) last and process
> (?finding a ?findingType) first, which matches pretty much
 > everything in the entire store first since
> ?finding is not bound to something more specific yet.
> > To sum things up: Regenerating the stats file after I imported all my
> data individualssolved the ec:Has_Id problem. But it does not address the
 > ?findingProp problem.
>
> I hope this helps clarifying the overall picture and thank you very
 > much for your help!
>
> -Wolfgang

        Andy

>
>
>
>
> -----Original Message-----
> From: Andy Seaborne <[email protected]>
> To: users <[email protected]>
> Sent: Thu, May 23, 2013 8:40 pm
> Subject: Re: Unexpectedly slow query
>
>
> Wolfgang,
>
> I confused as to what the setup is that you have.  Let's step back and
> establish what the setup is here.
>
> 1/ The original query you sent was:
>
>     SELECT  *
>     WHERE
>       { ?pat rdf:type nci:Patient .
>         ?pat ecIn:Has_Id ?patId .
>         ?findingProp rdfs:subPropertyOf ec:Has_Finding .
>         ?pat ?findingProp ?finding .
>         ?finding rdf:type ?findingType
>       }
>
> but your emails refer to  ec:Has_Id (not ecIn:)
>
> ecIn: <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl/instances#>
> ec:   <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#>
>
> which is it?
>
> There are no properties 'euroCAT.owl/instances#' in stats file.
>
> 2/ The stats file is not generated by the recent releases which handles
>
> ?? rdf:type :SomeType
>
> much better.  You only need to regenerate the stats file using the
> latest release - it'll work well with previous releases.
>
> 3/ The database has no inference capabilities itself.
> What are you expecting the :Has_Finding to do to influenece the rest of
> the query plan?
>
>       Andy
>
> On 23/05/13 08:46, [email protected] wrote:
>> I am using Sparql queries via Dataset and QueryExecution. No in-memory
>> inference defined. The stats file with my predicates is attached. It has
>> those for Has_Id, Has_Dyspnea_Score and Has_Dysphagia_Score, but it
>> apparently cannot infer that the latter two are also Has_Finding and
>> therefore can be used to narrow down the result set.
>>
>> -Wolfgang
>>
>>
>>
>> -----Original Message-----
>> From: Andy Seaborne <[email protected]>
>> To: users <[email protected]>
>> Sent: Sat, May 18, 2013 1:17 pm
>> Subject: Re: Unexpectedly slow query
>>
>> On 17/05/13 13:27,[email protected]  wrote:
>>>
>>>
>>> I ran tdbstats again on the fully loaded triple store (with all the patient
>> data as individuals and their relationships). My properties appear now in the
>> stats file. But only the properties with explicit triples, not the inferred
>> parent properties. E.g. I am using the following property type hierarchy:
>>
>> What is your inference setup?
>>
>>>
>>> Has_Finding
>>>      - Has_Dysnpea_Score
>>>      - Has_Dysphagia_Score
>>>      - Is_Dead
>>>
>>>
>>> There are no explicit triples stating e.g. that a patient Has_Finding
>> Dyspnea_Score_2. But there are triples using the sub-properties, e.g. Patient
>> Has_Dysnpea_Score Dyspnea_Score_2.
>>
>> Can you share the stats file?  I can't investigate the situation without
>> a test case.
>>
>>>
>>> The stats file now contains entries for the sub-properties, but not for
>> Has_Finding.
>>>
>>> The execution plan changed slightly though, but the crucial triple patterns
>> are still in the "wrong" order.
>>>
>>> It used to be:
>>> (?pat <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl/instances#Has_Id>
>> ?patId)
>>
>> The original stats file had no mention of #Has_Id and actaully said (at
>> the end) that missing predciates were to be counted as having zero
>> occurences.  The optimizer puts one of these first because the rest of
>> the pattern will never be reached if it's accurate.
>>
>>> (?pat rdf:type <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Patient>)
>>> (?findingProp rdfs:subPropertyOf 
>>> <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#Has_Finding>)
>>> (?finding rdf:type ?findingType)
>>> (?pat ?findingProp ?finding)
>>>
>>> Now it is:
>>> (?findingProp rdfs:subPropertyOf 
>>> <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#Has_Finding>)
>>> (?pat rdf:type <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Patient>)
>>> (?pat <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#Has_Id> ?patId)
>>> (?finding rdf:type ?findingType)
>>> (?pat ?findingProp ?finding)
>>
>>>
>>> The triple pattern for the Has_Finding sub-properties moved to the start, 
but
>> the crucial (?finding rdf:type ?findingType) is still evaluated before (?pat
>> ?findingProp ?finding) -> the query is still taking a very long time.
>>>
>>> I can go ahead and use "fixed.opt" instead of "stats.opt", but I am still
>> interested in whether there is a solution to this problem. I am using Jena
>> 2.10.0.
>>>
>>> Hope this info helps!
>>>
>>> -Wolfgang
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: hueyl16 <[email protected]>
>>> To: users <[email protected]>
>>> Sent: Fri, May 17, 2013 1:43 pm
>>> Subject: Re: Unexpectedly slow query
>>>
>>>
>>>    I was wondering about that too. I could only find entries related to NCI
>> terms. How or when is the stats file generated?
>>>
>>> I am using the .bat versions of the tdbloader for importing the NCIt first
> and
>> then my own ontology, which contains Has_Id and Has_Finding plus more.
>>>
>>>
>>> I also ran tdbstats once but it did not change the stats file, just printed
>> it.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Andy Seaborne <[email protected]>
>>> To: users <[email protected]>
>>> Sent: Fri, May 17, 2013 1:25 pm
>>> Subject: Re: Unexpectedly slow query
>>>
>>>
>>> (I now have the stats file)
>>>
>>> Wolfgang,
>>>
>>> I don't see entries for ec:Has_Id and ec:Has_Finding.
>>>
>>>     Andy
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
>
>
> :
>

Re: Unexpectedly slow query

Reply via email to