Thanks for the info!
Just to sum up my options:
1. Use fixed.opt instead of stats.opt (I am currently doing that. Works fine,
but other queries are a little slower than before, but acceptable)
2. Add Bind (1 as ?x) whenever I am using the dynamic subPropertyOf
relationships
3. Add ( (TERM TERM VAR) 2 )to the stats file
4. How about I add:
VALUES ?dynProp { ec:Has_Dyspnea_Score ec:Has_Dysphagia_Score }
?pat ?dynProp ?finding .
Would that work and be stylistically acceptable? I am leaning towards option
4 since I do not have to manipulate the TDB store and it can be easily
understood by other developers.
On a more philosophical note:
You mentioned that this use of sub properties is "unusual" since the property
is often assumed to be something constant. That raises the question: If this is
not one of the proper/expected uses of sub-properties, what is their purpose?
I am trying to allow for queries that either ask for Dyspnea and/or Dysphagia
scores explicitly or all types of Findings in general. IMO I have two options:
1. Just use "Has_Finding". If the user wants only Dyspnea scores, I can
restrict the object.
?pat ec:Has_Finding ?finding .
?finding rdfs:subClassOf* nci:Dyspnea_Score
2. Use sub-properties as outlined so far. I went with this one since I need to
know the possible sub-types of findings at design-time, when no data triples
are asserted. And I need to be able to tell that "Dyspnea_Score" is an "entity"
or property of Patient, but its sub-classes are real-life values for that
entity, e.g. "Dyspnea_Score_1".
My current use-case is fairly hierachical/relational. I am using the ontology
for query expansion. But in the future we do want the additional semantic
information and easy incorporation of other ontologies.
Am I trying to squeeze something into a triple store that I should not?
-Wolfgang
-----Original Message-----
From: Andy Seaborne <[email protected]>
To: users <[email protected]>
Sent: Sat, May 25, 2013 7:20 pm
Subject: Re: Unexpectedly slow query
On 24/05/13 10:23, [email protected] wrote:
> Sorry about the ecIn: vs. ec: namespaces/prefixes. I removed the ecIn
namespace in the process. But in general, the Has_Id property is not the
crucial
one, even though it is also put into the wrong place in the execution plan.
>
> My original query (modified for prefixes) is:
>
>
> SELECT *
> WHERE
> { ?pat rdf:type nci:Patient .
> ?pat ec:Has_Id ?patId .
> ?findingProp rdfs:subPropertyOf ec:Has_Finding .
> ?pat ?findingProp ?finding .
> ?finding rdf:type ?findingType
> }
> This to me also represents the most efficient order of triple patterns for
> the
execution plan. But the execution plan I initially got without regenerating the
stats file after inserting all my data triples was:
>
> (?pat <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl/instances#Has_Id>
?patId)
> (?pat rdf:type <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Patient>)
> (?findingProp rdfs:subPropertyOf
> <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#Has_Finding>)
> (?finding rdf:type ?findingType)
> (?pat ?findingProp ?finding)
>
>
> The main problem here is that (?pat ?findingProp ?finding) comes after
(?finding rdf:type ?findingType) .
> The other problem is that (?pat ec:Has_Id ?patId) comes first, even though
(?pat rdf:type nci:Patient>) is more restrictive.
>
> After regenerating the stats file I got this:
That looks possible using a stats file egenrated by 2.10.1 - that isn't the
>
> (?findingProp rdfs:subPropertyOf
> <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#Has_Finding>)
> (?pat rdf:type <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Patient>)
> (?pat <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#Has_Id> ?patId)
> (?finding rdf:type ?findingType)
> (?pat ?findingProp ?finding)
>
>
>
> This version fixed the ec:Has_Id problem since it is now located
> afterthe (?pat a nci:Patient) triple pattern. But we still have the problem
> with the last two lines. They should be reversed since
> (?pat ?findingProp ?finding) is way more restrictive than
> (?finding a ?findingType). The new stats file did however
> cause the (?findingProp rdfs:subPropertyOf ec:Has_Finding) triple
> pattern to be moved to the top.
>
> ?findingProp is bound to the predicates ec:Has_Dysnpea_Score and
> ec:Has_Dysphagia_Score, which are sub-properties of ec:Has_Finding.
The stats analysis is only partially dynamic. It runs at the start of a
basic graph pattern, not incrementally during a pattern. This matters
for TDB because TDB does not solve patterns using node values, it used
internal NodeIds (which are 64 bits). It's a trade-off.
The system does not "know" that ?findingProp will be bound to only
ec:Has_Dysnpea_Score and ec:Has_Dysphagia_Score.
The optimizer has a step of merging blocks so putting extra {} the query
doesn't help. A trick is to put in a BIND
SELECT *
WHERE
{
?findingProp rdfs:subPropertyOf ec:Has_Finding .
BIND(1 as ?x)
?pat rdf:type nci:Patient .
?pat ec:Has_Id ?patId .
?pat ?findingProp ?finding .
?finding rdf:type ?findingType
}
the it excutes the after BIND on each value from ?findingProp as a Node
not at the TDB level -- if there are only two, that might be good.
The optimizer is not doing the best job it could becuase at the last
step, (two remaining triples to order) it has:
3 -1 : TERM TERM ?finding
4 556563 : ?finding <::type> ?findingType
and TERM TERM ?finding (TERM means it is grounded at that point in the
execution should estimate as a small number, then it'll flip the last
two triples.
where the initial input is:
id count Pattern
0 100 : ?pat <::type> <::Patient>
1 290525 : ?pat <::Has_Id> ?patId
2 10 : ?findingProp <::subPropertyOf> <::Has_Finding>
3 -1 : ?pat ?findingProp ?finding
4 556563 : ?finding <::type> ?findingType
If you add
( (TERM TERM VAR) 2 )
to the stats rules, it seems to redo the order.
e..g at the end
...
(<http://.../euroCAT.owl#Has_Frac_Cumulative_Dose> 22873)
( (TERM TERM VAR) 2 )
(other 0))
Issue recorded as JENA-460
It is the use of rdfs:subPropertyOf which relates something in the
subject position to something in the property position that is unusual
(and several RDF systems assume the property is a constant in the query
and need to do a scan var in the property position.
> The stats file shows entries for ec:Has_Dysnpea_Score and
> ec:Has_Dysphagia_Score (counts are 7 and 8). But it chooses to
> put (?pat ?findingProp ?finding) last and process
> (?finding a ?findingType) first, which matches pretty much
> everything in the entire store first since
> ?finding is not bound to something more specific yet.
> > To sum things up: Regenerating the stats file after I imported all my
> data individualssolved the ec:Has_Id problem. But it does not address the
> ?findingProp problem.
>
> I hope this helps clarifying the overall picture and thank you very
> much for your help!
>
> -Wolfgang
Andy
>
>
>
>
> -----Original Message-----
> From: Andy Seaborne <[email protected]>
> To: users <[email protected]>
> Sent: Thu, May 23, 2013 8:40 pm
> Subject: Re: Unexpectedly slow query
>
>
> Wolfgang,
>
> I confused as to what the setup is that you have. Let's step back and
> establish what the setup is here.
>
> 1/ The original query you sent was:
>
> SELECT *
> WHERE
> { ?pat rdf:type nci:Patient .
> ?pat ecIn:Has_Id ?patId .
> ?findingProp rdfs:subPropertyOf ec:Has_Finding .
> ?pat ?findingProp ?finding .
> ?finding rdf:type ?findingType
> }
>
> but your emails refer to ec:Has_Id (not ecIn:)
>
> ecIn: <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl/instances#>
> ec: <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#>
>
> which is it?
>
> There are no properties 'euroCAT.owl/instances#' in stats file.
>
> 2/ The stats file is not generated by the recent releases which handles
>
> ?? rdf:type :SomeType
>
> much better. You only need to regenerate the stats file using the
> latest release - it'll work well with previous releases.
>
> 3/ The database has no inference capabilities itself.
> What are you expecting the :Has_Finding to do to influenece the rest of
> the query plan?
>
> Andy
>
> On 23/05/13 08:46, [email protected] wrote:
>> I am using Sparql queries via Dataset and QueryExecution. No in-memory
>> inference defined. The stats file with my predicates is attached. It has
>> those for Has_Id, Has_Dyspnea_Score and Has_Dysphagia_Score, but it
>> apparently cannot infer that the latter two are also Has_Finding and
>> therefore can be used to narrow down the result set.
>>
>> -Wolfgang
>>
>>
>>
>> -----Original Message-----
>> From: Andy Seaborne <[email protected]>
>> To: users <[email protected]>
>> Sent: Sat, May 18, 2013 1:17 pm
>> Subject: Re: Unexpectedly slow query
>>
>> On 17/05/13 13:27,[email protected] wrote:
>>>
>>>
>>> I ran tdbstats again on the fully loaded triple store (with all the patient
>> data as individuals and their relationships). My properties appear now in the
>> stats file. But only the properties with explicit triples, not the inferred
>> parent properties. E.g. I am using the following property type hierarchy:
>>
>> What is your inference setup?
>>
>>>
>>> Has_Finding
>>> - Has_Dysnpea_Score
>>> - Has_Dysphagia_Score
>>> - Is_Dead
>>>
>>>
>>> There are no explicit triples stating e.g. that a patient Has_Finding
>> Dyspnea_Score_2. But there are triples using the sub-properties, e.g. Patient
>> Has_Dysnpea_Score Dyspnea_Score_2.
>>
>> Can you share the stats file? I can't investigate the situation without
>> a test case.
>>
>>>
>>> The stats file now contains entries for the sub-properties, but not for
>> Has_Finding.
>>>
>>> The execution plan changed slightly though, but the crucial triple patterns
>> are still in the "wrong" order.
>>>
>>> It used to be:
>>> (?pat <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl/instances#Has_Id>
>> ?patId)
>>
>> The original stats file had no mention of #Has_Id and actaully said (at
>> the end) that missing predciates were to be counted as having zero
>> occurences. The optimizer puts one of these first because the rest of
>> the pattern will never be reached if it's accurate.
>>
>>> (?pat rdf:type <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Patient>)
>>> (?findingProp rdfs:subPropertyOf
>>> <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#Has_Finding>)
>>> (?finding rdf:type ?findingType)
>>> (?pat ?findingProp ?finding)
>>>
>>> Now it is:
>>> (?findingProp rdfs:subPropertyOf
>>> <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#Has_Finding>)
>>> (?pat rdf:type <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Patient>)
>>> (?pat <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#Has_Id> ?patId)
>>> (?finding rdf:type ?findingType)
>>> (?pat ?findingProp ?finding)
>>
>>>
>>> The triple pattern for the Has_Finding sub-properties moved to the start,
but
>> the crucial (?finding rdf:type ?findingType) is still evaluated before (?pat
>> ?findingProp ?finding) -> the query is still taking a very long time.
>>>
>>> I can go ahead and use "fixed.opt" instead of "stats.opt", but I am still
>> interested in whether there is a solution to this problem. I am using Jena
>> 2.10.0.
>>>
>>> Hope this info helps!
>>>
>>> -Wolfgang
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: hueyl16 <[email protected]>
>>> To: users <[email protected]>
>>> Sent: Fri, May 17, 2013 1:43 pm
>>> Subject: Re: Unexpectedly slow query
>>>
>>>
>>> I was wondering about that too. I could only find entries related to NCI
>> terms. How or when is the stats file generated?
>>>
>>> I am using the .bat versions of the tdbloader for importing the NCIt first
> and
>> then my own ontology, which contains Has_Id and Has_Finding plus more.
>>>
>>>
>>> I also ran tdbstats once but it did not change the stats file, just printed
>> it.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Andy Seaborne <[email protected]>
>>> To: users <[email protected]>
>>> Sent: Fri, May 17, 2013 1:25 pm
>>> Subject: Re: Unexpectedly slow query
>>>
>>>
>>> (I now have the stats file)
>>>
>>> Wolfgang,
>>>
>>> I don't see entries for ec:Has_Id and ec:Has_Finding.
>>>
>>> Andy
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
>
>
> :
>