On 24/05/13 10:23, [email protected] wrote:
Sorry about the ecIn: vs. ec: namespaces/prefixes. I removed the ecIn namespace
in the process. But in general, the Has_Id property is not the crucial one,
even though it is also put into the wrong place in the execution plan.
My original query (modified for prefixes) is:
SELECT *
WHERE
{ ?pat rdf:type nci:Patient .
?pat ec:Has_Id ?patId .
?findingProp rdfs:subPropertyOf ec:Has_Finding .
?pat ?findingProp ?finding .
?finding rdf:type ?findingType
}
This to me also represents the most efficient order of triple patterns for the
execution plan. But the execution plan I initially got without regenerating the
stats file after inserting all my data triples was:
(?pat <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl/instances#Has_Id>
?patId)
(?pat rdf:type <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Patient>)
(?findingProp rdfs:subPropertyOf
<http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#Has_Finding>)
(?finding rdf:type ?findingType)
(?pat ?findingProp ?finding)
The main problem here is that (?pat ?findingProp ?finding) comes after
(?finding rdf:type ?findingType) .
The other problem is that (?pat ec:Has_Id ?patId) comes first, even though (?pat
rdf:type nci:Patient>) is more restrictive.
After regenerating the stats file I got this:
That looks possible using a stats file egenrated by 2.10.1 - that isn't the
(?findingProp rdfs:subPropertyOf
<http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#Has_Finding>)
(?pat rdf:type <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Patient>)
(?pat <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#Has_Id> ?patId)
(?finding rdf:type ?findingType)
(?pat ?findingProp ?finding)
This version fixed the ec:Has_Id problem since it is now located
afterthe (?pat a nci:Patient) triple pattern. But we still have the problem
> with the last two lines. They should be reversed since
> (?pat ?findingProp ?finding) is way more restrictive than
> (?finding a ?findingType). The new stats file did however
cause the (?findingProp rdfs:subPropertyOf ec:Has_Finding) triple
pattern to be moved to the top.
?findingProp is bound to the predicates ec:Has_Dysnpea_Score and
> ec:Has_Dysphagia_Score, which are sub-properties of ec:Has_Finding.
The stats analysis is only partially dynamic. It runs at the start of a
basic graph pattern, not incrementally during a pattern. This matters
for TDB because TDB does not solve patterns using node values, it used
internal NodeIds (which are 64 bits). It's a trade-off.
The system does not "know" that ?findingProp will be bound to only
ec:Has_Dysnpea_Score and ec:Has_Dysphagia_Score.
The optimizer has a step of merging blocks so putting extra {} the query
doesn't help. A trick is to put in a BIND
SELECT *
WHERE
{
?findingProp rdfs:subPropertyOf ec:Has_Finding .
BIND(1 as ?x)
?pat rdf:type nci:Patient .
?pat ec:Has_Id ?patId .
?pat ?findingProp ?finding .
?finding rdf:type ?findingType
}
the it excutes the after BIND on each value from ?findingProp as a Node
not at the TDB level -- if there are only two, that might be good.
The optimizer is not doing the best job it could becuase at the last
step, (two remaining triples to order) it has:
3 -1 : TERM TERM ?finding
4 556563 : ?finding <::type> ?findingType
and TERM TERM ?finding (TERM means it is grounded at that point in the
execution should estimate as a small number, then it'll flip the last
two triples.
where the initial input is:
id count Pattern
0 100 : ?pat <::type> <::Patient>
1 290525 : ?pat <::Has_Id> ?patId
2 10 : ?findingProp <::subPropertyOf> <::Has_Finding>
3 -1 : ?pat ?findingProp ?finding
4 556563 : ?finding <::type> ?findingType
If you add
( (TERM TERM VAR) 2 )
to the stats rules, it seems to redo the order.
e..g at the end
...
(<http://.../euroCAT.owl#Has_Frac_Cumulative_Dose> 22873)
( (TERM TERM VAR) 2 )
(other 0))
Issue recorded as JENA-460
It is the use of rdfs:subPropertyOf which relates something in the
subject position to something in the property position that is unusual
(and several RDF systems assume the property is a constant in the query
and need to do a scan var in the property position.
> The stats file shows entries for ec:Has_Dysnpea_Score and
ec:Has_Dysphagia_Score (counts are 7 and 8). But it chooses to
put (?pat ?findingProp ?finding) last and process
(?finding a ?findingType) first, which matches pretty much
> everything in the entire store first since
?finding is not bound to something more specific yet.
> To sum things up: Regenerating the stats file after I imported all my
data individualssolved the ec:Has_Id problem. But it does not address the
> ?findingProp problem.
I hope this helps clarifying the overall picture and thank you very
> much for your help!
-Wolfgang
Andy
-----Original Message-----
From: Andy Seaborne <[email protected]>
To: users <[email protected]>
Sent: Thu, May 23, 2013 8:40 pm
Subject: Re: Unexpectedly slow query
Wolfgang,
I confused as to what the setup is that you have. Let's step back and
establish what the setup is here.
1/ The original query you sent was:
SELECT *
WHERE
{ ?pat rdf:type nci:Patient .
?pat ecIn:Has_Id ?patId .
?findingProp rdfs:subPropertyOf ec:Has_Finding .
?pat ?findingProp ?finding .
?finding rdf:type ?findingType
}
but your emails refer to ec:Has_Id (not ecIn:)
ecIn: <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl/instances#>
ec: <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#>
which is it?
There are no properties 'euroCAT.owl/instances#' in stats file.
2/ The stats file is not generated by the recent releases which handles
?? rdf:type :SomeType
much better. You only need to regenerate the stats file using the
latest release - it'll work well with previous releases.
3/ The database has no inference capabilities itself.
What are you expecting the :Has_Finding to do to influenece the rest of
the query plan?
Andy
On 23/05/13 08:46, [email protected] wrote:
I am using Sparql queries via Dataset and QueryExecution. No in-memory
inference defined. The stats file with my predicates is attached. It has
those for Has_Id, Has_Dyspnea_Score and Has_Dysphagia_Score, but it
apparently cannot infer that the latter two are also Has_Finding and
therefore can be used to narrow down the result set.
-Wolfgang
-----Original Message-----
From: Andy Seaborne <[email protected]>
To: users <[email protected]>
Sent: Sat, May 18, 2013 1:17 pm
Subject: Re: Unexpectedly slow query
On 17/05/13 13:27,[email protected] wrote:
I ran tdbstats again on the fully loaded triple store (with all the patient
data as individuals and their relationships). My properties appear now in the
stats file. But only the properties with explicit triples, not the inferred
parent properties. E.g. I am using the following property type hierarchy:
What is your inference setup?
Has_Finding
- Has_Dysnpea_Score
- Has_Dysphagia_Score
- Is_Dead
There are no explicit triples stating e.g. that a patient Has_Finding
Dyspnea_Score_2. But there are triples using the sub-properties, e.g. Patient
Has_Dysnpea_Score Dyspnea_Score_2.
Can you share the stats file? I can't investigate the situation without
a test case.
The stats file now contains entries for the sub-properties, but not for
Has_Finding.
The execution plan changed slightly though, but the crucial triple patterns
are still in the "wrong" order.
It used to be:
(?pat <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl/instances#Has_Id>
?patId)
The original stats file had no mention of #Has_Id and actaully said (at
the end) that missing predciates were to be counted as having zero
occurences. The optimizer puts one of these first because the rest of
the pattern will never be reached if it's accurate.
(?pat rdf:type <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Patient>)
(?findingProp rdfs:subPropertyOf
<http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#Has_Finding>)
(?finding rdf:type ?findingType)
(?pat ?findingProp ?finding)
Now it is:
(?findingProp rdfs:subPropertyOf
<http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#Has_Finding>)
(?pat rdf:type <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Patient>)
(?pat <http://www.siemens.com/euroCAT/2011/8/euroCAT.owl#Has_Id> ?patId)
(?finding rdf:type ?findingType)
(?pat ?findingProp ?finding)
The triple pattern for the Has_Finding sub-properties moved to the start, but
the crucial (?finding rdf:type ?findingType) is still evaluated before (?pat
?findingProp ?finding) -> the query is still taking a very long time.
I can go ahead and use "fixed.opt" instead of "stats.opt", but I am still
interested in whether there is a solution to this problem. I am using Jena
2.10.0.
Hope this info helps!
-Wolfgang
-----Original Message-----
From: hueyl16 <[email protected]>
To: users <[email protected]>
Sent: Fri, May 17, 2013 1:43 pm
Subject: Re: Unexpectedly slow query
I was wondering about that too. I could only find entries related to NCI
terms. How or when is the stats file generated?
I am using the .bat versions of the tdbloader for importing the NCIt first
and
then my own ontology, which contains Has_Id and Has_Finding plus more.
I also ran tdbstats once but it did not change the stats file, just printed
it.
-----Original Message-----
From: Andy Seaborne <[email protected]>
To: users <[email protected]>
Sent: Fri, May 17, 2013 1:25 pm
Subject: Re: Unexpectedly slow query
(I now have the stats file)
Wolfgang,
I don't see entries for ec:Has_Id and ec:Has_Finding.
Andy
: