On 27/01/14 10:58, Ewa Szwed wrote:
Hi,
I am working on a project that utilizes Jena TDB store to host full
Freebase data set.
I am at a stage now that all the data is loaded and I have written a couple
of Sparql queries to extract information about Freebase topics.
What I am trying to do now is to improve the performance of some of the
queries I have written.
For example the query to extract all the children of all the people from
Freebase and to format the output as it is desired on our side is as
follows:
prefix fb: <http://rdf.freebase.com/ns/>
prefix fn: <http://www.w3.org/2005/xpath-functions#>
select ?entity ?mID ?children ?gender ?wikipedia_url ?dob
where {
?mID_raw fb:type.object.type fb:people.person .
?mID_raw fb:type.object.name ?entity .
?mID_raw fb:people.person.children ?child .
?mID_raw fb:common.topic.topic_equivalent_webpage ?wikipedia_url .
?child fb:type.object.name ?children .
?child fb:people.person.date_of_birth ?dob .
?child fb:people.person.gender ?gend .
?gend fb:type.object.name ?child_gender .
BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/", "") as ?mID)
BIND(REPLACE(?child_gender, "Male", "Son") AS ?child_gender_conv_1)
BIND(REPLACE(?child_gender_conv_1, "Female", "Daughter") AS ?gender)
FILTER (lang(?entity) = "en" && lang(?children) = "en" &&
lang(?child_gender) = "en" && regex (str(?wikipedia_url),
"en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")) .
}
order by ?dob
At the moment it takes almost 14 hours to execute this query.
In general, would it be expected time for such an extraction or could this
be 'somehow' improved?
Which version of the code are you running?
I have measured that when I remove all th>
e BINDings lines from this query,
my execution time gets reduced to 4 hours so I conclude that BINDing are
expensive in general.
BIND is not fundamental expensive but str(?mID_raw) is forcing the
engine to actually go and get the string of the URI. It didn't in the
pattern matching part. If the filter then rejects the row, you've gone
an expensive thign for no reason.
(You are running a version that is less aggregessive about filter
placement - it got improved recently).
Putting the parts of the FILTER before the BINDs, may help although its
all string bashing so it needs to go to the node dictionary to get the
string for the URI or literal.
Is there a way to replace BINDing with some other constructs to achieve the
same formatting but with with better performance?
Are there any 'best practices' to follows here in general?
I have also experimented with Jena optimizers and noticed that although the
stats optimizer is recommended one I tend to get 10% better performance
with the fixed (fixed.opt) one?
Is there any general rule which one should be used here?
What hardware do you have?
Lastly I have observed strange repeated lines in logs (with logging and
debugging turned on and running with tdbquery)
Query run: (notice limit 1)
ORDER ... LIMIT 1 is optimized but still requires the entire query to be
executed, then the LIMIT done.
(just LIMIT 1 is much faster).
prefix fb: <http://rdf.freebase.com/ns/>
prefix fn: <http://www.w3.org/2005/xpath-functions#>
select ?entity ?mID ?children ?gender ?wikipedia_url ?dob
where {
?mID_raw fb:type.object.type fb:people.person .
?mID_raw fb:type.object.name ?entity .
?mID_raw fb:people.person.children ?child .
?mID_raw fb:common.topic.topic_equivalent_webpage ?wikipedia_url .
?child fb:type.object.name ?children .
OPTIONAL{ ?child fb:people.person.date_of_birth ?dob .}
?child fb:people.person.gender ?gend .
?gend fb:type.object.name ?child_gender .
BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/", "") as ?mID)
BIND(REPLACE(?child_gender, "Male", "Son") AS ?child_gender_conv_1)
BIND(REPLACE(?child_gender_conv_1, "Female", "Daughter") AS ?gender)
FILTER (lang(?entity) = "en" && lang(?children) = "en" &&
lang(?child_gender) = "en" && regex (str(?wikipedia_url),
"en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")) .
}
order by ?dob
limit 1
Output:
Is repeated many many times.
Yes - it's a consequence of the index-join-like execution strategy.
Moving the OPTIONAL later, after the fixed matches, may help.
Is this expected and how this should be interpreted?
*In general I am trying to understand if I could optimize my freebase set
extraction queries in any possible way and would appreciate any
comment/feedback here.*
Regards,
Ewa
Andy