Hi, I am working on a project that utilizes Jena TDB store to host full Freebase data set. I am at a stage now that all the data is loaded and I have written a couple of Sparql queries to extract information about Freebase topics. What I am trying to do now is to improve the performance of some of the queries I have written. For example the query to extract all the children of all the people from Freebase and to format the output as it is desired on our side is as follows:
prefix fb: <http://rdf.freebase.com/ns/> prefix fn: <http://www.w3.org/2005/xpath-functions#> select ?entity ?mID ?children ?gender ?wikipedia_url ?dob where { ?mID_raw fb:type.object.type fb:people.person . ?mID_raw fb:type.object.name ?entity . ?mID_raw fb:people.person.children ?child . ?mID_raw fb:common.topic.topic_equivalent_webpage ?wikipedia_url . ?child fb:type.object.name ?children . ?child fb:people.person.date_of_birth ?dob . ?child fb:people.person.gender ?gend . ?gend fb:type.object.name ?child_gender . BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/", "") as ?mID) BIND(REPLACE(?child_gender, "Male", "Son") AS ?child_gender_conv_1) BIND(REPLACE(?child_gender_conv_1, "Female", "Daughter") AS ?gender) FILTER (lang(?entity) = "en" && lang(?children) = "en" && lang(?child_gender) = "en" && regex (str(?wikipedia_url), "en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")) . } order by ?dob At the moment it takes almost 14 hours to execute this query. In general, would it be expected time for such an extraction or could this be 'somehow' improved? I have measured that when I remove all the BINDings lines from this query, my execution time gets reduced to 4 hours so I conclude that BINDing are expensive in general. Is there a way to replace BINDing with some other constructs to achieve the same formatting but with with better performance? Are there any 'best practices' to follows here in general? I have also experimented with Jena optimizers and noticed that although the stats optimizer is recommended one I tend to get 10% better performance with the fixed (fixed.opt) one? Is there any general rule which one should be used here? Lastly I have observed strange repeated lines in logs (with logging and debugging turned on and running with tdbquery) Query run: (notice limit 1) prefix fb: <http://rdf.freebase.com/ns/> prefix fn: <http://www.w3.org/2005/xpath-functions#> select ?entity ?mID ?children ?gender ?wikipedia_url ?dob where { ?mID_raw fb:type.object.type fb:people.person . ?mID_raw fb:type.object.name ?entity . ?mID_raw fb:people.person.children ?child . ?mID_raw fb:common.topic.topic_equivalent_webpage ?wikipedia_url . ?child fb:type.object.name ?children . OPTIONAL{ ?child fb:people.person.date_of_birth ?dob .} ?child fb:people.person.gender ?gend . ?gend fb:type.object.name ?child_gender . BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/", "") as ?mID) BIND(REPLACE(?child_gender, "Male", "Son") AS ?child_gender_conv_1) BIND(REPLACE(?child_gender_conv_1, "Female", "Daughter") AS ?gender) FILTER (lang(?entity) = "en" && lang(?children) = "en" && lang(?child_gender) = "en" && regex (str(?wikipedia_url), "en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")) . } order by ?dob limit 1 Output: 06:14:50 INFO exec :: ALGEBRA (slice _ 1 (project (?entity ?mID ?children ?gender ?wikipedia_url ?dob) (filter (exprlist (= (lang ?entity) "en") (= (lang ?children) "en") (= (lang ?child_gender) "en") (regex (str ?wikipedia_url) "en.wikipedia" "i") (! (regex (str ?wikipedia_url) "curid=" "i"))) (extend ((?mID (replace (str ?mID_raw) "http://rdf.freebase.com/ns/" "")) (?child_gender_conv_1 (replace ?child_gender "Male" "Son")) (?gender (replace ?child_gender_conv_1 "Female" "Daughter"))) (sequence (conditional (quadpattern (quad <urn:x-arq:DefaultGraphNode> ?mID_raw <http://rdf.freebase.com/ns/type.object.type> <http://rdf.freebase.com/ns/people.person>) (quad <urn:x-arq:DefaultGraphNode> ?mID_raw <http://rdf.freebase.com/ns/type.object.name> ?entity) (quad <urn:x-arq:DefaultGraphNode> ?mID_raw <http://rdf.freebase.com/ns/people.person.children> ?child) (quad <urn:x-arq:DefaultGraphNode> ?mID_raw <http://rdf.freebase.com/ns/common.topic.topic_equivalent_webpage> ?wikipedia_url) (quad <urn:x-arq:DefaultGraphNode> ?child <http://rdf.freebase.com/ns/type.object.name> ?children) ) (quadpattern (quad <urn:x-arq:DefaultGraphNode> ?child <http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob))) (quadpattern (quad <urn:x-arq:DefaultGraphNode> ?child <http://rdf.freebase.com/ns/people.person.gender> ?gend) (quad <urn:x-arq:DefaultGraphNode> ?gend <http://rdf.freebase.com/ns/type.object.name> ?child_gender) )))))) 06:14:50 INFO exec :: Execute (?mID_raw <http://rdf.freebase.com/ns/type.object.type> <http://rdf.freebase.com/ns/people.person>) (?mID_raw <http://rdf.freebase.com/ns/type.object.name> ?entity) (?mID_raw <http://rdf.freebase.com/ns/people.person.children> ?child) (?mID_raw <http://rdf.freebase.com/ns/common.topic.topic_equivalent_webpage> ?wikipedia_url) (?child <http://rdf.freebase.com/ns/type.object.name> ?children) 06:14:50 INFO exec :: Execute :: (<http://rdf.freebase.com/ns/m.0j2btth> <http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob) 06:14:50 INFO exec :: Execute (?child <http://rdf.freebase.com/ns/people.person.gender> ?gend) (?gend <http://rdf.freebase.com/ns/type.object.name> ?child_gender) 06:14:50 INFO exec :: Execute :: (<http://rdf.freebase.com/ns/m.0j2btth> <http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob) 06:14:50 INFO exec :: Execute :: (<http://rdf.freebase.com/ns/m.0j2btth> <http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob) 06:14:50 INFO exec :: Execute :: (<http://rdf.freebase.com/ns/m.0j2btth> <http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob) 06:14:50 INFO exec :: Execute :: (<http://rdf.freebase.com/ns/m.0j2btth> <http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob) 06:14:50 INFO exec :: Execute :: (<http://rdf.freebase.com/ns/m.0j2btth> <http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob) 06:14:50 INFO exec :: Execute :: (<http://rdf.freebase.com/ns/m.0j2btth> <http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob) 06:14:50 INFO exec :: Execute :: This bit: (<http://rdf.freebase.com/ns/m.0j2btth> <http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob) 06:14:50 INFO exec :: Execute :: Is repeated many many times. Is this expected and how this should be interpreted? *In general I am trying to understand if I could optimize my freebase set extraction queries in any possible way and would appreciate any comment/feedback here.* Regards, Ewa
