Thanks Andy. Please find my comments inline. ---------- Forwarded message ---------- From: Andy Seaborne <[email protected]> Date: 2014-01-27 Subject: Re: Spaql queries optimizations for Freebase set To: [email protected]
On 27/01/14 10:58, Ewa Szwed wrote: > Hi, > I am working on a project that utilizes Jena TDB store to host full > Freebase data set. > I am at a stage now that all the data is loaded and I have written a couple > of Sparql queries to extract information about Freebase topics. > What I am trying to do now is to improve the performance of some of the > queries I have written. > For example the query to extract all the children of all the people from > Freebase and to format the output as it is desired on our side is as > follows: > > prefix fb: <http://rdf.freebase.com/ns/> > prefix fn: <http://www.w3.org/2005/xpath-functions#> > select ?entity ?mID ?children ?gender ?wikipedia_url ?dob > where { > ?mID_raw fb:type.object.type fb:people.person . > ?mID_raw fb:type.object.name ?entity . > ?mID_raw fb:people.person.children ?child . > ?mID_raw fb:common.topic.topic_equivalent_webpage ?wikipedia_url . > ?child fb:type.object.name ?children . > ?child fb:people.person.date_of_birth ?dob . > ?child fb:people.person.gender ?gend . > ?gend fb:type.object.name ?child_gender . > BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/", "") as > ?mID) > BIND(REPLACE(?child_gender, "Male", "Son") AS ?child_gender_conv_1) > BIND(REPLACE(?child_gender_conv_1, "Female", "Daughter") AS ?gender) > FILTER (lang(?entity) = "en" && lang(?children) = "en" && > lang(?child_gender) = "en" && regex (str(?wikipedia_url), > "en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")) . > } > order by ?dob > > At the moment it takes almost 14 hours to execute this query. > In general, would it be expected time for such an extraction or could this > be 'somehow' improved? > Which version of the code are you running? What do u mean? I am running apache jena 2.11 and jena fuseki 1.0. I have measured that when I remove all th> > e BINDings lines from this query, > my execution time gets reduced to 4 hours so I conclude that BINDing are > expensive in general. > BIND is not fundamental expensive but str(?mID_raw) is forcing the engine to actually go and get the string of the URI. It didn't in the pattern matching part. If the filter then rejects the row, you've gone an expensive thign for no reason. (You are running a version that is less aggregessive about filter placement - it got improved recently). Putting the parts of the FILTER before the BINDs, may help although its all string bashing so it needs to go to the node dictionary to get the string for the URI or literal. Is there a way to replace BINDing with some other constructs to achieve the > same formatting but with with better performance? > Are there any 'best practices' to follows here in general? > > I have also experimented with Jena optimizers and noticed that although the > stats optimizer is recommended one I tend to get 10% better performance > with the fixed (fixed.opt) one? > Is there any general rule which one should be used here? > What hardware do you have? I run this on a virtual machine with Linux Red Hat distribution and of 8 cores CPU, 64 GB RAM and 1.2 TB hard drive. > > Lastly I have observed strange repeated lines in logs (with logging and > debugging turned on and running with tdbquery) > > Query run: (notice limit 1) > ORDER ... LIMIT 1 is optimized but still requires the entire query to be executed, then the LIMIT done. (just LIMIT 1 is much faster). > prefix fb: <http://rdf.freebase.com/ns/> > prefix fn: <http://www.w3.org/2005/xpath-functions#> > select ?entity ?mID ?children ?gender ?wikipedia_url ?dob > where { > ?mID_raw fb:type.object.type fb:people.person . > ?mID_raw fb:type.object.name ?entity . > ?mID_raw fb:people.person.children ?child . > ?mID_raw fb:common.topic.topic_equivalent_webpage ?wikipedia_url . > ?child fb:type.object.name ?children . > OPTIONAL{ ?child fb:people.person.date_of_birth ?dob .} > ?child fb:people.person.gender ?gend . > ?gend fb:type.object.name ?child_gender . > BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/", "") as > ?mID) > BIND(REPLACE(?child_gender, "Male", "Son") AS ?child_gender_conv_1) > BIND(REPLACE(?child_gender_conv_1, "Female", "Daughter") AS ?gender) > FILTER (lang(?entity) = "en" && lang(?children) = "en" && > lang(?child_gender) = "en" && regex (str(?wikipedia_url), > "en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")) . > } > order by ?dob > limit 1 > > Output: > > > Is repeated many many times. > Yes - it's a consequence of the index-join-like execution strategy. Moving the OPTIONAL later, after the fixed matches, may help. Is this expected and how this should be interpreted? > > > *In general I am trying to understand if I could optimize my freebase set > > extraction queries in any possible way and would appreciate any > comment/feedback here.* > > Regards, > > Ewa > > Andy
