Comments inline: On 27/01/2014 02:58, "Ewa Szwed" <[email protected]> wrote:
>Hi, >I am working on a project that utilizes Jena TDB store to host full >Freebase data set. >I am at a stage now that all the data is loaded and I have written a >couple >of Sparql queries to extract information about Freebase topics. >What I am trying to do now is to improve the performance of some of the >queries I have written. >For example the query to extract all the children of all the people from >Freebase and to format the output as it is desired on our side is as >follows: > >prefix fb: <http://rdf.freebase.com/ns/> >prefix fn: <http://www.w3.org/2005/xpath-functions#> >select ?entity ?mID ?children ?gender ?wikipedia_url ?dob >where { > ?mID_raw fb:type.object.type fb:people.person . > ?mID_raw fb:type.object.name ?entity . > ?mID_raw fb:people.person.children ?child . > ?mID_raw fb:common.topic.topic_equivalent_webpage ?wikipedia_url . > ?child fb:type.object.name ?children . > ?child fb:people.person.date_of_birth ?dob . > ?child fb:people.person.gender ?gend . > ?gend fb:type.object.name ?child_gender . > BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/", "") as >?mID) > BIND(REPLACE(?child_gender, "Male", "Son") AS ?child_gender_conv_1) > BIND(REPLACE(?child_gender_conv_1, "Female", "Daughter") AS ?gender) > FILTER (lang(?entity) = "en" && lang(?children) = "en" && >lang(?child_gender) = "en" && regex (str(?wikipedia_url), >"en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")) . >} >order by ?dob > >At the moment it takes almost 14 hours to execute this query. >In general, would it be expected time for such an extraction or could this >be 'somehow' improved? None of your FILTER conditions pertain to variables used in your BIND statements but you put your FILTER after your BIND. While in principal this makes no difference (since both are scoped to the containing graph pattern) in reality this blocks some useful optimisations that ARQ can apply. In practise what this means is that the BIND expressions are being calculated for all the intermediate results even those that don't meet the FILTER condition. You get the following algebra for this query algebra (I.e. the execution plan): (base <http://example/base/> (prefix ((fn: <http://www.w3.org/2005/xpath-functions#>) (fb: <http://rdf.freebase.com/ns/>)) (project (?entity ?mID ?children ?gender ?wikipedia_url ?dob) (order (?dob) (filter (exprlist (= (lang ?entity) "en") (= (lang ?children) "en") (= (lang ?child_gender) "en") (regex (str ?wikipedia_url) "en.wikipedia" "i") (! (regex (str ?wikipedia_url) "curid=" "i"))) (extend ((?mID (replace (str ?mID_raw) "http://rdf.freebase.com/ns/" "")) (?child_gender_conv_1 (replace ?child_gender "Male" "Son")) (?gender (replace ?child_gender_conv_1 "Female" "Daughter"))) (bgp (triple ?mID_raw fb:type.object.type fb:people.person) (triple ?mID_raw fb:type.object.name ?entity) (triple ?mID_raw fb:people.person.children ?child) (triple ?mID_raw fb:common.topic.topic_equivalent_webpage ?wikipedia_url) (triple ?child fb:type.object.name ?children) (triple ?child fb:people.person.date_of_birth ?dob) (triple ?child fb:people.person.gender ?gend) (triple ?gend fb:type.object.name ?child_gender) ))))))) The presence of these BIND's also blocks ARQs ability to push filters deeper into the query which is going to hurt performance. If you rewrite your query like so: prefix fb: <http://rdf.freebase.com/ns/> prefix fn: <http://www.w3.org/2005/xpath-functions#> select ?entity ?mID ?children ?gender ?wikipedia_url ?dob where { { ?mID_raw fb:type.object.type fb:people.person . ?mID_raw fb:type.object.name ?entity . ?mID_raw fb:people.person.children ?child . ?mID_raw fb:common.topic.topic_equivalent_webpage ?wikipedia_url . ?child fb:type.object.name ?children . ?child fb:people.person.date_of_birth ?dob . ?child fb:people.person.gender ?gend . ?gend fb:type.object.name ?child_gender . FILTER (lang(?entity) = "en" && lang(?children) = "en" && lang(?child_gender) = "en" && regex (str(?wikipedia_url), "en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")) . } BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/", "") as ?mID) BIND(REPLACE(?child_gender, "Male", "Son") AS ?child_gender_conv_1) BIND(REPLACE(?child_gender_conv_1, "Female", "Daughter") AS ?gender) } order by ?dob I.e. enclose the basic graph pattern and filter together and then apply the BINDs afterwards Then you get the following query algebra: (base <http://example/base/> (prefix ((fn: <http://www.w3.org/2005/xpath-functions#>) (fb: <http://rdf.freebase.com/ns/>)) (project (?entity ?mID ?children ?gender ?wikipedia_url ?dob) (order (?dob) (extend ((?mID (replace (str ?mID_raw) "http://rdf.freebase.com/ns/" "")) (?child_gender_conv_1 (replace ?child_gender "Male" "Son")) (?gender (replace ?child_gender_conv_1 "Female" "Daughter"))) (filter (exprlist (= (lang ?entity) "en") (= (lang ?children) "en") (= (lang ?child_gender) "en") (regex (str ?wikipedia_url) "en.wikipedia" "i") (! (regex (str ?wikipedia_url) "curid=" "i"))) (bgp (triple ?mID_raw fb:type.object.type fb:people.person) (triple ?mID_raw fb:type.object.name ?entity) (triple ?mID_raw fb:people.person.children ?child) (triple ?mID_raw fb:common.topic.topic_equivalent_webpage ?wikipedia_url) (triple ?child fb:type.object.name ?children) (triple ?child fb:people.person.date_of_birth ?dob) (triple ?child fb:people.person.gender ?gend) (triple ?gend fb:type.object.name ?child_gender) ))))))) So with this query you force ARQ to do the filtering before you spend the time doing the BINDs which contain the more expensive string manipulation expressions. This avoids wasting the effort on calculating these expressions on data that is only going to be thrown out by the filter in your original query. > >I have measured that when I remove all the BINDings lines from this query, >my execution time gets reduced to 4 hours so I conclude that BINDing are >expensive in general. >Is there a way to replace BINDing with some other constructs to achieve >the >same formatting but with with better performance? Either try the above changes I suggest or do the BINDs in your presentation layer (since they appear to be purely for presentation logic) >Are there any 'best practices' to follows here in general? > >I have also experimented with Jena optimizers and noticed that although >the >stats optimizer is recommended one I tend to get 10% better performance >with the fixed (fixed.opt) one? >Is there any general rule which one should be used here? Fixed vs Stats optimiser is always going to be data dependent, particularly in cases where there is a large variety in the data fixed optimiser often turns out to be better than stats based optimiser. This is because the estimation error of a stats based optimiser increases with the size of the the graph pattern, see http://www.csd.uoc.gr/~hy561/papers/storageaccess/optimization/Characterist ic%20Sets.pdf for some comparison of different stats based optimisers and explanations of the estimation error problem. > >Lastly I have observed strange repeated lines in logs (with logging and >debugging turned on and running with tdbquery) > >Query run: (notice limit 1) > >prefix fb: <http://rdf.freebase.com/ns/> >prefix fn: <http://www.w3.org/2005/xpath-functions#> >select ?entity ?mID ?children ?gender ?wikipedia_url ?dob >where { > ?mID_raw fb:type.object.type fb:people.person . > ?mID_raw fb:type.object.name ?entity . > ?mID_raw fb:people.person.children ?child . > ?mID_raw fb:common.topic.topic_equivalent_webpage ?wikipedia_url . > ?child fb:type.object.name ?children . > OPTIONAL{ ?child fb:people.person.date_of_birth ?dob .} > ?child fb:people.person.gender ?gend . > ?gend fb:type.object.name ?child_gender . > BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/", "") as >?mID) > BIND(REPLACE(?child_gender, "Male", "Son") AS ?child_gender_conv_1) > BIND(REPLACE(?child_gender_conv_1, "Female", "Daughter") AS ?gender) > FILTER (lang(?entity) = "en" && lang(?children) = "en" && >lang(?child_gender) = "en" && regex (str(?wikipedia_url), >"en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")) . >} >order by ?dob >limit 1 > >Output: > >06:14:50 INFO exec :: ALGEBRA > > (slice _ 1 > > (project (?entity ?mID ?children ?gender ?wikipedia_url ?dob) > > (filter (exprlist (= (lang ?entity) "en") (= (lang ?children) > >"en") (= (lang ?child_gender) "en") (regex (str ?wikipedia_url) > >"en.wikipedia" "i") (! (regex (str ?wikipedia_url) "curid=" "i"))) > > (extend ((?mID (replace (str ?mID_raw) > >"http://rdf.freebase.com/ns/" "")) (?child_gender_conv_1 (replace > >?child_gender "Male" "Son")) (?gender (replace ?child_gender_conv_1 > >"Female" "Daughter"))) > > (sequence > > (conditional > > (quadpattern > > (quad <urn:x-arq:DefaultGraphNode> ?mID_raw > ><http://rdf.freebase.com/ns/type.object.type> > ><http://rdf.freebase.com/ns/people.person>) > > (quad <urn:x-arq:DefaultGraphNode> ?mID_raw > ><http://rdf.freebase.com/ns/type.object.name> ?entity) > > (quad <urn:x-arq:DefaultGraphNode> ?mID_raw > ><http://rdf.freebase.com/ns/people.person.children> ?child) > > (quad <urn:x-arq:DefaultGraphNode> ?mID_raw > ><http://rdf.freebase.com/ns/common.topic.topic_equivalent_webpage> > >?wikipedia_url) > > (quad <urn:x-arq:DefaultGraphNode> ?child > ><http://rdf.freebase.com/ns/type.object.name> ?children) > > ) > > (quadpattern (quad <urn:x-arq:DefaultGraphNode> ?child > ><http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob))) > > (quadpattern > > (quad <urn:x-arq:DefaultGraphNode> ?child > ><http://rdf.freebase.com/ns/people.person.gender> ?gend) > > (quad <urn:x-arq:DefaultGraphNode> ?gend > ><http://rdf.freebase.com/ns/type.object.name> ?child_gender) > > )))))) > >06:14:50 INFO exec :: Execute > >(?mID_raw <http://rdf.freebase.com/ns/type.object.type> > ><http://rdf.freebase.com/ns/people.person>) > >(?mID_raw <http://rdf.freebase.com/ns/type.object.name> ?entity) > >(?mID_raw <http://rdf.freebase.com/ns/people.person.children> ?child) > >(?mID_raw > ><http://rdf.freebase.com/ns/common.topic.topic_equivalent_webpage> > >?wikipedia_url) > >(?child <http://rdf.freebase.com/ns/type.object.name> ?children) > >06:14:50 INFO exec :: Execute :: > >(<http://rdf.freebase.com/ns/m.0j2btth> > ><http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob) > >06:14:50 INFO exec :: Execute > >(?child <http://rdf.freebase.com/ns/people.person.gender> ?gend) > >(?gend <http://rdf.freebase.com/ns/type.object.name> ?child_gender) > >06:14:50 INFO exec :: Execute :: > >(<http://rdf.freebase.com/ns/m.0j2btth> > ><http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob) > >06:14:50 INFO exec :: Execute :: > >(<http://rdf.freebase.com/ns/m.0j2btth> > ><http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob) > >06:14:50 INFO exec :: Execute :: > >(<http://rdf.freebase.com/ns/m.0j2btth> > ><http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob) > >06:14:50 INFO exec :: Execute :: > >(<http://rdf.freebase.com/ns/m.0j2btth> > ><http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob) > >06:14:50 INFO exec :: Execute :: > >(<http://rdf.freebase.com/ns/m.0j2btth> > ><http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob) > >06:14:50 INFO exec :: Execute :: > >(<http://rdf.freebase.com/ns/m.0j2btth> > ><http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob) > >06:14:50 INFO exec :: Execute :: > > >This bit: > >(<http://rdf.freebase.com/ns/m.0j2btth> > ><http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob) > >06:14:50 INFO exec :: Execute :: > >Is repeated many many times. > >Is this expected and how this should be interpreted? This has to do with an optimisation that ARQ does to implement an Index Join strategy. What this does is when you have sequences of patterns with common variables rather than executing them entirely separately and then joining them together you take the output of one pattern and pass it to multiple executions of the next pattern. This tends to yield serious performance boosts since it improves the selectivity of the subsequent pattern and reduces the amount of intermediate data that is brought into memory only to be thrown out. > > >*In general I am trying to understand if I could optimize my freebase set >extraction queries in any possible way and would appreciate any >comment/feedback here.* Another thing to note is that you are using TDB so you need to be careful how you've allocated your RAM. TDB uses memory mapped files so all the data and indices are off-heap and the JVM heap is only used for intermediate results and since ARQ uses streaming evaluation wherever possible the heap memory requirement is generally relatively low. If you have set the JVM heap too large you are going to be forcing the OS to swap your JVM heap and TDBs memory mapped files in and out of RAM which is going to completely hose performance. Some more details on your environment - OS, RAM, JVM allocation, dataset size (triple/quad count) - would be useful if you want more tips on configuring your system. Rob > >Regards, > >Ewa
