Re: Spaql queries optimizations for Freebase set

Rob Vesse Mon, 27 Jan 2014 09:25:00 -0800

Comments inline:

On 27/01/2014 02:58, "Ewa Szwed" <[email protected]> wrote:


>Hi,
>I am working on a project that utilizes Jena TDB store to host full
>Freebase data set.
>I am at a stage now that all the data is loaded and I have written a
>couple
>of Sparql queries to extract information about Freebase topics.
>What I am trying to do now is to improve the performance of some of the
>queries I have written.
>For example the query to extract all the children of all the people from
>Freebase and to format the output as it is desired on our side is as
>follows:
>
>prefix fb: <http://rdf.freebase.com/ns/>
>prefix fn: <http://www.w3.org/2005/xpath-functions#>
>select ?entity ?mID ?children ?gender ?wikipedia_url ?dob
>where {
>    ?mID_raw fb:type.object.type fb:people.person .
>    ?mID_raw fb:type.object.name ?entity .
>    ?mID_raw fb:people.person.children ?child .
>    ?mID_raw fb:common.topic.topic_equivalent_webpage ?wikipedia_url .
>    ?child fb:type.object.name ?children .
>    ?child fb:people.person.date_of_birth ?dob .
>    ?child fb:people.person.gender ?gend .
>    ?gend fb:type.object.name ?child_gender .
>    BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/";, "") as
>?mID)
>    BIND(REPLACE(?child_gender, "Male", "Son") AS ?child_gender_conv_1)
>    BIND(REPLACE(?child_gender_conv_1, "Female", "Daughter") AS ?gender)
>    FILTER (lang(?entity) = "en" && lang(?children) = "en" &&
>lang(?child_gender) = "en" && regex (str(?wikipedia_url),
>"en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")) .
>}
>order by ?dob
>
>At the moment it takes almost 14 hours to execute this query.
>In general, would it be expected time for such an extraction or could this
>be 'somehow' improved?

None of your FILTER conditions pertain to variables used in your BIND
statements but you put your FILTER after your BIND.  While in principal
this makes no difference (since both are scoped to the containing graph
pattern) in reality this blocks some useful optimisations that ARQ can
apply.

In practise what this means is that the BIND expressions are being
calculated for all the intermediate results even those that don't meet the
FILTER condition.  You get the following algebra for this query algebra
(I.e. the execution plan):

(base <http://example/base/>
 (prefix ((fn: <http://www.w3.org/2005/xpath-functions#>)
  (fb: <http://rdf.freebase.com/ns/>))
 (project (?entity ?mID ?children ?gender ?wikipedia_url ?dob)
  (order (?dob)
   (filter (exprlist (= (lang ?entity) "en") (= (lang ?children) "en") (=
(lang ?child_gender) "en") (regex (str ?wikipedia_url) "en.wikipedia" "i")
(! (regex (str ?wikipedia_url) "curid=" "i")))
    (extend ((?mID (replace (str ?mID_raw) "http://rdf.freebase.com/ns/";
"")) (?child_gender_conv_1 (replace ?child_gender "Male" "Son")) (?gender
(replace ?child_gender_conv_1 "Female" "Daughter")))
     (bgp
      (triple ?mID_raw fb:type.object.type fb:people.person)
      (triple ?mID_raw fb:type.object.name ?entity)
      (triple ?mID_raw fb:people.person.children ?child)
      (triple ?mID_raw fb:common.topic.topic_equivalent_webpage
?wikipedia_url)
      (triple ?child fb:type.object.name ?children)
      (triple ?child fb:people.person.date_of_birth ?dob)
      (triple ?child fb:people.person.gender ?gend)
      (triple ?gend fb:type.object.name ?child_gender)
 )))))))


The presence of these BIND's also blocks ARQs ability to push filters
deeper into the query which is going to hurt performance.

If you rewrite your query like so:

prefix fb: <http://rdf.freebase.com/ns/>
prefix fn: <http://www.w3.org/2005/xpath-functions#>
select ?entity ?mID ?children ?gender ?wikipedia_url ?dob
where {
 {
  ?mID_raw fb:type.object.type fb:people.person .
  ?mID_raw fb:type.object.name ?entity .
  ?mID_raw fb:people.person.children ?child .
  ?mID_raw fb:common.topic.topic_equivalent_webpage ?wikipedia_url .
  ?child fb:type.object.name ?children .
  ?child fb:people.person.date_of_birth ?dob .
  ?child fb:people.person.gender ?gend .
  ?gend fb:type.object.name ?child_gender .
  FILTER (lang(?entity) = "en" && lang(?children) = "en" &&
  lang(?child_gender) = "en" && regex (str(?wikipedia_url),
  "en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")) .
 }
 BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/";, "") as ?mID)
 BIND(REPLACE(?child_gender, "Male", "Son") AS ?child_gender_conv_1)
 BIND(REPLACE(?child_gender_conv_1, "Female", "Daughter") AS ?gender)
}
order by ?dob


I.e. enclose the basic graph pattern and filter together and then apply
the BINDs afterwards

Then you get the following query algebra:

(base <http://example/base/>
 (prefix ((fn: <http://www.w3.org/2005/xpath-functions#>)
  (fb: <http://rdf.freebase.com/ns/>))
 (project (?entity ?mID ?children ?gender ?wikipedia_url ?dob)
  (order (?dob)
   (extend ((?mID (replace (str ?mID_raw) "http://rdf.freebase.com/ns/";
"")) (?child_gender_conv_1 (replace ?child_gender "Male" "Son")) (?gender
(replace ?child_gender_conv_1 "Female" "Daughter")))
    (filter (exprlist (= (lang ?entity) "en") (= (lang ?children) "en") (=
(lang ?child_gender) "en") (regex (str ?wikipedia_url) "en.wikipedia" "i")
(! (regex (str ?wikipedia_url) "curid=" "i")))
     (bgp
      (triple ?mID_raw fb:type.object.type fb:people.person)
      (triple ?mID_raw fb:type.object.name ?entity)
      (triple ?mID_raw fb:people.person.children ?child)
      (triple ?mID_raw fb:common.topic.topic_equivalent_webpage
?wikipedia_url)
      (triple ?child fb:type.object.name ?children)
      (triple ?child fb:people.person.date_of_birth ?dob)
      (triple ?child fb:people.person.gender ?gend)
      (triple ?gend fb:type.object.name ?child_gender)
     )))))))


So with this query you force ARQ to do the filtering before you spend the
time doing the BINDs which contain the more expensive string manipulation
expressions.  This avoids wasting the effort on calculating these
expressions on data that is only going to be thrown out by the filter in
your original query.


>
>I have measured that when I remove all the BINDings lines from this query,
>my execution time gets reduced to 4 hours so I conclude that BINDing are
>expensive in general.
>Is there a way to replace BINDing with some other constructs to achieve
>the
>same formatting but with with better performance?

Either try the above changes I suggest or do the BINDs in your
presentation layer (since they appear to be purely for presentation logic)

>Are there any 'best practices' to follows here in general?
>
>I have also experimented with Jena optimizers and noticed that although
>the
>stats optimizer is recommended one I tend to get 10% better performance
>with the fixed (fixed.opt) one?
>Is there any general rule which one should be used here?

Fixed vs Stats optimiser is always going to be data dependent,
particularly in cases where there is a large variety in the data fixed
optimiser often turns out to be better than stats based optimiser.

This is because the estimation error of a stats based optimiser increases
with the size of the the graph pattern, see
http://www.csd.uoc.gr/~hy561/papers/storageaccess/optimization/Characterist
ic%20Sets.pdf for some comparison of different stats based optimisers and
explanations of the estimation error problem.

>
>Lastly I have observed strange repeated lines in logs (with logging and
>debugging turned on and running with tdbquery)
>
>Query run: (notice limit 1)
>
>prefix fb: <http://rdf.freebase.com/ns/>
>prefix fn: <http://www.w3.org/2005/xpath-functions#>
>select ?entity ?mID ?children ?gender ?wikipedia_url ?dob
>where {
>    ?mID_raw fb:type.object.type fb:people.person .
>    ?mID_raw fb:type.object.name ?entity .
>    ?mID_raw fb:people.person.children ?child .
>    ?mID_raw fb:common.topic.topic_equivalent_webpage ?wikipedia_url .
>    ?child fb:type.object.name ?children .
>    OPTIONAL{ ?child fb:people.person.date_of_birth ?dob .}
>    ?child fb:people.person.gender ?gend .
>    ?gend fb:type.object.name ?child_gender .
>    BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/";, "") as
>?mID)
>    BIND(REPLACE(?child_gender, "Male", "Son") AS ?child_gender_conv_1)
>    BIND(REPLACE(?child_gender_conv_1, "Female", "Daughter") AS ?gender)
>    FILTER (lang(?entity) = "en" && lang(?children) = "en" &&
>lang(?child_gender) = "en" && regex (str(?wikipedia_url),
>"en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")) .
>}
>order by ?dob
>limit 1
>
>Output:
>
>06:14:50 INFO  exec                 :: ALGEBRA
>
>   (slice _ 1
>
>     (project (?entity ?mID ?children ?gender ?wikipedia_url ?dob)
>
>       (filter (exprlist (= (lang ?entity) "en") (= (lang ?children)
>
>"en") (= (lang ?child_gender) "en") (regex (str ?wikipedia_url)
>
>"en.wikipedia" "i") (! (regex (str ?wikipedia_url) "curid=" "i")))
>
>         (extend ((?mID (replace (str ?mID_raw)
>
>"http://rdf.freebase.com/ns/"; "")) (?child_gender_conv_1 (replace
>
>?child_gender "Male" "Son")) (?gender (replace ?child_gender_conv_1
>
>"Female" "Daughter")))
>
>           (sequence
>
>             (conditional
>
>               (quadpattern
>
>                 (quad <urn:x-arq:DefaultGraphNode> ?mID_raw
>
><http://rdf.freebase.com/ns/type.object.type>
>
><http://rdf.freebase.com/ns/people.person>)
>
>                 (quad <urn:x-arq:DefaultGraphNode> ?mID_raw
>
><http://rdf.freebase.com/ns/type.object.name> ?entity)
>
>                 (quad <urn:x-arq:DefaultGraphNode> ?mID_raw
>
><http://rdf.freebase.com/ns/people.person.children> ?child)
>
>                 (quad <urn:x-arq:DefaultGraphNode> ?mID_raw
>
><http://rdf.freebase.com/ns/common.topic.topic_equivalent_webpage>
>
>?wikipedia_url)
>
>                 (quad <urn:x-arq:DefaultGraphNode> ?child
>
><http://rdf.freebase.com/ns/type.object.name> ?children)
>
>               )
>
>               (quadpattern (quad <urn:x-arq:DefaultGraphNode> ?child
>
><http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob)))
>
>             (quadpattern
>
>               (quad <urn:x-arq:DefaultGraphNode> ?child
>
><http://rdf.freebase.com/ns/people.person.gender> ?gend)
>
>               (quad <urn:x-arq:DefaultGraphNode> ?gend
>
><http://rdf.freebase.com/ns/type.object.name> ?child_gender)
>
>             ))))))
>
>06:14:50 INFO  exec                 :: Execute
>
>(?mID_raw <http://rdf.freebase.com/ns/type.object.type>
>
><http://rdf.freebase.com/ns/people.person>)
>
>(?mID_raw <http://rdf.freebase.com/ns/type.object.name> ?entity)
>
>(?mID_raw <http://rdf.freebase.com/ns/people.person.children> ?child)
>
>(?mID_raw
>
><http://rdf.freebase.com/ns/common.topic.topic_equivalent_webpage>
>
>?wikipedia_url)
>
>(?child <http://rdf.freebase.com/ns/type.object.name> ?children)
>
>06:14:50 INFO  exec                 :: Execute ::
>
>(<http://rdf.freebase.com/ns/m.0j2btth>
>
><http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob)
>
>06:14:50 INFO  exec                 :: Execute
>
>(?child <http://rdf.freebase.com/ns/people.person.gender> ?gend)
>
>(?gend <http://rdf.freebase.com/ns/type.object.name> ?child_gender)
>
>06:14:50 INFO  exec                 :: Execute ::
>
>(<http://rdf.freebase.com/ns/m.0j2btth>
>
><http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob)
>
>06:14:50 INFO  exec                 :: Execute ::
>
>(<http://rdf.freebase.com/ns/m.0j2btth>
>
><http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob)
>
>06:14:50 INFO  exec                 :: Execute ::
>
>(<http://rdf.freebase.com/ns/m.0j2btth>
>
><http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob)
>
>06:14:50 INFO  exec                 :: Execute ::
>
>(<http://rdf.freebase.com/ns/m.0j2btth>
>
><http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob)
>
>06:14:50 INFO  exec                 :: Execute ::
>
>(<http://rdf.freebase.com/ns/m.0j2btth>
>
><http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob)
>
>06:14:50 INFO  exec                 :: Execute ::
>
>(<http://rdf.freebase.com/ns/m.0j2btth>
>
><http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob)
>
>06:14:50 INFO  exec                 :: Execute ::
>
>
>This bit:
>
>(<http://rdf.freebase.com/ns/m.0j2btth>
>
><http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob)
>
>06:14:50 INFO  exec                 :: Execute ::
>
>Is repeated many many times.
>
>Is this expected and how this should be interpreted?

This has to do with an optimisation that ARQ does to implement an Index
Join strategy.  What this does is when you have sequences of patterns with
common variables rather than executing them entirely separately and then
joining them together you take the output of one pattern and pass it to
multiple executions of the next pattern.  This tends to yield serious
performance boosts since it improves the selectivity of the subsequent
pattern and reduces the amount of intermediate data that is brought into
memory only to be thrown out.

>
>
>*In general I am trying to understand if I could optimize my freebase set
>extraction queries in any possible way and would appreciate any
>comment/feedback here.*

Another thing to note is that you are using TDB so you need to be careful
how you've allocated your RAM.  TDB uses memory mapped files so all the
data and indices are off-heap and the JVM heap is only used for
intermediate results and since ARQ uses streaming evaluation wherever
possible the heap memory requirement is generally relatively low.

If you have set the JVM heap too large you are going to be forcing the OS
to swap your JVM heap and TDBs memory mapped files in and out of RAM which
is going to completely hose performance.  Some more details on your
environment - OS, RAM, JVM allocation, dataset size (triple/quad count) -
would be useful if you want more tips on configuring your system.

Rob

>
>Regards,
>
>Ewa

Re: Spaql queries optimizations for Freebase set

Reply via email to