Hi,
I am working on a project that utilizes Jena TDB store to host full
Freebase data set.
I am at a stage now that all the data is loaded and I have written a couple
of Sparql queries to extract information about Freebase topics.
What I am trying to do now is to improve the performance of some of the
queries I have written.
For example the query to extract all the children of all the people from
Freebase and to format the output as it is desired on our side is as
follows:

prefix fb: <http://rdf.freebase.com/ns/>
prefix fn: <http://www.w3.org/2005/xpath-functions#>
select ?entity ?mID ?children ?gender ?wikipedia_url ?dob
where {
    ?mID_raw fb:type.object.type fb:people.person .
    ?mID_raw fb:type.object.name ?entity .
    ?mID_raw fb:people.person.children ?child .
    ?mID_raw fb:common.topic.topic_equivalent_webpage ?wikipedia_url .
    ?child fb:type.object.name ?children .
    ?child fb:people.person.date_of_birth ?dob .
    ?child fb:people.person.gender ?gend .
    ?gend fb:type.object.name ?child_gender .
    BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/";, "") as ?mID)
    BIND(REPLACE(?child_gender, "Male", "Son") AS ?child_gender_conv_1)
    BIND(REPLACE(?child_gender_conv_1, "Female", "Daughter") AS ?gender)
    FILTER (lang(?entity) = "en" && lang(?children) = "en" &&
lang(?child_gender) = "en" && regex (str(?wikipedia_url),
"en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")) .
}
order by ?dob

At the moment it takes almost 14 hours to execute this query.
In general, would it be expected time for such an extraction or could this
be 'somehow' improved?

I have measured that when I remove all the BINDings lines from this query,
my execution time gets reduced to 4 hours so I conclude that BINDing are
expensive in general.
Is there a way to replace BINDing with some other constructs to achieve the
same formatting but with with better performance?
Are there any 'best practices' to follows here in general?

I have also experimented with Jena optimizers and noticed that although the
stats optimizer is recommended one I tend to get 10% better performance
with the fixed (fixed.opt) one?
Is there any general rule which one should be used here?

Lastly I have observed strange repeated lines in logs (with logging and
debugging turned on and running with tdbquery)

Query run: (notice limit 1)

prefix fb: <http://rdf.freebase.com/ns/>
prefix fn: <http://www.w3.org/2005/xpath-functions#>
select ?entity ?mID ?children ?gender ?wikipedia_url ?dob
where {
    ?mID_raw fb:type.object.type fb:people.person .
    ?mID_raw fb:type.object.name ?entity .
    ?mID_raw fb:people.person.children ?child .
    ?mID_raw fb:common.topic.topic_equivalent_webpage ?wikipedia_url .
    ?child fb:type.object.name ?children .
    OPTIONAL{ ?child fb:people.person.date_of_birth ?dob .}
    ?child fb:people.person.gender ?gend .
    ?gend fb:type.object.name ?child_gender .
    BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/";, "") as ?mID)
    BIND(REPLACE(?child_gender, "Male", "Son") AS ?child_gender_conv_1)
    BIND(REPLACE(?child_gender_conv_1, "Female", "Daughter") AS ?gender)
    FILTER (lang(?entity) = "en" && lang(?children) = "en" &&
lang(?child_gender) = "en" && regex (str(?wikipedia_url),
"en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")) .
}
order by ?dob
limit 1

Output:

06:14:50 INFO  exec                 :: ALGEBRA

   (slice _ 1

     (project (?entity ?mID ?children ?gender ?wikipedia_url ?dob)

       (filter (exprlist (= (lang ?entity) "en") (= (lang ?children)

"en") (= (lang ?child_gender) "en") (regex (str ?wikipedia_url)

"en.wikipedia" "i") (! (regex (str ?wikipedia_url) "curid=" "i")))

         (extend ((?mID (replace (str ?mID_raw)

"http://rdf.freebase.com/ns/"; "")) (?child_gender_conv_1 (replace

?child_gender "Male" "Son")) (?gender (replace ?child_gender_conv_1

"Female" "Daughter")))

           (sequence

             (conditional

               (quadpattern

                 (quad <urn:x-arq:DefaultGraphNode> ?mID_raw

<http://rdf.freebase.com/ns/type.object.type>

<http://rdf.freebase.com/ns/people.person>)

                 (quad <urn:x-arq:DefaultGraphNode> ?mID_raw

<http://rdf.freebase.com/ns/type.object.name> ?entity)

                 (quad <urn:x-arq:DefaultGraphNode> ?mID_raw

<http://rdf.freebase.com/ns/people.person.children> ?child)

                 (quad <urn:x-arq:DefaultGraphNode> ?mID_raw

<http://rdf.freebase.com/ns/common.topic.topic_equivalent_webpage>

?wikipedia_url)

                 (quad <urn:x-arq:DefaultGraphNode> ?child

<http://rdf.freebase.com/ns/type.object.name> ?children)

               )

               (quadpattern (quad <urn:x-arq:DefaultGraphNode> ?child

<http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob)))

             (quadpattern

               (quad <urn:x-arq:DefaultGraphNode> ?child

<http://rdf.freebase.com/ns/people.person.gender> ?gend)

               (quad <urn:x-arq:DefaultGraphNode> ?gend

<http://rdf.freebase.com/ns/type.object.name> ?child_gender)

             ))))))

06:14:50 INFO  exec                 :: Execute

(?mID_raw <http://rdf.freebase.com/ns/type.object.type>

<http://rdf.freebase.com/ns/people.person>)

(?mID_raw <http://rdf.freebase.com/ns/type.object.name> ?entity)

(?mID_raw <http://rdf.freebase.com/ns/people.person.children> ?child)

(?mID_raw

<http://rdf.freebase.com/ns/common.topic.topic_equivalent_webpage>

?wikipedia_url)

(?child <http://rdf.freebase.com/ns/type.object.name> ?children)

06:14:50 INFO  exec                 :: Execute ::

(<http://rdf.freebase.com/ns/m.0j2btth>

<http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob)

06:14:50 INFO  exec                 :: Execute

(?child <http://rdf.freebase.com/ns/people.person.gender> ?gend)

(?gend <http://rdf.freebase.com/ns/type.object.name> ?child_gender)

06:14:50 INFO  exec                 :: Execute ::

(<http://rdf.freebase.com/ns/m.0j2btth>

<http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob)

06:14:50 INFO  exec                 :: Execute ::

(<http://rdf.freebase.com/ns/m.0j2btth>

<http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob)

06:14:50 INFO  exec                 :: Execute ::

(<http://rdf.freebase.com/ns/m.0j2btth>

<http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob)

06:14:50 INFO  exec                 :: Execute ::

(<http://rdf.freebase.com/ns/m.0j2btth>

<http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob)

06:14:50 INFO  exec                 :: Execute ::

(<http://rdf.freebase.com/ns/m.0j2btth>

<http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob)

06:14:50 INFO  exec                 :: Execute ::

(<http://rdf.freebase.com/ns/m.0j2btth>

<http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob)

06:14:50 INFO  exec                 :: Execute ::


This bit:

(<http://rdf.freebase.com/ns/m.0j2btth>

<http://rdf.freebase.com/ns/people.person.date_of_birth> ?dob)

06:14:50 INFO  exec                 :: Execute ::

Is repeated many many times.

Is this expected and how this should be interpreted?


*In general I am trying to understand if I could optimize my freebase set
extraction queries in any possible way and would appreciate any
comment/feedback here.*

Regards,

Ewa

Reply via email to