Re: freebase data under Jena

Ewa Szwed Wed, 14 May 2014 03:17:26 -0700

Hello,
Thank you both for your answers.
Here are some more details about my setup:
I run Jena on a virtual machine with Linux Cent OS 6.5 distribution and of
8 cores CPU, 64 GB RAM and 1.6 TB drive.
For the loading (tdbloader) I set max heap to 4 GB as it worked the best
for me before.
Now I have set heap for Fuseki for 12 GB and the same for tdbquery.
We really run different queries and the execution time on November data for
them varies from a couple of minutes to a couple of hours (usually not more
that 3 and this is reallt our max).
Here are 2 examples:


age at death:

 prefix fb: <http://rdf.freebase.com/ns/>
 prefix fn: <http://www.w3.org/2005/xpath-functions#>
 select ?entity ?mID ?age_at_death ?wikipedia_url
 where
{
   {
        ?mID_raw fb:type.object.type fb:people.person .
        ?mID_raw fb:type.object.type fb:people.deceased_person .
        ?mID_raw fb:type.object.name ?entity .
        ?mID_raw fb:people.deceased_person.date_of_death ?date_of_death .
        ?mID_raw fb:people.person.date_of_birth ?date_of_birth .
        ?mID_raw fb:common.topic.topic_equivalent_webpage ?wikipedia_url .

        FILTER (lang(?entity) = "en" && regex (str(?wikipedia_url),
"en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")).
   }
   BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/";, "") as ?mID)
   BIND(fn:year-from-dateTime(?date_of_birth) AS ?year_of_birth)
   BIND(fn:year-from-dateTime(?date_of_death) AS ?year_of_death)
   BIND(str(floor(fn:days-from-duration(?date_of_death - ?date_of_birth) /
365)) as ?age)
   BIND(fn:concat(?age, " (", ?year_of_birth, "-", ?year_of_death, ")" ) AS
?age_at_death)
}

age at death takes less than 5 minutes on November index and more than 10
hours on April index. :(

art:

prefix fb: <http://rdf.freebase.com/ns/>
prefix fn: <http://www.w3.org/2005/xpath-functions#>
select ?entity ?mID ?artist ?group_uri
where {
{
?mID_raw fb:type.object.type fb:visual_art.artwork .
?mID_raw fb:type.object.name ?entity .
?mID_raw fb:visual_art.artwork.artist ?group_uri .
?group_uri fb:type.object.name ?artist .
FILTER (lang(?entity) = "en" && lang(?artist) = "en").
}
BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/";, "") as ?mID)
} order by ?mID



I am aware that the data size is really massive and that is growing fast.
I guess I would like to ask for recommendation.
Would you advise changing the product?
My product would really want to do it as a last resort.
Yesterday I have found this document:
http://eprints.soton.ac.uk/266974/1/www2009fixedref.pdf
Is it being worked on?
Best regards and thank you.



2014-05-13 14:51 GMT+01:00 Andy Seaborne <[email protected]>:

> As Rob says, details matter here.  The amount of data has risen
> considerable, assuming the version of the code is the same in April
> and earlier in November, and the size of the machine and the style of
> queries being asked can be factors.
>
> What queries are you asking?
>
> Use of an SSD also makes a big difference, to loading and potentially
> to query of the dataset is a lot larger then RAM.  More RAM is good
> for query.
>
> You can load on a different machine (with SSD) and copy the database
> about is that helps.
>
> On 13 May 2014 10:22, Rob Vesse <[email protected]> wrote:
> > "Is this significant drop in performance sth expected or maybe I have sth
> > fundamentally wrong in my set up - which I would need to track and fix."
> >
> > We can't tell unless you actually tell us about your setup: OS, RAM, JVM
> > settings, type of disk the database resides upon, etc - the more details
> > you can provide the better
> >
> > One important thing to be aware of is that TDB uses memory mapped files
> so
> > you don't want to set the heap size too high since most of TDB memory
> > usage is off heap though depending on your queries you'll need the heap
> to
> > be reasonably sized as otherwise GC and spill-to-disk will slow down
> query
> > evaluation
> >
> > In general your dataset is at the upper limit of what TDB can reasonably
> > handle and if you are trying to build a business on top of a triple store
> > then you may want to consider commercial options
> >
> > Rob
> >
> >
> > On 12/05/2014 15:54, "Ewa Szwed" <[email protected]> wrote:
> >
> >>Hello,
> >>This is me again. :)
> >>I have the following (very big) problem.
> >>Last year in November I have loaded freebase dump to Jena TDB and I was
> >>able to work with it reasonably good and got quite good performance for
> >>most of my queries.
> >>Recently I have updated my Jena TDB store with a dump from April.
> >>Here are some numbers to show the difference between these 2 instances.
> >>
> >>
> >>
> >>*November 2013*
> >>
> >>*April 2014*
> >>
> >>Full time of import
> >>
> >>262,052 sec /3,03 days
> >>
> >>716,121 sec / 8,29 days
> >>
> >>Number of triples
> >>
> >>1,826,551,456
> >>
> >>2,489,221,915
> >>
> >>Index size (whole dir)
> >>
> >>174 GB
> >>
> >>333 GB
> >>
> >>
> >>My problem is that my new instance in not performing at all.
> >>The queries that previously run for a couple of minutes take a couple of
> >>hours now and it is not acceptable for my business. :(
> >>So I would like to ask if there is a practical index limit size for Jena
> >>TDB. Is there anything I can do to improve the performance of it.
> >>Is this significant drop in performance sth expected or maybe I have sth
> >>fundamentally wrong in my set up - which I would need to track and fix.
> >>Please advise.
> >>Regards,
> >>Ewa Szwed
> >
> >
> >
> >
>

Re: freebase data under Jena

Reply via email to