Re: freebase data under Jena

Andy Seaborne Thu, 15 May 2014 09:05:32 -0700


On 14/05/14 11:16, Ewa Szwed wrote:

Hello,
Thank you both for your answers.
Here are some more details about my setup:
I run Jena on a virtual machine with Linux Cent OS 6.5 distribution and of
8 cores CPU, 64 GB RAM and 1.6 TB drive.
For the loading (tdbloader) I set max heap to 4 GB as it worked the best
for me before.


Brian's point about tdbloader1 vs tdbloader2 applies.

tdbloader2 produces better databases if you are not going to be doingincremental updates (in which case it does not matter)

Now I have set heap for Fuseki for 12 GB and the same for tdbquery.

No need to make it that large unless you need it for other reasons. Thea lot of the caching is not in the java heap.

We really run different queries and the execution time on November data for
them varies from a couple of minutes to a couple of hours (usually not more
that 3 and this is reallt our max).
Here are 2 examples:

age at death:

  prefix fb: <http://rdf.freebase.com/ns/>
  prefix fn: <http://www.w3.org/2005/xpath-functions#>
  select ?entity ?mID ?age_at_death ?wikipedia_url
  where
{
    {
         ?mID_raw fb:type.object.type fb:people.person .
         ?mID_raw fb:type.object.type fb:people.deceased_person .
         ?mID_raw fb:type.object.name ?entity .
         ?mID_raw fb:people.deceased_person.date_of_death ?date_of_death .
         ?mID_raw fb:people.person.date_of_birth ?date_of_birth .
         ?mID_raw fb:common.topic.topic_equivalent_webpage ?wikipedia_url .

         FILTER (lang(?entity) = "en" && regex (str(?wikipedia_url),
"en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")).
    }
    BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/";, "") as ?mID)
    BIND(fn:year-from-dateTime(?date_of_birth) AS ?year_of_birth)
    BIND(fn:year-from-dateTime(?date_of_death) AS ?year_of_death)
    BIND(str(floor(fn:days-from-duration(?date_of_death - ?date_of_birth) /
365)) as ?age)
    BIND(fn:concat(?age, " (", ?year_of_birth, "-", ?year_of_death, ")" ) AS
?age_at_death)
}

age at death takes less than 5 minutes on November index and more than 10
hours on April index. :(


How many results does that give?

(which versions of the software?)

(assuming these both cache-warm timings - cold query is slow without an SSD)

So it sounds like the query is going to disk now when it used not to.

art:

prefix fb: <http://rdf.freebase.com/ns/>
prefix fn: <http://www.w3.org/2005/xpath-functions#>
select ?entity ?mID ?artist ?group_uri
where {
{
?mID_raw fb:type.object.type fb:visual_art.artwork .
?mID_raw fb:type.object.name ?entity .
?mID_raw fb:visual_art.artwork.artist ?group_uri .
?group_uri fb:type.object.name ?artist .
FILTER (lang(?entity) = "en" && lang(?artist) = "en").
}
BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/";, "") as ?mID)
} order by ?mID


So that one has a sort and as the data grows the sort grows faster.




I am aware that the data size is really massive and that is growing fast.
I guess I would like to ask for recommendation.
Would you advise changing the product?
My product would really want to do it as a last resort.
Yesterday I have found this document:
http://eprints.soton.ac.uk/266974/1/www2009fixedref.pdf
Is it being worked on?

Sort of. There is work-in-progress (indeed, progressing today) on acluster store but it is not production ready.


        Andy

Best regards and thank you.



2014-05-13 14:51 GMT+01:00 Andy Seaborne <[email protected]>:

As Rob says, details matter here.  The amount of data has risen
considerable, assuming the version of the code is the same in April
and earlier in November, and the size of the machine and the style of
queries being asked can be factors.

What queries are you asking?

Use of an SSD also makes a big difference, to loading and potentially
to query of the dataset is a lot larger then RAM.  More RAM is good
for query.

You can load on a different machine (with SSD) and copy the database
about is that helps.

On 13 May 2014 10:22, Rob Vesse <[email protected]> wrote:

"Is this significant drop in performance sth expected or maybe I have sth
fundamentally wrong in my set up - which I would need to track and fix."

We can't tell unless you actually tell us about your setup: OS, RAM, JVM
settings, type of disk the database resides upon, etc - the more details
you can provide the better

One important thing to be aware of is that TDB uses memory mapped files

so

you don't want to set the heap size too high since most of TDB memory
usage is off heap though depending on your queries you'll need the heap

to

be reasonably sized as otherwise GC and spill-to-disk will slow down

query

evaluation

In general your dataset is at the upper limit of what TDB can reasonably
handle and if you are trying to build a business on top of a triple store
then you may want to consider commercial options

Rob


On 12/05/2014 15:54, "Ewa Szwed" <[email protected]> wrote:

Hello,
This is me again. :)
I have the following (very big) problem.
Last year in November I have loaded freebase dump to Jena TDB and I was
able to work with it reasonably good and got quite good performance for
most of my queries.
Recently I have updated my Jena TDB store with a dump from April.
Here are some numbers to show the difference between these 2 instances.



*November 2013*

*April 2014*

Full time of import

262,052 sec /3,03 days

716,121 sec / 8,29 days

Number of triples

1,826,551,456

2,489,221,915

Index size (whole dir)

174 GB

333 GB


My problem is that my new instance in not performing at all.
The queries that previously run for a couple of minutes take a couple of
hours now and it is not acceptable for my business. :(
So I would like to ask if there is a practical index limit size for Jena
TDB. Is there anything I can do to improve the performance of it.
Is this significant drop in performance sth expected or maybe I have sth
fundamentally wrong in my set up - which I would need to track and fix.
Please advise.
Regards,
Ewa Szwed

Re: freebase data under Jena

Reply via email to