Re: freebase data under Jena

Ewa Szwed Tue, 20 May 2014 14:32:27 -0700

2014-05-14 18:35 GMT+01:00 Andy Seaborne <[email protected]>:

>
> On 14/05/14 11:16, Ewa Szwed wrote:
>
>> Hello,
>> Thank you both for your answers.
>> Here are some more details about my setup:
>> I run Jena on a virtual machine with Linux Cent OS 6.5 distribution and of
>> 8 cores CPU, 64 GB RAM and 1.6 TB drive.
>> For the loading (tdbloader) I set max heap to 4 GB as it worked the best
>> for me before.
>>
>
> Brian's point about tdbloader1 vs tdbloader2 applies.
>
> tdbloader2 produces better databases if you are not going to be doing
> incremental updates (in which case it does not matter)
>
>
>  Now I have set heap for Fuseki for 12 GB and the same for tdbquery.
>>
>
> No need to make it that large unless you need it for other reasons.  The a
> lot of the caching is not in the java heap.
>
>
Hi Andy, thank you for all these comments. Can you elaborate a little more
on this caching. I was able to see that the second time I run a query on
Fuseki I get better results. But this better results are maintained even
though I restart Fuseki. This is the same when I run query using tdbquery.
How is this information kept?


>
>  We really run different queries and the execution time on November data
>> for
>> them varies from a couple of minutes to a couple of hours (usually not
>> more
>> that 3 and this is reallt our max).
>> Here are 2 examples:
>>
>> age at death:
>>
>>   prefix fb: <http://rdf.freebase.com/ns/>
>>   prefix fn: <http://www.w3.org/2005/xpath-functions#>
>>   select ?entity ?mID ?age_at_death ?wikipedia_url
>>   where
>> {
>>     {
>>          ?mID_raw fb:type.object.type fb:people.person .
>>          ?mID_raw fb:type.object.type fb:people.deceased_person .
>>          ?mID_raw fb:type.object.name ?entity .
>>          ?mID_raw fb:people.deceased_person.date_of_death ?date_of_death
>> .
>>          ?mID_raw fb:people.person.date_of_birth ?date_of_birth .
>>          ?mID_raw fb:common.topic.topic_equivalent_webpage
>> ?wikipedia_url .
>>
>>          FILTER (lang(?entity) = "en" && regex (str(?wikipedia_url),
>> "en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")).
>>     }
>>     BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/";, "") as
>> ?mID)
>>     BIND(fn:year-from-dateTime(?date_of_birth) AS ?year_of_birth)
>>     BIND(fn:year-from-dateTime(?date_of_death) AS ?year_of_death)
>>     BIND(str(floor(fn:days-from-duration(?date_of_death -
>> ?date_of_birth) /
>> 365)) as ?age)
>>     BIND(fn:concat(?age, " (", ?year_of_birth, "-", ?year_of_death, ")" )
>> AS
>> ?age_at_death)
>> }
>>
>> age at death takes less than 5 minutes on November index and more than 10
>> hours on April index. :(
>>
>
> How many results does that give?
>
> (which versions of the software?)
>
> (assuming these both cache-warm timings - cold query is slow without an
> SSD)
>
> So it sounds like the query is going to disk now when it used not to.
>
>
>  art:
>>
>> prefix fb: <http://rdf.freebase.com/ns/>
>> prefix fn: <http://www.w3.org/2005/xpath-functions#>
>> select ?entity ?mID ?artist ?group_uri
>> where {
>> {
>> ?mID_raw fb:type.object.type fb:visual_art.artwork .
>> ?mID_raw fb:type.object.name ?entity .
>> ?mID_raw fb:visual_art.artwork.artist ?group_uri .
>> ?group_uri fb:type.object.name ?artist .
>> FILTER (lang(?entity) = "en" && lang(?artist) = "en").
>> }
>> BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/";, "") as ?mID)
>> } order by ?mID
>>
>
> So that one has a sort and as the data grows the sort grows faster.
>
>
>
>>
>>
>> I am aware that the data size is really massive and that is growing fast.
>> I guess I would like to ask for recommendation.
>> Would you advise changing the product?
>> My product would really want to do it as a last resort.
>> Yesterday I have found this document:
>> http://eprints.soton.ac.uk/266974/1/www2009fixedref.pdf
>> Is it being worked on?
>>
>
> Sort of.  There is work-in-progress (indeed, progressing today) on a
> cluster store but it is not production ready.
>
>         Andy
>
>
>
>  Best regards and thank you.
>>
>>
>>
>> 2014-05-13 14:51 GMT+01:00 Andy Seaborne <[email protected]>:
>>
>>  As Rob says, details matter here.  The amount of data has risen
>>> considerable, assuming the version of the code is the same in April
>>> and earlier in November, and the size of the machine and the style of
>>> queries being asked can be factors.
>>>
>>> What queries are you asking?
>>>
>>> Use of an SSD also makes a big difference, to loading and potentially
>>> to query of the dataset is a lot larger then RAM.  More RAM is good
>>> for query.
>>>
>>> You can load on a different machine (with SSD) and copy the database
>>> about is that helps.
>>>
>>> On 13 May 2014 10:22, Rob Vesse <[email protected]> wrote:
>>>
>>>> "Is this significant drop in performance sth expected or maybe I have
>>>> sth
>>>> fundamentally wrong in my set up - which I would need to track and fix."
>>>>
>>>> We can't tell unless you actually tell us about your setup: OS, RAM, JVM
>>>> settings, type of disk the database resides upon, etc - the more details
>>>> you can provide the better
>>>>
>>>> One important thing to be aware of is that TDB uses memory mapped files
>>>>
>>> so
>>>
>>>> you don't want to set the heap size too high since most of TDB memory
>>>> usage is off heap though depending on your queries you'll need the heap
>>>>
>>> to
>>>
>>>> be reasonably sized as otherwise GC and spill-to-disk will slow down
>>>>
>>> query
>>>
>>>> evaluation
>>>>
>>>> In general your dataset is at the upper limit of what TDB can reasonably
>>>> handle and if you are trying to build a business on top of a triple
>>>> store
>>>> then you may want to consider commercial options
>>>>
>>>> Rob
>>>>
>>>>
>>>> On 12/05/2014 15:54, "Ewa Szwed" <[email protected]> wrote:
>>>>
>>>>  Hello,
>>>>> This is me again. :)
>>>>> I have the following (very big) problem.
>>>>> Last year in November I have loaded freebase dump to Jena TDB and I was
>>>>> able to work with it reasonably good and got quite good performance for
>>>>> most of my queries.
>>>>> Recently I have updated my Jena TDB store with a dump from April.
>>>>> Here are some numbers to show the difference between these 2 instances.
>>>>>
>>>>>
>>>>>
>>>>> *November 2013*
>>>>>
>>>>> *April 2014*
>>>>>
>>>>> Full time of import
>>>>>
>>>>> 262,052 sec /3,03 days
>>>>>
>>>>> 716,121 sec / 8,29 days
>>>>>
>>>>> Number of triples
>>>>>
>>>>> 1,826,551,456
>>>>>
>>>>> 2,489,221,915
>>>>>
>>>>> Index size (whole dir)
>>>>>
>>>>> 174 GB
>>>>>
>>>>> 333 GB
>>>>>
>>>>>
>>>>> My problem is that my new instance in not performing at all.
>>>>> The queries that previously run for a couple of minutes take a couple
>>>>> of
>>>>> hours now and it is not acceptable for my business. :(
>>>>> So I would like to ask if there is a practical index limit size for
>>>>> Jena
>>>>> TDB. Is there anything I can do to improve the performance of it.
>>>>> Is this significant drop in performance sth expected or maybe I have
>>>>> sth
>>>>> fundamentally wrong in my set up - which I would need to track and fix.
>>>>> Please advise.
>>>>> Regards,
>>>>> Ewa Szwed
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: freebase data under Jena

Reply via email to