Hello, Thank you both for your answers. Here are some more details about my setup: I run Jena on a virtual machine with Linux Cent OS 6.5 distribution and of 8 cores CPU, 64 GB RAM and 1.6 TB drive. For the loading (tdbloader) I set max heap to 4 GB as it worked the best for me before. Now I have set heap for Fuseki for 12 GB and the same for tdbquery. We really run different queries and the execution time on November data for them varies from a couple of minutes to a couple of hours (usually not more that 3 and this is reallt our max). Here are 2 examples:
age at death: prefix fb: <http://rdf.freebase.com/ns/> prefix fn: <http://www.w3.org/2005/xpath-functions#> select ?entity ?mID ?age_at_death ?wikipedia_url where { { ?mID_raw fb:type.object.type fb:people.person . ?mID_raw fb:type.object.type fb:people.deceased_person . ?mID_raw fb:type.object.name ?entity . ?mID_raw fb:people.deceased_person.date_of_death ?date_of_death . ?mID_raw fb:people.person.date_of_birth ?date_of_birth . ?mID_raw fb:common.topic.topic_equivalent_webpage ?wikipedia_url . FILTER (lang(?entity) = "en" && regex (str(?wikipedia_url), "en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")). } BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/", "") as ?mID) BIND(fn:year-from-dateTime(?date_of_birth) AS ?year_of_birth) BIND(fn:year-from-dateTime(?date_of_death) AS ?year_of_death) BIND(str(floor(fn:days-from-duration(?date_of_death - ?date_of_birth) / 365)) as ?age) BIND(fn:concat(?age, " (", ?year_of_birth, "-", ?year_of_death, ")" ) AS ?age_at_death) } age at death takes less than 5 minutes on November index and more than 10 hours on April index. :( art: prefix fb: <http://rdf.freebase.com/ns/> prefix fn: <http://www.w3.org/2005/xpath-functions#> select ?entity ?mID ?artist ?group_uri where { { ?mID_raw fb:type.object.type fb:visual_art.artwork . ?mID_raw fb:type.object.name ?entity . ?mID_raw fb:visual_art.artwork.artist ?group_uri . ?group_uri fb:type.object.name ?artist . FILTER (lang(?entity) = "en" && lang(?artist) = "en"). } BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/", "") as ?mID) } order by ?mID I am aware that the data size is really massive and that is growing fast. I guess I would like to ask for recommendation. Would you advise changing the product? My product would really want to do it as a last resort. Yesterday I have found this document: http://eprints.soton.ac.uk/266974/1/www2009fixedref.pdf Is it being worked on? Best regards and thank you. 2014-05-13 14:51 GMT+01:00 Andy Seaborne <[email protected]>: > As Rob says, details matter here. The amount of data has risen > considerable, assuming the version of the code is the same in April > and earlier in November, and the size of the machine and the style of > queries being asked can be factors. > > What queries are you asking? > > Use of an SSD also makes a big difference, to loading and potentially > to query of the dataset is a lot larger then RAM. More RAM is good > for query. > > You can load on a different machine (with SSD) and copy the database > about is that helps. > > On 13 May 2014 10:22, Rob Vesse <[email protected]> wrote: > > "Is this significant drop in performance sth expected or maybe I have sth > > fundamentally wrong in my set up - which I would need to track and fix." > > > > We can't tell unless you actually tell us about your setup: OS, RAM, JVM > > settings, type of disk the database resides upon, etc - the more details > > you can provide the better > > > > One important thing to be aware of is that TDB uses memory mapped files > so > > you don't want to set the heap size too high since most of TDB memory > > usage is off heap though depending on your queries you'll need the heap > to > > be reasonably sized as otherwise GC and spill-to-disk will slow down > query > > evaluation > > > > In general your dataset is at the upper limit of what TDB can reasonably > > handle and if you are trying to build a business on top of a triple store > > then you may want to consider commercial options > > > > Rob > > > > > > On 12/05/2014 15:54, "Ewa Szwed" <[email protected]> wrote: > > > >>Hello, > >>This is me again. :) > >>I have the following (very big) problem. > >>Last year in November I have loaded freebase dump to Jena TDB and I was > >>able to work with it reasonably good and got quite good performance for > >>most of my queries. > >>Recently I have updated my Jena TDB store with a dump from April. > >>Here are some numbers to show the difference between these 2 instances. > >> > >> > >> > >>*November 2013* > >> > >>*April 2014* > >> > >>Full time of import > >> > >>262,052 sec /3,03 days > >> > >>716,121 sec / 8,29 days > >> > >>Number of triples > >> > >>1,826,551,456 > >> > >>2,489,221,915 > >> > >>Index size (whole dir) > >> > >>174 GB > >> > >>333 GB > >> > >> > >>My problem is that my new instance in not performing at all. > >>The queries that previously run for a couple of minutes take a couple of > >>hours now and it is not acceptable for my business. :( > >>So I would like to ask if there is a practical index limit size for Jena > >>TDB. Is there anything I can do to improve the performance of it. > >>Is this significant drop in performance sth expected or maybe I have sth > >>fundamentally wrong in my set up - which I would need to track and fix. > >>Please advise. > >>Regards, > >>Ewa Szwed > > > > > > > > >
