2014-05-14 18:35 GMT+01:00 Andy Seaborne <[email protected]>: > > On 14/05/14 11:16, Ewa Szwed wrote: > >> Hello, >> Thank you both for your answers. >> Here are some more details about my setup: >> I run Jena on a virtual machine with Linux Cent OS 6.5 distribution and of >> 8 cores CPU, 64 GB RAM and 1.6 TB drive. >> For the loading (tdbloader) I set max heap to 4 GB as it worked the best >> for me before. >> > > Brian's point about tdbloader1 vs tdbloader2 applies. > > tdbloader2 produces better databases if you are not going to be doing > incremental updates (in which case it does not matter) > > > Now I have set heap for Fuseki for 12 GB and the same for tdbquery. >> > > No need to make it that large unless you need it for other reasons. The a > lot of the caching is not in the java heap. > > Hi Andy, thank you for all these comments. Can you elaborate a little more on this caching. I was able to see that the second time I run a query on Fuseki I get better results. But this better results are maintained even though I restart Fuseki. This is the same when I run query using tdbquery. How is this information kept?
> > We really run different queries and the execution time on November data >> for >> them varies from a couple of minutes to a couple of hours (usually not >> more >> that 3 and this is reallt our max). >> Here are 2 examples: >> >> age at death: >> >> prefix fb: <http://rdf.freebase.com/ns/> >> prefix fn: <http://www.w3.org/2005/xpath-functions#> >> select ?entity ?mID ?age_at_death ?wikipedia_url >> where >> { >> { >> ?mID_raw fb:type.object.type fb:people.person . >> ?mID_raw fb:type.object.type fb:people.deceased_person . >> ?mID_raw fb:type.object.name ?entity . >> ?mID_raw fb:people.deceased_person.date_of_death ?date_of_death >> . >> ?mID_raw fb:people.person.date_of_birth ?date_of_birth . >> ?mID_raw fb:common.topic.topic_equivalent_webpage >> ?wikipedia_url . >> >> FILTER (lang(?entity) = "en" && regex (str(?wikipedia_url), >> "en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")). >> } >> BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/", "") as >> ?mID) >> BIND(fn:year-from-dateTime(?date_of_birth) AS ?year_of_birth) >> BIND(fn:year-from-dateTime(?date_of_death) AS ?year_of_death) >> BIND(str(floor(fn:days-from-duration(?date_of_death - >> ?date_of_birth) / >> 365)) as ?age) >> BIND(fn:concat(?age, " (", ?year_of_birth, "-", ?year_of_death, ")" ) >> AS >> ?age_at_death) >> } >> >> age at death takes less than 5 minutes on November index and more than 10 >> hours on April index. :( >> > > How many results does that give? > > (which versions of the software?) > > (assuming these both cache-warm timings - cold query is slow without an > SSD) > > So it sounds like the query is going to disk now when it used not to. > > > art: >> >> prefix fb: <http://rdf.freebase.com/ns/> >> prefix fn: <http://www.w3.org/2005/xpath-functions#> >> select ?entity ?mID ?artist ?group_uri >> where { >> { >> ?mID_raw fb:type.object.type fb:visual_art.artwork . >> ?mID_raw fb:type.object.name ?entity . >> ?mID_raw fb:visual_art.artwork.artist ?group_uri . >> ?group_uri fb:type.object.name ?artist . >> FILTER (lang(?entity) = "en" && lang(?artist) = "en"). >> } >> BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/", "") as ?mID) >> } order by ?mID >> > > So that one has a sort and as the data grows the sort grows faster. > > > >> >> >> I am aware that the data size is really massive and that is growing fast. >> I guess I would like to ask for recommendation. >> Would you advise changing the product? >> My product would really want to do it as a last resort. >> Yesterday I have found this document: >> http://eprints.soton.ac.uk/266974/1/www2009fixedref.pdf >> Is it being worked on? >> > > Sort of. There is work-in-progress (indeed, progressing today) on a > cluster store but it is not production ready. > > Andy > > > > Best regards and thank you. >> >> >> >> 2014-05-13 14:51 GMT+01:00 Andy Seaborne <[email protected]>: >> >> As Rob says, details matter here. The amount of data has risen >>> considerable, assuming the version of the code is the same in April >>> and earlier in November, and the size of the machine and the style of >>> queries being asked can be factors. >>> >>> What queries are you asking? >>> >>> Use of an SSD also makes a big difference, to loading and potentially >>> to query of the dataset is a lot larger then RAM. More RAM is good >>> for query. >>> >>> You can load on a different machine (with SSD) and copy the database >>> about is that helps. >>> >>> On 13 May 2014 10:22, Rob Vesse <[email protected]> wrote: >>> >>>> "Is this significant drop in performance sth expected or maybe I have >>>> sth >>>> fundamentally wrong in my set up - which I would need to track and fix." >>>> >>>> We can't tell unless you actually tell us about your setup: OS, RAM, JVM >>>> settings, type of disk the database resides upon, etc - the more details >>>> you can provide the better >>>> >>>> One important thing to be aware of is that TDB uses memory mapped files >>>> >>> so >>> >>>> you don't want to set the heap size too high since most of TDB memory >>>> usage is off heap though depending on your queries you'll need the heap >>>> >>> to >>> >>>> be reasonably sized as otherwise GC and spill-to-disk will slow down >>>> >>> query >>> >>>> evaluation >>>> >>>> In general your dataset is at the upper limit of what TDB can reasonably >>>> handle and if you are trying to build a business on top of a triple >>>> store >>>> then you may want to consider commercial options >>>> >>>> Rob >>>> >>>> >>>> On 12/05/2014 15:54, "Ewa Szwed" <[email protected]> wrote: >>>> >>>> Hello, >>>>> This is me again. :) >>>>> I have the following (very big) problem. >>>>> Last year in November I have loaded freebase dump to Jena TDB and I was >>>>> able to work with it reasonably good and got quite good performance for >>>>> most of my queries. >>>>> Recently I have updated my Jena TDB store with a dump from April. >>>>> Here are some numbers to show the difference between these 2 instances. >>>>> >>>>> >>>>> >>>>> *November 2013* >>>>> >>>>> *April 2014* >>>>> >>>>> Full time of import >>>>> >>>>> 262,052 sec /3,03 days >>>>> >>>>> 716,121 sec / 8,29 days >>>>> >>>>> Number of triples >>>>> >>>>> 1,826,551,456 >>>>> >>>>> 2,489,221,915 >>>>> >>>>> Index size (whole dir) >>>>> >>>>> 174 GB >>>>> >>>>> 333 GB >>>>> >>>>> >>>>> My problem is that my new instance in not performing at all. >>>>> The queries that previously run for a couple of minutes take a couple >>>>> of >>>>> hours now and it is not acceptable for my business. :( >>>>> So I would like to ask if there is a practical index limit size for >>>>> Jena >>>>> TDB. Is there anything I can do to improve the performance of it. >>>>> Is this significant drop in performance sth expected or maybe I have >>>>> sth >>>>> fundamentally wrong in my set up - which I would need to track and fix. >>>>> Please advise. >>>>> Regards, >>>>> Ewa Szwed >>>>> >>>> >>>> >>>> >>>> >>>> >>> >> >
