Hi all, The project I'm working in currently has a TDB with approximately 100M triplets and the size is increasing quite quickly. When I make a typical SPARQL query for getting data from the system, it takes ages, sometimes more than 10-20 minutes. I think performance wise this is not really user friendly. Therefore I need to know how I can increase the speed, etc.
I'm running the whole system on a machine with Intel Xeon E312xx with 32Gb RAM and many times I'm getting OutofMemory Exceptions and the google.cache that Jena handles is the one that seems to be causing the problem. Are the figures I'm pointing normal (machine specs, response time, etc.)? Is it too big/too small? For the moment, we have decided to split the graph in pieces, that is, generating a new named graph every now and then so the amount of information stored in a "current" graph is smaller. Then restricting the query to a set of graphs things work better. Although this solution works, when we merge the graphs for historical queries, we are facing the same problem as before. Then, how can we increased the speed? I cannot disclosed the dataset or part of it, but I will try to somehow explain it. - Ids for entities are approximately 255 random ASCII characters. Does the size of the ids affect the speed of the SPARQL queries? If yes, can I apply a Lucene index to the IDs in order to reduce the query time? - The depth level of the graph or the information relationship is around 7-8 level at most, but most of the times it is required to link 3-4 levels. - Most of the queries include several: ?x myont:hasattribute ?b. ?a rdf:type ?b. Therefore checking the class and subclasses of entities. Is there anyway to speed up the inference as if I'm asking for the parent class I will get also the children ones defined in my ontology. - I know the "." in a query acts as more or less like an AND logical operation. Does the order of sentences have implications in the performance? Should I start with the most restrictive ones? Should I start with the simplest ones, i.e. checking number values, etc.? - Some of the queries uses spatial and time filtering? Is is worth implementing the support for spatial searches with SPARQL? Is there any kind of index for time searches? Any help is more than welcome. Regards, Jorge
