On 13.10.2017 09:48, George News wrote: > Hi all, > > Thanks a lot for your answers... I have "negotiated" with the admins of > the project and I will be giving you examples of the queries and data ;) > > We really need to enhance performance. BTW Virtuoso is good at inference > or will I have the same issues? I don't think that people here have that much experience with Virtuoso. I used Virtuoso 7.x sometimes, but I can't say whether it's "good". Basically, it also applies rule-based inference, but as far as I know the focus was never on performance regarding inference. And the convenience was also not that good from my point of view.
The new Virtuoso 8.x is supposed a more powerful reasoning engine than before - at least that's what's announced in some blog posts. Indeed, there are no benchmarks etc. > > Thanks again. > Regards, > Jorge > > On 2017-10-11 15:47, Rob Vesse wrote: >> Comments inline: >> >> On 11/10/2017 11:57, "George News" <[email protected]> wrote: >> >> Hi all, >> >> The project I'm working in currently has a TDB with approximately 100M >> triplets and the size is increasing quite quickly. When I make a typical >> SPARQL query for getting data from the system, it takes ages, sometimes >> more than 10-20 minutes. I think performance wise this is not really >> user friendly. Therefore I need to know how I can increase the speed, >> etc. >> >> I'm running the whole system on a machine with Intel Xeon E312xx with >> 32Gb RAM and many times I'm getting OutofMemory Exceptions and the >> google.cache that Jena handles is the one that seems to be causing the >> problem. >> >> Specifics stack traces would be useful to understand where the cache is >> being exploded. Certain kinds of query may use the cache more heavily than >> others so some elaboration on the general construction of queries would be >> interesting. >> >> Are the figures I'm pointing normal (machine specs, response time, >> etc.)? Is it too big/too small? >> >> The size of the data seems small relative to the size of the machine. You >> don’t specify whether you change the JVM heap size, most memory usage in TDB >> is off-heap via memory mapped files so setting too large a heap can >> negatively impact performance. >> >> The response times seems very poor but that may be the nature of your >> queries and data structure, however since you are unable to show those we >> can only provide generalisations >> >> For the moment, we have decided to split the graph in pieces, that is, >> generating a new named graph every now and then so the amount of >> information stored in a "current" graph is smaller. Then restricting the >> query to a set of graphs things work better. >> >> Although this solution works, when we merge the graphs for historical >> queries, we are facing the same problem as before. Then, how can we >> increased the speed? >> >> I cannot disclosed the dataset or part of it, but I will try to somehow >> explain it. >> >> - Ids for entities are approximately 255 random ASCII characters. Does >> the size of the ids affect the speed of the SPARQL queries? If yes, can >> I apply a Lucene index to the IDs in order to reduce the query time? >> >> It depends on the nature of the query. All terms are mapped into 64-bit >> internal identifiers, these are only mapped back to the original terms as >> and when that query engine and/or results serialisation requires it. A >> cache is used to speed up the mapping in both directions so depending on the >> nature of the queries and your system loads you may be thrashing this cache. >> >> - The depth level of the graph or the information relationship is around >> 7-8 level at most, but most of the times it is required to link 3-4 >> levels. >> >> Difficult to say how this impacts performance because it really depends on >> how you are querying that structure >> >> - Most of the queries include several: >> ?x myont:hasattribute ?b. >> ?a rdf:type ?b. >> >> Therefore checking the class and subclasses of entities. Is there anyway >> to speed up the inference as if I'm asking for the parent class I will >> get also the children ones defined in my ontology. >> >> So are you actively using inference? If you are then that will significantly >> degrade performance because the inference closure is done entirely in memory >> i.e. not in TDB if inference is turned on and you will get minimal >> performance benefit from using TDB. >> >> If you only need simple inference like class and property hierarchy you may >> be better served by asserting those statically using SPARQL updates and not >> using dynamic inference >> >> - I know the "." in a query acts as more or less like an AND logical >> operation. Does the order of sentences have implications in the >> performance? Should I start with the most restrictive ones? Should I >> start with the simplest ones, i.e. checking number values, etc.? >> >> yes and no. TDB Will attempt to do the necessary scans in an optimal order >> based on its knowledge of the statistics of the data. However this only >> applies within a single query pattern i.e. { } so depending on the structure >> of your query you may need to do some manual reordering. Also if inference >> is involved then that may interact. >> >> - Some of the queries uses spatial and time filtering? Is is worth >> implementing the support for spatial searches with SPARQL? Is there any >> kind of index for time searches? >> >> There is a geospatial indexing extension but there is no temporal indexing >> provided by Jena. >> >> Any help is more than welcome. >> >> Without more detail it is difficult to provide more detailed help. >> >> Rob >> >> Regards, >> Jorge >> >> >> >> >> >>
