Re: How to increase performance

Lorenz Buehmann Fri, 13 Oct 2017 03:25:21 -0700


On 13.10.2017 09:48, George News wrote:
> Hi all,
>
> Thanks a lot for your answers... I have "negotiated" with the admins of
> the project and I will be giving you examples of the queries and data ;)
>
> We really need to enhance performance. BTW Virtuoso is good at inference
> or will I have the same issues?
I don't think that people here have that much experience with Virtuoso.
I used Virtuoso 7.x sometimes, but I can't say whether it's "good".
Basically, it also applies rule-based inference, but as far as I know
the focus was never on performance regarding inference. And the
convenience was also not that good from my point of view.


The new Virtuoso 8.x is supposed a more powerful reasoning engine than
before - at least that's what's announced in some blog posts. Indeed,
there are no benchmarks etc.
>
> Thanks again.
> Regards,
> Jorge
>
> On 2017-10-11 15:47, Rob Vesse wrote:
>> Comments inline:
>>
>> On 11/10/2017 11:57, "George News" <[email protected]> wrote:
>>
>>     Hi all,
>>     
>>     The project I'm working in currently has a TDB with approximately 100M
>>     triplets and the size is increasing quite quickly. When I make a typical
>>     SPARQL query for getting data from the system, it takes ages, sometimes
>>     more than 10-20 minutes. I think performance wise this is not really
>>     user friendly. Therefore I need to know how I can increase the speed, 
>> etc.
>>     
>>     I'm running the whole system on a machine with Intel Xeon E312xx with
>>     32Gb RAM and many times I'm getting OutofMemory Exceptions and the
>>     google.cache that Jena handles is the one that seems to be causing the
>>     problem.
>>
>>  Specifics stack traces would be useful to understand where the cache is 
>> being exploded. Certain kinds of query may use the cache more heavily than 
>> others so some elaboration on the general construction of queries would be 
>> interesting.
>>     
>>     Are the figures I'm pointing normal (machine specs, response time,
>>     etc.)? Is it too big/too small?
>>
>>  The size of the data seems small relative to the size of the machine. You 
>> don’t specify whether you change the JVM heap size, most memory usage in TDB 
>> is off-heap via memory mapped files so setting too large a heap can 
>> negatively impact performance.
>>
>>  The response times seems very poor but that may be the nature of your 
>> queries and data structure, however since you are unable to show those we 
>> can only provide generalisations
>>     
>>     For the moment, we have decided to split the graph in pieces, that is,
>>     generating a new named graph every now and then so the amount of
>>     information stored in a "current" graph is smaller. Then restricting the
>>     query to a set of graphs things work better.
>>     
>>     Although this solution works, when we merge the graphs for historical
>>     queries, we are facing the same problem as before. Then, how can we
>>     increased the speed?
>>     
>>     I cannot disclosed the dataset or part of it, but I will try to somehow
>>     explain it.
>>     
>>     - Ids for entities are approximately 255 random ASCII characters. Does
>>     the size of the ids affect the speed of the SPARQL queries? If yes, can
>>     I apply a Lucene index to the IDs in order to reduce the query time?
>>
>>  It depends on the nature of the query. All terms are mapped into 64-bit 
>> internal identifiers, these are only mapped back to the original terms as 
>> and when that query engine and/or results serialisation requires it.  A 
>> cache is used to speed up the mapping in both directions so depending on the 
>> nature of the queries and your system loads you may be thrashing this cache.
>>     
>>     - The depth level of the graph or the information relationship is around
>>     7-8 level at most, but most of the times it is required to link 3-4 
>> levels.
>>
>>   Difficult to say how this impacts performance because it really depends on 
>> how you are querying that structure
>>     
>>     - Most of the queries include several:
>>     ?x myont:hasattribute ?b.
>>     ?a rdf:type ?b.
>>     
>>     Therefore checking the class and subclasses of entities. Is there anyway
>>     to speed up the inference as if I'm asking for the parent class I will
>>     get also the children ones defined in my ontology.
>>
>> So are you actively using inference? If you are then that will significantly 
>> degrade performance because the inference closure is done entirely in memory 
>> i.e. not in TDB if inference is turned on and you will get minimal 
>> performance benefit from using TDB.
>>
>>  If you only need simple inference like class and property hierarchy you may 
>> be better served by asserting those statically using SPARQL updates and not 
>> using dynamic inference
>>     
>>     - I know the "." in a query acts as more or less like an AND logical
>>     operation. Does the order of sentences have implications in the
>>     performance? Should I start with the most restrictive ones? Should I
>>     start with the simplest ones, i.e. checking number values, etc.?
>>
>>  yes and no.  TDB Will attempt to do the necessary scans in an optimal order 
>> based on its knowledge of the statistics of the data. However this only 
>> applies within a single query pattern i.e. { } so depending on the structure 
>> of your query you may need to do some manual reordering. Also if inference 
>> is involved then that may interact.
>>     
>>     - Some of the queries uses spatial and time filtering? Is is worth
>>     implementing the support for spatial searches with SPARQL? Is there any
>>     kind of index for time searches?
>>
>>  There is a geospatial indexing extension but there is no temporal indexing 
>> provided by Jena.
>>     
>>     Any help is more than welcome.
>>
>>  Without more detail it is difficult to provide more detailed help.
>>
>> Rob
>>     
>>     Regards,
>>     Jorge
>>     
>>
>>
>>
>>
>>

Re: How to increase performance

Reply via email to