On 05/06/14 12:47, Osma Suominen wrote:
Hi Andy!

On 05/06/14 14:28, Andy Seaborne wrote:
Osma,

What are the long running queries?

Here's one (sorry for the messed up line breaks):

(the query in question may just be syp[tom and not the root cause - that is, it's not expensive but something else at the same time caue it to go wrong)

At a quick look:

?s text:query ('ampumavä*' 100000000)

Lucene is also taking RAM, partly the Lucene engine, partly to deal with limit/unique results.

(and if Solr then it has to completely buffer responses as well).

How many results are there to "?s text:query ("ampumavä*" 100000000 )"?

...

This type of query normally takes less than one second. Now it took more
than 15 minutes.

It is possible that updates are unflushed (a bit surprising given the
length of time since over night updates) - you can check this by looking
the the journal file and if is zero length, there are no outstanding
commits waiting to be flushed.

It's too late now to check what size the journal was when this happened
yesterday, but I checked it for today, and all the three *jrnl files had
a size of zero and a timestamp of 04:57, which was the time when the
last updates finished. So at least last night, the flush completed very
soon after the update.

OK - it's highly unlikely to be updates then.

But the size locked down by unflushed commits does not change due to
read load.

Queries can uses a lot of memory and several in parallel would also
cause OOME.

It is possible that there happened to be a large number of parallel
queries. However, looking at the log files, I couldn't find any notable
peak.

How many datasets are on this server?

Only one TDB dataset (no inference), with a jena-text index. Within this
dataset there are about thirty named graphs, and most queries are
limited to a single named graph.

On AWS, we have seen virtualization hardware "go bad" (I can't explain
it any better).  Only seen on old hardware, m1 generation) .  A server,
for no reason we can determine [*], simply starts having very high load,
makes very slow progress but is functionally fine.  But it's randomly
going slow, queries build up which can means more memory in active use
at anyone point meaning OOME is possible.  This is not a common
occurrence.

[*] We allowed for cycle stealing from co-resident VMs - the slow down
is 10x scale.

This is not AWS, but a VMware setup. It is possible that there was some
contention for resources from other VMs running on the same hardware
that could have caused queries to run slower and thus build up and use
lots of memory.

As first aid, I will increase the amount of memory to 8GB (from 6GB) and
am investigating ways to detect this situation from outside Fuseki
(which means I must be able to replicate the situation - will try
running Fuseki with low amounts of memory) so that I can give it a kick
if it happens again.

-Osma



On 05/06/14 12:29, Osma Suominen wrote:
> This is a backend server for a web application which is frequented by
> GoogleBot and others (including local pings to test that the service
> is up every 5 minutes or so) which together pretty much ensure that
> there
> is always some activity going on. But there is an alternate backend
> server running another Fuseki to which I can divert query traffic
> temporarily if necessary.

We have had to make sure bots didn't trigger expensive queries - for example sorts. And had bots that ignore "no follow" (and robots.txt).

We also use max=3 on the httpd front end to limit the impact of parallel requests. Not perfect by any means but it does no harm.

        Andy

Reply via email to