i think we can agree that the basic requirement of *knowing* when the OOM 
occurs is the minimal requirement, triggering an alert (email, etc) would be 
the first thing to get into your script

once you know when the OOM conditions are occuring you can start to get to the 
root cause or remedy (adjust heap sizes, or adjust the input side that is 
triggering the OOM). the correct remedy will obviously require some more deeper 
investigation into the actual solr usage at the point of OOM and the gc logs 
(you have these being generated too i hope). just bumping the Xmx because you 
hit an OOM during an abusive query is no guarantee of a fix and is likely going 
to cost you OS cache memory space which you want to leave available for holding 
the actual index data. the real fix would be cleaning up the query (if that is 
possible)

fundamentally, its a preference thing, but i'm personally not a fan of auto 
restarts as the problem that triggered the original OOM (say an expensive 
poorly constructed query) may just come back and you get into an oscillating 
situation of restart after restart. i generally want a human involved when 
error conditions which should be outliers (like OOM) are happening 


________________________________________
From: Salman Akram <salman.ak...@northbaysolutions.net>
Sent: Monday, October 20, 2014 08:47
To: Solr Group
Subject: Re: Recovering from Out of Mem

" That's why it is considered better to crash the program and restart it
for OOME."

In the end aren't you also saying the same thing or I misunderstood
something?

We don't get this issue on master server (indexing). Our real concern is
slave where sometimes (rare) so not an obvious heap config issue but when
it happens our failover doesn't even work (moving to another slave) as
there is no error so I just want a good way to know if there is an OOM and
shift to a failover or just have that server restarted.




On Mon, Oct 20, 2014 at 7:25 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 10/19/2014 11:32 PM, Ramzi Alqrainy wrote:
> > You can create a script to ping on Solr every 10 sec. if no response,
> then
> > restart it (Kill process id and run Solr again).
> > This is the fastest and easiest way to do that on windows.
>
> I wouldn't do this myself.  Any temporary problem that results in a long
> query time might result in a true outage while Solr restarts.  If OOME
> is a problem, then you can deal with that by providing a program for
> Java to call when OOME occurs.
>
> Sending notification when ping times get excessive is a good idea, but I
> wouldn't make it automatically restart, unless you've got a threshold
> for that action so it only happens when the ping time is *REALLY* high.
>
> The real fix for OOME is to make the heap larger or to reduce the heap
> requirements by changing how Solr is configured or used.
>
> http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap
>
> Writing a program that has deterministic behavior in an out of memory
> condition is very difficult.  The Lucene devs *have* done this hard work
> in the lower levels of IndexWriter and the specific Directory
> implementations, so that OOME doesn't cause *index corruption*.
>
> In general, once OOME happens, program operation (and in some cases the
> status of the most recently indexed documents) is completely
> undetermined.  We can be sure that the data which has already been
> written to disk will be correct, but nothing beyond that.  That's why it
> is considered better to crash the program and restart it for OOME.
>
> Thanks,
> Shawn
>
>


--
Regards,

Salman Akram

Reply via email to