Re: Recovering from Out of Mem

2014-10-21 Thread Shawn Heisey
On 10/21/2014 1:24 AM, Salman Akram wrote:
> Yes so the most imp thing is what's the best way to 'know' that there is
> OOM? Some script of a ping with 1-2 mins time?

To touch on both your question and that posed by Toke Eskildsen:

Java itself has a configuration option to call a program or script when
OOME occurs, the idea is that this script should kill the application
and start it back up.  Lucene and the way it builds indexes have
built-in protections that should keep the index from becoming corrupt
when the app is killed at an unknown location.

Even relatively simple applications tend to have several layers,
complicated ones may have dozens or hundreds of layers.  Lucene and Solr
are not simple.  Dealing with all possible fallout from OOME in
application code is *hard*.  It involves extra code and careful
planning.  Engineering a safe exit from the entire application is even
harder, something that even an experienced programmer who's in charge of
the entire application might not be able to easily do.

I see a number of try/catch cases in the code where a Throwable is
trapped, rather than an Exception.  This means that OOME will not result
in the entire program dying.  We might want OOME to result in the
program dying ... but getting there will involve a lot of tedious work
examining existing code to determine what errors are possible and how to
handle each one specifically, allowing OOME to bubble up to the point
where it can kill the app.  Even then, it is likely to only kill Solr,
not the servlet container ... which means that it probably won't restart
without the OOM config option on the JRE.

> The reason I want auto restart or at least some error (so that it can
> switch to another slave) is I want to have a good sleep if something goes
> wrong at night so that the systems keep on working and can look into
> details in the morning. That's the whole purpose of having a fail over
> implemented.
> 
> On a side node the instance where we had this OOM didn't have an explicit
> Xmx set (on 64 bit Windows) so in that case is there some default max?
> There was ample mem available so why would it throw OOM?

The default max heap is dependent on the specific java implementation --
whether it's 32 bit or 64 bit, whether it's a client JVM or a server
JVM, etc.  And it will usually depend on how much memory the system has
installed, too.  The first answer on this SO question will let you find
out what the default is for your system:

http://stackoverflow.com/questions/4667483/how-is-the-default-java-heap-size-determined

If you're getting OOME, then some aspect of your memory configuration
wasn't large enough for your index, configuration, or query pattern.
One thing that can get exceeded when everything looks like it should be
fine is PermGen.  An error stacktrace was never included on this thread,
so we have no idea exactly what kind of error we're dealing with.

Thanks,
Shawn



Re: Recovering from Out of Mem

2014-10-21 Thread Salman Akram
Yes so the most imp thing is what's the best way to 'know' that there is
OOM? Some script of a ping with 1-2 mins time?

The reason I want auto restart or at least some error (so that it can
switch to another slave) is I want to have a good sleep if something goes
wrong at night so that the systems keep on working and can look into
details in the morning. That's the whole purpose of having a fail over
implemented.

On a side node the instance where we had this OOM didn't have an explicit
Xmx set (on 64 bit Windows) so in that case is there some default max?
There was ample mem available so why would it throw OOM?

On Mon, Oct 20, 2014 at 9:00 PM, Boogie Shafer 
wrote:

>
> i think we can agree that the basic requirement of *knowing* when the OOM
> occurs is the minimal requirement, triggering an alert (email, etc) would
> be the first thing to get into your script
>
> once you know when the OOM conditions are occuring you can start to get to
> the root cause or remedy (adjust heap sizes, or adjust the input side that
> is triggering the OOM). the correct remedy will obviously require some more
> deeper investigation into the actual solr usage at the point of OOM and the
> gc logs (you have these being generated too i hope). just bumping the Xmx
> because you hit an OOM during an abusive query is no guarantee of a fix and
> is likely going to cost you OS cache memory space which you want to leave
> available for holding the actual index data. the real fix would be cleaning
> up the query (if that is possible)
>
> fundamentally, its a preference thing, but i'm personally not a fan of
> auto restarts as the problem that triggered the original OOM (say an
> expensive poorly constructed query) may just come back and you get into an
> oscillating situation of restart after restart. i generally want a human
> involved when error conditions which should be outliers (like OOM) are
> happening
>
>
> ____
> From: Salman Akram 
> Sent: Monday, October 20, 2014 08:47
> To: Solr Group
> Subject: Re: Recovering from Out of Mem
>
> " That's why it is considered better to crash the program and restart it
> for OOME."
>
> In the end aren't you also saying the same thing or I misunderstood
> something?
>
> We don't get this issue on master server (indexing). Our real concern is
> slave where sometimes (rare) so not an obvious heap config issue but when
> it happens our failover doesn't even work (moving to another slave) as
> there is no error so I just want a good way to know if there is an OOM and
> shift to a failover or just have that server restarted.
>
>
>
>
> On Mon, Oct 20, 2014 at 7:25 PM, Shawn Heisey  wrote:
>
> > On 10/19/2014 11:32 PM, Ramzi Alqrainy wrote:
> > > You can create a script to ping on Solr every 10 sec. if no response,
> > then
> > > restart it (Kill process id and run Solr again).
> > > This is the fastest and easiest way to do that on windows.
> >
> > I wouldn't do this myself.  Any temporary problem that results in a long
> > query time might result in a true outage while Solr restarts.  If OOME
> > is a problem, then you can deal with that by providing a program for
> > Java to call when OOME occurs.
> >
> > Sending notification when ping times get excessive is a good idea, but I
> > wouldn't make it automatically restart, unless you've got a threshold
> > for that action so it only happens when the ping time is *REALLY* high.
> >
> > The real fix for OOME is to make the heap larger or to reduce the heap
> > requirements by changing how Solr is configured or used.
> >
> > http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap
> >
> > Writing a program that has deterministic behavior in an out of memory
> > condition is very difficult.  The Lucene devs *have* done this hard work
> > in the lower levels of IndexWriter and the specific Directory
> > implementations, so that OOME doesn't cause *index corruption*.
> >
> > In general, once OOME happens, program operation (and in some cases the
> > status of the most recently indexed documents) is completely
> > undetermined.  We can be sure that the data which has already been
> > written to disk will be correct, but nothing beyond that.  That's why it
> > is considered better to crash the program and restart it for OOME.
> >
> > Thanks,
> > Shawn
> >
> >
>
>
> --
> Regards,
>
> Salman Akram
>



-- 
Regards,

Salman Akram


Re: Recovering from Out of Mem

2014-10-21 Thread Toke Eskildsen
On Mon, 2014-10-20 at 16:25 +0200, Shawn Heisey wrote:
> In general, once OOME happens, program operation (and in some cases the
> status of the most recently indexed documents) is completely
> undetermined.  We can be sure that the data which has already been
> written to disk will be correct, but nothing beyond that.  That's why it
> is considered better to crash the program and restart it for OOME.

Any idea why Lucene/Solr does not do this by itself? It could be
optional (with default to "yes, please crash"), but it seems to me that
shutting down on OOM would be the right thing to do. The need to set a
user-supplied system-specific watch-mechanism on the JVM to get a
reliable Solr is a) not done in a lot of cases and b) prone to errors.

If System.exit() is not available due to JVM options or shutdown is not
reliable by other reasons, the searcher could be marked as unreliable so
that all calls would result in an error "Service unavailable due to OOM.
Please restart", forcing action instead of silent "something's wrong,
but we don't know what".

- Toke Eskildsen, State and University Library




Re: Recovering from Out of Mem

2014-10-20 Thread Boogie Shafer

i think we can agree that the basic requirement of *knowing* when the OOM 
occurs is the minimal requirement, triggering an alert (email, etc) would be 
the first thing to get into your script

once you know when the OOM conditions are occuring you can start to get to the 
root cause or remedy (adjust heap sizes, or adjust the input side that is 
triggering the OOM). the correct remedy will obviously require some more deeper 
investigation into the actual solr usage at the point of OOM and the gc logs 
(you have these being generated too i hope). just bumping the Xmx because you 
hit an OOM during an abusive query is no guarantee of a fix and is likely going 
to cost you OS cache memory space which you want to leave available for holding 
the actual index data. the real fix would be cleaning up the query (if that is 
possible)

fundamentally, its a preference thing, but i'm personally not a fan of auto 
restarts as the problem that triggered the original OOM (say an expensive 
poorly constructed query) may just come back and you get into an oscillating 
situation of restart after restart. i generally want a human involved when 
error conditions which should be outliers (like OOM) are happening 



From: Salman Akram 
Sent: Monday, October 20, 2014 08:47
To: Solr Group
Subject: Re: Recovering from Out of Mem

" That's why it is considered better to crash the program and restart it
for OOME."

In the end aren't you also saying the same thing or I misunderstood
something?

We don't get this issue on master server (indexing). Our real concern is
slave where sometimes (rare) so not an obvious heap config issue but when
it happens our failover doesn't even work (moving to another slave) as
there is no error so I just want a good way to know if there is an OOM and
shift to a failover or just have that server restarted.




On Mon, Oct 20, 2014 at 7:25 PM, Shawn Heisey  wrote:

> On 10/19/2014 11:32 PM, Ramzi Alqrainy wrote:
> > You can create a script to ping on Solr every 10 sec. if no response,
> then
> > restart it (Kill process id and run Solr again).
> > This is the fastest and easiest way to do that on windows.
>
> I wouldn't do this myself.  Any temporary problem that results in a long
> query time might result in a true outage while Solr restarts.  If OOME
> is a problem, then you can deal with that by providing a program for
> Java to call when OOME occurs.
>
> Sending notification when ping times get excessive is a good idea, but I
> wouldn't make it automatically restart, unless you've got a threshold
> for that action so it only happens when the ping time is *REALLY* high.
>
> The real fix for OOME is to make the heap larger or to reduce the heap
> requirements by changing how Solr is configured or used.
>
> http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap
>
> Writing a program that has deterministic behavior in an out of memory
> condition is very difficult.  The Lucene devs *have* done this hard work
> in the lower levels of IndexWriter and the specific Directory
> implementations, so that OOME doesn't cause *index corruption*.
>
> In general, once OOME happens, program operation (and in some cases the
> status of the most recently indexed documents) is completely
> undetermined.  We can be sure that the data which has already been
> written to disk will be correct, but nothing beyond that.  That's why it
> is considered better to crash the program and restart it for OOME.
>
> Thanks,
> Shawn
>
>


--
Regards,

Salman Akram


Re: Recovering from Out of Mem

2014-10-20 Thread Salman Akram
" That's why it is considered better to crash the program and restart it
for OOME."

In the end aren't you also saying the same thing or I misunderstood
something?

We don't get this issue on master server (indexing). Our real concern is
slave where sometimes (rare) so not an obvious heap config issue but when
it happens our failover doesn't even work (moving to another slave) as
there is no error so I just want a good way to know if there is an OOM and
shift to a failover or just have that server restarted.




On Mon, Oct 20, 2014 at 7:25 PM, Shawn Heisey  wrote:

> On 10/19/2014 11:32 PM, Ramzi Alqrainy wrote:
> > You can create a script to ping on Solr every 10 sec. if no response,
> then
> > restart it (Kill process id and run Solr again).
> > This is the fastest and easiest way to do that on windows.
>
> I wouldn't do this myself.  Any temporary problem that results in a long
> query time might result in a true outage while Solr restarts.  If OOME
> is a problem, then you can deal with that by providing a program for
> Java to call when OOME occurs.
>
> Sending notification when ping times get excessive is a good idea, but I
> wouldn't make it automatically restart, unless you've got a threshold
> for that action so it only happens when the ping time is *REALLY* high.
>
> The real fix for OOME is to make the heap larger or to reduce the heap
> requirements by changing how Solr is configured or used.
>
> http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap
>
> Writing a program that has deterministic behavior in an out of memory
> condition is very difficult.  The Lucene devs *have* done this hard work
> in the lower levels of IndexWriter and the specific Directory
> implementations, so that OOME doesn't cause *index corruption*.
>
> In general, once OOME happens, program operation (and in some cases the
> status of the most recently indexed documents) is completely
> undetermined.  We can be sure that the data which has already been
> written to disk will be correct, but nothing beyond that.  That's why it
> is considered better to crash the program and restart it for OOME.
>
> Thanks,
> Shawn
>
>


-- 
Regards,

Salman Akram


Re: Recovering from Out of Mem

2014-10-20 Thread Shawn Heisey
On 10/19/2014 11:32 PM, Ramzi Alqrainy wrote:
> You can create a script to ping on Solr every 10 sec. if no response, then
> restart it (Kill process id and run Solr again).
> This is the fastest and easiest way to do that on windows.

I wouldn't do this myself.  Any temporary problem that results in a long
query time might result in a true outage while Solr restarts.  If OOME
is a problem, then you can deal with that by providing a program for
Java to call when OOME occurs.

Sending notification when ping times get excessive is a good idea, but I
wouldn't make it automatically restart, unless you've got a threshold
for that action so it only happens when the ping time is *REALLY* high.

The real fix for OOME is to make the heap larger or to reduce the heap
requirements by changing how Solr is configured or used.

http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap

Writing a program that has deterministic behavior in an out of memory
condition is very difficult.  The Lucene devs *have* done this hard work
in the lower levels of IndexWriter and the specific Directory
implementations, so that OOME doesn't cause *index corruption*.

In general, once OOME happens, program operation (and in some cases the
status of the most recently indexed documents) is completely
undetermined.  We can be sure that the data which has already been
written to disk will be correct, but nothing beyond that.  That's why it
is considered better to crash the program and restart it for OOME.

Thanks,
Shawn



Re: Recovering from Out of Mem

2014-10-19 Thread Salman Akram
I assume you will have to write a script to restart the service as well?

On Fri, Oct 17, 2014 at 7:17 PM, Tim Potter 
wrote:

> You'd still want to kill it ... so you'll need to register a cmd script
> with the JVM using -XX:OnOutOfMemoryError=kill.cmd and then you could
> either
>
> 1) trap the PID at startup using something like:
>
> title SolrCloud
>
> for /F "tokens=2 delims= " %%A in ('TASKLIST /FI ^"WINDOWTITLE eq
> SolrCloud^" /NH') do (
>
> set /A SOLR_PID=%%A
>
> echo !SOLR_PID!>solr.pid
>
>
> or
>
>
> 2) if you keep track of the port (which all my Windows scripts do), then
> you can do:
>
>
> For /f "tokens=5" %%j in ('netstat -aon ^| find /i "listening" ^| find
> ":%SOLR_PORT%"') do (
>
>   taskkill /t /f /pid %%j > nul 2>&1
>
> )
>
>
> On Fri, Oct 17, 2014 at 1:11 AM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
>
> > I know this might sound weird but any easy way to do it in Windows?
> >
> > On Tue, Oct 14, 2014 at 7:51 PM, Boogie Shafer <
> boogie.sha...@proquest.com
> > >
> > wrote:
> >
> > > yago,
> > >
> > > you can put more complex restart logic as shown in the examples below
> or
> > > just do something similar to the java_oom.sh i posted earlier where you
> > > just spit out an email alert and deal with service restarts and
> > > troubleshooting manually
> > >
> > >
> > > e.g. something like the following for a java_error.sh will drop an
> email
> > > with a timestamp
> > >
> > >
> > >
> > > echo `date` | mail -s "Java Error: General - $HOSTNAME"
> > not...@domain.com
> > >
> > >
> > > 
> > > From: Tim Potter 
> > > Sent: Tuesday, October 14, 2014 07:35
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Recovering from Out of Mem
> > >
> > > jfyi - the bin/solr script does the following:
> > >
> > > -XX:OnOutOfMemoryError="$SOLR_TIP/bin/oom_solr.sh $SOLR_PORT" where
> > > $SOLR_PORT is the port Solr is bound to, e.g. 8983
> > >
> > > The oom_solr.sh script looks like:
> > >
> > > SOLR_PORT=$1
> > >
> > > SOLR_PID=`ps waux | grep start.jar | grep $SOLR_PORT | grep -v grep |
> awk
> > > '{print $2}' | sort -r`
> > >
> > > if [ "$SOLR_PID" == "" ]; then
> > >
> > >   echo "Couldn't find Solr process running on port $SOLR_PORT!"
> > >
> > >   exit
> > >
> > > fi
> > >
> > > NOW=$(date +"%F%T")
> > >
> > > (
> > >
> > > echo "Running OOM killer script for process $SOLR_PID for Solr on port
> > > $SOLR_PORT"
> > >
> > > kill -9 $SOLR_PID
> > >
> > > echo "Killed process $SOLR_PID"
> > >
> > > ) | tee solr_oom_killer-$SOLR_PORT-$NOW.log
> > >
> > >
> > > I usually run Solr behind a supervisor type process (supervisord or
> > > upstart) that will restart it if the process dies.
> > >
> > >
> > > On Tue, Oct 14, 2014 at 8:09 AM, Markus Jelsma 
> > > wrote:
> > >
> > > > This will do:
> > > > kill -9 `ps aux | grep -v grep | grep tomcat6 | awk '{print $2}'`
> > > >
> > > > pkill should also work
> > > >
> > > > On Tuesday 14 October 2014 07:02:03 Yago Riveiro wrote:
> > > > > Boogie,
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Any example for java_error.sh script?
> > > > >
> > > > >
> > > > > —
> > > > > /Yago Riveiro
> > > > >
> > > > > On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer <
> > > > boogie.sha...@proquest.com>
> > > > >
> > > > > wrote:
> > > > > > a really simple approach is to have the OOM generate an email
> > > > > > e.g.
> > > > > > 1) create a simple script (call it java_oom.sh) and drop it in
> your
> > > > tomcat
> > > > > > bin dir echo `date` | mail -s "Java Error: OutOfMemory -
> $HOSTNAME"
> > > > > > not...@domain.com 2) configure your java options (in setenv.sh
> or
> &

Re: Recovering from Out of Mem

2014-10-19 Thread Ramzi Alqrainy
You can create a script to ping on Solr every 10 sec. if no response, then
restart it (Kill process id and run Solr again).
This is the fastest and easiest way to do that on windows.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Recovering-from-Out-of-Mem-tp4164167p4164882.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Recovering from Out of Mem

2014-10-17 Thread Tim Potter
You'd still want to kill it ... so you'll need to register a cmd script
with the JVM using -XX:OnOutOfMemoryError=kill.cmd and then you could
either

1) trap the PID at startup using something like:

title SolrCloud

for /F "tokens=2 delims= " %%A in ('TASKLIST /FI ^"WINDOWTITLE eq
SolrCloud^" /NH') do (

set /A SOLR_PID=%%A

echo !SOLR_PID!>solr.pid


or


2) if you keep track of the port (which all my Windows scripts do), then
you can do:


For /f "tokens=5" %%j in ('netstat -aon ^| find /i "listening" ^| find
":%SOLR_PORT%"') do (

  taskkill /t /f /pid %%j > nul 2>&1

)


On Fri, Oct 17, 2014 at 1:11 AM, Salman Akram <
salman.ak...@northbaysolutions.net> wrote:

> I know this might sound weird but any easy way to do it in Windows?
>
> On Tue, Oct 14, 2014 at 7:51 PM, Boogie Shafer  >
> wrote:
>
> > yago,
> >
> > you can put more complex restart logic as shown in the examples below or
> > just do something similar to the java_oom.sh i posted earlier where you
> > just spit out an email alert and deal with service restarts and
> > troubleshooting manually
> >
> >
> > e.g. something like the following for a java_error.sh will drop an email
> > with a timestamp
> >
> >
> >
> > echo `date` | mail -s "Java Error: General - $HOSTNAME"
> not...@domain.com
> >
> >
> > 
> > From: Tim Potter 
> > Sent: Tuesday, October 14, 2014 07:35
> > To: solr-user@lucene.apache.org
> > Subject: Re: Recovering from Out of Mem
> >
> > jfyi - the bin/solr script does the following:
> >
> > -XX:OnOutOfMemoryError="$SOLR_TIP/bin/oom_solr.sh $SOLR_PORT" where
> > $SOLR_PORT is the port Solr is bound to, e.g. 8983
> >
> > The oom_solr.sh script looks like:
> >
> > SOLR_PORT=$1
> >
> > SOLR_PID=`ps waux | grep start.jar | grep $SOLR_PORT | grep -v grep | awk
> > '{print $2}' | sort -r`
> >
> > if [ "$SOLR_PID" == "" ]; then
> >
> >   echo "Couldn't find Solr process running on port $SOLR_PORT!"
> >
> >   exit
> >
> > fi
> >
> > NOW=$(date +"%F%T")
> >
> > (
> >
> > echo "Running OOM killer script for process $SOLR_PID for Solr on port
> > $SOLR_PORT"
> >
> > kill -9 $SOLR_PID
> >
> > echo "Killed process $SOLR_PID"
> >
> > ) | tee solr_oom_killer-$SOLR_PORT-$NOW.log
> >
> >
> > I usually run Solr behind a supervisor type process (supervisord or
> > upstart) that will restart it if the process dies.
> >
> >
> > On Tue, Oct 14, 2014 at 8:09 AM, Markus Jelsma 
> > wrote:
> >
> > > This will do:
> > > kill -9 `ps aux | grep -v grep | grep tomcat6 | awk '{print $2}'`
> > >
> > > pkill should also work
> > >
> > > On Tuesday 14 October 2014 07:02:03 Yago Riveiro wrote:
> > > > Boogie,
> > > >
> > > >
> > > >
> > > >
> > > > Any example for java_error.sh script?
> > > >
> > > >
> > > > —
> > > > /Yago Riveiro
> > > >
> > > > On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer <
> > > boogie.sha...@proquest.com>
> > > >
> > > > wrote:
> > > > > a really simple approach is to have the OOM generate an email
> > > > > e.g.
> > > > > 1) create a simple script (call it java_oom.sh) and drop it in your
> > > tomcat
> > > > > bin dir echo `date` | mail -s "Java Error: OutOfMemory - $HOSTNAME"
> > > > > not...@domain.com 2) configure your java options (in setenv.sh or
> > > > > similar) to trigger heap dump and the email script when OOM occurs
> #
> > > > > config error behaviors
> > > > > CATALINA_OPTS="$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError
> > > > > -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof
> > > > > -XX:OnError=$TOMCAT_DIR/bin/java_error.sh
> > > > > -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh
> > > > > -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log"
> > > > > 
> > > > > From: Mark Miller 
> > > > > Sent: Tuesday, October 14, 2014 06:30
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: Re: Recovering from Out of Mem
> > > > > Best is to pass the Java cmd line option that kills the process on
> > OOM
> > > and
> > > > > setup a supervisor on the process to restart it.  You need a
> somewhat
> > > > > recent release for this to work properly though. - Mark
> > > > >
> > > > >> On Oct 14, 2014, at 9:06 AM, Salman Akram
> > > > >>  wrote:
> > > > >>
> > > > >> I know there are some suggestions to avoid OOM issue e.g. setting
> > > > >> appropriate Max Heap size etc. However, what's the best way to
> > recover
> > > > >> from
> > > > >> it as it goes into non-responding state? We are using Tomcat on
> back
> > > end.
> > > > >>
> > > > >> The scenario is that once we face OOM issue it keeps on taking
> > queries
> > > > >> (doesn't give any error) but they just time out. So even though we
> > > have a
> > > > >> fail over system implemented but we don't have a way to
> distinguish
> > if
> > > > >> these are real time out queries OR due to OOM.
> > > > >>
> > > > >> --
> > > > >> Regards,
> > > > >>
> > > > >> Salman Akram
> > >
> > >
> >
>
>
>
> --
> Regards,
>
> Salman Akram
>


Re: Recovering from Out of Mem

2014-10-17 Thread Salman Akram
I know this might sound weird but any easy way to do it in Windows?

On Tue, Oct 14, 2014 at 7:51 PM, Boogie Shafer 
wrote:

> yago,
>
> you can put more complex restart logic as shown in the examples below or
> just do something similar to the java_oom.sh i posted earlier where you
> just spit out an email alert and deal with service restarts and
> troubleshooting manually
>
>
> e.g. something like the following for a java_error.sh will drop an email
> with a timestamp
>
>
>
> echo `date` | mail -s "Java Error: General - $HOSTNAME" not...@domain.com
>
>
> 
> From: Tim Potter 
> Sent: Tuesday, October 14, 2014 07:35
> To: solr-user@lucene.apache.org
> Subject: Re: Recovering from Out of Mem
>
> jfyi - the bin/solr script does the following:
>
> -XX:OnOutOfMemoryError="$SOLR_TIP/bin/oom_solr.sh $SOLR_PORT" where
> $SOLR_PORT is the port Solr is bound to, e.g. 8983
>
> The oom_solr.sh script looks like:
>
> SOLR_PORT=$1
>
> SOLR_PID=`ps waux | grep start.jar | grep $SOLR_PORT | grep -v grep | awk
> '{print $2}' | sort -r`
>
> if [ "$SOLR_PID" == "" ]; then
>
>   echo "Couldn't find Solr process running on port $SOLR_PORT!"
>
>   exit
>
> fi
>
> NOW=$(date +"%F%T")
>
> (
>
> echo "Running OOM killer script for process $SOLR_PID for Solr on port
> $SOLR_PORT"
>
> kill -9 $SOLR_PID
>
> echo "Killed process $SOLR_PID"
>
> ) | tee solr_oom_killer-$SOLR_PORT-$NOW.log
>
>
> I usually run Solr behind a supervisor type process (supervisord or
> upstart) that will restart it if the process dies.
>
>
> On Tue, Oct 14, 2014 at 8:09 AM, Markus Jelsma 
> wrote:
>
> > This will do:
> > kill -9 `ps aux | grep -v grep | grep tomcat6 | awk '{print $2}'`
> >
> > pkill should also work
> >
> > On Tuesday 14 October 2014 07:02:03 Yago Riveiro wrote:
> > > Boogie,
> > >
> > >
> > >
> > >
> > > Any example for java_error.sh script?
> > >
> > >
> > > —
> > > /Yago Riveiro
> > >
> > > On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer <
> > boogie.sha...@proquest.com>
> > >
> > > wrote:
> > > > a really simple approach is to have the OOM generate an email
> > > > e.g.
> > > > 1) create a simple script (call it java_oom.sh) and drop it in your
> > tomcat
> > > > bin dir echo `date` | mail -s "Java Error: OutOfMemory - $HOSTNAME"
> > > > not...@domain.com 2) configure your java options (in setenv.sh or
> > > > similar) to trigger heap dump and the email script when OOM occurs #
> > > > config error behaviors
> > > > CATALINA_OPTS="$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError
> > > > -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof
> > > > -XX:OnError=$TOMCAT_DIR/bin/java_error.sh
> > > > -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh
> > > > -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log"
> > > > 
> > > > From: Mark Miller 
> > > > Sent: Tuesday, October 14, 2014 06:30
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: Recovering from Out of Mem
> > > > Best is to pass the Java cmd line option that kills the process on
> OOM
> > and
> > > > setup a supervisor on the process to restart it.  You need a somewhat
> > > > recent release for this to work properly though. - Mark
> > > >
> > > >> On Oct 14, 2014, at 9:06 AM, Salman Akram
> > > >>  wrote:
> > > >>
> > > >> I know there are some suggestions to avoid OOM issue e.g. setting
> > > >> appropriate Max Heap size etc. However, what's the best way to
> recover
> > > >> from
> > > >> it as it goes into non-responding state? We are using Tomcat on back
> > end.
> > > >>
> > > >> The scenario is that once we face OOM issue it keeps on taking
> queries
> > > >> (doesn't give any error) but they just time out. So even though we
> > have a
> > > >> fail over system implemented but we don't have a way to distinguish
> if
> > > >> these are real time out queries OR due to OOM.
> > > >>
> > > >> --
> > > >> Regards,
> > > >>
> > > >> Salman Akram
> >
> >
>



-- 
Regards,

Salman Akram


Re: Recovering from Out of Mem

2014-10-14 Thread Boogie Shafer
yago,

you can put more complex restart logic as shown in the examples below or just 
do something similar to the java_oom.sh i posted earlier where you just spit 
out an email alert and deal with service restarts and troubleshooting manually


e.g. something like the following for a java_error.sh will drop an email with a 
timestamp



echo `date` | mail -s "Java Error: General - $HOSTNAME" not...@domain.com



From: Tim Potter 
Sent: Tuesday, October 14, 2014 07:35
To: solr-user@lucene.apache.org
Subject: Re: Recovering from Out of Mem

jfyi - the bin/solr script does the following:

-XX:OnOutOfMemoryError="$SOLR_TIP/bin/oom_solr.sh $SOLR_PORT" where
$SOLR_PORT is the port Solr is bound to, e.g. 8983

The oom_solr.sh script looks like:

SOLR_PORT=$1

SOLR_PID=`ps waux | grep start.jar | grep $SOLR_PORT | grep -v grep | awk
'{print $2}' | sort -r`

if [ "$SOLR_PID" == "" ]; then

  echo "Couldn't find Solr process running on port $SOLR_PORT!"

  exit

fi

NOW=$(date +"%F%T")

(

echo "Running OOM killer script for process $SOLR_PID for Solr on port
$SOLR_PORT"

kill -9 $SOLR_PID

echo "Killed process $SOLR_PID"

) | tee solr_oom_killer-$SOLR_PORT-$NOW.log


I usually run Solr behind a supervisor type process (supervisord or
upstart) that will restart it if the process dies.


On Tue, Oct 14, 2014 at 8:09 AM, Markus Jelsma  wrote:

> This will do:
> kill -9 `ps aux | grep -v grep | grep tomcat6 | awk '{print $2}'`
>
> pkill should also work
>
> On Tuesday 14 October 2014 07:02:03 Yago Riveiro wrote:
> > Boogie,
> >
> >
> >
> >
> > Any example for java_error.sh script?
> >
> >
> > —
> > /Yago Riveiro
> >
> > On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer <
> boogie.sha...@proquest.com>
> >
> > wrote:
> > > a really simple approach is to have the OOM generate an email
> > > e.g.
> > > 1) create a simple script (call it java_oom.sh) and drop it in your
> tomcat
> > > bin dir echo `date` | mail -s "Java Error: OutOfMemory - $HOSTNAME"
> > > not...@domain.com 2) configure your java options (in setenv.sh or
> > > similar) to trigger heap dump and the email script when OOM occurs #
> > > config error behaviors
> > > CATALINA_OPTS="$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError
> > > -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof
> > > -XX:OnError=$TOMCAT_DIR/bin/java_error.sh
> > > -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh
> > > -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log"
> > > 
> > > From: Mark Miller 
> > > Sent: Tuesday, October 14, 2014 06:30
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Recovering from Out of Mem
> > > Best is to pass the Java cmd line option that kills the process on OOM
> and
> > > setup a supervisor on the process to restart it.  You need a somewhat
> > > recent release for this to work properly though. - Mark
> > >
> > >> On Oct 14, 2014, at 9:06 AM, Salman Akram
> > >>  wrote:
> > >>
> > >> I know there are some suggestions to avoid OOM issue e.g. setting
> > >> appropriate Max Heap size etc. However, what's the best way to recover
> > >> from
> > >> it as it goes into non-responding state? We are using Tomcat on back
> end.
> > >>
> > >> The scenario is that once we face OOM issue it keeps on taking queries
> > >> (doesn't give any error) but they just time out. So even though we
> have a
> > >> fail over system implemented but we don't have a way to distinguish if
> > >> these are real time out queries OR due to OOM.
> > >>
> > >> --
> > >> Regards,
> > >>
> > >> Salman Akram
>
>

Re: Recovering from Out of Mem

2014-10-14 Thread Tim Potter
jfyi - the bin/solr script does the following:

-XX:OnOutOfMemoryError="$SOLR_TIP/bin/oom_solr.sh $SOLR_PORT" where
$SOLR_PORT is the port Solr is bound to, e.g. 8983

The oom_solr.sh script looks like:

SOLR_PORT=$1

SOLR_PID=`ps waux | grep start.jar | grep $SOLR_PORT | grep -v grep | awk
'{print $2}' | sort -r`

if [ "$SOLR_PID" == "" ]; then

  echo "Couldn't find Solr process running on port $SOLR_PORT!"

  exit

fi

NOW=$(date +"%F%T")

(

echo "Running OOM killer script for process $SOLR_PID for Solr on port
$SOLR_PORT"

kill -9 $SOLR_PID

echo "Killed process $SOLR_PID"

) | tee solr_oom_killer-$SOLR_PORT-$NOW.log


I usually run Solr behind a supervisor type process (supervisord or
upstart) that will restart it if the process dies.


On Tue, Oct 14, 2014 at 8:09 AM, Markus Jelsma  wrote:

> This will do:
> kill -9 `ps aux | grep -v grep | grep tomcat6 | awk '{print $2}'`
>
> pkill should also work
>
> On Tuesday 14 October 2014 07:02:03 Yago Riveiro wrote:
> > Boogie,
> >
> >
> >
> >
> > Any example for java_error.sh script?
> >
> >
> > —
> > /Yago Riveiro
> >
> > On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer <
> boogie.sha...@proquest.com>
> >
> > wrote:
> > > a really simple approach is to have the OOM generate an email
> > > e.g.
> > > 1) create a simple script (call it java_oom.sh) and drop it in your
> tomcat
> > > bin dir echo `date` | mail -s "Java Error: OutOfMemory - $HOSTNAME"
> > > not...@domain.com 2) configure your java options (in setenv.sh or
> > > similar) to trigger heap dump and the email script when OOM occurs #
> > > config error behaviors
> > > CATALINA_OPTS="$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError
> > > -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof
> > > -XX:OnError=$TOMCAT_DIR/bin/java_error.sh
> > > -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh
> > > -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log"
> > > 
> > > From: Mark Miller 
> > > Sent: Tuesday, October 14, 2014 06:30
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Recovering from Out of Mem
> > > Best is to pass the Java cmd line option that kills the process on OOM
> and
> > > setup a supervisor on the process to restart it.  You need a somewhat
> > > recent release for this to work properly though. - Mark
> > >
> > >> On Oct 14, 2014, at 9:06 AM, Salman Akram
> > >>  wrote:
> > >>
> > >> I know there are some suggestions to avoid OOM issue e.g. setting
> > >> appropriate Max Heap size etc. However, what's the best way to recover
> > >> from
> > >> it as it goes into non-responding state? We are using Tomcat on back
> end.
> > >>
> > >> The scenario is that once we face OOM issue it keeps on taking queries
> > >> (doesn't give any error) but they just time out. So even though we
> have a
> > >> fail over system implemented but we don't have a way to distinguish if
> > >> these are real time out queries OR due to OOM.
> > >>
> > >> --
> > >> Regards,
> > >>
> > >> Salman Akram
>
>


Re: Recovering from Out of Mem

2014-10-14 Thread Markus Jelsma
This will do:
kill -9 `ps aux | grep -v grep | grep tomcat6 | awk '{print $2}'`

pkill should also work

On Tuesday 14 October 2014 07:02:03 Yago Riveiro wrote:
> Boogie,
> 
> 
> 
> 
> Any example for java_error.sh script?
> 
> 
> —
> /Yago Riveiro
> 
> On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer 
> 
> wrote:
> > a really simple approach is to have the OOM generate an email
> > e.g.
> > 1) create a simple script (call it java_oom.sh) and drop it in your tomcat
> > bin dir echo `date` | mail -s "Java Error: OutOfMemory - $HOSTNAME"
> > not...@domain.com 2) configure your java options (in setenv.sh or
> > similar) to trigger heap dump and the email script when OOM occurs #
> > config error behaviors
> > CATALINA_OPTS="$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError
> > -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof
> > -XX:OnError=$TOMCAT_DIR/bin/java_error.sh
> > -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh
> > -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log"
> > ____________________
> > From: Mark Miller 
> > Sent: Tuesday, October 14, 2014 06:30
> > To: solr-user@lucene.apache.org
> > Subject: Re: Recovering from Out of Mem
> > Best is to pass the Java cmd line option that kills the process on OOM and
> > setup a supervisor on the process to restart it.  You need a somewhat
> > recent release for this to work properly though. - Mark
> > 
> >> On Oct 14, 2014, at 9:06 AM, Salman Akram
> >>  wrote:
> >> 
> >> I know there are some suggestions to avoid OOM issue e.g. setting
> >> appropriate Max Heap size etc. However, what's the best way to recover
> >> from
> >> it as it goes into non-responding state? We are using Tomcat on back end.
> >> 
> >> The scenario is that once we face OOM issue it keeps on taking queries
> >> (doesn't give any error) but they just time out. So even though we have a
> >> fail over system implemented but we don't have a way to distinguish if
> >> these are real time out queries OR due to OOM.
> >> 
> >> --
> >> Regards,
> >> 
> >> Salman Akram



Re: Recovering from Out of Mem

2014-10-14 Thread Yago Riveiro
Boogie,




Any example for java_error.sh script?


—
/Yago Riveiro

On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer 
wrote:

> a really simple approach is to have the OOM generate an email
> e.g. 
> 1) create a simple script (call it java_oom.sh) and drop it in your tomcat 
> bin dir
> echo `date` | mail -s "Java Error: OutOfMemory - $HOSTNAME" not...@domain.com
> 2) configure your java options (in setenv.sh or similar) to trigger heap dump 
> and the email script when OOM occurs
> # config error behaviors
> CATALINA_OPTS="$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof 
> -XX:OnError=$TOMCAT_DIR/bin/java_error.sh 
> -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh 
> -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log"
> 
> From: Mark Miller 
> Sent: Tuesday, October 14, 2014 06:30
> To: solr-user@lucene.apache.org
> Subject: Re: Recovering from Out of Mem
> Best is to pass the Java cmd line option that kills the process on OOM and 
> setup a supervisor on the process to restart it.  You need a somewhat recent 
> release for this to work properly though.
> - Mark
>> On Oct 14, 2014, at 9:06 AM, Salman Akram 
>>  wrote:
>>
>> I know there are some suggestions to avoid OOM issue e.g. setting
>> appropriate Max Heap size etc. However, what's the best way to recover from
>> it as it goes into non-responding state? We are using Tomcat on back end.
>>
>> The scenario is that once we face OOM issue it keeps on taking queries
>> (doesn't give any error) but they just time out. So even though we have a
>> fail over system implemented but we don't have a way to distinguish if
>> these are real time out queries OR due to OOM.
>>
>> --
>> Regards,
>>
>> Salman Akram

Re: Recovering from Out of Mem

2014-10-14 Thread Markus Jelsma
And don't forget to set the proper permissions on the script, the tomcat or 
jetty user.

Markus

On Tuesday 14 October 2014 13:47:47 Boogie Shafer wrote:
> a really simple approach is to have the OOM generate an email
> 
> e.g.
> 
> 1) create a simple script (call it java_oom.sh) and drop it in your tomcat
> bin dir
> 
> 
> echo `date` | mail -s "Java Error: OutOfMemory - $HOSTNAME"
> not...@domain.com
> 
> 
> 2) configure your java options (in setenv.sh or similar) to trigger heap
> dump and the email script when OOM occurs
> 
> # config error behaviors
> CATALINA_OPTS="$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof
> -XX:OnError=$TOMCAT_DIR/bin/java_error.sh
> -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh
> -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log"
> 
> 
> 
> 
> From: Mark Miller 
> Sent: Tuesday, October 14, 2014 06:30
> To: solr-user@lucene.apache.org
> Subject: Re: Recovering from Out of Mem
> 
> Best is to pass the Java cmd line option that kills the process on OOM and
> setup a supervisor on the process to restart it.  You need a somewhat
> recent release for this to work properly though.
> 
> - Mark
> 
> > On Oct 14, 2014, at 9:06 AM, Salman Akram
> >  wrote:
> > 
> > I know there are some suggestions to avoid OOM issue e.g. setting
> > appropriate Max Heap size etc. However, what's the best way to recover
> > from
> > it as it goes into non-responding state? We are using Tomcat on back end.
> > 
> > The scenario is that once we face OOM issue it keeps on taking queries
> > (doesn't give any error) but they just time out. So even though we have a
> > fail over system implemented but we don't have a way to distinguish if
> > these are real time out queries OR due to OOM.
> > 
> > --
> > Regards,
> > 
> > Salman Akram



Re: Recovering from Out of Mem

2014-10-14 Thread Boogie Shafer

a really simple approach is to have the OOM generate an email

e.g. 

1) create a simple script (call it java_oom.sh) and drop it in your tomcat bin 
dir


echo `date` | mail -s "Java Error: OutOfMemory - $HOSTNAME" not...@domain.com


2) configure your java options (in setenv.sh or similar) to trigger heap dump 
and the email script when OOM occurs

# config error behaviors
CATALINA_OPTS="$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof 
-XX:OnError=$TOMCAT_DIR/bin/java_error.sh 
-XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh 
-XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log"




From: Mark Miller 
Sent: Tuesday, October 14, 2014 06:30
To: solr-user@lucene.apache.org
Subject: Re: Recovering from Out of Mem

Best is to pass the Java cmd line option that kills the process on OOM and 
setup a supervisor on the process to restart it.  You need a somewhat recent 
release for this to work properly though.

- Mark

> On Oct 14, 2014, at 9:06 AM, Salman Akram 
>  wrote:
>
> I know there are some suggestions to avoid OOM issue e.g. setting
> appropriate Max Heap size etc. However, what's the best way to recover from
> it as it goes into non-responding state? We are using Tomcat on back end.
>
> The scenario is that once we face OOM issue it keeps on taking queries
> (doesn't give any error) but they just time out. So even though we have a
> fail over system implemented but we don't have a way to distinguish if
> these are real time out queries OR due to OOM.
>
> --
> Regards,
>
> Salman Akram

Re: Recovering from Out of Mem

2014-10-14 Thread Mark Miller
Best is to pass the Java cmd line option that kills the process on OOM and 
setup a supervisor on the process to restart it.  You need a somewhat recent 
release for this to work properly though. 

- Mark

> On Oct 14, 2014, at 9:06 AM, Salman Akram 
>  wrote:
> 
> I know there are some suggestions to avoid OOM issue e.g. setting
> appropriate Max Heap size etc. However, what's the best way to recover from
> it as it goes into non-responding state? We are using Tomcat on back end.
> 
> The scenario is that once we face OOM issue it keeps on taking queries
> (doesn't give any error) but they just time out. So even though we have a
> fail over system implemented but we don't have a way to distinguish if
> these are real time out queries OR due to OOM.
> 
> -- 
> Regards,
> 
> Salman Akram