Re: Recovering from Out of Mem
On 10/21/2014 1:24 AM, Salman Akram wrote: > Yes so the most imp thing is what's the best way to 'know' that there is > OOM? Some script of a ping with 1-2 mins time? To touch on both your question and that posed by Toke Eskildsen: Java itself has a configuration option to call a program or script when OOME occurs, the idea is that this script should kill the application and start it back up. Lucene and the way it builds indexes have built-in protections that should keep the index from becoming corrupt when the app is killed at an unknown location. Even relatively simple applications tend to have several layers, complicated ones may have dozens or hundreds of layers. Lucene and Solr are not simple. Dealing with all possible fallout from OOME in application code is *hard*. It involves extra code and careful planning. Engineering a safe exit from the entire application is even harder, something that even an experienced programmer who's in charge of the entire application might not be able to easily do. I see a number of try/catch cases in the code where a Throwable is trapped, rather than an Exception. This means that OOME will not result in the entire program dying. We might want OOME to result in the program dying ... but getting there will involve a lot of tedious work examining existing code to determine what errors are possible and how to handle each one specifically, allowing OOME to bubble up to the point where it can kill the app. Even then, it is likely to only kill Solr, not the servlet container ... which means that it probably won't restart without the OOM config option on the JRE. > The reason I want auto restart or at least some error (so that it can > switch to another slave) is I want to have a good sleep if something goes > wrong at night so that the systems keep on working and can look into > details in the morning. That's the whole purpose of having a fail over > implemented. > > On a side node the instance where we had this OOM didn't have an explicit > Xmx set (on 64 bit Windows) so in that case is there some default max? > There was ample mem available so why would it throw OOM? The default max heap is dependent on the specific java implementation -- whether it's 32 bit or 64 bit, whether it's a client JVM or a server JVM, etc. And it will usually depend on how much memory the system has installed, too. The first answer on this SO question will let you find out what the default is for your system: http://stackoverflow.com/questions/4667483/how-is-the-default-java-heap-size-determined If you're getting OOME, then some aspect of your memory configuration wasn't large enough for your index, configuration, or query pattern. One thing that can get exceeded when everything looks like it should be fine is PermGen. An error stacktrace was never included on this thread, so we have no idea exactly what kind of error we're dealing with. Thanks, Shawn
Re: Recovering from Out of Mem
Yes so the most imp thing is what's the best way to 'know' that there is OOM? Some script of a ping with 1-2 mins time? The reason I want auto restart or at least some error (so that it can switch to another slave) is I want to have a good sleep if something goes wrong at night so that the systems keep on working and can look into details in the morning. That's the whole purpose of having a fail over implemented. On a side node the instance where we had this OOM didn't have an explicit Xmx set (on 64 bit Windows) so in that case is there some default max? There was ample mem available so why would it throw OOM? On Mon, Oct 20, 2014 at 9:00 PM, Boogie Shafer wrote: > > i think we can agree that the basic requirement of *knowing* when the OOM > occurs is the minimal requirement, triggering an alert (email, etc) would > be the first thing to get into your script > > once you know when the OOM conditions are occuring you can start to get to > the root cause or remedy (adjust heap sizes, or adjust the input side that > is triggering the OOM). the correct remedy will obviously require some more > deeper investigation into the actual solr usage at the point of OOM and the > gc logs (you have these being generated too i hope). just bumping the Xmx > because you hit an OOM during an abusive query is no guarantee of a fix and > is likely going to cost you OS cache memory space which you want to leave > available for holding the actual index data. the real fix would be cleaning > up the query (if that is possible) > > fundamentally, its a preference thing, but i'm personally not a fan of > auto restarts as the problem that triggered the original OOM (say an > expensive poorly constructed query) may just come back and you get into an > oscillating situation of restart after restart. i generally want a human > involved when error conditions which should be outliers (like OOM) are > happening > > > ____ > From: Salman Akram > Sent: Monday, October 20, 2014 08:47 > To: Solr Group > Subject: Re: Recovering from Out of Mem > > " That's why it is considered better to crash the program and restart it > for OOME." > > In the end aren't you also saying the same thing or I misunderstood > something? > > We don't get this issue on master server (indexing). Our real concern is > slave where sometimes (rare) so not an obvious heap config issue but when > it happens our failover doesn't even work (moving to another slave) as > there is no error so I just want a good way to know if there is an OOM and > shift to a failover or just have that server restarted. > > > > > On Mon, Oct 20, 2014 at 7:25 PM, Shawn Heisey wrote: > > > On 10/19/2014 11:32 PM, Ramzi Alqrainy wrote: > > > You can create a script to ping on Solr every 10 sec. if no response, > > then > > > restart it (Kill process id and run Solr again). > > > This is the fastest and easiest way to do that on windows. > > > > I wouldn't do this myself. Any temporary problem that results in a long > > query time might result in a true outage while Solr restarts. If OOME > > is a problem, then you can deal with that by providing a program for > > Java to call when OOME occurs. > > > > Sending notification when ping times get excessive is a good idea, but I > > wouldn't make it automatically restart, unless you've got a threshold > > for that action so it only happens when the ping time is *REALLY* high. > > > > The real fix for OOME is to make the heap larger or to reduce the heap > > requirements by changing how Solr is configured or used. > > > > http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap > > > > Writing a program that has deterministic behavior in an out of memory > > condition is very difficult. The Lucene devs *have* done this hard work > > in the lower levels of IndexWriter and the specific Directory > > implementations, so that OOME doesn't cause *index corruption*. > > > > In general, once OOME happens, program operation (and in some cases the > > status of the most recently indexed documents) is completely > > undetermined. We can be sure that the data which has already been > > written to disk will be correct, but nothing beyond that. That's why it > > is considered better to crash the program and restart it for OOME. > > > > Thanks, > > Shawn > > > > > > > -- > Regards, > > Salman Akram > -- Regards, Salman Akram
Re: Recovering from Out of Mem
On Mon, 2014-10-20 at 16:25 +0200, Shawn Heisey wrote: > In general, once OOME happens, program operation (and in some cases the > status of the most recently indexed documents) is completely > undetermined. We can be sure that the data which has already been > written to disk will be correct, but nothing beyond that. That's why it > is considered better to crash the program and restart it for OOME. Any idea why Lucene/Solr does not do this by itself? It could be optional (with default to "yes, please crash"), but it seems to me that shutting down on OOM would be the right thing to do. The need to set a user-supplied system-specific watch-mechanism on the JVM to get a reliable Solr is a) not done in a lot of cases and b) prone to errors. If System.exit() is not available due to JVM options or shutdown is not reliable by other reasons, the searcher could be marked as unreliable so that all calls would result in an error "Service unavailable due to OOM. Please restart", forcing action instead of silent "something's wrong, but we don't know what". - Toke Eskildsen, State and University Library
Re: Recovering from Out of Mem
i think we can agree that the basic requirement of *knowing* when the OOM occurs is the minimal requirement, triggering an alert (email, etc) would be the first thing to get into your script once you know when the OOM conditions are occuring you can start to get to the root cause or remedy (adjust heap sizes, or adjust the input side that is triggering the OOM). the correct remedy will obviously require some more deeper investigation into the actual solr usage at the point of OOM and the gc logs (you have these being generated too i hope). just bumping the Xmx because you hit an OOM during an abusive query is no guarantee of a fix and is likely going to cost you OS cache memory space which you want to leave available for holding the actual index data. the real fix would be cleaning up the query (if that is possible) fundamentally, its a preference thing, but i'm personally not a fan of auto restarts as the problem that triggered the original OOM (say an expensive poorly constructed query) may just come back and you get into an oscillating situation of restart after restart. i generally want a human involved when error conditions which should be outliers (like OOM) are happening From: Salman Akram Sent: Monday, October 20, 2014 08:47 To: Solr Group Subject: Re: Recovering from Out of Mem " That's why it is considered better to crash the program and restart it for OOME." In the end aren't you also saying the same thing or I misunderstood something? We don't get this issue on master server (indexing). Our real concern is slave where sometimes (rare) so not an obvious heap config issue but when it happens our failover doesn't even work (moving to another slave) as there is no error so I just want a good way to know if there is an OOM and shift to a failover or just have that server restarted. On Mon, Oct 20, 2014 at 7:25 PM, Shawn Heisey wrote: > On 10/19/2014 11:32 PM, Ramzi Alqrainy wrote: > > You can create a script to ping on Solr every 10 sec. if no response, > then > > restart it (Kill process id and run Solr again). > > This is the fastest and easiest way to do that on windows. > > I wouldn't do this myself. Any temporary problem that results in a long > query time might result in a true outage while Solr restarts. If OOME > is a problem, then you can deal with that by providing a program for > Java to call when OOME occurs. > > Sending notification when ping times get excessive is a good idea, but I > wouldn't make it automatically restart, unless you've got a threshold > for that action so it only happens when the ping time is *REALLY* high. > > The real fix for OOME is to make the heap larger or to reduce the heap > requirements by changing how Solr is configured or used. > > http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap > > Writing a program that has deterministic behavior in an out of memory > condition is very difficult. The Lucene devs *have* done this hard work > in the lower levels of IndexWriter and the specific Directory > implementations, so that OOME doesn't cause *index corruption*. > > In general, once OOME happens, program operation (and in some cases the > status of the most recently indexed documents) is completely > undetermined. We can be sure that the data which has already been > written to disk will be correct, but nothing beyond that. That's why it > is considered better to crash the program and restart it for OOME. > > Thanks, > Shawn > > -- Regards, Salman Akram
Re: Recovering from Out of Mem
" That's why it is considered better to crash the program and restart it for OOME." In the end aren't you also saying the same thing or I misunderstood something? We don't get this issue on master server (indexing). Our real concern is slave where sometimes (rare) so not an obvious heap config issue but when it happens our failover doesn't even work (moving to another slave) as there is no error so I just want a good way to know if there is an OOM and shift to a failover or just have that server restarted. On Mon, Oct 20, 2014 at 7:25 PM, Shawn Heisey wrote: > On 10/19/2014 11:32 PM, Ramzi Alqrainy wrote: > > You can create a script to ping on Solr every 10 sec. if no response, > then > > restart it (Kill process id and run Solr again). > > This is the fastest and easiest way to do that on windows. > > I wouldn't do this myself. Any temporary problem that results in a long > query time might result in a true outage while Solr restarts. If OOME > is a problem, then you can deal with that by providing a program for > Java to call when OOME occurs. > > Sending notification when ping times get excessive is a good idea, but I > wouldn't make it automatically restart, unless you've got a threshold > for that action so it only happens when the ping time is *REALLY* high. > > The real fix for OOME is to make the heap larger or to reduce the heap > requirements by changing how Solr is configured or used. > > http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap > > Writing a program that has deterministic behavior in an out of memory > condition is very difficult. The Lucene devs *have* done this hard work > in the lower levels of IndexWriter and the specific Directory > implementations, so that OOME doesn't cause *index corruption*. > > In general, once OOME happens, program operation (and in some cases the > status of the most recently indexed documents) is completely > undetermined. We can be sure that the data which has already been > written to disk will be correct, but nothing beyond that. That's why it > is considered better to crash the program and restart it for OOME. > > Thanks, > Shawn > > -- Regards, Salman Akram
Re: Recovering from Out of Mem
On 10/19/2014 11:32 PM, Ramzi Alqrainy wrote: > You can create a script to ping on Solr every 10 sec. if no response, then > restart it (Kill process id and run Solr again). > This is the fastest and easiest way to do that on windows. I wouldn't do this myself. Any temporary problem that results in a long query time might result in a true outage while Solr restarts. If OOME is a problem, then you can deal with that by providing a program for Java to call when OOME occurs. Sending notification when ping times get excessive is a good idea, but I wouldn't make it automatically restart, unless you've got a threshold for that action so it only happens when the ping time is *REALLY* high. The real fix for OOME is to make the heap larger or to reduce the heap requirements by changing how Solr is configured or used. http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap Writing a program that has deterministic behavior in an out of memory condition is very difficult. The Lucene devs *have* done this hard work in the lower levels of IndexWriter and the specific Directory implementations, so that OOME doesn't cause *index corruption*. In general, once OOME happens, program operation (and in some cases the status of the most recently indexed documents) is completely undetermined. We can be sure that the data which has already been written to disk will be correct, but nothing beyond that. That's why it is considered better to crash the program and restart it for OOME. Thanks, Shawn
Re: Recovering from Out of Mem
I assume you will have to write a script to restart the service as well? On Fri, Oct 17, 2014 at 7:17 PM, Tim Potter wrote: > You'd still want to kill it ... so you'll need to register a cmd script > with the JVM using -XX:OnOutOfMemoryError=kill.cmd and then you could > either > > 1) trap the PID at startup using something like: > > title SolrCloud > > for /F "tokens=2 delims= " %%A in ('TASKLIST /FI ^"WINDOWTITLE eq > SolrCloud^" /NH') do ( > > set /A SOLR_PID=%%A > > echo !SOLR_PID!>solr.pid > > > or > > > 2) if you keep track of the port (which all my Windows scripts do), then > you can do: > > > For /f "tokens=5" %%j in ('netstat -aon ^| find /i "listening" ^| find > ":%SOLR_PORT%"') do ( > > taskkill /t /f /pid %%j > nul 2>&1 > > ) > > > On Fri, Oct 17, 2014 at 1:11 AM, Salman Akram < > salman.ak...@northbaysolutions.net> wrote: > > > I know this might sound weird but any easy way to do it in Windows? > > > > On Tue, Oct 14, 2014 at 7:51 PM, Boogie Shafer < > boogie.sha...@proquest.com > > > > > wrote: > > > > > yago, > > > > > > you can put more complex restart logic as shown in the examples below > or > > > just do something similar to the java_oom.sh i posted earlier where you > > > just spit out an email alert and deal with service restarts and > > > troubleshooting manually > > > > > > > > > e.g. something like the following for a java_error.sh will drop an > email > > > with a timestamp > > > > > > > > > > > > echo `date` | mail -s "Java Error: General - $HOSTNAME" > > not...@domain.com > > > > > > > > > > > > From: Tim Potter > > > Sent: Tuesday, October 14, 2014 07:35 > > > To: solr-user@lucene.apache.org > > > Subject: Re: Recovering from Out of Mem > > > > > > jfyi - the bin/solr script does the following: > > > > > > -XX:OnOutOfMemoryError="$SOLR_TIP/bin/oom_solr.sh $SOLR_PORT" where > > > $SOLR_PORT is the port Solr is bound to, e.g. 8983 > > > > > > The oom_solr.sh script looks like: > > > > > > SOLR_PORT=$1 > > > > > > SOLR_PID=`ps waux | grep start.jar | grep $SOLR_PORT | grep -v grep | > awk > > > '{print $2}' | sort -r` > > > > > > if [ "$SOLR_PID" == "" ]; then > > > > > > echo "Couldn't find Solr process running on port $SOLR_PORT!" > > > > > > exit > > > > > > fi > > > > > > NOW=$(date +"%F%T") > > > > > > ( > > > > > > echo "Running OOM killer script for process $SOLR_PID for Solr on port > > > $SOLR_PORT" > > > > > > kill -9 $SOLR_PID > > > > > > echo "Killed process $SOLR_PID" > > > > > > ) | tee solr_oom_killer-$SOLR_PORT-$NOW.log > > > > > > > > > I usually run Solr behind a supervisor type process (supervisord or > > > upstart) that will restart it if the process dies. > > > > > > > > > On Tue, Oct 14, 2014 at 8:09 AM, Markus Jelsma > > > wrote: > > > > > > > This will do: > > > > kill -9 `ps aux | grep -v grep | grep tomcat6 | awk '{print $2}'` > > > > > > > > pkill should also work > > > > > > > > On Tuesday 14 October 2014 07:02:03 Yago Riveiro wrote: > > > > > Boogie, > > > > > > > > > > > > > > > > > > > > > > > > > Any example for java_error.sh script? > > > > > > > > > > > > > > > — > > > > > /Yago Riveiro > > > > > > > > > > On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer < > > > > boogie.sha...@proquest.com> > > > > > > > > > > wrote: > > > > > > a really simple approach is to have the OOM generate an email > > > > > > e.g. > > > > > > 1) create a simple script (call it java_oom.sh) and drop it in > your > > > > tomcat > > > > > > bin dir echo `date` | mail -s "Java Error: OutOfMemory - > $HOSTNAME" > > > > > > not...@domain.com 2) configure your java options (in setenv.sh > or > &
Re: Recovering from Out of Mem
You can create a script to ping on Solr every 10 sec. if no response, then restart it (Kill process id and run Solr again). This is the fastest and easiest way to do that on windows. -- View this message in context: http://lucene.472066.n3.nabble.com/Recovering-from-Out-of-Mem-tp4164167p4164882.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Recovering from Out of Mem
You'd still want to kill it ... so you'll need to register a cmd script with the JVM using -XX:OnOutOfMemoryError=kill.cmd and then you could either 1) trap the PID at startup using something like: title SolrCloud for /F "tokens=2 delims= " %%A in ('TASKLIST /FI ^"WINDOWTITLE eq SolrCloud^" /NH') do ( set /A SOLR_PID=%%A echo !SOLR_PID!>solr.pid or 2) if you keep track of the port (which all my Windows scripts do), then you can do: For /f "tokens=5" %%j in ('netstat -aon ^| find /i "listening" ^| find ":%SOLR_PORT%"') do ( taskkill /t /f /pid %%j > nul 2>&1 ) On Fri, Oct 17, 2014 at 1:11 AM, Salman Akram < salman.ak...@northbaysolutions.net> wrote: > I know this might sound weird but any easy way to do it in Windows? > > On Tue, Oct 14, 2014 at 7:51 PM, Boogie Shafer > > wrote: > > > yago, > > > > you can put more complex restart logic as shown in the examples below or > > just do something similar to the java_oom.sh i posted earlier where you > > just spit out an email alert and deal with service restarts and > > troubleshooting manually > > > > > > e.g. something like the following for a java_error.sh will drop an email > > with a timestamp > > > > > > > > echo `date` | mail -s "Java Error: General - $HOSTNAME" > not...@domain.com > > > > > > > > From: Tim Potter > > Sent: Tuesday, October 14, 2014 07:35 > > To: solr-user@lucene.apache.org > > Subject: Re: Recovering from Out of Mem > > > > jfyi - the bin/solr script does the following: > > > > -XX:OnOutOfMemoryError="$SOLR_TIP/bin/oom_solr.sh $SOLR_PORT" where > > $SOLR_PORT is the port Solr is bound to, e.g. 8983 > > > > The oom_solr.sh script looks like: > > > > SOLR_PORT=$1 > > > > SOLR_PID=`ps waux | grep start.jar | grep $SOLR_PORT | grep -v grep | awk > > '{print $2}' | sort -r` > > > > if [ "$SOLR_PID" == "" ]; then > > > > echo "Couldn't find Solr process running on port $SOLR_PORT!" > > > > exit > > > > fi > > > > NOW=$(date +"%F%T") > > > > ( > > > > echo "Running OOM killer script for process $SOLR_PID for Solr on port > > $SOLR_PORT" > > > > kill -9 $SOLR_PID > > > > echo "Killed process $SOLR_PID" > > > > ) | tee solr_oom_killer-$SOLR_PORT-$NOW.log > > > > > > I usually run Solr behind a supervisor type process (supervisord or > > upstart) that will restart it if the process dies. > > > > > > On Tue, Oct 14, 2014 at 8:09 AM, Markus Jelsma > > wrote: > > > > > This will do: > > > kill -9 `ps aux | grep -v grep | grep tomcat6 | awk '{print $2}'` > > > > > > pkill should also work > > > > > > On Tuesday 14 October 2014 07:02:03 Yago Riveiro wrote: > > > > Boogie, > > > > > > > > > > > > > > > > > > > > Any example for java_error.sh script? > > > > > > > > > > > > — > > > > /Yago Riveiro > > > > > > > > On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer < > > > boogie.sha...@proquest.com> > > > > > > > > wrote: > > > > > a really simple approach is to have the OOM generate an email > > > > > e.g. > > > > > 1) create a simple script (call it java_oom.sh) and drop it in your > > > tomcat > > > > > bin dir echo `date` | mail -s "Java Error: OutOfMemory - $HOSTNAME" > > > > > not...@domain.com 2) configure your java options (in setenv.sh or > > > > > similar) to trigger heap dump and the email script when OOM occurs > # > > > > > config error behaviors > > > > > CATALINA_OPTS="$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError > > > > > -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof > > > > > -XX:OnError=$TOMCAT_DIR/bin/java_error.sh > > > > > -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh > > > > > -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log" > > > > > > > > > > From: Mark Miller > > > > > Sent: Tuesday, October 14, 2014 06:30 > > > > > To: solr-user@lucene.apache.org > > > > > Subject: Re: Recovering from Out of Mem > > > > > Best is to pass the Java cmd line option that kills the process on > > OOM > > > and > > > > > setup a supervisor on the process to restart it. You need a > somewhat > > > > > recent release for this to work properly though. - Mark > > > > > > > > > >> On Oct 14, 2014, at 9:06 AM, Salman Akram > > > > >> wrote: > > > > >> > > > > >> I know there are some suggestions to avoid OOM issue e.g. setting > > > > >> appropriate Max Heap size etc. However, what's the best way to > > recover > > > > >> from > > > > >> it as it goes into non-responding state? We are using Tomcat on > back > > > end. > > > > >> > > > > >> The scenario is that once we face OOM issue it keeps on taking > > queries > > > > >> (doesn't give any error) but they just time out. So even though we > > > have a > > > > >> fail over system implemented but we don't have a way to > distinguish > > if > > > > >> these are real time out queries OR due to OOM. > > > > >> > > > > >> -- > > > > >> Regards, > > > > >> > > > > >> Salman Akram > > > > > > > > > > > > -- > Regards, > > Salman Akram >
Re: Recovering from Out of Mem
I know this might sound weird but any easy way to do it in Windows? On Tue, Oct 14, 2014 at 7:51 PM, Boogie Shafer wrote: > yago, > > you can put more complex restart logic as shown in the examples below or > just do something similar to the java_oom.sh i posted earlier where you > just spit out an email alert and deal with service restarts and > troubleshooting manually > > > e.g. something like the following for a java_error.sh will drop an email > with a timestamp > > > > echo `date` | mail -s "Java Error: General - $HOSTNAME" not...@domain.com > > > > From: Tim Potter > Sent: Tuesday, October 14, 2014 07:35 > To: solr-user@lucene.apache.org > Subject: Re: Recovering from Out of Mem > > jfyi - the bin/solr script does the following: > > -XX:OnOutOfMemoryError="$SOLR_TIP/bin/oom_solr.sh $SOLR_PORT" where > $SOLR_PORT is the port Solr is bound to, e.g. 8983 > > The oom_solr.sh script looks like: > > SOLR_PORT=$1 > > SOLR_PID=`ps waux | grep start.jar | grep $SOLR_PORT | grep -v grep | awk > '{print $2}' | sort -r` > > if [ "$SOLR_PID" == "" ]; then > > echo "Couldn't find Solr process running on port $SOLR_PORT!" > > exit > > fi > > NOW=$(date +"%F%T") > > ( > > echo "Running OOM killer script for process $SOLR_PID for Solr on port > $SOLR_PORT" > > kill -9 $SOLR_PID > > echo "Killed process $SOLR_PID" > > ) | tee solr_oom_killer-$SOLR_PORT-$NOW.log > > > I usually run Solr behind a supervisor type process (supervisord or > upstart) that will restart it if the process dies. > > > On Tue, Oct 14, 2014 at 8:09 AM, Markus Jelsma > wrote: > > > This will do: > > kill -9 `ps aux | grep -v grep | grep tomcat6 | awk '{print $2}'` > > > > pkill should also work > > > > On Tuesday 14 October 2014 07:02:03 Yago Riveiro wrote: > > > Boogie, > > > > > > > > > > > > > > > Any example for java_error.sh script? > > > > > > > > > — > > > /Yago Riveiro > > > > > > On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer < > > boogie.sha...@proquest.com> > > > > > > wrote: > > > > a really simple approach is to have the OOM generate an email > > > > e.g. > > > > 1) create a simple script (call it java_oom.sh) and drop it in your > > tomcat > > > > bin dir echo `date` | mail -s "Java Error: OutOfMemory - $HOSTNAME" > > > > not...@domain.com 2) configure your java options (in setenv.sh or > > > > similar) to trigger heap dump and the email script when OOM occurs # > > > > config error behaviors > > > > CATALINA_OPTS="$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError > > > > -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof > > > > -XX:OnError=$TOMCAT_DIR/bin/java_error.sh > > > > -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh > > > > -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log" > > > > > > > > From: Mark Miller > > > > Sent: Tuesday, October 14, 2014 06:30 > > > > To: solr-user@lucene.apache.org > > > > Subject: Re: Recovering from Out of Mem > > > > Best is to pass the Java cmd line option that kills the process on > OOM > > and > > > > setup a supervisor on the process to restart it. You need a somewhat > > > > recent release for this to work properly though. - Mark > > > > > > > >> On Oct 14, 2014, at 9:06 AM, Salman Akram > > > >> wrote: > > > >> > > > >> I know there are some suggestions to avoid OOM issue e.g. setting > > > >> appropriate Max Heap size etc. However, what's the best way to > recover > > > >> from > > > >> it as it goes into non-responding state? We are using Tomcat on back > > end. > > > >> > > > >> The scenario is that once we face OOM issue it keeps on taking > queries > > > >> (doesn't give any error) but they just time out. So even though we > > have a > > > >> fail over system implemented but we don't have a way to distinguish > if > > > >> these are real time out queries OR due to OOM. > > > >> > > > >> -- > > > >> Regards, > > > >> > > > >> Salman Akram > > > > > -- Regards, Salman Akram
Re: Recovering from Out of Mem
yago, you can put more complex restart logic as shown in the examples below or just do something similar to the java_oom.sh i posted earlier where you just spit out an email alert and deal with service restarts and troubleshooting manually e.g. something like the following for a java_error.sh will drop an email with a timestamp echo `date` | mail -s "Java Error: General - $HOSTNAME" not...@domain.com From: Tim Potter Sent: Tuesday, October 14, 2014 07:35 To: solr-user@lucene.apache.org Subject: Re: Recovering from Out of Mem jfyi - the bin/solr script does the following: -XX:OnOutOfMemoryError="$SOLR_TIP/bin/oom_solr.sh $SOLR_PORT" where $SOLR_PORT is the port Solr is bound to, e.g. 8983 The oom_solr.sh script looks like: SOLR_PORT=$1 SOLR_PID=`ps waux | grep start.jar | grep $SOLR_PORT | grep -v grep | awk '{print $2}' | sort -r` if [ "$SOLR_PID" == "" ]; then echo "Couldn't find Solr process running on port $SOLR_PORT!" exit fi NOW=$(date +"%F%T") ( echo "Running OOM killer script for process $SOLR_PID for Solr on port $SOLR_PORT" kill -9 $SOLR_PID echo "Killed process $SOLR_PID" ) | tee solr_oom_killer-$SOLR_PORT-$NOW.log I usually run Solr behind a supervisor type process (supervisord or upstart) that will restart it if the process dies. On Tue, Oct 14, 2014 at 8:09 AM, Markus Jelsma wrote: > This will do: > kill -9 `ps aux | grep -v grep | grep tomcat6 | awk '{print $2}'` > > pkill should also work > > On Tuesday 14 October 2014 07:02:03 Yago Riveiro wrote: > > Boogie, > > > > > > > > > > Any example for java_error.sh script? > > > > > > — > > /Yago Riveiro > > > > On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer < > boogie.sha...@proquest.com> > > > > wrote: > > > a really simple approach is to have the OOM generate an email > > > e.g. > > > 1) create a simple script (call it java_oom.sh) and drop it in your > tomcat > > > bin dir echo `date` | mail -s "Java Error: OutOfMemory - $HOSTNAME" > > > not...@domain.com 2) configure your java options (in setenv.sh or > > > similar) to trigger heap dump and the email script when OOM occurs # > > > config error behaviors > > > CATALINA_OPTS="$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError > > > -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof > > > -XX:OnError=$TOMCAT_DIR/bin/java_error.sh > > > -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh > > > -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log" > > > > > > From: Mark Miller > > > Sent: Tuesday, October 14, 2014 06:30 > > > To: solr-user@lucene.apache.org > > > Subject: Re: Recovering from Out of Mem > > > Best is to pass the Java cmd line option that kills the process on OOM > and > > > setup a supervisor on the process to restart it. You need a somewhat > > > recent release for this to work properly though. - Mark > > > > > >> On Oct 14, 2014, at 9:06 AM, Salman Akram > > >> wrote: > > >> > > >> I know there are some suggestions to avoid OOM issue e.g. setting > > >> appropriate Max Heap size etc. However, what's the best way to recover > > >> from > > >> it as it goes into non-responding state? We are using Tomcat on back > end. > > >> > > >> The scenario is that once we face OOM issue it keeps on taking queries > > >> (doesn't give any error) but they just time out. So even though we > have a > > >> fail over system implemented but we don't have a way to distinguish if > > >> these are real time out queries OR due to OOM. > > >> > > >> -- > > >> Regards, > > >> > > >> Salman Akram > >
Re: Recovering from Out of Mem
jfyi - the bin/solr script does the following: -XX:OnOutOfMemoryError="$SOLR_TIP/bin/oom_solr.sh $SOLR_PORT" where $SOLR_PORT is the port Solr is bound to, e.g. 8983 The oom_solr.sh script looks like: SOLR_PORT=$1 SOLR_PID=`ps waux | grep start.jar | grep $SOLR_PORT | grep -v grep | awk '{print $2}' | sort -r` if [ "$SOLR_PID" == "" ]; then echo "Couldn't find Solr process running on port $SOLR_PORT!" exit fi NOW=$(date +"%F%T") ( echo "Running OOM killer script for process $SOLR_PID for Solr on port $SOLR_PORT" kill -9 $SOLR_PID echo "Killed process $SOLR_PID" ) | tee solr_oom_killer-$SOLR_PORT-$NOW.log I usually run Solr behind a supervisor type process (supervisord or upstart) that will restart it if the process dies. On Tue, Oct 14, 2014 at 8:09 AM, Markus Jelsma wrote: > This will do: > kill -9 `ps aux | grep -v grep | grep tomcat6 | awk '{print $2}'` > > pkill should also work > > On Tuesday 14 October 2014 07:02:03 Yago Riveiro wrote: > > Boogie, > > > > > > > > > > Any example for java_error.sh script? > > > > > > — > > /Yago Riveiro > > > > On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer < > boogie.sha...@proquest.com> > > > > wrote: > > > a really simple approach is to have the OOM generate an email > > > e.g. > > > 1) create a simple script (call it java_oom.sh) and drop it in your > tomcat > > > bin dir echo `date` | mail -s "Java Error: OutOfMemory - $HOSTNAME" > > > not...@domain.com 2) configure your java options (in setenv.sh or > > > similar) to trigger heap dump and the email script when OOM occurs # > > > config error behaviors > > > CATALINA_OPTS="$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError > > > -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof > > > -XX:OnError=$TOMCAT_DIR/bin/java_error.sh > > > -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh > > > -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log" > > > > > > From: Mark Miller > > > Sent: Tuesday, October 14, 2014 06:30 > > > To: solr-user@lucene.apache.org > > > Subject: Re: Recovering from Out of Mem > > > Best is to pass the Java cmd line option that kills the process on OOM > and > > > setup a supervisor on the process to restart it. You need a somewhat > > > recent release for this to work properly though. - Mark > > > > > >> On Oct 14, 2014, at 9:06 AM, Salman Akram > > >> wrote: > > >> > > >> I know there are some suggestions to avoid OOM issue e.g. setting > > >> appropriate Max Heap size etc. However, what's the best way to recover > > >> from > > >> it as it goes into non-responding state? We are using Tomcat on back > end. > > >> > > >> The scenario is that once we face OOM issue it keeps on taking queries > > >> (doesn't give any error) but they just time out. So even though we > have a > > >> fail over system implemented but we don't have a way to distinguish if > > >> these are real time out queries OR due to OOM. > > >> > > >> -- > > >> Regards, > > >> > > >> Salman Akram > >
Re: Recovering from Out of Mem
This will do: kill -9 `ps aux | grep -v grep | grep tomcat6 | awk '{print $2}'` pkill should also work On Tuesday 14 October 2014 07:02:03 Yago Riveiro wrote: > Boogie, > > > > > Any example for java_error.sh script? > > > — > /Yago Riveiro > > On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer > > wrote: > > a really simple approach is to have the OOM generate an email > > e.g. > > 1) create a simple script (call it java_oom.sh) and drop it in your tomcat > > bin dir echo `date` | mail -s "Java Error: OutOfMemory - $HOSTNAME" > > not...@domain.com 2) configure your java options (in setenv.sh or > > similar) to trigger heap dump and the email script when OOM occurs # > > config error behaviors > > CATALINA_OPTS="$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError > > -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof > > -XX:OnError=$TOMCAT_DIR/bin/java_error.sh > > -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh > > -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log" > > ____________________ > > From: Mark Miller > > Sent: Tuesday, October 14, 2014 06:30 > > To: solr-user@lucene.apache.org > > Subject: Re: Recovering from Out of Mem > > Best is to pass the Java cmd line option that kills the process on OOM and > > setup a supervisor on the process to restart it. You need a somewhat > > recent release for this to work properly though. - Mark > > > >> On Oct 14, 2014, at 9:06 AM, Salman Akram > >> wrote: > >> > >> I know there are some suggestions to avoid OOM issue e.g. setting > >> appropriate Max Heap size etc. However, what's the best way to recover > >> from > >> it as it goes into non-responding state? We are using Tomcat on back end. > >> > >> The scenario is that once we face OOM issue it keeps on taking queries > >> (doesn't give any error) but they just time out. So even though we have a > >> fail over system implemented but we don't have a way to distinguish if > >> these are real time out queries OR due to OOM. > >> > >> -- > >> Regards, > >> > >> Salman Akram
Re: Recovering from Out of Mem
Boogie, Any example for java_error.sh script? — /Yago Riveiro On Tue, Oct 14, 2014 at 2:48 PM, Boogie Shafer wrote: > a really simple approach is to have the OOM generate an email > e.g. > 1) create a simple script (call it java_oom.sh) and drop it in your tomcat > bin dir > echo `date` | mail -s "Java Error: OutOfMemory - $HOSTNAME" not...@domain.com > 2) configure your java options (in setenv.sh or similar) to trigger heap dump > and the email script when OOM occurs > # config error behaviors > CATALINA_OPTS="$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof > -XX:OnError=$TOMCAT_DIR/bin/java_error.sh > -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh > -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log" > > From: Mark Miller > Sent: Tuesday, October 14, 2014 06:30 > To: solr-user@lucene.apache.org > Subject: Re: Recovering from Out of Mem > Best is to pass the Java cmd line option that kills the process on OOM and > setup a supervisor on the process to restart it. You need a somewhat recent > release for this to work properly though. > - Mark >> On Oct 14, 2014, at 9:06 AM, Salman Akram >> wrote: >> >> I know there are some suggestions to avoid OOM issue e.g. setting >> appropriate Max Heap size etc. However, what's the best way to recover from >> it as it goes into non-responding state? We are using Tomcat on back end. >> >> The scenario is that once we face OOM issue it keeps on taking queries >> (doesn't give any error) but they just time out. So even though we have a >> fail over system implemented but we don't have a way to distinguish if >> these are real time out queries OR due to OOM. >> >> -- >> Regards, >> >> Salman Akram
Re: Recovering from Out of Mem
And don't forget to set the proper permissions on the script, the tomcat or jetty user. Markus On Tuesday 14 October 2014 13:47:47 Boogie Shafer wrote: > a really simple approach is to have the OOM generate an email > > e.g. > > 1) create a simple script (call it java_oom.sh) and drop it in your tomcat > bin dir > > > echo `date` | mail -s "Java Error: OutOfMemory - $HOSTNAME" > not...@domain.com > > > 2) configure your java options (in setenv.sh or similar) to trigger heap > dump and the email script when OOM occurs > > # config error behaviors > CATALINA_OPTS="$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof > -XX:OnError=$TOMCAT_DIR/bin/java_error.sh > -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh > -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log" > > > > > From: Mark Miller > Sent: Tuesday, October 14, 2014 06:30 > To: solr-user@lucene.apache.org > Subject: Re: Recovering from Out of Mem > > Best is to pass the Java cmd line option that kills the process on OOM and > setup a supervisor on the process to restart it. You need a somewhat > recent release for this to work properly though. > > - Mark > > > On Oct 14, 2014, at 9:06 AM, Salman Akram > > wrote: > > > > I know there are some suggestions to avoid OOM issue e.g. setting > > appropriate Max Heap size etc. However, what's the best way to recover > > from > > it as it goes into non-responding state? We are using Tomcat on back end. > > > > The scenario is that once we face OOM issue it keeps on taking queries > > (doesn't give any error) but they just time out. So even though we have a > > fail over system implemented but we don't have a way to distinguish if > > these are real time out queries OR due to OOM. > > > > -- > > Regards, > > > > Salman Akram
Re: Recovering from Out of Mem
a really simple approach is to have the OOM generate an email e.g. 1) create a simple script (call it java_oom.sh) and drop it in your tomcat bin dir echo `date` | mail -s "Java Error: OutOfMemory - $HOSTNAME" not...@domain.com 2) configure your java options (in setenv.sh or similar) to trigger heap dump and the email script when OOM occurs # config error behaviors CATALINA_OPTS="$CATALINA_OPTS -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$TOMCAT_DIR/temp/tomcat-dump.hprof -XX:OnError=$TOMCAT_DIR/bin/java_error.sh -XX:OnOutOfMemoryError=$TOMCAT_DIR/bin/java_oom.sh -XX:ErrorFile=$TOMCAT_DIR/temp/java_error%p.log" From: Mark Miller Sent: Tuesday, October 14, 2014 06:30 To: solr-user@lucene.apache.org Subject: Re: Recovering from Out of Mem Best is to pass the Java cmd line option that kills the process on OOM and setup a supervisor on the process to restart it. You need a somewhat recent release for this to work properly though. - Mark > On Oct 14, 2014, at 9:06 AM, Salman Akram > wrote: > > I know there are some suggestions to avoid OOM issue e.g. setting > appropriate Max Heap size etc. However, what's the best way to recover from > it as it goes into non-responding state? We are using Tomcat on back end. > > The scenario is that once we face OOM issue it keeps on taking queries > (doesn't give any error) but they just time out. So even though we have a > fail over system implemented but we don't have a way to distinguish if > these are real time out queries OR due to OOM. > > -- > Regards, > > Salman Akram
Re: Recovering from Out of Mem
Best is to pass the Java cmd line option that kills the process on OOM and setup a supervisor on the process to restart it. You need a somewhat recent release for this to work properly though. - Mark > On Oct 14, 2014, at 9:06 AM, Salman Akram > wrote: > > I know there are some suggestions to avoid OOM issue e.g. setting > appropriate Max Heap size etc. However, what's the best way to recover from > it as it goes into non-responding state? We are using Tomcat on back end. > > The scenario is that once we face OOM issue it keeps on taking queries > (doesn't give any error) but they just time out. So even though we have a > fail over system implemented but we don't have a way to distinguish if > these are real time out queries OR due to OOM. > > -- > Regards, > > Salman Akram