I’ll add that whenever I’ve had a solr instance shut down, for me it’s been a 
hardware failure. Either the ram or the disk got a “glitch” and both of these 
are relatively fragile and wear and tear type parts of the machine, and should 
be expected to fail and be replaced from time to time. Solr is pretty 
aggressive with its logging so there are a lot of writes always happening and 
of course reads, if the disk has any issues or the memory it can lock it up and 
bring her down, more so if you have any spellcheck dictionaries or suggesters 
being built on start up. 

Just my experience with this, could be wrong (most likely wrong) but we always 
have extra drives and memory around the server room for this reason.  At least 
once or twice a year we will have a disk failure in the raid and need to swap 
in a new one. 

Good luck though, also solr should be logging it’s failures so it would be good 
to look there too

> On Jun 9, 2020, at 2:35 AM, Shawn Heisey <apa...@elyograg.org> wrote:
> 
> On 5/14/2020 7:22 AM, Ryan W wrote:
>> I manage a site where solr has stopped running a couple times in the past
>> week. The server hasn't been rebooted, so that's not the reason.  What else
>> causes solr to stop running?  How can I investigate why this is happening?
> 
> Any situation where Solr stops running and nobody requested the stop is a 
> result of a serious problem that must be thoroughly investigated.  I think 
> it's a bad idea for Solr to automatically restart when it stops unexpectedly. 
>  Chances are that whatever caused the crash is going to simply make the crash 
> happen again until the problem is solved. Automatically restarting could hide 
> problems from the system administrator.
> 
> The only way a Solr auto-restart would be acceptable to me is if it sends a 
> high priority alert to the sysadmin EVERY time it executes an auto-restart.  
> It really is that bad of a problem.
> 
> The causes of Solr crashes (that I can think of) include the following. I 
> believe I have listed these four options from most likely to least likely:
> 
> * Java OutOfMemoryError exceptions.  On non-windows systems, the "bin/solr" 
> script starts Solr with an option that results in Solr's death anytime one of 
> these exceptions occurs.  We do this because program operation is 
> indeterminate and completely unpredictable when OOME occurs, so it's far 
> safer to stop running.  That exception can be caused by several things, some 
> of which actually do not involve memory at all.  If you're running on Windows 
> via the bin\solr.cmd command, then this will not happen ... but OOME could 
> still cause a crash, because as I already mentioned, program operation is 
> unpredictable when OOME occurs.
> 
> * The OS kills Solr because system memory is completely exhausted and Solr is 
> the process using the most memory.  Linux calls this the "oom-killer" ... I 
> am pretty sure something like it exists on most operating systems.
> 
> * Corruption somewhere in the system.  Could be in Java, the OS, Solr, or 
> data used by any of those.
> 
> * A very serious bug in Solr's code that we haven't discovered yet.
> 
> I included that last one simply for completeness.  A bug that causes a crash 
> *COULD* exist, but as of right now, we have not seen any supporting evidence.
> 
> My guess is that Java OutOfMemoryError is the cause here, but I can't be 
> certain.  If that is happening, then some resource (which might not be 
> memory) is fully depleted.  We would need to see the full OutOfMemoryError 
> exception in order to determine why it is happening. Sometimes the exception 
> is logged in solr.log, sometimes it isn't.  We cannot predict what part of 
> the code will be running when OOME occurs, so it would be nearly impossible 
> for us to guarantee logging.  OOME can happen ANYWHERE - even in code that 
> the compiler thinks is immune to exceptions.
> 
> Side note to fellow committers:  I wonder if we should implement an uncaught 
> exception handler in Solr.  I have found in my own programs that it helps 
> figure out thorny problems.  And while I am on the subject of handlers that 
> might not be general knowledge, I didn't find a shutdown hook or a security 
> manager outside of tests.
> 
> Thanks,
> Shawn

Reply via email to