To add to what Dave said, if you have a particular machine that’s prone to suddenly stopping, that’s usually a red flag that you should seriously think about hardware issues.
If the problem strikes different machines, then I agree with Shawn that the first thing I’d be suspicious of is OOM errors. FWIW, Erick > On Jun 9, 2020, at 6:05 AM, Dave <hastings.recurs...@gmail.com> wrote: > > I’ll add that whenever I’ve had a solr instance shut down, for me it’s been a > hardware failure. Either the ram or the disk got a “glitch” and both of these > are relatively fragile and wear and tear type parts of the machine, and > should be expected to fail and be replaced from time to time. Solr is pretty > aggressive with its logging so there are a lot of writes always happening and > of course reads, if the disk has any issues or the memory it can lock it up > and bring her down, more so if you have any spellcheck dictionaries or > suggesters being built on start up. > > Just my experience with this, could be wrong (most likely wrong) but we > always have extra drives and memory around the server room for this reason. > At least once or twice a year we will have a disk failure in the raid and > need to swap in a new one. > > Good luck though, also solr should be logging it’s failures so it would be > good to look there too > >> On Jun 9, 2020, at 2:35 AM, Shawn Heisey <apa...@elyograg.org> wrote: >> >> On 5/14/2020 7:22 AM, Ryan W wrote: >>> I manage a site where solr has stopped running a couple times in the past >>> week. The server hasn't been rebooted, so that's not the reason. What else >>> causes solr to stop running? How can I investigate why this is happening? >> >> Any situation where Solr stops running and nobody requested the stop is a >> result of a serious problem that must be thoroughly investigated. I think >> it's a bad idea for Solr to automatically restart when it stops >> unexpectedly. Chances are that whatever caused the crash is going to simply >> make the crash happen again until the problem is solved. Automatically >> restarting could hide problems from the system administrator. >> >> The only way a Solr auto-restart would be acceptable to me is if it sends a >> high priority alert to the sysadmin EVERY time it executes an auto-restart. >> It really is that bad of a problem. >> >> The causes of Solr crashes (that I can think of) include the following. I >> believe I have listed these four options from most likely to least likely: >> >> * Java OutOfMemoryError exceptions. On non-windows systems, the "bin/solr" >> script starts Solr with an option that results in Solr's death anytime one >> of these exceptions occurs. We do this because program operation is >> indeterminate and completely unpredictable when OOME occurs, so it's far >> safer to stop running. That exception can be caused by several things, some >> of which actually do not involve memory at all. If you're running on >> Windows via the bin\solr.cmd command, then this will not happen ... but OOME >> could still cause a crash, because as I already mentioned, program operation >> is unpredictable when OOME occurs. >> >> * The OS kills Solr because system memory is completely exhausted and Solr >> is the process using the most memory. Linux calls this the "oom-killer" ... >> I am pretty sure something like it exists on most operating systems. >> >> * Corruption somewhere in the system. Could be in Java, the OS, Solr, or >> data used by any of those. >> >> * A very serious bug in Solr's code that we haven't discovered yet. >> >> I included that last one simply for completeness. A bug that causes a crash >> *COULD* exist, but as of right now, we have not seen any supporting evidence. >> >> My guess is that Java OutOfMemoryError is the cause here, but I can't be >> certain. If that is happening, then some resource (which might not be >> memory) is fully depleted. We would need to see the full OutOfMemoryError >> exception in order to determine why it is happening. Sometimes the exception >> is logged in solr.log, sometimes it isn't. We cannot predict what part of >> the code will be running when OOME occurs, so it would be nearly impossible >> for us to guarantee logging. OOME can happen ANYWHERE - even in code that >> the compiler thinks is immune to exceptions. >> >> Side note to fellow committers: I wonder if we should implement an uncaught >> exception handler in Solr. I have found in my own programs that it helps >> figure out thorny problems. And while I am on the subject of handlers that >> might not be general knowledge, I didn't find a shutdown hook or a security >> manager outside of tests. >> >> Thanks, >> Shawn