I agree with Otis' response. Adding a few more details, there is a ".out" file in the logs/ directory, that is the stdout for each of these daemons and incase of an OOM crash, it prints something like this
# java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError="kill -9 %p" # Executing /bin/sh -c "kill -9 <pid>"... On Tue, Dec 2, 2014 at 11:06 AM, Otis Gospodnetic < [email protected]> wrote: > Hi Ming, > > 1) There typically is an OOM message from the JVM itself > > 2) I would monitor the server instead of relying on log messages mentioning > OOMs. For example, in SPM <http://sematext.com/spm/> we have "hearbeat > alerts" that tell us when we stop hearing from RegionServers and other > types of servers. It also helps when servers simply die for reasons other > than OOM. > > 3) You could (should?) monitor individual memory pools and possibly set > alerts or anomaly detection on those. If you have that, if there was an > OOM, you will typically see one of the memory pools approach 100% > utilization. I personally really like this report in SPM because it gives > a bit more insight than just "heap size/utilization". So I'd point the > admin to this sort of monitoring report. > > 4) High GC counts/time, or jump in those metrics, and then typically also > jump in CPU usage is what often precedes OOMs. > > Otis > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ > > > On Tue, Dec 2, 2014 at 12:22 AM, Liu, Ming (HPIT-GADSC) <[email protected]> > wrote: > > > Hi, all, > > > > Recently, one of our HBase 0.98.5 instance meet with issues: when run > some > > specific workload, all region servers will suddenly shut down at same > time, > > but master is still running. When I check the log, in master log, I can > see > > messages like > > 2014-12-01 08:28:11,072 DEBUG [main-EventThread] master.ServerManager: > > Added=n008.cluster,60020,1417413986550 to dead servers, submitted > shutdown > > handler to be executed meta=false > > And on n008, regionserver log file, there is no ERROR message, the last > > log entry looks very like a ZooKeeper startup message. The log just > stopped > > with that last ZooKeeper startup message, and the Region Server process > was > > gone when we check with 'jps'. > > > > We then increased the heap size of regionserver, and it work fine. > > RegionServer no longer disappear. So we doubt there was a Out Of Memory > > issue, so the region server processes are killed. But my questions are: > > > > 1. What log message will indicate there is a OOM? Since the region > > server is 'kill -9', so I think there is no message can tell this. > > > > 2. If there is no typical log message about OOM, then how can an > > admin make sure there is a region server OOM happened? We just guess, but > > can not make sure. We hope there is a method to tell OOM occured for > sure. > > > > 3. Does the Zookeeper message appears every time with RegionServer > > OOM (if it is a OOM). Or it is just a random event just in our system? > > > > So in sum, I want to know what is the typical clue that people can make > > sure there is a OOM issue in HBase region server? > > > > Thank you, > > Ming > > > -- Bharath Vissapragada <http://www.cloudera.com>
