I don't know if it's related but i've seen a dead regionserver a little while ago too. But in our case .out file showed some JVM crash along with a hs_err dump in hbase home (attached below). We were running 0.90.0 at the moment and we upgraded to 0.90.1 in hopes of a fix but nothing changed.
The crash happened when we ran RowCounter job, with 12 simultaneous tasks on 11 machines, 132 simultaneous tasks total. Table has ~100k rows with ~700kB per row. We were trying different max_region_size values, and that instance had 100M. We have ~1000 regions total. Our machines have 24G ram and ganglia shows no resource starvation nor swapping. These happened about a week ago, but i wasn't able to test further so i delayed asking here, but if it has any relation to problem Deniz's having, this log might be useful. # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007fd02e23825b, pid=30204, tid=140531828942608 # # JRE version: 6.0_23-b05 # Java VM: Java HotSpot(TM) 64-Bit Server VM (19.0-b09 mixed mode linux-amd64 compressed oops) # Problematic frame: # V [libjvm.so+0x30325b] # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # --------------- T H R E A D --------------- Current thread (0x000000004013d800): ConcurrentGCThread [stack: 0x00007fd01dae6000,0x00007fd01dbe7000] [id=30221] siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0x0000000000000018 Registers: RAX=0x000000004013cbd8, RBX=0x00007fd02e8c6960, RCX=0x0000000000000003, RDX=0x0000000000000000 RSP=0x00007fd01dbe58c0, RBP=0x00007fd01dbe58e0, RSI=0x00007fd02e8aa9b0, RDI=0x0000000000000010 R8 =0x00000000175f6400, R9 =0x000000000000000c, R10=0x00007fd02e8aa754, R11=0x00000000000209bc R12=0x00007fd01dbe5a00, R13=0x00000006c3272000, R14=0x000000004013c9c0, R15=0x00007fd01dbe5ab0 RIP=0x00007fd02e23825b, EFL=0x0000000000010246, CSGSFS=0x0000000000000033, ERR=0x0000000000000004 TRAPNO=0x000000000000000e Register to memory mapping: RAX=0x000000004013cbd8 0x000000004013cbd8 is pointing to unknown location RBX=0x00007fd02e8c6960 0x00007fd02e8c6960: <offset 0x991960> in /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at 0x00007fd02df35000 RCX=0x0000000000000003 0x0000000000000003 is pointing to unknown location RDX=0x0000000000000000 0x0000000000000000 is pointing to unknown location RSP=0x00007fd01dbe58c0 0x00007fd01dbe58c0 is pointing to unknown location RBP=0x00007fd01dbe58e0 0x00007fd01dbe58e0 is pointing to unknown location RSI=0x00007fd02e8aa9b0 0x00007fd02e8aa9b0: <offset 0x9759b0> in /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at 0x00007fd02df35000 RDI=0x0000000000000010 0x0000000000000010 is pointing to unknown location R8 =0x00000000175f6400 0x00000000175f6400 is pointing to unknown location R9 =0x000000000000000c 0x000000000000000c is pointing to unknown location R10=0x00007fd02e8aa754 0x00007fd02e8aa754: <offset 0x975754> in /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at 0x00007fd02df35000 R11=0x00000000000209bc 0x00000000000209bc is pointing to unknown location R12=0x00007fd01dbe5a00 0x00007fd01dbe5a00 is pointing to unknown location R13=0x00000006c3272000 On Tue, Mar 8, 2011 at 6:21 PM, 陈加俊 <[email protected]> wrote: > Htable had disabled when ctr+c ? > > 2011/3/8, M.Deniz OKTAR <[email protected]>: > > Something new came up! > > > > I tried to truncate the 'usertable' which had ~12M entries. > > > > Shell stayed at "disabling table" for a long time. The processes was > there > > but there were no requests. So I quit the state by ctrl-c. > > > > Then tried count 'usertable' to see if data remains, shell gave an error > and > > one of the regionservers had a log such as below, > > > > The master logs were also similar (tried to disable again, and the master > > log is from that trial) > > > > > > Regionserver 2: > > > > 2011-03-08 16:47:24,852 DEBUG > > org.apache.hadoop.hbase.regionserver.HRegionServer: > > NotServingRegionException; Region is not online: > > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8. > > 2011-03-08 16:47:27,765 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=39.63 > MB, > > free=4.65 GB, max=4.68 GB, blocks=35, accesses=376070, hits=12035, > > hitRatio=3.20%%, cachingAccesses=12070, cachingHits=12035, > > cachingHitsRatio=99.71%%, evictions=0, evicted=0, evictedPerRun=NaN > > 2011-03-08 16:47:28,863 DEBUG > > org.apache.hadoop.hbase.regionserver.HRegionServer: > > NotServingRegionException; Region is not online: > > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8. > > 2011-03-08 16:47:28,865 ERROR > > org.apache.hadoop.hbase.regionserver.HRegionServer: > > org.apache.hadoop.hbase.UnknownScannerException: Name: -1 > > at > > > org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1795) > > at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) > > at > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at > > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) > > at > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) > > > > > > > > Masterserver: > > . > > . > > . (same thing) > > 2011-03-08 16:51:34,679 INFO > > org.apache.hadoop.hbase.master.AssignmentManager: Region has been > > PENDING_CLOSE for too long, running forced unassign again on > > > region=usertable,user1948102037,1299592536693.d5bae6bbe54aa182e1215ab626e0011e. > > > > > > -- > > deniz > > > > > > On Tue, Mar 8, 2011 at 4:34 PM, M.Deniz OKTAR <[email protected]> > wrote: > > > >> Hi all, > >> > >> Thanks for the support. I'v been trying to replicate the problem since > >> this > >> morning. Before doing that, played with the configuration. I used to > have > >> only one user and set all the permissions according to that. Now I'v > >> followed the cloudera manuals and set permissions for hdfs and mapred > >> users. > >> (changed the hbase-env.sh) > >> > >> I had 2 trials, on both the yahoo test failed because of receiving lost > of > >> "0"s but the region servers didn't die. At some points in the test, > (also > >> when failed) , hbase master gave exceptions about not being able to > reach > >> one of the servers. I also lost the ssh connection to that server, but > >> after > >> a while it recovered. (also hmaster) The last thing in the regionserver > >> logs > >> was that it was going for a flush. > >> > >> I'll be going over the tests again and provide you with clean log files > >> from all servers. (hadoop, hbase, namenode, masternode logs) > >> > >> If you have any suggestions or directions for me to better diagnose the > >> problem, that would be lovely. > >> > >> btw: these servers do not have ECC memory but I do not see any > corruption > >> in data. > >> > >> Thanks! > >> > >> -- > >> deniz > >> > >> > >> On Mon, Mar 7, 2011 at 7:47 PM, Jean-Daniel Cryans > >> <[email protected]>wrote: > >> > >>> Along with a bigger portion of the log, it be might good to check if > >>> there's anything in the .out file that looks like a jvm error. > >>> > >>> J-D > >>> > >>> On Mon, Mar 7, 2011 at 9:22 AM, M.Deniz OKTAR <[email protected]> > >>> wrote: > >>> > I run every kind of benchmark I could find on those machines and they > >>> seemed > >>> > to work fine. Did memory/disk tests too. > >>> > > >>> > The master node or other nodes provide some information and > exceptions > >>> about > >>> > that they can't reach to the dead node. > >>> > > >>> > Btw sometimes the process does not die but looses the connection. > >>> > > >>> > -- > >>> > > >>> > deniz > >>> > > >>> > On Mon, Mar 7, 2011 at 7:19 PM, Stack <[email protected]> wrote: > >>> > > >>> >> I'm stumped. I have nothing to go on when no death throes or > >>> >> complaints. This hardware for sure is healthy? Other stuff runs > w/o > >>> >> issue? > >>> >> St.Ack > >>> >> > >>> >> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR < > [email protected]> > >>> >> wrote: > >>> >> > I don't know if its normal but I see alot of '0's in the test > >>> >> > results > >>> >> when > >>> >> > it tends to fail, such as: > >>> >> > > >>> >> > 1196 sec: 7394901 operations; 0 current ops/sec; > >>> >> > > >>> >> > -- > >>> >> > deniz > >>> >> > > >>> >> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR < > [email protected] > >>> > > >>> >> wrote: > >>> >> > > >>> >> >> Hi, > >>> >> >> > >>> >> >> Thanks for the effort, answers below: > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack <[email protected]> wrote: > >>> >> >> > >>> >> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR < > >>> [email protected]> > >>> >> >>> wrote: > >>> >> >>> > We have a 5 node cluster, 4 of them being region servers. I am > >>> >> running a > >>> >> >>> > custom workload with YCSB and when the data is loading (heavy > >>> insert) > >>> >> at > >>> >> >>> > least one of the region servers are dying after about 600000 > >>> >> operations. > >>> >> >>> > >>> >> >>> > >>> >> >>> Tell us the character of your 'custom workload' please. > >>> >> >>> > >>> >> >>> > >>> >> >> The workload is below, the part that fails is the loading part > >>> (-load) > >>> >> >> which inserts all the records first) > >>> >> >> > >>> >> >> recordcount=10000000 > >>> >> >> operationcount=3000000 > >>> >> >> workload=com.yahoo.ycsb.workloads.CoreWorkload > >>> >> >> > >>> >> >> readallfields=true > >>> >> >> > >>> >> >> readproportion=0.5 > >>> >> >> updateproportion=0.1 > >>> >> >> scanproportion=0 > >>> >> >> insertproportion=0.35 > >>> >> >> readmodifywriteproportion=0.05 > >>> >> >> > >>> >> >> requestdistribution=zipfian > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >>> > >>> >> >>> > There are no abnormalities in the logs as far as I can see, > the > >>> only > >>> >> >>> common > >>> >> >>> > point is that all of them(in different trials, different > region > >>> >> servers > >>> >> >>> > fail) request for a flush as the last logs, given below. .out > >>> files > >>> >> are > >>> >> >>> > empty. I am looking at the /var/log/hbase folder for logs. > >>> Running > >>> >> sun > >>> >> >>> java > >>> >> >>> > 6 latest version. I couldn't find any logs that indicates a > >>> problem > >>> >> with > >>> >> >>> > java. Tried the tests with openjdk and had the same results. > >>> >> >>> > > >>> >> >>> > >>> >> >>> Its strange that flush is the last thing in your log. The > process > >>> is > >>> >> >>> dead? We are exiting w/o a note in logs? Thats unusual. We > >>> usually > >>> >> >>> scream loudly when dying. > >>> >> >>> > >>> >> >> > >>> >> >> Yes, thats the strange part. The last line is a flush as if the > >>> process > >>> >> >> never failed. Yes, the process is dead and hbase cannot see the > >>> node. > >>> >> >> > >>> >> >> > >>> >> >>> > >>> >> >>> > I have set ulimits(50000) and xceivers(20000) for multiple > users > >>> and > >>> >> >>> certain > >>> >> >>> > that they are correct. > >>> >> >>> > >>> >> >>> The first line in an hbase log prints out the ulimit it sees. > You > >>> >> >>> might check that the hbase process for sure is picking up your > >>> ulimit > >>> >> >>> setting. > >>> >> >>> > >>> >> >>> That was a mistake I did a couple of days ago, checked it with > cat > >>> >> >> /proc/<pid of reginserver>/limits and all related users like > >>> 'hbase' > >>> >> has > >>> >> >> those limits. Checked the logs: > >>> >> >> > >>> >> >> Mon Mar 7 06:41:15 EET 2011 Starting regionserver on test-1 > >>> >> >> ulimit -n 52768 > >>> >> >> > >>> >> >>> > >>> >> >>> > Also in the kernel logs, there are no apparent problems. > >>> >> >>> > > >>> >> >>> > >>> >> >>> (The mystery compounds) > >>> >> >>> > >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG > >>> >> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread: > >>> Compaction > >>> >> >>> > requested for > >>> >> >>> > > >>> >> > >>> > usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3. > >>> >> >>> > because regionserver60020.cacheFlusher; priority=3, compaction > >>> queue > >>> >> >>> size=18 > >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG > >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: > >>> >> >>> > NOT flushing memstore for region > >>> >> >>> > > >>> >> >>> > >>> >> > >>> > usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc., > >>> >> >>> > flushing=false, writesEnabled=false > >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG > >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: > >>> >> >>> > Started memstore flush for > >>> >> >>> > > >>> >> >>> > >>> >> > >>> > usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6., > >>> >> >>> > current region memstore size 68.6m > >>> >> >>> > 2011-03-07 15:07:58,310 DEBUG > >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: > >>> >> >>> > Flush requested on > >>> >> >>> > > >>> >> > >>> > usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc. > >>> >> >>> > -end of log file- > >>> >> >>> > --- > >>> >> >>> > > >>> >> >>> > >>> >> >>> Nothing more? > >>> >> >>> > >>> >> >>> > >>> >> >> No, nothing after that. But quite a lot of logs before that, I > can > >>> send > >>> >> >> them if you'd like. > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >>> Thanks, > >>> >> >>> St.Ack > >>> >> >>> > >>> >> >> > >>> >> >> Thanks alot! > >>> >> >> > >>> >> >> > >>> >> > > >>> >> > >>> > > >>> > >> > >> > > > > -- > 从我的移动设备发送 > > Thanks & Best regards > jiajun > -- erdem agaoglu
