Thats the weird thing, Region is still alive. Just paused for a while and I don't know what are the causes of those long pauses. Checked the garbage collector logs, nothing was taking too long.
I'm suspecting hardware. -- Deniz 2011/3/11 Jean-Daniel Cryans <[email protected]> > iletken-test-2 died. > > J-D > > 2011/3/10 M.Deniz OKTAR <[email protected]>: > > Hi, > > Still working on the issue. This is one of the last trials I am doing > before > > ordering a new cluster. > > I was going through yahoo benchmark again and hbase became non responsive > > for a long time, (about 100 secs) benchmark results were 0 throughput for > > that time and eventually, benchmark failed. > > I didn't get any exceptions but this one, on the Master node, > iletken-test-0 > > is also the Master node, so it was trying to recover a file from the same > > node. > > Any suggestions? > > Thanks. > > --- > > > > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 3 > of > > 14: > > > hdfs://iletken-test-0:30001/hbase3/.logs/iletken-test-2,60020,1299794182845/iletken-test-2%3A60020.1299794988383, > > length=86340652 > > > > 2011-03-11 00:15:54,825 INFO org.apache.hadoop.hbase.util.FSUtils: > > Recovering file > > > hdfs://iletken-test-0:30001/hbase3/.logs/iletken-test-2,60020,1299794182845/iletken-test-2%3A60020.1299794988383 > > > > 2011-03-11 00:15:56,675 WARN org.apache.hadoop.ipc.HBaseServer: IPC > Server > > listener on 60000: readAndProcess threw exception java.io.IOException: > > Connection reset by peer. Count of bytes read: 0 > > > > java.io.IOException: Connection reset by peer > > > > at sun.nio.ch.FileDispatcher.read0(Native Method) > > > > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > > > > -- > > deniz > > > > On Thu, Mar 10, 2011 at 12:51 AM, Jean-Daniel Cryans < > [email protected]> > > wrote: > >> > >> This is a JVM error, and there seems to be a lot of them in the recent > >> versions. I personally recommend using u16 or u17. > >> > >> J-D > >> > >> On Wed, Mar 9, 2011 at 1:01 AM, Erdem Agaoglu <[email protected]> > >> wrote: > >> > I don't know if it's related but i've seen a dead regionserver a > little > >> > while ago too. But in our case .out file showed some JVM crash along > >> > with a > >> > hs_err dump in hbase home (attached below). We were running 0.90.0 at > >> > the > >> > moment and we upgraded to 0.90.1 in hopes of a fix but nothing > changed. > >> > > >> > The crash happened when we ran RowCounter job, with 12 simultaneous > >> > tasks on > >> > 11 machines, 132 simultaneous tasks total. Table has ~100k rows with > >> > ~700kB > >> > per row. We were trying different max_region_size values, and that > >> > instance > >> > had 100M. We have ~1000 regions total. Our machines have 24G ram and > >> > ganglia > >> > shows no resource starvation nor swapping. > >> > > >> > These happened about a week ago, but i wasn't able to test further so > i > >> > delayed asking here, but if it has any relation to problem Deniz's > >> > having, > >> > this log might be useful. > >> > > >> > # > >> > # A fatal error has been detected by the Java Runtime Environment: > >> > # > >> > # SIGSEGV (0xb) at pc=0x00007fd02e23825b, pid=30204, > >> > tid=140531828942608 > >> > # > >> > # JRE version: 6.0_23-b05 > >> > # Java VM: Java HotSpot(TM) 64-Bit Server VM (19.0-b09 mixed mode > >> > linux-amd64 compressed oops) > >> > # Problematic frame: > >> > # V [libjvm.so+0x30325b] > >> > # > >> > # If you would like to submit a bug report, please visit: > >> > # http://java.sun.com/webapps/bugreport/crash.jsp > >> > # > >> > > >> > --------------- T H R E A D --------------- > >> > > >> > Current thread (0x000000004013d800): ConcurrentGCThread [stack: > >> > 0x00007fd01dae6000,0x00007fd01dbe7000] [id=30221] > >> > > >> > siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), > >> > si_addr=0x0000000000000018 > >> > > >> > Registers: > >> > RAX=0x000000004013cbd8, RBX=0x00007fd02e8c6960, > RCX=0x0000000000000003, > >> > RDX=0x0000000000000000 > >> > RSP=0x00007fd01dbe58c0, RBP=0x00007fd01dbe58e0, > RSI=0x00007fd02e8aa9b0, > >> > RDI=0x0000000000000010 > >> > R8 =0x00000000175f6400, R9 =0x000000000000000c, > R10=0x00007fd02e8aa754, > >> > R11=0x00000000000209bc > >> > R12=0x00007fd01dbe5a00, R13=0x00000006c3272000, > R14=0x000000004013c9c0, > >> > R15=0x00007fd01dbe5ab0 > >> > RIP=0x00007fd02e23825b, EFL=0x0000000000010246, > >> > CSGSFS=0x0000000000000033, > >> > ERR=0x0000000000000004 > >> > TRAPNO=0x000000000000000e > >> > > >> > Register to memory mapping: > >> > > >> > RAX=0x000000004013cbd8 > >> > 0x000000004013cbd8 is pointing to unknown location > >> > > >> > RBX=0x00007fd02e8c6960 > >> > 0x00007fd02e8c6960: <offset 0x991960> in > >> > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at > >> > 0x00007fd02df35000 > >> > > >> > RCX=0x0000000000000003 > >> > 0x0000000000000003 is pointing to unknown location > >> > > >> > RDX=0x0000000000000000 > >> > 0x0000000000000000 is pointing to unknown location > >> > > >> > RSP=0x00007fd01dbe58c0 > >> > 0x00007fd01dbe58c0 is pointing to unknown location > >> > > >> > RBP=0x00007fd01dbe58e0 > >> > 0x00007fd01dbe58e0 is pointing to unknown location > >> > > >> > RSI=0x00007fd02e8aa9b0 > >> > 0x00007fd02e8aa9b0: <offset 0x9759b0> in > >> > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at > >> > 0x00007fd02df35000 > >> > > >> > RDI=0x0000000000000010 > >> > 0x0000000000000010 is pointing to unknown location > >> > > >> > R8 =0x00000000175f6400 > >> > 0x00000000175f6400 is pointing to unknown location > >> > > >> > R9 =0x000000000000000c > >> > 0x000000000000000c is pointing to unknown location > >> > > >> > R10=0x00007fd02e8aa754 > >> > 0x00007fd02e8aa754: <offset 0x975754> in > >> > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at > >> > 0x00007fd02df35000 > >> > > >> > R11=0x00000000000209bc > >> > 0x00000000000209bc is pointing to unknown location > >> > > >> > R12=0x00007fd01dbe5a00 > >> > 0x00007fd01dbe5a00 is pointing to unknown location > >> > > >> > R13=0x00000006c3272000 > >> > > >> > > >> > On Tue, Mar 8, 2011 at 6:21 PM, 陈加俊 <[email protected]> wrote: > >> > > >> >> Htable had disabled when ctr+c ? > >> >> > >> >> 2011/3/8, M.Deniz OKTAR <[email protected]>: > >> >> > Something new came up! > >> >> > > >> >> > I tried to truncate the 'usertable' which had ~12M entries. > >> >> > > >> >> > Shell stayed at "disabling table" for a long time. The processes > was > >> >> there > >> >> > but there were no requests. So I quit the state by ctrl-c. > >> >> > > >> >> > Then tried count 'usertable' to see if data remains, shell gave an > >> >> > error > >> >> and > >> >> > one of the regionservers had a log such as below, > >> >> > > >> >> > The master logs were also similar (tried to disable again, and the > >> >> > master > >> >> > log is from that trial) > >> >> > > >> >> > > >> >> > Regionserver 2: > >> >> > > >> >> > 2011-03-08 16:47:24,852 DEBUG > >> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: > >> >> > NotServingRegionException; Region is not online: > >> >> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8. > >> >> > 2011-03-08 16:47:27,765 DEBUG > >> >> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: > >> >> > total=39.63 > >> >> MB, > >> >> > free=4.65 GB, max=4.68 GB, blocks=35, accesses=376070, hits=12035, > >> >> > hitRatio=3.20%%, cachingAccesses=12070, cachingHits=12035, > >> >> > cachingHitsRatio=99.71%%, evictions=0, evicted=0, evictedPerRun=NaN > >> >> > 2011-03-08 16:47:28,863 DEBUG > >> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: > >> >> > NotServingRegionException; Region is not online: > >> >> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8. > >> >> > 2011-03-08 16:47:28,865 ERROR > >> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: > >> >> > org.apache.hadoop.hbase.UnknownScannerException: Name: -1 > >> >> > at > >> >> > > >> >> > >> >> > org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1795) > >> >> > at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown > >> >> > Source) > >> >> > at > >> >> > > >> >> > >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >> >> > at java.lang.reflect.Method.invoke(Method.java:597) > >> >> > at > >> >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) > >> >> > at > >> >> > > >> >> > >> >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) > >> >> > > >> >> > > >> >> > > >> >> > Masterserver: > >> >> > . > >> >> > . > >> >> > . (same thing) > >> >> > 2011-03-08 16:51:34,679 INFO > >> >> > org.apache.hadoop.hbase.master.AssignmentManager: Region has been > >> >> > PENDING_CLOSE for too long, running forced unassign again on > >> >> > > >> >> > >> >> > region=usertable,user1948102037,1299592536693.d5bae6bbe54aa182e1215ab626e0011e. > >> >> > > >> >> > > >> >> > -- > >> >> > deniz > >> >> > > >> >> > > >> >> > On Tue, Mar 8, 2011 at 4:34 PM, M.Deniz OKTAR < > [email protected]> > >> >> wrote: > >> >> > > >> >> >> Hi all, > >> >> >> > >> >> >> Thanks for the support. I'v been trying to replicate the problem > >> >> >> since > >> >> >> this > >> >> >> morning. Before doing that, played with the configuration. I used > to > >> >> have > >> >> >> only one user and set all the permissions according to that. Now > I'v > >> >> >> followed the cloudera manuals and set permissions for hdfs and > >> >> >> mapred > >> >> >> users. > >> >> >> (changed the hbase-env.sh) > >> >> >> > >> >> >> I had 2 trials, on both the yahoo test failed because of receiving > >> >> >> lost > >> >> of > >> >> >> "0"s but the region servers didn't die. At some points in the > test, > >> >> (also > >> >> >> when failed) , hbase master gave exceptions about not being able > to > >> >> reach > >> >> >> one of the servers. I also lost the ssh connection to that server, > >> >> >> but > >> >> >> after > >> >> >> a while it recovered. (also hmaster) The last thing in the > >> >> >> regionserver > >> >> >> logs > >> >> >> was that it was going for a flush. > >> >> >> > >> >> >> I'll be going over the tests again and provide you with clean log > >> >> >> files > >> >> >> from all servers. (hadoop, hbase, namenode, masternode logs) > >> >> >> > >> >> >> If you have any suggestions or directions for me to better > diagnose > >> >> >> the > >> >> >> problem, that would be lovely. > >> >> >> > >> >> >> btw: these servers do not have ECC memory but I do not see any > >> >> corruption > >> >> >> in data. > >> >> >> > >> >> >> Thanks! > >> >> >> > >> >> >> -- > >> >> >> deniz > >> >> >> > >> >> >> > >> >> >> On Mon, Mar 7, 2011 at 7:47 PM, Jean-Daniel Cryans > >> >> >> <[email protected]>wrote: > >> >> >> > >> >> >>> Along with a bigger portion of the log, it be might good to check > >> >> >>> if > >> >> >>> there's anything in the .out file that looks like a jvm error. > >> >> >>> > >> >> >>> J-D > >> >> >>> > >> >> >>> On Mon, Mar 7, 2011 at 9:22 AM, M.Deniz OKTAR > >> >> >>> <[email protected]> > >> >> >>> wrote: > >> >> >>> > I run every kind of benchmark I could find on those machines > and > >> >> >>> > they > >> >> >>> seemed > >> >> >>> > to work fine. Did memory/disk tests too. > >> >> >>> > > >> >> >>> > The master node or other nodes provide some information and > >> >> exceptions > >> >> >>> about > >> >> >>> > that they can't reach to the dead node. > >> >> >>> > > >> >> >>> > Btw sometimes the process does not die but looses the > connection. > >> >> >>> > > >> >> >>> > -- > >> >> >>> > > >> >> >>> > deniz > >> >> >>> > > >> >> >>> > On Mon, Mar 7, 2011 at 7:19 PM, Stack <[email protected]> > wrote: > >> >> >>> > > >> >> >>> >> I'm stumped. I have nothing to go on when no death throes or > >> >> >>> >> complaints. This hardware for sure is healthy? Other stuff > >> >> >>> >> runs > >> >> w/o > >> >> >>> >> issue? > >> >> >>> >> St.Ack > >> >> >>> >> > >> >> >>> >> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR < > >> >> [email protected]> > >> >> >>> >> wrote: > >> >> >>> >> > I don't know if its normal but I see alot of '0's in the > test > >> >> >>> >> > results > >> >> >>> >> when > >> >> >>> >> > it tends to fail, such as: > >> >> >>> >> > > >> >> >>> >> > 1196 sec: 7394901 operations; 0 current ops/sec; > >> >> >>> >> > > >> >> >>> >> > -- > >> >> >>> >> > deniz > >> >> >>> >> > > >> >> >>> >> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR < > >> >> [email protected] > >> >> >>> > > >> >> >>> >> wrote: > >> >> >>> >> > > >> >> >>> >> >> Hi, > >> >> >>> >> >> > >> >> >>> >> >> Thanks for the effort, answers below: > >> >> >>> >> >> > >> >> >>> >> >> > >> >> >>> >> >> > >> >> >>> >> >> > >> >> >>> >> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack <[email protected]> > >> >> >>> >> >> wrote: > >> >> >>> >> >> > >> >> >>> >> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR < > >> >> >>> [email protected]> > >> >> >>> >> >>> wrote: > >> >> >>> >> >>> > We have a 5 node cluster, 4 of them being region > servers. > >> >> >>> >> >>> > I am > >> >> >>> >> running a > >> >> >>> >> >>> > custom workload with YCSB and when the data is loading > >> >> >>> >> >>> > (heavy > >> >> >>> insert) > >> >> >>> >> at > >> >> >>> >> >>> > least one of the region servers are dying after about > >> >> >>> >> >>> > 600000 > >> >> >>> >> operations. > >> >> >>> >> >>> > >> >> >>> >> >>> > >> >> >>> >> >>> Tell us the character of your 'custom workload' please. > >> >> >>> >> >>> > >> >> >>> >> >>> > >> >> >>> >> >> The workload is below, the part that fails is the loading > >> >> >>> >> >> part > >> >> >>> (-load) > >> >> >>> >> >> which inserts all the records first) > >> >> >>> >> >> > >> >> >>> >> >> recordcount=10000000 > >> >> >>> >> >> operationcount=3000000 > >> >> >>> >> >> workload=com.yahoo.ycsb.workloads.CoreWorkload > >> >> >>> >> >> > >> >> >>> >> >> readallfields=true > >> >> >>> >> >> > >> >> >>> >> >> readproportion=0.5 > >> >> >>> >> >> updateproportion=0.1 > >> >> >>> >> >> scanproportion=0 > >> >> >>> >> >> insertproportion=0.35 > >> >> >>> >> >> readmodifywriteproportion=0.05 > >> >> >>> >> >> > >> >> >>> >> >> requestdistribution=zipfian > >> >> >>> >> >> > >> >> >>> >> >> > >> >> >>> >> >> > >> >> >>> >> >> > >> >> >>> >> >>> > >> >> >>> >> >>> > There are no abnormalities in the logs as far as I can > >> >> >>> >> >>> > see, > >> >> the > >> >> >>> only > >> >> >>> >> >>> common > >> >> >>> >> >>> > point is that all of them(in different trials, different > >> >> region > >> >> >>> >> servers > >> >> >>> >> >>> > fail) request for a flush as the last logs, given below. > >> >> >>> >> >>> > .out > >> >> >>> files > >> >> >>> >> are > >> >> >>> >> >>> > empty. I am looking at the /var/log/hbase folder for > logs. > >> >> >>> Running > >> >> >>> >> sun > >> >> >>> >> >>> java > >> >> >>> >> >>> > 6 latest version. I couldn't find any logs that > indicates > >> >> >>> >> >>> > a > >> >> >>> problem > >> >> >>> >> with > >> >> >>> >> >>> > java. Tried the tests with openjdk and had the same > >> >> >>> >> >>> > results. > >> >> >>> >> >>> > > >> >> >>> >> >>> > >> >> >>> >> >>> Its strange that flush is the last thing in your log. The > >> >> process > >> >> >>> is > >> >> >>> >> >>> dead? We are exiting w/o a note in logs? Thats unusual. > >> >> >>> >> >>> We > >> >> >>> usually > >> >> >>> >> >>> scream loudly when dying. > >> >> >>> >> >>> > >> >> >>> >> >> > >> >> >>> >> >> Yes, thats the strange part. The last line is a flush as if > >> >> >>> >> >> the > >> >> >>> process > >> >> >>> >> >> never failed. Yes, the process is dead and hbase cannot see > >> >> >>> >> >> the > >> >> >>> node. > >> >> >>> >> >> > >> >> >>> >> >> > >> >> >>> >> >>> > >> >> >>> >> >>> > I have set ulimits(50000) and xceivers(20000) for > multiple > >> >> users > >> >> >>> and > >> >> >>> >> >>> certain > >> >> >>> >> >>> > that they are correct. > >> >> >>> >> >>> > >> >> >>> >> >>> The first line in an hbase log prints out the ulimit it > >> >> >>> >> >>> sees. > >> >> You > >> >> >>> >> >>> might check that the hbase process for sure is picking up > >> >> >>> >> >>> your > >> >> >>> ulimit > >> >> >>> >> >>> setting. > >> >> >>> >> >>> > >> >> >>> >> >>> That was a mistake I did a couple of days ago, checked it > >> >> >>> >> >>> with > >> >> cat > >> >> >>> >> >> /proc/<pid of reginserver>/limits and all related users > like > >> >> >>> 'hbase' > >> >> >>> >> has > >> >> >>> >> >> those limits. Checked the logs: > >> >> >>> >> >> > >> >> >>> >> >> Mon Mar 7 06:41:15 EET 2011 Starting regionserver on > test-1 > >> >> >>> >> >> ulimit -n 52768 > >> >> >>> >> >> > >> >> >>> >> >>> > >> >> >>> >> >>> > Also in the kernel logs, there are no apparent problems. > >> >> >>> >> >>> > > >> >> >>> >> >>> > >> >> >>> >> >>> (The mystery compounds) > >> >> >>> >> >>> > >> >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG > >> >> >>> >> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread: > >> >> >>> Compaction > >> >> >>> >> >>> > requested for > >> >> >>> >> >>> > > >> >> >>> >> > >> >> >>> > >> >> > >> >> > usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3. > >> >> >>> >> >>> > because regionserver60020.cacheFlusher; priority=3, > >> >> >>> >> >>> > compaction > >> >> >>> queue > >> >> >>> >> >>> size=18 > >> >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG > >> >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: > >> >> >>> >> >>> > NOT flushing memstore for region > >> >> >>> >> >>> > > >> >> >>> >> >>> > >> >> >>> >> > >> >> >>> > >> >> > >> >> > usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc., > >> >> >>> >> >>> > flushing=false, writesEnabled=false > >> >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG > >> >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: > >> >> >>> >> >>> > Started memstore flush for > >> >> >>> >> >>> > > >> >> >>> >> >>> > >> >> >>> >> > >> >> >>> > >> >> > >> >> > usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6., > >> >> >>> >> >>> > current region memstore size 68.6m > >> >> >>> >> >>> > 2011-03-07 15:07:58,310 DEBUG > >> >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: > >> >> >>> >> >>> > Flush requested on > >> >> >>> >> >>> > > >> >> >>> >> > >> >> >>> > >> >> > >> >> > usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc. > >> >> >>> >> >>> > -end of log file- > >> >> >>> >> >>> > --- > >> >> >>> >> >>> > > >> >> >>> >> >>> > >> >> >>> >> >>> Nothing more? > >> >> >>> >> >>> > >> >> >>> >> >>> > >> >> >>> >> >> No, nothing after that. But quite a lot of logs before > that, > >> >> >>> >> >> I > >> >> can > >> >> >>> send > >> >> >>> >> >> them if you'd like. > >> >> >>> >> >> > >> >> >>> >> >> > >> >> >>> >> >> > >> >> >>> >> >>> Thanks, > >> >> >>> >> >>> St.Ack > >> >> >>> >> >>> > >> >> >>> >> >> > >> >> >>> >> >> Thanks alot! > >> >> >>> >> >> > >> >> >>> >> >> > >> >> >>> >> > > >> >> >>> >> > >> >> >>> > > >> >> >>> > >> >> >> > >> >> >> > >> >> > > >> >> > >> >> -- > >> >> 从我的移动设备发送 > >> >> > >> >> Thanks & Best regards > >> >> jiajun > >> >> > >> > > >> > > >> > > >> > -- > >> > erdem agaoglu > >> > > > > > >
