Hi, Still working on the issue. This is one of the last trials I am doing before ordering a new cluster.
I was going through yahoo benchmark again and hbase became non responsive for a long time, (about 100 secs) benchmark results were 0 throughput for that time and eventually, benchmark failed. I didn't get any exceptions but this one, on the Master node, iletken-test-0 is also the Master node, so it was trying to recover a file from the same node. Any suggestions? Thanks. --- org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 3 of 14: hdfs://iletken-test-0:30001/hbase3/.logs/iletken-test-2,60020,1299794182845/iletken-test-2%3A60020.1299794988383, length=86340652 2011-03-11 00:15:54,825 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://iletken-test-0:30001/hbase3/.logs/iletken-test-2,60020,1299794182845/iletken-test-2%3A60020.1299794988383 2011-03-11 00:15:56,675 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server listener on 60000: readAndProcess threw exception java.io.IOException: Connection reset by peer. Count of bytes read: 0 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) -- deniz On Thu, Mar 10, 2011 at 12:51 AM, Jean-Daniel Cryans <[email protected]>wrote: > This is a JVM error, and there seems to be a lot of them in the recent > versions. I personally recommend using u16 or u17. > > J-D > > On Wed, Mar 9, 2011 at 1:01 AM, Erdem Agaoglu <[email protected]> > wrote: > > I don't know if it's related but i've seen a dead regionserver a little > > while ago too. But in our case .out file showed some JVM crash along with > a > > hs_err dump in hbase home (attached below). We were running 0.90.0 at the > > moment and we upgraded to 0.90.1 in hopes of a fix but nothing changed. > > > > The crash happened when we ran RowCounter job, with 12 simultaneous tasks > on > > 11 machines, 132 simultaneous tasks total. Table has ~100k rows with > ~700kB > > per row. We were trying different max_region_size values, and that > instance > > had 100M. We have ~1000 regions total. Our machines have 24G ram and > ganglia > > shows no resource starvation nor swapping. > > > > These happened about a week ago, but i wasn't able to test further so i > > delayed asking here, but if it has any relation to problem Deniz's > having, > > this log might be useful. > > > > # > > # A fatal error has been detected by the Java Runtime Environment: > > # > > # SIGSEGV (0xb) at pc=0x00007fd02e23825b, pid=30204, tid=140531828942608 > > # > > # JRE version: 6.0_23-b05 > > # Java VM: Java HotSpot(TM) 64-Bit Server VM (19.0-b09 mixed mode > > linux-amd64 compressed oops) > > # Problematic frame: > > # V [libjvm.so+0x30325b] > > # > > # If you would like to submit a bug report, please visit: > > # http://java.sun.com/webapps/bugreport/crash.jsp > > # > > > > --------------- T H R E A D --------------- > > > > Current thread (0x000000004013d800): ConcurrentGCThread [stack: > > 0x00007fd01dae6000,0x00007fd01dbe7000] [id=30221] > > > > siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), > > si_addr=0x0000000000000018 > > > > Registers: > > RAX=0x000000004013cbd8, RBX=0x00007fd02e8c6960, RCX=0x0000000000000003, > > RDX=0x0000000000000000 > > RSP=0x00007fd01dbe58c0, RBP=0x00007fd01dbe58e0, RSI=0x00007fd02e8aa9b0, > > RDI=0x0000000000000010 > > R8 =0x00000000175f6400, R9 =0x000000000000000c, R10=0x00007fd02e8aa754, > > R11=0x00000000000209bc > > R12=0x00007fd01dbe5a00, R13=0x00000006c3272000, R14=0x000000004013c9c0, > > R15=0x00007fd01dbe5ab0 > > RIP=0x00007fd02e23825b, EFL=0x0000000000010246, > CSGSFS=0x0000000000000033, > > ERR=0x0000000000000004 > > TRAPNO=0x000000000000000e > > > > Register to memory mapping: > > > > RAX=0x000000004013cbd8 > > 0x000000004013cbd8 is pointing to unknown location > > > > RBX=0x00007fd02e8c6960 > > 0x00007fd02e8c6960: <offset 0x991960> in > > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at > > 0x00007fd02df35000 > > > > RCX=0x0000000000000003 > > 0x0000000000000003 is pointing to unknown location > > > > RDX=0x0000000000000000 > > 0x0000000000000000 is pointing to unknown location > > > > RSP=0x00007fd01dbe58c0 > > 0x00007fd01dbe58c0 is pointing to unknown location > > > > RBP=0x00007fd01dbe58e0 > > 0x00007fd01dbe58e0 is pointing to unknown location > > > > RSI=0x00007fd02e8aa9b0 > > 0x00007fd02e8aa9b0: <offset 0x9759b0> in > > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at > > 0x00007fd02df35000 > > > > RDI=0x0000000000000010 > > 0x0000000000000010 is pointing to unknown location > > > > R8 =0x00000000175f6400 > > 0x00000000175f6400 is pointing to unknown location > > > > R9 =0x000000000000000c > > 0x000000000000000c is pointing to unknown location > > > > R10=0x00007fd02e8aa754 > > 0x00007fd02e8aa754: <offset 0x975754> in > > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at > > 0x00007fd02df35000 > > > > R11=0x00000000000209bc > > 0x00000000000209bc is pointing to unknown location > > > > R12=0x00007fd01dbe5a00 > > 0x00007fd01dbe5a00 is pointing to unknown location > > > > R13=0x00000006c3272000 > > > > > > On Tue, Mar 8, 2011 at 6:21 PM, 陈加俊 <[email protected]> wrote: > > > >> Htable had disabled when ctr+c ? > >> > >> 2011/3/8, M.Deniz OKTAR <[email protected]>: > >> > Something new came up! > >> > > >> > I tried to truncate the 'usertable' which had ~12M entries. > >> > > >> > Shell stayed at "disabling table" for a long time. The processes was > >> there > >> > but there were no requests. So I quit the state by ctrl-c. > >> > > >> > Then tried count 'usertable' to see if data remains, shell gave an > error > >> and > >> > one of the regionservers had a log such as below, > >> > > >> > The master logs were also similar (tried to disable again, and the > master > >> > log is from that trial) > >> > > >> > > >> > Regionserver 2: > >> > > >> > 2011-03-08 16:47:24,852 DEBUG > >> > org.apache.hadoop.hbase.regionserver.HRegionServer: > >> > NotServingRegionException; Region is not online: > >> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8. > >> > 2011-03-08 16:47:27,765 DEBUG > >> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=39.63 > >> MB, > >> > free=4.65 GB, max=4.68 GB, blocks=35, accesses=376070, hits=12035, > >> > hitRatio=3.20%%, cachingAccesses=12070, cachingHits=12035, > >> > cachingHitsRatio=99.71%%, evictions=0, evicted=0, evictedPerRun=NaN > >> > 2011-03-08 16:47:28,863 DEBUG > >> > org.apache.hadoop.hbase.regionserver.HRegionServer: > >> > NotServingRegionException; Region is not online: > >> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8. > >> > 2011-03-08 16:47:28,865 ERROR > >> > org.apache.hadoop.hbase.regionserver.HRegionServer: > >> > org.apache.hadoop.hbase.UnknownScannerException: Name: -1 > >> > at > >> > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1795) > >> > at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown > Source) > >> > at > >> > > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >> > at java.lang.reflect.Method.invoke(Method.java:597) > >> > at > >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) > >> > at > >> > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) > >> > > >> > > >> > > >> > Masterserver: > >> > . > >> > . > >> > . (same thing) > >> > 2011-03-08 16:51:34,679 INFO > >> > org.apache.hadoop.hbase.master.AssignmentManager: Region has been > >> > PENDING_CLOSE for too long, running forced unassign again on > >> > > >> > region=usertable,user1948102037,1299592536693.d5bae6bbe54aa182e1215ab626e0011e. > >> > > >> > > >> > -- > >> > deniz > >> > > >> > > >> > On Tue, Mar 8, 2011 at 4:34 PM, M.Deniz OKTAR <[email protected]> > >> wrote: > >> > > >> >> Hi all, > >> >> > >> >> Thanks for the support. I'v been trying to replicate the problem > since > >> >> this > >> >> morning. Before doing that, played with the configuration. I used to > >> have > >> >> only one user and set all the permissions according to that. Now I'v > >> >> followed the cloudera manuals and set permissions for hdfs and mapred > >> >> users. > >> >> (changed the hbase-env.sh) > >> >> > >> >> I had 2 trials, on both the yahoo test failed because of receiving > lost > >> of > >> >> "0"s but the region servers didn't die. At some points in the test, > >> (also > >> >> when failed) , hbase master gave exceptions about not being able to > >> reach > >> >> one of the servers. I also lost the ssh connection to that server, > but > >> >> after > >> >> a while it recovered. (also hmaster) The last thing in the > regionserver > >> >> logs > >> >> was that it was going for a flush. > >> >> > >> >> I'll be going over the tests again and provide you with clean log > files > >> >> from all servers. (hadoop, hbase, namenode, masternode logs) > >> >> > >> >> If you have any suggestions or directions for me to better diagnose > the > >> >> problem, that would be lovely. > >> >> > >> >> btw: these servers do not have ECC memory but I do not see any > >> corruption > >> >> in data. > >> >> > >> >> Thanks! > >> >> > >> >> -- > >> >> deniz > >> >> > >> >> > >> >> On Mon, Mar 7, 2011 at 7:47 PM, Jean-Daniel Cryans > >> >> <[email protected]>wrote: > >> >> > >> >>> Along with a bigger portion of the log, it be might good to check if > >> >>> there's anything in the .out file that looks like a jvm error. > >> >>> > >> >>> J-D > >> >>> > >> >>> On Mon, Mar 7, 2011 at 9:22 AM, M.Deniz OKTAR < > [email protected]> > >> >>> wrote: > >> >>> > I run every kind of benchmark I could find on those machines and > they > >> >>> seemed > >> >>> > to work fine. Did memory/disk tests too. > >> >>> > > >> >>> > The master node or other nodes provide some information and > >> exceptions > >> >>> about > >> >>> > that they can't reach to the dead node. > >> >>> > > >> >>> > Btw sometimes the process does not die but looses the connection. > >> >>> > > >> >>> > -- > >> >>> > > >> >>> > deniz > >> >>> > > >> >>> > On Mon, Mar 7, 2011 at 7:19 PM, Stack <[email protected]> wrote: > >> >>> > > >> >>> >> I'm stumped. I have nothing to go on when no death throes or > >> >>> >> complaints. This hardware for sure is healthy? Other stuff runs > >> w/o > >> >>> >> issue? > >> >>> >> St.Ack > >> >>> >> > >> >>> >> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR < > >> [email protected]> > >> >>> >> wrote: > >> >>> >> > I don't know if its normal but I see alot of '0's in the test > >> >>> >> > results > >> >>> >> when > >> >>> >> > it tends to fail, such as: > >> >>> >> > > >> >>> >> > 1196 sec: 7394901 operations; 0 current ops/sec; > >> >>> >> > > >> >>> >> > -- > >> >>> >> > deniz > >> >>> >> > > >> >>> >> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR < > >> [email protected] > >> >>> > > >> >>> >> wrote: > >> >>> >> > > >> >>> >> >> Hi, > >> >>> >> >> > >> >>> >> >> Thanks for the effort, answers below: > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack <[email protected]> > wrote: > >> >>> >> >> > >> >>> >> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR < > >> >>> [email protected]> > >> >>> >> >>> wrote: > >> >>> >> >>> > We have a 5 node cluster, 4 of them being region servers. I > am > >> >>> >> running a > >> >>> >> >>> > custom workload with YCSB and when the data is loading > (heavy > >> >>> insert) > >> >>> >> at > >> >>> >> >>> > least one of the region servers are dying after about > 600000 > >> >>> >> operations. > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >>> Tell us the character of your 'custom workload' please. > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >> The workload is below, the part that fails is the loading part > >> >>> (-load) > >> >>> >> >> which inserts all the records first) > >> >>> >> >> > >> >>> >> >> recordcount=10000000 > >> >>> >> >> operationcount=3000000 > >> >>> >> >> workload=com.yahoo.ycsb.workloads.CoreWorkload > >> >>> >> >> > >> >>> >> >> readallfields=true > >> >>> >> >> > >> >>> >> >> readproportion=0.5 > >> >>> >> >> updateproportion=0.1 > >> >>> >> >> scanproportion=0 > >> >>> >> >> insertproportion=0.35 > >> >>> >> >> readmodifywriteproportion=0.05 > >> >>> >> >> > >> >>> >> >> requestdistribution=zipfian > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> > >> >>> >> >>> > >> >>> >> >>> > There are no abnormalities in the logs as far as I can see, > >> the > >> >>> only > >> >>> >> >>> common > >> >>> >> >>> > point is that all of them(in different trials, different > >> region > >> >>> >> servers > >> >>> >> >>> > fail) request for a flush as the last logs, given below. > .out > >> >>> files > >> >>> >> are > >> >>> >> >>> > empty. I am looking at the /var/log/hbase folder for logs. > >> >>> Running > >> >>> >> sun > >> >>> >> >>> java > >> >>> >> >>> > 6 latest version. I couldn't find any logs that indicates a > >> >>> problem > >> >>> >> with > >> >>> >> >>> > java. Tried the tests with openjdk and had the same > results. > >> >>> >> >>> > > >> >>> >> >>> > >> >>> >> >>> Its strange that flush is the last thing in your log. The > >> process > >> >>> is > >> >>> >> >>> dead? We are exiting w/o a note in logs? Thats unusual. We > >> >>> usually > >> >>> >> >>> scream loudly when dying. > >> >>> >> >>> > >> >>> >> >> > >> >>> >> >> Yes, thats the strange part. The last line is a flush as if > the > >> >>> process > >> >>> >> >> never failed. Yes, the process is dead and hbase cannot see > the > >> >>> node. > >> >>> >> >> > >> >>> >> >> > >> >>> >> >>> > >> >>> >> >>> > I have set ulimits(50000) and xceivers(20000) for multiple > >> users > >> >>> and > >> >>> >> >>> certain > >> >>> >> >>> > that they are correct. > >> >>> >> >>> > >> >>> >> >>> The first line in an hbase log prints out the ulimit it sees. > >> You > >> >>> >> >>> might check that the hbase process for sure is picking up > your > >> >>> ulimit > >> >>> >> >>> setting. > >> >>> >> >>> > >> >>> >> >>> That was a mistake I did a couple of days ago, checked it > with > >> cat > >> >>> >> >> /proc/<pid of reginserver>/limits and all related users like > >> >>> 'hbase' > >> >>> >> has > >> >>> >> >> those limits. Checked the logs: > >> >>> >> >> > >> >>> >> >> Mon Mar 7 06:41:15 EET 2011 Starting regionserver on test-1 > >> >>> >> >> ulimit -n 52768 > >> >>> >> >> > >> >>> >> >>> > >> >>> >> >>> > Also in the kernel logs, there are no apparent problems. > >> >>> >> >>> > > >> >>> >> >>> > >> >>> >> >>> (The mystery compounds) > >> >>> >> >>> > >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG > >> >>> >> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread: > >> >>> Compaction > >> >>> >> >>> > requested for > >> >>> >> >>> > > >> >>> >> > >> >>> > >> usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3. > >> >>> >> >>> > because regionserver60020.cacheFlusher; priority=3, > compaction > >> >>> queue > >> >>> >> >>> size=18 > >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG > >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: > >> >>> >> >>> > NOT flushing memstore for region > >> >>> >> >>> > > >> >>> >> >>> > >> >>> >> > >> >>> > >> > usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc., > >> >>> >> >>> > flushing=false, writesEnabled=false > >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG > >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: > >> >>> >> >>> > Started memstore flush for > >> >>> >> >>> > > >> >>> >> >>> > >> >>> >> > >> >>> > >> > usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6., > >> >>> >> >>> > current region memstore size 68.6m > >> >>> >> >>> > 2011-03-07 15:07:58,310 DEBUG > >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: > >> >>> >> >>> > Flush requested on > >> >>> >> >>> > > >> >>> >> > >> >>> > >> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc. > >> >>> >> >>> > -end of log file- > >> >>> >> >>> > --- > >> >>> >> >>> > > >> >>> >> >>> > >> >>> >> >>> Nothing more? > >> >>> >> >>> > >> >>> >> >>> > >> >>> >> >> No, nothing after that. But quite a lot of logs before that, I > >> can > >> >>> send > >> >>> >> >> them if you'd like. > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> > >> >>> >> >>> Thanks, > >> >>> >> >>> St.Ack > >> >>> >> >>> > >> >>> >> >> > >> >>> >> >> Thanks alot! > >> >>> >> >> > >> >>> >> >> > >> >>> >> > > >> >>> >> > >> >>> > > >> >>> > >> >> > >> >> > >> > > >> > >> -- > >> 从我的移动设备发送 > >> > >> Thanks & Best regards > >> jiajun > >> > > > > > > > > -- > > erdem agaoglu > > >
