iletken-test-2 died. J-D
2011/3/10 M.Deniz OKTAR <[email protected]>: > Hi, > Still working on the issue. This is one of the last trials I am doing before > ordering a new cluster. > I was going through yahoo benchmark again and hbase became non responsive > for a long time, (about 100 secs) benchmark results were 0 throughput for > that time and eventually, benchmark failed. > I didn't get any exceptions but this one, on the Master node, iletken-test-0 > is also the Master node, so it was trying to recover a file from the same > node. > Any suggestions? > Thanks. > --- > > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 3 of > 14: > hdfs://iletken-test-0:30001/hbase3/.logs/iletken-test-2,60020,1299794182845/iletken-test-2%3A60020.1299794988383, > length=86340652 > > 2011-03-11 00:15:54,825 INFO org.apache.hadoop.hbase.util.FSUtils: > Recovering file > hdfs://iletken-test-0:30001/hbase3/.logs/iletken-test-2,60020,1299794182845/iletken-test-2%3A60020.1299794988383 > > 2011-03-11 00:15:56,675 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server > listener on 60000: readAndProcess threw exception java.io.IOException: > Connection reset by peer. Count of bytes read: 0 > > java.io.IOException: Connection reset by peer > > at sun.nio.ch.FileDispatcher.read0(Native Method) > > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > > -- > deniz > > On Thu, Mar 10, 2011 at 12:51 AM, Jean-Daniel Cryans <[email protected]> > wrote: >> >> This is a JVM error, and there seems to be a lot of them in the recent >> versions. I personally recommend using u16 or u17. >> >> J-D >> >> On Wed, Mar 9, 2011 at 1:01 AM, Erdem Agaoglu <[email protected]> >> wrote: >> > I don't know if it's related but i've seen a dead regionserver a little >> > while ago too. But in our case .out file showed some JVM crash along >> > with a >> > hs_err dump in hbase home (attached below). We were running 0.90.0 at >> > the >> > moment and we upgraded to 0.90.1 in hopes of a fix but nothing changed. >> > >> > The crash happened when we ran RowCounter job, with 12 simultaneous >> > tasks on >> > 11 machines, 132 simultaneous tasks total. Table has ~100k rows with >> > ~700kB >> > per row. We were trying different max_region_size values, and that >> > instance >> > had 100M. We have ~1000 regions total. Our machines have 24G ram and >> > ganglia >> > shows no resource starvation nor swapping. >> > >> > These happened about a week ago, but i wasn't able to test further so i >> > delayed asking here, but if it has any relation to problem Deniz's >> > having, >> > this log might be useful. >> > >> > # >> > # A fatal error has been detected by the Java Runtime Environment: >> > # >> > # SIGSEGV (0xb) at pc=0x00007fd02e23825b, pid=30204, >> > tid=140531828942608 >> > # >> > # JRE version: 6.0_23-b05 >> > # Java VM: Java HotSpot(TM) 64-Bit Server VM (19.0-b09 mixed mode >> > linux-amd64 compressed oops) >> > # Problematic frame: >> > # V [libjvm.so+0x30325b] >> > # >> > # If you would like to submit a bug report, please visit: >> > # http://java.sun.com/webapps/bugreport/crash.jsp >> > # >> > >> > --------------- T H R E A D --------------- >> > >> > Current thread (0x000000004013d800): ConcurrentGCThread [stack: >> > 0x00007fd01dae6000,0x00007fd01dbe7000] [id=30221] >> > >> > siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), >> > si_addr=0x0000000000000018 >> > >> > Registers: >> > RAX=0x000000004013cbd8, RBX=0x00007fd02e8c6960, RCX=0x0000000000000003, >> > RDX=0x0000000000000000 >> > RSP=0x00007fd01dbe58c0, RBP=0x00007fd01dbe58e0, RSI=0x00007fd02e8aa9b0, >> > RDI=0x0000000000000010 >> > R8 =0x00000000175f6400, R9 =0x000000000000000c, R10=0x00007fd02e8aa754, >> > R11=0x00000000000209bc >> > R12=0x00007fd01dbe5a00, R13=0x00000006c3272000, R14=0x000000004013c9c0, >> > R15=0x00007fd01dbe5ab0 >> > RIP=0x00007fd02e23825b, EFL=0x0000000000010246, >> > CSGSFS=0x0000000000000033, >> > ERR=0x0000000000000004 >> > TRAPNO=0x000000000000000e >> > >> > Register to memory mapping: >> > >> > RAX=0x000000004013cbd8 >> > 0x000000004013cbd8 is pointing to unknown location >> > >> > RBX=0x00007fd02e8c6960 >> > 0x00007fd02e8c6960: <offset 0x991960> in >> > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at >> > 0x00007fd02df35000 >> > >> > RCX=0x0000000000000003 >> > 0x0000000000000003 is pointing to unknown location >> > >> > RDX=0x0000000000000000 >> > 0x0000000000000000 is pointing to unknown location >> > >> > RSP=0x00007fd01dbe58c0 >> > 0x00007fd01dbe58c0 is pointing to unknown location >> > >> > RBP=0x00007fd01dbe58e0 >> > 0x00007fd01dbe58e0 is pointing to unknown location >> > >> > RSI=0x00007fd02e8aa9b0 >> > 0x00007fd02e8aa9b0: <offset 0x9759b0> in >> > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at >> > 0x00007fd02df35000 >> > >> > RDI=0x0000000000000010 >> > 0x0000000000000010 is pointing to unknown location >> > >> > R8 =0x00000000175f6400 >> > 0x00000000175f6400 is pointing to unknown location >> > >> > R9 =0x000000000000000c >> > 0x000000000000000c is pointing to unknown location >> > >> > R10=0x00007fd02e8aa754 >> > 0x00007fd02e8aa754: <offset 0x975754> in >> > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at >> > 0x00007fd02df35000 >> > >> > R11=0x00000000000209bc >> > 0x00000000000209bc is pointing to unknown location >> > >> > R12=0x00007fd01dbe5a00 >> > 0x00007fd01dbe5a00 is pointing to unknown location >> > >> > R13=0x00000006c3272000 >> > >> > >> > On Tue, Mar 8, 2011 at 6:21 PM, 陈加俊 <[email protected]> wrote: >> > >> >> Htable had disabled when ctr+c ? >> >> >> >> 2011/3/8, M.Deniz OKTAR <[email protected]>: >> >> > Something new came up! >> >> > >> >> > I tried to truncate the 'usertable' which had ~12M entries. >> >> > >> >> > Shell stayed at "disabling table" for a long time. The processes was >> >> there >> >> > but there were no requests. So I quit the state by ctrl-c. >> >> > >> >> > Then tried count 'usertable' to see if data remains, shell gave an >> >> > error >> >> and >> >> > one of the regionservers had a log such as below, >> >> > >> >> > The master logs were also similar (tried to disable again, and the >> >> > master >> >> > log is from that trial) >> >> > >> >> > >> >> > Regionserver 2: >> >> > >> >> > 2011-03-08 16:47:24,852 DEBUG >> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: >> >> > NotServingRegionException; Region is not online: >> >> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8. >> >> > 2011-03-08 16:47:27,765 DEBUG >> >> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: >> >> > total=39.63 >> >> MB, >> >> > free=4.65 GB, max=4.68 GB, blocks=35, accesses=376070, hits=12035, >> >> > hitRatio=3.20%%, cachingAccesses=12070, cachingHits=12035, >> >> > cachingHitsRatio=99.71%%, evictions=0, evicted=0, evictedPerRun=NaN >> >> > 2011-03-08 16:47:28,863 DEBUG >> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: >> >> > NotServingRegionException; Region is not online: >> >> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8. >> >> > 2011-03-08 16:47:28,865 ERROR >> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: >> >> > org.apache.hadoop.hbase.UnknownScannerException: Name: -1 >> >> > at >> >> > >> >> >> >> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1795) >> >> > at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown >> >> > Source) >> >> > at >> >> > >> >> >> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> >> > at java.lang.reflect.Method.invoke(Method.java:597) >> >> > at >> >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) >> >> > at >> >> > >> >> >> >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) >> >> > >> >> > >> >> > >> >> > Masterserver: >> >> > . >> >> > . >> >> > . (same thing) >> >> > 2011-03-08 16:51:34,679 INFO >> >> > org.apache.hadoop.hbase.master.AssignmentManager: Region has been >> >> > PENDING_CLOSE for too long, running forced unassign again on >> >> > >> >> >> >> region=usertable,user1948102037,1299592536693.d5bae6bbe54aa182e1215ab626e0011e. >> >> > >> >> > >> >> > -- >> >> > deniz >> >> > >> >> > >> >> > On Tue, Mar 8, 2011 at 4:34 PM, M.Deniz OKTAR <[email protected]> >> >> wrote: >> >> > >> >> >> Hi all, >> >> >> >> >> >> Thanks for the support. I'v been trying to replicate the problem >> >> >> since >> >> >> this >> >> >> morning. Before doing that, played with the configuration. I used to >> >> have >> >> >> only one user and set all the permissions according to that. Now I'v >> >> >> followed the cloudera manuals and set permissions for hdfs and >> >> >> mapred >> >> >> users. >> >> >> (changed the hbase-env.sh) >> >> >> >> >> >> I had 2 trials, on both the yahoo test failed because of receiving >> >> >> lost >> >> of >> >> >> "0"s but the region servers didn't die. At some points in the test, >> >> (also >> >> >> when failed) , hbase master gave exceptions about not being able to >> >> reach >> >> >> one of the servers. I also lost the ssh connection to that server, >> >> >> but >> >> >> after >> >> >> a while it recovered. (also hmaster) The last thing in the >> >> >> regionserver >> >> >> logs >> >> >> was that it was going for a flush. >> >> >> >> >> >> I'll be going over the tests again and provide you with clean log >> >> >> files >> >> >> from all servers. (hadoop, hbase, namenode, masternode logs) >> >> >> >> >> >> If you have any suggestions or directions for me to better diagnose >> >> >> the >> >> >> problem, that would be lovely. >> >> >> >> >> >> btw: these servers do not have ECC memory but I do not see any >> >> corruption >> >> >> in data. >> >> >> >> >> >> Thanks! >> >> >> >> >> >> -- >> >> >> deniz >> >> >> >> >> >> >> >> >> On Mon, Mar 7, 2011 at 7:47 PM, Jean-Daniel Cryans >> >> >> <[email protected]>wrote: >> >> >> >> >> >>> Along with a bigger portion of the log, it be might good to check >> >> >>> if >> >> >>> there's anything in the .out file that looks like a jvm error. >> >> >>> >> >> >>> J-D >> >> >>> >> >> >>> On Mon, Mar 7, 2011 at 9:22 AM, M.Deniz OKTAR >> >> >>> <[email protected]> >> >> >>> wrote: >> >> >>> > I run every kind of benchmark I could find on those machines and >> >> >>> > they >> >> >>> seemed >> >> >>> > to work fine. Did memory/disk tests too. >> >> >>> > >> >> >>> > The master node or other nodes provide some information and >> >> exceptions >> >> >>> about >> >> >>> > that they can't reach to the dead node. >> >> >>> > >> >> >>> > Btw sometimes the process does not die but looses the connection. >> >> >>> > >> >> >>> > -- >> >> >>> > >> >> >>> > deniz >> >> >>> > >> >> >>> > On Mon, Mar 7, 2011 at 7:19 PM, Stack <[email protected]> wrote: >> >> >>> > >> >> >>> >> I'm stumped. I have nothing to go on when no death throes or >> >> >>> >> complaints. This hardware for sure is healthy? Other stuff >> >> >>> >> runs >> >> w/o >> >> >>> >> issue? >> >> >>> >> St.Ack >> >> >>> >> >> >> >>> >> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR < >> >> [email protected]> >> >> >>> >> wrote: >> >> >>> >> > I don't know if its normal but I see alot of '0's in the test >> >> >>> >> > results >> >> >>> >> when >> >> >>> >> > it tends to fail, such as: >> >> >>> >> > >> >> >>> >> > 1196 sec: 7394901 operations; 0 current ops/sec; >> >> >>> >> > >> >> >>> >> > -- >> >> >>> >> > deniz >> >> >>> >> > >> >> >>> >> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR < >> >> [email protected] >> >> >>> > >> >> >>> >> wrote: >> >> >>> >> > >> >> >>> >> >> Hi, >> >> >>> >> >> >> >> >>> >> >> Thanks for the effort, answers below: >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack <[email protected]> >> >> >>> >> >> wrote: >> >> >>> >> >> >> >> >>> >> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR < >> >> >>> [email protected]> >> >> >>> >> >>> wrote: >> >> >>> >> >>> > We have a 5 node cluster, 4 of them being region servers. >> >> >>> >> >>> > I am >> >> >>> >> running a >> >> >>> >> >>> > custom workload with YCSB and when the data is loading >> >> >>> >> >>> > (heavy >> >> >>> insert) >> >> >>> >> at >> >> >>> >> >>> > least one of the region servers are dying after about >> >> >>> >> >>> > 600000 >> >> >>> >> operations. >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> Tell us the character of your 'custom workload' please. >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >> The workload is below, the part that fails is the loading >> >> >>> >> >> part >> >> >>> (-load) >> >> >>> >> >> which inserts all the records first) >> >> >>> >> >> >> >> >>> >> >> recordcount=10000000 >> >> >>> >> >> operationcount=3000000 >> >> >>> >> >> workload=com.yahoo.ycsb.workloads.CoreWorkload >> >> >>> >> >> >> >> >>> >> >> readallfields=true >> >> >>> >> >> >> >> >>> >> >> readproportion=0.5 >> >> >>> >> >> updateproportion=0.1 >> >> >>> >> >> scanproportion=0 >> >> >>> >> >> insertproportion=0.35 >> >> >>> >> >> readmodifywriteproportion=0.05 >> >> >>> >> >> >> >> >>> >> >> requestdistribution=zipfian >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >>> >> >> >>> >> >>> > There are no abnormalities in the logs as far as I can >> >> >>> >> >>> > see, >> >> the >> >> >>> only >> >> >>> >> >>> common >> >> >>> >> >>> > point is that all of them(in different trials, different >> >> region >> >> >>> >> servers >> >> >>> >> >>> > fail) request for a flush as the last logs, given below. >> >> >>> >> >>> > .out >> >> >>> files >> >> >>> >> are >> >> >>> >> >>> > empty. I am looking at the /var/log/hbase folder for logs. >> >> >>> Running >> >> >>> >> sun >> >> >>> >> >>> java >> >> >>> >> >>> > 6 latest version. I couldn't find any logs that indicates >> >> >>> >> >>> > a >> >> >>> problem >> >> >>> >> with >> >> >>> >> >>> > java. Tried the tests with openjdk and had the same >> >> >>> >> >>> > results. >> >> >>> >> >>> > >> >> >>> >> >>> >> >> >>> >> >>> Its strange that flush is the last thing in your log. The >> >> process >> >> >>> is >> >> >>> >> >>> dead? We are exiting w/o a note in logs? Thats unusual. >> >> >>> >> >>> We >> >> >>> usually >> >> >>> >> >>> scream loudly when dying. >> >> >>> >> >>> >> >> >>> >> >> >> >> >>> >> >> Yes, thats the strange part. The last line is a flush as if >> >> >>> >> >> the >> >> >>> process >> >> >>> >> >> never failed. Yes, the process is dead and hbase cannot see >> >> >>> >> >> the >> >> >>> node. >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >>> >> >> >>> >> >>> > I have set ulimits(50000) and xceivers(20000) for multiple >> >> users >> >> >>> and >> >> >>> >> >>> certain >> >> >>> >> >>> > that they are correct. >> >> >>> >> >>> >> >> >>> >> >>> The first line in an hbase log prints out the ulimit it >> >> >>> >> >>> sees. >> >> You >> >> >>> >> >>> might check that the hbase process for sure is picking up >> >> >>> >> >>> your >> >> >>> ulimit >> >> >>> >> >>> setting. >> >> >>> >> >>> >> >> >>> >> >>> That was a mistake I did a couple of days ago, checked it >> >> >>> >> >>> with >> >> cat >> >> >>> >> >> /proc/<pid of reginserver>/limits and all related users like >> >> >>> 'hbase' >> >> >>> >> has >> >> >>> >> >> those limits. Checked the logs: >> >> >>> >> >> >> >> >>> >> >> Mon Mar 7 06:41:15 EET 2011 Starting regionserver on test-1 >> >> >>> >> >> ulimit -n 52768 >> >> >>> >> >> >> >> >>> >> >>> >> >> >>> >> >>> > Also in the kernel logs, there are no apparent problems. >> >> >>> >> >>> > >> >> >>> >> >>> >> >> >>> >> >>> (The mystery compounds) >> >> >>> >> >>> >> >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG >> >> >>> >> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread: >> >> >>> Compaction >> >> >>> >> >>> > requested for >> >> >>> >> >>> > >> >> >>> >> >> >> >>> >> >> >> >> usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3. >> >> >>> >> >>> > because regionserver60020.cacheFlusher; priority=3, >> >> >>> >> >>> > compaction >> >> >>> queue >> >> >>> >> >>> size=18 >> >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG >> >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: >> >> >>> >> >>> > NOT flushing memstore for region >> >> >>> >> >>> > >> >> >>> >> >>> >> >> >>> >> >> >> >>> >> >> >> >> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc., >> >> >>> >> >>> > flushing=false, writesEnabled=false >> >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG >> >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: >> >> >>> >> >>> > Started memstore flush for >> >> >>> >> >>> > >> >> >>> >> >>> >> >> >>> >> >> >> >>> >> >> >> >> usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6., >> >> >>> >> >>> > current region memstore size 68.6m >> >> >>> >> >>> > 2011-03-07 15:07:58,310 DEBUG >> >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: >> >> >>> >> >>> > Flush requested on >> >> >>> >> >>> > >> >> >>> >> >> >> >>> >> >> >> >> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc. >> >> >>> >> >>> > -end of log file- >> >> >>> >> >>> > --- >> >> >>> >> >>> > >> >> >>> >> >>> >> >> >>> >> >>> Nothing more? >> >> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >> No, nothing after that. But quite a lot of logs before that, >> >> >>> >> >> I >> >> can >> >> >>> send >> >> >>> >> >> them if you'd like. >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >>> Thanks, >> >> >>> >> >>> St.Ack >> >> >>> >> >>> >> >> >>> >> >> >> >> >>> >> >> Thanks alot! >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> > >> >> >>> >> >> >> >>> > >> >> >>> >> >> >> >> >> >> >> >> > >> >> >> >> -- >> >> 从我的移动设备发送 >> >> >> >> Thanks & Best regards >> >> jiajun >> >> >> > >> > >> > >> > -- >> > erdem agaoglu >> > > >
