This is a JVM error, and there seems to be a lot of them in the recent versions. I personally recommend using u16 or u17.
J-D On Wed, Mar 9, 2011 at 1:01 AM, Erdem Agaoglu <[email protected]> wrote: > I don't know if it's related but i've seen a dead regionserver a little > while ago too. But in our case .out file showed some JVM crash along with a > hs_err dump in hbase home (attached below). We were running 0.90.0 at the > moment and we upgraded to 0.90.1 in hopes of a fix but nothing changed. > > The crash happened when we ran RowCounter job, with 12 simultaneous tasks on > 11 machines, 132 simultaneous tasks total. Table has ~100k rows with ~700kB > per row. We were trying different max_region_size values, and that instance > had 100M. We have ~1000 regions total. Our machines have 24G ram and ganglia > shows no resource starvation nor swapping. > > These happened about a week ago, but i wasn't able to test further so i > delayed asking here, but if it has any relation to problem Deniz's having, > this log might be useful. > > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x00007fd02e23825b, pid=30204, tid=140531828942608 > # > # JRE version: 6.0_23-b05 > # Java VM: Java HotSpot(TM) 64-Bit Server VM (19.0-b09 mixed mode > linux-amd64 compressed oops) > # Problematic frame: > # V [libjvm.so+0x30325b] > # > # If you would like to submit a bug report, please visit: > # http://java.sun.com/webapps/bugreport/crash.jsp > # > > --------------- T H R E A D --------------- > > Current thread (0x000000004013d800): ConcurrentGCThread [stack: > 0x00007fd01dae6000,0x00007fd01dbe7000] [id=30221] > > siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), > si_addr=0x0000000000000018 > > Registers: > RAX=0x000000004013cbd8, RBX=0x00007fd02e8c6960, RCX=0x0000000000000003, > RDX=0x0000000000000000 > RSP=0x00007fd01dbe58c0, RBP=0x00007fd01dbe58e0, RSI=0x00007fd02e8aa9b0, > RDI=0x0000000000000010 > R8 =0x00000000175f6400, R9 =0x000000000000000c, R10=0x00007fd02e8aa754, > R11=0x00000000000209bc > R12=0x00007fd01dbe5a00, R13=0x00000006c3272000, R14=0x000000004013c9c0, > R15=0x00007fd01dbe5ab0 > RIP=0x00007fd02e23825b, EFL=0x0000000000010246, CSGSFS=0x0000000000000033, > ERR=0x0000000000000004 > TRAPNO=0x000000000000000e > > Register to memory mapping: > > RAX=0x000000004013cbd8 > 0x000000004013cbd8 is pointing to unknown location > > RBX=0x00007fd02e8c6960 > 0x00007fd02e8c6960: <offset 0x991960> in > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at > 0x00007fd02df35000 > > RCX=0x0000000000000003 > 0x0000000000000003 is pointing to unknown location > > RDX=0x0000000000000000 > 0x0000000000000000 is pointing to unknown location > > RSP=0x00007fd01dbe58c0 > 0x00007fd01dbe58c0 is pointing to unknown location > > RBP=0x00007fd01dbe58e0 > 0x00007fd01dbe58e0 is pointing to unknown location > > RSI=0x00007fd02e8aa9b0 > 0x00007fd02e8aa9b0: <offset 0x9759b0> in > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at > 0x00007fd02df35000 > > RDI=0x0000000000000010 > 0x0000000000000010 is pointing to unknown location > > R8 =0x00000000175f6400 > 0x00000000175f6400 is pointing to unknown location > > R9 =0x000000000000000c > 0x000000000000000c is pointing to unknown location > > R10=0x00007fd02e8aa754 > 0x00007fd02e8aa754: <offset 0x975754> in > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at > 0x00007fd02df35000 > > R11=0x00000000000209bc > 0x00000000000209bc is pointing to unknown location > > R12=0x00007fd01dbe5a00 > 0x00007fd01dbe5a00 is pointing to unknown location > > R13=0x00000006c3272000 > > > On Tue, Mar 8, 2011 at 6:21 PM, 陈加俊 <[email protected]> wrote: > >> Htable had disabled when ctr+c ? >> >> 2011/3/8, M.Deniz OKTAR <[email protected]>: >> > Something new came up! >> > >> > I tried to truncate the 'usertable' which had ~12M entries. >> > >> > Shell stayed at "disabling table" for a long time. The processes was >> there >> > but there were no requests. So I quit the state by ctrl-c. >> > >> > Then tried count 'usertable' to see if data remains, shell gave an error >> and >> > one of the regionservers had a log such as below, >> > >> > The master logs were also similar (tried to disable again, and the master >> > log is from that trial) >> > >> > >> > Regionserver 2: >> > >> > 2011-03-08 16:47:24,852 DEBUG >> > org.apache.hadoop.hbase.regionserver.HRegionServer: >> > NotServingRegionException; Region is not online: >> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8. >> > 2011-03-08 16:47:27,765 DEBUG >> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=39.63 >> MB, >> > free=4.65 GB, max=4.68 GB, blocks=35, accesses=376070, hits=12035, >> > hitRatio=3.20%%, cachingAccesses=12070, cachingHits=12035, >> > cachingHitsRatio=99.71%%, evictions=0, evicted=0, evictedPerRun=NaN >> > 2011-03-08 16:47:28,863 DEBUG >> > org.apache.hadoop.hbase.regionserver.HRegionServer: >> > NotServingRegionException; Region is not online: >> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8. >> > 2011-03-08 16:47:28,865 ERROR >> > org.apache.hadoop.hbase.regionserver.HRegionServer: >> > org.apache.hadoop.hbase.UnknownScannerException: Name: -1 >> > at >> > >> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1795) >> > at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) >> > at >> > >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> > at java.lang.reflect.Method.invoke(Method.java:597) >> > at >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) >> > at >> > >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) >> > >> > >> > >> > Masterserver: >> > . >> > . >> > . (same thing) >> > 2011-03-08 16:51:34,679 INFO >> > org.apache.hadoop.hbase.master.AssignmentManager: Region has been >> > PENDING_CLOSE for too long, running forced unassign again on >> > >> region=usertable,user1948102037,1299592536693.d5bae6bbe54aa182e1215ab626e0011e. >> > >> > >> > -- >> > deniz >> > >> > >> > On Tue, Mar 8, 2011 at 4:34 PM, M.Deniz OKTAR <[email protected]> >> wrote: >> > >> >> Hi all, >> >> >> >> Thanks for the support. I'v been trying to replicate the problem since >> >> this >> >> morning. Before doing that, played with the configuration. I used to >> have >> >> only one user and set all the permissions according to that. Now I'v >> >> followed the cloudera manuals and set permissions for hdfs and mapred >> >> users. >> >> (changed the hbase-env.sh) >> >> >> >> I had 2 trials, on both the yahoo test failed because of receiving lost >> of >> >> "0"s but the region servers didn't die. At some points in the test, >> (also >> >> when failed) , hbase master gave exceptions about not being able to >> reach >> >> one of the servers. I also lost the ssh connection to that server, but >> >> after >> >> a while it recovered. (also hmaster) The last thing in the regionserver >> >> logs >> >> was that it was going for a flush. >> >> >> >> I'll be going over the tests again and provide you with clean log files >> >> from all servers. (hadoop, hbase, namenode, masternode logs) >> >> >> >> If you have any suggestions or directions for me to better diagnose the >> >> problem, that would be lovely. >> >> >> >> btw: these servers do not have ECC memory but I do not see any >> corruption >> >> in data. >> >> >> >> Thanks! >> >> >> >> -- >> >> deniz >> >> >> >> >> >> On Mon, Mar 7, 2011 at 7:47 PM, Jean-Daniel Cryans >> >> <[email protected]>wrote: >> >> >> >>> Along with a bigger portion of the log, it be might good to check if >> >>> there's anything in the .out file that looks like a jvm error. >> >>> >> >>> J-D >> >>> >> >>> On Mon, Mar 7, 2011 at 9:22 AM, M.Deniz OKTAR <[email protected]> >> >>> wrote: >> >>> > I run every kind of benchmark I could find on those machines and they >> >>> seemed >> >>> > to work fine. Did memory/disk tests too. >> >>> > >> >>> > The master node or other nodes provide some information and >> exceptions >> >>> about >> >>> > that they can't reach to the dead node. >> >>> > >> >>> > Btw sometimes the process does not die but looses the connection. >> >>> > >> >>> > -- >> >>> > >> >>> > deniz >> >>> > >> >>> > On Mon, Mar 7, 2011 at 7:19 PM, Stack <[email protected]> wrote: >> >>> > >> >>> >> I'm stumped. I have nothing to go on when no death throes or >> >>> >> complaints. This hardware for sure is healthy? Other stuff runs >> w/o >> >>> >> issue? >> >>> >> St.Ack >> >>> >> >> >>> >> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR < >> [email protected]> >> >>> >> wrote: >> >>> >> > I don't know if its normal but I see alot of '0's in the test >> >>> >> > results >> >>> >> when >> >>> >> > it tends to fail, such as: >> >>> >> > >> >>> >> > 1196 sec: 7394901 operations; 0 current ops/sec; >> >>> >> > >> >>> >> > -- >> >>> >> > deniz >> >>> >> > >> >>> >> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR < >> [email protected] >> >>> > >> >>> >> wrote: >> >>> >> > >> >>> >> >> Hi, >> >>> >> >> >> >>> >> >> Thanks for the effort, answers below: >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack <[email protected]> wrote: >> >>> >> >> >> >>> >> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR < >> >>> [email protected]> >> >>> >> >>> wrote: >> >>> >> >>> > We have a 5 node cluster, 4 of them being region servers. I am >> >>> >> running a >> >>> >> >>> > custom workload with YCSB and when the data is loading (heavy >> >>> insert) >> >>> >> at >> >>> >> >>> > least one of the region servers are dying after about 600000 >> >>> >> operations. >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> Tell us the character of your 'custom workload' please. >> >>> >> >>> >> >>> >> >>> >> >>> >> >> The workload is below, the part that fails is the loading part >> >>> (-load) >> >>> >> >> which inserts all the records first) >> >>> >> >> >> >>> >> >> recordcount=10000000 >> >>> >> >> operationcount=3000000 >> >>> >> >> workload=com.yahoo.ycsb.workloads.CoreWorkload >> >>> >> >> >> >>> >> >> readallfields=true >> >>> >> >> >> >>> >> >> readproportion=0.5 >> >>> >> >> updateproportion=0.1 >> >>> >> >> scanproportion=0 >> >>> >> >> insertproportion=0.35 >> >>> >> >> readmodifywriteproportion=0.05 >> >>> >> >> >> >>> >> >> requestdistribution=zipfian >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >>> >> >>> >> >>> > There are no abnormalities in the logs as far as I can see, >> the >> >>> only >> >>> >> >>> common >> >>> >> >>> > point is that all of them(in different trials, different >> region >> >>> >> servers >> >>> >> >>> > fail) request for a flush as the last logs, given below. .out >> >>> files >> >>> >> are >> >>> >> >>> > empty. I am looking at the /var/log/hbase folder for logs. >> >>> Running >> >>> >> sun >> >>> >> >>> java >> >>> >> >>> > 6 latest version. I couldn't find any logs that indicates a >> >>> problem >> >>> >> with >> >>> >> >>> > java. Tried the tests with openjdk and had the same results. >> >>> >> >>> > >> >>> >> >>> >> >>> >> >>> Its strange that flush is the last thing in your log. The >> process >> >>> is >> >>> >> >>> dead? We are exiting w/o a note in logs? Thats unusual. We >> >>> usually >> >>> >> >>> scream loudly when dying. >> >>> >> >>> >> >>> >> >> >> >>> >> >> Yes, thats the strange part. The last line is a flush as if the >> >>> process >> >>> >> >> never failed. Yes, the process is dead and hbase cannot see the >> >>> node. >> >>> >> >> >> >>> >> >> >> >>> >> >>> >> >>> >> >>> > I have set ulimits(50000) and xceivers(20000) for multiple >> users >> >>> and >> >>> >> >>> certain >> >>> >> >>> > that they are correct. >> >>> >> >>> >> >>> >> >>> The first line in an hbase log prints out the ulimit it sees. >> You >> >>> >> >>> might check that the hbase process for sure is picking up your >> >>> ulimit >> >>> >> >>> setting. >> >>> >> >>> >> >>> >> >>> That was a mistake I did a couple of days ago, checked it with >> cat >> >>> >> >> /proc/<pid of reginserver>/limits and all related users like >> >>> 'hbase' >> >>> >> has >> >>> >> >> those limits. Checked the logs: >> >>> >> >> >> >>> >> >> Mon Mar 7 06:41:15 EET 2011 Starting regionserver on test-1 >> >>> >> >> ulimit -n 52768 >> >>> >> >> >> >>> >> >>> >> >>> >> >>> > Also in the kernel logs, there are no apparent problems. >> >>> >> >>> > >> >>> >> >>> >> >>> >> >>> (The mystery compounds) >> >>> >> >>> >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG >> >>> >> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread: >> >>> Compaction >> >>> >> >>> > requested for >> >>> >> >>> > >> >>> >> >> >>> >> usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3. >> >>> >> >>> > because regionserver60020.cacheFlusher; priority=3, compaction >> >>> queue >> >>> >> >>> size=18 >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: >> >>> >> >>> > NOT flushing memstore for region >> >>> >> >>> > >> >>> >> >>> >> >>> >> >> >>> >> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc., >> >>> >> >>> > flushing=false, writesEnabled=false >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: >> >>> >> >>> > Started memstore flush for >> >>> >> >>> > >> >>> >> >>> >> >>> >> >> >>> >> usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6., >> >>> >> >>> > current region memstore size 68.6m >> >>> >> >>> > 2011-03-07 15:07:58,310 DEBUG >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: >> >>> >> >>> > Flush requested on >> >>> >> >>> > >> >>> >> >> >>> >> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc. >> >>> >> >>> > -end of log file- >> >>> >> >>> > --- >> >>> >> >>> > >> >>> >> >>> >> >>> >> >>> Nothing more? >> >>> >> >>> >> >>> >> >>> >> >>> >> >> No, nothing after that. But quite a lot of logs before that, I >> can >> >>> send >> >>> >> >> them if you'd like. >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >>> Thanks, >> >>> >> >>> St.Ack >> >>> >> >>> >> >>> >> >> >> >>> >> >> Thanks alot! >> >>> >> >> >> >>> >> >> >> >>> >> > >> >>> >> >> >>> > >> >>> >> >> >> >> >> > >> >> -- >> 从我的移动设备发送 >> >> Thanks & Best regards >> jiajun >> > > > > -- > erdem agaoglu >
