Re: region servers dying - flush request - YCSB

Erdem Agaoglu Wed, 09 Mar 2011 01:02:14 -0800

I don't know if it's related but i've seen a dead regionserver a little
while ago too. But in our case .out file showed some JVM crash along with a
hs_err dump in hbase home (attached below). We were running 0.90.0 at the
moment and we upgraded to 0.90.1 in hopes of a fix but nothing changed.


The crash happened when we ran RowCounter job, with 12 simultaneous tasks on
11 machines, 132 simultaneous tasks total.  Table has ~100k rows with ~700kB
per row. We were trying different max_region_size values, and that instance
had 100M. We have ~1000 regions total. Our machines have 24G ram and ganglia
shows no resource starvation nor swapping.

These happened about a week ago, but i wasn't able to test further so i
delayed asking here, but if it has any relation to problem Deniz's having,
this log might be useful.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fd02e23825b, pid=30204, tid=140531828942608
#
# JRE version: 6.0_23-b05
# Java VM: Java HotSpot(TM) 64-Bit Server VM (19.0-b09 mixed mode
linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x30325b]
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#

---------------  T H R E A D  ---------------

Current thread (0x000000004013d800):  ConcurrentGCThread [stack:
0x00007fd01dae6000,0x00007fd01dbe7000] [id=30221]

siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR),
si_addr=0x0000000000000018

Registers:
RAX=0x000000004013cbd8, RBX=0x00007fd02e8c6960, RCX=0x0000000000000003,
RDX=0x0000000000000000
RSP=0x00007fd01dbe58c0, RBP=0x00007fd01dbe58e0, RSI=0x00007fd02e8aa9b0,
RDI=0x0000000000000010
R8 =0x00000000175f6400, R9 =0x000000000000000c, R10=0x00007fd02e8aa754,
R11=0x00000000000209bc
R12=0x00007fd01dbe5a00, R13=0x00000006c3272000, R14=0x000000004013c9c0,
R15=0x00007fd01dbe5ab0
RIP=0x00007fd02e23825b, EFL=0x0000000000010246, CSGSFS=0x0000000000000033,
ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

Register to memory mapping:

RAX=0x000000004013cbd8
0x000000004013cbd8 is pointing to unknown location

RBX=0x00007fd02e8c6960
0x00007fd02e8c6960: <offset 0x991960> in
/usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at
0x00007fd02df35000

RCX=0x0000000000000003
0x0000000000000003 is pointing to unknown location

RDX=0x0000000000000000
0x0000000000000000 is pointing to unknown location

RSP=0x00007fd01dbe58c0
0x00007fd01dbe58c0 is pointing to unknown location

RBP=0x00007fd01dbe58e0
0x00007fd01dbe58e0 is pointing to unknown location

RSI=0x00007fd02e8aa9b0
0x00007fd02e8aa9b0: <offset 0x9759b0> in
/usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at
0x00007fd02df35000

RDI=0x0000000000000010
0x0000000000000010 is pointing to unknown location

R8 =0x00000000175f6400
0x00000000175f6400 is pointing to unknown location

R9 =0x000000000000000c
0x000000000000000c is pointing to unknown location

R10=0x00007fd02e8aa754
0x00007fd02e8aa754: <offset 0x975754> in
/usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at
0x00007fd02df35000

R11=0x00000000000209bc
0x00000000000209bc is pointing to unknown location

R12=0x00007fd01dbe5a00
0x00007fd01dbe5a00 is pointing to unknown location

R13=0x00000006c3272000


On Tue, Mar 8, 2011 at 6:21 PM, 陈加俊 <[email protected]> wrote:

> Htable had disabled when ctr+c ?
>
> 2011/3/8, M.Deniz OKTAR <[email protected]>:
> > Something new came up!
> >
> > I tried to truncate the 'usertable' which had ~12M entries.
> >
> > Shell stayed at "disabling table" for a long time. The processes was
> there
> > but there were no requests. So I quit the state by ctrl-c.
> >
> > Then tried count 'usertable' to see if data remains, shell gave an error
> and
> > one of the regionservers had a log such as below,
> >
> > The master logs were also similar (tried to disable again, and the master
> > log is from that trial)
> >
> >
> > Regionserver 2:
> >
> > 2011-03-08 16:47:24,852 DEBUG
> > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > NotServingRegionException; Region is not online:
> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8.
> > 2011-03-08 16:47:27,765 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=39.63
> MB,
> > free=4.65 GB, max=4.68 GB, blocks=35, accesses=376070, hits=12035,
> > hitRatio=3.20%%, cachingAccesses=12070, cachingHits=12035,
> > cachingHitsRatio=99.71%%, evictions=0, evicted=0, evictedPerRun=NaN
> > 2011-03-08 16:47:28,863 DEBUG
> > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > NotServingRegionException; Region is not online:
> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8.
> > 2011-03-08 16:47:28,865 ERROR
> > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > org.apache.hadoop.hbase.UnknownScannerException: Name: -1
> >         at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1795)
> >         at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
> >         at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >         at java.lang.reflect.Method.invoke(Method.java:597)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >         at
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
> >
> >
> >
> > Masterserver:
> > .
> > .
> > . (same thing)
> > 2011-03-08 16:51:34,679 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Region has been
> > PENDING_CLOSE for too long, running forced unassign again on
> >
> region=usertable,user1948102037,1299592536693.d5bae6bbe54aa182e1215ab626e0011e.
> >
> >
> > --
> > deniz
> >
> >
> > On Tue, Mar 8, 2011 at 4:34 PM, M.Deniz OKTAR <[email protected]>
> wrote:
> >
> >> Hi all,
> >>
> >> Thanks for the support. I'v been trying to replicate the problem since
> >> this
> >> morning. Before doing that, played with the configuration. I used to
> have
> >> only one user and set all the permissions according to that. Now I'v
> >> followed the cloudera manuals and set permissions for hdfs and mapred
> >> users.
> >> (changed the hbase-env.sh)
> >>
> >> I had 2 trials, on both the yahoo test failed because of receiving lost
> of
> >> "0"s but the region servers didn't die. At some points in the test,
> (also
> >> when failed) , hbase master gave exceptions about not being able to
> reach
> >> one of the servers. I also lost the ssh connection to that server, but
> >> after
> >> a while it recovered. (also hmaster) The last thing in the regionserver
> >> logs
> >> was that it was going for a flush.
> >>
> >> I'll be going over the tests again and provide you with clean log files
> >> from all servers. (hadoop, hbase, namenode, masternode logs)
> >>
> >> If you have any suggestions or directions for me to better diagnose the
> >> problem, that would be lovely.
> >>
> >> btw: these servers do not have ECC memory but I do not see any
> corruption
> >> in data.
> >>
> >> Thanks!
> >>
> >> --
> >> deniz
> >>
> >>
> >> On Mon, Mar 7, 2011 at 7:47 PM, Jean-Daniel Cryans
> >> <[email protected]>wrote:
> >>
> >>> Along with a bigger portion of the log, it be might good to check if
> >>> there's anything in the .out file that looks like a jvm error.
> >>>
> >>> J-D
> >>>
> >>> On Mon, Mar 7, 2011 at 9:22 AM, M.Deniz OKTAR <[email protected]>
> >>> wrote:
> >>> > I run every kind of benchmark I could find on those machines and they
> >>> seemed
> >>> > to work fine. Did memory/disk tests too.
> >>> >
> >>> > The master node or other nodes provide some information and
> exceptions
> >>> about
> >>> > that they can't reach to the dead node.
> >>> >
> >>> > Btw sometimes the process does not die but looses the connection.
> >>> >
> >>> > --
> >>> >
> >>> > deniz
> >>> >
> >>> > On Mon, Mar 7, 2011 at 7:19 PM, Stack <[email protected]> wrote:
> >>> >
> >>> >> I'm stumped.  I have nothing to go on when no death throes or
> >>> >> complaints.  This hardware for sure is healthy?  Other stuff runs
> w/o
> >>> >> issue?
> >>> >> St.Ack
> >>> >>
> >>> >> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR <
> [email protected]>
> >>> >> wrote:
> >>> >> > I don't know if its normal but I see alot of '0's in the test
> >>> >> > results
> >>> >> when
> >>> >> > it tends to fail, such as:
> >>> >> >
> >>> >> >  1196 sec: 7394901 operations; 0 current ops/sec;
> >>> >> >
> >>> >> > --
> >>> >> > deniz
> >>> >> >
> >>> >> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR <
> [email protected]
> >>> >
> >>> >> wrote:
> >>> >> >
> >>> >> >> Hi,
> >>> >> >>
> >>> >> >> Thanks for the effort, answers below:
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack <[email protected]> wrote:
> >>> >> >>
> >>> >> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR <
> >>> [email protected]>
> >>> >> >>> wrote:
> >>> >> >>> > We have a 5 node cluster, 4 of them being region servers. I am
> >>> >> running a
> >>> >> >>> > custom workload with YCSB and when the data is loading (heavy
> >>> insert)
> >>> >> at
> >>> >> >>> > least one of the region servers are dying after about 600000
> >>> >> operations.
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> Tell us the character of your 'custom workload' please.
> >>> >> >>>
> >>> >> >>>
> >>> >> >> The workload is below, the part that fails is the loading part
> >>> (-load)
> >>> >> >> which inserts all the records first)
> >>> >> >>
> >>> >> >> recordcount=10000000
> >>> >> >> operationcount=3000000
> >>> >> >> workload=com.yahoo.ycsb.workloads.CoreWorkload
> >>> >> >>
> >>> >> >> readallfields=true
> >>> >> >>
> >>> >> >> readproportion=0.5
> >>> >> >> updateproportion=0.1
> >>> >> >> scanproportion=0
> >>> >> >> insertproportion=0.35
> >>> >> >> readmodifywriteproportion=0.05
> >>> >> >>
> >>> >> >> requestdistribution=zipfian
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >>>
> >>> >> >>> > There are no abnormalities in the logs as far as I can see,
> the
> >>> only
> >>> >> >>> common
> >>> >> >>> > point is that all of them(in different trials, different
> region
> >>> >> servers
> >>> >> >>> > fail) request for a flush as the last logs, given below. .out
> >>> files
> >>> >> are
> >>> >> >>> > empty. I am looking at the /var/log/hbase folder for logs.
> >>> Running
> >>> >> sun
> >>> >> >>> java
> >>> >> >>> > 6 latest version. I couldn't find any logs that indicates a
> >>> problem
> >>> >> with
> >>> >> >>> > java. Tried the tests with openjdk and had the same results.
> >>> >> >>> >
> >>> >> >>>
> >>> >> >>> Its strange that flush is the last thing in your log.  The
> process
> >>> is
> >>> >> >>> dead?  We are exiting w/o a note in logs?  Thats unusual.  We
> >>> usually
> >>> >> >>> scream loudly when dying.
> >>> >> >>>
> >>> >> >>
> >>> >> >> Yes, thats the strange part. The last line is a flush as if the
> >>> process
> >>> >> >> never failed. Yes, the process is dead and hbase cannot see the
> >>> node.
> >>> >> >>
> >>> >> >>
> >>> >> >>>
> >>> >> >>> > I have set ulimits(50000) and xceivers(20000) for multiple
> users
> >>> and
> >>> >> >>> certain
> >>> >> >>> > that they are correct.
> >>> >> >>>
> >>> >> >>> The first line in an hbase log prints out the ulimit it sees.
>  You
> >>> >> >>> might check that the hbase process for sure is picking up your
> >>> ulimit
> >>> >> >>> setting.
> >>> >> >>>
> >>> >> >>> That was a mistake I did a couple of days ago, checked it with
> cat
> >>> >> >> /proc/<pid of reginserver>/limits  and all related users like
> >>> 'hbase'
> >>> >> has
> >>> >> >> those limits. Checked the logs:
> >>> >> >>
> >>> >> >> Mon Mar  7 06:41:15 EET 2011 Starting regionserver on test-1
> >>> >> >> ulimit -n 52768
> >>> >> >>
> >>> >> >>>
> >>> >> >>> > Also in the kernel logs, there are no apparent problems.
> >>> >> >>> >
> >>> >> >>>
> >>> >> >>> (The mystery compounds)
> >>> >> >>>
> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
> >>> >> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> >>> Compaction
> >>> >> >>> > requested for
> >>> >> >>> >
> >>> >>
> >>>
> usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3.
> >>> >> >>> > because regionserver60020.cacheFlusher; priority=3, compaction
> >>> queue
> >>> >> >>> size=18
> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
> >>> >> >>> > NOT flushing memstore for region
> >>> >> >>> >
> >>> >> >>>
> >>> >>
> >>>
> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.,
> >>> >> >>> > flushing=false, writesEnabled=false
> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
> >>> >> >>> > Started memstore flush for
> >>> >> >>> >
> >>> >> >>>
> >>> >>
> >>>
> usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6.,
> >>> >> >>> > current region memstore size 68.6m
> >>> >> >>> > 2011-03-07 15:07:58,310 DEBUG
> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
> >>> >> >>> > Flush requested on
> >>> >> >>> >
> >>> >>
> >>>
> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.
> >>> >> >>> > -end of log file-
> >>> >> >>> > ---
> >>> >> >>> >
> >>> >> >>>
> >>> >> >>> Nothing more?
> >>> >> >>>
> >>> >> >>>
> >>> >> >> No, nothing after that. But quite a lot of logs before that, I
> can
> >>> send
> >>> >> >> them if you'd like.
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >>> Thanks,
> >>> >> >>> St.Ack
> >>> >> >>>
> >>> >> >>
> >>> >> >> Thanks alot!
> >>> >> >>
> >>> >> >>
> >>> >> >
> >>> >>
> >>> >
> >>>
> >>
> >>
> >
>
> --
> 从我的移动设备发送
>
> Thanks & Best regards
> jiajun
>



-- 
erdem agaoglu

Re: region servers dying - flush request - YCSB

Reply via email to