Re: region servers dying - flush request - YCSB

Jean-Daniel Cryans Wed, 09 Mar 2011 14:51:31 -0800

This is a JVM error, and there seems to be a lot of them in the recent
versions. I personally recommend using u16 or u17.


J-D

On Wed, Mar 9, 2011 at 1:01 AM, Erdem Agaoglu <[email protected]> wrote:
> I don't know if it's related but i've seen a dead regionserver a little
> while ago too. But in our case .out file showed some JVM crash along with a
> hs_err dump in hbase home (attached below). We were running 0.90.0 at the
> moment and we upgraded to 0.90.1 in hopes of a fix but nothing changed.
>
> The crash happened when we ran RowCounter job, with 12 simultaneous tasks on
> 11 machines, 132 simultaneous tasks total.  Table has ~100k rows with ~700kB
> per row. We were trying different max_region_size values, and that instance
> had 100M. We have ~1000 regions total. Our machines have 24G ram and ganglia
> shows no resource starvation nor swapping.
>
> These happened about a week ago, but i wasn't able to test further so i
> delayed asking here, but if it has any relation to problem Deniz's having,
> this log might be useful.
>
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007fd02e23825b, pid=30204, tid=140531828942608
> #
> # JRE version: 6.0_23-b05
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (19.0-b09 mixed mode
> linux-amd64 compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x30325b]
> #
> # If you would like to submit a bug report, please visit:
> #   http://java.sun.com/webapps/bugreport/crash.jsp
> #
>
> ---------------  T H R E A D  ---------------
>
> Current thread (0x000000004013d800):  ConcurrentGCThread [stack:
> 0x00007fd01dae6000,0x00007fd01dbe7000] [id=30221]
>
> siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR),
> si_addr=0x0000000000000018
>
> Registers:
> RAX=0x000000004013cbd8, RBX=0x00007fd02e8c6960, RCX=0x0000000000000003,
> RDX=0x0000000000000000
> RSP=0x00007fd01dbe58c0, RBP=0x00007fd01dbe58e0, RSI=0x00007fd02e8aa9b0,
> RDI=0x0000000000000010
> R8 =0x00000000175f6400, R9 =0x000000000000000c, R10=0x00007fd02e8aa754,
> R11=0x00000000000209bc
> R12=0x00007fd01dbe5a00, R13=0x00000006c3272000, R14=0x000000004013c9c0,
> R15=0x00007fd01dbe5ab0
> RIP=0x00007fd02e23825b, EFL=0x0000000000010246, CSGSFS=0x0000000000000033,
> ERR=0x0000000000000004
>  TRAPNO=0x000000000000000e
>
> Register to memory mapping:
>
> RAX=0x000000004013cbd8
> 0x000000004013cbd8 is pointing to unknown location
>
> RBX=0x00007fd02e8c6960
> 0x00007fd02e8c6960: <offset 0x991960> in
> /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at
> 0x00007fd02df35000
>
> RCX=0x0000000000000003
> 0x0000000000000003 is pointing to unknown location
>
> RDX=0x0000000000000000
> 0x0000000000000000 is pointing to unknown location
>
> RSP=0x00007fd01dbe58c0
> 0x00007fd01dbe58c0 is pointing to unknown location
>
> RBP=0x00007fd01dbe58e0
> 0x00007fd01dbe58e0 is pointing to unknown location
>
> RSI=0x00007fd02e8aa9b0
> 0x00007fd02e8aa9b0: <offset 0x9759b0> in
> /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at
> 0x00007fd02df35000
>
> RDI=0x0000000000000010
> 0x0000000000000010 is pointing to unknown location
>
> R8 =0x00000000175f6400
> 0x00000000175f6400 is pointing to unknown location
>
> R9 =0x000000000000000c
> 0x000000000000000c is pointing to unknown location
>
> R10=0x00007fd02e8aa754
> 0x00007fd02e8aa754: <offset 0x975754> in
> /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at
> 0x00007fd02df35000
>
> R11=0x00000000000209bc
> 0x00000000000209bc is pointing to unknown location
>
> R12=0x00007fd01dbe5a00
> 0x00007fd01dbe5a00 is pointing to unknown location
>
> R13=0x00000006c3272000
>
>
> On Tue, Mar 8, 2011 at 6:21 PM, 陈加俊 <[email protected]> wrote:
>
>> Htable had disabled when ctr+c ?
>>
>> 2011/3/8, M.Deniz OKTAR <[email protected]>:
>> > Something new came up!
>> >
>> > I tried to truncate the 'usertable' which had ~12M entries.
>> >
>> > Shell stayed at "disabling table" for a long time. The processes was
>> there
>> > but there were no requests. So I quit the state by ctrl-c.
>> >
>> > Then tried count 'usertable' to see if data remains, shell gave an error
>> and
>> > one of the regionservers had a log such as below,
>> >
>> > The master logs were also similar (tried to disable again, and the master
>> > log is from that trial)
>> >
>> >
>> > Regionserver 2:
>> >
>> > 2011-03-08 16:47:24,852 DEBUG
>> > org.apache.hadoop.hbase.regionserver.HRegionServer:
>> > NotServingRegionException; Region is not online:
>> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8.
>> > 2011-03-08 16:47:27,765 DEBUG
>> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=39.63
>> MB,
>> > free=4.65 GB, max=4.68 GB, blocks=35, accesses=376070, hits=12035,
>> > hitRatio=3.20%%, cachingAccesses=12070, cachingHits=12035,
>> > cachingHitsRatio=99.71%%, evictions=0, evicted=0, evictedPerRun=NaN
>> > 2011-03-08 16:47:28,863 DEBUG
>> > org.apache.hadoop.hbase.regionserver.HRegionServer:
>> > NotServingRegionException; Region is not online:
>> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8.
>> > 2011-03-08 16:47:28,865 ERROR
>> > org.apache.hadoop.hbase.regionserver.HRegionServer:
>> > org.apache.hadoop.hbase.UnknownScannerException: Name: -1
>> >         at
>> >
>> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1795)
>> >         at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
>> >         at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >         at java.lang.reflect.Method.invoke(Method.java:597)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>> >         at
>> >
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
>> >
>> >
>> >
>> > Masterserver:
>> > .
>> > .
>> > . (same thing)
>> > 2011-03-08 16:51:34,679 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Region has been
>> > PENDING_CLOSE for too long, running forced unassign again on
>> >
>> region=usertable,user1948102037,1299592536693.d5bae6bbe54aa182e1215ab626e0011e.
>> >
>> >
>> > --
>> > deniz
>> >
>> >
>> > On Tue, Mar 8, 2011 at 4:34 PM, M.Deniz OKTAR <[email protected]>
>> wrote:
>> >
>> >> Hi all,
>> >>
>> >> Thanks for the support. I'v been trying to replicate the problem since
>> >> this
>> >> morning. Before doing that, played with the configuration. I used to
>> have
>> >> only one user and set all the permissions according to that. Now I'v
>> >> followed the cloudera manuals and set permissions for hdfs and mapred
>> >> users.
>> >> (changed the hbase-env.sh)
>> >>
>> >> I had 2 trials, on both the yahoo test failed because of receiving lost
>> of
>> >> "0"s but the region servers didn't die. At some points in the test,
>> (also
>> >> when failed) , hbase master gave exceptions about not being able to
>> reach
>> >> one of the servers. I also lost the ssh connection to that server, but
>> >> after
>> >> a while it recovered. (also hmaster) The last thing in the regionserver
>> >> logs
>> >> was that it was going for a flush.
>> >>
>> >> I'll be going over the tests again and provide you with clean log files
>> >> from all servers. (hadoop, hbase, namenode, masternode logs)
>> >>
>> >> If you have any suggestions or directions for me to better diagnose the
>> >> problem, that would be lovely.
>> >>
>> >> btw: these servers do not have ECC memory but I do not see any
>> corruption
>> >> in data.
>> >>
>> >> Thanks!
>> >>
>> >> --
>> >> deniz
>> >>
>> >>
>> >> On Mon, Mar 7, 2011 at 7:47 PM, Jean-Daniel Cryans
>> >> <[email protected]>wrote:
>> >>
>> >>> Along with a bigger portion of the log, it be might good to check if
>> >>> there's anything in the .out file that looks like a jvm error.
>> >>>
>> >>> J-D
>> >>>
>> >>> On Mon, Mar 7, 2011 at 9:22 AM, M.Deniz OKTAR <[email protected]>
>> >>> wrote:
>> >>> > I run every kind of benchmark I could find on those machines and they
>> >>> seemed
>> >>> > to work fine. Did memory/disk tests too.
>> >>> >
>> >>> > The master node or other nodes provide some information and
>> exceptions
>> >>> about
>> >>> > that they can't reach to the dead node.
>> >>> >
>> >>> > Btw sometimes the process does not die but looses the connection.
>> >>> >
>> >>> > --
>> >>> >
>> >>> > deniz
>> >>> >
>> >>> > On Mon, Mar 7, 2011 at 7:19 PM, Stack <[email protected]> wrote:
>> >>> >
>> >>> >> I'm stumped.  I have nothing to go on when no death throes or
>> >>> >> complaints.  This hardware for sure is healthy?  Other stuff runs
>> w/o
>> >>> >> issue?
>> >>> >> St.Ack
>> >>> >>
>> >>> >> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR <
>> [email protected]>
>> >>> >> wrote:
>> >>> >> > I don't know if its normal but I see alot of '0's in the test
>> >>> >> > results
>> >>> >> when
>> >>> >> > it tends to fail, such as:
>> >>> >> >
>> >>> >> >  1196 sec: 7394901 operations; 0 current ops/sec;
>> >>> >> >
>> >>> >> > --
>> >>> >> > deniz
>> >>> >> >
>> >>> >> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR <
>> [email protected]
>> >>> >
>> >>> >> wrote:
>> >>> >> >
>> >>> >> >> Hi,
>> >>> >> >>
>> >>> >> >> Thanks for the effort, answers below:
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack <[email protected]> wrote:
>> >>> >> >>
>> >>> >> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR <
>> >>> [email protected]>
>> >>> >> >>> wrote:
>> >>> >> >>> > We have a 5 node cluster, 4 of them being region servers. I am
>> >>> >> running a
>> >>> >> >>> > custom workload with YCSB and when the data is loading (heavy
>> >>> insert)
>> >>> >> at
>> >>> >> >>> > least one of the region servers are dying after about 600000
>> >>> >> operations.
>> >>> >> >>>
>> >>> >> >>>
>> >>> >> >>> Tell us the character of your 'custom workload' please.
>> >>> >> >>>
>> >>> >> >>>
>> >>> >> >> The workload is below, the part that fails is the loading part
>> >>> (-load)
>> >>> >> >> which inserts all the records first)
>> >>> >> >>
>> >>> >> >> recordcount=10000000
>> >>> >> >> operationcount=3000000
>> >>> >> >> workload=com.yahoo.ycsb.workloads.CoreWorkload
>> >>> >> >>
>> >>> >> >> readallfields=true
>> >>> >> >>
>> >>> >> >> readproportion=0.5
>> >>> >> >> updateproportion=0.1
>> >>> >> >> scanproportion=0
>> >>> >> >> insertproportion=0.35
>> >>> >> >> readmodifywriteproportion=0.05
>> >>> >> >>
>> >>> >> >> requestdistribution=zipfian
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>>
>> >>> >> >>> > There are no abnormalities in the logs as far as I can see,
>> the
>> >>> only
>> >>> >> >>> common
>> >>> >> >>> > point is that all of them(in different trials, different
>> region
>> >>> >> servers
>> >>> >> >>> > fail) request for a flush as the last logs, given below. .out
>> >>> files
>> >>> >> are
>> >>> >> >>> > empty. I am looking at the /var/log/hbase folder for logs.
>> >>> Running
>> >>> >> sun
>> >>> >> >>> java
>> >>> >> >>> > 6 latest version. I couldn't find any logs that indicates a
>> >>> problem
>> >>> >> with
>> >>> >> >>> > java. Tried the tests with openjdk and had the same results.
>> >>> >> >>> >
>> >>> >> >>>
>> >>> >> >>> Its strange that flush is the last thing in your log.  The
>> process
>> >>> is
>> >>> >> >>> dead?  We are exiting w/o a note in logs?  Thats unusual.  We
>> >>> usually
>> >>> >> >>> scream loudly when dying.
>> >>> >> >>>
>> >>> >> >>
>> >>> >> >> Yes, thats the strange part. The last line is a flush as if the
>> >>> process
>> >>> >> >> never failed. Yes, the process is dead and hbase cannot see the
>> >>> node.
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>>
>> >>> >> >>> > I have set ulimits(50000) and xceivers(20000) for multiple
>> users
>> >>> and
>> >>> >> >>> certain
>> >>> >> >>> > that they are correct.
>> >>> >> >>>
>> >>> >> >>> The first line in an hbase log prints out the ulimit it sees.
>>  You
>> >>> >> >>> might check that the hbase process for sure is picking up your
>> >>> ulimit
>> >>> >> >>> setting.
>> >>> >> >>>
>> >>> >> >>> That was a mistake I did a couple of days ago, checked it with
>> cat
>> >>> >> >> /proc/<pid of reginserver>/limits  and all related users like
>> >>> 'hbase'
>> >>> >> has
>> >>> >> >> those limits. Checked the logs:
>> >>> >> >>
>> >>> >> >> Mon Mar  7 06:41:15 EET 2011 Starting regionserver on test-1
>> >>> >> >> ulimit -n 52768
>> >>> >> >>
>> >>> >> >>>
>> >>> >> >>> > Also in the kernel logs, there are no apparent problems.
>> >>> >> >>> >
>> >>> >> >>>
>> >>> >> >>> (The mystery compounds)
>> >>> >> >>>
>> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>> >>> >> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>> >>> Compaction
>> >>> >> >>> > requested for
>> >>> >> >>> >
>> >>> >>
>> >>>
>> usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3.
>> >>> >> >>> > because regionserver60020.cacheFlusher; priority=3, compaction
>> >>> queue
>> >>> >> >>> size=18
>> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >>> >> >>> > NOT flushing memstore for region
>> >>> >> >>> >
>> >>> >> >>>
>> >>> >>
>> >>>
>> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.,
>> >>> >> >>> > flushing=false, writesEnabled=false
>> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >>> >> >>> > Started memstore flush for
>> >>> >> >>> >
>> >>> >> >>>
>> >>> >>
>> >>>
>> usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6.,
>> >>> >> >>> > current region memstore size 68.6m
>> >>> >> >>> > 2011-03-07 15:07:58,310 DEBUG
>> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >>> >> >>> > Flush requested on
>> >>> >> >>> >
>> >>> >>
>> >>>
>> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.
>> >>> >> >>> > -end of log file-
>> >>> >> >>> > ---
>> >>> >> >>> >
>> >>> >> >>>
>> >>> >> >>> Nothing more?
>> >>> >> >>>
>> >>> >> >>>
>> >>> >> >> No, nothing after that. But quite a lot of logs before that, I
>> can
>> >>> send
>> >>> >> >> them if you'd like.
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>> Thanks,
>> >>> >> >>> St.Ack
>> >>> >> >>>
>> >>> >> >>
>> >>> >> >> Thanks alot!
>> >>> >> >>
>> >>> >> >>
>> >>> >> >
>> >>> >>
>> >>> >
>> >>>
>> >>
>> >>
>> >
>>
>> --
>> 从我的移动设备发送
>>
>> Thanks & Best regards
>> jiajun
>>
>
>
>
> --
> erdem agaoglu
>

Re: region servers dying - flush request - YCSB

Reply via email to