Re: region servers dying - flush request - YCSB

Stack Thu, 10 Mar 2011 14:48:40 -0800

That looks like someone trying to connect to the master but they are
not doing the handshake properly.  Do you have old versions of hbase
around the place? Or some other process connection the HBase Master?


As to unresponsive for 100 seconds, what was going on on your cluster?
 Any clues in Master log or in any of the regionserver logs?

St.Ack

2011/3/10 M.Deniz OKTAR <[email protected]>:
> Hi,
>
> Still working on the issue. This is one of the last trials I am doing before
> ordering a new cluster.
>
> I was going through yahoo benchmark again and hbase became non responsive
> for a long time, (about 100 secs) benchmark results were 0 throughput for
> that time and eventually, benchmark failed.
>
> I didn't get any exceptions but this one, on the Master node, iletken-test-0
> is also the Master node, so it was trying to recover a file from the same
> node.
>
> Any suggestions?
>
> Thanks.
>
> ---
>
>
>   org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 3
> of 14:
> hdfs://iletken-test-0:30001/hbase3/.logs/iletken-test-2,60020,1299794182845/iletken-test-2%3A60020.1299794988383,
> length=86340652
>
> 2011-03-11 00:15:54,825 INFO org.apache.hadoop.hbase.util.FSUtils:
> Recovering file
> hdfs://iletken-test-0:30001/hbase3/.logs/iletken-test-2,60020,1299794182845/iletken-test-2%3A60020.1299794988383
>
> 2011-03-11 00:15:56,675 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> listener on 60000: readAndProcess threw exception java.io.IOException:
> Connection reset by peer. Count of bytes read: 0
>
> java.io.IOException: Connection reset by peer
>
> at sun.nio.ch.FileDispatcher.read0(Native Method)
>
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>
>
>
> --
> deniz
>
> On Thu, Mar 10, 2011 at 12:51 AM, Jean-Daniel Cryans 
> <[email protected]>wrote:
>
>> This is a JVM error, and there seems to be a lot of them in the recent
>> versions. I personally recommend using u16 or u17.
>>
>> J-D
>>
>> On Wed, Mar 9, 2011 at 1:01 AM, Erdem Agaoglu <[email protected]>
>> wrote:
>> > I don't know if it's related but i've seen a dead regionserver a little
>> > while ago too. But in our case .out file showed some JVM crash along with
>> a
>> > hs_err dump in hbase home (attached below). We were running 0.90.0 at the
>> > moment and we upgraded to 0.90.1 in hopes of a fix but nothing changed.
>> >
>> > The crash happened when we ran RowCounter job, with 12 simultaneous tasks
>> on
>> > 11 machines, 132 simultaneous tasks total.  Table has ~100k rows with
>> ~700kB
>> > per row. We were trying different max_region_size values, and that
>> instance
>> > had 100M. We have ~1000 regions total. Our machines have 24G ram and
>> ganglia
>> > shows no resource starvation nor swapping.
>> >
>> > These happened about a week ago, but i wasn't able to test further so i
>> > delayed asking here, but if it has any relation to problem Deniz's
>> having,
>> > this log might be useful.
>> >
>> > #
>> > # A fatal error has been detected by the Java Runtime Environment:
>> > #
>> > #  SIGSEGV (0xb) at pc=0x00007fd02e23825b, pid=30204, tid=140531828942608
>> > #
>> > # JRE version: 6.0_23-b05
>> > # Java VM: Java HotSpot(TM) 64-Bit Server VM (19.0-b09 mixed mode
>> > linux-amd64 compressed oops)
>> > # Problematic frame:
>> > # V  [libjvm.so+0x30325b]
>> > #
>> > # If you would like to submit a bug report, please visit:
>> > #   http://java.sun.com/webapps/bugreport/crash.jsp
>> > #
>> >
>> > ---------------  T H R E A D  ---------------
>> >
>> > Current thread (0x000000004013d800):  ConcurrentGCThread [stack:
>> > 0x00007fd01dae6000,0x00007fd01dbe7000] [id=30221]
>> >
>> > siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR),
>> > si_addr=0x0000000000000018
>> >
>> > Registers:
>> > RAX=0x000000004013cbd8, RBX=0x00007fd02e8c6960, RCX=0x0000000000000003,
>> > RDX=0x0000000000000000
>> > RSP=0x00007fd01dbe58c0, RBP=0x00007fd01dbe58e0, RSI=0x00007fd02e8aa9b0,
>> > RDI=0x0000000000000010
>> > R8 =0x00000000175f6400, R9 =0x000000000000000c, R10=0x00007fd02e8aa754,
>> > R11=0x00000000000209bc
>> > R12=0x00007fd01dbe5a00, R13=0x00000006c3272000, R14=0x000000004013c9c0,
>> > R15=0x00007fd01dbe5ab0
>> > RIP=0x00007fd02e23825b, EFL=0x0000000000010246,
>> CSGSFS=0x0000000000000033,
>> > ERR=0x0000000000000004
>> >  TRAPNO=0x000000000000000e
>> >
>> > Register to memory mapping:
>> >
>> > RAX=0x000000004013cbd8
>> > 0x000000004013cbd8 is pointing to unknown location
>> >
>> > RBX=0x00007fd02e8c6960
>> > 0x00007fd02e8c6960: <offset 0x991960> in
>> > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at
>> > 0x00007fd02df35000
>> >
>> > RCX=0x0000000000000003
>> > 0x0000000000000003 is pointing to unknown location
>> >
>> > RDX=0x0000000000000000
>> > 0x0000000000000000 is pointing to unknown location
>> >
>> > RSP=0x00007fd01dbe58c0
>> > 0x00007fd01dbe58c0 is pointing to unknown location
>> >
>> > RBP=0x00007fd01dbe58e0
>> > 0x00007fd01dbe58e0 is pointing to unknown location
>> >
>> > RSI=0x00007fd02e8aa9b0
>> > 0x00007fd02e8aa9b0: <offset 0x9759b0> in
>> > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at
>> > 0x00007fd02df35000
>> >
>> > RDI=0x0000000000000010
>> > 0x0000000000000010 is pointing to unknown location
>> >
>> > R8 =0x00000000175f6400
>> > 0x00000000175f6400 is pointing to unknown location
>> >
>> > R9 =0x000000000000000c
>> > 0x000000000000000c is pointing to unknown location
>> >
>> > R10=0x00007fd02e8aa754
>> > 0x00007fd02e8aa754: <offset 0x975754> in
>> > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at
>> > 0x00007fd02df35000
>> >
>> > R11=0x00000000000209bc
>> > 0x00000000000209bc is pointing to unknown location
>> >
>> > R12=0x00007fd01dbe5a00
>> > 0x00007fd01dbe5a00 is pointing to unknown location
>> >
>> > R13=0x00000006c3272000
>> >
>> >
>> > On Tue, Mar 8, 2011 at 6:21 PM, 陈加俊 <[email protected]> wrote:
>> >
>> >> Htable had disabled when ctr+c ?
>> >>
>> >> 2011/3/8, M.Deniz OKTAR <[email protected]>:
>> >> > Something new came up!
>> >> >
>> >> > I tried to truncate the 'usertable' which had ~12M entries.
>> >> >
>> >> > Shell stayed at "disabling table" for a long time. The processes was
>> >> there
>> >> > but there were no requests. So I quit the state by ctrl-c.
>> >> >
>> >> > Then tried count 'usertable' to see if data remains, shell gave an
>> error
>> >> and
>> >> > one of the regionservers had a log such as below,
>> >> >
>> >> > The master logs were also similar (tried to disable again, and the
>> master
>> >> > log is from that trial)
>> >> >
>> >> >
>> >> > Regionserver 2:
>> >> >
>> >> > 2011-03-08 16:47:24,852 DEBUG
>> >> > org.apache.hadoop.hbase.regionserver.HRegionServer:
>> >> > NotServingRegionException; Region is not online:
>> >> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8.
>> >> > 2011-03-08 16:47:27,765 DEBUG
>> >> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=39.63
>> >> MB,
>> >> > free=4.65 GB, max=4.68 GB, blocks=35, accesses=376070, hits=12035,
>> >> > hitRatio=3.20%%, cachingAccesses=12070, cachingHits=12035,
>> >> > cachingHitsRatio=99.71%%, evictions=0, evicted=0, evictedPerRun=NaN
>> >> > 2011-03-08 16:47:28,863 DEBUG
>> >> > org.apache.hadoop.hbase.regionserver.HRegionServer:
>> >> > NotServingRegionException; Region is not online:
>> >> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8.
>> >> > 2011-03-08 16:47:28,865 ERROR
>> >> > org.apache.hadoop.hbase.regionserver.HRegionServer:
>> >> > org.apache.hadoop.hbase.UnknownScannerException: Name: -1
>> >> >         at
>> >> >
>> >>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1795)
>> >> >         at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown
>> Source)
>> >> >         at
>> >> >
>> >>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >> >         at java.lang.reflect.Method.invoke(Method.java:597)
>> >> >         at
>> >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>> >> >         at
>> >> >
>> >>
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
>> >> >
>> >> >
>> >> >
>> >> > Masterserver:
>> >> > .
>> >> > .
>> >> > . (same thing)
>> >> > 2011-03-08 16:51:34,679 INFO
>> >> > org.apache.hadoop.hbase.master.AssignmentManager: Region has been
>> >> > PENDING_CLOSE for too long, running forced unassign again on
>> >> >
>> >>
>> region=usertable,user1948102037,1299592536693.d5bae6bbe54aa182e1215ab626e0011e.
>> >> >
>> >> >
>> >> > --
>> >> > deniz
>> >> >
>> >> >
>> >> > On Tue, Mar 8, 2011 at 4:34 PM, M.Deniz OKTAR <[email protected]>
>> >> wrote:
>> >> >
>> >> >> Hi all,
>> >> >>
>> >> >> Thanks for the support. I'v been trying to replicate the problem
>> since
>> >> >> this
>> >> >> morning. Before doing that, played with the configuration. I used to
>> >> have
>> >> >> only one user and set all the permissions according to that. Now I'v
>> >> >> followed the cloudera manuals and set permissions for hdfs and mapred
>> >> >> users.
>> >> >> (changed the hbase-env.sh)
>> >> >>
>> >> >> I had 2 trials, on both the yahoo test failed because of receiving
>> lost
>> >> of
>> >> >> "0"s but the region servers didn't die. At some points in the test,
>> >> (also
>> >> >> when failed) , hbase master gave exceptions about not being able to
>> >> reach
>> >> >> one of the servers. I also lost the ssh connection to that server,
>> but
>> >> >> after
>> >> >> a while it recovered. (also hmaster) The last thing in the
>> regionserver
>> >> >> logs
>> >> >> was that it was going for a flush.
>> >> >>
>> >> >> I'll be going over the tests again and provide you with clean log
>> files
>> >> >> from all servers. (hadoop, hbase, namenode, masternode logs)
>> >> >>
>> >> >> If you have any suggestions or directions for me to better diagnose
>> the
>> >> >> problem, that would be lovely.
>> >> >>
>> >> >> btw: these servers do not have ECC memory but I do not see any
>> >> corruption
>> >> >> in data.
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> --
>> >> >> deniz
>> >> >>
>> >> >>
>> >> >> On Mon, Mar 7, 2011 at 7:47 PM, Jean-Daniel Cryans
>> >> >> <[email protected]>wrote:
>> >> >>
>> >> >>> Along with a bigger portion of the log, it be might good to check if
>> >> >>> there's anything in the .out file that looks like a jvm error.
>> >> >>>
>> >> >>> J-D
>> >> >>>
>> >> >>> On Mon, Mar 7, 2011 at 9:22 AM, M.Deniz OKTAR <
>> [email protected]>
>> >> >>> wrote:
>> >> >>> > I run every kind of benchmark I could find on those machines and
>> they
>> >> >>> seemed
>> >> >>> > to work fine. Did memory/disk tests too.
>> >> >>> >
>> >> >>> > The master node or other nodes provide some information and
>> >> exceptions
>> >> >>> about
>> >> >>> > that they can't reach to the dead node.
>> >> >>> >
>> >> >>> > Btw sometimes the process does not die but looses the connection.
>> >> >>> >
>> >> >>> > --
>> >> >>> >
>> >> >>> > deniz
>> >> >>> >
>> >> >>> > On Mon, Mar 7, 2011 at 7:19 PM, Stack <[email protected]> wrote:
>> >> >>> >
>> >> >>> >> I'm stumped.  I have nothing to go on when no death throes or
>> >> >>> >> complaints.  This hardware for sure is healthy?  Other stuff runs
>> >> w/o
>> >> >>> >> issue?
>> >> >>> >> St.Ack
>> >> >>> >>
>> >> >>> >> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR <
>> >> [email protected]>
>> >> >>> >> wrote:
>> >> >>> >> > I don't know if its normal but I see alot of '0's in the test
>> >> >>> >> > results
>> >> >>> >> when
>> >> >>> >> > it tends to fail, such as:
>> >> >>> >> >
>> >> >>> >> >  1196 sec: 7394901 operations; 0 current ops/sec;
>> >> >>> >> >
>> >> >>> >> > --
>> >> >>> >> > deniz
>> >> >>> >> >
>> >> >>> >> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR <
>> >> [email protected]
>> >> >>> >
>> >> >>> >> wrote:
>> >> >>> >> >
>> >> >>> >> >> Hi,
>> >> >>> >> >>
>> >> >>> >> >> Thanks for the effort, answers below:
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack <[email protected]>
>> wrote:
>> >> >>> >> >>
>> >> >>> >> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR <
>> >> >>> [email protected]>
>> >> >>> >> >>> wrote:
>> >> >>> >> >>> > We have a 5 node cluster, 4 of them being region servers. I
>> am
>> >> >>> >> running a
>> >> >>> >> >>> > custom workload with YCSB and when the data is loading
>> (heavy
>> >> >>> insert)
>> >> >>> >> at
>> >> >>> >> >>> > least one of the region servers are dying after about
>> 600000
>> >> >>> >> operations.
>> >> >>> >> >>>
>> >> >>> >> >>>
>> >> >>> >> >>> Tell us the character of your 'custom workload' please.
>> >> >>> >> >>>
>> >> >>> >> >>>
>> >> >>> >> >> The workload is below, the part that fails is the loading part
>> >> >>> (-load)
>> >> >>> >> >> which inserts all the records first)
>> >> >>> >> >>
>> >> >>> >> >> recordcount=10000000
>> >> >>> >> >> operationcount=3000000
>> >> >>> >> >> workload=com.yahoo.ycsb.workloads.CoreWorkload
>> >> >>> >> >>
>> >> >>> >> >> readallfields=true
>> >> >>> >> >>
>> >> >>> >> >> readproportion=0.5
>> >> >>> >> >> updateproportion=0.1
>> >> >>> >> >> scanproportion=0
>> >> >>> >> >> insertproportion=0.35
>> >> >>> >> >> readmodifywriteproportion=0.05
>> >> >>> >> >>
>> >> >>> >> >> requestdistribution=zipfian
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>>
>> >> >>> >> >>> > There are no abnormalities in the logs as far as I can see,
>> >> the
>> >> >>> only
>> >> >>> >> >>> common
>> >> >>> >> >>> > point is that all of them(in different trials, different
>> >> region
>> >> >>> >> servers
>> >> >>> >> >>> > fail) request for a flush as the last logs, given below.
>> .out
>> >> >>> files
>> >> >>> >> are
>> >> >>> >> >>> > empty. I am looking at the /var/log/hbase folder for logs.
>> >> >>> Running
>> >> >>> >> sun
>> >> >>> >> >>> java
>> >> >>> >> >>> > 6 latest version. I couldn't find any logs that indicates a
>> >> >>> problem
>> >> >>> >> with
>> >> >>> >> >>> > java. Tried the tests with openjdk and had the same
>> results.
>> >> >>> >> >>> >
>> >> >>> >> >>>
>> >> >>> >> >>> Its strange that flush is the last thing in your log.  The
>> >> process
>> >> >>> is
>> >> >>> >> >>> dead?  We are exiting w/o a note in logs?  Thats unusual.  We
>> >> >>> usually
>> >> >>> >> >>> scream loudly when dying.
>> >> >>> >> >>>
>> >> >>> >> >>
>> >> >>> >> >> Yes, thats the strange part. The last line is a flush as if
>> the
>> >> >>> process
>> >> >>> >> >> never failed. Yes, the process is dead and hbase cannot see
>> the
>> >> >>> node.
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>>
>> >> >>> >> >>> > I have set ulimits(50000) and xceivers(20000) for multiple
>> >> users
>> >> >>> and
>> >> >>> >> >>> certain
>> >> >>> >> >>> > that they are correct.
>> >> >>> >> >>>
>> >> >>> >> >>> The first line in an hbase log prints out the ulimit it sees.
>> >>  You
>> >> >>> >> >>> might check that the hbase process for sure is picking up
>> your
>> >> >>> ulimit
>> >> >>> >> >>> setting.
>> >> >>> >> >>>
>> >> >>> >> >>> That was a mistake I did a couple of days ago, checked it
>> with
>> >> cat
>> >> >>> >> >> /proc/<pid of reginserver>/limits  and all related users like
>> >> >>> 'hbase'
>> >> >>> >> has
>> >> >>> >> >> those limits. Checked the logs:
>> >> >>> >> >>
>> >> >>> >> >> Mon Mar  7 06:41:15 EET 2011 Starting regionserver on test-1
>> >> >>> >> >> ulimit -n 52768
>> >> >>> >> >>
>> >> >>> >> >>>
>> >> >>> >> >>> > Also in the kernel logs, there are no apparent problems.
>> >> >>> >> >>> >
>> >> >>> >> >>>
>> >> >>> >> >>> (The mystery compounds)
>> >> >>> >> >>>
>> >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>> >> >>> >> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>> >> >>> Compaction
>> >> >>> >> >>> > requested for
>> >> >>> >> >>> >
>> >> >>> >>
>> >> >>>
>> >> usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3.
>> >> >>> >> >>> > because regionserver60020.cacheFlusher; priority=3,
>> compaction
>> >> >>> queue
>> >> >>> >> >>> size=18
>> >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>> >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> >>> >> >>> > NOT flushing memstore for region
>> >> >>> >> >>> >
>> >> >>> >> >>>
>> >> >>> >>
>> >> >>>
>> >>
>> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.,
>> >> >>> >> >>> > flushing=false, writesEnabled=false
>> >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>> >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> >>> >> >>> > Started memstore flush for
>> >> >>> >> >>> >
>> >> >>> >> >>>
>> >> >>> >>
>> >> >>>
>> >>
>> usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6.,
>> >> >>> >> >>> > current region memstore size 68.6m
>> >> >>> >> >>> > 2011-03-07 15:07:58,310 DEBUG
>> >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> >>> >> >>> > Flush requested on
>> >> >>> >> >>> >
>> >> >>> >>
>> >> >>>
>> >> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.
>> >> >>> >> >>> > -end of log file-
>> >> >>> >> >>> > ---
>> >> >>> >> >>> >
>> >> >>> >> >>>
>> >> >>> >> >>> Nothing more?
>> >> >>> >> >>>
>> >> >>> >> >>>
>> >> >>> >> >> No, nothing after that. But quite a lot of logs before that, I
>> >> can
>> >> >>> send
>> >> >>> >> >> them if you'd like.
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>> Thanks,
>> >> >>> >> >>> St.Ack
>> >> >>> >> >>>
>> >> >>> >> >>
>> >> >>> >> >> Thanks alot!
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >
>> >> >>> >>
>> >> >>> >
>> >> >>>
>> >> >>
>> >> >>
>> >> >
>> >>
>> >> --
>> >> 从我的移动设备发送
>> >>
>> >> Thanks & Best regards
>> >> jiajun
>> >>
>> >
>> >
>> >
>> > --
>> > erdem agaoglu
>> >
>>
>

Re: region servers dying - flush request - YCSB

Reply via email to