Re: region servers dying - flush request - YCSB

Jean-Daniel Cryans Thu, 10 Mar 2011 17:21:34 -0800

iletken-test-2 died.

J-D


2011/3/10 M.Deniz OKTAR <[email protected]>:
> Hi,
> Still working on the issue. This is one of the last trials I am doing before
> ordering a new cluster.
> I was going through yahoo benchmark again and hbase became non responsive
> for a long time, (about 100 secs) benchmark results were 0 throughput for
> that time and eventually, benchmark failed.
> I didn't get any exceptions but this one, on the Master node, iletken-test-0
> is also the Master node, so it was trying to recover a file from the same
> node.
> Any suggestions?
> Thanks.
> ---
>
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 3 of
> 14:
> hdfs://iletken-test-0:30001/hbase3/.logs/iletken-test-2,60020,1299794182845/iletken-test-2%3A60020.1299794988383,
> length=86340652
>
> 2011-03-11 00:15:54,825 INFO org.apache.hadoop.hbase.util.FSUtils:
> Recovering file
> hdfs://iletken-test-0:30001/hbase3/.logs/iletken-test-2,60020,1299794182845/iletken-test-2%3A60020.1299794988383
>
> 2011-03-11 00:15:56,675 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> listener on 60000: readAndProcess threw exception java.io.IOException:
> Connection reset by peer. Count of bytes read: 0
>
> java.io.IOException: Connection reset by peer
>
> at sun.nio.ch.FileDispatcher.read0(Native Method)
>
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>
> --
> deniz
>
> On Thu, Mar 10, 2011 at 12:51 AM, Jean-Daniel Cryans <[email protected]>
> wrote:
>>
>> This is a JVM error, and there seems to be a lot of them in the recent
>> versions. I personally recommend using u16 or u17.
>>
>> J-D
>>
>> On Wed, Mar 9, 2011 at 1:01 AM, Erdem Agaoglu <[email protected]>
>> wrote:
>> > I don't know if it's related but i've seen a dead regionserver a little
>> > while ago too. But in our case .out file showed some JVM crash along
>> > with a
>> > hs_err dump in hbase home (attached below). We were running 0.90.0 at
>> > the
>> > moment and we upgraded to 0.90.1 in hopes of a fix but nothing changed.
>> >
>> > The crash happened when we ran RowCounter job, with 12 simultaneous
>> > tasks on
>> > 11 machines, 132 simultaneous tasks total.  Table has ~100k rows with
>> > ~700kB
>> > per row. We were trying different max_region_size values, and that
>> > instance
>> > had 100M. We have ~1000 regions total. Our machines have 24G ram and
>> > ganglia
>> > shows no resource starvation nor swapping.
>> >
>> > These happened about a week ago, but i wasn't able to test further so i
>> > delayed asking here, but if it has any relation to problem Deniz's
>> > having,
>> > this log might be useful.
>> >
>> > #
>> > # A fatal error has been detected by the Java Runtime Environment:
>> > #
>> > #  SIGSEGV (0xb) at pc=0x00007fd02e23825b, pid=30204,
>> > tid=140531828942608
>> > #
>> > # JRE version: 6.0_23-b05
>> > # Java VM: Java HotSpot(TM) 64-Bit Server VM (19.0-b09 mixed mode
>> > linux-amd64 compressed oops)
>> > # Problematic frame:
>> > # V  [libjvm.so+0x30325b]
>> > #
>> > # If you would like to submit a bug report, please visit:
>> > #   http://java.sun.com/webapps/bugreport/crash.jsp
>> > #
>> >
>> > ---------------  T H R E A D  ---------------
>> >
>> > Current thread (0x000000004013d800):  ConcurrentGCThread [stack:
>> > 0x00007fd01dae6000,0x00007fd01dbe7000] [id=30221]
>> >
>> > siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR),
>> > si_addr=0x0000000000000018
>> >
>> > Registers:
>> > RAX=0x000000004013cbd8, RBX=0x00007fd02e8c6960, RCX=0x0000000000000003,
>> > RDX=0x0000000000000000
>> > RSP=0x00007fd01dbe58c0, RBP=0x00007fd01dbe58e0, RSI=0x00007fd02e8aa9b0,
>> > RDI=0x0000000000000010
>> > R8 =0x00000000175f6400, R9 =0x000000000000000c, R10=0x00007fd02e8aa754,
>> > R11=0x00000000000209bc
>> > R12=0x00007fd01dbe5a00, R13=0x00000006c3272000, R14=0x000000004013c9c0,
>> > R15=0x00007fd01dbe5ab0
>> > RIP=0x00007fd02e23825b, EFL=0x0000000000010246,
>> > CSGSFS=0x0000000000000033,
>> > ERR=0x0000000000000004
>> >  TRAPNO=0x000000000000000e
>> >
>> > Register to memory mapping:
>> >
>> > RAX=0x000000004013cbd8
>> > 0x000000004013cbd8 is pointing to unknown location
>> >
>> > RBX=0x00007fd02e8c6960
>> > 0x00007fd02e8c6960: <offset 0x991960> in
>> > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at
>> > 0x00007fd02df35000
>> >
>> > RCX=0x0000000000000003
>> > 0x0000000000000003 is pointing to unknown location
>> >
>> > RDX=0x0000000000000000
>> > 0x0000000000000000 is pointing to unknown location
>> >
>> > RSP=0x00007fd01dbe58c0
>> > 0x00007fd01dbe58c0 is pointing to unknown location
>> >
>> > RBP=0x00007fd01dbe58e0
>> > 0x00007fd01dbe58e0 is pointing to unknown location
>> >
>> > RSI=0x00007fd02e8aa9b0
>> > 0x00007fd02e8aa9b0: <offset 0x9759b0> in
>> > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at
>> > 0x00007fd02df35000
>> >
>> > RDI=0x0000000000000010
>> > 0x0000000000000010 is pointing to unknown location
>> >
>> > R8 =0x00000000175f6400
>> > 0x00000000175f6400 is pointing to unknown location
>> >
>> > R9 =0x000000000000000c
>> > 0x000000000000000c is pointing to unknown location
>> >
>> > R10=0x00007fd02e8aa754
>> > 0x00007fd02e8aa754: <offset 0x975754> in
>> > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at
>> > 0x00007fd02df35000
>> >
>> > R11=0x00000000000209bc
>> > 0x00000000000209bc is pointing to unknown location
>> >
>> > R12=0x00007fd01dbe5a00
>> > 0x00007fd01dbe5a00 is pointing to unknown location
>> >
>> > R13=0x00000006c3272000
>> >
>> >
>> > On Tue, Mar 8, 2011 at 6:21 PM, 陈加俊 <[email protected]> wrote:
>> >
>> >> Htable had disabled when ctr+c ?
>> >>
>> >> 2011/3/8, M.Deniz OKTAR <[email protected]>:
>> >> > Something new came up!
>> >> >
>> >> > I tried to truncate the 'usertable' which had ~12M entries.
>> >> >
>> >> > Shell stayed at "disabling table" for a long time. The processes was
>> >> there
>> >> > but there were no requests. So I quit the state by ctrl-c.
>> >> >
>> >> > Then tried count 'usertable' to see if data remains, shell gave an
>> >> > error
>> >> and
>> >> > one of the regionservers had a log such as below,
>> >> >
>> >> > The master logs were also similar (tried to disable again, and the
>> >> > master
>> >> > log is from that trial)
>> >> >
>> >> >
>> >> > Regionserver 2:
>> >> >
>> >> > 2011-03-08 16:47:24,852 DEBUG
>> >> > org.apache.hadoop.hbase.regionserver.HRegionServer:
>> >> > NotServingRegionException; Region is not online:
>> >> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8.
>> >> > 2011-03-08 16:47:27,765 DEBUG
>> >> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats:
>> >> > total=39.63
>> >> MB,
>> >> > free=4.65 GB, max=4.68 GB, blocks=35, accesses=376070, hits=12035,
>> >> > hitRatio=3.20%%, cachingAccesses=12070, cachingHits=12035,
>> >> > cachingHitsRatio=99.71%%, evictions=0, evicted=0, evictedPerRun=NaN
>> >> > 2011-03-08 16:47:28,863 DEBUG
>> >> > org.apache.hadoop.hbase.regionserver.HRegionServer:
>> >> > NotServingRegionException; Region is not online:
>> >> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8.
>> >> > 2011-03-08 16:47:28,865 ERROR
>> >> > org.apache.hadoop.hbase.regionserver.HRegionServer:
>> >> > org.apache.hadoop.hbase.UnknownScannerException: Name: -1
>> >> >         at
>> >> >
>> >>
>> >> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1795)
>> >> >         at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown
>> >> > Source)
>> >> >         at
>> >> >
>> >>
>> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >> >         at java.lang.reflect.Method.invoke(Method.java:597)
>> >> >         at
>> >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>> >> >         at
>> >> >
>> >>
>> >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
>> >> >
>> >> >
>> >> >
>> >> > Masterserver:
>> >> > .
>> >> > .
>> >> > . (same thing)
>> >> > 2011-03-08 16:51:34,679 INFO
>> >> > org.apache.hadoop.hbase.master.AssignmentManager: Region has been
>> >> > PENDING_CLOSE for too long, running forced unassign again on
>> >> >
>> >>
>> >> region=usertable,user1948102037,1299592536693.d5bae6bbe54aa182e1215ab626e0011e.
>> >> >
>> >> >
>> >> > --
>> >> > deniz
>> >> >
>> >> >
>> >> > On Tue, Mar 8, 2011 at 4:34 PM, M.Deniz OKTAR <[email protected]>
>> >> wrote:
>> >> >
>> >> >> Hi all,
>> >> >>
>> >> >> Thanks for the support. I'v been trying to replicate the problem
>> >> >> since
>> >> >> this
>> >> >> morning. Before doing that, played with the configuration. I used to
>> >> have
>> >> >> only one user and set all the permissions according to that. Now I'v
>> >> >> followed the cloudera manuals and set permissions for hdfs and
>> >> >> mapred
>> >> >> users.
>> >> >> (changed the hbase-env.sh)
>> >> >>
>> >> >> I had 2 trials, on both the yahoo test failed because of receiving
>> >> >> lost
>> >> of
>> >> >> "0"s but the region servers didn't die. At some points in the test,
>> >> (also
>> >> >> when failed) , hbase master gave exceptions about not being able to
>> >> reach
>> >> >> one of the servers. I also lost the ssh connection to that server,
>> >> >> but
>> >> >> after
>> >> >> a while it recovered. (also hmaster) The last thing in the
>> >> >> regionserver
>> >> >> logs
>> >> >> was that it was going for a flush.
>> >> >>
>> >> >> I'll be going over the tests again and provide you with clean log
>> >> >> files
>> >> >> from all servers. (hadoop, hbase, namenode, masternode logs)
>> >> >>
>> >> >> If you have any suggestions or directions for me to better diagnose
>> >> >> the
>> >> >> problem, that would be lovely.
>> >> >>
>> >> >> btw: these servers do not have ECC memory but I do not see any
>> >> corruption
>> >> >> in data.
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> --
>> >> >> deniz
>> >> >>
>> >> >>
>> >> >> On Mon, Mar 7, 2011 at 7:47 PM, Jean-Daniel Cryans
>> >> >> <[email protected]>wrote:
>> >> >>
>> >> >>> Along with a bigger portion of the log, it be might good to check
>> >> >>> if
>> >> >>> there's anything in the .out file that looks like a jvm error.
>> >> >>>
>> >> >>> J-D
>> >> >>>
>> >> >>> On Mon, Mar 7, 2011 at 9:22 AM, M.Deniz OKTAR
>> >> >>> <[email protected]>
>> >> >>> wrote:
>> >> >>> > I run every kind of benchmark I could find on those machines and
>> >> >>> > they
>> >> >>> seemed
>> >> >>> > to work fine. Did memory/disk tests too.
>> >> >>> >
>> >> >>> > The master node or other nodes provide some information and
>> >> exceptions
>> >> >>> about
>> >> >>> > that they can't reach to the dead node.
>> >> >>> >
>> >> >>> > Btw sometimes the process does not die but looses the connection.
>> >> >>> >
>> >> >>> > --
>> >> >>> >
>> >> >>> > deniz
>> >> >>> >
>> >> >>> > On Mon, Mar 7, 2011 at 7:19 PM, Stack <[email protected]> wrote:
>> >> >>> >
>> >> >>> >> I'm stumped.  I have nothing to go on when no death throes or
>> >> >>> >> complaints.  This hardware for sure is healthy?  Other stuff
>> >> >>> >> runs
>> >> w/o
>> >> >>> >> issue?
>> >> >>> >> St.Ack
>> >> >>> >>
>> >> >>> >> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR <
>> >> [email protected]>
>> >> >>> >> wrote:
>> >> >>> >> > I don't know if its normal but I see alot of '0's in the test
>> >> >>> >> > results
>> >> >>> >> when
>> >> >>> >> > it tends to fail, such as:
>> >> >>> >> >
>> >> >>> >> >  1196 sec: 7394901 operations; 0 current ops/sec;
>> >> >>> >> >
>> >> >>> >> > --
>> >> >>> >> > deniz
>> >> >>> >> >
>> >> >>> >> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR <
>> >> [email protected]
>> >> >>> >
>> >> >>> >> wrote:
>> >> >>> >> >
>> >> >>> >> >> Hi,
>> >> >>> >> >>
>> >> >>> >> >> Thanks for the effort, answers below:
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack <[email protected]>
>> >> >>> >> >> wrote:
>> >> >>> >> >>
>> >> >>> >> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR <
>> >> >>> [email protected]>
>> >> >>> >> >>> wrote:
>> >> >>> >> >>> > We have a 5 node cluster, 4 of them being region servers.
>> >> >>> >> >>> > I am
>> >> >>> >> running a
>> >> >>> >> >>> > custom workload with YCSB and when the data is loading
>> >> >>> >> >>> > (heavy
>> >> >>> insert)
>> >> >>> >> at
>> >> >>> >> >>> > least one of the region servers are dying after about
>> >> >>> >> >>> > 600000
>> >> >>> >> operations.
>> >> >>> >> >>>
>> >> >>> >> >>>
>> >> >>> >> >>> Tell us the character of your 'custom workload' please.
>> >> >>> >> >>>
>> >> >>> >> >>>
>> >> >>> >> >> The workload is below, the part that fails is the loading
>> >> >>> >> >> part
>> >> >>> (-load)
>> >> >>> >> >> which inserts all the records first)
>> >> >>> >> >>
>> >> >>> >> >> recordcount=10000000
>> >> >>> >> >> operationcount=3000000
>> >> >>> >> >> workload=com.yahoo.ycsb.workloads.CoreWorkload
>> >> >>> >> >>
>> >> >>> >> >> readallfields=true
>> >> >>> >> >>
>> >> >>> >> >> readproportion=0.5
>> >> >>> >> >> updateproportion=0.1
>> >> >>> >> >> scanproportion=0
>> >> >>> >> >> insertproportion=0.35
>> >> >>> >> >> readmodifywriteproportion=0.05
>> >> >>> >> >>
>> >> >>> >> >> requestdistribution=zipfian
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>>
>> >> >>> >> >>> > There are no abnormalities in the logs as far as I can
>> >> >>> >> >>> > see,
>> >> the
>> >> >>> only
>> >> >>> >> >>> common
>> >> >>> >> >>> > point is that all of them(in different trials, different
>> >> region
>> >> >>> >> servers
>> >> >>> >> >>> > fail) request for a flush as the last logs, given below.
>> >> >>> >> >>> > .out
>> >> >>> files
>> >> >>> >> are
>> >> >>> >> >>> > empty. I am looking at the /var/log/hbase folder for logs.
>> >> >>> Running
>> >> >>> >> sun
>> >> >>> >> >>> java
>> >> >>> >> >>> > 6 latest version. I couldn't find any logs that indicates
>> >> >>> >> >>> > a
>> >> >>> problem
>> >> >>> >> with
>> >> >>> >> >>> > java. Tried the tests with openjdk and had the same
>> >> >>> >> >>> > results.
>> >> >>> >> >>> >
>> >> >>> >> >>>
>> >> >>> >> >>> Its strange that flush is the last thing in your log.  The
>> >> process
>> >> >>> is
>> >> >>> >> >>> dead?  We are exiting w/o a note in logs?  Thats unusual.
>> >> >>> >> >>>  We
>> >> >>> usually
>> >> >>> >> >>> scream loudly when dying.
>> >> >>> >> >>>
>> >> >>> >> >>
>> >> >>> >> >> Yes, thats the strange part. The last line is a flush as if
>> >> >>> >> >> the
>> >> >>> process
>> >> >>> >> >> never failed. Yes, the process is dead and hbase cannot see
>> >> >>> >> >> the
>> >> >>> node.
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>>
>> >> >>> >> >>> > I have set ulimits(50000) and xceivers(20000) for multiple
>> >> users
>> >> >>> and
>> >> >>> >> >>> certain
>> >> >>> >> >>> > that they are correct.
>> >> >>> >> >>>
>> >> >>> >> >>> The first line in an hbase log prints out the ulimit it
>> >> >>> >> >>> sees.
>> >>  You
>> >> >>> >> >>> might check that the hbase process for sure is picking up
>> >> >>> >> >>> your
>> >> >>> ulimit
>> >> >>> >> >>> setting.
>> >> >>> >> >>>
>> >> >>> >> >>> That was a mistake I did a couple of days ago, checked it
>> >> >>> >> >>> with
>> >> cat
>> >> >>> >> >> /proc/<pid of reginserver>/limits  and all related users like
>> >> >>> 'hbase'
>> >> >>> >> has
>> >> >>> >> >> those limits. Checked the logs:
>> >> >>> >> >>
>> >> >>> >> >> Mon Mar  7 06:41:15 EET 2011 Starting regionserver on test-1
>> >> >>> >> >> ulimit -n 52768
>> >> >>> >> >>
>> >> >>> >> >>>
>> >> >>> >> >>> > Also in the kernel logs, there are no apparent problems.
>> >> >>> >> >>> >
>> >> >>> >> >>>
>> >> >>> >> >>> (The mystery compounds)
>> >> >>> >> >>>
>> >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>> >> >>> >> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>> >> >>> Compaction
>> >> >>> >> >>> > requested for
>> >> >>> >> >>> >
>> >> >>> >>
>> >> >>>
>> >>
>> >> usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3.
>> >> >>> >> >>> > because regionserver60020.cacheFlusher; priority=3,
>> >> >>> >> >>> > compaction
>> >> >>> queue
>> >> >>> >> >>> size=18
>> >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>> >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> >>> >> >>> > NOT flushing memstore for region
>> >> >>> >> >>> >
>> >> >>> >> >>>
>> >> >>> >>
>> >> >>>
>> >>
>> >> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.,
>> >> >>> >> >>> > flushing=false, writesEnabled=false
>> >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>> >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> >>> >> >>> > Started memstore flush for
>> >> >>> >> >>> >
>> >> >>> >> >>>
>> >> >>> >>
>> >> >>>
>> >>
>> >> usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6.,
>> >> >>> >> >>> > current region memstore size 68.6m
>> >> >>> >> >>> > 2011-03-07 15:07:58,310 DEBUG
>> >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>> >> >>> >> >>> > Flush requested on
>> >> >>> >> >>> >
>> >> >>> >>
>> >> >>>
>> >>
>> >> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.
>> >> >>> >> >>> > -end of log file-
>> >> >>> >> >>> > ---
>> >> >>> >> >>> >
>> >> >>> >> >>>
>> >> >>> >> >>> Nothing more?
>> >> >>> >> >>>
>> >> >>> >> >>>
>> >> >>> >> >> No, nothing after that. But quite a lot of logs before that,
>> >> >>> >> >> I
>> >> can
>> >> >>> send
>> >> >>> >> >> them if you'd like.
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >>> Thanks,
>> >> >>> >> >>> St.Ack
>> >> >>> >> >>>
>> >> >>> >> >>
>> >> >>> >> >> Thanks alot!
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >
>> >> >>> >>
>> >> >>> >
>> >> >>>
>> >> >>
>> >> >>
>> >> >
>> >>
>> >> --
>> >> 从我的移动设备发送
>> >>
>> >> Thanks & Best regards
>> >> jiajun
>> >>
>> >
>> >
>> >
>> > --
>> > erdem agaoglu
>> >
>
>

Re: region servers dying - flush request - YCSB

Reply via email to