Re: region servers dying - flush request - YCSB

M.Deniz OKTAR Thu, 10 Mar 2011 14:34:19 -0800

Hi,

Still working on the issue. This is one of the last trials I am doing before
ordering a new cluster.


I was going through yahoo benchmark again and hbase became non responsive
for a long time, (about 100 secs) benchmark results were 0 throughput for
that time and eventually, benchmark failed.

I didn't get any exceptions but this one, on the Master node, iletken-test-0
is also the Master node, so it was trying to recover a file from the same
node.

Any suggestions?

Thanks.

---


   org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 3
of 14:
hdfs://iletken-test-0:30001/hbase3/.logs/iletken-test-2,60020,1299794182845/iletken-test-2%3A60020.1299794988383,
length=86340652

2011-03-11 00:15:54,825 INFO org.apache.hadoop.hbase.util.FSUtils:
Recovering file
hdfs://iletken-test-0:30001/hbase3/.logs/iletken-test-2,60020,1299794182845/iletken-test-2%3A60020.1299794988383

2011-03-11 00:15:56,675 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
listener on 60000: readAndProcess threw exception java.io.IOException:
Connection reset by peer. Count of bytes read: 0

java.io.IOException: Connection reset by peer

at sun.nio.ch.FileDispatcher.read0(Native Method)

at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)



--
deniz

On Thu, Mar 10, 2011 at 12:51 AM, Jean-Daniel Cryans <[email protected]>wrote:

> This is a JVM error, and there seems to be a lot of them in the recent
> versions. I personally recommend using u16 or u17.
>
> J-D
>
> On Wed, Mar 9, 2011 at 1:01 AM, Erdem Agaoglu <[email protected]>
> wrote:
> > I don't know if it's related but i've seen a dead regionserver a little
> > while ago too. But in our case .out file showed some JVM crash along with
> a
> > hs_err dump in hbase home (attached below). We were running 0.90.0 at the
> > moment and we upgraded to 0.90.1 in hopes of a fix but nothing changed.
> >
> > The crash happened when we ran RowCounter job, with 12 simultaneous tasks
> on
> > 11 machines, 132 simultaneous tasks total.  Table has ~100k rows with
> ~700kB
> > per row. We were trying different max_region_size values, and that
> instance
> > had 100M. We have ~1000 regions total. Our machines have 24G ram and
> ganglia
> > shows no resource starvation nor swapping.
> >
> > These happened about a week ago, but i wasn't able to test further so i
> > delayed asking here, but if it has any relation to problem Deniz's
> having,
> > this log might be useful.
> >
> > #
> > # A fatal error has been detected by the Java Runtime Environment:
> > #
> > #  SIGSEGV (0xb) at pc=0x00007fd02e23825b, pid=30204, tid=140531828942608
> > #
> > # JRE version: 6.0_23-b05
> > # Java VM: Java HotSpot(TM) 64-Bit Server VM (19.0-b09 mixed mode
> > linux-amd64 compressed oops)
> > # Problematic frame:
> > # V  [libjvm.so+0x30325b]
> > #
> > # If you would like to submit a bug report, please visit:
> > #   http://java.sun.com/webapps/bugreport/crash.jsp
> > #
> >
> > ---------------  T H R E A D  ---------------
> >
> > Current thread (0x000000004013d800):  ConcurrentGCThread [stack:
> > 0x00007fd01dae6000,0x00007fd01dbe7000] [id=30221]
> >
> > siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR),
> > si_addr=0x0000000000000018
> >
> > Registers:
> > RAX=0x000000004013cbd8, RBX=0x00007fd02e8c6960, RCX=0x0000000000000003,
> > RDX=0x0000000000000000
> > RSP=0x00007fd01dbe58c0, RBP=0x00007fd01dbe58e0, RSI=0x00007fd02e8aa9b0,
> > RDI=0x0000000000000010
> > R8 =0x00000000175f6400, R9 =0x000000000000000c, R10=0x00007fd02e8aa754,
> > R11=0x00000000000209bc
> > R12=0x00007fd01dbe5a00, R13=0x00000006c3272000, R14=0x000000004013c9c0,
> > R15=0x00007fd01dbe5ab0
> > RIP=0x00007fd02e23825b, EFL=0x0000000000010246,
> CSGSFS=0x0000000000000033,
> > ERR=0x0000000000000004
> >  TRAPNO=0x000000000000000e
> >
> > Register to memory mapping:
> >
> > RAX=0x000000004013cbd8
> > 0x000000004013cbd8 is pointing to unknown location
> >
> > RBX=0x00007fd02e8c6960
> > 0x00007fd02e8c6960: <offset 0x991960> in
> > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at
> > 0x00007fd02df35000
> >
> > RCX=0x0000000000000003
> > 0x0000000000000003 is pointing to unknown location
> >
> > RDX=0x0000000000000000
> > 0x0000000000000000 is pointing to unknown location
> >
> > RSP=0x00007fd01dbe58c0
> > 0x00007fd01dbe58c0 is pointing to unknown location
> >
> > RBP=0x00007fd01dbe58e0
> > 0x00007fd01dbe58e0 is pointing to unknown location
> >
> > RSI=0x00007fd02e8aa9b0
> > 0x00007fd02e8aa9b0: <offset 0x9759b0> in
> > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at
> > 0x00007fd02df35000
> >
> > RDI=0x0000000000000010
> > 0x0000000000000010 is pointing to unknown location
> >
> > R8 =0x00000000175f6400
> > 0x00000000175f6400 is pointing to unknown location
> >
> > R9 =0x000000000000000c
> > 0x000000000000000c is pointing to unknown location
> >
> > R10=0x00007fd02e8aa754
> > 0x00007fd02e8aa754: <offset 0x975754> in
> > /usr/lib64/jvm/java-1.6.0-sun-1.6.0/jre/lib/amd64/server/libjvm.so at
> > 0x00007fd02df35000
> >
> > R11=0x00000000000209bc
> > 0x00000000000209bc is pointing to unknown location
> >
> > R12=0x00007fd01dbe5a00
> > 0x00007fd01dbe5a00 is pointing to unknown location
> >
> > R13=0x00000006c3272000
> >
> >
> > On Tue, Mar 8, 2011 at 6:21 PM, 陈加俊 <[email protected]> wrote:
> >
> >> Htable had disabled when ctr+c ?
> >>
> >> 2011/3/8, M.Deniz OKTAR <[email protected]>:
> >> > Something new came up!
> >> >
> >> > I tried to truncate the 'usertable' which had ~12M entries.
> >> >
> >> > Shell stayed at "disabling table" for a long time. The processes was
> >> there
> >> > but there were no requests. So I quit the state by ctrl-c.
> >> >
> >> > Then tried count 'usertable' to see if data remains, shell gave an
> error
> >> and
> >> > one of the regionservers had a log such as below,
> >> >
> >> > The master logs were also similar (tried to disable again, and the
> master
> >> > log is from that trial)
> >> >
> >> >
> >> > Regionserver 2:
> >> >
> >> > 2011-03-08 16:47:24,852 DEBUG
> >> > org.apache.hadoop.hbase.regionserver.HRegionServer:
> >> > NotServingRegionException; Region is not online:
> >> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8.
> >> > 2011-03-08 16:47:27,765 DEBUG
> >> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=39.63
> >> MB,
> >> > free=4.65 GB, max=4.68 GB, blocks=35, accesses=376070, hits=12035,
> >> > hitRatio=3.20%%, cachingAccesses=12070, cachingHits=12035,
> >> > cachingHitsRatio=99.71%%, evictions=0, evicted=0, evictedPerRun=NaN
> >> > 2011-03-08 16:47:28,863 DEBUG
> >> > org.apache.hadoop.hbase.regionserver.HRegionServer:
> >> > NotServingRegionException; Region is not online:
> >> > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8.
> >> > 2011-03-08 16:47:28,865 ERROR
> >> > org.apache.hadoop.hbase.regionserver.HRegionServer:
> >> > org.apache.hadoop.hbase.UnknownScannerException: Name: -1
> >> >         at
> >> >
> >>
> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1795)
> >> >         at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown
> Source)
> >> >         at
> >> >
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >> >         at java.lang.reflect.Method.invoke(Method.java:597)
> >> >         at
> >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >> >         at
> >> >
> >>
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
> >> >
> >> >
> >> >
> >> > Masterserver:
> >> > .
> >> > .
> >> > . (same thing)
> >> > 2011-03-08 16:51:34,679 INFO
> >> > org.apache.hadoop.hbase.master.AssignmentManager: Region has been
> >> > PENDING_CLOSE for too long, running forced unassign again on
> >> >
> >>
> region=usertable,user1948102037,1299592536693.d5bae6bbe54aa182e1215ab626e0011e.
> >> >
> >> >
> >> > --
> >> > deniz
> >> >
> >> >
> >> > On Tue, Mar 8, 2011 at 4:34 PM, M.Deniz OKTAR <[email protected]>
> >> wrote:
> >> >
> >> >> Hi all,
> >> >>
> >> >> Thanks for the support. I'v been trying to replicate the problem
> since
> >> >> this
> >> >> morning. Before doing that, played with the configuration. I used to
> >> have
> >> >> only one user and set all the permissions according to that. Now I'v
> >> >> followed the cloudera manuals and set permissions for hdfs and mapred
> >> >> users.
> >> >> (changed the hbase-env.sh)
> >> >>
> >> >> I had 2 trials, on both the yahoo test failed because of receiving
> lost
> >> of
> >> >> "0"s but the region servers didn't die. At some points in the test,
> >> (also
> >> >> when failed) , hbase master gave exceptions about not being able to
> >> reach
> >> >> one of the servers. I also lost the ssh connection to that server,
> but
> >> >> after
> >> >> a while it recovered. (also hmaster) The last thing in the
> regionserver
> >> >> logs
> >> >> was that it was going for a flush.
> >> >>
> >> >> I'll be going over the tests again and provide you with clean log
> files
> >> >> from all servers. (hadoop, hbase, namenode, masternode logs)
> >> >>
> >> >> If you have any suggestions or directions for me to better diagnose
> the
> >> >> problem, that would be lovely.
> >> >>
> >> >> btw: these servers do not have ECC memory but I do not see any
> >> corruption
> >> >> in data.
> >> >>
> >> >> Thanks!
> >> >>
> >> >> --
> >> >> deniz
> >> >>
> >> >>
> >> >> On Mon, Mar 7, 2011 at 7:47 PM, Jean-Daniel Cryans
> >> >> <[email protected]>wrote:
> >> >>
> >> >>> Along with a bigger portion of the log, it be might good to check if
> >> >>> there's anything in the .out file that looks like a jvm error.
> >> >>>
> >> >>> J-D
> >> >>>
> >> >>> On Mon, Mar 7, 2011 at 9:22 AM, M.Deniz OKTAR <
> [email protected]>
> >> >>> wrote:
> >> >>> > I run every kind of benchmark I could find on those machines and
> they
> >> >>> seemed
> >> >>> > to work fine. Did memory/disk tests too.
> >> >>> >
> >> >>> > The master node or other nodes provide some information and
> >> exceptions
> >> >>> about
> >> >>> > that they can't reach to the dead node.
> >> >>> >
> >> >>> > Btw sometimes the process does not die but looses the connection.
> >> >>> >
> >> >>> > --
> >> >>> >
> >> >>> > deniz
> >> >>> >
> >> >>> > On Mon, Mar 7, 2011 at 7:19 PM, Stack <[email protected]> wrote:
> >> >>> >
> >> >>> >> I'm stumped.  I have nothing to go on when no death throes or
> >> >>> >> complaints.  This hardware for sure is healthy?  Other stuff runs
> >> w/o
> >> >>> >> issue?
> >> >>> >> St.Ack
> >> >>> >>
> >> >>> >> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR <
> >> [email protected]>
> >> >>> >> wrote:
> >> >>> >> > I don't know if its normal but I see alot of '0's in the test
> >> >>> >> > results
> >> >>> >> when
> >> >>> >> > it tends to fail, such as:
> >> >>> >> >
> >> >>> >> >  1196 sec: 7394901 operations; 0 current ops/sec;
> >> >>> >> >
> >> >>> >> > --
> >> >>> >> > deniz
> >> >>> >> >
> >> >>> >> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR <
> >> [email protected]
> >> >>> >
> >> >>> >> wrote:
> >> >>> >> >
> >> >>> >> >> Hi,
> >> >>> >> >>
> >> >>> >> >> Thanks for the effort, answers below:
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack <[email protected]>
> wrote:
> >> >>> >> >>
> >> >>> >> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR <
> >> >>> [email protected]>
> >> >>> >> >>> wrote:
> >> >>> >> >>> > We have a 5 node cluster, 4 of them being region servers. I
> am
> >> >>> >> running a
> >> >>> >> >>> > custom workload with YCSB and when the data is loading
> (heavy
> >> >>> insert)
> >> >>> >> at
> >> >>> >> >>> > least one of the region servers are dying after about
> 600000
> >> >>> >> operations.
> >> >>> >> >>>
> >> >>> >> >>>
> >> >>> >> >>> Tell us the character of your 'custom workload' please.
> >> >>> >> >>>
> >> >>> >> >>>
> >> >>> >> >> The workload is below, the part that fails is the loading part
> >> >>> (-load)
> >> >>> >> >> which inserts all the records first)
> >> >>> >> >>
> >> >>> >> >> recordcount=10000000
> >> >>> >> >> operationcount=3000000
> >> >>> >> >> workload=com.yahoo.ycsb.workloads.CoreWorkload
> >> >>> >> >>
> >> >>> >> >> readallfields=true
> >> >>> >> >>
> >> >>> >> >> readproportion=0.5
> >> >>> >> >> updateproportion=0.1
> >> >>> >> >> scanproportion=0
> >> >>> >> >> insertproportion=0.35
> >> >>> >> >> readmodifywriteproportion=0.05
> >> >>> >> >>
> >> >>> >> >> requestdistribution=zipfian
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >>>
> >> >>> >> >>> > There are no abnormalities in the logs as far as I can see,
> >> the
> >> >>> only
> >> >>> >> >>> common
> >> >>> >> >>> > point is that all of them(in different trials, different
> >> region
> >> >>> >> servers
> >> >>> >> >>> > fail) request for a flush as the last logs, given below.
> .out
> >> >>> files
> >> >>> >> are
> >> >>> >> >>> > empty. I am looking at the /var/log/hbase folder for logs.
> >> >>> Running
> >> >>> >> sun
> >> >>> >> >>> java
> >> >>> >> >>> > 6 latest version. I couldn't find any logs that indicates a
> >> >>> problem
> >> >>> >> with
> >> >>> >> >>> > java. Tried the tests with openjdk and had the same
> results.
> >> >>> >> >>> >
> >> >>> >> >>>
> >> >>> >> >>> Its strange that flush is the last thing in your log.  The
> >> process
> >> >>> is
> >> >>> >> >>> dead?  We are exiting w/o a note in logs?  Thats unusual.  We
> >> >>> usually
> >> >>> >> >>> scream loudly when dying.
> >> >>> >> >>>
> >> >>> >> >>
> >> >>> >> >> Yes, thats the strange part. The last line is a flush as if
> the
> >> >>> process
> >> >>> >> >> never failed. Yes, the process is dead and hbase cannot see
> the
> >> >>> node.
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >>>
> >> >>> >> >>> > I have set ulimits(50000) and xceivers(20000) for multiple
> >> users
> >> >>> and
> >> >>> >> >>> certain
> >> >>> >> >>> > that they are correct.
> >> >>> >> >>>
> >> >>> >> >>> The first line in an hbase log prints out the ulimit it sees.
> >>  You
> >> >>> >> >>> might check that the hbase process for sure is picking up
> your
> >> >>> ulimit
> >> >>> >> >>> setting.
> >> >>> >> >>>
> >> >>> >> >>> That was a mistake I did a couple of days ago, checked it
> with
> >> cat
> >> >>> >> >> /proc/<pid of reginserver>/limits  and all related users like
> >> >>> 'hbase'
> >> >>> >> has
> >> >>> >> >> those limits. Checked the logs:
> >> >>> >> >>
> >> >>> >> >> Mon Mar  7 06:41:15 EET 2011 Starting regionserver on test-1
> >> >>> >> >> ulimit -n 52768
> >> >>> >> >>
> >> >>> >> >>>
> >> >>> >> >>> > Also in the kernel logs, there are no apparent problems.
> >> >>> >> >>> >
> >> >>> >> >>>
> >> >>> >> >>> (The mystery compounds)
> >> >>> >> >>>
> >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
> >> >>> >> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> >> >>> Compaction
> >> >>> >> >>> > requested for
> >> >>> >> >>> >
> >> >>> >>
> >> >>>
> >> usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3.
> >> >>> >> >>> > because regionserver60020.cacheFlusher; priority=3,
> compaction
> >> >>> queue
> >> >>> >> >>> size=18
> >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
> >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
> >> >>> >> >>> > NOT flushing memstore for region
> >> >>> >> >>> >
> >> >>> >> >>>
> >> >>> >>
> >> >>>
> >>
> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.,
> >> >>> >> >>> > flushing=false, writesEnabled=false
> >> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
> >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
> >> >>> >> >>> > Started memstore flush for
> >> >>> >> >>> >
> >> >>> >> >>>
> >> >>> >>
> >> >>>
> >>
> usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6.,
> >> >>> >> >>> > current region memstore size 68.6m
> >> >>> >> >>> > 2011-03-07 15:07:58,310 DEBUG
> >> >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
> >> >>> >> >>> > Flush requested on
> >> >>> >> >>> >
> >> >>> >>
> >> >>>
> >> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.
> >> >>> >> >>> > -end of log file-
> >> >>> >> >>> > ---
> >> >>> >> >>> >
> >> >>> >> >>>
> >> >>> >> >>> Nothing more?
> >> >>> >> >>>
> >> >>> >> >>>
> >> >>> >> >> No, nothing after that. But quite a lot of logs before that, I
> >> can
> >> >>> send
> >> >>> >> >> them if you'd like.
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >>> Thanks,
> >> >>> >> >>> St.Ack
> >> >>> >> >>>
> >> >>> >> >>
> >> >>> >> >> Thanks alot!
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >
> >> >>> >>
> >> >>> >
> >> >>>
> >> >>
> >> >>
> >> >
> >>
> >> --
> >> 从我的移动设备发送
> >>
> >> Thanks & Best regards
> >> jiajun
> >>
> >
> >
> >
> > --
> > erdem agaoglu
> >
>

Re: region servers dying - flush request - YCSB

Reply via email to