Re: Extremely long flush times

Ted Yu Fri, 10 Aug 2012 21:06:41 -0700

Looks like time was spent in HRegion.internalFlushcache().
There is not so much information from server log.


I suggest you try out the patch in:
HBASE-6466 Enable multi-thread for memstore flush

It has been running in Taobao's production system.

BTW Have you noticed the following in the log ?

2012-08-10 17:30:17,927 WARN org.apache.hadoop.hbase.regionserver.wal.HLog:
Too many consecutive RollWriter requests, it's a sign of the total number
of live datanodes is lower than the tolerable replicas.

How many datanodes did you use ?

Thanks

On Fri, Aug 10, 2012 at 3:12 PM, Espinoza,Carlos <[email protected]> wrote:

> We were able to replicate this behavior in a pseudo-distributed hbase
> (hbase-0.94.1) environment. We wrote a test program that creates a test
> table "MyTestTable" and populates it with random rows, then it creates a
> row with 60,000 columns and repeatedly updates it. Each column has a 18
> byte qualifier and a 50 byte value. In our tests, when we ran the
> program, we usually never got beyond 15 updates before it would flush
> for a really long time. The rows that are being updated are about 4MB
> each (minues any hbase metadata).
>
> It doesn't seem like it's caused by GC. I turned on gc logging, and
> didn't see any long pauses. This is the gc log during the flush.
> http://pastebin.com/vJKKXDx5
>
> This is the regionserver log with debug on during the same flush
> http://pastebin.com/Fh5213mg
>
> This is the test program we wrote.
> http://pastebin.com/aZ0k5tx2
>
> You should be able to just compile it, and run it against a running
> HBase cluster.
> $ java TestTable
>
> Carlos
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of
> Jean-Daniel Cryans
> Sent: Monday, August 06, 2012 8:04 PM
> To: [email protected]
> Subject: Re: Extremely long flush times
>
> Hi Ron,
>
> I think this is caused by GC, for example there's 7 minutes between the
> last blocking message and the end of the flush:
>
> 2012-08-06 11:44:36,348 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC
> Server handler 42 on 9009' on region
> Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6.:
> memstore size 271.9m is >= than blocking 256.0m size
> 2012-08-06 11:51:28,139 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of
> ~155.4m/162905016, currentsize=116.5m/122172880 for region
> Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in
> 501337ms, sequenceid=1792264092, compaction requested=true
>
> During that time I would expect that clients got socket timeouts, tried
> to write again, and data would just pile up in the RPC queue.
>
> Do you have the gc log handy? Also did you set the zookeeper session
> timeout higher than the default of 3 minutes?
>
> Thx,
>
> J-D
>
> On Mon, Aug 6, 2012 at 2:37 PM, Buckley,Ron <[email protected]> wrote:
> > We ran into this situation today on our production instance, I'm
> > wondering if anyone has seen or knows a way around it.
> >
> >
> >
> > We were repeatedly updating the same 60,000 column wide row.  With
> > each update, we were updating nearly every column of the row. Each
> > update generated nearly 9 MB in the WAL.
> >
> > This worked OK most of the time. We saw frequent messages like this in
>
> > the region server log, indicating a 2.6 second flush time:
> >
> > 2012-08-06 11:42:52,316 INFO
> > org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
> > of ~155.4m/162931400, currentsize=2.7k/2760 for region
> > Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in
> > 2610ms, sequenceid=1792263953, compaction requested=false
> >
> >
> >
> > However, sometimes we would see a flush times around 8 - 10 Minutes.
> >
> > 2012-08-06 11:51:28,139 INFO
> > org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
> > of ~155.4m/162905016, currentsize=116.5m/122172880 for region
> > Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in
> > 501337ms, sequenceid=1792264092, compaction requested=true
> >
> >
> >
> > During these long flushes, we would end up taking all the ipc handler
> > slots into that region server that we had configured (50).  It looked
> > to our application like the region server was unavailable.
> >
> >
> >
> > After the long delay, the region server would continue running fine,
> > but then do it again a few minutes later.
> >
> >
> >
> > I thought it was most likely something with Garbage Collection, so I
> > got a stack trace during one of the events.  The trace doesn't look
> > like GC was running.
> >
> >
> >
> >
> >
> > http://pastebin.com/ANBiUzX4  - Stack Trace when the region server was
>
> > unreachable
> >
> > http://pastebin.com/naTYQxZR  - Region Server Log during the same
> > interval
> >
> >
> >
> >
> >
> > Versions:
> >
> > HBase 0.92.1, r1298924
> >
> > Hadoop 0.20.2-cdh3u3
> >
> >
> >
> >
> >
> > ----------------------------------------------------------------------
> >
> > Ron Buckley
> >
> > x6365
> >
> >
> >
>
>
>

Re: Extremely long flush times

Reply via email to