Re: Some regions never get online after a region server crashes

2016-05-19 Thread Ted Yu
Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE request
when it was opening, leading to RegionAlreadyInTransitionException.

Was there any clue in master log why the close request was sent ?

Cheers

On Wed, May 4, 2016 at 8:02 PM, Shuai Lin  wrote:

> Hi Ted,
>
> The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8. The
> RS logs on node6 can be found here  >
>  .
>
> Thanks!
>
> Shuai
>
> On Thu, May 5, 2016 at 9:15 AM, Ted Yu  wrote:
>
> > Can you pastebin related server log w.r.t.
> d1c7f3f455f2529da82a2f713b5ee067
> > from rs-node6 ?
> >
> > Which release of hbase are you using ?
> >
> > Cheers
> >
> > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin 
> wrote:
> >
> > > Hi list,
> > >
> > > Last weekend I got a region server crashed, but some regions never got
> > > online again on other RSes. I've gone through the logs, and here is the
> > > timeline about some of the events:
> > >
> > > * 13:03:50 on of the region server, rs-node7, died because of a disk
> > > failure. Master started to split rs-node7's WALs
> > >
> > >
> > > 2016-04-30 13:03:50,953 INFO
> > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
> > > logs for rs-node7.example.com,60020,1458724695128 before assignment;
> > > region
> > > count=133
> > > 2016-04-30 13:03:50,966 DEBUG
> > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
> logs
> > to
> > > split
> > > 2016-04-30 13:03:50,966 INFO
> > > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33
> logs
> > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
> > > ,60020,1458724695128-splitting]
> > > for [rs-node7.example.com,60020,1458724695128]
> > >
> > > * 13:10:47 WAL splits done, master began to re-assign regions
> > >
> > > 2016-04-30 13:10:47,655 INFO
> > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> Reassigning
> > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
> carrying
> > > (and 0 regions(s) that were opening on this server)
> > > 2016-04-30 13:10:47,665 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
> > > region(s) across 6 server(s), round-robin=true
> > > 2016-04-30 13:10:47,667 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> region(s)
> > to
> > > rs-node1.example.com,60020,1458720625688
> > > 2016-04-30 13:10:47,667 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> region(s)
> > to
> > > rs-node2.example.com,60020,1458721110988
> > > 2016-04-30 13:10:47,667 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> region(s)
> > to
> > > rs-node3.example.com,60020,1458721713906
> > > 2016-04-30 13:10:47,679 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
> region(s)
> > to
> > > rs-node4.example.com,60020,1458722335527
> > > 2016-04-30 13:10:47,691 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> region(s)
> > to
> > > rs-node5.example.com,60020,1458722992296
> > > 2016-04-30 13:10:47,702 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> region(s)
> > to
> > > rs-node6.example.com,60020,1458723856883
> > >
> > > * 13:11:47 the opening regions rpc sent by master to region servers
> timed
> > > out after 1 minutes
> > >
> > > 2016-04-30 13:11:47,780 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
> > > with rs-node3.example.com,60020,1458721713906 in order to assign
> regions
> > > java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020
> > failed
> > > on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException:
> > Call
> > > id=4, waitTime=60001, operationTimeout=6 expired.
> > > 2016-04-30 13:11:47,782 INFO
> > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
> > > regions to server rs-node6.example.com,60020,1458723856883,
> reassigning
> > > them
> > > 2016-04-30 13:11:47,782 INFO
> > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 23
> > > regions to server rs-node4.example.com,60020,1458722335527,
> reassigning
> > > them
> > > 2016-04-30 13:11:47,782 INFO
> > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
> > > regions to server rs-node3.example.com,60020,1458721713906,
> reassigning
> > > them
> > > 2016-04-30 13:11:47,783 DEBUG
> > > org.apache.hadoop.hbase.master.AssignmentManager: Force region state
> > > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
> > ts=1462021907753,
> > > server=rs-node6.example.com,60020,1458723856883}
> > >
> > >
> > > * After that, part of the regions (40 out of 130 regions) never got
> > online,
> > > and the following lines were logged repeatly in master log:
> > >
> > > 2016-04-30 13:12:37,188 DEBUG
> > > org.apache.hadoop.hbase.master.AssignmentManager: update
> 

Re: Some regions never get online after a region server crashes

2016-05-19 Thread Shuai Lin
Hi Ted, Could be please take a look at the log I've pasted above? Thanks!

On Thu, May 5, 2016 at 11:02 AM, Shuai Lin  wrote:

> Hi Ted,
>
> The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8. The
> RS logs on node6 can be found here
>  .
>
> Thanks!
>
> Shuai
>
> On Thu, May 5, 2016 at 9:15 AM, Ted Yu  wrote:
>
>> Can you pastebin related server log w.r.t.
>> d1c7f3f455f2529da82a2f713b5ee067
>> from rs-node6 ?
>>
>> Which release of hbase are you using ?
>>
>> Cheers
>>
>> On Wed, May 4, 2016 at 6:07 PM, Shuai Lin  wrote:
>>
>> > Hi list,
>> >
>> > Last weekend I got a region server crashed, but some regions never got
>> > online again on other RSes. I've gone through the logs, and here is the
>> > timeline about some of the events:
>> >
>> > * 13:03:50 on of the region server, rs-node7, died because of a disk
>> > failure. Master started to split rs-node7's WALs
>> >
>> >
>> > 2016-04-30 13:03:50,953 INFO
>> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
>> > logs for rs-node7.example.com,60020,1458724695128 before assignment;
>> > region
>> > count=133
>> > 2016-04-30 13:03:50,966 DEBUG
>> > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
>> logs to
>> > split
>> > 2016-04-30 13:03:50,966 INFO
>> > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33
>> logs
>> > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
>> > ,60020,1458724695128-splitting]
>> > for [rs-node7.example.com,60020,1458724695128]
>> >
>> > * 13:10:47 WAL splits done, master began to re-assign regions
>> >
>> > 2016-04-30 13:10:47,655 INFO
>> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>> Reassigning
>> > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
>> carrying
>> > (and 0 regions(s) that were opening on this server)
>> > 2016-04-30 13:10:47,665 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
>> > region(s) across 6 server(s), round-robin=true
>> > 2016-04-30 13:10:47,667 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node1.example.com,60020,1458720625688
>> > 2016-04-30 13:10:47,667 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node2.example.com,60020,1458721110988
>> > 2016-04-30 13:10:47,667 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node3.example.com,60020,1458721713906
>> > 2016-04-30 13:10:47,679 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
>> region(s) to
>> > rs-node4.example.com,60020,1458722335527
>> > 2016-04-30 13:10:47,691 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node5.example.com,60020,1458722992296
>> > 2016-04-30 13:10:47,702 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node6.example.com,60020,1458723856883
>> >
>> > * 13:11:47 the opening regions rpc sent by master to region servers
>> timed
>> > out after 1 minutes
>> >
>> > 2016-04-30 13:11:47,780 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
>> > with rs-node3.example.com,60020,1458721713906 in order to assign
>> regions
>> > java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020
>> failed
>> > on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException:
>> Call
>> > id=4, waitTime=60001, operationTimeout=6 expired.
>> > 2016-04-30 13:11:47,782 INFO
>> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
>> > regions to server rs-node6.example.com,60020,1458723856883, reassigning
>> > them
>> > 2016-04-30 13:11:47,782 INFO
>> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 23
>> > regions to server rs-node4.example.com,60020,1458722335527, reassigning
>> > them
>> > 2016-04-30 13:11:47,782 INFO
>> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
>> > regions to server rs-node3.example.com,60020,1458721713906, reassigning
>> > them
>> > 2016-04-30 13:11:47,783 DEBUG
>> > org.apache.hadoop.hbase.master.AssignmentManager: Force region state
>> > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
>> ts=1462021907753,
>> > server=rs-node6.example.com,60020,1458723856883}
>> >
>> >
>> > * After that, part of the regions (40 out of 130 regions) never got
>> online,
>> > and the following lines were logged repeatly in master log:
>> >
>> > 2016-04-30 13:12:37,188 DEBUG
>> > org.apache.hadoop.hbase.master.AssignmentManager: update
>> > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, ts=1462021957087,
>> > server=rs-node6.example.com,60020,1458723856883} the timestamp.
>> > 2016-04-30 13:12:37,188 DEBUG
>> > org.apache.hadoop.hbase.master.AssignmentManager: Region is already in
>> > 

Re: [ANNOUNCE] PhoenixCon the day after HBaseCon

2016-05-19 Thread Enis Söztutar
This is great. Thanks James for organizing!

Enis

On Thu, May 19, 2016 at 2:50 PM, James Taylor 
wrote:

> The inaugural PhoenixCon will take place 9am-1pm on Wed, May 25th (at
> Salesforce @ 1 Market St, SF), the day after HBaseCon. We'll have two
> tracks: one for Apache Phoenix use cases and one for Apache Phoenix
> internals.
>
> To RSVP and for more details see here[1].
>
> We hope you can make it!
>
> James
>
> [1]
> http://www.meetup.com/SF-Bay-Area-Apache-Phoenix-Meetup/events/230545182/
>


[ANNOUNCE] PhoenixCon the day after HBaseCon

2016-05-19 Thread James Taylor
The inaugural PhoenixCon will take place 9am-1pm on Wed, May 25th (at
Salesforce @ 1 Market St, SF), the day after HBaseCon. We'll have two
tracks: one for Apache Phoenix use cases and one for Apache Phoenix
internals.

To RSVP and for more details see here[1].

We hope you can make it!

James

[1]
http://www.meetup.com/SF-Bay-Area-Apache-Phoenix-Meetup/events/230545182/


Re: Rows per second for RegionScanner

2016-05-19 Thread hongbin ma
hi Vladimir

Thanks for your reply. I'll try.

a quick question: suppose there's only one region on a 4-core server, when
the region is being scanned, will all the cores being utilized to speed up
scanning? can you kindly show me some evidence in code or some official
document?

Last time I observed a 4-core server scanning a single region, the CPU
usage was only 25%. That's why I'm asking.

thanks!


On Fri, Apr 22, 2016 at 12:10 PM, Vladimir Rodionov 
wrote:

> Try disabling block encoding - you will get better numbers.
>
> >>  I mean per region scan speed,
>
> Scan performance depends on # of CPU cores, the more cores you have the
> more performance you will get. Your servers are pretty low end (4 virtual
> CPU cores is just 2 hardware cores). With 32 cores per node you will get 8x
> speed up (close to 8x).
>
> -Vlad
>
>
> On Thu, Apr 21, 2016 at 7:22 PM, hongbin ma  wrote:
>
> > hi Thakrar
> >
> > Thanks for your reply.
> >
> > My settings for the RegionScanner Scan is
> >
> > scan.setCaching(1024)
> > scan.setMaxResultSize(5M)
> >
> > even if I change the caching to 10 I'm still not getting any
> > improvements. I guess the caching works for remote scan through RPC,
> > however not helping too much for region side scan?
> >
> > I also tried the PREFETCH_BLOCKS_ON_OPEN for the whole table, however no
> > improvement was observed.
> >
> > I'm pursuing for pure scan-read performance optimization because our
> > application is sort of read-only. And I observed that even if I did no
> > other thing (only scanning) in my coprocessor, the scan speed is not
> > satisfying. The CPU seems to be fully utilized. May be the process of
> > decoding FAST_DIFF rows is too heavy for CPU? How many rows/second scan
> > speed would your expect on a normal setting? I mean per region scan
> speed,
> > not the overall scan speed counting in all regions.
> >
> > thanks
> >
> > On Thu, Apr 21, 2016 at 10:24 PM, Thakrar, Jayesh <
> > jthak...@conversantmedia.com> wrote:
> >
> > > Just curious - have you set the scanner caching to some high value -
> say
> > > 1000 (or even higher in your small value case)?
> > >
> > > The parameter is hbase.client.scanner.caching
> > >
> > > You can read up on it - https://hbase.apache.org/book.html
> > >
> > > Another thing, are you just looking for pure scan-read performance
> > > optimization?
> > > Depending upon the table size you can also look into caching the table
> or
> > > not caching at all.
> > >
> > > -Original Message-
> > > From: hongbin ma [mailto:mahong...@apache.org]
> > > Sent: Thursday, April 21, 2016 5:04 AM
> > > To: user@hbase.apache.org
> > > Subject: Rows per second for RegionScanner
> > >
> > > ​Hi, experts,
> > >
> > > I'm trying to figure out how fast hbase can scan. I'm setting up the
> > > RegionScan in a endpoint coprocessor so that no network overhead will
> be
> > > included. My average key length is 35 and average value length is 5.
> > >
> > > My test result is that if I warm all my interested blocks in the block
> > > cache, I'm only able to scan around 300,000 rows per second per region
> > > (with endpoint I guess it's one thread per region), so it's like
> > getting15M
> > > data per second. I'm not sure if this is already an acceptable number
> for
> > > HBase. The answers from you experts might help me to decide if it's
> worth
> > > to further dig into tuning it.
> > >
> > > thanks!
> > >
> > >
> > >
> > >
> > >
> > >
> > > other info:
> > >
> > > My hbase cluster is on 8 AWS m1.xlarge instance, with 4 CPU cores and
> 16G
> > > RAM. Each region server is configured 10G heap size. The test HTable
> has
> > 23
> > > regions, one hfile per region (just major compacted). There's no other
> > > resource contention when I ran the tests.
> > >
> > > Attached is the HFile output of one of the region hfile:
> > > =
> > >  hbase  org.apache.hadoop.hbase.io.hfile.HFile -m -s -v -f
> > >
> > >
> >
> /apps/hbase/data/data/default/KYLIN_YMSGYYXO12/d42b9faf43eafcc9640aa256143d5be3/F1/30b8a8ff5a82458481846e364974bf06
> > > 2016-04-21 09:16:04,091 INFO  [main] Configuration.deprecation:
> > > hadoop.native.lib is deprecated. Instead, use io.native.lib.available
> > > 2016-04-21 09:16:04,292 INFO  [main] util.ChecksumType: Checksum using
> > > org.apache.hadoop.util.PureJavaCrc32
> > > 2016-04-21 09:16:04,294 INFO  [main] util.ChecksumType: Checksum can
> use
> > > org.apache.hadoop.util.PureJavaCrc32C
> > > SLF4J: Class path contains multiple SLF4J bindings.
> > > SLF4J: Found binding in
> > >
> > >
> >
> [jar:file:/usr/hdp/2.2.9.0-3393/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > > SLF4J: Found binding in
> > >
> > >
> >
> [jar:file:/usr/hdp/2.2.9.0-3393/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> > > explanation.
> > >