Re: Parallel Scanner

2017-02-20 Thread Anil
Hi Ted,

Thanks. I will go through phoenix code.

Thanks.

On 20 February 2017 at 21:50, Ted Yu  wrote:

> Please read https://phoenix.apache.org/update_statistics.html
>
> FYI
>
> On Mon, Feb 20, 2017 at 8:14 AM, Anil  wrote:
>
> > Hi Ted,
> >
> > its very difficult to predict the data distribution. we store parent to
> > child relationships in the table. (Note : A parent record is child for
> > itself )
> >
> > we set the max hregion file size as 10gb. I don't think we have any
> control
> > on region size :(
> >
> > Thanks
> >
> >
> > On 20 February 2017 at 21:24, Ted Yu  wrote:
> >
> > > Among the 5 columns, do you know roughly the data distribution ?
> > >
> > > You should put the columns whose data distribution is relatively even
> > > first. Of course, there may be business requirement which you take into
> > > consideration w.r.t. the composite key.
> > >
> > > If you cannot change the schema, do you have control over the region
> > size ?
> > > Smaller region may lower the variance in data distribution per region.
> > >
> > > On Mon, Feb 20, 2017 at 7:47 AM, Anil  wrote:
> > >
> > > > Hi Ted,
> > > >
> > > > Current region size is 10 GB.
> > > >
> > > > Hbase row key designed like a phoenix primary key. I can say it is
> > like 5
> > > > column composite key. Prefix for a common set of data would have same
> > > first
> > > > prefix. I am not sure how to convey the data distribution.
> > > >
> > > > Thanks.
> > > >
> > > > On 20 February 2017 at 20:48, Ted Yu  wrote:
> > > >
> > > > > Anil:
> > > > > What's the current region size you use ?
> > > > >
> > > > > Given a region, do you have some idea how the data is distributed
> > > within
> > > > > the region ?
> > > > >
> > > > > Cheers
> > > > >
> > > > > On Mon, Feb 20, 2017 at 7:14 AM, Anil  wrote:
> > > > >
> > > > > > i understand my original post now :)  Sorry about that.
> > > > > >
> > > > > > now the challenge is to split a start key and end key at client
> > side
> > > to
> > > > > > allow parallel scans on table with no buckets, pre-salting.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > On 20 February 2017 at 20:21, ramkrishna vasudevan <
> > > > > > ramkrishna.s.vasude...@gmail.com> wrote:
> > > > > >
> > > > > > > You are trying to scan one region itself in parallel, then
> even I
> > > got
> > > > > you
> > > > > > > wrong. Richard's suggestion is the right choice for client only
> > > soln.
> > > > > > >
> > > > > > > On Mon, Feb 20, 2017 at 7:40 PM, Anil 
> > wrote:
> > > > > > >
> > > > > > > > Thanks Richard :)
> > > > > > > >
> > > > > > > > On 20 February 2017 at 18:56, Richard Startin <
> > > > > > > richardstar...@outlook.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > RegionLocator is not deprecated, hence the suggestion to
> use
> > it
> > > > if
> > > > > > it's
> > > > > > > > > available in place of whatever is still available on HTable
> > for
> > > > > your
> > > > > > > > > version of HBase - it will make upgrades easier. For
> instance
> > > > > > > > > HTable::getRegionsInRange no longer exists on the current
> > > master
> > > > > > > branch.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > "I am trying to scan a region in parallel :)"
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I thought you were asking about scanning many regions at
> the
> > > same
> > > > > > time,
> > > > > > > > > not scanning a single region in parallel? HBASE-1935 is
> about
> > > > > > > > parallelising
> > > > > > > > > scans over regions, not within regions.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > If you want to parallelise within a region, you could
> write a
> > > > > little
> > > > > > > > > method to split the first and last key of the region into
> > > several
> > > > > > > > disjoint
> > > > > > > > > lexicographic buckets and create a scan for each bucket,
> then
> > > > > execute
> > > > > > > > those
> > > > > > > > > scans in parallel. Your data probably doesn't distribute
> > > > uniformly
> > > > > > over
> > > > > > > > > lexicographic buckets though so the scans are unlikely to
> > > execute
> > > > > at
> > > > > > a
> > > > > > > > > constant rate and you'll get results in time proportional
> to
> > > the
> > > > > > > > > lexicographic bucket with the highest cardinality in the
> > > region.
> > > > > I'd
> > > > > > be
> > > > > > > > > interested to know if anyone on the list has ever tried
> this
> > > and
> > > > > what
> > > > > > > the
> > > > > > > > > results were?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Using the much simpler approach of parallelising over
> regions
> > > by
> > > > > > > creating
> > > > > > > > > multiple disjoint scans client side, as suggested, your
> > > > performance
> > > > > > now
> > > > > > > > > depends on your regions which you have some control over.
> You
> > > can
> > > > > 

Re: Don't Settle for Eventual Consistency

2017-02-20 Thread Edward Capriolo
On Mon, Feb 20, 2017 at 9:29 PM, Edward Capriolo 
wrote:

> Ap systems are not available in practice..
>
> Cp systems can be made highly available.
>
> Sounds like they are arguing ap is not ap, but somehow cp can be ap.
>
> Then google can label failures as 'incidents' and cp and ap are unefected.
>
> I swear foundation db claimed it solved cap, too bad foundationdb is
> unavailableexception.
>
> On Friday, February 17, 2017, Ted Yu  wrote:
>
>> Reference #8 at the end of the post is interesting.
>>
>> On Fri, Feb 17, 2017 at 9:23 AM, Robert Yokota 
>> wrote:
>>
>> > Hi,
>> >
>> > This may be helpful to those who are considering the use of HBase.
>> >
>> > https://yokota.blog/2017/02/17/dont-settle-for-eventual-consistency/
>> >
>>
>
>
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.
>


Totally fair comparison by the way. Call out figures with from Google and
Facebook, companies with huge development budgets and data centers, teams
with tends or hundreds of developers building in house software, then
compare those deployments to a deployment of Cassandra or Riak at
Yammer.

Random thought: Compare the availability of amazon S3, build to leverage
eventual consistency, on top of other systems with eventual consistency,
and compare that with say google cloud storage...

https://cloud.google.com/products/storage/
AvailableAll storage classes offer very high availability. Your data is
accessible when you need it. Multi-Regional offers 99.95% and Regional
storage offers 99.9% monthly availability in their Service Level Agreement.
Nearline and Coldline storage offer 99% monthly availability.

What happened to those 5 9s?


Don't Settle for Eventual Consistency

2017-02-20 Thread Edward Capriolo
Ap systems are not available in practice..

Cp systems can be made highly available.

Sounds like they are arguing ap is not ap, but somehow cp can be ap.

Then google can label failures as 'incidents' and cp and ap are unefected.

I swear foundation db claimed it solved cap, too bad foundationdb is
unavailableexception.

On Friday, February 17, 2017, Ted Yu  wrote:

> Reference #8 at the end of the post is interesting.
>
> On Fri, Feb 17, 2017 at 9:23 AM, Robert Yokota  wrote:
>
> > Hi,
> >
> > This may be helpful to those who are considering the use of HBase.
> >
> > https://yokota.blog/2017/02/17/dont-settle-for-eventual-consistency/
> >
>


-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.


Re: Parallel Scanner

2017-02-20 Thread Ted Yu
Please read https://phoenix.apache.org/update_statistics.html

FYI

On Mon, Feb 20, 2017 at 8:14 AM, Anil  wrote:

> Hi Ted,
>
> its very difficult to predict the data distribution. we store parent to
> child relationships in the table. (Note : A parent record is child for
> itself )
>
> we set the max hregion file size as 10gb. I don't think we have any control
> on region size :(
>
> Thanks
>
>
> On 20 February 2017 at 21:24, Ted Yu  wrote:
>
> > Among the 5 columns, do you know roughly the data distribution ?
> >
> > You should put the columns whose data distribution is relatively even
> > first. Of course, there may be business requirement which you take into
> > consideration w.r.t. the composite key.
> >
> > If you cannot change the schema, do you have control over the region
> size ?
> > Smaller region may lower the variance in data distribution per region.
> >
> > On Mon, Feb 20, 2017 at 7:47 AM, Anil  wrote:
> >
> > > Hi Ted,
> > >
> > > Current region size is 10 GB.
> > >
> > > Hbase row key designed like a phoenix primary key. I can say it is
> like 5
> > > column composite key. Prefix for a common set of data would have same
> > first
> > > prefix. I am not sure how to convey the data distribution.
> > >
> > > Thanks.
> > >
> > > On 20 February 2017 at 20:48, Ted Yu  wrote:
> > >
> > > > Anil:
> > > > What's the current region size you use ?
> > > >
> > > > Given a region, do you have some idea how the data is distributed
> > within
> > > > the region ?
> > > >
> > > > Cheers
> > > >
> > > > On Mon, Feb 20, 2017 at 7:14 AM, Anil  wrote:
> > > >
> > > > > i understand my original post now :)  Sorry about that.
> > > > >
> > > > > now the challenge is to split a start key and end key at client
> side
> > to
> > > > > allow parallel scans on table with no buckets, pre-salting.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > On 20 February 2017 at 20:21, ramkrishna vasudevan <
> > > > > ramkrishna.s.vasude...@gmail.com> wrote:
> > > > >
> > > > > > You are trying to scan one region itself in parallel, then even I
> > got
> > > > you
> > > > > > wrong. Richard's suggestion is the right choice for client only
> > soln.
> > > > > >
> > > > > > On Mon, Feb 20, 2017 at 7:40 PM, Anil 
> wrote:
> > > > > >
> > > > > > > Thanks Richard :)
> > > > > > >
> > > > > > > On 20 February 2017 at 18:56, Richard Startin <
> > > > > > richardstar...@outlook.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > RegionLocator is not deprecated, hence the suggestion to use
> it
> > > if
> > > > > it's
> > > > > > > > available in place of whatever is still available on HTable
> for
> > > > your
> > > > > > > > version of HBase - it will make upgrades easier. For instance
> > > > > > > > HTable::getRegionsInRange no longer exists on the current
> > master
> > > > > > branch.
> > > > > > > >
> > > > > > > >
> > > > > > > > "I am trying to scan a region in parallel :)"
> > > > > > > >
> > > > > > > >
> > > > > > > > I thought you were asking about scanning many regions at the
> > same
> > > > > time,
> > > > > > > > not scanning a single region in parallel? HBASE-1935 is about
> > > > > > > parallelising
> > > > > > > > scans over regions, not within regions.
> > > > > > > >
> > > > > > > >
> > > > > > > > If you want to parallelise within a region, you could write a
> > > > little
> > > > > > > > method to split the first and last key of the region into
> > several
> > > > > > > disjoint
> > > > > > > > lexicographic buckets and create a scan for each bucket, then
> > > > execute
> > > > > > > those
> > > > > > > > scans in parallel. Your data probably doesn't distribute
> > > uniformly
> > > > > over
> > > > > > > > lexicographic buckets though so the scans are unlikely to
> > execute
> > > > at
> > > > > a
> > > > > > > > constant rate and you'll get results in time proportional to
> > the
> > > > > > > > lexicographic bucket with the highest cardinality in the
> > region.
> > > > I'd
> > > > > be
> > > > > > > > interested to know if anyone on the list has ever tried this
> > and
> > > > what
> > > > > > the
> > > > > > > > results were?
> > > > > > > >
> > > > > > > >
> > > > > > > > Using the much simpler approach of parallelising over regions
> > by
> > > > > > creating
> > > > > > > > multiple disjoint scans client side, as suggested, your
> > > performance
> > > > > now
> > > > > > > > depends on your regions which you have some control over. You
> > can
> > > > > > achieve
> > > > > > > > the same effect by pre-splitting your table such that you
> > > > empirically
> > > > > > > > optimise read performance for the dataset you store.
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Richard
> > > > > > > >
> > > > > > > >
> > > > > > > > 
> > > > > > > > From: Anil 
> > > > > > > > Sent: 20 February 

Re: Parallel Scanner

2017-02-20 Thread Anil
Hi Ted,

its very difficult to predict the data distribution. we store parent to
child relationships in the table. (Note : A parent record is child for
itself )

we set the max hregion file size as 10gb. I don't think we have any control
on region size :(

Thanks


On 20 February 2017 at 21:24, Ted Yu  wrote:

> Among the 5 columns, do you know roughly the data distribution ?
>
> You should put the columns whose data distribution is relatively even
> first. Of course, there may be business requirement which you take into
> consideration w.r.t. the composite key.
>
> If you cannot change the schema, do you have control over the region size ?
> Smaller region may lower the variance in data distribution per region.
>
> On Mon, Feb 20, 2017 at 7:47 AM, Anil  wrote:
>
> > Hi Ted,
> >
> > Current region size is 10 GB.
> >
> > Hbase row key designed like a phoenix primary key. I can say it is like 5
> > column composite key. Prefix for a common set of data would have same
> first
> > prefix. I am not sure how to convey the data distribution.
> >
> > Thanks.
> >
> > On 20 February 2017 at 20:48, Ted Yu  wrote:
> >
> > > Anil:
> > > What's the current region size you use ?
> > >
> > > Given a region, do you have some idea how the data is distributed
> within
> > > the region ?
> > >
> > > Cheers
> > >
> > > On Mon, Feb 20, 2017 at 7:14 AM, Anil  wrote:
> > >
> > > > i understand my original post now :)  Sorry about that.
> > > >
> > > > now the challenge is to split a start key and end key at client side
> to
> > > > allow parallel scans on table with no buckets, pre-salting.
> > > >
> > > > Thanks.
> > > >
> > > > On 20 February 2017 at 20:21, ramkrishna vasudevan <
> > > > ramkrishna.s.vasude...@gmail.com> wrote:
> > > >
> > > > > You are trying to scan one region itself in parallel, then even I
> got
> > > you
> > > > > wrong. Richard's suggestion is the right choice for client only
> soln.
> > > > >
> > > > > On Mon, Feb 20, 2017 at 7:40 PM, Anil  wrote:
> > > > >
> > > > > > Thanks Richard :)
> > > > > >
> > > > > > On 20 February 2017 at 18:56, Richard Startin <
> > > > > richardstar...@outlook.com>
> > > > > > wrote:
> > > > > >
> > > > > > > RegionLocator is not deprecated, hence the suggestion to use it
> > if
> > > > it's
> > > > > > > available in place of whatever is still available on HTable for
> > > your
> > > > > > > version of HBase - it will make upgrades easier. For instance
> > > > > > > HTable::getRegionsInRange no longer exists on the current
> master
> > > > > branch.
> > > > > > >
> > > > > > >
> > > > > > > "I am trying to scan a region in parallel :)"
> > > > > > >
> > > > > > >
> > > > > > > I thought you were asking about scanning many regions at the
> same
> > > > time,
> > > > > > > not scanning a single region in parallel? HBASE-1935 is about
> > > > > > parallelising
> > > > > > > scans over regions, not within regions.
> > > > > > >
> > > > > > >
> > > > > > > If you want to parallelise within a region, you could write a
> > > little
> > > > > > > method to split the first and last key of the region into
> several
> > > > > > disjoint
> > > > > > > lexicographic buckets and create a scan for each bucket, then
> > > execute
> > > > > > those
> > > > > > > scans in parallel. Your data probably doesn't distribute
> > uniformly
> > > > over
> > > > > > > lexicographic buckets though so the scans are unlikely to
> execute
> > > at
> > > > a
> > > > > > > constant rate and you'll get results in time proportional to
> the
> > > > > > > lexicographic bucket with the highest cardinality in the
> region.
> > > I'd
> > > > be
> > > > > > > interested to know if anyone on the list has ever tried this
> and
> > > what
> > > > > the
> > > > > > > results were?
> > > > > > >
> > > > > > >
> > > > > > > Using the much simpler approach of parallelising over regions
> by
> > > > > creating
> > > > > > > multiple disjoint scans client side, as suggested, your
> > performance
> > > > now
> > > > > > > depends on your regions which you have some control over. You
> can
> > > > > achieve
> > > > > > > the same effect by pre-splitting your table such that you
> > > empirically
> > > > > > > optimise read performance for the dataset you store.
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Richard
> > > > > > >
> > > > > > >
> > > > > > > 
> > > > > > > From: Anil 
> > > > > > > Sent: 20 February 2017 12:35
> > > > > > > To: user@hbase.apache.org
> > > > > > > Subject: Re: Parallel Scanner
> > > > > > >
> > > > > > > Thanks Richard.
> > > > > > >
> > > > > > > I am able to get the regions for data to be loaded from table.
> I
> > am
> > > > > > trying
> > > > > > > to scan a region in parallel :)
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > On 20 February 2017 at 16:44, Richard Startin <
> > > > > > 

Re: Parallel Scanner

2017-02-20 Thread Ted Yu
Among the 5 columns, do you know roughly the data distribution ?

You should put the columns whose data distribution is relatively even
first. Of course, there may be business requirement which you take into
consideration w.r.t. the composite key.

If you cannot change the schema, do you have control over the region size ?
Smaller region may lower the variance in data distribution per region.

On Mon, Feb 20, 2017 at 7:47 AM, Anil  wrote:

> Hi Ted,
>
> Current region size is 10 GB.
>
> Hbase row key designed like a phoenix primary key. I can say it is like 5
> column composite key. Prefix for a common set of data would have same first
> prefix. I am not sure how to convey the data distribution.
>
> Thanks.
>
> On 20 February 2017 at 20:48, Ted Yu  wrote:
>
> > Anil:
> > What's the current region size you use ?
> >
> > Given a region, do you have some idea how the data is distributed within
> > the region ?
> >
> > Cheers
> >
> > On Mon, Feb 20, 2017 at 7:14 AM, Anil  wrote:
> >
> > > i understand my original post now :)  Sorry about that.
> > >
> > > now the challenge is to split a start key and end key at client side to
> > > allow parallel scans on table with no buckets, pre-salting.
> > >
> > > Thanks.
> > >
> > > On 20 February 2017 at 20:21, ramkrishna vasudevan <
> > > ramkrishna.s.vasude...@gmail.com> wrote:
> > >
> > > > You are trying to scan one region itself in parallel, then even I got
> > you
> > > > wrong. Richard's suggestion is the right choice for client only soln.
> > > >
> > > > On Mon, Feb 20, 2017 at 7:40 PM, Anil  wrote:
> > > >
> > > > > Thanks Richard :)
> > > > >
> > > > > On 20 February 2017 at 18:56, Richard Startin <
> > > > richardstar...@outlook.com>
> > > > > wrote:
> > > > >
> > > > > > RegionLocator is not deprecated, hence the suggestion to use it
> if
> > > it's
> > > > > > available in place of whatever is still available on HTable for
> > your
> > > > > > version of HBase - it will make upgrades easier. For instance
> > > > > > HTable::getRegionsInRange no longer exists on the current master
> > > > branch.
> > > > > >
> > > > > >
> > > > > > "I am trying to scan a region in parallel :)"
> > > > > >
> > > > > >
> > > > > > I thought you were asking about scanning many regions at the same
> > > time,
> > > > > > not scanning a single region in parallel? HBASE-1935 is about
> > > > > parallelising
> > > > > > scans over regions, not within regions.
> > > > > >
> > > > > >
> > > > > > If you want to parallelise within a region, you could write a
> > little
> > > > > > method to split the first and last key of the region into several
> > > > > disjoint
> > > > > > lexicographic buckets and create a scan for each bucket, then
> > execute
> > > > > those
> > > > > > scans in parallel. Your data probably doesn't distribute
> uniformly
> > > over
> > > > > > lexicographic buckets though so the scans are unlikely to execute
> > at
> > > a
> > > > > > constant rate and you'll get results in time proportional to the
> > > > > > lexicographic bucket with the highest cardinality in the region.
> > I'd
> > > be
> > > > > > interested to know if anyone on the list has ever tried this and
> > what
> > > > the
> > > > > > results were?
> > > > > >
> > > > > >
> > > > > > Using the much simpler approach of parallelising over regions by
> > > > creating
> > > > > > multiple disjoint scans client side, as suggested, your
> performance
> > > now
> > > > > > depends on your regions which you have some control over. You can
> > > > achieve
> > > > > > the same effect by pre-splitting your table such that you
> > empirically
> > > > > > optimise read performance for the dataset you store.
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Richard
> > > > > >
> > > > > >
> > > > > > 
> > > > > > From: Anil 
> > > > > > Sent: 20 February 2017 12:35
> > > > > > To: user@hbase.apache.org
> > > > > > Subject: Re: Parallel Scanner
> > > > > >
> > > > > > Thanks Richard.
> > > > > >
> > > > > > I am able to get the regions for data to be loaded from table. I
> am
> > > > > trying
> > > > > > to scan a region in parallel :)
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > On 20 February 2017 at 16:44, Richard Startin <
> > > > > richardstar...@outlook.com>
> > > > > > wrote:
> > > > > >
> > > > > > > For a client only solution, have you looked at the
> RegionLocator
> > > > > > > interface? It gives you a list of pairs of byte[] (the start
> and
> > > stop
> > > > > > keys
> > > > > > > for each region). You can easily use a ForkJoinPool recursive
> > task
> > > or
> > > > > > java
> > > > > > > 8 parallel stream over that list. I implemented a spark RDD to
> do
> > > > that
> > > > > > and
> > > > > > > wrote about it with code samples here:
> > > > > > >
> > > > > > > https://richardstartin.com/2016/11/07/co-locating-spark-
> > > > > >
> > > > 

Re: Parallel Scanner

2017-02-20 Thread Anil
Hi Ted,

Current region size is 10 GB.

Hbase row key designed like a phoenix primary key. I can say it is like 5
column composite key. Prefix for a common set of data would have same first
prefix. I am not sure how to convey the data distribution.

Thanks.

On 20 February 2017 at 20:48, Ted Yu  wrote:

> Anil:
> What's the current region size you use ?
>
> Given a region, do you have some idea how the data is distributed within
> the region ?
>
> Cheers
>
> On Mon, Feb 20, 2017 at 7:14 AM, Anil  wrote:
>
> > i understand my original post now :)  Sorry about that.
> >
> > now the challenge is to split a start key and end key at client side to
> > allow parallel scans on table with no buckets, pre-salting.
> >
> > Thanks.
> >
> > On 20 February 2017 at 20:21, ramkrishna vasudevan <
> > ramkrishna.s.vasude...@gmail.com> wrote:
> >
> > > You are trying to scan one region itself in parallel, then even I got
> you
> > > wrong. Richard's suggestion is the right choice for client only soln.
> > >
> > > On Mon, Feb 20, 2017 at 7:40 PM, Anil  wrote:
> > >
> > > > Thanks Richard :)
> > > >
> > > > On 20 February 2017 at 18:56, Richard Startin <
> > > richardstar...@outlook.com>
> > > > wrote:
> > > >
> > > > > RegionLocator is not deprecated, hence the suggestion to use it if
> > it's
> > > > > available in place of whatever is still available on HTable for
> your
> > > > > version of HBase - it will make upgrades easier. For instance
> > > > > HTable::getRegionsInRange no longer exists on the current master
> > > branch.
> > > > >
> > > > >
> > > > > "I am trying to scan a region in parallel :)"
> > > > >
> > > > >
> > > > > I thought you were asking about scanning many regions at the same
> > time,
> > > > > not scanning a single region in parallel? HBASE-1935 is about
> > > > parallelising
> > > > > scans over regions, not within regions.
> > > > >
> > > > >
> > > > > If you want to parallelise within a region, you could write a
> little
> > > > > method to split the first and last key of the region into several
> > > > disjoint
> > > > > lexicographic buckets and create a scan for each bucket, then
> execute
> > > > those
> > > > > scans in parallel. Your data probably doesn't distribute uniformly
> > over
> > > > > lexicographic buckets though so the scans are unlikely to execute
> at
> > a
> > > > > constant rate and you'll get results in time proportional to the
> > > > > lexicographic bucket with the highest cardinality in the region.
> I'd
> > be
> > > > > interested to know if anyone on the list has ever tried this and
> what
> > > the
> > > > > results were?
> > > > >
> > > > >
> > > > > Using the much simpler approach of parallelising over regions by
> > > creating
> > > > > multiple disjoint scans client side, as suggested, your performance
> > now
> > > > > depends on your regions which you have some control over. You can
> > > achieve
> > > > > the same effect by pre-splitting your table such that you
> empirically
> > > > > optimise read performance for the dataset you store.
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Richard
> > > > >
> > > > >
> > > > > 
> > > > > From: Anil 
> > > > > Sent: 20 February 2017 12:35
> > > > > To: user@hbase.apache.org
> > > > > Subject: Re: Parallel Scanner
> > > > >
> > > > > Thanks Richard.
> > > > >
> > > > > I am able to get the regions for data to be loaded from table. I am
> > > > trying
> > > > > to scan a region in parallel :)
> > > > >
> > > > > Thanks
> > > > >
> > > > > On 20 February 2017 at 16:44, Richard Startin <
> > > > richardstar...@outlook.com>
> > > > > wrote:
> > > > >
> > > > > > For a client only solution, have you looked at the RegionLocator
> > > > > > interface? It gives you a list of pairs of byte[] (the start and
> > stop
> > > > > keys
> > > > > > for each region). You can easily use a ForkJoinPool recursive
> task
> > or
> > > > > java
> > > > > > 8 parallel stream over that list. I implemented a spark RDD to do
> > > that
> > > > > and
> > > > > > wrote about it with code samples here:
> > > > > >
> > > > > > https://richardstartin.com/2016/11/07/co-locating-spark-
> > > > >
> > > > > > partitions-with-hbase-regions/
> > > > > >
> > > > > > Forget about the spark details in the post (and forget that
> > > Hortonworks
> > > > > > have a library to do the same thing :)) the idea of creating one
> > scan
> > > > per
> > > > > > region and setting scan starts and stops from the region locator
> > > would
> > > > > give
> > > > > > you a parallel scan. Note you can also group the scans by region
> > > > server.
> > > > > >
> > > > > > Cheers,
> > > > > > Richard
> > > > > > On 20 Feb 2017, at 07:33, Anil  > > > > > lk...@gmail.com>> wrote:
> > > > > >
> > > > > > Thanks Ram. I will look into EndPoints.
> > > > > >
> > > > > > On 20 February 2017 at 12:29, ramkrishna vasudevan <
> > > 

Re: Parallel Scanner

2017-02-20 Thread Ted Yu
Anil:
What's the current region size you use ?

Given a region, do you have some idea how the data is distributed within
the region ?

Cheers

On Mon, Feb 20, 2017 at 7:14 AM, Anil  wrote:

> i understand my original post now :)  Sorry about that.
>
> now the challenge is to split a start key and end key at client side to
> allow parallel scans on table with no buckets, pre-salting.
>
> Thanks.
>
> On 20 February 2017 at 20:21, ramkrishna vasudevan <
> ramkrishna.s.vasude...@gmail.com> wrote:
>
> > You are trying to scan one region itself in parallel, then even I got you
> > wrong. Richard's suggestion is the right choice for client only soln.
> >
> > On Mon, Feb 20, 2017 at 7:40 PM, Anil  wrote:
> >
> > > Thanks Richard :)
> > >
> > > On 20 February 2017 at 18:56, Richard Startin <
> > richardstar...@outlook.com>
> > > wrote:
> > >
> > > > RegionLocator is not deprecated, hence the suggestion to use it if
> it's
> > > > available in place of whatever is still available on HTable for your
> > > > version of HBase - it will make upgrades easier. For instance
> > > > HTable::getRegionsInRange no longer exists on the current master
> > branch.
> > > >
> > > >
> > > > "I am trying to scan a region in parallel :)"
> > > >
> > > >
> > > > I thought you were asking about scanning many regions at the same
> time,
> > > > not scanning a single region in parallel? HBASE-1935 is about
> > > parallelising
> > > > scans over regions, not within regions.
> > > >
> > > >
> > > > If you want to parallelise within a region, you could write a little
> > > > method to split the first and last key of the region into several
> > > disjoint
> > > > lexicographic buckets and create a scan for each bucket, then execute
> > > those
> > > > scans in parallel. Your data probably doesn't distribute uniformly
> over
> > > > lexicographic buckets though so the scans are unlikely to execute at
> a
> > > > constant rate and you'll get results in time proportional to the
> > > > lexicographic bucket with the highest cardinality in the region. I'd
> be
> > > > interested to know if anyone on the list has ever tried this and what
> > the
> > > > results were?
> > > >
> > > >
> > > > Using the much simpler approach of parallelising over regions by
> > creating
> > > > multiple disjoint scans client side, as suggested, your performance
> now
> > > > depends on your regions which you have some control over. You can
> > achieve
> > > > the same effect by pre-splitting your table such that you empirically
> > > > optimise read performance for the dataset you store.
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Richard
> > > >
> > > >
> > > > 
> > > > From: Anil 
> > > > Sent: 20 February 2017 12:35
> > > > To: user@hbase.apache.org
> > > > Subject: Re: Parallel Scanner
> > > >
> > > > Thanks Richard.
> > > >
> > > > I am able to get the regions for data to be loaded from table. I am
> > > trying
> > > > to scan a region in parallel :)
> > > >
> > > > Thanks
> > > >
> > > > On 20 February 2017 at 16:44, Richard Startin <
> > > richardstar...@outlook.com>
> > > > wrote:
> > > >
> > > > > For a client only solution, have you looked at the RegionLocator
> > > > > interface? It gives you a list of pairs of byte[] (the start and
> stop
> > > > keys
> > > > > for each region). You can easily use a ForkJoinPool recursive task
> or
> > > > java
> > > > > 8 parallel stream over that list. I implemented a spark RDD to do
> > that
> > > > and
> > > > > wrote about it with code samples here:
> > > > >
> > > > > https://richardstartin.com/2016/11/07/co-locating-spark-
> > > >
> > > > > partitions-with-hbase-regions/
> > > > >
> > > > > Forget about the spark details in the post (and forget that
> > Hortonworks
> > > > > have a library to do the same thing :)) the idea of creating one
> scan
> > > per
> > > > > region and setting scan starts and stops from the region locator
> > would
> > > > give
> > > > > you a parallel scan. Note you can also group the scans by region
> > > server.
> > > > >
> > > > > Cheers,
> > > > > Richard
> > > > > On 20 Feb 2017, at 07:33, Anil  > > > > lk...@gmail.com>> wrote:
> > > > >
> > > > > Thanks Ram. I will look into EndPoints.
> > > > >
> > > > > On 20 February 2017 at 12:29, ramkrishna vasudevan <
> > > > > ramkrishna.s.vasude...@gmail.com > > vasude...@gmail.com
> > > > >>
> > > > > wrote:
> > > > >
> > > > > Yes. There is way.
> > > > >
> > > > > Have you seen Endpoints? Endpoints are triggers like points that
> > allows
> > > > > your client to trigger them parallely in one ore more regions using
> > the
> > > > > start and end key of the region. This executes parallely and then
> you
> > > may
> > > > > have to sort out the results as per your need.
> > > > >
> > > > > But these endpoints have to running on your region servers and it
> is
> > > not
> > > > a
> > > > > 

Re: Parallel Scanner

2017-02-20 Thread Anil
i understand my original post now :)  Sorry about that.

now the challenge is to split a start key and end key at client side to
allow parallel scans on table with no buckets, pre-salting.

Thanks.

On 20 February 2017 at 20:21, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com> wrote:

> You are trying to scan one region itself in parallel, then even I got you
> wrong. Richard's suggestion is the right choice for client only soln.
>
> On Mon, Feb 20, 2017 at 7:40 PM, Anil  wrote:
>
> > Thanks Richard :)
> >
> > On 20 February 2017 at 18:56, Richard Startin <
> richardstar...@outlook.com>
> > wrote:
> >
> > > RegionLocator is not deprecated, hence the suggestion to use it if it's
> > > available in place of whatever is still available on HTable for your
> > > version of HBase - it will make upgrades easier. For instance
> > > HTable::getRegionsInRange no longer exists on the current master
> branch.
> > >
> > >
> > > "I am trying to scan a region in parallel :)"
> > >
> > >
> > > I thought you were asking about scanning many regions at the same time,
> > > not scanning a single region in parallel? HBASE-1935 is about
> > parallelising
> > > scans over regions, not within regions.
> > >
> > >
> > > If you want to parallelise within a region, you could write a little
> > > method to split the first and last key of the region into several
> > disjoint
> > > lexicographic buckets and create a scan for each bucket, then execute
> > those
> > > scans in parallel. Your data probably doesn't distribute uniformly over
> > > lexicographic buckets though so the scans are unlikely to execute at a
> > > constant rate and you'll get results in time proportional to the
> > > lexicographic bucket with the highest cardinality in the region. I'd be
> > > interested to know if anyone on the list has ever tried this and what
> the
> > > results were?
> > >
> > >
> > > Using the much simpler approach of parallelising over regions by
> creating
> > > multiple disjoint scans client side, as suggested, your performance now
> > > depends on your regions which you have some control over. You can
> achieve
> > > the same effect by pre-splitting your table such that you empirically
> > > optimise read performance for the dataset you store.
> > >
> > >
> > > Thanks,
> > >
> > > Richard
> > >
> > >
> > > 
> > > From: Anil 
> > > Sent: 20 February 2017 12:35
> > > To: user@hbase.apache.org
> > > Subject: Re: Parallel Scanner
> > >
> > > Thanks Richard.
> > >
> > > I am able to get the regions for data to be loaded from table. I am
> > trying
> > > to scan a region in parallel :)
> > >
> > > Thanks
> > >
> > > On 20 February 2017 at 16:44, Richard Startin <
> > richardstar...@outlook.com>
> > > wrote:
> > >
> > > > For a client only solution, have you looked at the RegionLocator
> > > > interface? It gives you a list of pairs of byte[] (the start and stop
> > > keys
> > > > for each region). You can easily use a ForkJoinPool recursive task or
> > > java
> > > > 8 parallel stream over that list. I implemented a spark RDD to do
> that
> > > and
> > > > wrote about it with code samples here:
> > > >
> > > > https://richardstartin.com/2016/11/07/co-locating-spark-
> > >
> > > > partitions-with-hbase-regions/
> > > >
> > > > Forget about the spark details in the post (and forget that
> Hortonworks
> > > > have a library to do the same thing :)) the idea of creating one scan
> > per
> > > > region and setting scan starts and stops from the region locator
> would
> > > give
> > > > you a parallel scan. Note you can also group the scans by region
> > server.
> > > >
> > > > Cheers,
> > > > Richard
> > > > On 20 Feb 2017, at 07:33, Anil  > > > lk...@gmail.com>> wrote:
> > > >
> > > > Thanks Ram. I will look into EndPoints.
> > > >
> > > > On 20 February 2017 at 12:29, ramkrishna vasudevan <
> > > > ramkrishna.s.vasude...@gmail.com > vasude...@gmail.com
> > > >>
> > > > wrote:
> > > >
> > > > Yes. There is way.
> > > >
> > > > Have you seen Endpoints? Endpoints are triggers like points that
> allows
> > > > your client to trigger them parallely in one ore more regions using
> the
> > > > start and end key of the region. This executes parallely and then you
> > may
> > > > have to sort out the results as per your need.
> > > >
> > > > But these endpoints have to running on your region servers and it is
> > not
> > > a
> > > > client only soln.
> > > > https://blogs.apache.org/hbase/entry/coprocessor_introduction.
> > > [https://blogs.apache.org/hbase/mediaresource/60b135e5-
> > > 04c6-4197-b262-e7cd08de784b] > > entry/coprocessor_introduction>
> > >
> > > Coprocessor Introduction : Apache HBase > > org/hbase/entry/coprocessor_introduction>
> > > blogs.apache.org
> > > Coprocessor Introduction. Authors: Trend Micro Hadoop Group: Mingjie
> Lai,
> > > Eugene 

Re: Parallel Scanner

2017-02-20 Thread ramkrishna vasudevan
You are trying to scan one region itself in parallel, then even I got you
wrong. Richard's suggestion is the right choice for client only soln.

On Mon, Feb 20, 2017 at 7:40 PM, Anil  wrote:

> Thanks Richard :)
>
> On 20 February 2017 at 18:56, Richard Startin 
> wrote:
>
> > RegionLocator is not deprecated, hence the suggestion to use it if it's
> > available in place of whatever is still available on HTable for your
> > version of HBase - it will make upgrades easier. For instance
> > HTable::getRegionsInRange no longer exists on the current master branch.
> >
> >
> > "I am trying to scan a region in parallel :)"
> >
> >
> > I thought you were asking about scanning many regions at the same time,
> > not scanning a single region in parallel? HBASE-1935 is about
> parallelising
> > scans over regions, not within regions.
> >
> >
> > If you want to parallelise within a region, you could write a little
> > method to split the first and last key of the region into several
> disjoint
> > lexicographic buckets and create a scan for each bucket, then execute
> those
> > scans in parallel. Your data probably doesn't distribute uniformly over
> > lexicographic buckets though so the scans are unlikely to execute at a
> > constant rate and you'll get results in time proportional to the
> > lexicographic bucket with the highest cardinality in the region. I'd be
> > interested to know if anyone on the list has ever tried this and what the
> > results were?
> >
> >
> > Using the much simpler approach of parallelising over regions by creating
> > multiple disjoint scans client side, as suggested, your performance now
> > depends on your regions which you have some control over. You can achieve
> > the same effect by pre-splitting your table such that you empirically
> > optimise read performance for the dataset you store.
> >
> >
> > Thanks,
> >
> > Richard
> >
> >
> > 
> > From: Anil 
> > Sent: 20 February 2017 12:35
> > To: user@hbase.apache.org
> > Subject: Re: Parallel Scanner
> >
> > Thanks Richard.
> >
> > I am able to get the regions for data to be loaded from table. I am
> trying
> > to scan a region in parallel :)
> >
> > Thanks
> >
> > On 20 February 2017 at 16:44, Richard Startin <
> richardstar...@outlook.com>
> > wrote:
> >
> > > For a client only solution, have you looked at the RegionLocator
> > > interface? It gives you a list of pairs of byte[] (the start and stop
> > keys
> > > for each region). You can easily use a ForkJoinPool recursive task or
> > java
> > > 8 parallel stream over that list. I implemented a spark RDD to do that
> > and
> > > wrote about it with code samples here:
> > >
> > > https://richardstartin.com/2016/11/07/co-locating-spark-
> >
> > > partitions-with-hbase-regions/
> > >
> > > Forget about the spark details in the post (and forget that Hortonworks
> > > have a library to do the same thing :)) the idea of creating one scan
> per
> > > region and setting scan starts and stops from the region locator would
> > give
> > > you a parallel scan. Note you can also group the scans by region
> server.
> > >
> > > Cheers,
> > > Richard
> > > On 20 Feb 2017, at 07:33, Anil  > > lk...@gmail.com>> wrote:
> > >
> > > Thanks Ram. I will look into EndPoints.
> > >
> > > On 20 February 2017 at 12:29, ramkrishna vasudevan <
> > > ramkrishna.s.vasude...@gmail.com vasude...@gmail.com
> > >>
> > > wrote:
> > >
> > > Yes. There is way.
> > >
> > > Have you seen Endpoints? Endpoints are triggers like points that allows
> > > your client to trigger them parallely in one ore more regions using the
> > > start and end key of the region. This executes parallely and then you
> may
> > > have to sort out the results as per your need.
> > >
> > > But these endpoints have to running on your region servers and it is
> not
> > a
> > > client only soln.
> > > https://blogs.apache.org/hbase/entry/coprocessor_introduction.
> > [https://blogs.apache.org/hbase/mediaresource/60b135e5-
> > 04c6-4197-b262-e7cd08de784b] > entry/coprocessor_introduction>
> >
> > Coprocessor Introduction : Apache HBase > org/hbase/entry/coprocessor_introduction>
> > blogs.apache.org
> > Coprocessor Introduction. Authors: Trend Micro Hadoop Group: Mingjie Lai,
> > Eugene Koontz, Andrew Purtell (The original version of the blog was
> posted
> > at http ...
> >
> >
> >
> > >
> > > Be careful when you use them. Since these endpoints run on server
> ensure
> > > that these are not heavy or things that consume more memory which can
> > have
> > > adverse effects on the server.
> > >
> > >
> > > Regards
> > > Ram
> > >
> > > On Mon, Feb 20, 2017 at 12:18 PM, Anil  > > lk...@gmail.com>> wrote:
> > >
> > > Thanks Ram.
> > >
> > > So, you mean that there is no harm in using  HTable#getRegionsInRange
> in

Re: Parallel Scanner

2017-02-20 Thread Anil
Thanks Richard :)

On 20 February 2017 at 18:56, Richard Startin 
wrote:

> RegionLocator is not deprecated, hence the suggestion to use it if it's
> available in place of whatever is still available on HTable for your
> version of HBase - it will make upgrades easier. For instance
> HTable::getRegionsInRange no longer exists on the current master branch.
>
>
> "I am trying to scan a region in parallel :)"
>
>
> I thought you were asking about scanning many regions at the same time,
> not scanning a single region in parallel? HBASE-1935 is about parallelising
> scans over regions, not within regions.
>
>
> If you want to parallelise within a region, you could write a little
> method to split the first and last key of the region into several disjoint
> lexicographic buckets and create a scan for each bucket, then execute those
> scans in parallel. Your data probably doesn't distribute uniformly over
> lexicographic buckets though so the scans are unlikely to execute at a
> constant rate and you'll get results in time proportional to the
> lexicographic bucket with the highest cardinality in the region. I'd be
> interested to know if anyone on the list has ever tried this and what the
> results were?
>
>
> Using the much simpler approach of parallelising over regions by creating
> multiple disjoint scans client side, as suggested, your performance now
> depends on your regions which you have some control over. You can achieve
> the same effect by pre-splitting your table such that you empirically
> optimise read performance for the dataset you store.
>
>
> Thanks,
>
> Richard
>
>
> 
> From: Anil 
> Sent: 20 February 2017 12:35
> To: user@hbase.apache.org
> Subject: Re: Parallel Scanner
>
> Thanks Richard.
>
> I am able to get the regions for data to be loaded from table. I am trying
> to scan a region in parallel :)
>
> Thanks
>
> On 20 February 2017 at 16:44, Richard Startin 
> wrote:
>
> > For a client only solution, have you looked at the RegionLocator
> > interface? It gives you a list of pairs of byte[] (the start and stop
> keys
> > for each region). You can easily use a ForkJoinPool recursive task or
> java
> > 8 parallel stream over that list. I implemented a spark RDD to do that
> and
> > wrote about it with code samples here:
> >
> > https://richardstartin.com/2016/11/07/co-locating-spark-
>
> > partitions-with-hbase-regions/
> >
> > Forget about the spark details in the post (and forget that Hortonworks
> > have a library to do the same thing :)) the idea of creating one scan per
> > region and setting scan starts and stops from the region locator would
> give
> > you a parallel scan. Note you can also group the scans by region server.
> >
> > Cheers,
> > Richard
> > On 20 Feb 2017, at 07:33, Anil  > lk...@gmail.com>> wrote:
> >
> > Thanks Ram. I will look into EndPoints.
> >
> > On 20 February 2017 at 12:29, ramkrishna vasudevan <
> > ramkrishna.s.vasude...@gmail.com >>
> > wrote:
> >
> > Yes. There is way.
> >
> > Have you seen Endpoints? Endpoints are triggers like points that allows
> > your client to trigger them parallely in one ore more regions using the
> > start and end key of the region. This executes parallely and then you may
> > have to sort out the results as per your need.
> >
> > But these endpoints have to running on your region servers and it is not
> a
> > client only soln.
> > https://blogs.apache.org/hbase/entry/coprocessor_introduction.
> [https://blogs.apache.org/hbase/mediaresource/60b135e5-
> 04c6-4197-b262-e7cd08de784b] entry/coprocessor_introduction>
>
> Coprocessor Introduction : Apache HBase org/hbase/entry/coprocessor_introduction>
> blogs.apache.org
> Coprocessor Introduction. Authors: Trend Micro Hadoop Group: Mingjie Lai,
> Eugene Koontz, Andrew Purtell (The original version of the blog was posted
> at http ...
>
>
>
> >
> > Be careful when you use them. Since these endpoints run on server ensure
> > that these are not heavy or things that consume more memory which can
> have
> > adverse effects on the server.
> >
> >
> > Regards
> > Ram
> >
> > On Mon, Feb 20, 2017 at 12:18 PM, Anil  > lk...@gmail.com>> wrote:
> >
> > Thanks Ram.
> >
> > So, you mean that there is no harm in using  HTable#getRegionsInRange in
> > the application code.
> >
> > HTable#getRegionsInRange returned single entry for all my region start
> > key
> > and end key. i need to explore more on this.
> >
> > "If you know the table region's start and end keys you could create
> > parallel scans in your application code."  - is there any way to scan a
> > region in the application code other than the one i put in the original
> > email ?
> >
> > "One thing to watch out is that if there is a split in the region then
> > this start

Re: Parallel Scanner

2017-02-20 Thread Anil
Thanks Richard.

I am able to get the regions for data to be loaded from table. I am trying
to scan a region in parallel :)

Thanks

On 20 February 2017 at 16:44, Richard Startin 
wrote:

> For a client only solution, have you looked at the RegionLocator
> interface? It gives you a list of pairs of byte[] (the start and stop keys
> for each region). You can easily use a ForkJoinPool recursive task or java
> 8 parallel stream over that list. I implemented a spark RDD to do that and
> wrote about it with code samples here:
>
> https://richardstartin.com/2016/11/07/co-locating-spark-
> partitions-with-hbase-regions/
>
> Forget about the spark details in the post (and forget that Hortonworks
> have a library to do the same thing :)) the idea of creating one scan per
> region and setting scan starts and stops from the region locator would give
> you a parallel scan. Note you can also group the scans by region server.
>
> Cheers,
> Richard
> On 20 Feb 2017, at 07:33, Anil  lk...@gmail.com>> wrote:
>
> Thanks Ram. I will look into EndPoints.
>
> On 20 February 2017 at 12:29, ramkrishna vasudevan <
> ramkrishna.s.vasude...@gmail.com>
> wrote:
>
> Yes. There is way.
>
> Have you seen Endpoints? Endpoints are triggers like points that allows
> your client to trigger them parallely in one ore more regions using the
> start and end key of the region. This executes parallely and then you may
> have to sort out the results as per your need.
>
> But these endpoints have to running on your region servers and it is not a
> client only soln.
> https://blogs.apache.org/hbase/entry/coprocessor_introduction.
>
> Be careful when you use them. Since these endpoints run on server ensure
> that these are not heavy or things that consume more memory which can have
> adverse effects on the server.
>
>
> Regards
> Ram
>
> On Mon, Feb 20, 2017 at 12:18 PM, Anil  lk...@gmail.com>> wrote:
>
> Thanks Ram.
>
> So, you mean that there is no harm in using  HTable#getRegionsInRange in
> the application code.
>
> HTable#getRegionsInRange returned single entry for all my region start
> key
> and end key. i need to explore more on this.
>
> "If you know the table region's start and end keys you could create
> parallel scans in your application code."  - is there any way to scan a
> region in the application code other than the one i put in the original
> email ?
>
> "One thing to watch out is that if there is a split in the region then
> this start
> and end row may change so in that case it is better you try to get
> the regions every time before you issue a scan"
> - Agree. i am dynamically determining the region start key and end key
> before initiating scan operations for every initial load.
>
> Thanks.
>
>
>
>
> On 20 February 2017 at 10:59, ramkrishna vasudevan <
> ramkrishna.s.vasude...@gmail.com>
> wrote:
>
> Hi Anil,
>
> HBase directly does not provide parallel scans. If you know the table
> region's start and end keys you could create parallel scans in your
> application code.
>
> In the above code snippet, the intent is right - you get the required
> regions and can issue parallel scans from your app.
>
> One thing to watch out is that if there is a split in the region then
> this
> start and end row may change so in that case it is better you try to
> get
> the regions every time before you issue a scan. Does that make sense to
> you?
>
> Regards
> Ram
>
> On Sat, Feb 18, 2017 at 1:44 PM, Anil  lk...@gmail.com>> wrote:
>
> Hi ,
>
> I am building an usecase where i have to load the hbase data into
> In-memory
> database (IMDB). I am scanning the each region and loading data into
> IMDB.
>
> i am looking at parallel scanner ( https://issues.apache.org/
> jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and
> HTable#
> getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is
> deprecated, HBASE-1935 is still open.
>
> I see Connection from ConnectionFactory is HConnectionImplementation
> by
> default and creates HTable instance.
>
> Do you see any issues in using HTable from Table instance ?
>for each region {
>int i = 0;
>List regions =
> hTable.getRegionsInRange(scans.getStartRow(), scans.getStopRow(),
> true);
>
>for (HRegionLocation region : regions){
>startRow = i == 0 ? scans.getStartRow() :
> region.getRegionInfo().getStartKey();
>i++;
>endRow = i == regions.size()? scans.getStopRow()
> :
> region.getRegionInfo().getEndKey();
> }
>   }
>
> are there any alternatives to achieve parallel scan? Thanks.
>
> Thanks
>
>
>
>
>


Re: Parallel Scanner

2017-02-20 Thread Richard Startin
For a client only solution, have you looked at the RegionLocator interface? It 
gives you a list of pairs of byte[] (the start and stop keys for each region). 
You can easily use a ForkJoinPool recursive task or java 8 parallel stream over 
that list. I implemented a spark RDD to do that and wrote about it with code 
samples here:

https://richardstartin.com/2016/11/07/co-locating-spark-partitions-with-hbase-regions/

Forget about the spark details in the post (and forget that Hortonworks have a 
library to do the same thing :)) the idea of creating one scan per region and 
setting scan starts and stops from the region locator would give you a parallel 
scan. Note you can also group the scans by region server.

Cheers,
Richard
On 20 Feb 2017, at 07:33, Anil > 
wrote:

Thanks Ram. I will look into EndPoints.

On 20 February 2017 at 12:29, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com> 
wrote:

Yes. There is way.

Have you seen Endpoints? Endpoints are triggers like points that allows
your client to trigger them parallely in one ore more regions using the
start and end key of the region. This executes parallely and then you may
have to sort out the results as per your need.

But these endpoints have to running on your region servers and it is not a
client only soln.
https://blogs.apache.org/hbase/entry/coprocessor_introduction.

Be careful when you use them. Since these endpoints run on server ensure
that these are not heavy or things that consume more memory which can have
adverse effects on the server.


Regards
Ram

On Mon, Feb 20, 2017 at 12:18 PM, Anil 
> wrote:

Thanks Ram.

So, you mean that there is no harm in using  HTable#getRegionsInRange in
the application code.

HTable#getRegionsInRange returned single entry for all my region start
key
and end key. i need to explore more on this.

"If you know the table region's start and end keys you could create
parallel scans in your application code."  - is there any way to scan a
region in the application code other than the one i put in the original
email ?

"One thing to watch out is that if there is a split in the region then
this start
and end row may change so in that case it is better you try to get
the regions every time before you issue a scan"
- Agree. i am dynamically determining the region start key and end key
before initiating scan operations for every initial load.

Thanks.




On 20 February 2017 at 10:59, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com> 
wrote:

Hi Anil,

HBase directly does not provide parallel scans. If you know the table
region's start and end keys you could create parallel scans in your
application code.

In the above code snippet, the intent is right - you get the required
regions and can issue parallel scans from your app.

One thing to watch out is that if there is a split in the region then
this
start and end row may change so in that case it is better you try to
get
the regions every time before you issue a scan. Does that make sense to
you?

Regards
Ram

On Sat, Feb 18, 2017 at 1:44 PM, Anil 
> wrote:

Hi ,

I am building an usecase where i have to load the hbase data into
In-memory
database (IMDB). I am scanning the each region and loading data into
IMDB.

i am looking at parallel scanner ( https://issues.apache.org/
jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and
HTable#
getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is
deprecated, HBASE-1935 is still open.

I see Connection from ConnectionFactory is HConnectionImplementation
by
default and creates HTable instance.

Do you see any issues in using HTable from Table instance ?
   for each region {
   int i = 0;
   List regions =
hTable.getRegionsInRange(scans.getStartRow(), scans.getStopRow(),
true);

   for (HRegionLocation region : regions){
   startRow = i == 0 ? scans.getStartRow() :
region.getRegionInfo().getStartKey();
   i++;
   endRow = i == regions.size()? scans.getStopRow()
:
region.getRegionInfo().getEndKey();
}
  }

are there any alternatives to achieve parallel scan? Thanks.

Thanks