Re: Parallel Scanner

2017-02-20 Thread Anil
t; > > > > > not scanning a single region in parallel? HBASE-1935 is
> about
> > > > > > > > parallelising
> > > > > > > > > scans over regions, not within regions.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > If you want to parallelise within a region, you could
> write a
> > > > > little
> > > > > > > > > method to split the first and last key of the region into
> > > several
> > > > > > > > disjoint
> > > > > > > > > lexicographic buckets and create a scan for each bucket,
> then
> > > > > execute
> > > > > > > > those
> > > > > > > > > scans in parallel. Your data probably doesn't distribute
> > > > uniformly
> > > > > > over
> > > > > > > > > lexicographic buckets though so the scans are unlikely to
> > > execute
> > > > > at
> > > > > > a
> > > > > > > > > constant rate and you'll get results in time proportional
> to
> > > the
> > > > > > > > > lexicographic bucket with the highest cardinality in the
> > > region.
> > > > > I'd
> > > > > > be
> > > > > > > > > interested to know if anyone on the list has ever tried
> this
> > > and
> > > > > what
> > > > > > > the
> > > > > > > > > results were?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Using the much simpler approach of parallelising over
> regions
> > > by
> > > > > > > creating
> > > > > > > > > multiple disjoint scans client side, as suggested, your
> > > > performance
> > > > > > now
> > > > > > > > > depends on your regions which you have some control over.
> You
> > > can
> > > > > > > achieve
> > > > > > > > > the same effect by pre-splitting your table such that you
> > > > > empirically
> > > > > > > > > optimise read performance for the dataset you store.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Richard
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 
> > > > > > > > > From: Anil <anilk...@gmail.com>
> > > > > > > > > Sent: 20 February 2017 12:35
> > > > > > > > > To: user@hbase.apache.org
> > > > > > > > > Subject: Re: Parallel Scanner
> > > > > > > > >
> > > > > > > > > Thanks Richard.
> > > > > > > > >
> > > > > > > > > I am able to get the regions for data to be loaded from
> > table.
> > > I
> > > > am
> > > > > > > > trying
> > > > > > > > > to scan a region in parallel :)
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > >
> > > > > > > > > On 20 February 2017 at 16:44, Richard Startin <
> > > > > > > > richardstar...@outlook.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > For a client only solution, have you looked at the
> > > > RegionLocator
> > > > > > > > > > interface? It gives you a list of pairs of byte[] (the
> > start
> > > > and
> > > > > > stop
> > > > > > > > > keys
> > > > > > > > > > for each region). You can easily use a ForkJoinPool
> > recursive
> > > > > task
> > > > > > or
> > > > > > > > > java
> > > > > > > > > > 8 parallel stream over that list. I implemented a spark
> RDD
> > > to
> > > > do
> > > > > > > that
> > > > > > > > > and
> >

Re: Parallel Scanner

2017-02-20 Thread Ted Yu
et, then
> > > > execute
> > > > > > > those
> > > > > > > > scans in parallel. Your data probably doesn't distribute
> > > uniformly
> > > > > over
> > > > > > > > lexicographic buckets though so the scans are unlikely to
> > execute
> > > > at
> > > > > a
> > > > > > > > constant rate and you'll get results in time proportional to
> > the
> > > > > > > > lexicographic bucket with the highest cardinality in the
> > region.
> > > > I'd
> > > > > be
> > > > > > > > interested to know if anyone on the list has ever tried this
> > and
> > > > what
> > > > > > the
> > > > > > > > results were?
> > > > > > > >
> > > > > > > >
> > > > > > > > Using the much simpler approach of parallelising over regions
> > by
> > > > > > creating
> > > > > > > > multiple disjoint scans client side, as suggested, your
> > > performance
> > > > > now
> > > > > > > > depends on your regions which you have some control over. You
> > can
> > > > > > achieve
> > > > > > > > the same effect by pre-splitting your table such that you
> > > > empirically
> > > > > > > > optimise read performance for the dataset you store.
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Richard
> > > > > > > >
> > > > > > > >
> > > > > > > > 
> > > > > > > > From: Anil <anilk...@gmail.com>
> > > > > > > > Sent: 20 February 2017 12:35
> > > > > > > > To: user@hbase.apache.org
> > > > > > > > Subject: Re: Parallel Scanner
> > > > > > > >
> > > > > > > > Thanks Richard.
> > > > > > > >
> > > > > > > > I am able to get the regions for data to be loaded from
> table.
> > I
> > > am
> > > > > > > trying
> > > > > > > > to scan a region in parallel :)
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > > > > > On 20 February 2017 at 16:44, Richard Startin <
> > > > > > > richardstar...@outlook.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > For a client only solution, have you looked at the
> > > RegionLocator
> > > > > > > > > interface? It gives you a list of pairs of byte[] (the
> start
> > > and
> > > > > stop
> > > > > > > > keys
> > > > > > > > > for each region). You can easily use a ForkJoinPool
> recursive
> > > > task
> > > > > or
> > > > > > > > java
> > > > > > > > > 8 parallel stream over that list. I implemented a spark RDD
> > to
> > > do
> > > > > > that
> > > > > > > > and
> > > > > > > > > wrote about it with code samples here:
> > > > > > > > >
> > > > > > > > > https://richardstartin.com/2016/11/07/co-locating-spark-
> > > > > > > >
> > > > > > > > > partitions-with-hbase-regions/
> > > > > > > > >
> > > > > > > > > Forget about the spark details in the post (and forget that
> > > > > > Hortonworks
> > > > > > > > > have a library to do the same thing :)) the idea of
> creating
> > > one
> > > > > scan
> > > > > > > per
> > > > > > > > > region and setting scan starts and stops from the region
> > > locator
> > > > > > would
> > > > > > > > give
> > > > > > > > > you a parallel scan. Note you can also group the scans by
> > > region
> > > > > > > server.
> > > > > > > >

Re: Parallel Scanner

2017-02-20 Thread Anil
Hi Ted,

its very difficult to predict the data distribution. we store parent to
child relationships in the table. (Note : A parent record is child for
itself )

we set the max hregion file size as 10gb. I don't think we have any control
on region size :(

Thanks


On 20 February 2017 at 21:24, Ted Yu <yuzhih...@gmail.com> wrote:

> Among the 5 columns, do you know roughly the data distribution ?
>
> You should put the columns whose data distribution is relatively even
> first. Of course, there may be business requirement which you take into
> consideration w.r.t. the composite key.
>
> If you cannot change the schema, do you have control over the region size ?
> Smaller region may lower the variance in data distribution per region.
>
> On Mon, Feb 20, 2017 at 7:47 AM, Anil <anilk...@gmail.com> wrote:
>
> > Hi Ted,
> >
> > Current region size is 10 GB.
> >
> > Hbase row key designed like a phoenix primary key. I can say it is like 5
> > column composite key. Prefix for a common set of data would have same
> first
> > prefix. I am not sure how to convey the data distribution.
> >
> > Thanks.
> >
> > On 20 February 2017 at 20:48, Ted Yu <yuzhih...@gmail.com> wrote:
> >
> > > Anil:
> > > What's the current region size you use ?
> > >
> > > Given a region, do you have some idea how the data is distributed
> within
> > > the region ?
> > >
> > > Cheers
> > >
> > > On Mon, Feb 20, 2017 at 7:14 AM, Anil <anilk...@gmail.com> wrote:
> > >
> > > > i understand my original post now :)  Sorry about that.
> > > >
> > > > now the challenge is to split a start key and end key at client side
> to
> > > > allow parallel scans on table with no buckets, pre-salting.
> > > >
> > > > Thanks.
> > > >
> > > > On 20 February 2017 at 20:21, ramkrishna vasudevan <
> > > > ramkrishna.s.vasude...@gmail.com> wrote:
> > > >
> > > > > You are trying to scan one region itself in parallel, then even I
> got
> > > you
> > > > > wrong. Richard's suggestion is the right choice for client only
> soln.
> > > > >
> > > > > On Mon, Feb 20, 2017 at 7:40 PM, Anil <anilk...@gmail.com> wrote:
> > > > >
> > > > > > Thanks Richard :)
> > > > > >
> > > > > > On 20 February 2017 at 18:56, Richard Startin <
> > > > > richardstar...@outlook.com>
> > > > > > wrote:
> > > > > >
> > > > > > > RegionLocator is not deprecated, hence the suggestion to use it
> > if
> > > > it's
> > > > > > > available in place of whatever is still available on HTable for
> > > your
> > > > > > > version of HBase - it will make upgrades easier. For instance
> > > > > > > HTable::getRegionsInRange no longer exists on the current
> master
> > > > > branch.
> > > > > > >
> > > > > > >
> > > > > > > "I am trying to scan a region in parallel :)"
> > > > > > >
> > > > > > >
> > > > > > > I thought you were asking about scanning many regions at the
> same
> > > > time,
> > > > > > > not scanning a single region in parallel? HBASE-1935 is about
> > > > > > parallelising
> > > > > > > scans over regions, not within regions.
> > > > > > >
> > > > > > >
> > > > > > > If you want to parallelise within a region, you could write a
> > > little
> > > > > > > method to split the first and last key of the region into
> several
> > > > > > disjoint
> > > > > > > lexicographic buckets and create a scan for each bucket, then
> > > execute
> > > > > > those
> > > > > > > scans in parallel. Your data probably doesn't distribute
> > uniformly
> > > > over
> > > > > > > lexicographic buckets though so the scans are unlikely to
> execute
> > > at
> > > > a
> > > > > > > constant rate and you'll get results in time proportional to
> the
> > > > > > > lexicographic bucket with the highest cardinality in the
> region.
> > > I'd
> > > > be
> > > > > > > interested to know 

Re: Parallel Scanner

2017-02-20 Thread Ted Yu
Among the 5 columns, do you know roughly the data distribution ?

You should put the columns whose data distribution is relatively even
first. Of course, there may be business requirement which you take into
consideration w.r.t. the composite key.

If you cannot change the schema, do you have control over the region size ?
Smaller region may lower the variance in data distribution per region.

On Mon, Feb 20, 2017 at 7:47 AM, Anil <anilk...@gmail.com> wrote:

> Hi Ted,
>
> Current region size is 10 GB.
>
> Hbase row key designed like a phoenix primary key. I can say it is like 5
> column composite key. Prefix for a common set of data would have same first
> prefix. I am not sure how to convey the data distribution.
>
> Thanks.
>
> On 20 February 2017 at 20:48, Ted Yu <yuzhih...@gmail.com> wrote:
>
> > Anil:
> > What's the current region size you use ?
> >
> > Given a region, do you have some idea how the data is distributed within
> > the region ?
> >
> > Cheers
> >
> > On Mon, Feb 20, 2017 at 7:14 AM, Anil <anilk...@gmail.com> wrote:
> >
> > > i understand my original post now :)  Sorry about that.
> > >
> > > now the challenge is to split a start key and end key at client side to
> > > allow parallel scans on table with no buckets, pre-salting.
> > >
> > > Thanks.
> > >
> > > On 20 February 2017 at 20:21, ramkrishna vasudevan <
> > > ramkrishna.s.vasude...@gmail.com> wrote:
> > >
> > > > You are trying to scan one region itself in parallel, then even I got
> > you
> > > > wrong. Richard's suggestion is the right choice for client only soln.
> > > >
> > > > On Mon, Feb 20, 2017 at 7:40 PM, Anil <anilk...@gmail.com> wrote:
> > > >
> > > > > Thanks Richard :)
> > > > >
> > > > > On 20 February 2017 at 18:56, Richard Startin <
> > > > richardstar...@outlook.com>
> > > > > wrote:
> > > > >
> > > > > > RegionLocator is not deprecated, hence the suggestion to use it
> if
> > > it's
> > > > > > available in place of whatever is still available on HTable for
> > your
> > > > > > version of HBase - it will make upgrades easier. For instance
> > > > > > HTable::getRegionsInRange no longer exists on the current master
> > > > branch.
> > > > > >
> > > > > >
> > > > > > "I am trying to scan a region in parallel :)"
> > > > > >
> > > > > >
> > > > > > I thought you were asking about scanning many regions at the same
> > > time,
> > > > > > not scanning a single region in parallel? HBASE-1935 is about
> > > > > parallelising
> > > > > > scans over regions, not within regions.
> > > > > >
> > > > > >
> > > > > > If you want to parallelise within a region, you could write a
> > little
> > > > > > method to split the first and last key of the region into several
> > > > > disjoint
> > > > > > lexicographic buckets and create a scan for each bucket, then
> > execute
> > > > > those
> > > > > > scans in parallel. Your data probably doesn't distribute
> uniformly
> > > over
> > > > > > lexicographic buckets though so the scans are unlikely to execute
> > at
> > > a
> > > > > > constant rate and you'll get results in time proportional to the
> > > > > > lexicographic bucket with the highest cardinality in the region.
> > I'd
> > > be
> > > > > > interested to know if anyone on the list has ever tried this and
> > what
> > > > the
> > > > > > results were?
> > > > > >
> > > > > >
> > > > > > Using the much simpler approach of parallelising over regions by
> > > > creating
> > > > > > multiple disjoint scans client side, as suggested, your
> performance
> > > now
> > > > > > depends on your regions which you have some control over. You can
> > > > achieve
> > > > > > the same effect by pre-splitting your table such that you
> > empirically
> > > > > > optimise read performance for the dataset you store.
> > > > > >
> > > > > >
> > > > > > Thanks,
>

Re: Parallel Scanner

2017-02-20 Thread Anil
Hi Ted,

Current region size is 10 GB.

Hbase row key designed like a phoenix primary key. I can say it is like 5
column composite key. Prefix for a common set of data would have same first
prefix. I am not sure how to convey the data distribution.

Thanks.

On 20 February 2017 at 20:48, Ted Yu <yuzhih...@gmail.com> wrote:

> Anil:
> What's the current region size you use ?
>
> Given a region, do you have some idea how the data is distributed within
> the region ?
>
> Cheers
>
> On Mon, Feb 20, 2017 at 7:14 AM, Anil <anilk...@gmail.com> wrote:
>
> > i understand my original post now :)  Sorry about that.
> >
> > now the challenge is to split a start key and end key at client side to
> > allow parallel scans on table with no buckets, pre-salting.
> >
> > Thanks.
> >
> > On 20 February 2017 at 20:21, ramkrishna vasudevan <
> > ramkrishna.s.vasude...@gmail.com> wrote:
> >
> > > You are trying to scan one region itself in parallel, then even I got
> you
> > > wrong. Richard's suggestion is the right choice for client only soln.
> > >
> > > On Mon, Feb 20, 2017 at 7:40 PM, Anil <anilk...@gmail.com> wrote:
> > >
> > > > Thanks Richard :)
> > > >
> > > > On 20 February 2017 at 18:56, Richard Startin <
> > > richardstar...@outlook.com>
> > > > wrote:
> > > >
> > > > > RegionLocator is not deprecated, hence the suggestion to use it if
> > it's
> > > > > available in place of whatever is still available on HTable for
> your
> > > > > version of HBase - it will make upgrades easier. For instance
> > > > > HTable::getRegionsInRange no longer exists on the current master
> > > branch.
> > > > >
> > > > >
> > > > > "I am trying to scan a region in parallel :)"
> > > > >
> > > > >
> > > > > I thought you were asking about scanning many regions at the same
> > time,
> > > > > not scanning a single region in parallel? HBASE-1935 is about
> > > > parallelising
> > > > > scans over regions, not within regions.
> > > > >
> > > > >
> > > > > If you want to parallelise within a region, you could write a
> little
> > > > > method to split the first and last key of the region into several
> > > > disjoint
> > > > > lexicographic buckets and create a scan for each bucket, then
> execute
> > > > those
> > > > > scans in parallel. Your data probably doesn't distribute uniformly
> > over
> > > > > lexicographic buckets though so the scans are unlikely to execute
> at
> > a
> > > > > constant rate and you'll get results in time proportional to the
> > > > > lexicographic bucket with the highest cardinality in the region.
> I'd
> > be
> > > > > interested to know if anyone on the list has ever tried this and
> what
> > > the
> > > > > results were?
> > > > >
> > > > >
> > > > > Using the much simpler approach of parallelising over regions by
> > > creating
> > > > > multiple disjoint scans client side, as suggested, your performance
> > now
> > > > > depends on your regions which you have some control over. You can
> > > achieve
> > > > > the same effect by pre-splitting your table such that you
> empirically
> > > > > optimise read performance for the dataset you store.
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Richard
> > > > >
> > > > >
> > > > > 
> > > > > From: Anil <anilk...@gmail.com>
> > > > > Sent: 20 February 2017 12:35
> > > > > To: user@hbase.apache.org
> > > > > Subject: Re: Parallel Scanner
> > > > >
> > > > > Thanks Richard.
> > > > >
> > > > > I am able to get the regions for data to be loaded from table. I am
> > > > trying
> > > > > to scan a region in parallel :)
> > > > >
> > > > > Thanks
> > > > >
> > > > > On 20 February 2017 at 16:44, Richard Startin <
> > > > richardstar...@outlook.com>
> > > > > wrote:
> > > > >
> > > > > > For a client only solution, ha

Re: Parallel Scanner

2017-02-20 Thread Ted Yu
Anil:
What's the current region size you use ?

Given a region, do you have some idea how the data is distributed within
the region ?

Cheers

On Mon, Feb 20, 2017 at 7:14 AM, Anil <anilk...@gmail.com> wrote:

> i understand my original post now :)  Sorry about that.
>
> now the challenge is to split a start key and end key at client side to
> allow parallel scans on table with no buckets, pre-salting.
>
> Thanks.
>
> On 20 February 2017 at 20:21, ramkrishna vasudevan <
> ramkrishna.s.vasude...@gmail.com> wrote:
>
> > You are trying to scan one region itself in parallel, then even I got you
> > wrong. Richard's suggestion is the right choice for client only soln.
> >
> > On Mon, Feb 20, 2017 at 7:40 PM, Anil <anilk...@gmail.com> wrote:
> >
> > > Thanks Richard :)
> > >
> > > On 20 February 2017 at 18:56, Richard Startin <
> > richardstar...@outlook.com>
> > > wrote:
> > >
> > > > RegionLocator is not deprecated, hence the suggestion to use it if
> it's
> > > > available in place of whatever is still available on HTable for your
> > > > version of HBase - it will make upgrades easier. For instance
> > > > HTable::getRegionsInRange no longer exists on the current master
> > branch.
> > > >
> > > >
> > > > "I am trying to scan a region in parallel :)"
> > > >
> > > >
> > > > I thought you were asking about scanning many regions at the same
> time,
> > > > not scanning a single region in parallel? HBASE-1935 is about
> > > parallelising
> > > > scans over regions, not within regions.
> > > >
> > > >
> > > > If you want to parallelise within a region, you could write a little
> > > > method to split the first and last key of the region into several
> > > disjoint
> > > > lexicographic buckets and create a scan for each bucket, then execute
> > > those
> > > > scans in parallel. Your data probably doesn't distribute uniformly
> over
> > > > lexicographic buckets though so the scans are unlikely to execute at
> a
> > > > constant rate and you'll get results in time proportional to the
> > > > lexicographic bucket with the highest cardinality in the region. I'd
> be
> > > > interested to know if anyone on the list has ever tried this and what
> > the
> > > > results were?
> > > >
> > > >
> > > > Using the much simpler approach of parallelising over regions by
> > creating
> > > > multiple disjoint scans client side, as suggested, your performance
> now
> > > > depends on your regions which you have some control over. You can
> > achieve
> > > > the same effect by pre-splitting your table such that you empirically
> > > > optimise read performance for the dataset you store.
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Richard
> > > >
> > > >
> > > > 
> > > > From: Anil <anilk...@gmail.com>
> > > > Sent: 20 February 2017 12:35
> > > > To: user@hbase.apache.org
> > > > Subject: Re: Parallel Scanner
> > > >
> > > > Thanks Richard.
> > > >
> > > > I am able to get the regions for data to be loaded from table. I am
> > > trying
> > > > to scan a region in parallel :)
> > > >
> > > > Thanks
> > > >
> > > > On 20 February 2017 at 16:44, Richard Startin <
> > > richardstar...@outlook.com>
> > > > wrote:
> > > >
> > > > > For a client only solution, have you looked at the RegionLocator
> > > > > interface? It gives you a list of pairs of byte[] (the start and
> stop
> > > > keys
> > > > > for each region). You can easily use a ForkJoinPool recursive task
> or
> > > > java
> > > > > 8 parallel stream over that list. I implemented a spark RDD to do
> > that
> > > > and
> > > > > wrote about it with code samples here:
> > > > >
> > > > > https://richardstartin.com/2016/11/07/co-locating-spark-
> > > >
> > > > > partitions-with-hbase-regions/
> > > > >
> > > > > Forget about the spark details in the post (and forget that
> > Hortonworks
> > > > > have a lib

Re: Parallel Scanner

2017-02-20 Thread Anil
i understand my original post now :)  Sorry about that.

now the challenge is to split a start key and end key at client side to
allow parallel scans on table with no buckets, pre-salting.

Thanks.

On 20 February 2017 at 20:21, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com> wrote:

> You are trying to scan one region itself in parallel, then even I got you
> wrong. Richard's suggestion is the right choice for client only soln.
>
> On Mon, Feb 20, 2017 at 7:40 PM, Anil <anilk...@gmail.com> wrote:
>
> > Thanks Richard :)
> >
> > On 20 February 2017 at 18:56, Richard Startin <
> richardstar...@outlook.com>
> > wrote:
> >
> > > RegionLocator is not deprecated, hence the suggestion to use it if it's
> > > available in place of whatever is still available on HTable for your
> > > version of HBase - it will make upgrades easier. For instance
> > > HTable::getRegionsInRange no longer exists on the current master
> branch.
> > >
> > >
> > > "I am trying to scan a region in parallel :)"
> > >
> > >
> > > I thought you were asking about scanning many regions at the same time,
> > > not scanning a single region in parallel? HBASE-1935 is about
> > parallelising
> > > scans over regions, not within regions.
> > >
> > >
> > > If you want to parallelise within a region, you could write a little
> > > method to split the first and last key of the region into several
> > disjoint
> > > lexicographic buckets and create a scan for each bucket, then execute
> > those
> > > scans in parallel. Your data probably doesn't distribute uniformly over
> > > lexicographic buckets though so the scans are unlikely to execute at a
> > > constant rate and you'll get results in time proportional to the
> > > lexicographic bucket with the highest cardinality in the region. I'd be
> > > interested to know if anyone on the list has ever tried this and what
> the
> > > results were?
> > >
> > >
> > > Using the much simpler approach of parallelising over regions by
> creating
> > > multiple disjoint scans client side, as suggested, your performance now
> > > depends on your regions which you have some control over. You can
> achieve
> > > the same effect by pre-splitting your table such that you empirically
> > > optimise read performance for the dataset you store.
> > >
> > >
> > > Thanks,
> > >
> > > Richard
> > >
> > >
> > > 
> > > From: Anil <anilk...@gmail.com>
> > > Sent: 20 February 2017 12:35
> > > To: user@hbase.apache.org
> > > Subject: Re: Parallel Scanner
> > >
> > > Thanks Richard.
> > >
> > > I am able to get the regions for data to be loaded from table. I am
> > trying
> > > to scan a region in parallel :)
> > >
> > > Thanks
> > >
> > > On 20 February 2017 at 16:44, Richard Startin <
> > richardstar...@outlook.com>
> > > wrote:
> > >
> > > > For a client only solution, have you looked at the RegionLocator
> > > > interface? It gives you a list of pairs of byte[] (the start and stop
> > > keys
> > > > for each region). You can easily use a ForkJoinPool recursive task or
> > > java
> > > > 8 parallel stream over that list. I implemented a spark RDD to do
> that
> > > and
> > > > wrote about it with code samples here:
> > > >
> > > > https://richardstartin.com/2016/11/07/co-locating-spark-
> > >
> > > > partitions-with-hbase-regions/
> > > >
> > > > Forget about the spark details in the post (and forget that
> Hortonworks
> > > > have a library to do the same thing :)) the idea of creating one scan
> > per
> > > > region and setting scan starts and stops from the region locator
> would
> > > give
> > > > you a parallel scan. Note you can also group the scans by region
> > server.
> > > >
> > > > Cheers,
> > > > Richard
> > > > On 20 Feb 2017, at 07:33, Anil <anilk...@gmail.com<mailto:ani
> > > > lk...@gmail.com>> wrote:
> > > >
> > > > Thanks Ram. I will look into EndPoints.
> > > >
> > > > On 20 February 2017 at 12:29, ramkrishna vasudevan <
> > > > ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s.
> > vasude...@gmail.com
>

Re: Parallel Scanner

2017-02-20 Thread ramkrishna vasudevan
You are trying to scan one region itself in parallel, then even I got you
wrong. Richard's suggestion is the right choice for client only soln.

On Mon, Feb 20, 2017 at 7:40 PM, Anil <anilk...@gmail.com> wrote:

> Thanks Richard :)
>
> On 20 February 2017 at 18:56, Richard Startin <richardstar...@outlook.com>
> wrote:
>
> > RegionLocator is not deprecated, hence the suggestion to use it if it's
> > available in place of whatever is still available on HTable for your
> > version of HBase - it will make upgrades easier. For instance
> > HTable::getRegionsInRange no longer exists on the current master branch.
> >
> >
> > "I am trying to scan a region in parallel :)"
> >
> >
> > I thought you were asking about scanning many regions at the same time,
> > not scanning a single region in parallel? HBASE-1935 is about
> parallelising
> > scans over regions, not within regions.
> >
> >
> > If you want to parallelise within a region, you could write a little
> > method to split the first and last key of the region into several
> disjoint
> > lexicographic buckets and create a scan for each bucket, then execute
> those
> > scans in parallel. Your data probably doesn't distribute uniformly over
> > lexicographic buckets though so the scans are unlikely to execute at a
> > constant rate and you'll get results in time proportional to the
> > lexicographic bucket with the highest cardinality in the region. I'd be
> > interested to know if anyone on the list has ever tried this and what the
> > results were?
> >
> >
> > Using the much simpler approach of parallelising over regions by creating
> > multiple disjoint scans client side, as suggested, your performance now
> > depends on your regions which you have some control over. You can achieve
> > the same effect by pre-splitting your table such that you empirically
> > optimise read performance for the dataset you store.
> >
> >
> > Thanks,
> >
> > Richard
> >
> >
> > 
> > From: Anil <anilk...@gmail.com>
> > Sent: 20 February 2017 12:35
> > To: user@hbase.apache.org
> > Subject: Re: Parallel Scanner
> >
> > Thanks Richard.
> >
> > I am able to get the regions for data to be loaded from table. I am
> trying
> > to scan a region in parallel :)
> >
> > Thanks
> >
> > On 20 February 2017 at 16:44, Richard Startin <
> richardstar...@outlook.com>
> > wrote:
> >
> > > For a client only solution, have you looked at the RegionLocator
> > > interface? It gives you a list of pairs of byte[] (the start and stop
> > keys
> > > for each region). You can easily use a ForkJoinPool recursive task or
> > java
> > > 8 parallel stream over that list. I implemented a spark RDD to do that
> > and
> > > wrote about it with code samples here:
> > >
> > > https://richardstartin.com/2016/11/07/co-locating-spark-
> >
> > > partitions-with-hbase-regions/
> > >
> > > Forget about the spark details in the post (and forget that Hortonworks
> > > have a library to do the same thing :)) the idea of creating one scan
> per
> > > region and setting scan starts and stops from the region locator would
> > give
> > > you a parallel scan. Note you can also group the scans by region
> server.
> > >
> > > Cheers,
> > > Richard
> > > On 20 Feb 2017, at 07:33, Anil <anilk...@gmail.com<mailto:ani
> > > lk...@gmail.com>> wrote:
> > >
> > > Thanks Ram. I will look into EndPoints.
> > >
> > > On 20 February 2017 at 12:29, ramkrishna vasudevan <
> > > ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s.
> vasude...@gmail.com
> > >>
> > > wrote:
> > >
> > > Yes. There is way.
> > >
> > > Have you seen Endpoints? Endpoints are triggers like points that allows
> > > your client to trigger them parallely in one ore more regions using the
> > > start and end key of the region. This executes parallely and then you
> may
> > > have to sort out the results as per your need.
> > >
> > > But these endpoints have to running on your region servers and it is
> not
> > a
> > > client only soln.
> > > https://blogs.apache.org/hbase/entry/coprocessor_introduction.
> > [https://blogs.apache.org/hbase/mediaresource/60b135e5-
> > 04c6-4197-b262-e7cd08de784b]<https://blogs.apache.org/hbase/
> > entry/coprocessor_in

Re: Parallel Scanner

2017-02-20 Thread Anil
Thanks Richard :)

On 20 February 2017 at 18:56, Richard Startin <richardstar...@outlook.com>
wrote:

> RegionLocator is not deprecated, hence the suggestion to use it if it's
> available in place of whatever is still available on HTable for your
> version of HBase - it will make upgrades easier. For instance
> HTable::getRegionsInRange no longer exists on the current master branch.
>
>
> "I am trying to scan a region in parallel :)"
>
>
> I thought you were asking about scanning many regions at the same time,
> not scanning a single region in parallel? HBASE-1935 is about parallelising
> scans over regions, not within regions.
>
>
> If you want to parallelise within a region, you could write a little
> method to split the first and last key of the region into several disjoint
> lexicographic buckets and create a scan for each bucket, then execute those
> scans in parallel. Your data probably doesn't distribute uniformly over
> lexicographic buckets though so the scans are unlikely to execute at a
> constant rate and you'll get results in time proportional to the
> lexicographic bucket with the highest cardinality in the region. I'd be
> interested to know if anyone on the list has ever tried this and what the
> results were?
>
>
> Using the much simpler approach of parallelising over regions by creating
> multiple disjoint scans client side, as suggested, your performance now
> depends on your regions which you have some control over. You can achieve
> the same effect by pre-splitting your table such that you empirically
> optimise read performance for the dataset you store.
>
>
> Thanks,
>
> Richard
>
>
> ________
> From: Anil <anilk...@gmail.com>
> Sent: 20 February 2017 12:35
> To: user@hbase.apache.org
> Subject: Re: Parallel Scanner
>
> Thanks Richard.
>
> I am able to get the regions for data to be loaded from table. I am trying
> to scan a region in parallel :)
>
> Thanks
>
> On 20 February 2017 at 16:44, Richard Startin <richardstar...@outlook.com>
> wrote:
>
> > For a client only solution, have you looked at the RegionLocator
> > interface? It gives you a list of pairs of byte[] (the start and stop
> keys
> > for each region). You can easily use a ForkJoinPool recursive task or
> java
> > 8 parallel stream over that list. I implemented a spark RDD to do that
> and
> > wrote about it with code samples here:
> >
> > https://richardstartin.com/2016/11/07/co-locating-spark-
>
> > partitions-with-hbase-regions/
> >
> > Forget about the spark details in the post (and forget that Hortonworks
> > have a library to do the same thing :)) the idea of creating one scan per
> > region and setting scan starts and stops from the region locator would
> give
> > you a parallel scan. Note you can also group the scans by region server.
> >
> > Cheers,
> > Richard
> > On 20 Feb 2017, at 07:33, Anil <anilk...@gmail.com<mailto:ani
> > lk...@gmail.com>> wrote:
> >
> > Thanks Ram. I will look into EndPoints.
> >
> > On 20 February 2017 at 12:29, ramkrishna vasudevan <
> > ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s.vasude...@gmail.com
> >>
> > wrote:
> >
> > Yes. There is way.
> >
> > Have you seen Endpoints? Endpoints are triggers like points that allows
> > your client to trigger them parallely in one ore more regions using the
> > start and end key of the region. This executes parallely and then you may
> > have to sort out the results as per your need.
> >
> > But these endpoints have to running on your region servers and it is not
> a
> > client only soln.
> > https://blogs.apache.org/hbase/entry/coprocessor_introduction.
> [https://blogs.apache.org/hbase/mediaresource/60b135e5-
> 04c6-4197-b262-e7cd08de784b]<https://blogs.apache.org/hbase/
> entry/coprocessor_introduction>
>
> Coprocessor Introduction : Apache HBase<https://blogs.apache.
> org/hbase/entry/coprocessor_introduction>
> blogs.apache.org
> Coprocessor Introduction. Authors: Trend Micro Hadoop Group: Mingjie Lai,
> Eugene Koontz, Andrew Purtell (The original version of the blog was posted
> at http ...
>
>
>
> >
> > Be careful when you use them. Since these endpoints run on server ensure
> > that these are not heavy or things that consume more memory which can
> have
> > adverse effects on the server.
> >
> >
> > Regards
> > Ram
> >
> > On Mon, Feb 20, 2017 at 12:18 PM, Anil <anilk...@gmail.com<mailto:ani
> > lk...@gmail.com>> wrote:
> >
> > Th

Re: Parallel Scanner

2017-02-20 Thread Anil
Thanks Richard.

I am able to get the regions for data to be loaded from table. I am trying
to scan a region in parallel :)

Thanks

On 20 February 2017 at 16:44, Richard Startin <richardstar...@outlook.com>
wrote:

> For a client only solution, have you looked at the RegionLocator
> interface? It gives you a list of pairs of byte[] (the start and stop keys
> for each region). You can easily use a ForkJoinPool recursive task or java
> 8 parallel stream over that list. I implemented a spark RDD to do that and
> wrote about it with code samples here:
>
> https://richardstartin.com/2016/11/07/co-locating-spark-
> partitions-with-hbase-regions/
>
> Forget about the spark details in the post (and forget that Hortonworks
> have a library to do the same thing :)) the idea of creating one scan per
> region and setting scan starts and stops from the region locator would give
> you a parallel scan. Note you can also group the scans by region server.
>
> Cheers,
> Richard
> On 20 Feb 2017, at 07:33, Anil <anilk...@gmail.com<mailto:ani
> lk...@gmail.com>> wrote:
>
> Thanks Ram. I will look into EndPoints.
>
> On 20 February 2017 at 12:29, ramkrishna vasudevan <
> ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s.vasude...@gmail.com>>
> wrote:
>
> Yes. There is way.
>
> Have you seen Endpoints? Endpoints are triggers like points that allows
> your client to trigger them parallely in one ore more regions using the
> start and end key of the region. This executes parallely and then you may
> have to sort out the results as per your need.
>
> But these endpoints have to running on your region servers and it is not a
> client only soln.
> https://blogs.apache.org/hbase/entry/coprocessor_introduction.
>
> Be careful when you use them. Since these endpoints run on server ensure
> that these are not heavy or things that consume more memory which can have
> adverse effects on the server.
>
>
> Regards
> Ram
>
> On Mon, Feb 20, 2017 at 12:18 PM, Anil <anilk...@gmail.com<mailto:ani
> lk...@gmail.com>> wrote:
>
> Thanks Ram.
>
> So, you mean that there is no harm in using  HTable#getRegionsInRange in
> the application code.
>
> HTable#getRegionsInRange returned single entry for all my region start
> key
> and end key. i need to explore more on this.
>
> "If you know the table region's start and end keys you could create
> parallel scans in your application code."  - is there any way to scan a
> region in the application code other than the one i put in the original
> email ?
>
> "One thing to watch out is that if there is a split in the region then
> this start
> and end row may change so in that case it is better you try to get
> the regions every time before you issue a scan"
> - Agree. i am dynamically determining the region start key and end key
> before initiating scan operations for every initial load.
>
> Thanks.
>
>
>
>
> On 20 February 2017 at 10:59, ramkrishna vasudevan <
> ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s.vasude...@gmail.com>>
> wrote:
>
> Hi Anil,
>
> HBase directly does not provide parallel scans. If you know the table
> region's start and end keys you could create parallel scans in your
> application code.
>
> In the above code snippet, the intent is right - you get the required
> regions and can issue parallel scans from your app.
>
> One thing to watch out is that if there is a split in the region then
> this
> start and end row may change so in that case it is better you try to
> get
> the regions every time before you issue a scan. Does that make sense to
> you?
>
> Regards
> Ram
>
> On Sat, Feb 18, 2017 at 1:44 PM, Anil <anilk...@gmail.com<mailto:ani
> lk...@gmail.com>> wrote:
>
> Hi ,
>
> I am building an usecase where i have to load the hbase data into
> In-memory
> database (IMDB). I am scanning the each region and loading data into
> IMDB.
>
> i am looking at parallel scanner ( https://issues.apache.org/
> jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and
> HTable#
> getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is
> deprecated, HBASE-1935 is still open.
>
> I see Connection from ConnectionFactory is HConnectionImplementation
> by
> default and creates HTable instance.
>
> Do you see any issues in using HTable from Table instance ?
>for each region {
>int i = 0;
>List regions =
> hTable.getRegionsInRange(scans.getStartRow(), scans.getStopRow(),
> true);
>
>for (HRegionLocation region : regions){
>startRow = i == 0 ? scans.getStartRow() :
> region.getRegionInfo().getStartKey();
>i++;
>endRow = i == regions.size()? scans.getStopRow()
> :
> region.getRegionInfo().getEndKey();
> }
>   }
>
> are there any alternatives to achieve parallel scan? Thanks.
>
> Thanks
>
>
>
>
>


Re: Parallel Scanner

2017-02-20 Thread Richard Startin
For a client only solution, have you looked at the RegionLocator interface? It 
gives you a list of pairs of byte[] (the start and stop keys for each region). 
You can easily use a ForkJoinPool recursive task or java 8 parallel stream over 
that list. I implemented a spark RDD to do that and wrote about it with code 
samples here:

https://richardstartin.com/2016/11/07/co-locating-spark-partitions-with-hbase-regions/

Forget about the spark details in the post (and forget that Hortonworks have a 
library to do the same thing :)) the idea of creating one scan per region and 
setting scan starts and stops from the region locator would give you a parallel 
scan. Note you can also group the scans by region server.

Cheers,
Richard
On 20 Feb 2017, at 07:33, Anil <anilk...@gmail.com<mailto:anilk...@gmail.com>> 
wrote:

Thanks Ram. I will look into EndPoints.

On 20 February 2017 at 12:29, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s.vasude...@gmail.com>> 
wrote:

Yes. There is way.

Have you seen Endpoints? Endpoints are triggers like points that allows
your client to trigger them parallely in one ore more regions using the
start and end key of the region. This executes parallely and then you may
have to sort out the results as per your need.

But these endpoints have to running on your region servers and it is not a
client only soln.
https://blogs.apache.org/hbase/entry/coprocessor_introduction.

Be careful when you use them. Since these endpoints run on server ensure
that these are not heavy or things that consume more memory which can have
adverse effects on the server.


Regards
Ram

On Mon, Feb 20, 2017 at 12:18 PM, Anil 
<anilk...@gmail.com<mailto:anilk...@gmail.com>> wrote:

Thanks Ram.

So, you mean that there is no harm in using  HTable#getRegionsInRange in
the application code.

HTable#getRegionsInRange returned single entry for all my region start
key
and end key. i need to explore more on this.

"If you know the table region's start and end keys you could create
parallel scans in your application code."  - is there any way to scan a
region in the application code other than the one i put in the original
email ?

"One thing to watch out is that if there is a split in the region then
this start
and end row may change so in that case it is better you try to get
the regions every time before you issue a scan"
- Agree. i am dynamically determining the region start key and end key
before initiating scan operations for every initial load.

Thanks.




On 20 February 2017 at 10:59, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s.vasude...@gmail.com>> 
wrote:

Hi Anil,

HBase directly does not provide parallel scans. If you know the table
region's start and end keys you could create parallel scans in your
application code.

In the above code snippet, the intent is right - you get the required
regions and can issue parallel scans from your app.

One thing to watch out is that if there is a split in the region then
this
start and end row may change so in that case it is better you try to
get
the regions every time before you issue a scan. Does that make sense to
you?

Regards
Ram

On Sat, Feb 18, 2017 at 1:44 PM, Anil 
<anilk...@gmail.com<mailto:anilk...@gmail.com>> wrote:

Hi ,

I am building an usecase where i have to load the hbase data into
In-memory
database (IMDB). I am scanning the each region and loading data into
IMDB.

i am looking at parallel scanner ( https://issues.apache.org/
jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and
HTable#
getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is
deprecated, HBASE-1935 is still open.

I see Connection from ConnectionFactory is HConnectionImplementation
by
default and creates HTable instance.

Do you see any issues in using HTable from Table instance ?
   for each region {
   int i = 0;
   List regions =
hTable.getRegionsInRange(scans.getStartRow(), scans.getStopRow(),
true);

   for (HRegionLocation region : regions){
   startRow = i == 0 ? scans.getStartRow() :
region.getRegionInfo().getStartKey();
   i++;
   endRow = i == regions.size()? scans.getStopRow()
:
region.getRegionInfo().getEndKey();
}
  }

are there any alternatives to achieve parallel scan? Thanks.

Thanks






Re: Parallel Scanner

2017-02-19 Thread ramkrishna vasudevan
Yes. There is way.

Have you seen Endpoints? Endpoints are triggers like points that allows
your client to trigger them parallely in one ore more regions using the
start and end key of the region. This executes parallely and then you may
have to sort out the results as per your need.

But these endpoints have to running on your region servers and it is not a
client only soln.
https://blogs.apache.org/hbase/entry/coprocessor_introduction.

Be careful when you use them. Since these endpoints run on server ensure
that these are not heavy or things that consume more memory which can have
adverse effects on the server.


Regards
Ram

On Mon, Feb 20, 2017 at 12:18 PM, Anil <anilk...@gmail.com> wrote:

> Thanks Ram.
>
> So, you mean that there is no harm in using  HTable#getRegionsInRange in
> the application code.
>
> HTable#getRegionsInRange returned single entry for all my region start key
> and end key. i need to explore more on this.
>
> "If you know the table region's start and end keys you could create
> parallel scans in your application code."  - is there any way to scan a
> region in the application code other than the one i put in the original
> email ?
>
> "One thing to watch out is that if there is a split in the region then
> this start
> and end row may change so in that case it is better you try to get
> the regions every time before you issue a scan"
>  - Agree. i am dynamically determining the region start key and end key
> before initiating scan operations for every initial load.
>
> Thanks.
>
>
>
>
> On 20 February 2017 at 10:59, ramkrishna vasudevan <
> ramkrishna.s.vasude...@gmail.com> wrote:
>
> > Hi Anil,
> >
> > HBase directly does not provide parallel scans. If you know the table
> > region's start and end keys you could create parallel scans in your
> > application code.
> >
> > In the above code snippet, the intent is right - you get the required
> > regions and can issue parallel scans from your app.
> >
> > One thing to watch out is that if there is a split in the region then
> this
> > start and end row may change so in that case it is better you try to get
> > the regions every time before you issue a scan. Does that make sense to
> > you?
> >
> > Regards
> > Ram
> >
> > On Sat, Feb 18, 2017 at 1:44 PM, Anil <anilk...@gmail.com> wrote:
> >
> > > Hi ,
> > >
> > > I am building an usecase where i have to load the hbase data into
> > In-memory
> > > database (IMDB). I am scanning the each region and loading data into
> > IMDB.
> > >
> > > i am looking at parallel scanner ( https://issues.apache.org/
> > > jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and
> HTable#
> > > getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is
> > > deprecated, HBASE-1935 is still open.
> > >
> > > I see Connection from ConnectionFactory is HConnectionImplementation by
> > > default and creates HTable instance.
> > >
> > > Do you see any issues in using HTable from Table instance ?
> > > for each region {
> > > int i = 0;
> > > List regions =
> > > hTable.getRegionsInRange(scans.getStartRow(), scans.getStopRow(),
> true);
> > >
> > > for (HRegionLocation region : regions){
> > > startRow = i == 0 ? scans.getStartRow() :
> > > region.getRegionInfo().getStartKey();
> > > i++;
> > > endRow = i == regions.size()? scans.getStopRow() :
> > > region.getRegionInfo().getEndKey();
> > >  }
> > >}
> > >
> > > are there any alternatives to achieve parallel scan? Thanks.
> > >
> > > Thanks
> > >
> >
>


Re: Parallel Scanner

2017-02-19 Thread Anil
Thanks Ram.

So, you mean that there is no harm in using  HTable#getRegionsInRange in
the application code.

HTable#getRegionsInRange returned single entry for all my region start key
and end key. i need to explore more on this.

"If you know the table region's start and end keys you could create
parallel scans in your application code."  - is there any way to scan a
region in the application code other than the one i put in the original
email ?

"One thing to watch out is that if there is a split in the region then
this start
and end row may change so in that case it is better you try to get
the regions every time before you issue a scan"
 - Agree. i am dynamically determining the region start key and end key
before initiating scan operations for every initial load.

Thanks.




On 20 February 2017 at 10:59, ramkrishna vasudevan <
ramkrishna.s.vasude...@gmail.com> wrote:

> Hi Anil,
>
> HBase directly does not provide parallel scans. If you know the table
> region's start and end keys you could create parallel scans in your
> application code.
>
> In the above code snippet, the intent is right - you get the required
> regions and can issue parallel scans from your app.
>
> One thing to watch out is that if there is a split in the region then this
> start and end row may change so in that case it is better you try to get
> the regions every time before you issue a scan. Does that make sense to
> you?
>
> Regards
> Ram
>
> On Sat, Feb 18, 2017 at 1:44 PM, Anil <anilk...@gmail.com> wrote:
>
> > Hi ,
> >
> > I am building an usecase where i have to load the hbase data into
> In-memory
> > database (IMDB). I am scanning the each region and loading data into
> IMDB.
> >
> > i am looking at parallel scanner ( https://issues.apache.org/
> > jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and HTable#
> > getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is
> > deprecated, HBASE-1935 is still open.
> >
> > I see Connection from ConnectionFactory is HConnectionImplementation by
> > default and creates HTable instance.
> >
> > Do you see any issues in using HTable from Table instance ?
> > for each region {
> > int i = 0;
> > List regions =
> > hTable.getRegionsInRange(scans.getStartRow(), scans.getStopRow(), true);
> >
> > for (HRegionLocation region : regions){
> > startRow = i == 0 ? scans.getStartRow() :
> > region.getRegionInfo().getStartKey();
> > i++;
> > endRow = i == regions.size()? scans.getStopRow() :
> > region.getRegionInfo().getEndKey();
> >  }
> >}
> >
> > are there any alternatives to achieve parallel scan? Thanks.
> >
> > Thanks
> >
>


Re: Parallel Scanner

2017-02-19 Thread ramkrishna vasudevan
Hi Anil,

HBase directly does not provide parallel scans. If you know the table
region's start and end keys you could create parallel scans in your
application code.

In the above code snippet, the intent is right - you get the required
regions and can issue parallel scans from your app.

One thing to watch out is that if there is a split in the region then this
start and end row may change so in that case it is better you try to get
the regions every time before you issue a scan. Does that make sense to you?

Regards
Ram

On Sat, Feb 18, 2017 at 1:44 PM, Anil <anilk...@gmail.com> wrote:

> Hi ,
>
> I am building an usecase where i have to load the hbase data into In-memory
> database (IMDB). I am scanning the each region and loading data into IMDB.
>
> i am looking at parallel scanner ( https://issues.apache.org/
> jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and HTable#
> getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is
> deprecated, HBASE-1935 is still open.
>
> I see Connection from ConnectionFactory is HConnectionImplementation by
> default and creates HTable instance.
>
> Do you see any issues in using HTable from Table instance ?
> for each region {
> int i = 0;
> List regions =
> hTable.getRegionsInRange(scans.getStartRow(), scans.getStopRow(), true);
>
> for (HRegionLocation region : regions){
> startRow = i == 0 ? scans.getStartRow() :
> region.getRegionInfo().getStartKey();
> i++;
> endRow = i == regions.size()? scans.getStopRow() :
> region.getRegionInfo().getEndKey();
>  }
>}
>
> are there any alternatives to achieve parallel scan? Thanks.
>
> Thanks
>


Parallel Scanner

2017-02-18 Thread Anil
Hi ,

I am building an usecase where i have to load the hbase data into In-memory
database (IMDB). I am scanning the each region and loading data into IMDB.

i am looking at parallel scanner ( https://issues.apache.org/
jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and HTable#
getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is
deprecated, HBASE-1935 is still open.

I see Connection from ConnectionFactory is HConnectionImplementation by
default and creates HTable instance.

Do you see any issues in using HTable from Table instance ?
for each region {
int i = 0;
List regions =
hTable.getRegionsInRange(scans.getStartRow(), scans.getStopRow(), true);

for (HRegionLocation region : regions){
startRow = i == 0 ? scans.getStartRow() :
region.getRegionInfo().getStartKey();
i++;
endRow = i == regions.size()? scans.getStopRow() :
region.getRegionInfo().getEndKey();
 }
   }

are there any alternatives to achieve parallel scan? Thanks.

Thanks


Parallel Scanner

2017-02-18 Thread Anil
Hi ,

I am building an usecase where i have to load the hbase data into In-memory
database (IMDB). I am scanning the each region and loading data into IMDB.

i am looking at parallel scanner (
https://issues.apache.org/jira/browse/HBASE-8504 ) and HTable#
getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is
deprecated.


Re: HBase parallel scanner performance

2012-05-19 Thread S Ahmed
great thread for a real world problem.

Michael, it sounds like the initial design was more of a traditional db
solution, whereas with hbase (and nosql in general) the design is to
denormalize and build your row/cf structure to fit the use case.  Disks are
cheap, writes are fast, so build your index in order to scan for the
results you need.

On Thu, Apr 19, 2012 at 2:33 PM, Michael Segel michael_se...@hotmail.comwrote:

 No problem.

 One of the hardest things to do is to try to be open to other design ideas
 and not become wedded to one.

 I think once you get that working you can start to look at your cluster.

 On Apr 19, 2012, at 1:26 PM, Narendra yadala wrote:

  Michael,
 
  I will do the redesign and build the index. Thanks a lot for the
 insights.
 
  Narendra
 
  On Thu, Apr 19, 2012 at 9:56 PM, Michael Segel 
 michael_se...@hotmail.comwrote:
 
  Narendra,
 
  I think you are still missing the point.
  130 seconds to scan the table per iteration.
  Even if you have 10K rows
  130 * 10^4 or 1.3*10^6 seconds.  ~361 hours
 
  Compare that to 10K rows where you then select a single row in your sub
  select that has a list of all of the associated rows.
  You can then do  n number of get()s based on the data in the index. (If
  the data wasn't in the index itself)
 
  Assuming that the data was in the index, that's one get(). This is sub
  second.
  Just to keep things simple assume 1 second.
  That's 10K seconds vs 1.3 million seconds.  (2 hours vs 361hours)
  Actually its more like 10ms  so its 100 seconds to run your code.  (So
 its
  like 2 minutes or so)
 
  Also since you're doing less work, you put less strain on the system.
 
  Look, you're asking for help. You're fighting to maintain a bad design.
  Building the index table shouldn't take you more than a day to think,
  design and implement.
 
  So you tell me, 2 minutes vs 361 hours. Which would you choose?
 
  HTH
 
  -Mike
 
 
  On Apr 19, 2012, at 10:04 AM, Narendra yadala wrote:
 
  Michael,
 
  Thanks for the response. This is a real problem and not a class
 project.
  Boxes itself costed 9k ;)
 
  I think there is some difference in understanding of the problem. The
  table
  has 2m rows but I am looking at the latest 10k rows only in the outer
 for
  loop. Only in the inner for loop i am trying to get all rows that
 contain
  the url that is given by the row in the outer for loop. So pseudo code
 is
  like this
 
  All scanners have a caching of 128.
 
  Scanner outerScanner =  tweetTable.getScanner(new Scan()); //This gets
  the
  entire row
  for (int index = 0; index  1; index++) {
  Result tweet =  outerScanner.next();
  NavigableMapbyte[],byte[] linkFamilyMap =
  tweet.getFamilyMap(Bytes.toBytes(link));
  String url = Bytes.toString( linkFamilyMap.firstKey());  //assuming
 only
  one link is there in the tweet.
  Scan linkScan = new Scan();
  linkScan.addColumn(Bytes.toBytes(link), Bytes.toBytes(url)); //get
 only
  the link column family
  Scanner linkScanner = tweetTable.getScanner(linkScan); //ideally this
 for
  loop is taking 2 sec per sc
  for (Result linkResult = linkScanner.next(); linkResult != null;
  linkResult = linkScanner.next()) {
//do something with the link
  }
  linkScanner.close();
 
//do a similar for loop for hashtags
  }
 
  Each of my inner for loop is taking around 20 seconds (or more
 depending
  on
  number of rows returned by that particular scanner) for each of the 10k
  rows that I am processing and this is also triggering a lot of GC in
  turn.
  So it is 1*40 seconds (4 days) for each thread. But the problem is
  that
  the batch process crashes before completion throwing IOException and
  SocketTimeoutException and sometimes GC OutOfMemory exceptions.
 
  I will definitely take the much elegant approach that you mentioned
  eventually. I just wanted to get to the core of the issue before
 choosing
  the solution.
 
  Thanks again.
  Narendra
 
  On Thu, Apr 19, 2012 at 7:42 PM, Michel Segel 
 michael_se...@hotmail.com
  wrote:
 
  Narendra,
 
  Are you trying to solve a real problem, or is this a class project?
 
  Your solution doesn't scale. It's a non starter. 130 seconds for each
  iteration times 1 million seconds is how long? 130 million seconds,
  which
  is ~36000 hours or over 4 years to complete.
  (the numbers are rough but you get the idea...)
 
  That's assuming that your table is static and doesn't change.
 
  I didn't even ask if you were attempting any sort of server side
  filtering
  which would reduce the amount of data you send back to the client
  because
  it a moot point.
 
  Finer tuning is also moot.
 
  So you insert a row in one table. You then do n^2 operations to pull
 out
  data.
  The better solution is to insert data into 2 tables where you then
 have
  to
  do 2n operations to get the same results. Thats per thread btw.  So if
  you
  were running 10 threads, you would have 10n^2  operations versus 20n
  operations to get the same result set.
 
 

HBase parallel scanner performance

2012-04-19 Thread Narendra yadala
I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop (4*32
GB RAM and 4*6 TB disk space) cluster. We are using Cloudera distribution
for maintaining our cluster. I have a single tweets table in which we store
the tweets, one tweet per row (it has millions of rows currently).

Now I try to run a Java batch (not a map reduce) which does the following :

   1. Open a scanner over the tweet table and read the tweets one after
   another. I set scanner caching to 128 rows as higher scanner caching is
   leading to ScannerTimeoutExceptions. I scan over the first 10k rows only.
   2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are there
   in that tweet and open another scanner over the tweets table to see who
   else shared that link. This involves getting rows having that URL from the
   entire table (not first 10k rows).
   3. Do similar stuff as in step 2 for hashtags
   (hashtagcolfamily:hashtagvalue).
   4. Do steps 1-3 in parallel for approximately 7-8 threads. This number
   can be higher (thousands also) later.


When I run this batch I got the GC issue which is specified here
http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
Then I tried to turn on the MSLAB feature and changed the GC settings by
specifying  -XX:+UseParNewGC  and  -XX:+UseConcMarkSweepGC JVM flags.
Even after doing this, I am running into all kinds of IOExceptions
and SocketTimeoutExceptions.

This Java batch opens approximately 7*2 (14) scanners open at a point in
time and still I am running into all kinds of troubles. I am wondering
whether I can have thousands of parallel scanners with HBase when I need to
scale.

It would be great to know whether I can open thousands/millions of scanners
in parallel with HBase efficiently.

Thanks
Narendra


Re: HBase parallel scanner performance

2012-04-19 Thread Michel Segel
So in your step 2 you have the following:
FOREACH row IN TABLE alpha:
 SELECT something
 FROM TABLE alpha 
 WHERE alpha.url = row.url

Right?
And you are wondering why you are getting timeouts?
...
...
And how long does it take to do a full table scan? ;-)
(there's more, but that's the first thing you should see...)

Try creating a second table where you invert the URL and key pair such that for 
each URL, you have a set of your alpha table's keys?

Then you have the following...
FOREACH row IN TABLE alpha:
   FETCH key-set FROM beta 
   WHERE beta.rowkey = alpha.url

Note I use FETCH to signify that you should get a single row in response.

Does this make sense?
( your second table is actually and index of the URL column in your first table)

HTH 

Sent from a remote device. Please excuse any typos...

Mike Segel

On Apr 19, 2012, at 5:43 AM, Narendra yadala narendra.yad...@gmail.com wrote:

 I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop (4*32
 GB RAM and 4*6 TB disk space) cluster. We are using Cloudera distribution
 for maintaining our cluster. I have a single tweets table in which we store
 the tweets, one tweet per row (it has millions of rows currently).
 
 Now I try to run a Java batch (not a map reduce) which does the following :
 
   1. Open a scanner over the tweet table and read the tweets one after
   another. I set scanner caching to 128 rows as higher scanner caching is
   leading to ScannerTimeoutExceptions. I scan over the first 10k rows only.
   2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are there
   in that tweet and open another scanner over the tweets table to see who
   else shared that link. This involves getting rows having that URL from the
   entire table (not first 10k rows).
   3. Do similar stuff as in step 2 for hashtags
   (hashtagcolfamily:hashtagvalue).
   4. Do steps 1-3 in parallel for approximately 7-8 threads. This number
   can be higher (thousands also) later.
 
 
 When I run this batch I got the GC issue which is specified here
 http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
 Then I tried to turn on the MSLAB feature and changed the GC settings by
 specifying  -XX:+UseParNewGC  and  -XX:+UseConcMarkSweepGC JVM flags.
 Even after doing this, I am running into all kinds of IOExceptions
 and SocketTimeoutExceptions.
 
 This Java batch opens approximately 7*2 (14) scanners open at a point in
 time and still I am running into all kinds of troubles. I am wondering
 whether I can have thousands of parallel scanners with HBase when I need to
 scale.
 
 It would be great to know whether I can open thousands/millions of scanners
 in parallel with HBase efficiently.
 
 Thanks
 Narendra


Re: HBase parallel scanner performance

2012-04-19 Thread Narendra yadala
Hi Michel

Yes, that is exactly what I do in step 2. I am aware of the reason for the
scanner timeout exceptions. It is the time between two consecutive
invocations of the next call on a specific scanner object. I increased the
scanner timeout to 10 min on the region server and still I keep seeing the
timeouts. So I reduced my scanner cache to 128.

Full table scan takes 130 seconds and there are 2.2 million rows in the
table as of now. Each row is around 2 KB in size. I measured time for the
full table scan by issuing `count` command from the hbase shell.

I kind of understood the fix that you are specifying, but do I need to
change the table structure to fix this problem? All I do is a n^2 operation
and even that fails with 10 different types of exceptions. It is mildly
annoying that I need to know all the low level storage details of HBase to
do such a simple operation. And this is happening for just 14 parallel
scanners. I am wondering what would happen when there are thousands of
parallel scanners.

Please let me know if there is any configuration param change which would
fix this issue.

Thanks a lot
Narendra

On Thu, Apr 19, 2012 at 4:40 PM, Michel Segel michael_se...@hotmail.comwrote:

 So in your step 2 you have the following:
 FOREACH row IN TABLE alpha:
 SELECT something
 FROM TABLE alpha
 WHERE alpha.url = row.url

 Right?
 And you are wondering why you are getting timeouts?
 ...
 ...
 And how long does it take to do a full table scan? ;-)
 (there's more, but that's the first thing you should see...)

 Try creating a second table where you invert the URL and key pair such
 that for each URL, you have a set of your alpha table's keys?

 Then you have the following...
 FOREACH row IN TABLE alpha:
   FETCH key-set FROM beta
   WHERE beta.rowkey = alpha.url

 Note I use FETCH to signify that you should get a single row in response.

 Does this make sense?
 ( your second table is actually and index of the URL column in your first
 table)

 HTH

 Sent from a remote device. Please excuse any typos...

 Mike Segel

 On Apr 19, 2012, at 5:43 AM, Narendra yadala narendra.yad...@gmail.com
 wrote:

  I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop
 (4*32
  GB RAM and 4*6 TB disk space) cluster. We are using Cloudera distribution
  for maintaining our cluster. I have a single tweets table in which we
 store
  the tweets, one tweet per row (it has millions of rows currently).
 
  Now I try to run a Java batch (not a map reduce) which does the
 following :
 
1. Open a scanner over the tweet table and read the tweets one after
another. I set scanner caching to 128 rows as higher scanner caching is
leading to ScannerTimeoutExceptions. I scan over the first 10k rows
 only.
2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are there
in that tweet and open another scanner over the tweets table to see who
else shared that link. This involves getting rows having that URL from
 the
entire table (not first 10k rows).
3. Do similar stuff as in step 2 for hashtags
(hashtagcolfamily:hashtagvalue).
4. Do steps 1-3 in parallel for approximately 7-8 threads. This number
can be higher (thousands also) later.
 
 
  When I run this batch I got the GC issue which is specified here
 
 http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
  Then I tried to turn on the MSLAB feature and changed the GC settings by
  specifying  -XX:+UseParNewGC  and  -XX:+UseConcMarkSweepGC JVM flags.
  Even after doing this, I am running into all kinds of IOExceptions
  and SocketTimeoutExceptions.
 
  This Java batch opens approximately 7*2 (14) scanners open at a point in
  time and still I am running into all kinds of troubles. I am wondering
  whether I can have thousands of parallel scanners with HBase when I need
 to
  scale.
 
  It would be great to know whether I can open thousands/millions of
 scanners
  in parallel with HBase efficiently.
 
  Thanks
  Narendra



Re: HBase parallel scanner performance

2012-04-19 Thread Michel Segel
Narendra, 

Are you trying to solve a real problem, or is this a class project?

Your solution doesn't scale. It's a non starter. 130 seconds for each iteration 
times 1 million seconds is how long? 130 million seconds, which is ~36000 hours 
or over 4 years to complete.
(the numbers are rough but you get the idea...)

That's assuming that your table is static and doesn't change.

I didn't even ask if you were attempting any sort of server side filtering 
which would reduce the amount of data you send back to the client because it a 
moot point. 

Finer tuning is also moot.

So you insert a row in one table. You then do n^2 operations to pull out data.
The better solution is to insert data into 2 tables where you then have to do 
2n operations to get the same results. Thats per thread btw.  So if you were 
running 10 threads, you would have 10n^2  operations versus 20n operations to 
get the same result set.

A million row table... 1*10^13. Vs 2*10^6 

I don't believe I mentioned anything about HBase's internals and this solution 
works for any NoSQL database.


Sent from a remote device. Please excuse any typos...

Mike Segel

On Apr 19, 2012, at 7:03 AM, Narendra yadala narendra.yad...@gmail.com wrote:

 Hi Michel
 
 Yes, that is exactly what I do in step 2. I am aware of the reason for the
 scanner timeout exceptions. It is the time between two consecutive
 invocations of the next call on a specific scanner object. I increased the
 scanner timeout to 10 min on the region server and still I keep seeing the
 timeouts. So I reduced my scanner cache to 128.
 
 Full table scan takes 130 seconds and there are 2.2 million rows in the
 table as of now. Each row is around 2 KB in size. I measured time for the
 full table scan by issuing `count` command from the hbase shell.
 
 I kind of understood the fix that you are specifying, but do I need to
 change the table structure to fix this problem? All I do is a n^2 operation
 and even that fails with 10 different types of exceptions. It is mildly
 annoying that I need to know all the low level storage details of HBase to
 do such a simple operation. And this is happening for just 14 parallel
 scanners. I am wondering what would happen when there are thousands of
 parallel scanners.
 
 Please let me know if there is any configuration param change which would
 fix this issue.
 
 Thanks a lot
 Narendra
 
 On Thu, Apr 19, 2012 at 4:40 PM, Michel Segel 
 michael_se...@hotmail.comwrote:
 
 So in your step 2 you have the following:
 FOREACH row IN TABLE alpha:
SELECT something
FROM TABLE alpha
WHERE alpha.url = row.url
 
 Right?
 And you are wondering why you are getting timeouts?
 ...
 ...
 And how long does it take to do a full table scan? ;-)
 (there's more, but that's the first thing you should see...)
 
 Try creating a second table where you invert the URL and key pair such
 that for each URL, you have a set of your alpha table's keys?
 
 Then you have the following...
 FOREACH row IN TABLE alpha:
  FETCH key-set FROM beta
  WHERE beta.rowkey = alpha.url
 
 Note I use FETCH to signify that you should get a single row in response.
 
 Does this make sense?
 ( your second table is actually and index of the URL column in your first
 table)
 
 HTH
 
 Sent from a remote device. Please excuse any typos...
 
 Mike Segel
 
 On Apr 19, 2012, at 5:43 AM, Narendra yadala narendra.yad...@gmail.com
 wrote:
 
 I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop
 (4*32
 GB RAM and 4*6 TB disk space) cluster. We are using Cloudera distribution
 for maintaining our cluster. I have a single tweets table in which we
 store
 the tweets, one tweet per row (it has millions of rows currently).
 
 Now I try to run a Java batch (not a map reduce) which does the
 following :
 
  1. Open a scanner over the tweet table and read the tweets one after
  another. I set scanner caching to 128 rows as higher scanner caching is
  leading to ScannerTimeoutExceptions. I scan over the first 10k rows
 only.
  2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are there
  in that tweet and open another scanner over the tweets table to see who
  else shared that link. This involves getting rows having that URL from
 the
  entire table (not first 10k rows).
  3. Do similar stuff as in step 2 for hashtags
  (hashtagcolfamily:hashtagvalue).
  4. Do steps 1-3 in parallel for approximately 7-8 threads. This number
  can be higher (thousands also) later.
 
 
 When I run this batch I got the GC issue which is specified here
 
 http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
 Then I tried to turn on the MSLAB feature and changed the GC settings by
 specifying  -XX:+UseParNewGC  and  -XX:+UseConcMarkSweepGC JVM flags.
 Even after doing this, I am running into all kinds of IOExceptions
 and SocketTimeoutExceptions.
 
 This Java batch opens approximately 7*2 (14) scanners open at a point in
 time and still I am 

RE: HBase parallel scanner performance

2012-04-19 Thread Bijieshan
Hi Narendra,

I have a few doubts:

1. Which version you are using?
2. What's the size of each KeyValue?
3. Did you change the GC parameters in client side or server side? After 
changing the GC parameters, did you keep an eye on the GC logs? 

Thank you.

Regards,
Jieshan

-Original Message-
From: Narendra yadala [mailto:narendra.yad...@gmail.com] 
Sent: Thursday, April 19, 2012 8:04 PM
To: user@hbase.apache.org
Subject: Re: HBase parallel scanner performance

Hi Michel

Yes, that is exactly what I do in step 2. I am aware of the reason for the
scanner timeout exceptions. It is the time between two consecutive
invocations of the next call on a specific scanner object. I increased the
scanner timeout to 10 min on the region server and still I keep seeing the
timeouts. So I reduced my scanner cache to 128.

Full table scan takes 130 seconds and there are 2.2 million rows in the
table as of now. Each row is around 2 KB in size. I measured time for the
full table scan by issuing `count` command from the hbase shell.

I kind of understood the fix that you are specifying, but do I need to
change the table structure to fix this problem? All I do is a n^2 operation
and even that fails with 10 different types of exceptions. It is mildly
annoying that I need to know all the low level storage details of HBase to
do such a simple operation. And this is happening for just 14 parallel
scanners. I am wondering what would happen when there are thousands of
parallel scanners.

Please let me know if there is any configuration param change which would
fix this issue.

Thanks a lot
Narendra

On Thu, Apr 19, 2012 at 4:40 PM, Michel Segel michael_se...@hotmail.comwrote:

 So in your step 2 you have the following:
 FOREACH row IN TABLE alpha:
 SELECT something
 FROM TABLE alpha
 WHERE alpha.url = row.url

 Right?
 And you are wondering why you are getting timeouts?
 ...
 ...
 And how long does it take to do a full table scan? ;-)
 (there's more, but that's the first thing you should see...)

 Try creating a second table where you invert the URL and key pair such
 that for each URL, you have a set of your alpha table's keys?

 Then you have the following...
 FOREACH row IN TABLE alpha:
   FETCH key-set FROM beta
   WHERE beta.rowkey = alpha.url

 Note I use FETCH to signify that you should get a single row in response.

 Does this make sense?
 ( your second table is actually and index of the URL column in your first
 table)

 HTH

 Sent from a remote device. Please excuse any typos...

 Mike Segel

 On Apr 19, 2012, at 5:43 AM, Narendra yadala narendra.yad...@gmail.com
 wrote:

  I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop
 (4*32
  GB RAM and 4*6 TB disk space) cluster. We are using Cloudera distribution
  for maintaining our cluster. I have a single tweets table in which we
 store
  the tweets, one tweet per row (it has millions of rows currently).
 
  Now I try to run a Java batch (not a map reduce) which does the
 following :
 
1. Open a scanner over the tweet table and read the tweets one after
another. I set scanner caching to 128 rows as higher scanner caching is
leading to ScannerTimeoutExceptions. I scan over the first 10k rows
 only.
2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are there
in that tweet and open another scanner over the tweets table to see who
else shared that link. This involves getting rows having that URL from
 the
entire table (not first 10k rows).
3. Do similar stuff as in step 2 for hashtags
(hashtagcolfamily:hashtagvalue).
4. Do steps 1-3 in parallel for approximately 7-8 threads. This number
can be higher (thousands also) later.
 
 
  When I run this batch I got the GC issue which is specified here
 
 http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
  Then I tried to turn on the MSLAB feature and changed the GC settings by
  specifying  -XX:+UseParNewGC  and  -XX:+UseConcMarkSweepGC JVM flags.
  Even after doing this, I am running into all kinds of IOExceptions
  and SocketTimeoutExceptions.
 
  This Java batch opens approximately 7*2 (14) scanners open at a point in
  time and still I am running into all kinds of troubles. I am wondering
  whether I can have thousands of parallel scanners with HBase when I need
 to
  scale.
 
  It would be great to know whether I can open thousands/millions of
 scanners
  in parallel with HBase efficiently.
 
  Thanks
  Narendra



Re: HBase parallel scanner performance

2012-04-19 Thread Narendra yadala
Michael,

Thanks for the response. This is a real problem and not a class project.
Boxes itself costed 9k ;)

I think there is some difference in understanding of the problem. The table
has 2m rows but I am looking at the latest 10k rows only in the outer for
loop. Only in the inner for loop i am trying to get all rows that contain
the url that is given by the row in the outer for loop. So pseudo code is
like this

All scanners have a caching of 128.

Scanner outerScanner =  tweetTable.getScanner(new Scan()); //This gets the
entire row
for (int index = 0; index  1; index++) {
 Result tweet =  outerScanner.next();
NavigableMapbyte[],byte[] linkFamilyMap =
 tweet.getFamilyMap(Bytes.toBytes(link));
 String url = Bytes.toString( linkFamilyMap.firstKey());  //assuming only
one link is there in the tweet.
Scan linkScan = new Scan();
 linkScan.addColumn(Bytes.toBytes(link), Bytes.toBytes(url)); //get only
the link column family
Scanner linkScanner = tweetTable.getScanner(linkScan); //ideally this for
loop is taking 2 sec per sc
 for (Result linkResult = linkScanner.next(); linkResult != null;
linkResult = linkScanner.next()) {
//do something with the link
 }
linkScanner.close();

//do a similar for loop for hashtags
}

Each of my inner for loop is taking around 20 seconds (or more depending on
number of rows returned by that particular scanner) for each of the 10k
rows that I am processing and this is also triggering a lot of GC in turn.
So it is 1*40 seconds (4 days) for each thread. But the problem is that
the batch process crashes before completion throwing IOException and
SocketTimeoutException and sometimes GC OutOfMemory exceptions.

I will definitely take the much elegant approach that you mentioned
eventually. I just wanted to get to the core of the issue before choosing
the solution.

Thanks again.
Narendra

On Thu, Apr 19, 2012 at 7:42 PM, Michel Segel michael_se...@hotmail.comwrote:

 Narendra,

 Are you trying to solve a real problem, or is this a class project?

 Your solution doesn't scale. It's a non starter. 130 seconds for each
 iteration times 1 million seconds is how long? 130 million seconds, which
 is ~36000 hours or over 4 years to complete.
 (the numbers are rough but you get the idea...)

 That's assuming that your table is static and doesn't change.

 I didn't even ask if you were attempting any sort of server side filtering
 which would reduce the amount of data you send back to the client because
 it a moot point.

 Finer tuning is also moot.

 So you insert a row in one table. You then do n^2 operations to pull out
 data.
 The better solution is to insert data into 2 tables where you then have to
 do 2n operations to get the same results. Thats per thread btw.  So if you
 were running 10 threads, you would have 10n^2  operations versus 20n
 operations to get the same result set.

 A million row table... 1*10^13. Vs 2*10^6

 I don't believe I mentioned anything about HBase's internals and this
 solution works for any NoSQL database.


 Sent from a remote device. Please excuse any typos...

 Mike Segel

 On Apr 19, 2012, at 7:03 AM, Narendra yadala narendra.yad...@gmail.com
 wrote:

  Hi Michel
 
  Yes, that is exactly what I do in step 2. I am aware of the reason for
 the
  scanner timeout exceptions. It is the time between two consecutive
  invocations of the next call on a specific scanner object. I increased
 the
  scanner timeout to 10 min on the region server and still I keep seeing
 the
  timeouts. So I reduced my scanner cache to 128.
 
  Full table scan takes 130 seconds and there are 2.2 million rows in the
  table as of now. Each row is around 2 KB in size. I measured time for the
  full table scan by issuing `count` command from the hbase shell.
 
  I kind of understood the fix that you are specifying, but do I need to
  change the table structure to fix this problem? All I do is a n^2
 operation
  and even that fails with 10 different types of exceptions. It is mildly
  annoying that I need to know all the low level storage details of HBase
 to
  do such a simple operation. And this is happening for just 14 parallel
  scanners. I am wondering what would happen when there are thousands of
  parallel scanners.
 
  Please let me know if there is any configuration param change which would
  fix this issue.
 
  Thanks a lot
  Narendra
 
  On Thu, Apr 19, 2012 at 4:40 PM, Michel Segel michael_se...@hotmail.com
 wrote:
 
  So in your step 2 you have the following:
  FOREACH row IN TABLE alpha:
 SELECT something
 FROM TABLE alpha
 WHERE alpha.url = row.url
 
  Right?
  And you are wondering why you are getting timeouts?
  ...
  ...
  And how long does it take to do a full table scan? ;-)
  (there's more, but that's the first thing you should see...)
 
  Try creating a second table where you invert the URL and key pair such
  that for each URL, you have a set of your alpha table's keys?
 
  Then you have the following...
  FOREACH row IN TABLE alpha:
 

Re: HBase parallel scanner performance

2012-04-19 Thread Narendra yadala
Hi Jieshan

HBase version : Version 0.90.4-cdh3u3
Size of Key Value pair should not be more than 2KB
I changed the GC parameters at the server side. I have not looked into GC
logs yet but I have noticed that it pausing the batch process every now and
then. How do I look at the server GC logs?

Thanks
Narendra

On Thu, Apr 19, 2012 at 7:46 PM, Bijieshan bijies...@huawei.com wrote:

 Hi Narendra,

 I have a few doubts:

 1. Which version you are using?
 2. What's the size of each KeyValue?
 3. Did you change the GC parameters in client side or server side? After
 changing the GC parameters, did you keep an eye on the GC logs?

 Thank you.

 Regards,
 Jieshan

 -Original Message-
 From: Narendra yadala [mailto:narendra.yad...@gmail.com]
 Sent: Thursday, April 19, 2012 8:04 PM
 To: user@hbase.apache.org
 Subject: Re: HBase parallel scanner performance

 Hi Michel

 Yes, that is exactly what I do in step 2. I am aware of the reason for the
 scanner timeout exceptions. It is the time between two consecutive
 invocations of the next call on a specific scanner object. I increased the
 scanner timeout to 10 min on the region server and still I keep seeing the
 timeouts. So I reduced my scanner cache to 128.

 Full table scan takes 130 seconds and there are 2.2 million rows in the
 table as of now. Each row is around 2 KB in size. I measured time for the
 full table scan by issuing `count` command from the hbase shell.

 I kind of understood the fix that you are specifying, but do I need to
 change the table structure to fix this problem? All I do is a n^2 operation
 and even that fails with 10 different types of exceptions. It is mildly
 annoying that I need to know all the low level storage details of HBase to
 do such a simple operation. And this is happening for just 14 parallel
 scanners. I am wondering what would happen when there are thousands of
 parallel scanners.

 Please let me know if there is any configuration param change which would
 fix this issue.

 Thanks a lot
 Narendra

 On Thu, Apr 19, 2012 at 4:40 PM, Michel Segel michael_se...@hotmail.com
 wrote:

  So in your step 2 you have the following:
  FOREACH row IN TABLE alpha:
  SELECT something
  FROM TABLE alpha
  WHERE alpha.url = row.url
 
  Right?
  And you are wondering why you are getting timeouts?
  ...
  ...
  And how long does it take to do a full table scan? ;-)
  (there's more, but that's the first thing you should see...)
 
  Try creating a second table where you invert the URL and key pair such
  that for each URL, you have a set of your alpha table's keys?
 
  Then you have the following...
  FOREACH row IN TABLE alpha:
FETCH key-set FROM beta
WHERE beta.rowkey = alpha.url
 
  Note I use FETCH to signify that you should get a single row in response.
 
  Does this make sense?
  ( your second table is actually and index of the URL column in your first
  table)
 
  HTH
 
  Sent from a remote device. Please excuse any typos...
 
  Mike Segel
 
  On Apr 19, 2012, at 5:43 AM, Narendra yadala narendra.yad...@gmail.com
  wrote:
 
   I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop
  (4*32
   GB RAM and 4*6 TB disk space) cluster. We are using Cloudera
 distribution
   for maintaining our cluster. I have a single tweets table in which we
  store
   the tweets, one tweet per row (it has millions of rows currently).
  
   Now I try to run a Java batch (not a map reduce) which does the
  following :
  
 1. Open a scanner over the tweet table and read the tweets one after
 another. I set scanner caching to 128 rows as higher scanner caching
 is
 leading to ScannerTimeoutExceptions. I scan over the first 10k rows
  only.
 2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are
 there
 in that tweet and open another scanner over the tweets table to see
 who
 else shared that link. This involves getting rows having that URL
 from
  the
 entire table (not first 10k rows).
 3. Do similar stuff as in step 2 for hashtags
 (hashtagcolfamily:hashtagvalue).
 4. Do steps 1-3 in parallel for approximately 7-8 threads. This
 number
 can be higher (thousands also) later.
  
  
   When I run this batch I got the GC issue which is specified here
  
 
 http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
   Then I tried to turn on the MSLAB feature and changed the GC settings
 by
   specifying  -XX:+UseParNewGC  and  -XX:+UseConcMarkSweepGC JVM flags.
   Even after doing this, I am running into all kinds of IOExceptions
   and SocketTimeoutExceptions.
  
   This Java batch opens approximately 7*2 (14) scanners open at a point
 in
   time and still I am running into all kinds of troubles. I am wondering
   whether I can have thousands of parallel scanners with HBase when I
 need
  to
   scale.
  
   It would be great to know whether I can open thousands/millions of
  scanners
   in parallel with HBase

Re: HBase parallel scanner performance

2012-04-19 Thread Michael Segel
Narendra, 

I think you are still missing the point. 
130 seconds to scan the table per iteration. 
Even if you have 10K rows 
130 * 10^4 or 1.3*10^6 seconds.  ~361 hours

Compare that to 10K rows where you then select a single row in your sub select 
that has a list of all of the associated rows. 
You can then do  n number of get()s based on the data in the index. (If the 
data wasn't in the index itself)

Assuming that the data was in the index, that's one get(). This is sub second. 
Just to keep things simple assume 1 second. 
That's 10K seconds vs 1.3 million seconds.  (2 hours vs 361hours) 
Actually its more like 10ms  so its 100 seconds to run your code.  (So its like 
2 minutes or so) 

Also since you're doing less work, you put less strain on the system.

Look, you're asking for help. You're fighting to maintain a bad design. 
Building the index table shouldn't take you more than a day to think, design 
and implement. 

So you tell me, 2 minutes vs 361 hours. Which would you choose?

HTH

-Mike


On Apr 19, 2012, at 10:04 AM, Narendra yadala wrote:

 Michael,
 
 Thanks for the response. This is a real problem and not a class project.
 Boxes itself costed 9k ;)
 
 I think there is some difference in understanding of the problem. The table
 has 2m rows but I am looking at the latest 10k rows only in the outer for
 loop. Only in the inner for loop i am trying to get all rows that contain
 the url that is given by the row in the outer for loop. So pseudo code is
 like this
 
 All scanners have a caching of 128.
 
 Scanner outerScanner =  tweetTable.getScanner(new Scan()); //This gets the
 entire row
 for (int index = 0; index  1; index++) {
 Result tweet =  outerScanner.next();
 NavigableMapbyte[],byte[] linkFamilyMap =
 tweet.getFamilyMap(Bytes.toBytes(link));
 String url = Bytes.toString( linkFamilyMap.firstKey());  //assuming only
 one link is there in the tweet.
 Scan linkScan = new Scan();
 linkScan.addColumn(Bytes.toBytes(link), Bytes.toBytes(url)); //get only
 the link column family
 Scanner linkScanner = tweetTable.getScanner(linkScan); //ideally this for
 loop is taking 2 sec per sc
 for (Result linkResult = linkScanner.next(); linkResult != null;
 linkResult = linkScanner.next()) {
//do something with the link
 }
 linkScanner.close();
 
//do a similar for loop for hashtags
 }
 
 Each of my inner for loop is taking around 20 seconds (or more depending on
 number of rows returned by that particular scanner) for each of the 10k
 rows that I am processing and this is also triggering a lot of GC in turn.
 So it is 1*40 seconds (4 days) for each thread. But the problem is that
 the batch process crashes before completion throwing IOException and
 SocketTimeoutException and sometimes GC OutOfMemory exceptions.
 
 I will definitely take the much elegant approach that you mentioned
 eventually. I just wanted to get to the core of the issue before choosing
 the solution.
 
 Thanks again.
 Narendra
 
 On Thu, Apr 19, 2012 at 7:42 PM, Michel Segel 
 michael_se...@hotmail.comwrote:
 
 Narendra,
 
 Are you trying to solve a real problem, or is this a class project?
 
 Your solution doesn't scale. It's a non starter. 130 seconds for each
 iteration times 1 million seconds is how long? 130 million seconds, which
 is ~36000 hours or over 4 years to complete.
 (the numbers are rough but you get the idea...)
 
 That's assuming that your table is static and doesn't change.
 
 I didn't even ask if you were attempting any sort of server side filtering
 which would reduce the amount of data you send back to the client because
 it a moot point.
 
 Finer tuning is also moot.
 
 So you insert a row in one table. You then do n^2 operations to pull out
 data.
 The better solution is to insert data into 2 tables where you then have to
 do 2n operations to get the same results. Thats per thread btw.  So if you
 were running 10 threads, you would have 10n^2  operations versus 20n
 operations to get the same result set.
 
 A million row table... 1*10^13. Vs 2*10^6
 
 I don't believe I mentioned anything about HBase's internals and this
 solution works for any NoSQL database.
 
 
 Sent from a remote device. Please excuse any typos...
 
 Mike Segel
 
 On Apr 19, 2012, at 7:03 AM, Narendra yadala narendra.yad...@gmail.com
 wrote:
 
 Hi Michel
 
 Yes, that is exactly what I do in step 2. I am aware of the reason for
 the
 scanner timeout exceptions. It is the time between two consecutive
 invocations of the next call on a specific scanner object. I increased
 the
 scanner timeout to 10 min on the region server and still I keep seeing
 the
 timeouts. So I reduced my scanner cache to 128.
 
 Full table scan takes 130 seconds and there are 2.2 million rows in the
 table as of now. Each row is around 2 KB in size. I measured time for the
 full table scan by issuing `count` command from the hbase shell.
 
 I kind of understood the fix that you are specifying, but do I need to
 change the table structure to fix 

RE: HBase parallel scanner performance

2012-04-19 Thread Bijieshan
Narendra,

Since I didn't see the client logs , FullGC is one probably reason I suspect. 
No matter it happens in client side or server side. So I suggest to check the 
GC log (Open the client GC log both at server and client side) to see whether 
FullGC happens with a high frequency, and check the stop-the-world pause time.

I don't think parallel scanners is the problem. 

Jieshan

-Original Message-
From: Narendra yadala [mailto:narendra.yad...@gmail.com] 
Sent: Thursday, April 19, 2012 11:24 PM
To: user@hbase.apache.org
Subject: Re: HBase parallel scanner performance

Hi Jieshan

HBase version : Version 0.90.4-cdh3u3
Size of Key Value pair should not be more than 2KB
I changed the GC parameters at the server side. I have not looked into GC
logs yet but I have noticed that it pausing the batch process every now and
then. How do I look at the server GC logs?

Thanks
Narendra

On Thu, Apr 19, 2012 at 7:46 PM, Bijieshan bijies...@huawei.com wrote:

 Hi Narendra,

 I have a few doubts:

 1. Which version you are using?
 2. What's the size of each KeyValue?
 3. Did you change the GC parameters in client side or server side? After
 changing the GC parameters, did you keep an eye on the GC logs?

 Thank you.

 Regards,
 Jieshan

 -Original Message-
 From: Narendra yadala [mailto:narendra.yad...@gmail.com]
 Sent: Thursday, April 19, 2012 8:04 PM
 To: user@hbase.apache.org
 Subject: Re: HBase parallel scanner performance

 Hi Michel

 Yes, that is exactly what I do in step 2. I am aware of the reason for the
 scanner timeout exceptions. It is the time between two consecutive
 invocations of the next call on a specific scanner object. I increased the
 scanner timeout to 10 min on the region server and still I keep seeing the
 timeouts. So I reduced my scanner cache to 128.

 Full table scan takes 130 seconds and there are 2.2 million rows in the
 table as of now. Each row is around 2 KB in size. I measured time for the
 full table scan by issuing `count` command from the hbase shell.

 I kind of understood the fix that you are specifying, but do I need to
 change the table structure to fix this problem? All I do is a n^2 operation
 and even that fails with 10 different types of exceptions. It is mildly
 annoying that I need to know all the low level storage details of HBase to
 do such a simple operation. And this is happening for just 14 parallel
 scanners. I am wondering what would happen when there are thousands of
 parallel scanners.

 Please let me know if there is any configuration param change which would
 fix this issue.

 Thanks a lot
 Narendra

 On Thu, Apr 19, 2012 at 4:40 PM, Michel Segel michael_se...@hotmail.com
 wrote:

  So in your step 2 you have the following:
  FOREACH row IN TABLE alpha:
  SELECT something
  FROM TABLE alpha
  WHERE alpha.url = row.url
 
  Right?
  And you are wondering why you are getting timeouts?
  ...
  ...
  And how long does it take to do a full table scan? ;-)
  (there's more, but that's the first thing you should see...)
 
  Try creating a second table where you invert the URL and key pair such
  that for each URL, you have a set of your alpha table's keys?
 
  Then you have the following...
  FOREACH row IN TABLE alpha:
FETCH key-set FROM beta
WHERE beta.rowkey = alpha.url
 
  Note I use FETCH to signify that you should get a single row in response.
 
  Does this make sense?
  ( your second table is actually and index of the URL column in your first
  table)
 
  HTH
 
  Sent from a remote device. Please excuse any typos...
 
  Mike Segel
 
  On Apr 19, 2012, at 5:43 AM, Narendra yadala narendra.yad...@gmail.com
  wrote:
 
   I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop
  (4*32
   GB RAM and 4*6 TB disk space) cluster. We are using Cloudera
 distribution
   for maintaining our cluster. I have a single tweets table in which we
  store
   the tweets, one tweet per row (it has millions of rows currently).
  
   Now I try to run a Java batch (not a map reduce) which does the
  following :
  
 1. Open a scanner over the tweet table and read the tweets one after
 another. I set scanner caching to 128 rows as higher scanner caching
 is
 leading to ScannerTimeoutExceptions. I scan over the first 10k rows
  only.
 2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are
 there
 in that tweet and open another scanner over the tweets table to see
 who
 else shared that link. This involves getting rows having that URL
 from
  the
 entire table (not first 10k rows).
 3. Do similar stuff as in step 2 for hashtags
 (hashtagcolfamily:hashtagvalue).
 4. Do steps 1-3 in parallel for approximately 7-8 threads. This
 number
 can be higher (thousands also) later.
  
  
   When I run this batch I got the GC issue which is specified here
  
 
 http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
   Then I tried to turn

Re: HBase parallel scanner performance

2012-04-19 Thread Narendra yadala
Michael,

I will do the redesign and build the index. Thanks a lot for the insights.

Narendra

On Thu, Apr 19, 2012 at 9:56 PM, Michael Segel michael_se...@hotmail.comwrote:

 Narendra,

 I think you are still missing the point.
 130 seconds to scan the table per iteration.
 Even if you have 10K rows
 130 * 10^4 or 1.3*10^6 seconds.  ~361 hours

 Compare that to 10K rows where you then select a single row in your sub
 select that has a list of all of the associated rows.
 You can then do  n number of get()s based on the data in the index. (If
 the data wasn't in the index itself)

 Assuming that the data was in the index, that's one get(). This is sub
 second.
 Just to keep things simple assume 1 second.
 That's 10K seconds vs 1.3 million seconds.  (2 hours vs 361hours)
 Actually its more like 10ms  so its 100 seconds to run your code.  (So its
 like 2 minutes or so)

 Also since you're doing less work, you put less strain on the system.

 Look, you're asking for help. You're fighting to maintain a bad design.
 Building the index table shouldn't take you more than a day to think,
 design and implement.

 So you tell me, 2 minutes vs 361 hours. Which would you choose?

 HTH

 -Mike


 On Apr 19, 2012, at 10:04 AM, Narendra yadala wrote:

  Michael,
 
  Thanks for the response. This is a real problem and not a class project.
  Boxes itself costed 9k ;)
 
  I think there is some difference in understanding of the problem. The
 table
  has 2m rows but I am looking at the latest 10k rows only in the outer for
  loop. Only in the inner for loop i am trying to get all rows that contain
  the url that is given by the row in the outer for loop. So pseudo code is
  like this
 
  All scanners have a caching of 128.
 
  Scanner outerScanner =  tweetTable.getScanner(new Scan()); //This gets
 the
  entire row
  for (int index = 0; index  1; index++) {
  Result tweet =  outerScanner.next();
  NavigableMapbyte[],byte[] linkFamilyMap =
  tweet.getFamilyMap(Bytes.toBytes(link));
  String url = Bytes.toString( linkFamilyMap.firstKey());  //assuming only
  one link is there in the tweet.
  Scan linkScan = new Scan();
  linkScan.addColumn(Bytes.toBytes(link), Bytes.toBytes(url)); //get only
  the link column family
  Scanner linkScanner = tweetTable.getScanner(linkScan); //ideally this for
  loop is taking 2 sec per sc
  for (Result linkResult = linkScanner.next(); linkResult != null;
  linkResult = linkScanner.next()) {
 //do something with the link
  }
  linkScanner.close();
 
 //do a similar for loop for hashtags
  }
 
  Each of my inner for loop is taking around 20 seconds (or more depending
 on
  number of rows returned by that particular scanner) for each of the 10k
  rows that I am processing and this is also triggering a lot of GC in
 turn.
  So it is 1*40 seconds (4 days) for each thread. But the problem is
 that
  the batch process crashes before completion throwing IOException and
  SocketTimeoutException and sometimes GC OutOfMemory exceptions.
 
  I will definitely take the much elegant approach that you mentioned
  eventually. I just wanted to get to the core of the issue before choosing
  the solution.
 
  Thanks again.
  Narendra
 
  On Thu, Apr 19, 2012 at 7:42 PM, Michel Segel michael_se...@hotmail.com
 wrote:
 
  Narendra,
 
  Are you trying to solve a real problem, or is this a class project?
 
  Your solution doesn't scale. It's a non starter. 130 seconds for each
  iteration times 1 million seconds is how long? 130 million seconds,
 which
  is ~36000 hours or over 4 years to complete.
  (the numbers are rough but you get the idea...)
 
  That's assuming that your table is static and doesn't change.
 
  I didn't even ask if you were attempting any sort of server side
 filtering
  which would reduce the amount of data you send back to the client
 because
  it a moot point.
 
  Finer tuning is also moot.
 
  So you insert a row in one table. You then do n^2 operations to pull out
  data.
  The better solution is to insert data into 2 tables where you then have
 to
  do 2n operations to get the same results. Thats per thread btw.  So if
 you
  were running 10 threads, you would have 10n^2  operations versus 20n
  operations to get the same result set.
 
  A million row table... 1*10^13. Vs 2*10^6
 
  I don't believe I mentioned anything about HBase's internals and this
  solution works for any NoSQL database.
 
 
  Sent from a remote device. Please excuse any typos...
 
  Mike Segel
 
  On Apr 19, 2012, at 7:03 AM, Narendra yadala narendra.yad...@gmail.com
 
  wrote:
 
  Hi Michel
 
  Yes, that is exactly what I do in step 2. I am aware of the reason for
  the
  scanner timeout exceptions. It is the time between two consecutive
  invocations of the next call on a specific scanner object. I increased
  the
  scanner timeout to 10 min on the region server and still I keep seeing
  the
  timeouts. So I reduced my scanner cache to 128.
 
  Full table scan takes 130 seconds and there are 2.2 

Re: HBase parallel scanner performance

2012-04-19 Thread Michael Segel
No problem. 

One of the hardest things to do is to try to be open to other design ideas and 
not become wedded to one.

I think once you get that working you can start to look at your cluster. 

On Apr 19, 2012, at 1:26 PM, Narendra yadala wrote:

 Michael,
 
 I will do the redesign and build the index. Thanks a lot for the insights.
 
 Narendra
 
 On Thu, Apr 19, 2012 at 9:56 PM, Michael Segel 
 michael_se...@hotmail.comwrote:
 
 Narendra,
 
 I think you are still missing the point.
 130 seconds to scan the table per iteration.
 Even if you have 10K rows
 130 * 10^4 or 1.3*10^6 seconds.  ~361 hours
 
 Compare that to 10K rows where you then select a single row in your sub
 select that has a list of all of the associated rows.
 You can then do  n number of get()s based on the data in the index. (If
 the data wasn't in the index itself)
 
 Assuming that the data was in the index, that's one get(). This is sub
 second.
 Just to keep things simple assume 1 second.
 That's 10K seconds vs 1.3 million seconds.  (2 hours vs 361hours)
 Actually its more like 10ms  so its 100 seconds to run your code.  (So its
 like 2 minutes or so)
 
 Also since you're doing less work, you put less strain on the system.
 
 Look, you're asking for help. You're fighting to maintain a bad design.
 Building the index table shouldn't take you more than a day to think,
 design and implement.
 
 So you tell me, 2 minutes vs 361 hours. Which would you choose?
 
 HTH
 
 -Mike
 
 
 On Apr 19, 2012, at 10:04 AM, Narendra yadala wrote:
 
 Michael,
 
 Thanks for the response. This is a real problem and not a class project.
 Boxes itself costed 9k ;)
 
 I think there is some difference in understanding of the problem. The
 table
 has 2m rows but I am looking at the latest 10k rows only in the outer for
 loop. Only in the inner for loop i am trying to get all rows that contain
 the url that is given by the row in the outer for loop. So pseudo code is
 like this
 
 All scanners have a caching of 128.
 
 Scanner outerScanner =  tweetTable.getScanner(new Scan()); //This gets
 the
 entire row
 for (int index = 0; index  1; index++) {
 Result tweet =  outerScanner.next();
 NavigableMapbyte[],byte[] linkFamilyMap =
 tweet.getFamilyMap(Bytes.toBytes(link));
 String url = Bytes.toString( linkFamilyMap.firstKey());  //assuming only
 one link is there in the tweet.
 Scan linkScan = new Scan();
 linkScan.addColumn(Bytes.toBytes(link), Bytes.toBytes(url)); //get only
 the link column family
 Scanner linkScanner = tweetTable.getScanner(linkScan); //ideally this for
 loop is taking 2 sec per sc
 for (Result linkResult = linkScanner.next(); linkResult != null;
 linkResult = linkScanner.next()) {
   //do something with the link
 }
 linkScanner.close();
 
   //do a similar for loop for hashtags
 }
 
 Each of my inner for loop is taking around 20 seconds (or more depending
 on
 number of rows returned by that particular scanner) for each of the 10k
 rows that I am processing and this is also triggering a lot of GC in
 turn.
 So it is 1*40 seconds (4 days) for each thread. But the problem is
 that
 the batch process crashes before completion throwing IOException and
 SocketTimeoutException and sometimes GC OutOfMemory exceptions.
 
 I will definitely take the much elegant approach that you mentioned
 eventually. I just wanted to get to the core of the issue before choosing
 the solution.
 
 Thanks again.
 Narendra
 
 On Thu, Apr 19, 2012 at 7:42 PM, Michel Segel michael_se...@hotmail.com
 wrote:
 
 Narendra,
 
 Are you trying to solve a real problem, or is this a class project?
 
 Your solution doesn't scale. It's a non starter. 130 seconds for each
 iteration times 1 million seconds is how long? 130 million seconds,
 which
 is ~36000 hours or over 4 years to complete.
 (the numbers are rough but you get the idea...)
 
 That's assuming that your table is static and doesn't change.
 
 I didn't even ask if you were attempting any sort of server side
 filtering
 which would reduce the amount of data you send back to the client
 because
 it a moot point.
 
 Finer tuning is also moot.
 
 So you insert a row in one table. You then do n^2 operations to pull out
 data.
 The better solution is to insert data into 2 tables where you then have
 to
 do 2n operations to get the same results. Thats per thread btw.  So if
 you
 were running 10 threads, you would have 10n^2  operations versus 20n
 operations to get the same result set.
 
 A million row table... 1*10^13. Vs 2*10^6
 
 I don't believe I mentioned anything about HBase's internals and this
 solution works for any NoSQL database.
 
 
 Sent from a remote device. Please excuse any typos...
 
 Mike Segel
 
 On Apr 19, 2012, at 7:03 AM, Narendra yadala narendra.yad...@gmail.com
 
 wrote:
 
 Hi Michel
 
 Yes, that is exactly what I do in step 2. I am aware of the reason for
 the
 scanner timeout exceptions. It is the time between two consecutive
 invocations of the next call on a specific scanner object. I