Re: Parallel Scanner

Anil Mon, 20 Feb 2017 06:11:55 -0800

Thanks Richard :)

On 20 February 2017 at 18:56, Richard Startin <richardstar...@outlook.com>
wrote:


> RegionLocator is not deprecated, hence the suggestion to use it if it's
> available in place of whatever is still available on HTable for your
> version of HBase - it will make upgrades easier. For instance
> HTable::getRegionsInRange no longer exists on the current master branch.
>
>
> "I am trying to scan a region in parallel :)"
>
>
> I thought you were asking about scanning many regions at the same time,
> not scanning a single region in parallel? HBASE-1935 is about parallelising
> scans over regions, not within regions.
>
>
> If you want to parallelise within a region, you could write a little
> method to split the first and last key of the region into several disjoint
> lexicographic buckets and create a scan for each bucket, then execute those
> scans in parallel. Your data probably doesn't distribute uniformly over
> lexicographic buckets though so the scans are unlikely to execute at a
> constant rate and you'll get results in time proportional to the
> lexicographic bucket with the highest cardinality in the region. I'd be
> interested to know if anyone on the list has ever tried this and what the
> results were?
>
>
> Using the much simpler approach of parallelising over regions by creating
> multiple disjoint scans client side, as suggested, your performance now
> depends on your regions which you have some control over. You can achieve
> the same effect by pre-splitting your table such that you empirically
> optimise read performance for the dataset you store.
>
>
> Thanks,
>
> Richard
>
>
> ________________________________
> From: Anil <anilk...@gmail.com>
> Sent: 20 February 2017 12:35
> To: user@hbase.apache.org
> Subject: Re: Parallel Scanner
>
> Thanks Richard.
>
> I am able to get the regions for data to be loaded from table. I am trying
> to scan a region in parallel :)
>
> Thanks
>
> On 20 February 2017 at 16:44, Richard Startin <richardstar...@outlook.com>
> wrote:
>
> > For a client only solution, have you looked at the RegionLocator
> > interface? It gives you a list of pairs of byte[] (the start and stop
> keys
> > for each region). You can easily use a ForkJoinPool recursive task or
> java
> > 8 parallel stream over that list. I implemented a spark RDD to do that
> and
> > wrote about it with code samples here:
> >
> > https://richardstartin.com/2016/11/07/co-locating-spark-
>
> > partitions-with-hbase-regions/
> >
> > Forget about the spark details in the post (and forget that Hortonworks
> > have a library to do the same thing :)) the idea of creating one scan per
> > region and setting scan starts and stops from the region locator would
> give
> > you a parallel scan. Note you can also group the scans by region server.
> >
> > Cheers,
> > Richard
> > On 20 Feb 2017, at 07:33, Anil <anilk...@gmail.com<mailto:ani
> > lk...@gmail.com>> wrote:
> >
> > Thanks Ram. I will look into EndPoints.
> >
> > On 20 February 2017 at 12:29, ramkrishna vasudevan <
> > ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s.vasude...@gmail.com
> >>
> > wrote:
> >
> > Yes. There is way.
> >
> > Have you seen Endpoints? Endpoints are triggers like points that allows
> > your client to trigger them parallely in one ore more regions using the
> > start and end key of the region. This executes parallely and then you may
> > have to sort out the results as per your need.
> >
> > But these endpoints have to running on your region servers and it is not
> a
> > client only soln.
> > https://blogs.apache.org/hbase/entry/coprocessor_introduction.
> [https://blogs.apache.org/hbase/mediaresource/60b135e5-
> 04c6-4197-b262-e7cd08de784b]<https://blogs.apache.org/hbase/
> entry/coprocessor_introduction>
>
> Coprocessor Introduction : Apache HBase<https://blogs.apache.
> org/hbase/entry/coprocessor_introduction>
> blogs.apache.org
> Coprocessor Introduction. Authors: Trend Micro Hadoop Group: Mingjie Lai,
> Eugene Koontz, Andrew Purtell (The original version of the blog was posted
> at http ...
>
>
>
> >
> > Be careful when you use them. Since these endpoints run on server ensure
> > that these are not heavy or things that consume more memory which can
> have
> > adverse effects on the server.
> >
> >
> > Regards
> > Ram
> >
> > On Mon, Feb 20, 2017 at 12:18 PM, Anil <anilk...@gmail.com<mailto:ani
> > lk...@gmail.com>> wrote:
> >
> > Thanks Ram.
> >
> > So, you mean that there is no harm in using  HTable#getRegionsInRange in
> > the application code.
> >
> > HTable#getRegionsInRange returned single entry for all my region start
> > key
> > and end key. i need to explore more on this.
> >
> > "If you know the table region's start and end keys you could create
> > parallel scans in your application code."  - is there any way to scan a
> > region in the application code other than the one i put in the original
> > email ?
> >
> > "One thing to watch out is that if there is a split in the region then
> > this start
> > and end row may change so in that case it is better you try to get
> > the regions every time before you issue a scan"
> > - Agree. i am dynamically determining the region start key and end key
> > before initiating scan operations for every initial load.
> >
> > Thanks.
> >
> >
> >
> >
> > On 20 February 2017 at 10:59, ramkrishna vasudevan <
> > ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s.vasude...@gmail.com
> >>
> > wrote:
> >
> > Hi Anil,
> >
> > HBase directly does not provide parallel scans. If you know the table
> > region's start and end keys you could create parallel scans in your
> > application code.
> >
> > In the above code snippet, the intent is right - you get the required
> > regions and can issue parallel scans from your app.
> >
> > One thing to watch out is that if there is a split in the region then
> > this
> > start and end row may change so in that case it is better you try to
> > get
> > the regions every time before you issue a scan. Does that make sense to
> > you?
> >
> > Regards
> > Ram
> >
> > On Sat, Feb 18, 2017 at 1:44 PM, Anil <anilk...@gmail.com<mailto:ani
> > lk...@gmail.com>> wrote:
> >
> > Hi ,
> >
> > I am building an usecase where i have to load the hbase data into
> > In-memory
> > database (IMDB). I am scanning the each region and loading data into
> > IMDB.
> >
> > i am looking at parallel scanner ( https://issues.apache.org/
> issues.apache.org<https://issues.apache.org/>
> issues.apache.org
> issues.apache.org. Apache currently hosts two different issue tracking
> systems, Bugzilla and Jira. To find out how to report an issue for a
> particular project ...
>
>
>
> > jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and
> > HTable#
> > getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is
> > deprecated, HBASE-1935 is still open.
> >
> > I see Connection from ConnectionFactory is HConnectionImplementation
> > by
> > default and creates HTable instance.
> >
> > Do you see any issues in using HTable from Table instance ?
> >            for each region {
> >                        int i = 0;
> >                    List<HRegionLocation> regions =
> > hTable.getRegionsInRange(scans.getStartRow(), scans.getStopRow(),
> > true);
> >
> >                    for (HRegionLocation region : regions){
> >                    startRow = i == 0 ? scans.getStartRow() :
> > region.getRegionInfo().getStartKey();
> >                    i++;
> >                    endRow = i == regions.size()? scans.getStopRow()
> > :
> > region.getRegionInfo().getEndKey();
> >                     }
> >           }
> >
> > are there any alternatives to achieve parallel scan? Thanks.
> >
> > Thanks
> >
> >
> >
> >
> >
>

Re: Parallel Scanner

Reply via email to