Thanks Richard :) On 20 February 2017 at 18:56, Richard Startin <richardstar...@outlook.com> wrote:
> RegionLocator is not deprecated, hence the suggestion to use it if it's > available in place of whatever is still available on HTable for your > version of HBase - it will make upgrades easier. For instance > HTable::getRegionsInRange no longer exists on the current master branch. > > > "I am trying to scan a region in parallel :)" > > > I thought you were asking about scanning many regions at the same time, > not scanning a single region in parallel? HBASE-1935 is about parallelising > scans over regions, not within regions. > > > If you want to parallelise within a region, you could write a little > method to split the first and last key of the region into several disjoint > lexicographic buckets and create a scan for each bucket, then execute those > scans in parallel. Your data probably doesn't distribute uniformly over > lexicographic buckets though so the scans are unlikely to execute at a > constant rate and you'll get results in time proportional to the > lexicographic bucket with the highest cardinality in the region. I'd be > interested to know if anyone on the list has ever tried this and what the > results were? > > > Using the much simpler approach of parallelising over regions by creating > multiple disjoint scans client side, as suggested, your performance now > depends on your regions which you have some control over. You can achieve > the same effect by pre-splitting your table such that you empirically > optimise read performance for the dataset you store. > > > Thanks, > > Richard > > > ________________________________ > From: Anil <anilk...@gmail.com> > Sent: 20 February 2017 12:35 > To: user@hbase.apache.org > Subject: Re: Parallel Scanner > > Thanks Richard. > > I am able to get the regions for data to be loaded from table. I am trying > to scan a region in parallel :) > > Thanks > > On 20 February 2017 at 16:44, Richard Startin <richardstar...@outlook.com> > wrote: > > > For a client only solution, have you looked at the RegionLocator > > interface? It gives you a list of pairs of byte[] (the start and stop > keys > > for each region). You can easily use a ForkJoinPool recursive task or > java > > 8 parallel stream over that list. I implemented a spark RDD to do that > and > > wrote about it with code samples here: > > > > https://richardstartin.com/2016/11/07/co-locating-spark- > > > partitions-with-hbase-regions/ > > > > Forget about the spark details in the post (and forget that Hortonworks > > have a library to do the same thing :)) the idea of creating one scan per > > region and setting scan starts and stops from the region locator would > give > > you a parallel scan. Note you can also group the scans by region server. > > > > Cheers, > > Richard > > On 20 Feb 2017, at 07:33, Anil <anilk...@gmail.com<mailto:ani > > lk...@gmail.com>> wrote: > > > > Thanks Ram. I will look into EndPoints. > > > > On 20 February 2017 at 12:29, ramkrishna vasudevan < > > ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s.vasude...@gmail.com > >> > > wrote: > > > > Yes. There is way. > > > > Have you seen Endpoints? Endpoints are triggers like points that allows > > your client to trigger them parallely in one ore more regions using the > > start and end key of the region. This executes parallely and then you may > > have to sort out the results as per your need. > > > > But these endpoints have to running on your region servers and it is not > a > > client only soln. > > https://blogs.apache.org/hbase/entry/coprocessor_introduction. > [https://blogs.apache.org/hbase/mediaresource/60b135e5- > 04c6-4197-b262-e7cd08de784b]<https://blogs.apache.org/hbase/ > entry/coprocessor_introduction> > > Coprocessor Introduction : Apache HBase<https://blogs.apache. > org/hbase/entry/coprocessor_introduction> > blogs.apache.org > Coprocessor Introduction. Authors: Trend Micro Hadoop Group: Mingjie Lai, > Eugene Koontz, Andrew Purtell (The original version of the blog was posted > at http ... > > > > > > > Be careful when you use them. Since these endpoints run on server ensure > > that these are not heavy or things that consume more memory which can > have > > adverse effects on the server. > > > > > > Regards > > Ram > > > > On Mon, Feb 20, 2017 at 12:18 PM, Anil <anilk...@gmail.com<mailto:ani > > lk...@gmail.com>> wrote: > > > > Thanks Ram. > > > > So, you mean that there is no harm in using HTable#getRegionsInRange in > > the application code. > > > > HTable#getRegionsInRange returned single entry for all my region start > > key > > and end key. i need to explore more on this. > > > > "If you know the table region's start and end keys you could create > > parallel scans in your application code." - is there any way to scan a > > region in the application code other than the one i put in the original > > email ? > > > > "One thing to watch out is that if there is a split in the region then > > this start > > and end row may change so in that case it is better you try to get > > the regions every time before you issue a scan" > > - Agree. i am dynamically determining the region start key and end key > > before initiating scan operations for every initial load. > > > > Thanks. > > > > > > > > > > On 20 February 2017 at 10:59, ramkrishna vasudevan < > > ramkrishna.s.vasude...@gmail.com<mailto:ramkrishna.s.vasude...@gmail.com > >> > > wrote: > > > > Hi Anil, > > > > HBase directly does not provide parallel scans. If you know the table > > region's start and end keys you could create parallel scans in your > > application code. > > > > In the above code snippet, the intent is right - you get the required > > regions and can issue parallel scans from your app. > > > > One thing to watch out is that if there is a split in the region then > > this > > start and end row may change so in that case it is better you try to > > get > > the regions every time before you issue a scan. Does that make sense to > > you? > > > > Regards > > Ram > > > > On Sat, Feb 18, 2017 at 1:44 PM, Anil <anilk...@gmail.com<mailto:ani > > lk...@gmail.com>> wrote: > > > > Hi , > > > > I am building an usecase where i have to load the hbase data into > > In-memory > > database (IMDB). I am scanning the each region and loading data into > > IMDB. > > > > i am looking at parallel scanner ( https://issues.apache.org/ > issues.apache.org<https://issues.apache.org/> > issues.apache.org > issues.apache.org. Apache currently hosts two different issue tracking > systems, Bugzilla and Jira. To find out how to report an issue for a > particular project ... > > > > > jira/browse/HBASE-8504, HBASE-1935 ) to reduce the load time and > > HTable# > > getRegionsInRange(byte[] startKey, byte[] endKey, boolean reload) is > > deprecated, HBASE-1935 is still open. > > > > I see Connection from ConnectionFactory is HConnectionImplementation > > by > > default and creates HTable instance. > > > > Do you see any issues in using HTable from Table instance ? > > for each region { > > int i = 0; > > List<HRegionLocation> regions = > > hTable.getRegionsInRange(scans.getStartRow(), scans.getStopRow(), > > true); > > > > for (HRegionLocation region : regions){ > > startRow = i == 0 ? scans.getStartRow() : > > region.getRegionInfo().getStartKey(); > > i++; > > endRow = i == regions.size()? scans.getStopRow() > > : > > region.getRegionInfo().getEndKey(); > > } > > } > > > > are there any alternatives to achieve parallel scan? Thanks. > > > > Thanks > > > > > > > > > > >