Inline... On Sun, Feb 3, 2013 at 9:25 AM, Toby Lazar <[email protected]> wrote:
> Quick question - if you perform the pagination client-side and just > call scanner.iterator().next() > to get to the necessary results, doesn't this add unecessary network > traffic of the unused results? Anil: It depends on the solution. If 95% your scans are limited to a single region then there wont be unnecessary Network I/O. > If you want results 100-120, does the > client need to first read results 1-100 over the network? Anil: If you do a simple scan and you want result 100-120 then i would say yes. Maybe you only get 100-120 by using pagination filter or writing some custom filter or coprocessor. As, i have mentioned earlier in this post that we wont be allowing the user to jump to100-120 directly. So, first the user needs to go through 1-100 results. Hence, i will know the rowkey of 100th results and "rowkey of 100th results" will become my startKey for 100-120 results. So, no unnecessary network I/O. > Couldn't a > filter help prevent some of that unneeded traffic? Or, is the data only > transferred when inspecting the result object? > Anil: Filters might help reduce unnecessary traffic. It all depends on your use case. > > Thanks, > > Toby > On Sun, Feb 3, 2013 at 11:07 AM, Anoop John <[email protected]> wrote: > > > >lets say for a scan setCaching is > > 10 and scan is done across two regions. 9 Results(satisfying the filter) > > are in Region1 and 10 Results(satisfying the filter) are in Region2. Then > > will this scan return 19 (9+10) results? > > > > @Anil. > > No it will return 10 results only not 19. The client here takes into > > account the no# of results got from previous region. But a filter is > > different. The filter has no logic to do at the client side. It fully > > executed at server side. This is the way it is designed. Personally I > would > > prefer to do the pagination by app alone by using plain scan with caching > > (to avoid so many RPCs) and app level logic. > > > > -Anoop- > > > > On Sat, Feb 2, 2013 at 1:32 PM, anil gupta <[email protected]> > wrote: > > > > > Hi Anoop, > > > > > > Please find my reply inline. > > > > > > Thanks, > > > Anil > > > > > > On Wed, Jan 30, 2013 at 3:31 AM, Anoop Sam John <[email protected]> > > > wrote: > > > > > > > @Anil > > > > > > > > >I could not understand that why it goes to multiple regionservers in > > > > parallel. Why it cannot guarantee results <= page size( my guess: due > > to > > > > multiple RS scans)? If you have used it then maybe you can explain > the > > > > behaviour? > > > > > > > > Scan from client side never go to multiple RS in parallel. Scan from > > > > HTable API will be sequential with one region after the other. For > > every > > > > region it will open up scanner in the RS and do next() calls. The > > filter > > > > will be instantiated at server side per region level ... > > > > > > > > When u need 100 rows in the page and you created a Scan at client > side > > > > with the filter and suppose there are 2 regions, 1st the scanner is > > > opened > > > > at for region1 and scan is happening. It will ensure that max 100 > rows > > > will > > > > be retrieved from that region. But when the region boundary crosses > > and > > > > client automatically open up scanner for the region2, there also it > > will > > > > pass filter with max 100 rows and so from there also max 100 rows can > > > > come.. So over all at the client side we can not guartee that the > scan > > > > created will only scan 100 rows as a whole from the table. > > > > > > > > > > I agree with other people on this email chain that the 2nd region > should > > > only return (100 - no. of rows returned by Region1), if possible. > > > > > > When the region boundary crosses and client automatically open up > scanner > > > for the region2, why doesnt the scanner in Region2 knows that some of > the > > > rows are already fetched by region1. Do you mean to say that by > default, > > > for a scan spanning multiple regions, every region has it's own count > of > > > no.of rows that its going to return? i.e. lets say for a scan > setCaching > > is > > > 10 and scan is done across two regions. 9 Results(satisfying the > filter) > > > are in Region1 and 10 Results(satisfying the filter) are in Region2. > Then > > > will this scan return 19 (9+10) results? > > > > > > > > > > > I think I am making it clear. I have not PageFilter at all.. I am > > just > > > > explaining as per the knowledge on scan flow and the general filter > > > usage. > > > > > > > > "This is because the filter is applied separately on different region > > > > servers. It does however optimize the scan of individual HRegions by > > > making > > > > sure that the page size is never exceeded locally. " > > > > > > > > I guess it need to be saying that "This is because the filter is > > > applied > > > > separately on different regions". > > > > > > > > -Anoop- > > > > > > > > ________________________________________ > > > > From: anil gupta [[email protected]] > > > > Sent: Wednesday, January 30, 2013 1:33 PM > > > > To: [email protected] > > > > Subject: Re: Pagination with HBase - getting previous page of data > > > > > > > > Hi Mohammad, > > > > > > > > You are most welcome to join the discussion. I have never used > > PageFilter > > > > so i don't really have concrete input. > > > > I had a look at > > > > > > > > > > > > > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PageFilter.html > > > > I could not understand that why it goes to multiple regionservers in > > > > parallel. Why it cannot guarantee results <= page size( my guess: due > > to > > > > multiple RS scans)? If you have used it then maybe you can explain > the > > > > behaviour? > > > > > > > > Thanks, > > > > Anil > > > > > > > > On Tue, Jan 29, 2013 at 7:32 PM, Mohammad Tariq <[email protected]> > > > > wrote: > > > > > > > > > I'm kinda hesitant to put my leg in between the pros ;)But, does it > > > sound > > > > > sane to use PageFilter for both rows and columns and having some > > > > additional > > > > > logic to handle the 'nth' page logic?It'll help us in both kind of > > > > paging. > > > > > > > > > > On Wednesday, January 30, 2013, Jean-Marc Spaggiari < > > > > > [email protected]> > > > > > wrote: > > > > > > Hi Anil, > > > > > > > > > > > > I think it really depend on the way you want to use the > pagination. > > > > > > > > > > > > Do you need to be able to jump to page X? Are you ok if you miss > a > > > > > > line or 2? Is your data growing fastly? Or slowly? Is it ok if > your > > > > > > page indexes are a day old? Do you need to paginate over 300 > > colums? > > > > > > Or just 1? Do you need to always have the exact same number of > > > entries > > > > > > in each page? > > > > > > > > > > > > For my usecase I need to be able to jump to the page X and I > don't > > > > > > have any content. I have hundred of millions lines. Only the > rowkey > > > > > > matter for me and I'm fine if sometime I have 50 entries > displayed, > > > > > > and sometime only 45. So I'm thinking about calculating which row > > is > > > > > > the first one for each page, and store that separatly. Then I > just > > > > > > need to run the MR daily. > > > > > > > > > > > > It's not a perfect solution I agree, but this might do the job > for > > > me. > > > > > > I'm totally open to all other idea which might do the job to. > > > > > > > > > > > > JM > > > > > > > > > > > > 2013/1/29, anil gupta <[email protected]>: > > > > > >> Yes, your suggested solution only works on RowKey based > > pagination. > > > It > > > > > will > > > > > >> fail when you start filtering on the basis of columns. > > > > > >> > > > > > >> Still, i would say it's comparatively easier to maintain this at > > > > > >> Application level rather than creating tables for pagination. > > > > > >> > > > > > >> What if you have 300 columns in your schema. Will you create 300 > > > > tables? > > > > > >> What about handling of pagination when filtering is done based > on > > > > > multiple > > > > > >> columns ("and" and "or" conditions)? > > > > > >> > > > > > >> On Tue, Jan 29, 2013 at 1:08 PM, Jean-Marc Spaggiari < > > > > > >> [email protected]> wrote: > > > > > >> > > > > > >>> No, no killer solution here ;) > > > > > >>> > > > > > >>> But I'm still thinking about that because I might have to > > implement > > > > > >>> some pagination options soon... > > > > > >>> > > > > > >>> As you are saying, it's only working on the row-key, but if you > > > want > > > > > >>> to do the same-thing on non-rowkey, you might have to create a > > > > > >>> secondary index table... > > > > > >>> > > > > > >>> JM > > > > > >>> > > > > > >>> 2013/1/27, anil gupta <[email protected]>: > > > > > >>> > That's alright..I thought that you have come-up with a killer > > > > > solution. > > > > > >>> So, > > > > > >>> > got curious to hear your ideas. ;) > > > > > >>> > It seems like your below mentioned solution will not work on > > > > > filtering > > > > > >>> > on > > > > > >>> > non row-key columns since when you are deciding the page > > numbers > > > > you > > > > > >>> > are > > > > > >>> > only considering rowkey. > > > > > >>> > > > > > > >>> > Thanks, > > > > > >>> > Anil > > > > > >>> > > > > > > >>> > On Fri, Jan 25, 2013 at 6:58 PM, Jean-Marc Spaggiari < > > > > > >>> > [email protected]> wrote: > > > > > >>> > > > > > > >>> >> Hi Anil, > > > > > >>> >> > > > > > >>> >> I don't have a solution. I never tought about that ;) But I > > was > > > > > >>> >> thinking about something like you create a 2nd table where > you > > > > place > > > > > >>> >> the raw number (4 bytes) then the raw key. You go directly > to > > a > > > > > >>> >> specific page, you query by the number, found the key, and > you > > > > know > > > > > >>> >> where to start you scan in the main table. > > > > > >>> >> > > > > > >>> >> The issue is properly the number for each lines since with a > > MR > > > > you > > > > > >>> >> don't know where you are from the beginning. But you can > built > > > > > >>> >> something where you store the line number from the beginning > > of > > > > the > > > > > >>> >> region, then when all regions are parsed you can reconstruct > > the > > > > > total > > > > > >>> >> numbering... That should work... > > > > > >>> >> > > > > > >>> >> JM > > > > > >>> >> > > > > > >>> >> 2013/1/25, anil gupta <[email protected]>: > > > > > >>> >> > Inline... > > > > > >>> >> > > > > > > >>> >> > On Fri, Jan 25, 2013 at 9:17 AM, Jean-Marc Spaggiari < > > > > > >>> >> > [email protected]> wrote: > > > > > >>> >> > > > > > > >>> >> >> Hi Anil, > > > > > >>> >> >> > > > > > >>> >> >> The issue is that all the other sub-sequent page start > > should > > > > be > > > > > >>> moved > > > > > >>> >> >> too... > > > > > >>> >> >> > > > > > >>> >> > Yes, this is a possibility. Hence the Developer has to > take > > > care > > > > > of > > > > > >>> >> > this > > > > > >>> >> > case. It might also be possible that the pageSize is not a > > > hard > > > > > >>> >> > limit > > > > > >>> >> > on > > > > > >>> >> > number of results(more like a hint or suggestion on > size). I > > > > would > > > > > >>> >> > say > > > > > >>> >> > it > > > > > >>> >> > varies by use case. > > > > > >>> >> > > > > > > >>> >> >> > > > > > >>> >> >> so if you want to jump directly to page n, you might be > > > totally > > > > > >>> >> >> shifted because of all the data inserted in the > meantime... > > > > > >>> >> >> > > > > > >>> >> >> If you want a real complete pagination feature, you might > > > want > > > > to > > > > > >>> have > > > > > >>> >> >> a coproccessor or a MR updating another table refering to > > the > > > > > >>> >> >> pages.... > > > > > >>> >> >> > > > > > >>> >> > Well, the solution depends on the use case. I will be > doing > > > > > >>> >> > pagination > > > > > > > > > > > > > > > > -- > > > > > Warm Regards, > > > > > Tariq > > > > > https://mtariq.jux.com/ > > > > > cloudfront.blogspot.com > > > > > > > > > > > > > > > > > > > > > -- > > > > Thanks & Regards, > > > > Anil Gupta > > > > > > > > > > > > > > > > -- > > > Thanks & Regards, > > > Anil Gupta > > > > > > -- Thanks & Regards, Anil Gupta
