Sounds like if you had 1000 regions, each with 99 rows, and you asked for 100 that you'd get back 99,000. My guess is that a Filter is serialized once and that is sent successively to each region and that it isn't updated between regions. Don't think doing that would be too easy.
Toby On 1/30/13, Jean-Marc Spaggiari <[email protected]> wrote: > Hi Anoop, > > So does it mean the scanner can send back LIMIT*2-1 lines max? Reading > 100 rows from the 2nd region is using extra time and resources. Why > not ask for only the number of missing lines? > > JM > > 2013/1/30, Anoop Sam John <[email protected]>: >> @Anil >> >>>I could not understand that why it goes to multiple regionservers in >> parallel. Why it cannot guarantee results <= page size( my guess: due to >> multiple RS scans)? If you have used it then maybe you can explain the >> behaviour? >> >> Scan from client side never go to multiple RS in parallel. Scan from >> HTable >> API will be sequential with one region after the other. For every region >> it >> will open up scanner in the RS and do next() calls. The filter will be >> instantiated at server side per region level ... >> >> When u need 100 rows in the page and you created a Scan at client side >> with >> the filter and suppose there are 2 regions, 1st the scanner is opened at >> for >> region1 and scan is happening. It will ensure that max 100 rows will be >> retrieved from that region. But when the region boundary crosses and >> client >> automatically open up scanner for the region2, there also it will pass >> filter with max 100 rows and so from there also max 100 rows can come.. >> So >> over all at the client side we can not guartee that the scan created will >> only scan 100 rows as a whole from the table. >> >> I think I am making it clear. I have not PageFilter at all.. I am just >> explaining as per the knowledge on scan flow and the general filter >> usage. >> >> "This is because the filter is applied separately on different region >> servers. It does however optimize the scan of individual HRegions by >> making >> sure that the page size is never exceeded locally. " >> >> I guess it need to be saying that "This is because the filter is >> applied >> separately on different regions". >> >> -Anoop- >> >> ________________________________________ >> From: anil gupta [[email protected]] >> Sent: Wednesday, January 30, 2013 1:33 PM >> To: [email protected] >> Subject: Re: Pagination with HBase - getting previous page of data >> >> Hi Mohammad, >> >> You are most welcome to join the discussion. I have never used PageFilter >> so i don't really have concrete input. >> I had a look at >> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PageFilter.html >> I could not understand that why it goes to multiple regionservers in >> parallel. Why it cannot guarantee results <= page size( my guess: due to >> multiple RS scans)? If you have used it then maybe you can explain the >> behaviour? >> >> Thanks, >> Anil >> >> On Tue, Jan 29, 2013 at 7:32 PM, Mohammad Tariq <[email protected]> >> wrote: >> >>> I'm kinda hesitant to put my leg in between the pros ;)But, does it >>> sound >>> sane to use PageFilter for both rows and columns and having some >>> additional >>> logic to handle the 'nth' page logic?It'll help us in both kind of >>> paging. >>> >>> On Wednesday, January 30, 2013, Jean-Marc Spaggiari < >>> [email protected]> >>> wrote: >>> > Hi Anil, >>> > >>> > I think it really depend on the way you want to use the pagination. >>> > >>> > Do you need to be able to jump to page X? Are you ok if you miss a >>> > line or 2? Is your data growing fastly? Or slowly? Is it ok if your >>> > page indexes are a day old? Do you need to paginate over 300 colums? >>> > Or just 1? Do you need to always have the exact same number of entries >>> > in each page? >>> > >>> > For my usecase I need to be able to jump to the page X and I don't >>> > have any content. I have hundred of millions lines. Only the rowkey >>> > matter for me and I'm fine if sometime I have 50 entries displayed, >>> > and sometime only 45. So I'm thinking about calculating which row is >>> > the first one for each page, and store that separatly. Then I just >>> > need to run the MR daily. >>> > >>> > It's not a perfect solution I agree, but this might do the job for me. >>> > I'm totally open to all other idea which might do the job to. >>> > >>> > JM >>> > >>> > 2013/1/29, anil gupta <[email protected]>: >>> >> Yes, your suggested solution only works on RowKey based pagination. >>> >> It >>> will >>> >> fail when you start filtering on the basis of columns. >>> >> >>> >> Still, i would say it's comparatively easier to maintain this at >>> >> Application level rather than creating tables for pagination. >>> >> >>> >> What if you have 300 columns in your schema. Will you create 300 >>> >> tables? >>> >> What about handling of pagination when filtering is done based on >>> multiple >>> >> columns ("and" and "or" conditions)? >>> >> >>> >> On Tue, Jan 29, 2013 at 1:08 PM, Jean-Marc Spaggiari < >>> >> [email protected]> wrote: >>> >> >>> >>> No, no killer solution here ;) >>> >>> >>> >>> But I'm still thinking about that because I might have to implement >>> >>> some pagination options soon... >>> >>> >>> >>> As you are saying, it's only working on the row-key, but if you want >>> >>> to do the same-thing on non-rowkey, you might have to create a >>> >>> secondary index table... >>> >>> >>> >>> JM >>> >>> >>> >>> 2013/1/27, anil gupta <[email protected]>: >>> >>> > That's alright..I thought that you have come-up with a killer >>> solution. >>> >>> So, >>> >>> > got curious to hear your ideas. ;) >>> >>> > It seems like your below mentioned solution will not work on >>> filtering >>> >>> > on >>> >>> > non row-key columns since when you are deciding the page numbers >>> >>> > you >>> >>> > are >>> >>> > only considering rowkey. >>> >>> > >>> >>> > Thanks, >>> >>> > Anil >>> >>> > >>> >>> > On Fri, Jan 25, 2013 at 6:58 PM, Jean-Marc Spaggiari < >>> >>> > [email protected]> wrote: >>> >>> > >>> >>> >> Hi Anil, >>> >>> >> >>> >>> >> I don't have a solution. I never tought about that ;) But I was >>> >>> >> thinking about something like you create a 2nd table where you >>> >>> >> place >>> >>> >> the raw number (4 bytes) then the raw key. You go directly to a >>> >>> >> specific page, you query by the number, found the key, and you >>> >>> >> know >>> >>> >> where to start you scan in the main table. >>> >>> >> >>> >>> >> The issue is properly the number for each lines since with a MR >>> >>> >> you >>> >>> >> don't know where you are from the beginning. But you can built >>> >>> >> something where you store the line number from the beginning of >>> >>> >> the >>> >>> >> region, then when all regions are parsed you can reconstruct the >>> total >>> >>> >> numbering... That should work... >>> >>> >> >>> >>> >> JM >>> >>> >> >>> >>> >> 2013/1/25, anil gupta <[email protected]>: >>> >>> >> > Inline... >>> >>> >> > >>> >>> >> > On Fri, Jan 25, 2013 at 9:17 AM, Jean-Marc Spaggiari < >>> >>> >> > [email protected]> wrote: >>> >>> >> > >>> >>> >> >> Hi Anil, >>> >>> >> >> >>> >>> >> >> The issue is that all the other sub-sequent page start should >>> >>> >> >> be >>> >>> moved >>> >>> >> >> too... >>> >>> >> >> >>> >>> >> > Yes, this is a possibility. Hence the Developer has to take >>> >>> >> > care >>> of >>> >>> >> > this >>> >>> >> > case. It might also be possible that the pageSize is not a hard >>> >>> >> > limit >>> >>> >> > on >>> >>> >> > number of results(more like a hint or suggestion on size). I >>> >>> >> > would >>> >>> >> > say >>> >>> >> > it >>> >>> >> > varies by use case. >>> >>> >> > >>> >>> >> >> >>> >>> >> >> so if you want to jump directly to page n, you might be >>> >>> >> >> totally >>> >>> >> >> shifted because of all the data inserted in the meantime... >>> >>> >> >> >>> >>> >> >> If you want a real complete pagination feature, you might want >>> >>> >> >> to >>> >>> have >>> >>> >> >> a coproccessor or a MR updating another table refering to the >>> >>> >> >> pages.... >>> >>> >> >> >>> >>> >> > Well, the solution depends on the use case. I will be doing >>> >>> >> > pagination >>> > >>> >>> -- >>> Warm Regards, >>> Tariq >>> https://mtariq.jux.com/ >>> cloudfront.blogspot.com >>> >> >> >> >> -- >> Thanks & Regards, >> Anil Gupta > -- Sent from my mobile device
