Hi Anoop, So does it mean the scanner can send back LIMIT*2-1 lines max? Reading 100 rows from the 2nd region is using extra time and resources. Why not ask for only the number of missing lines?
JM 2013/1/30, Anoop Sam John <[email protected]>: > @Anil > >>I could not understand that why it goes to multiple regionservers in > parallel. Why it cannot guarantee results <= page size( my guess: due to > multiple RS scans)? If you have used it then maybe you can explain the > behaviour? > > Scan from client side never go to multiple RS in parallel. Scan from HTable > API will be sequential with one region after the other. For every region it > will open up scanner in the RS and do next() calls. The filter will be > instantiated at server side per region level ... > > When u need 100 rows in the page and you created a Scan at client side with > the filter and suppose there are 2 regions, 1st the scanner is opened at for > region1 and scan is happening. It will ensure that max 100 rows will be > retrieved from that region. But when the region boundary crosses and client > automatically open up scanner for the region2, there also it will pass > filter with max 100 rows and so from there also max 100 rows can come.. So > over all at the client side we can not guartee that the scan created will > only scan 100 rows as a whole from the table. > > I think I am making it clear. I have not PageFilter at all.. I am just > explaining as per the knowledge on scan flow and the general filter usage. > > "This is because the filter is applied separately on different region > servers. It does however optimize the scan of individual HRegions by making > sure that the page size is never exceeded locally. " > > I guess it need to be saying that "This is because the filter is applied > separately on different regions". > > -Anoop- > > ________________________________________ > From: anil gupta [[email protected]] > Sent: Wednesday, January 30, 2013 1:33 PM > To: [email protected] > Subject: Re: Pagination with HBase - getting previous page of data > > Hi Mohammad, > > You are most welcome to join the discussion. I have never used PageFilter > so i don't really have concrete input. > I had a look at > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PageFilter.html > I could not understand that why it goes to multiple regionservers in > parallel. Why it cannot guarantee results <= page size( my guess: due to > multiple RS scans)? If you have used it then maybe you can explain the > behaviour? > > Thanks, > Anil > > On Tue, Jan 29, 2013 at 7:32 PM, Mohammad Tariq <[email protected]> wrote: > >> I'm kinda hesitant to put my leg in between the pros ;)But, does it sound >> sane to use PageFilter for both rows and columns and having some >> additional >> logic to handle the 'nth' page logic?It'll help us in both kind of >> paging. >> >> On Wednesday, January 30, 2013, Jean-Marc Spaggiari < >> [email protected]> >> wrote: >> > Hi Anil, >> > >> > I think it really depend on the way you want to use the pagination. >> > >> > Do you need to be able to jump to page X? Are you ok if you miss a >> > line or 2? Is your data growing fastly? Or slowly? Is it ok if your >> > page indexes are a day old? Do you need to paginate over 300 colums? >> > Or just 1? Do you need to always have the exact same number of entries >> > in each page? >> > >> > For my usecase I need to be able to jump to the page X and I don't >> > have any content. I have hundred of millions lines. Only the rowkey >> > matter for me and I'm fine if sometime I have 50 entries displayed, >> > and sometime only 45. So I'm thinking about calculating which row is >> > the first one for each page, and store that separatly. Then I just >> > need to run the MR daily. >> > >> > It's not a perfect solution I agree, but this might do the job for me. >> > I'm totally open to all other idea which might do the job to. >> > >> > JM >> > >> > 2013/1/29, anil gupta <[email protected]>: >> >> Yes, your suggested solution only works on RowKey based pagination. It >> will >> >> fail when you start filtering on the basis of columns. >> >> >> >> Still, i would say it's comparatively easier to maintain this at >> >> Application level rather than creating tables for pagination. >> >> >> >> What if you have 300 columns in your schema. Will you create 300 >> >> tables? >> >> What about handling of pagination when filtering is done based on >> multiple >> >> columns ("and" and "or" conditions)? >> >> >> >> On Tue, Jan 29, 2013 at 1:08 PM, Jean-Marc Spaggiari < >> >> [email protected]> wrote: >> >> >> >>> No, no killer solution here ;) >> >>> >> >>> But I'm still thinking about that because I might have to implement >> >>> some pagination options soon... >> >>> >> >>> As you are saying, it's only working on the row-key, but if you want >> >>> to do the same-thing on non-rowkey, you might have to create a >> >>> secondary index table... >> >>> >> >>> JM >> >>> >> >>> 2013/1/27, anil gupta <[email protected]>: >> >>> > That's alright..I thought that you have come-up with a killer >> solution. >> >>> So, >> >>> > got curious to hear your ideas. ;) >> >>> > It seems like your below mentioned solution will not work on >> filtering >> >>> > on >> >>> > non row-key columns since when you are deciding the page numbers >> >>> > you >> >>> > are >> >>> > only considering rowkey. >> >>> > >> >>> > Thanks, >> >>> > Anil >> >>> > >> >>> > On Fri, Jan 25, 2013 at 6:58 PM, Jean-Marc Spaggiari < >> >>> > [email protected]> wrote: >> >>> > >> >>> >> Hi Anil, >> >>> >> >> >>> >> I don't have a solution. I never tought about that ;) But I was >> >>> >> thinking about something like you create a 2nd table where you >> >>> >> place >> >>> >> the raw number (4 bytes) then the raw key. You go directly to a >> >>> >> specific page, you query by the number, found the key, and you >> >>> >> know >> >>> >> where to start you scan in the main table. >> >>> >> >> >>> >> The issue is properly the number for each lines since with a MR >> >>> >> you >> >>> >> don't know where you are from the beginning. But you can built >> >>> >> something where you store the line number from the beginning of >> >>> >> the >> >>> >> region, then when all regions are parsed you can reconstruct the >> total >> >>> >> numbering... That should work... >> >>> >> >> >>> >> JM >> >>> >> >> >>> >> 2013/1/25, anil gupta <[email protected]>: >> >>> >> > Inline... >> >>> >> > >> >>> >> > On Fri, Jan 25, 2013 at 9:17 AM, Jean-Marc Spaggiari < >> >>> >> > [email protected]> wrote: >> >>> >> > >> >>> >> >> Hi Anil, >> >>> >> >> >> >>> >> >> The issue is that all the other sub-sequent page start should >> >>> >> >> be >> >>> moved >> >>> >> >> too... >> >>> >> >> >> >>> >> > Yes, this is a possibility. Hence the Developer has to take care >> of >> >>> >> > this >> >>> >> > case. It might also be possible that the pageSize is not a hard >> >>> >> > limit >> >>> >> > on >> >>> >> > number of results(more like a hint or suggestion on size). I >> >>> >> > would >> >>> >> > say >> >>> >> > it >> >>> >> > varies by use case. >> >>> >> > >> >>> >> >> >> >>> >> >> so if you want to jump directly to page n, you might be totally >> >>> >> >> shifted because of all the data inserted in the meantime... >> >>> >> >> >> >>> >> >> If you want a real complete pagination feature, you might want >> >>> >> >> to >> >>> have >> >>> >> >> a coproccessor or a MR updating another table refering to the >> >>> >> >> pages.... >> >>> >> >> >> >>> >> > Well, the solution depends on the use case. I will be doing >> >>> >> > pagination >> > >> >> -- >> Warm Regards, >> Tariq >> https://mtariq.jux.com/ >> cloudfront.blogspot.com >> > > > > -- > Thanks & Regards, > Anil Gupta
