Re: Pagination with HBase - getting previous page of data

Toby Lazar Wed, 30 Jan 2013 04:43:23 -0800

Sounds like if you had 1000 regions, each with 99 rows, and you asked
for 100 that you'd get back 99,000. My guess is that a Filter is
serialized once and that is sent successively to each region and that
it isn't updated between regions.  Don't think doing that would be too
easy.


Toby

On 1/30/13, Jean-Marc Spaggiari <[email protected]> wrote:
> Hi Anoop,
>
> So does it mean the scanner can send back LIMIT*2-1 lines max? Reading
> 100 rows from the 2nd region is using extra time and resources. Why
> not ask for only the number of missing lines?
>
> JM
>
> 2013/1/30, Anoop Sam John <[email protected]>:
>> @Anil
>>
>>>I could not understand that why it goes to multiple regionservers in
>> parallel. Why it cannot guarantee results <= page size( my guess: due to
>> multiple RS scans)? If you have used it then maybe you can explain the
>> behaviour?
>>
>> Scan from client side never go to multiple RS in parallel. Scan from
>> HTable
>> API will be sequential with one region after the other. For every region
>> it
>> will open up scanner in the RS and do next() calls. The filter will be
>> instantiated at server side per region level ...
>>
>> When u need 100 rows in the page and you created a Scan at client side
>> with
>> the filter and suppose there are 2 regions, 1st the scanner is opened at
>> for
>> region1 and scan is happening. It will ensure that max 100 rows will be
>> retrieved from that region.  But when the region boundary crosses and
>> client
>> automatically open up scanner for the region2, there also it will pass
>> filter with max 100 rows and so from there also max 100 rows can come..
>> So
>> over all at the client side we can not guartee that the scan created will
>> only scan 100 rows as a whole from the table.
>>
>> I think I am making it clear.   I have not PageFilter at all.. I am just
>> explaining as per the knowledge on scan flow and the general filter
>> usage.
>>
>> "This is because the filter is applied separately on different region
>> servers. It does however optimize the scan of individual HRegions by
>> making
>> sure that the page size is never exceeded locally. "
>>
>> I guess it need to be saying that   "This is because the filter is
>> applied
>> separately on different regions".
>>
>> -Anoop-
>>
>> ________________________________________
>> From: anil gupta [[email protected]]
>> Sent: Wednesday, January 30, 2013 1:33 PM
>> To: [email protected]
>> Subject: Re: Pagination with HBase - getting previous page of data
>>
>> Hi Mohammad,
>>
>> You are most welcome to join the discussion. I have never used PageFilter
>> so i don't really have concrete input.
>> I had a look at
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PageFilter.html
>> I could not understand that why it goes to multiple regionservers in
>> parallel. Why it cannot guarantee results <= page size( my guess: due to
>> multiple RS scans)? If you have used it then maybe you can explain the
>> behaviour?
>>
>> Thanks,
>> Anil
>>
>> On Tue, Jan 29, 2013 at 7:32 PM, Mohammad Tariq <[email protected]>
>> wrote:
>>
>>> I'm kinda hesitant to put my leg in between the pros ;)But, does it
>>> sound
>>> sane to use PageFilter for both rows and columns and having some
>>> additional
>>> logic to handle the 'nth' page logic?It'll help us in both kind of
>>> paging.
>>>
>>> On Wednesday, January 30, 2013, Jean-Marc Spaggiari <
>>> [email protected]>
>>> wrote:
>>> > Hi Anil,
>>> >
>>> > I think it really depend on the way you want to use the pagination.
>>> >
>>> > Do you need to be able to jump to page X? Are you ok if you miss a
>>> > line or 2? Is your data growing fastly? Or slowly? Is it ok if your
>>> > page indexes are a day old? Do you need to paginate over 300 colums?
>>> > Or just 1? Do you need to always have the exact same number of entries
>>> > in each page?
>>> >
>>> > For my usecase I need to be able to jump to the page X and I don't
>>> > have any content. I have hundred of millions lines. Only the rowkey
>>> > matter for me and I'm fine if sometime I have 50 entries displayed,
>>> > and sometime only 45. So I'm thinking about calculating which row is
>>> > the first one for each page, and store that separatly. Then I just
>>> > need to run the MR daily.
>>> >
>>> > It's not a perfect solution I agree, but this might do the job for me.
>>> > I'm totally open to all other idea which might do the job to.
>>> >
>>> > JM
>>> >
>>> > 2013/1/29, anil gupta <[email protected]>:
>>> >> Yes, your suggested solution only works on RowKey based pagination.
>>> >> It
>>> will
>>> >> fail when you start filtering on the basis of columns.
>>> >>
>>> >> Still, i would say it's comparatively easier to maintain this at
>>> >> Application level rather than creating tables for pagination.
>>> >>
>>> >> What if you have 300 columns in your schema. Will you create 300
>>> >> tables?
>>> >> What about handling of pagination when filtering is done based on
>>> multiple
>>> >> columns ("and" and "or" conditions)?
>>> >>
>>> >> On Tue, Jan 29, 2013 at 1:08 PM, Jean-Marc Spaggiari <
>>> >> [email protected]> wrote:
>>> >>
>>> >>> No, no killer solution here ;)
>>> >>>
>>> >>> But I'm still thinking about that because I might have to implement
>>> >>> some pagination options soon...
>>> >>>
>>> >>> As you are saying, it's only working on the row-key, but if you want
>>> >>> to do the same-thing on non-rowkey, you might have to create a
>>> >>> secondary index table...
>>> >>>
>>> >>> JM
>>> >>>
>>> >>> 2013/1/27, anil gupta <[email protected]>:
>>> >>> > That's alright..I thought that you have come-up with a killer
>>> solution.
>>> >>> So,
>>> >>> > got curious to hear your ideas. ;)
>>> >>> > It seems like your below mentioned solution will not work on
>>> filtering
>>> >>> > on
>>> >>> > non row-key columns since when you are deciding the page numbers
>>> >>> > you
>>> >>> > are
>>> >>> > only considering rowkey.
>>> >>> >
>>> >>> > Thanks,
>>> >>> > Anil
>>> >>> >
>>> >>> > On Fri, Jan 25, 2013 at 6:58 PM, Jean-Marc Spaggiari <
>>> >>> > [email protected]> wrote:
>>> >>> >
>>> >>> >> Hi Anil,
>>> >>> >>
>>> >>> >> I don't have a solution. I never tought about that ;) But I was
>>> >>> >> thinking about something like you create a 2nd table where you
>>> >>> >> place
>>> >>> >> the raw number (4 bytes) then the raw key. You go directly to a
>>> >>> >> specific page, you query by the number, found the key, and you
>>> >>> >> know
>>> >>> >> where to start you scan in the main table.
>>> >>> >>
>>> >>> >> The issue is properly the number for each lines since with a MR
>>> >>> >> you
>>> >>> >> don't know where you are from the beginning. But you can built
>>> >>> >> something where you store the line number from the beginning of
>>> >>> >> the
>>> >>> >> region, then when all regions are parsed you can reconstruct the
>>> total
>>> >>> >> numbering... That should work...
>>> >>> >>
>>> >>> >> JM
>>> >>> >>
>>> >>> >> 2013/1/25, anil gupta <[email protected]>:
>>> >>> >> > Inline...
>>> >>> >> >
>>> >>> >> > On Fri, Jan 25, 2013 at 9:17 AM, Jean-Marc Spaggiari <
>>> >>> >> > [email protected]> wrote:
>>> >>> >> >
>>> >>> >> >> Hi Anil,
>>> >>> >> >>
>>> >>> >> >> The issue is that all the other sub-sequent page start should
>>> >>> >> >> be
>>> >>> moved
>>> >>> >> >> too...
>>> >>> >> >>
>>> >>> >> > Yes, this is a possibility. Hence the Developer has to take
>>> >>> >> > care
>>> of
>>> >>> >> > this
>>> >>> >> > case. It might also be possible that the pageSize is not a hard
>>> >>> >> > limit
>>> >>> >> > on
>>> >>> >> > number of results(more like a hint or suggestion on size). I
>>> >>> >> > would
>>> >>> >> > say
>>> >>> >> > it
>>> >>> >> > varies by use case.
>>> >>> >> >
>>> >>> >> >>
>>> >>> >> >> so if you want to jump directly to page n, you might be
>>> >>> >> >> totally
>>> >>> >> >> shifted because of all the data inserted in the meantime...
>>> >>> >> >>
>>> >>> >> >> If you want a real complete pagination feature, you might want
>>> >>> >> >> to
>>> >>> have
>>> >>> >> >> a coproccessor or a MR updating another table refering to the
>>> >>> >> >> pages....
>>> >>> >> >>
>>> >>> >> > Well, the solution depends on the use case. I will be doing
>>> >>> >> > pagination
>>> >
>>>
>>> --
>>> Warm Regards,
>>> Tariq
>>> https://mtariq.jux.com/
>>> cloudfront.blogspot.com
>>>
>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>

-- 
Sent from my mobile device

Re: Pagination with HBase - getting previous page of data

Reply via email to