Don't forget that a Get is just a 1 row scan, they share the same code
path internally.  The only difference of course is that a get just
returns that one row and therefore is fairly fast (unless your row is
huge, think hundreds of MBs).

-ryan

On Thu, May 12, 2011 at 1:31 PM, Himanish Kushary <[email protected]> wrote:
> Thanks for your help. We are implementing our own secondary index table to
> get rid of the scan and replace those calls with Get.
>
> One common trend that we are following , to ensure the frontend web
> application is performant as per our expectation, is to always try and use
> Gets' from the UI instead of Scans'.
>
> Thanks
> Himanish
>
> On Thu, May 12, 2011 at 2:21 AM, Ryan Rawson <[email protected]> wrote:
>
>> Scans are in serial.
>>
>> To use DB parlance, consider a Scan + filter the moral equivalent of a
>> "SELECT * FROM <> WHERE col='val'" with no index, and a full table
>> scan is engaged.
>>
>> The typical ways to help solve performance issues are such:
>> - arrange your data using the primary key so you can scan the smallest
>> portion of the table possible.
>> - use another table as an index. Unfortunately HBase doesn't help you here.
>>
>> -ryan
>>
>> On Wed, May 11, 2011 at 11:12 PM, Connolly Juhani <[email protected]>
>> wrote:
>> > By naming rows from the timestamp the rowids are going to all be
>> sequential
>> > when inserting. So all new inserts will be going into the same region.
>> When
>> > checking the last 30 days you will also be reading from the same region
>> > where all the writing is happening, i.e the one that is already busy
>> writing
>> > the edit log for all those entries. You might want to consider an
>> > alternative method of naming your rows that would result in more
>> distributed
>> > reading/writing.
>> > However since you are naming rows by timestamps, you should be able to
>> > restrict the scan by a start and end date. You are doing this, right? If
>> > you're not, you are scanning every row in the table when you only need
>> the
>> > rows from end-start.
>> >
>> > Someone may need to correct me, but based on my memory of the
>> implementation
>> > scans are entirely sequential, so region a gets scanned, then b, then c.
>> You
>> > could speed this up by scanning multiple regions in parallel processes
>> and
>> > merging the results.
>> >
>> > On 12 May 2011 14:36, Himanish Kushary <[email protected]> wrote:
>> >
>> >> Hi,
>> >>
>> >> We have a table split across multiple regions(approx 50-60 regions for
>> 64
>> >> MB
>> >> split size) with rowid schema as
>> >> [ReverseTimestamp/itemtimestamp/customerid/itemid].This stores the
>> >> activities for an item for a customer.We have lots of data for lots of
>> item
>> >> for a custoer in this table.
>> >>
>> >> When we try to lookup activities for an item for the last 30 days from
>> this
>> >> table , we are using a Scan with RowFilter and RegexComparator.The scan
>> >> takes a lot of time ( almost 15-20 secs) to get us the activities for an
>> >> item.
>> >>
>> >> We are hooked up to HBase tables directly from a web application,so this
>> >> response time of around 20 secs is unacceptable.We also noticed that
>> >> whenever we do any scan kind of operation it is never in acceptable
>> ranges
>> >> for a web application.
>> >>
>> >> Are we doing something wrong ? If Hbase scans are so slow then it would
>> be
>> >> real hard to hook it up directly with any web application.
>> >>
>> >> Could somebody please suggest how to improve this or some other
>> >> options(design,architectural) to remedy this kind of issues dealing with
>> >> lot
>> >> of data.
>> >>
>> >> Note: We have tried with setCaching,SingleColumnValueFilter to no
>> >> significant effect.
>> >>
>> >> ---------------------------
>> >> Thanks & Regards
>> >> Himanish
>> >>
>> >
>>
>
>
>
> --
> Thanks & Regards
> Himanish
>

Reply via email to