Don't forget that a Get is just a 1 row scan, they share the same code path internally. The only difference of course is that a get just returns that one row and therefore is fairly fast (unless your row is huge, think hundreds of MBs).
-ryan On Thu, May 12, 2011 at 1:31 PM, Himanish Kushary <[email protected]> wrote: > Thanks for your help. We are implementing our own secondary index table to > get rid of the scan and replace those calls with Get. > > One common trend that we are following , to ensure the frontend web > application is performant as per our expectation, is to always try and use > Gets' from the UI instead of Scans'. > > Thanks > Himanish > > On Thu, May 12, 2011 at 2:21 AM, Ryan Rawson <[email protected]> wrote: > >> Scans are in serial. >> >> To use DB parlance, consider a Scan + filter the moral equivalent of a >> "SELECT * FROM <> WHERE col='val'" with no index, and a full table >> scan is engaged. >> >> The typical ways to help solve performance issues are such: >> - arrange your data using the primary key so you can scan the smallest >> portion of the table possible. >> - use another table as an index. Unfortunately HBase doesn't help you here. >> >> -ryan >> >> On Wed, May 11, 2011 at 11:12 PM, Connolly Juhani <[email protected]> >> wrote: >> > By naming rows from the timestamp the rowids are going to all be >> sequential >> > when inserting. So all new inserts will be going into the same region. >> When >> > checking the last 30 days you will also be reading from the same region >> > where all the writing is happening, i.e the one that is already busy >> writing >> > the edit log for all those entries. You might want to consider an >> > alternative method of naming your rows that would result in more >> distributed >> > reading/writing. >> > However since you are naming rows by timestamps, you should be able to >> > restrict the scan by a start and end date. You are doing this, right? If >> > you're not, you are scanning every row in the table when you only need >> the >> > rows from end-start. >> > >> > Someone may need to correct me, but based on my memory of the >> implementation >> > scans are entirely sequential, so region a gets scanned, then b, then c. >> You >> > could speed this up by scanning multiple regions in parallel processes >> and >> > merging the results. >> > >> > On 12 May 2011 14:36, Himanish Kushary <[email protected]> wrote: >> > >> >> Hi, >> >> >> >> We have a table split across multiple regions(approx 50-60 regions for >> 64 >> >> MB >> >> split size) with rowid schema as >> >> [ReverseTimestamp/itemtimestamp/customerid/itemid].This stores the >> >> activities for an item for a customer.We have lots of data for lots of >> item >> >> for a custoer in this table. >> >> >> >> When we try to lookup activities for an item for the last 30 days from >> this >> >> table , we are using a Scan with RowFilter and RegexComparator.The scan >> >> takes a lot of time ( almost 15-20 secs) to get us the activities for an >> >> item. >> >> >> >> We are hooked up to HBase tables directly from a web application,so this >> >> response time of around 20 secs is unacceptable.We also noticed that >> >> whenever we do any scan kind of operation it is never in acceptable >> ranges >> >> for a web application. >> >> >> >> Are we doing something wrong ? If Hbase scans are so slow then it would >> be >> >> real hard to hook it up directly with any web application. >> >> >> >> Could somebody please suggest how to improve this or some other >> >> options(design,architectural) to remedy this kind of issues dealing with >> >> lot >> >> of data. >> >> >> >> Note: We have tried with setCaching,SingleColumnValueFilter to no >> >> significant effect. >> >> >> >> --------------------------- >> >> Thanks & Regards >> >> Himanish >> >> >> > >> > > > > -- > Thanks & Regards > Himanish >
