In my case, I am not feeding hbase result to mapred, it's just pure hbase scan, returning all columns vs two columns makes huge difference to me.
On 13 September 2012 15:29, Doug Meil <[email protected]> wrote: > > Hi there, I don't know the specifics of your environment, but ... > > http://hbase.apache.org/book.html#perf.reading > 11.8.2. Scan Attribute Selection > > > Š describes paying attention to the number of columns you are returning, > particularly when using HBase as a MR source. In short, returning only > the columns you need means you are reducing the data transferred between > the RS and the client and the number of KV's evaluated in the RS, etc. > > > > > On 9/13/12 10:12 AM, "Shengjie Min" <[email protected]> wrote: > > >Hi, > > > >I found an interesting difference between hbase scan query. > > > >I have a hbase table which has a lot of columns in a single column family. > >eg. let's say I have a users table, then userid, username, email .... etc > >etc 15 fields all together are in the single columnFamily. > > > >if you are familiar with RDBMS, > > > >query 1: select * from users > >vs > >query 2: select userid, username from users > > > >in mysql, these two has a difference, the query 2 will be obviously > >faster, > >but two queries won't give you a huge difference from performance > >perspective. > > > >In Hbase, I noticed that: > > > >query 3: scan 'users', // this is basically return me all 15 fields > >vs > >query 4: scan 'users', {COLUMNS=>['cf:userid','cf:username']} // this > >is > >return me only two fields: userid , username > > > >query 3 here takes way longer than query 4, Given a big data set. In my > >test, I have around 1,000,000 user records. You are talking about query 3 > >- > >100 secs VS query 4 - a few secs. > > > > > >Can anybody explain to me, why the width of the resultset in HBASE can > >impact the performance that much? > > > > > >Shengjie Min > > > -- All the best, Shengjie Min
