Are you using HBase Shell to test performance? In my experience, this may be not a good idea if you run that from one of the nodes of your cluster. The shell speed wasn't very representative.
Other than that: > I have a hbase table which has a lot of columns in a single column family. > eg. let's say I have a users table, then userid, username, email .... etc > etc 15 fields all together are in the single columnFamily. 15 fields is not really "a lot of columns". Selecting several vs all should not make big difference if they are in the same columnfamily. Unless some of them have large values, so that it makes it longer to simply transfer those values over the network (is your network fast, btw?). Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Thu, Sep 13, 2012 at 11:02 AM, Jacques <[email protected]> wrote: > Not sure of your schema... > > Each column family is in a separate collection of StoreFiles. Scan all will > read all these files whereas your second scan will only read the StoreFiles > associated with column family cf (difference if you have multiple column > families). Additionally, pushing a large amount of data from region > servers to wherever you're running the shell will slow things down. > > It is difficult to respond to this unless you reveal your entire data > structure and nature as well as your deployment scenario. > > Jacques > > > > On Thu, Sep 13, 2012 at 7:35 AM, Shengjie Min <[email protected]> > wrote: > > > In my case, I am not feeding hbase result to mapred, it's just pure hbase > > scan, returning all columns vs two columns makes huge difference to me. > > > > On 13 September 2012 15:29, Doug Meil <[email protected]> > > wrote: > > > > > > > > Hi there, I don't know the specifics of your environment, but ... > > > > > > http://hbase.apache.org/book.html#perf.reading > > > 11.8.2. Scan Attribute Selection > > > > > > > > > Š describes paying attention to the number of columns you are > returning, > > > particularly when using HBase as a MR source. In short, returning only > > > the columns you need means you are reducing the data transferred > between > > > the RS and the client and the number of KV's evaluated in the RS, etc. > > > > > > > > > > > > > > > On 9/13/12 10:12 AM, "Shengjie Min" <[email protected]> wrote: > > > > > > >Hi, > > > > > > > >I found an interesting difference between hbase scan query. > > > > > > > >I have a hbase table which has a lot of columns in a single column > > family. > > > >eg. let's say I have a users table, then userid, username, email .... > > etc > > > >etc 15 fields all together are in the single columnFamily. > > > > > > > >if you are familiar with RDBMS, > > > > > > > >query 1: select * from users > > > >vs > > > >query 2: select userid, username from users > > > > > > > >in mysql, these two has a difference, the query 2 will be obviously > > > >faster, > > > >but two queries won't give you a huge difference from performance > > > >perspective. > > > > > > > >In Hbase, I noticed that: > > > > > > > >query 3: scan 'users', // this is basically return me all 15 fields > > > >vs > > > >query 4: scan 'users', {COLUMNS=>['cf:userid','cf:username']} // > this > > > >is > > > >return me only two fields: userid , username > > > > > > > >query 3 here takes way longer than query 4, Given a big data set. In > my > > > >test, I have around 1,000,000 user records. You are talking about > query > > 3 > > > >- > > > >100 secs VS query 4 - a few secs. > > > > > > > > > > > >Can anybody explain to me, why the width of the resultset in HBASE can > > > >impact the performance that much? > > > > > > > > > > > >Shengjie Min > > > > > > > > > > > > > > > -- > > All the best, > > Shengjie Min > > >
