On Sun, May 23, 2010 at 10:36 AM, Michael Segel <[email protected]>wrote:
> > J-D, > > Here's the problem.. you go to any relational database and do a select > count(*) and you get a response back fairly quickly. > The difference is that in HBase, you're doing a physical count and with the > relational engine you're pulling it from meta data. > > I have a couple of ideas on how we could do this... > > -Mike > > > Date: Sat, 22 May 2010 09:25:51 -0700 > > Subject: Re: RowCounter example run time > > From: [email protected] > > To: [email protected] > > > > My first question would be, what do you expect exactly? Would 5 min be > > enough? Or are you expecting something more like 1-2 secs (which is > > impossible since this is mapreduce)? > > > > Then there's also Jon's questions. > > > > Finally, did you set a higher scanner caching on that job? > > hbase.client.scanner.caching is the name of the config, which defaults > > to 1. When mapping a HBase table, if you don't set it higher you're > > basically benchmarking the RPC layer since it does 1 call per next() > > invocation. Setting the right value depends on the size of your rows > > eg are you storing 60 bytes or something high like 100KB? On our 13B > > rows table (each row is a few bytes), we set it to 10k. > > > > J-D > > > > On Sat, May 22, 2010 at 8:40 AM, Andrew Nguyen > > <[email protected]> wrote: > > > Hello, > > > > > > I finally got some decent hardware to put together a 1 master, 4 slave > Hadoop/HBase cluster. However, I'm still waiting for space in the > datacenter to clear out and only have 3 of the nodes deployed (master + 2 > slaves). Each node is a quad-core AMD with 8G of RAM, running on a GigE > network. HDFS is configured to run on a separate (from the OS drive) U320 > drive. The master has RAID1 mirrored drives only. > > > > > > I've installed HBase with slave1 and slave2 as regionservers and > master, slave1, slave2 as the ZK quorom. The master serves as the NN and JT > and the slaves as DN and TT. > > > > > > Now my question: > > > > > > I've imported 22.5M rows into HBase, into a single table. Each row has > 8 or so columns. I just ran the RowCounter MR example and it takes about 25 > minutes to complete. Is a 3 node setup too underpowered to combat the > overhead of Hadoop and HBase? Or, could it be something with my > configuration? I've been playing around with Hadoop some but this is my > first attempt at anything HBase. > > > > > > Thanks! > > > > > > --Andrew > > _________________________________________________________________ > The New Busy is not the too busy. Combine all your e-mail accounts with > Hotmail. > > http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4 > Every system has its tradeoff. In the example above: >> select count(*) and you get a response back fairly quickly. Try this with my isam very fast. Try that will innodb, this takes a very long time. Some systems maintain a row count and some do not. Now if you are using innodb there is a quick way to get an approximate row count. explain select count(*) This causes the innodb engine to use indexes for an approximate table size. HBase does not maintain a row count. The row count is intensive process as it scans every row. Such is life.
