Re: RowCounter example run time

Edward Capriolo Sun, 23 May 2010 07:58:56 -0700

On Sun, May 23, 2010 at 10:36 AM, Michael Segel
<[email protected]>wrote:


>
> J-D,
>
> Here's the problem.. you go to any relational database and do a select
> count(*) and you get a response back fairly quickly.
> The difference is that in HBase, you're doing a physical count and with the
> relational engine you're pulling it from meta data.
>
> I have a couple of ideas on how we could do this...
>
> -Mike
>
> > Date: Sat, 22 May 2010 09:25:51 -0700
> > Subject: Re: RowCounter example run time
> > From: [email protected]
> > To: [email protected]
> >
> > My first question would be, what do you expect exactly? Would 5 min be
> > enough? Or are you expecting something more like 1-2 secs (which is
> > impossible since this is mapreduce)?
> >
> > Then there's also Jon's questions.
> >
> > Finally, did you set a higher scanner caching on that job?
> > hbase.client.scanner.caching is the name of the config, which defaults
> > to 1. When mapping a HBase table, if you don't set it higher you're
> > basically benchmarking the RPC layer since it does 1 call per next()
> > invocation. Setting the right value depends on the size of your rows
> > eg are you storing 60 bytes or something high like 100KB? On our 13B
> > rows table (each row is a few bytes), we set it to 10k.
> >
> > J-D
> >
> > On Sat, May 22, 2010 at 8:40 AM, Andrew Nguyen
> > <[email protected]> wrote:
> > > Hello,
> > >
> > > I finally got some decent hardware to put together a 1 master, 4 slave
> Hadoop/HBase cluster.  However, I'm still waiting for space in the
> datacenter to clear out and only have 3 of the nodes deployed (master + 2
> slaves).  Each node is a quad-core AMD with 8G of RAM, running on a GigE
> network.  HDFS is configured to run on a separate (from the OS drive) U320
> drive.  The master has RAID1 mirrored drives only.
> > >
> > > I've installed HBase with slave1 and slave2 as regionservers and
> master, slave1, slave2 as the ZK quorom.  The master serves as the NN and JT
> and the slaves as DN and TT.
> > >
> > > Now my question:
> > >
> > > I've imported 22.5M rows into HBase, into a single table.  Each row has
> 8 or so columns.  I just ran the RowCounter MR example and it takes about 25
> minutes to complete.  Is a 3 node setup too underpowered to combat the
> overhead of Hadoop and HBase?  Or, could it be something with my
> configuration?  I've been playing around with Hadoop some but this is my
> first attempt at anything HBase.
> > >
> > > Thanks!
> > >
> > > --Andrew
>
> _________________________________________________________________
> The New Busy is not the too busy. Combine all your e-mail accounts with
> Hotmail.
>
> http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4
>

Every system has its tradeoff. In the example above:

>> select count(*) and you get a response back fairly quickly.

Try this with my isam very fast. Try that will innodb, this takes a very
long time. Some systems maintain a row count and some do not.

Now if you are using innodb there is a quick way to get an approximate row
count.

explain select count(*)

This causes the innodb engine to use indexes for an approximate table size.

HBase does not maintain a row count. The row count is intensive process as
it scans every row. Such is life.

Re: RowCounter example run time

Reply via email to