The select count(*) optimization is a classic in databases - some
people argue that it's really important and should be optimized for
(myisam for example) and others note that it's a trick and real DB
loads rarely use that on a sizable table.  Note that myisam locks the
entire table for each update (only 1 update at a time) so comparing
hbase to it is odd.  Innodb doesn't (maintaining global stats under
performance can be difficult).  Oracle doesn't (but may be able to use
a primary index to reduce the blocks read).

Implementing this in HBase might be difficult - when a new column is
inserted into a table the regionserver doesn't know if that row
already exists - to know that it would have to read some data
potentially from disk first.  Any scheme that requires the
regionserver to increment a "rowsForRegion" during certain inserts
would therefore be problematic.

As JD noted, the likely cause here is scanner pre-fetch caching.  We
ship with very conservative scanner pre-fetch values because if a
client takes too long they will get a fatal exception.  RowCounter MR
jobs shouldn't be like that however.

As for cluster sizing - 6-10 is the minimum really.  With 3 nodes you
are replicating data to every node, and you arent getting the benefits
of a clustered solution.  At higher node counts you get some disjoint
parallelism underway and things really pick up on the larger datasets
(I can do MapReduces at 7-8m rows/sec for 20+ minutes on end).

-ryan


On Sun, May 23, 2010 at 7:58 AM, Edward Capriolo <[email protected]> wrote:
> On Sun, May 23, 2010 at 10:36 AM, Michael Segel
> <[email protected]>wrote:
>
>>
>> J-D,
>>
>> Here's the problem.. you go to any relational database and do a select
>> count(*) and you get a response back fairly quickly.
>> The difference is that in HBase, you're doing a physical count and with the
>> relational engine you're pulling it from meta data.
>>
>> I have a couple of ideas on how we could do this...
>>
>> -Mike
>>
>> > Date: Sat, 22 May 2010 09:25:51 -0700
>> > Subject: Re: RowCounter example run time
>> > From: [email protected]
>> > To: [email protected]
>> >
>> > My first question would be, what do you expect exactly? Would 5 min be
>> > enough? Or are you expecting something more like 1-2 secs (which is
>> > impossible since this is mapreduce)?
>> >
>> > Then there's also Jon's questions.
>> >
>> > Finally, did you set a higher scanner caching on that job?
>> > hbase.client.scanner.caching is the name of the config, which defaults
>> > to 1. When mapping a HBase table, if you don't set it higher you're
>> > basically benchmarking the RPC layer since it does 1 call per next()
>> > invocation. Setting the right value depends on the size of your rows
>> > eg are you storing 60 bytes or something high like 100KB? On our 13B
>> > rows table (each row is a few bytes), we set it to 10k.
>> >
>> > J-D
>> >
>> > On Sat, May 22, 2010 at 8:40 AM, Andrew Nguyen
>> > <[email protected]> wrote:
>> > > Hello,
>> > >
>> > > I finally got some decent hardware to put together a 1 master, 4 slave
>> Hadoop/HBase cluster.  However, I'm still waiting for space in the
>> datacenter to clear out and only have 3 of the nodes deployed (master + 2
>> slaves).  Each node is a quad-core AMD with 8G of RAM, running on a GigE
>> network.  HDFS is configured to run on a separate (from the OS drive) U320
>> drive.  The master has RAID1 mirrored drives only.
>> > >
>> > > I've installed HBase with slave1 and slave2 as regionservers and
>> master, slave1, slave2 as the ZK quorom.  The master serves as the NN and JT
>> and the slaves as DN and TT.
>> > >
>> > > Now my question:
>> > >
>> > > I've imported 22.5M rows into HBase, into a single table.  Each row has
>> 8 or so columns.  I just ran the RowCounter MR example and it takes about 25
>> minutes to complete.  Is a 3 node setup too underpowered to combat the
>> overhead of Hadoop and HBase?  Or, could it be something with my
>> configuration?  I've been playing around with Hadoop some but this is my
>> first attempt at anything HBase.
>> > >
>> > > Thanks!
>> > >
>> > > --Andrew
>>
>> _________________________________________________________________
>> The New Busy is not the too busy. Combine all your e-mail accounts with
>> Hotmail.
>>
>> http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4
>>
>
> Every system has its tradeoff. In the example above:
>
>>> select count(*) and you get a response back fairly quickly.
>
> Try this with my isam very fast. Try that will innodb, this takes a very
> long time. Some systems maintain a row count and some do not.
>
> Now if you are using innodb there is a quick way to get an approximate row
> count.
>
> explain select count(*)
>
> This causes the innodb engine to use indexes for an approximate table size.
>
> HBase does not maintain a row count. The row count is intensive process as
> it scans every row. Such is life.
>

Reply via email to