Michael, HBase is modeled after Bigtable, so I'll refer to its literature : http://turing.cs.washington.edu/papers/dataprojects-google-sigmodrecord08.pdf
In particular from section 6: "Since Bigtables are sparse structures, a row may or may not exist for a given query, depending on which columns that query requested. Data is maintained in lexicographical order but different columns may or may not be stored apart. Because of such semantics and storing scheme, skipping N rows is not feasible without actually reading them. Even finding the count of rows in a Bigtable at any point in time can be done only probabilistically." The rest of that section, in HBase's world, is https://issues.apache.org/jira/browse/HBASE-2571 J-D On Sun, May 23, 2010 at 7:36 AM, Michael Segel <[email protected]> wrote: > > J-D, > > Here's the problem.. you go to any relational database and do a select > count(*) and you get a response back fairly quickly. > The difference is that in HBase, you're doing a physical count and with the > relational engine you're pulling it from meta data. > > I have a couple of ideas on how we could do this... > > -Mike > >> Date: Sat, 22 May 2010 09:25:51 -0700 >> Subject: Re: RowCounter example run time >> From: [email protected] >> To: [email protected] >> >> My first question would be, what do you expect exactly? Would 5 min be >> enough? Or are you expecting something more like 1-2 secs (which is >> impossible since this is mapreduce)? >> >> Then there's also Jon's questions. >> >> Finally, did you set a higher scanner caching on that job? >> hbase.client.scanner.caching is the name of the config, which defaults >> to 1. When mapping a HBase table, if you don't set it higher you're >> basically benchmarking the RPC layer since it does 1 call per next() >> invocation. Setting the right value depends on the size of your rows >> eg are you storing 60 bytes or something high like 100KB? On our 13B >> rows table (each row is a few bytes), we set it to 10k. >> >> J-D >> >> On Sat, May 22, 2010 at 8:40 AM, Andrew Nguyen >> <[email protected]> wrote: >> > Hello, >> > >> > I finally got some decent hardware to put together a 1 master, 4 slave >> > Hadoop/HBase cluster. However, I'm still waiting for space in the >> > datacenter to clear out and only have 3 of the nodes deployed (master + 2 >> > slaves). Each node is a quad-core AMD with 8G of RAM, running on a GigE >> > network. HDFS is configured to run on a separate (from the OS drive) U320 >> > drive. The master has RAID1 mirrored drives only. >> > >> > I've installed HBase with slave1 and slave2 as regionservers and master, >> > slave1, slave2 as the ZK quorom. The master serves as the NN and JT and >> > the slaves as DN and TT. >> > >> > Now my question: >> > >> > I've imported 22.5M rows into HBase, into a single table. Each row has 8 >> > or so columns. I just ran the RowCounter MR example and it takes about 25 >> > minutes to complete. Is a 3 node setup too underpowered to combat the >> > overhead of Hadoop and HBase? Or, could it be something with my >> > configuration? I've been playing around with Hadoop some but this is my >> > first attempt at anything HBase. >> > >> > Thanks! >> > >> > --Andrew > > _________________________________________________________________ > The New Busy is not the too busy. Combine all your e-mail accounts with > Hotmail. > http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4
