Hi Vivek,
Take a look at the SQL skin for HBase called Phoenix (https://github.com/forcedotcom/phoenix). Instead of using the native HBase client, you use regular JDBC and Phoenix takes care of making the native HBase calls for you.

We support composite row keys, so you could form your row key like this:
CREATE TABLE TimeSeries (
    host VARCHAR NOT NULL,
    date DATE NOT NULL,
    value1 BIGINT,
    value2 DECIMAL(10,4)
CONSTRAINT pk PRIMARY KEY (host, date)); // composite row key of host + date

Then to do aggregate queries, you use our built in AVG, SUM, COUNT, MIN, MAX:
SELECT AVG(value1), SUM(value2) * 123.45 / 678.9 FROM TimeSeries
WHERE host IN ('host1','host2')
GROUP BY TRUNC(date, 'DAY')  // group into day sized buckets

For debugging, you can use either the terminal command line we bundle: https://github.com/forcedotcom/phoenix#command-line or you can install a SQL client like SQuirrel: https://github.com/forcedotcom/phoenix#sql-client
You'll see your integer, date, and decimal types as you'd expect.

We have integration with map/reduce and Pig, so you could use those tools in conjunction with Phoenix.

We also support TopN queries, select distinct, transparent salting for when your row key leads with a monotonically increasing value like time, and our performance (https://github.com/forcedotcom/phoenix/wiki/Performance) can't be beat. See our recent announcement for more detail: http://phoenix-hbase.blogspot.com/2013/05/announcing-phoenix-12.html

HTH.

Regards,

James
@JamesPlusPlus


On 05/19/2013 08:41 AM, Ted Yu wrote:
For #b, take a look
at 
src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java
in 0.94
It supports avg, max, min and sum operations through calling coprocessors.
Here is snippet from its javadoc:

  * This client class is for invoking the aggregate functions deployed on the
  * Region Server side via the AggregateProtocol. This class will implement
the
  * supporting functionality for summing/processing the individual results
  * obtained from the AggregateProtocol for each region.

For #c, running HBase and MR on the same cluster is acceptable. If you have
additional hardware, you can run mapreduce jobs on separate machines where
region server is not running.

Cheers

On Sun, May 19, 2013 at 8:29 AM, Vivek Padmanabhan
<vpadmanab...@aryaka.com>wrote:

Hi,
   I am pretty new to HBase so it would be great if someone could help me
out with my below queries;

(Ours is a time series data and all the queries will be range scan on
  composite row keys)

a) What is the usual practice of storing data types.

    We have noticed that converting datatypes to bytes render unreadable
data while debugging.
    For ids, or int values we see the byte representation. So for some
important columns
    we converted into  datatype -> characters ->bytes, rather than datatype
-> bytes
    (May be we can write a wrapper over hbase shell to solve this. But is
there a simpler way)


b) What is the best way to achieve operations like AVG,SUM or some custom
formula for real time queries. Coprocessors or in-memory with query result?
    (The formula that we apply might get changed at any time so storing
result is not an option)


c) We are planning to start off with a four node cluster, having both
HBase and MR jobs running.
    I have heard that it is not recommended to have both HBase and MR on
the same cluster, but I would
    like to understand what could be the possible bottle necks.

   (We plan to run MR on HDFS and MR on Hbase. Most of our MR jobs are IO
bound rather than CPU bound)


Thanks
Vivek


Reply via email to