Re: Row Key Design in time based aplication

James Taylor Sun, 17 Feb 2013 17:19:46 -0800

Michael is right - Phoenix wouldn't automatically solve these issues foryou - it would just a) decrease the amount of code you need to writewhile still giving you coprocessor-speed performance, and b) give you anindustry standard API to read/write your data.

However, since the date is not the leading part of the key, it wouldn'tbe a problem for it to be monotonically increasing. If project_id is,then you could reverse the bytes on the way in and on the way out toprevent hot spotting on writes (basically taking the same approach aswhen you'd use the HBase native APIs). If you wanted to do it in SQL,you could add your own built-in function to Phoenix. I'll blog about howto do this soon.


James
http://phoenix-hbase.blogspot.com/

On 02/17/2013 03:18 PM, Michael Segel wrote:

I'm not sure how a SQL interface above HBase will solve some of the issues with 
regional hot spotting when using time as the key. Or the problem with always 
adding data to the right of the last row.

The same would apply with the project id, assuming that it too is a number that 
grows incrementally with each project.
On Feb 17, 2013, at 4:50 PM, James Taylor <[email protected]> wrote:

Hello,
Have you considered using Phoenix (https://github.com/forcedotcom/phoenix) for 
this use case? Phoenix is a SQL layer on top of HBase. For this use case, you'd 
connect to your cluster like this:

Class.forName("com.salesforce.phoenix.jdbc.PhoenixDriver"); // register driver
Connection conn = DriverManager..getConnection("jdbc:phoenix:localhost"); // 
connect to local HBase

Create a table like this (adding additional columns that you want to measure, 
like txn_count below):

conn.createStatement().execute(
    "CREATE TABLE event_log (\n" +
    "     project_id INTEGER NOT NULL, \n" +
    "    time DATE NOT NULL,\n" +
    "txn_count LONG\n" +
    "CONSTRAINT pk PRIMARY KEY (project_id, time))");

Then to insert data you'd do this:

PreparedStatement preparedStmt = conn.prepareStatement(
    "UPSERT INTO event_log VALUES(?,?,0)");

and you'd bind the values in JDBC like this:

preparedStmt.setInt(1, projectId);
preparedStmt.setDate(2, time);
preparedStmt.execute();

conn.commit(); // If upserting many values, you'd want to commit after 
upserting maybe 1000-10000 rows

Then at query data time, assuming you want to report on this data by grouping into 
different "time buckets", you could do as show below. Phoenix stores your date 
values at the millisecond granularity and you can decide a query time how you'd like to 
roll it up:

// Query with time bucket at the hour granularity
conn.createStatement().execute(
   "SELECT\n" +
   "    project_id, TRUNC(time,'HOUR') as time_bucket, \n" +
   "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
   "GROUP BY project_id, TRUNC(time,'HOUR')");

// Query with time bucket at the day granularity
conn.createStatement().execute(
    "SELECT\n" +
    "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
    "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
    "GROUP BY project_id, TRUNC(time,'DAY')");

You could, of course include a WHERE clause in the query to filter based on the 
range of dates, particular projectIds, etc. like this:

conn.prepareStatement(
    "SELECT\n" +
    "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
    "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
    "WHERE project_id IN (?, ?, ?) AND date >= ? AND date < ?\n" +
    "GROUP BY project_id, TRUNC(time,'DAY')");
preparedStmt.setInt(1, projectId1);
preparedStmt.setInt(2, projectId2);
preparedStmt.setInt(3, projectId3);
preparedStmt.setDate(4, beginDate);
preparedStmt.setDate(5, endDate);
preparedStmt.execute();


HTH.

Regards,

    James

On 02/17/2013 11:33 AM, Mehmet Simsek wrote:

Hi,

I want to hold event log data in hbase but I couldn't decide row key. I must 
hold project id and time,I will use project ld and time combination while 
searching.

Row key can be below

ProjectId+timeInMs

In similiar application(open source TSDB) time is divided 1000 to round in this 
project.I can use this strategy but I don't know how we decide what divider 
must be?  1000 or 10000.

Why time is divided 1000 in this application? why didn't be hold without 
division?

Can you explain this strategy?

Michael Segel  | (m) 312.755.9623

Segel and Associates

Re: Row Key Design in time based aplication

Reply via email to