Re: Row Key Design in time based aplication

Michael Segel Sun, 17 Feb 2013 17:29:50 -0800

So its a *shameless* plug?  :-)

Depending on the project id, it could be a good key, but it would have to be 
something more meaningful than just a number.


To answer the question about time... Time Stamps are Longs which hold the 
number of ms since the a set time. (I forget the date and time but you can look 
it up in the API docs)  By losing precision of the last 3 decimal places, 
you're rounding to the nearest second. 

Wasn't sure if that was answered before or not...


On Feb 17, 2013, at 7:19 PM, James Taylor <[email protected]> wrote:

> Michael is right - Phoenix wouldn't automatically solve these issues for you 
> - it would just a) decrease the amount of code you need to write while still 
> giving you coprocessor-speed performance, and b) give you an industry 
> standard API to read/write your data.
> 
> However, since the date is not the leading part of the key, it wouldn't be a 
> problem for it to be monotonically increasing. If project_id is, then you 
> could reverse the bytes on the way in and on the way out to prevent hot 
> spotting on writes (basically taking the same approach as when you'd use the 
> HBase native APIs). If you wanted to do it in SQL, you could add your own 
> built-in function to Phoenix. I'll blog about how to do this soon.
> 
> James
> http://phoenix-hbase.blogspot.com/
> 
> On 02/17/2013 03:18 PM, Michael Segel wrote:
>> I'm not sure how a SQL interface above HBase will solve some of the issues 
>> with regional hot spotting when using time as the key. Or the problem with 
>> always adding data to the right of the last row.
>> 
>> The same would apply with the project id, assuming that it too is a number 
>> that grows incrementally with each project.
>> On Feb 17, 2013, at 4:50 PM, James Taylor <[email protected]> wrote:
>> 
>>> Hello,
>>> Have you considered using Phoenix (https://github.com/forcedotcom/phoenix) 
>>> for this use case? Phoenix is a SQL layer on top of HBase. For this use 
>>> case, you'd connect to your cluster like this:
>>> 
>>> Class.forName("com.salesforce.phoenix.jdbc.PhoenixDriver"); // register 
>>> driver
>>> Connection conn = DriverManager..getConnection("jdbc:phoenix:localhost"); 
>>> // connect to local HBase
>>> 
>>> Create a table like this (adding additional columns that you want to 
>>> measure, like txn_count below):
>>> 
>>> conn.createStatement().execute(
>>>    "CREATE TABLE event_log (\n" +
>>>    "     project_id INTEGER NOT NULL, \n" +
>>>    "    time DATE NOT NULL,\n" +
>>>    "txn_count LONG\n" +
>>>    "CONSTRAINT pk PRIMARY KEY (project_id, time))");
>>> 
>>> Then to insert data you'd do this:
>>> 
>>> PreparedStatement preparedStmt = conn.prepareStatement(
>>>    "UPSERT INTO event_log VALUES(?,?,0)");
>>> 
>>> and you'd bind the values in JDBC like this:
>>> 
>>> preparedStmt.setInt(1, projectId);
>>> preparedStmt.setDate(2, time);
>>> preparedStmt.execute();
>>> 
>>> conn.commit(); // If upserting many values, you'd want to commit after 
>>> upserting maybe 1000-10000 rows
>>> 
>>> Then at query data time, assuming you want to report on this data by 
>>> grouping into different "time buckets", you could do as show below. Phoenix 
>>> stores your date values at the millisecond granularity and you can decide a 
>>> query time how you'd like to roll it up:
>>> 
>>> // Query with time bucket at the hour granularity
>>> conn.createStatement().execute(
>>>   "SELECT\n" +
>>>   "    project_id, TRUNC(time,'HOUR') as time_bucket, \n" +
>>>   "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
>>>   "GROUP BY project_id, TRUNC(time,'HOUR')");
>>> 
>>> // Query with time bucket at the day granularity
>>> conn.createStatement().execute(
>>>    "SELECT\n" +
>>>    "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
>>>    "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
>>>    "GROUP BY project_id, TRUNC(time,'DAY')");
>>> 
>>> You could, of course include a WHERE clause in the query to filter based on 
>>> the range of dates, particular projectIds, etc. like this:
>>> 
>>> conn.prepareStatement(
>>>    "SELECT\n" +
>>>    "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
>>>    "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
>>>    "WHERE project_id IN (?, ?, ?) AND date >= ? AND date < ?\n" +
>>>    "GROUP BY project_id, TRUNC(time,'DAY')");
>>> preparedStmt.setInt(1, projectId1);
>>> preparedStmt.setInt(2, projectId2);
>>> preparedStmt.setInt(3, projectId3);
>>> preparedStmt.setDate(4, beginDate);
>>> preparedStmt.setDate(5, endDate);
>>> preparedStmt.execute();
>>> 
>>> 
>>> HTH.
>>> 
>>> Regards,
>>> 
>>>    James
>>> 
>>> On 02/17/2013 11:33 AM, Mehmet Simsek wrote:
>>>> Hi,
>>>> 
>>>> I want to hold event log data in hbase but I couldn't decide row key. I 
>>>> must hold project id and time,I will use project ld and time combination 
>>>> while searching.
>>>> 
>>>> Row key can be below
>>>> 
>>>> ProjectId+timeInMs
>>>> 
>>>> In similiar application(open source TSDB) time is divided 1000 to round in 
>>>> this project.I can use this strategy but I don't know how we decide what 
>>>> divider must be?  1000 or 10000.
>>>> 
>>>> Why time is divided 1000 in this application? why didn't be hold without 
>>>> division?
>>>> 
>>>> Can you explain this strategy?
>>>> 
>>> 
>> Michael Segel  | (m) 312.755.9623
>> 
>> Segel and Associates
>> 
>> 
> 
> 

Michael Segel  | (m) 312.755.9623

Segel and Associates

Re: Row Key Design in time based aplication

Reply via email to