So its a *shameless* plug? :-) Depending on the project id, it could be a good key, but it would have to be something more meaningful than just a number.
To answer the question about time... Time Stamps are Longs which hold the number of ms since the a set time. (I forget the date and time but you can look it up in the API docs) By losing precision of the last 3 decimal places, you're rounding to the nearest second. Wasn't sure if that was answered before or not... On Feb 17, 2013, at 7:19 PM, James Taylor <[email protected]> wrote: > Michael is right - Phoenix wouldn't automatically solve these issues for you > - it would just a) decrease the amount of code you need to write while still > giving you coprocessor-speed performance, and b) give you an industry > standard API to read/write your data. > > However, since the date is not the leading part of the key, it wouldn't be a > problem for it to be monotonically increasing. If project_id is, then you > could reverse the bytes on the way in and on the way out to prevent hot > spotting on writes (basically taking the same approach as when you'd use the > HBase native APIs). If you wanted to do it in SQL, you could add your own > built-in function to Phoenix. I'll blog about how to do this soon. > > James > http://phoenix-hbase.blogspot.com/ > > On 02/17/2013 03:18 PM, Michael Segel wrote: >> I'm not sure how a SQL interface above HBase will solve some of the issues >> with regional hot spotting when using time as the key. Or the problem with >> always adding data to the right of the last row. >> >> The same would apply with the project id, assuming that it too is a number >> that grows incrementally with each project. >> On Feb 17, 2013, at 4:50 PM, James Taylor <[email protected]> wrote: >> >>> Hello, >>> Have you considered using Phoenix (https://github.com/forcedotcom/phoenix) >>> for this use case? Phoenix is a SQL layer on top of HBase. For this use >>> case, you'd connect to your cluster like this: >>> >>> Class.forName("com.salesforce.phoenix.jdbc.PhoenixDriver"); // register >>> driver >>> Connection conn = DriverManager..getConnection("jdbc:phoenix:localhost"); >>> // connect to local HBase >>> >>> Create a table like this (adding additional columns that you want to >>> measure, like txn_count below): >>> >>> conn.createStatement().execute( >>> "CREATE TABLE event_log (\n" + >>> " project_id INTEGER NOT NULL, \n" + >>> " time DATE NOT NULL,\n" + >>> "txn_count LONG\n" + >>> "CONSTRAINT pk PRIMARY KEY (project_id, time))"); >>> >>> Then to insert data you'd do this: >>> >>> PreparedStatement preparedStmt = conn.prepareStatement( >>> "UPSERT INTO event_log VALUES(?,?,0)"); >>> >>> and you'd bind the values in JDBC like this: >>> >>> preparedStmt.setInt(1, projectId); >>> preparedStmt.setDate(2, time); >>> preparedStmt.execute(); >>> >>> conn.commit(); // If upserting many values, you'd want to commit after >>> upserting maybe 1000-10000 rows >>> >>> Then at query data time, assuming you want to report on this data by >>> grouping into different "time buckets", you could do as show below. Phoenix >>> stores your date values at the millisecond granularity and you can decide a >>> query time how you'd like to roll it up: >>> >>> // Query with time bucket at the hour granularity >>> conn.createStatement().execute( >>> "SELECT\n" + >>> " project_id, TRUNC(time,'HOUR') as time_bucket, \n" + >>> " MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" + >>> "GROUP BY project_id, TRUNC(time,'HOUR')"); >>> >>> // Query with time bucket at the day granularity >>> conn.createStatement().execute( >>> "SELECT\n" + >>> " project_id, TRUNC(time,'DAY') as time_bucket,\n" + >>> " MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" + >>> "GROUP BY project_id, TRUNC(time,'DAY')"); >>> >>> You could, of course include a WHERE clause in the query to filter based on >>> the range of dates, particular projectIds, etc. like this: >>> >>> conn.prepareStatement( >>> "SELECT\n" + >>> " project_id, TRUNC(time,'DAY') as time_bucket,\n" + >>> " MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" + >>> "WHERE project_id IN (?, ?, ?) AND date >= ? AND date < ?\n" + >>> "GROUP BY project_id, TRUNC(time,'DAY')"); >>> preparedStmt.setInt(1, projectId1); >>> preparedStmt.setInt(2, projectId2); >>> preparedStmt.setInt(3, projectId3); >>> preparedStmt.setDate(4, beginDate); >>> preparedStmt.setDate(5, endDate); >>> preparedStmt.execute(); >>> >>> >>> HTH. >>> >>> Regards, >>> >>> James >>> >>> On 02/17/2013 11:33 AM, Mehmet Simsek wrote: >>>> Hi, >>>> >>>> I want to hold event log data in hbase but I couldn't decide row key. I >>>> must hold project id and time,I will use project ld and time combination >>>> while searching. >>>> >>>> Row key can be below >>>> >>>> ProjectId+timeInMs >>>> >>>> In similiar application(open source TSDB) time is divided 1000 to round in >>>> this project.I can use this strategy but I don't know how we decide what >>>> divider must be? 1000 or 10000. >>>> >>>> Why time is divided 1000 in this application? why didn't be hold without >>>> division? >>>> >>>> Can you explain this strategy? >>>> >>> >> Michael Segel | (m) 312.755.9623 >> >> Segel and Associates >> >> > > Michael Segel | (m) 312.755.9623 Segel and Associates
