Hi James, Great to hear that Phoenix supports this kind of table schema. Seems like i'll go ahead with Bucketing for timeseries data.
Thanks, Anil On Tue, Sep 24, 2013 at 9:36 AM, James Taylor <[email protected]>wrote: > Hey Anil, > The solution you've described is the best we've found for Phoenix (inspired > by the work of Alex at Sematext). > You can do all of this in a few lines of SQL: > > CREATE TABLE event_data( > who VARCHAR, type SMALLINT, id BIGINT, when DATE, payload VARBINARY > CONSTRAINT pk PRIMARY KEY (who, type, id)) > IMMUTABLE_ROWS=true; // Declare event table as having immutable rows > CREATE INDEX event_data_index ON event_data(when, type, who) > INCLUDE(payload) > SALT_BUCKETS=10; // Salt the index since it'll create write hotspots > otherwise > > The following query would display event count per type across all users > over the last week. > It would automatically use the index: > > SELECT type, count(*) FROM event_data WHERE when > CURRENT_DATE() - 7 GROUP > BY type > > The following query would display the event count by type for a particular > user. It would > automatically use the data table: > > SELECT who, type, count(*) FROM event_data WHERE who = ? GROUP BY who, > type; > > As far as the read cost associated with reading from a salted table, we've > found in most cases > it actually performs better, because you get better parallelization. The > case where it performs > worse is on a selective query that returns a smallish set of rows that > normally would be in the same > block. In this case, you're reading an entire block for each row, where in > the worst case these > would be neighbors in the same block on an unsalted table. > > HTH, > > James > > On Tue, Sep 24, 2013 at 8:12 AM, anil gupta <[email protected]> wrote: > > > Inline > > > > On Mon, Sep 23, 2013 at 6:15 PM, Shahab Yunus <[email protected] > > >wrote: > > > > > Yeah, I saw that. In fact that is why I recommended that to you as I > > > couldn't infer from your email that whether you have already gone > through > > > that source or not. > > > > Yes, i was aware of that article. But my read pattern is slighty > different > > from that article.We are using HBase as DataSource for a RestFul service. > > Even though if my range scan finds 400 rows with a specified timerange. I > > only return top 20 for one rest request. So, if in case i do > bucketing(lets > > say bucket=10) then i will need to fetch 20 results from each bucket and > > then i will have to do a merge sort on the client size and return final > 20. > > You can assume that i need to return the 20rows sorted by timestamp. > > > > > > > > > A source, who did the exact same thing and discuss it > > > in much more detail and concerns aligning with yours (in fact I think > > some > > > of the authors/creators of that link/group are members here of this > > > community as well.) > > > > Do you know what the outcome of their experiment? Do you have any link > for > > that? Thanks for your time and help. > > > > > > > > > > Regards, > > > Shahab > > > > > > > > > On Mon, Sep 23, 2013 at 8:41 PM, anil gupta <[email protected]> > > wrote: > > > > > > > Hi Shahab, > > > > > > > > If you read my solution carefully. I am already doing that. > > > > > > > > Thanks, > > > > Anil Gupta > > > > > > > > > > > > On Mon, Sep 23, 2013 at 3:51 PM, Shahab Yunus < > [email protected] > > > > >wrote: > > > > > > > > > > > > > > > > > > > > > > > > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ > > > > > > > > > > Here you can find the discussion, trade-offs and working code/API > > (even > > > > for > > > > > M/R) about this and the approach you are trying out. > > > > > > > > > > Regards, > > > > > Shahab > > > > > > > > > > > > > > > On Mon, Sep 23, 2013 at 5:41 PM, anil gupta <[email protected] > > > > > > wrote: > > > > > > > > > > > Hi All, > > > > > > > > > > > > I have a secondary index(inverted index) table with a rowkey on > the > > > > basis > > > > > > of Timestamp of an event. Assume the rowkey as <TimeStamp in > > Epoch>. > > > > > > I also store some extra(apart from main_table rowkey) columns in > > that > > > > > table > > > > > > for doing filtering. > > > > > > > > > > > > The requirement is to do range-based scan on the basis of time of > > > > > > event. Hence, the index with this rowkey. > > > > > > I cannot use Hashing or MD5 digest solution because then i cannot > > do > > > > > range > > > > > > based scans. And, i already have a index like OpenTSDB in > another > > > > table > > > > > > for the same dataset.(I have many secondary Index for same data > > set.) > > > > > > > > > > > > Problem: When we increase the write workload during stress test. > > Time > > > > > > secondary index becomes a bottleneck due to the famous Region > > > > HotSpotting > > > > > > problem. > > > > > > Solution: I am thinking of adding a prefix of { (<TimeStamp in > > > > > Epoch>%10) = > > > > > > bucket} in the rowkey. Then my row key will become: > > > > > > <Bucket><TimeStamp in Epoch> > > > > > > By using above rowkey i can at least alleviate *WRITE* problem.(i > > > don't > > > > > > think problem can be fixed permanently because of the use case > > > > > requirement. > > > > > > I would love to be proven wrong.) > > > > > > However, with the above row key, now when i want to *READ* data, > > for > > > > > every > > > > > > single range scans i have to read data from 10 different regions. > > > This > > > > > > extra load for read is scaring me a bit. > > > > > > > > > > > > I am wondering if anyone has better suggestion/approach to solve > > this > > > > > > problem given the constraints i have. Looking for feedback from > > > > > community. > > > > > > > > > > > > -- > > > > > > Thanks & Regards, > > > > > > Anil Gupta > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Thanks & Regards, > > > > Anil Gupta > > > > > > > > > > > > > > > -- > > Thanks & Regards, > > Anil Gupta > > > -- Thanks & Regards, Anil Gupta
