Re: What's the best modeling approach for ordering events by date?
Hi. So, the OPP will direct all activity for a range of keys to a particular node (or set of nodes, in accordance with your replication factor). Depending on the volume of writes, this could be fine. Depending on the distribution of key values you write at any given time, it can also be fine. But if you're using the OPP, and your keys align with the time of receiving the data, and your application writes that data as it receives it, you're going to be placing write activity on effectively one node at a time, for the range of time allocated to that node. If you use RP, and can divide time into finer slices such that you have multiple tweets in a row, you trade off a more complex read in exchange for better distribution of load throughout your cluster. The necessity of this depends on your particulars. In your TweetsBySecond example, you're using a deterministic set of keys (the keys correspond to seconds since epoch). Querying for ranges of time is nice with OPP, but if the ranges of time you're interested in are constrained, you don't specifically need OPP. You could use RP and request all the keys for the seconds contained within the time range of interest. In this way, you balance writes across the cluster more effectively than you would with OPP, while still getting a workable data set. Again, the degree to which you need this is dependent on your situation. Others on the list will no doubt have more informed opinions on this than me. :) On Thu, Apr 14, 2011 at 8:00 PM, Guillermo Winkler gwink...@inconcertcc.com wrote: Hi Ethan, I want to present the events ordered by time, always in pages of 20/40 events. If the events are tweets, you can have 1000 tweets from the same second or you can have 30 tweets in a 10 minute range. But I always wanna be able to page through the results in an orderly fashion. I think that using seconds since epoch it's what I'm doing, that is divide time into a fixed series of interval. Each second is an interval, and all of the events for that particular second are columns of that row. Again with tweets for easier visualizatoin TweetsBySecond : { 12121121212 :{ - seconds since epoch id1,id2,id3 - all the tweet ids ocurred in that particular second }, 12121212123 : { id4,id5 }, 12121212124 : { id6 } } The problem is you can't do that using OPP in cassandra 0.7, or it's just me missing something? Thanks for your answer, Guille On Thu, Apr 14, 2011 at 4:49 PM, Ethan Rowe et...@the-rowes.com wrote: How do you plan to read the data? Entire histories, or in relatively confined slices of time? Do the events have any attributes by which you might segregate them, apart from time? If you can divide time into a fixed series of intervals, you can insert members of a given interval as columns (or supercolumns) in a row. But it depends how you want to use the data on the read side. On Thu, Apr 14, 2011 at 12:25 PM, Guillermo Winkler gwink...@inconcertcc.com wrote: I have a huge number of events I need to consume later, ordered by the date the event occured. My first approach to this problem was to use seconds since epoch as row key, and event ids as column names (empty value), this way: EventsByDate : { SecondsSinceEpoch: { evid:, evid:, evid: } } And use OPP as partitioner. Using GetRangeSlices to retrieve ordered events secuentially. Now I have two problems to solve: 1) The system is realtime, so all the events in a given moment are hitting the same box 2) Migrating from cassandra 0.6 to cassandra 0.7 OPP doesn't seem to like LongType for row keys, was this purposedly deprecated? I was thinking about secondary indexes, but it does not assure the order the rows are coming out of cassandra. Anyone has a better approach to model events by date given that restrictions? Thanks, Guille
What's the best modeling approach for ordering events by date?
I have a huge number of events I need to consume later, ordered by the date the event occured. My first approach to this problem was to use seconds since epoch as row key, and event ids as column names (empty value), this way: EventsByDate : { SecondsSinceEpoch: { evid:, evid:, evid: } } And use OPP as partitioner. Using GetRangeSlices to retrieve ordered events secuentially. Now I have two problems to solve: 1) The system is realtime, so all the events in a given moment are hitting the same box 2) Migrating from cassandra 0.6 to cassandra 0.7 OPP doesn't seem to like LongType for row keys, was this purposedly deprecated? I was thinking about secondary indexes, but it does not assure the order the rows are coming out of cassandra. Anyone has a better approach to model events by date given that restrictions? Thanks, Guille
Re: What's the best modeling approach for ordering events by date?
How do you plan to read the data? Entire histories, or in relatively confined slices of time? Do the events have any attributes by which you might segregate them, apart from time? If you can divide time into a fixed series of intervals, you can insert members of a given interval as columns (or supercolumns) in a row. But it depends how you want to use the data on the read side. On Thu, Apr 14, 2011 at 12:25 PM, Guillermo Winkler gwink...@inconcertcc.com wrote: I have a huge number of events I need to consume later, ordered by the date the event occured. My first approach to this problem was to use seconds since epoch as row key, and event ids as column names (empty value), this way: EventsByDate : { SecondsSinceEpoch: { evid:, evid:, evid: } } And use OPP as partitioner. Using GetRangeSlices to retrieve ordered events secuentially. Now I have two problems to solve: 1) The system is realtime, so all the events in a given moment are hitting the same box 2) Migrating from cassandra 0.6 to cassandra 0.7 OPP doesn't seem to like LongType for row keys, was this purposedly deprecated? I was thinking about secondary indexes, but it does not assure the order the rows are coming out of cassandra. Anyone has a better approach to model events by date given that restrictions? Thanks, Guille
Re: What's the best modeling approach for ordering events by date?
Hi Ethan, I want to present the events ordered by time, always in pages of 20/40 events. If the events are tweets, you can have 1000 tweets from the same second or you can have 30 tweets in a 10 minute range. But I always wanna be able to page through the results in an orderly fashion. I think that using seconds since epoch it's what I'm doing, that is divide time into a fixed series of interval. Each second is an interval, and all of the events for that particular second are columns of that row. Again with tweets for easier visualizatoin TweetsBySecond : { 12121121212 :{ - seconds since epoch id1,id2,id3 - all the tweet ids ocurred in that particular second }, 12121212123 : { id4,id5 }, 12121212124 : { id6 } } The problem is you can't do that using OPP in cassandra 0.7, or it's just me missing something? Thanks for your answer, Guille On Thu, Apr 14, 2011 at 4:49 PM, Ethan Rowe et...@the-rowes.com wrote: How do you plan to read the data? Entire histories, or in relatively confined slices of time? Do the events have any attributes by which you might segregate them, apart from time? If you can divide time into a fixed series of intervals, you can insert members of a given interval as columns (or supercolumns) in a row. But it depends how you want to use the data on the read side. On Thu, Apr 14, 2011 at 12:25 PM, Guillermo Winkler gwink...@inconcertcc.com wrote: I have a huge number of events I need to consume later, ordered by the date the event occured. My first approach to this problem was to use seconds since epoch as row key, and event ids as column names (empty value), this way: EventsByDate : { SecondsSinceEpoch: { evid:, evid:, evid: } } And use OPP as partitioner. Using GetRangeSlices to retrieve ordered events secuentially. Now I have two problems to solve: 1) The system is realtime, so all the events in a given moment are hitting the same box 2) Migrating from cassandra 0.6 to cassandra 0.7 OPP doesn't seem to like LongType for row keys, was this purposedly deprecated? I was thinking about secondary indexes, but it does not assure the order the rows are coming out of cassandra. Anyone has a better approach to model events by date given that restrictions? Thanks, Guille