Re: What's the best modeling approach for ordering events by date?

2011-04-15 Thread Ethan Rowe
Hi.

So, the OPP will direct all activity for a range of keys to a particular
node (or set of nodes, in accordance with your replication factor).
 Depending on the volume of writes, this could be fine.  Depending on the
distribution of key values you write at any given time, it can also be fine.
 But if you're using the OPP, and your keys align with the time of receiving
the data, and your application writes that data as it receives it, you're
going to be placing write activity on effectively one node at a time, for
the range of time allocated to that node.

If you use RP, and can divide time into finer slices such that you have
multiple tweets in a row, you trade off a more complex read in exchange for
better distribution of load throughout your cluster.  The necessity of this
depends on your particulars.

In your TweetsBySecond example, you're using a deterministic set of keys
(the keys correspond to seconds since epoch).  Querying for ranges of time
is nice with OPP, but if the ranges of time you're interested in are
constrained, you don't specifically need OPP.  You could use RP and request
all the keys for the seconds contained within the time range of interest.
 In this way, you balance writes across the cluster more effectively than
you would with OPP, while still getting a workable data set.  Again, the
degree to which you need this is dependent on your situation.  Others on the
list will no doubt have more informed opinions on this than me.  :)

On Thu, Apr 14, 2011 at 8:00 PM, Guillermo Winkler gwink...@inconcertcc.com
 wrote:

 Hi Ethan,

 I want to present the events ordered by time, always in pages of 20/40
 events. If the events are tweets, you can have 1000 tweets from the same
 second or you can have 30 tweets in a 10 minute range. But I always wanna be
 able to page through the results in an orderly fashion.

 I think that using seconds since epoch it's what I'm doing, that is divide
 time into a fixed series of interval. Each second is an interval, and all of
 the events for that particular second are columns of that row.

 Again with tweets for easier visualizatoin

 TweetsBySecond : {
  12121121212 :{ - seconds since epoch
  id1,id2,id3 - all the tweet ids ocurred in that particular second
 },
 12121212123 : {
 id4,id5
 },
 12121212124 : {
 id6
 }
 }

 The problem is you can't do that using OPP in cassandra 0.7, or it's just
 me missing something?

 Thanks for your answer,
 Guille

 On Thu, Apr 14, 2011 at 4:49 PM, Ethan Rowe et...@the-rowes.com wrote:

 How do you plan to read the data?  Entire histories, or in relatively
 confined slices of time?  Do the events have any attributes by which you
 might segregate them, apart from time?

 If you can divide time into a fixed series of intervals, you can insert
 members of a given interval as columns (or supercolumns) in a row.  But it
 depends how you want to use the data on the read side.


 On Thu, Apr 14, 2011 at 12:25 PM, Guillermo Winkler 
 gwink...@inconcertcc.com wrote:

 I have a huge number of events I need to consume later, ordered by the
 date the event occured.

 My first approach to this problem was to use seconds since epoch as row
 key, and event ids as column names (empty value), this way:

 EventsByDate : {
 SecondsSinceEpoch: {
 evid:, evid:, evid:
 }
 }

 And use OPP as partitioner. Using GetRangeSlices to retrieve ordered
 events secuentially.

 Now I have two problems to solve:

 1) The system is realtime, so all the events in a given moment are
 hitting the same box
 2) Migrating from cassandra 0.6 to cassandra 0.7 OPP doesn't seem to like
 LongType for row keys, was this purposedly deprecated?

 I was thinking about secondary indexes, but it does not assure the order
 the rows are coming out of cassandra.

 Anyone has a better approach to model events by date given that
 restrictions?

 Thanks,
 Guille







What's the best modeling approach for ordering events by date?

2011-04-14 Thread Guillermo Winkler
I have a huge number of events I need to consume later, ordered by the date
the event occured.

My first approach to this problem was to use seconds since epoch as row key,
and event ids as column names (empty value), this way:

EventsByDate : {
SecondsSinceEpoch: {
evid:, evid:, evid:
}
}

And use OPP as partitioner. Using GetRangeSlices to retrieve ordered events
secuentially.

Now I have two problems to solve:

1) The system is realtime, so all the events in a given moment are hitting
the same box
2) Migrating from cassandra 0.6 to cassandra 0.7 OPP doesn't seem to like
LongType for row keys, was this purposedly deprecated?

I was thinking about secondary indexes, but it does not assure the order the
rows are coming out of cassandra.

Anyone has a better approach to model events by date given that
restrictions?

Thanks,
Guille




Re: What's the best modeling approach for ordering events by date?

2011-04-14 Thread Ethan Rowe
How do you plan to read the data?  Entire histories, or in relatively
confined slices of time?  Do the events have any attributes by which you
might segregate them, apart from time?

If you can divide time into a fixed series of intervals, you can insert
members of a given interval as columns (or supercolumns) in a row.  But it
depends how you want to use the data on the read side.

On Thu, Apr 14, 2011 at 12:25 PM, Guillermo Winkler 
gwink...@inconcertcc.com wrote:

 I have a huge number of events I need to consume later, ordered by the date
 the event occured.

 My first approach to this problem was to use seconds since epoch as row
 key, and event ids as column names (empty value), this way:

 EventsByDate : {
 SecondsSinceEpoch: {
 evid:, evid:, evid:
 }
 }

 And use OPP as partitioner. Using GetRangeSlices to retrieve ordered events
 secuentially.

 Now I have two problems to solve:

 1) The system is realtime, so all the events in a given moment are hitting
 the same box
 2) Migrating from cassandra 0.6 to cassandra 0.7 OPP doesn't seem to like
 LongType for row keys, was this purposedly deprecated?

 I was thinking about secondary indexes, but it does not assure the order
 the rows are coming out of cassandra.

 Anyone has a better approach to model events by date given that
 restrictions?

 Thanks,
 Guille





Re: What's the best modeling approach for ordering events by date?

2011-04-14 Thread Guillermo Winkler
Hi Ethan,

I want to present the events ordered by time, always in pages of 20/40
events. If the events are tweets, you can have 1000 tweets from the same
second or you can have 30 tweets in a 10 minute range. But I always wanna be
able to page through the results in an orderly fashion.

I think that using seconds since epoch it's what I'm doing, that is divide
time into a fixed series of interval. Each second is an interval, and all of
the events for that particular second are columns of that row.

Again with tweets for easier visualizatoin

TweetsBySecond : {
 12121121212 :{ - seconds since epoch
 id1,id2,id3 - all the tweet ids ocurred in that particular second
},
12121212123 : {
id4,id5
},
12121212124 : {
id6
}
}

The problem is you can't do that using OPP in cassandra 0.7, or it's just me
missing something?

Thanks for your answer,
Guille

On Thu, Apr 14, 2011 at 4:49 PM, Ethan Rowe et...@the-rowes.com wrote:

 How do you plan to read the data?  Entire histories, or in relatively
 confined slices of time?  Do the events have any attributes by which you
 might segregate them, apart from time?

 If you can divide time into a fixed series of intervals, you can insert
 members of a given interval as columns (or supercolumns) in a row.  But it
 depends how you want to use the data on the read side.


 On Thu, Apr 14, 2011 at 12:25 PM, Guillermo Winkler 
 gwink...@inconcertcc.com wrote:

 I have a huge number of events I need to consume later, ordered by the
 date the event occured.

 My first approach to this problem was to use seconds since epoch as row
 key, and event ids as column names (empty value), this way:

 EventsByDate : {
 SecondsSinceEpoch: {
 evid:, evid:, evid:
 }
 }

 And use OPP as partitioner. Using GetRangeSlices to retrieve ordered
 events secuentially.

 Now I have two problems to solve:

 1) The system is realtime, so all the events in a given moment are hitting
 the same box
 2) Migrating from cassandra 0.6 to cassandra 0.7 OPP doesn't seem to like
 LongType for row keys, was this purposedly deprecated?

 I was thinking about secondary indexes, but it does not assure the order
 the rows are coming out of cassandra.

 Anyone has a better approach to model events by date given that
 restrictions?

 Thanks,
 Guille