Re: question about deleting from cassandra
since they are separate changes, it's much easier to review if they are submitted separately. On 3/13/10, Weijun Li wrote: > Sure. I'm making another change for cross multiple DC replication, once this > one is done (probably in next week) I'll submit them together to Jira. All > based on 0.6 beta2. > > -Weijun > > -Original Message- > From: Jonathan Ellis [mailto:jbel...@gmail.com] > Sent: Saturday, March 13, 2010 5:36 AM > To: cassandra-user@incubator.apache.org > Subject: Re: question about deleting from cassandra > > You should submit your minor change to jira for others who might want to try > it. > > On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li wrote: >> Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked >> perfectly. Without this feature, as far as you have high volume new and >> expired columns your life will be miserable :-) >> >> Thanks for great job Sylvain!! >> >> -Weijun >> >> On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne >> wrote: >>> >>> I guess you can also vote for this ticket : >>> https://issues.apache.org/jira/browse/CASSANDRA-699 :) >>> >>> >>> >>> -- >>> Sylvain >>> >>> >>> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson wrote: >>> > On 12 March 2010 03:34, Bill Au wrote: >>> >> >>> >> Let take Twitter as an example. All the tweets are timestamped. I >>> >> want >>> >> to keep only a month's worth of tweets for each user. The number of >>> >> tweets >>> >> that fit within this one month window varies from user to user. What >>> >> is the >>> >> best way to accomplish this? >>> > >>> > This is the "expiry" problem that has been discussed on this list >>> > before. As >>> > far as I can see there are no easy ways to do it with 0.5 >>> > >>> > If you use the ordered partitioner and make the first part of the keys > a >>> > timestamp (or part of it) then you can get the keys and delete them. >>> > >>> > However, these deletes will be quite inefficient, currently each row >>> > must be >>> > deleted individually (there was a patch to range delete kicking around, >>> > I >>> > don't know if it's accepted yet) >>> > >>> > But even if range delete is implemented, it's still quite inefficient >>> > and >>> > not really what you want, and doesn't work with the RandomPartitioner >>> > >>> > If you have some metadata to say who tweeted within a given period (say >>> > 10 >>> > days or 30 days) and you store the tweets all in the same key per user >>> > per >>> > period (say with one column per tweet, or use supercolumns), then you >>> > can >>> > just delete one key per user per period. >>> > >>> > One of the problems with using a time-based key with ordered > partitioner >>> > is >>> > that you're always going to have a data imbalance, so you may want to >>> > try >>> > hashing *part* of the key (The first part) so you can still range scan >>> > the >>> > next part. This may fix load balancing while still enabling you to use >>> > range >>> > scans to do data expiry. >>> > >>> > e.g. your key is >>> > >>> > Hash of day number + user id + timestamp >>> > >>> > Then you can range scan the entire day's tweets to expire them, and >>> > range >>> > scan a given user's tweets for a given day efficiently (and doing this >>> > for >>> > 30 days is just 30 range scans) >>> > >>> > Putting a hash in there fixes load balancing with OPP. >>> > >>> > Mark >>> > >> >> > >
RE: question about deleting from cassandra
Sure. I'm making another change for cross multiple DC replication, once this one is done (probably in next week) I'll submit them together to Jira. All based on 0.6 beta2. -Weijun -Original Message- From: Jonathan Ellis [mailto:jbel...@gmail.com] Sent: Saturday, March 13, 2010 5:36 AM To: cassandra-user@incubator.apache.org Subject: Re: question about deleting from cassandra You should submit your minor change to jira for others who might want to try it. On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li wrote: > Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked > perfectly. Without this feature, as far as you have high volume new and > expired columns your life will be miserable :-) > > Thanks for great job Sylvain!! > > -Weijun > > On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne > wrote: >> >> I guess you can also vote for this ticket : >> https://issues.apache.org/jira/browse/CASSANDRA-699 :) >> >> >> >> -- >> Sylvain >> >> >> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson wrote: >> > On 12 March 2010 03:34, Bill Au wrote: >> >> >> >> Let take Twitter as an example. All the tweets are timestamped. I >> >> want >> >> to keep only a month's worth of tweets for each user. The number of >> >> tweets >> >> that fit within this one month window varies from user to user. What >> >> is the >> >> best way to accomplish this? >> > >> > This is the "expiry" problem that has been discussed on this list >> > before. As >> > far as I can see there are no easy ways to do it with 0.5 >> > >> > If you use the ordered partitioner and make the first part of the keys a >> > timestamp (or part of it) then you can get the keys and delete them. >> > >> > However, these deletes will be quite inefficient, currently each row >> > must be >> > deleted individually (there was a patch to range delete kicking around, >> > I >> > don't know if it's accepted yet) >> > >> > But even if range delete is implemented, it's still quite inefficient >> > and >> > not really what you want, and doesn't work with the RandomPartitioner >> > >> > If you have some metadata to say who tweeted within a given period (say >> > 10 >> > days or 30 days) and you store the tweets all in the same key per user >> > per >> > period (say with one column per tweet, or use supercolumns), then you >> > can >> > just delete one key per user per period. >> > >> > One of the problems with using a time-based key with ordered partitioner >> > is >> > that you're always going to have a data imbalance, so you may want to >> > try >> > hashing *part* of the key (The first part) so you can still range scan >> > the >> > next part. This may fix load balancing while still enabling you to use >> > range >> > scans to do data expiry. >> > >> > e.g. your key is >> > >> > Hash of day number + user id + timestamp >> > >> > Then you can range scan the entire day's tweets to expire them, and >> > range >> > scan a given user's tweets for a given day efficiently (and doing this >> > for >> > 30 days is just 30 range scans) >> > >> > Putting a hash in there fixes load balancing with OPP. >> > >> > Mark >> > > >
Re: question about deleting from cassandra
You should submit your minor change to jira for others who might want to try it. On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li wrote: > Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked > perfectly. Without this feature, as far as you have high volume new and > expired columns your life will be miserable :-) > > Thanks for great job Sylvain!! > > -Weijun > > On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne > wrote: >> >> I guess you can also vote for this ticket : >> https://issues.apache.org/jira/browse/CASSANDRA-699 :) >> >> >> >> -- >> Sylvain >> >> >> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson wrote: >> > On 12 March 2010 03:34, Bill Au wrote: >> >> >> >> Let take Twitter as an example. All the tweets are timestamped. I >> >> want >> >> to keep only a month's worth of tweets for each user. The number of >> >> tweets >> >> that fit within this one month window varies from user to user. What >> >> is the >> >> best way to accomplish this? >> > >> > This is the "expiry" problem that has been discussed on this list >> > before. As >> > far as I can see there are no easy ways to do it with 0.5 >> > >> > If you use the ordered partitioner and make the first part of the keys a >> > timestamp (or part of it) then you can get the keys and delete them. >> > >> > However, these deletes will be quite inefficient, currently each row >> > must be >> > deleted individually (there was a patch to range delete kicking around, >> > I >> > don't know if it's accepted yet) >> > >> > But even if range delete is implemented, it's still quite inefficient >> > and >> > not really what you want, and doesn't work with the RandomPartitioner >> > >> > If you have some metadata to say who tweeted within a given period (say >> > 10 >> > days or 30 days) and you store the tweets all in the same key per user >> > per >> > period (say with one column per tweet, or use supercolumns), then you >> > can >> > just delete one key per user per period. >> > >> > One of the problems with using a time-based key with ordered partitioner >> > is >> > that you're always going to have a data imbalance, so you may want to >> > try >> > hashing *part* of the key (The first part) so you can still range scan >> > the >> > next part. This may fix load balancing while still enabling you to use >> > range >> > scans to do data expiry. >> > >> > e.g. your key is >> > >> > Hash of day number + user id + timestamp >> > >> > Then you can range scan the entire day's tweets to expire them, and >> > range >> > scan a given user's tweets for a given day efficiently (and doing this >> > for >> > 30 days is just 30 range scans) >> > >> > Putting a hash in there fixes load balancing with OPP. >> > >> > Mark >> > > >
Re: question about deleting from cassandra
Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked perfectly. Without this feature, as far as you have high volume new and expired columns your life will be miserable :-) Thanks for great job Sylvain!! -Weijun On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne wrote: > I guess you can also vote for this ticket : > https://issues.apache.org/jira/browse/CASSANDRA-699 :) > > > > -- > Sylvain > > > On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson wrote: > > On 12 March 2010 03:34, Bill Au wrote: > >> > >> Let take Twitter as an example. All the tweets are timestamped. I want > >> to keep only a month's worth of tweets for each user. The number of > tweets > >> that fit within this one month window varies from user to user. What is > the > >> best way to accomplish this? > > > > This is the "expiry" problem that has been discussed on this list before. > As > > far as I can see there are no easy ways to do it with 0.5 > > > > If you use the ordered partitioner and make the first part of the keys a > > timestamp (or part of it) then you can get the keys and delete them. > > > > However, these deletes will be quite inefficient, currently each row must > be > > deleted individually (there was a patch to range delete kicking around, I > > don't know if it's accepted yet) > > > > But even if range delete is implemented, it's still quite inefficient and > > not really what you want, and doesn't work with the RandomPartitioner > > > > If you have some metadata to say who tweeted within a given period (say > 10 > > days or 30 days) and you store the tweets all in the same key per user > per > > period (say with one column per tweet, or use supercolumns), then you can > > just delete one key per user per period. > > > > One of the problems with using a time-based key with ordered partitioner > is > > that you're always going to have a data imbalance, so you may want to try > > hashing *part* of the key (The first part) so you can still range scan > the > > next part. This may fix load balancing while still enabling you to use > range > > scans to do data expiry. > > > > e.g. your key is > > > > Hash of day number + user id + timestamp > > > > Then you can range scan the entire day's tweets to expire them, and range > > scan a given user's tweets for a given day efficiently (and doing this > for > > 30 days is just 30 range scans) > > > > Putting a hash in there fixes load balancing with OPP. > > > > Mark > > >
Re: question about deleting from cassandra
I guess you can also vote for this ticket : https://issues.apache.org/jira/browse/CASSANDRA-699 :) -- Sylvain On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson wrote: > On 12 March 2010 03:34, Bill Au wrote: >> >> Let take Twitter as an example. All the tweets are timestamped. I want >> to keep only a month's worth of tweets for each user. The number of tweets >> that fit within this one month window varies from user to user. What is the >> best way to accomplish this? > > This is the "expiry" problem that has been discussed on this list before. As > far as I can see there are no easy ways to do it with 0.5 > > If you use the ordered partitioner and make the first part of the keys a > timestamp (or part of it) then you can get the keys and delete them. > > However, these deletes will be quite inefficient, currently each row must be > deleted individually (there was a patch to range delete kicking around, I > don't know if it's accepted yet) > > But even if range delete is implemented, it's still quite inefficient and > not really what you want, and doesn't work with the RandomPartitioner > > If you have some metadata to say who tweeted within a given period (say 10 > days or 30 days) and you store the tweets all in the same key per user per > period (say with one column per tweet, or use supercolumns), then you can > just delete one key per user per period. > > One of the problems with using a time-based key with ordered partitioner is > that you're always going to have a data imbalance, so you may want to try > hashing *part* of the key (The first part) so you can still range scan the > next part. This may fix load balancing while still enabling you to use range > scans to do data expiry. > > e.g. your key is > > Hash of day number + user id + timestamp > > Then you can range scan the entire day's tweets to expire them, and range > scan a given user's tweets for a given day efficiently (and doing this for > 30 days is just 30 range scans) > > Putting a hash in there fixes load balancing with OPP. > > Mark >
Re: question about deleting from cassandra
On 12 March 2010 03:34, Bill Au wrote: > Let take Twitter as an example. All the tweets are timestamped. I want to > keep only a month's worth of tweets for each user. The number of tweets > that fit within this one month window varies from user to user. What is the > best way to accomplish this? This is the "expiry" problem that has been discussed on this list before. As far as I can see there are no easy ways to do it with 0.5 If you use the ordered partitioner and make the first part of the keys a timestamp (or part of it) then you can get the keys and delete them. However, these deletes will be quite inefficient, currently each row must be deleted individually (there was a patch to range delete kicking around, I don't know if it's accepted yet) But even if range delete is implemented, it's still quite inefficient and not really what you want, and doesn't work with the RandomPartitioner If you have some metadata to say who tweeted within a given period (say 10 days or 30 days) and you store the tweets all in the same key per user per period (say with one column per tweet, or use supercolumns), then you can just delete one key per user per period. One of the problems with using a time-based key with ordered partitioner is that you're always going to have a data imbalance, so you may want to try hashing *part* of the key (The first part) so you can still range scan the next part. This may fix load balancing while still enabling you to use range scans to do data expiry. e.g. your key is Hash of day number + user id + timestamp Then you can range scan the entire day's tweets to expire them, and range scan a given user's tweets for a given day efficiently (and doing this for 30 days is just 30 range scans) Putting a hash in there fixes load balancing with OPP. Mark
Re: question about deleting from cassandra
I've been thinking more about a similar sort of problem. The major difference between normal relational databases and big hashtables is that in the former you can sort and retrieve on any column. In big hashtables (or at least from Cassandra), you only have 1 field to sort on and the sort type is predetermined. >From a theoretical perspective, your traditional DBMS typically allows you to create arbitrary indexes in order to speed up access. I'm thinking the same can be through of for something like this. Ergo, I imagine that for different kinds of entities, you can have a separate supercolumn family that basically serves as an index table. From what I've heard, this is somewhat indicated. In a broader perspective, you can also use tables that serve as metadata. Ergo, you could store keys of all posts bucketed by some time period (eg. month). Peter On Thu, Mar 11, 2010 at 7:34 PM, Bill Au wrote: > Let take Twitter as an example. All the tweets are timestamped. I want to > keep only a month's worth of tweets for each user. The number of tweets > that fit within this one month window varies from user to user. What is the > best way to accomplish this? There are millions of users. Do I need to > loop through all of them and handle the delete one user at a time? Or is > there a better way to do this? If a user has not post a new tweet in more > than a month, I also want to remove the user itself. Do I also need to do > looking through all the users one at a time? > > Bill >