Re: question about deleting from cassandra

2010-03-13 Thread Jonathan Ellis
since they are separate changes, it's much easier to review if they
are submitted separately.

On 3/13/10, Weijun Li  wrote:
> Sure. I'm making another change for cross multiple DC replication, once this
> one is done (probably in next week) I'll submit them together to Jira. All
> based on 0.6 beta2.
>
> -Weijun
>
> -Original Message-
> From: Jonathan Ellis [mailto:jbel...@gmail.com]
> Sent: Saturday, March 13, 2010 5:36 AM
> To: cassandra-user@incubator.apache.org
> Subject: Re: question about deleting from cassandra
>
> You should submit your minor change to jira for others who might want to try
> it.
>
> On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li  wrote:
>> Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
>> perfectly. Without this feature, as far as you have high volume new and
>> expired columns your life will be miserable :-)
>>
>> Thanks for great job Sylvain!!
>>
>> -Weijun
>>
>> On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne 
>> wrote:
>>>
>>> I guess you can also vote for this ticket :
>>> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
>>>
>>> 
>>>
>>> --
>>> Sylvain
>>>
>>>
>>> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson  wrote:
>>> > On 12 March 2010 03:34, Bill Au  wrote:
>>> >>
>>> >> Let take Twitter as an example.  All the tweets are timestamped.  I
>>> >> want
>>> >> to keep only a month's worth of tweets for each user.  The number of
>>> >> tweets
>>> >> that fit within this one month window varies from user to user.  What
>>> >> is the
>>> >> best way to accomplish this?
>>> >
>>> > This is the "expiry" problem that has been discussed on this list
>>> > before. As
>>> > far as I can see there are no easy ways to do it with 0.5
>>> >
>>> > If you use the ordered partitioner and make the first part of the keys
> a
>>> > timestamp (or part of it) then you can get the keys and delete them.
>>> >
>>> > However, these deletes will be quite inefficient, currently each row
>>> > must be
>>> > deleted individually (there was a patch to range delete kicking around,
>>> > I
>>> > don't know if it's accepted yet)
>>> >
>>> > But even if range delete is implemented, it's still quite inefficient
>>> > and
>>> > not really what you want, and doesn't work with the RandomPartitioner
>>> >
>>> > If you have some metadata to say who tweeted within a given period (say
>>> > 10
>>> > days or 30 days) and you store the tweets all in the same key per user
>>> > per
>>> > period (say with one column per tweet, or use supercolumns), then you
>>> > can
>>> > just delete one key per user per period.
>>> >
>>> > One of the problems with using a time-based key with ordered
> partitioner
>>> > is
>>> > that you're always going to have a data imbalance, so you may want to
>>> > try
>>> > hashing *part* of the key (The first part) so you can still range scan
>>> > the
>>> > next part. This may fix load balancing while still enabling you to use
>>> > range
>>> > scans to do data expiry.
>>> >
>>> > e.g. your key is
>>> >
>>> > Hash of day number + user id + timestamp
>>> >
>>> > Then you can range scan the entire day's tweets to expire them, and
>>> > range
>>> > scan a given user's tweets for a given day efficiently (and doing this
>>> > for
>>> > 30 days is just 30 range scans)
>>> >
>>> > Putting a hash in there fixes load balancing with OPP.
>>> >
>>> > Mark
>>> >
>>
>>
>
>


RE: question about deleting from cassandra

2010-03-13 Thread Weijun Li
Sure. I'm making another change for cross multiple DC replication, once this
one is done (probably in next week) I'll submit them together to Jira. All
based on 0.6 beta2.

-Weijun

-Original Message-
From: Jonathan Ellis [mailto:jbel...@gmail.com] 
Sent: Saturday, March 13, 2010 5:36 AM
To: cassandra-user@incubator.apache.org
Subject: Re: question about deleting from cassandra

You should submit your minor change to jira for others who might want to try
it.

On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li  wrote:
> Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
> perfectly. Without this feature, as far as you have high volume new and
> expired columns your life will be miserable :-)
>
> Thanks for great job Sylvain!!
>
> -Weijun
>
> On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne 
> wrote:
>>
>> I guess you can also vote for this ticket :
>> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
>>
>> 
>>
>> --
>> Sylvain
>>
>>
>> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson  wrote:
>> > On 12 March 2010 03:34, Bill Au  wrote:
>> >>
>> >> Let take Twitter as an example.  All the tweets are timestamped.  I
>> >> want
>> >> to keep only a month's worth of tweets for each user.  The number of
>> >> tweets
>> >> that fit within this one month window varies from user to user.  What
>> >> is the
>> >> best way to accomplish this?
>> >
>> > This is the "expiry" problem that has been discussed on this list
>> > before. As
>> > far as I can see there are no easy ways to do it with 0.5
>> >
>> > If you use the ordered partitioner and make the first part of the keys
a
>> > timestamp (or part of it) then you can get the keys and delete them.
>> >
>> > However, these deletes will be quite inefficient, currently each row
>> > must be
>> > deleted individually (there was a patch to range delete kicking around,
>> > I
>> > don't know if it's accepted yet)
>> >
>> > But even if range delete is implemented, it's still quite inefficient
>> > and
>> > not really what you want, and doesn't work with the RandomPartitioner
>> >
>> > If you have some metadata to say who tweeted within a given period (say
>> > 10
>> > days or 30 days) and you store the tweets all in the same key per user
>> > per
>> > period (say with one column per tweet, or use supercolumns), then you
>> > can
>> > just delete one key per user per period.
>> >
>> > One of the problems with using a time-based key with ordered
partitioner
>> > is
>> > that you're always going to have a data imbalance, so you may want to
>> > try
>> > hashing *part* of the key (The first part) so you can still range scan
>> > the
>> > next part. This may fix load balancing while still enabling you to use
>> > range
>> > scans to do data expiry.
>> >
>> > e.g. your key is
>> >
>> > Hash of day number + user id + timestamp
>> >
>> > Then you can range scan the entire day's tweets to expire them, and
>> > range
>> > scan a given user's tweets for a given day efficiently (and doing this
>> > for
>> > 30 days is just 30 range scans)
>> >
>> > Putting a hash in there fixes load balancing with OPP.
>> >
>> > Mark
>> >
>
>



Re: question about deleting from cassandra

2010-03-13 Thread Jonathan Ellis
You should submit your minor change to jira for others who might want to try it.

On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li  wrote:
> Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
> perfectly. Without this feature, as far as you have high volume new and
> expired columns your life will be miserable :-)
>
> Thanks for great job Sylvain!!
>
> -Weijun
>
> On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne 
> wrote:
>>
>> I guess you can also vote for this ticket :
>> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
>>
>> 
>>
>> --
>> Sylvain
>>
>>
>> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson  wrote:
>> > On 12 March 2010 03:34, Bill Au  wrote:
>> >>
>> >> Let take Twitter as an example.  All the tweets are timestamped.  I
>> >> want
>> >> to keep only a month's worth of tweets for each user.  The number of
>> >> tweets
>> >> that fit within this one month window varies from user to user.  What
>> >> is the
>> >> best way to accomplish this?
>> >
>> > This is the "expiry" problem that has been discussed on this list
>> > before. As
>> > far as I can see there are no easy ways to do it with 0.5
>> >
>> > If you use the ordered partitioner and make the first part of the keys a
>> > timestamp (or part of it) then you can get the keys and delete them.
>> >
>> > However, these deletes will be quite inefficient, currently each row
>> > must be
>> > deleted individually (there was a patch to range delete kicking around,
>> > I
>> > don't know if it's accepted yet)
>> >
>> > But even if range delete is implemented, it's still quite inefficient
>> > and
>> > not really what you want, and doesn't work with the RandomPartitioner
>> >
>> > If you have some metadata to say who tweeted within a given period (say
>> > 10
>> > days or 30 days) and you store the tweets all in the same key per user
>> > per
>> > period (say with one column per tweet, or use supercolumns), then you
>> > can
>> > just delete one key per user per period.
>> >
>> > One of the problems with using a time-based key with ordered partitioner
>> > is
>> > that you're always going to have a data imbalance, so you may want to
>> > try
>> > hashing *part* of the key (The first part) so you can still range scan
>> > the
>> > next part. This may fix load balancing while still enabling you to use
>> > range
>> > scans to do data expiry.
>> >
>> > e.g. your key is
>> >
>> > Hash of day number + user id + timestamp
>> >
>> > Then you can range scan the entire day's tweets to expire them, and
>> > range
>> > scan a given user's tweets for a given day efficiently (and doing this
>> > for
>> > 30 days is just 30 range scans)
>> >
>> > Putting a hash in there fixes load balancing with OPP.
>> >
>> > Mark
>> >
>
>


Re: question about deleting from cassandra

2010-03-13 Thread Weijun Li
Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
perfectly. Without this feature, as far as you have high volume new and
expired columns your life will be miserable :-)

Thanks for great job Sylvain!!

-Weijun

On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne wrote:

> I guess you can also vote for this ticket :
> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
>
> 
>
> --
> Sylvain
>
>
> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson  wrote:
> > On 12 March 2010 03:34, Bill Au  wrote:
> >>
> >> Let take Twitter as an example.  All the tweets are timestamped.  I want
> >> to keep only a month's worth of tweets for each user.  The number of
> tweets
> >> that fit within this one month window varies from user to user.  What is
> the
> >> best way to accomplish this?
> >
> > This is the "expiry" problem that has been discussed on this list before.
> As
> > far as I can see there are no easy ways to do it with 0.5
> >
> > If you use the ordered partitioner and make the first part of the keys a
> > timestamp (or part of it) then you can get the keys and delete them.
> >
> > However, these deletes will be quite inefficient, currently each row must
> be
> > deleted individually (there was a patch to range delete kicking around, I
> > don't know if it's accepted yet)
> >
> > But even if range delete is implemented, it's still quite inefficient and
> > not really what you want, and doesn't work with the RandomPartitioner
> >
> > If you have some metadata to say who tweeted within a given period (say
> 10
> > days or 30 days) and you store the tweets all in the same key per user
> per
> > period (say with one column per tweet, or use supercolumns), then you can
> > just delete one key per user per period.
> >
> > One of the problems with using a time-based key with ordered partitioner
> is
> > that you're always going to have a data imbalance, so you may want to try
> > hashing *part* of the key (The first part) so you can still range scan
> the
> > next part. This may fix load balancing while still enabling you to use
> range
> > scans to do data expiry.
> >
> > e.g. your key is
> >
> > Hash of day number + user id + timestamp
> >
> > Then you can range scan the entire day's tweets to expire them, and range
> > scan a given user's tweets for a given day efficiently (and doing this
> for
> > 30 days is just 30 range scans)
> >
> > Putting a hash in there fixes load balancing with OPP.
> >
> > Mark
> >
>


Re: question about deleting from cassandra

2010-03-12 Thread Sylvain Lebresne
I guess you can also vote for this ticket :
https://issues.apache.org/jira/browse/CASSANDRA-699 :)



--
Sylvain


On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson  wrote:
> On 12 March 2010 03:34, Bill Au  wrote:
>>
>> Let take Twitter as an example.  All the tweets are timestamped.  I want
>> to keep only a month's worth of tweets for each user.  The number of tweets
>> that fit within this one month window varies from user to user.  What is the
>> best way to accomplish this?
>
> This is the "expiry" problem that has been discussed on this list before. As
> far as I can see there are no easy ways to do it with 0.5
>
> If you use the ordered partitioner and make the first part of the keys a
> timestamp (or part of it) then you can get the keys and delete them.
>
> However, these deletes will be quite inefficient, currently each row must be
> deleted individually (there was a patch to range delete kicking around, I
> don't know if it's accepted yet)
>
> But even if range delete is implemented, it's still quite inefficient and
> not really what you want, and doesn't work with the RandomPartitioner
>
> If you have some metadata to say who tweeted within a given period (say 10
> days or 30 days) and you store the tweets all in the same key per user per
> period (say with one column per tweet, or use supercolumns), then you can
> just delete one key per user per period.
>
> One of the problems with using a time-based key with ordered partitioner is
> that you're always going to have a data imbalance, so you may want to try
> hashing *part* of the key (The first part) so you can still range scan the
> next part. This may fix load balancing while still enabling you to use range
> scans to do data expiry.
>
> e.g. your key is
>
> Hash of day number + user id + timestamp
>
> Then you can range scan the entire day's tweets to expire them, and range
> scan a given user's tweets for a given day efficiently (and doing this for
> 30 days is just 30 range scans)
>
> Putting a hash in there fixes load balancing with OPP.
>
> Mark
>


Re: question about deleting from cassandra

2010-03-11 Thread Mark Robson
On 12 March 2010 03:34, Bill Au  wrote:

> Let take Twitter as an example.  All the tweets are timestamped.  I want to
> keep only a month's worth of tweets for each user.  The number of tweets
> that fit within this one month window varies from user to user.  What is the
> best way to accomplish this?


This is the "expiry" problem that has been discussed on this list before. As
far as I can see there are no easy ways to do it with 0.5

If you use the ordered partitioner and make the first part of the keys a
timestamp (or part of it) then you can get the keys and delete them.

However, these deletes will be quite inefficient, currently each row must be
deleted individually (there was a patch to range delete kicking around, I
don't know if it's accepted yet)

But even if range delete is implemented, it's still quite inefficient and
not really what you want, and doesn't work with the RandomPartitioner

If you have some metadata to say who tweeted within a given period (say 10
days or 30 days) and you store the tweets all in the same key per user per
period (say with one column per tweet, or use supercolumns), then you can
just delete one key per user per period.

One of the problems with using a time-based key with ordered partitioner is
that you're always going to have a data imbalance, so you may want to try
hashing *part* of the key (The first part) so you can still range scan the
next part. This may fix load balancing while still enabling you to use range
scans to do data expiry.

e.g. your key is

Hash of day number + user id + timestamp

Then you can range scan the entire day's tweets to expire them, and range
scan a given user's tweets for a given day efficiently (and doing this for
30 days is just 30 range scans)

Putting a hash in there fixes load balancing with OPP.

Mark


Re: question about deleting from cassandra

2010-03-11 Thread Peter Chang
I've been thinking more about a similar sort of problem.

The major difference between normal relational databases and big hashtables
is that in the former you can sort and retrieve on any column. In big
hashtables (or at least from Cassandra), you only have 1 field to sort on
and the sort type is predetermined.

>From a theoretical perspective, your traditional DBMS typically allows you
to create arbitrary indexes in order to speed up access. I'm thinking the
same can be through of for something like this.

Ergo, I imagine that for different kinds of entities, you can have a
separate supercolumn family that basically serves as an index table. From
what I've heard, this is somewhat indicated.

In a broader perspective, you can also use tables that serve as metadata.
Ergo, you could store keys of all posts bucketed by some time period (eg.
month).

Peter


On Thu, Mar 11, 2010 at 7:34 PM, Bill Au  wrote:

> Let take Twitter as an example.  All the tweets are timestamped.  I want to
> keep only a month's worth of tweets for each user.  The number of tweets
> that fit within this one month window varies from user to user.  What is the
> best way to accomplish this?  There are millions of users.  Do I need to
> loop through all of them and handle the delete one user at a time?  Or is
> there a better way to do this?  If a user has not post a new tweet in more
> than a month, I also want to remove the user itself.  Do I also need to do
> looking through all the users one at a time?
>
> Bill
>