Re: need some help with counters

2011-06-16 Thread Ian Holsman

On Jun 13, 2011, at 5:10 AM, aaron morton wrote:

 I am wondering how to index on the most recent hour as well. (ie show me top 
 5 URLs type query).. 
 
 AFAIK thats not a great application for counters. You would need range 
 support in the secondary indexes so you could get the first X rows ordered by 
 a column value. 
 
 To be honest, depending on scale, I'd consider a sorted set in redis for 
 that. 

It does.
Thanks Aaron.

 
 Hope that helps. 
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 11 Jun 2011, at 00:36, Ian Holsman wrote:
 
 
 On Jun 9, 2011, at 10:04 PM, aaron morton wrote:
 
 I may be missing something but could you use a column for each of the last 
 48 hours all in the same row for a url ?
 
 e.g. 
 {
 /url.com/hourly : {
 20110609T01:00:00 : 456,
 20110609T02:00:00 : 4567,
 }
 }
 
 yes.. that would work better... I was storing all the different times in the 
 same row.
 {
  /url.com : {
   H-20110609T01:00:00 : 456,
   H-0110609T02:00:00 : 4567,
   D-0110609 : 5678,
  }
 }
 
 I am wondering how to index on the most recent hour as well. (ie show me top 
 5 URLs type query).. 
 
 
 Increment the current hour only. Delete the older columns either when a 
 read detects there are old values or as a maintenance job. Or as part of 
 writing values for the first 5 minutes of any hour. 
 
 yes.. I thought of that. The problem with doing it on read is there may be a 
 case where a old URL never gets read.. so it will just sit there taking up 
 space.. the maintenance job is the route I went down.
 
 
 The row will get spread out over a lot of sstables which may reduce read 
 speed. If this is a problem consider a separate CF with more aggressive GC 
 and compaction settings. 
 
 Thanks!
 
 Cheers
 
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 10 Jun 2011, at 09:28, Ian Holsman wrote:
 
 So would doing something like storing it in reverse (so I know what to 
 delete) work? Or is storing a million columns in a supercolumn impossible. 
 
 I could always use a logfile and run the archiver off that as a worst case 
 I guess. 
 Would doing so many deletes screw up the db/cause other problems?
 
 ---
 Ian Holsman - 703 879-3128
 
 I saw the angel in the marble and carved until I set him free -- 
 Michelangelo
 
 On 09/06/2011, at 4:22 PM, Ryan King r...@twitter.com wrote:
 
 On Thu, Jun 9, 2011 at 1:06 PM, Ian Holsman had...@holsman.net wrote:
 Hi Ryan.
 you wouldn't have your version of cassandra up on github would you??
 
 No, and the patch isn't in our version yet either. We're still working on 
 it.
 
 -ryan
 
 
 



Re: need some help with counters

2011-06-12 Thread aaron morton
 I am wondering how to index on the most recent hour as well. (ie show me top 
 5 URLs type query).. 

AFAIK thats not a great application for counters. You would need range support 
in the secondary indexes so you could get the first X rows ordered by a column 
value. 

To be honest, depending on scale, I'd consider a sorted set in redis for that. 

Hope that helps. 
  
-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 11 Jun 2011, at 00:36, Ian Holsman wrote:

 
 On Jun 9, 2011, at 10:04 PM, aaron morton wrote:
 
 I may be missing something but could you use a column for each of the last 
 48 hours all in the same row for a url ?
 
 e.g. 
 {
  /url.com/hourly : {
  20110609T01:00:00 : 456,
  20110609T02:00:00 : 4567,
  }
 }
 
 yes.. that would work better... I was storing all the different times in the 
 same row.
 {
   /url.com : {
H-20110609T01:00:00 : 456,
H-0110609T02:00:00 : 4567,
D-0110609 : 5678,
   }
 }
 
 I am wondering how to index on the most recent hour as well. (ie show me top 
 5 URLs type query).. 
 
 
 Increment the current hour only. Delete the older columns either when a read 
 detects there are old values or as a maintenance job. Or as part of writing 
 values for the first 5 minutes of any hour. 
 
 yes.. I thought of that. The problem with doing it on read is there may be a 
 case where a old URL never gets read.. so it will just sit there taking up 
 space.. the maintenance job is the route I went down.
 
 
 The row will get spread out over a lot of sstables which may reduce read 
 speed. If this is a problem consider a separate CF with more aggressive GC 
 and compaction settings. 
 
 Thanks!
 
 Cheers
 
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 10 Jun 2011, at 09:28, Ian Holsman wrote:
 
 So would doing something like storing it in reverse (so I know what to 
 delete) work? Or is storing a million columns in a supercolumn impossible. 
 
 I could always use a logfile and run the archiver off that as a worst case 
 I guess. 
 Would doing so many deletes screw up the db/cause other problems?
 
 ---
 Ian Holsman - 703 879-3128
 
 I saw the angel in the marble and carved until I set him free -- 
 Michelangelo
 
 On 09/06/2011, at 4:22 PM, Ryan King r...@twitter.com wrote:
 
 On Thu, Jun 9, 2011 at 1:06 PM, Ian Holsman had...@holsman.net wrote:
 Hi Ryan.
 you wouldn't have your version of cassandra up on github would you??
 
 No, and the patch isn't in our version yet either. We're still working on 
 it.
 
 -ryan
 
 



Re: need some help with counters

2011-06-10 Thread Ian Holsman

On Jun 9, 2011, at 10:04 PM, aaron morton wrote:

 I may be missing something but could you use a column for each of the last 48 
 hours all in the same row for a url ?
 
 e.g. 
 {
   /url.com/hourly : {
   20110609T01:00:00 : 456,
   20110609T02:00:00 : 4567,
   }
 }

yes.. that would work better... I was storing all the different times in the 
same row.
{
/url.com : {
 H-20110609T01:00:00 : 456,
 H-0110609T02:00:00 : 4567,
 D-0110609 : 5678,
}
}

I am wondering how to index on the most recent hour as well. (ie show me top 5 
URLs type query).. 

 
 Increment the current hour only. Delete the older columns either when a read 
 detects there are old values or as a maintenance job. Or as part of writing 
 values for the first 5 minutes of any hour. 

yes.. I thought of that. The problem with doing it on read is there may be a 
case where a old URL never gets read.. so it will just sit there taking up 
space.. the maintenance job is the route I went down.

 
 The row will get spread out over a lot of sstables which may reduce read 
 speed. If this is a problem consider a separate CF with more aggressive GC 
 and compaction settings. 

Thanks!
 
 Cheers
 
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 10 Jun 2011, at 09:28, Ian Holsman wrote:
 
 So would doing something like storing it in reverse (so I know what to 
 delete) work? Or is storing a million columns in a supercolumn impossible. 
 
 I could always use a logfile and run the archiver off that as a worst case I 
 guess. 
 Would doing so many deletes screw up the db/cause other problems?
 
 ---
 Ian Holsman - 703 879-3128
 
 I saw the angel in the marble and carved until I set him free -- Michelangelo
 
 On 09/06/2011, at 4:22 PM, Ryan King r...@twitter.com wrote:
 
 On Thu, Jun 9, 2011 at 1:06 PM, Ian Holsman had...@holsman.net wrote:
 Hi Ryan.
 you wouldn't have your version of cassandra up on github would you??
 
 No, and the patch isn't in our version yet either. We're still working on 
 it.
 
 -ryan
 



Re: need some help with counters

2011-06-09 Thread Ryan King
On Thu, Jun 9, 2011 at 12:41 PM, Ian Holsman had...@holsman.net wrote:
 Hi.

 I had a brief look at CASSANDRA-2103 (expiring counter columns), and I was 
 wondering if anyone can help me with my problem.

 I want to keep some page-view stats on a URL at different levels of 
 granularity (page views per hour, page views per day, page views per year etc 
 etc).


 so my thinking was to create something a counter with a key based on 
 Year-Month-Day-Hour, and simply increment the counter as I go along.

 this work's well and I'm getting my metrics beautifully put into the right 
 places.

 the only problem I have is that I only need the last 48-hours worth of 
 metrics at the hour level.

 how do I get rid of the old counters?
 do I need to write a archiver that will go through each url (could be 
 millions) and just delete them?

 I'm sure other people have encountered this, and was wondering how they 
 approached it.

Here's how we are going to do it at twitter:
https://issues.apache.org/jira/browse/CASSANDRA-2735

-ryan


Re: need some help with counters

2011-06-09 Thread Colin
Hey guy, have you tried amazon turk?

--
Colin Clark
+1 315 886 3422 cell
+1 701 212 4314 office
http://cloudeventprocessing.com
http://blog.cloudeventprocessing.com
@EventCloudPro

*Sent from Star Trek like flat panel device, which although larger than my Star 
Trek like communicator device, may have typo's and exhibit improper grammar due 
to haste and less than perfect use of the virtual keyboard*
 

On Jun 9, 2011, at 3:41 PM, Ian Holsman had...@holsman.net wrote:

 Hi.
 
 I had a brief look at CASSANDRA-2103 (expiring counter columns), and I was 
 wondering if anyone can help me with my problem.
 
 I want to keep some page-view stats on a URL at different levels of 
 granularity (page views per hour, page views per day, page views per year etc 
 etc).
 
 
 so my thinking was to create something a counter with a key based on 
 Year-Month-Day-Hour, and simply increment the counter as I go along. 
 
 this work's well and I'm getting my metrics beautifully put into the right 
 places.
 
 the only problem I have is that I only need the last 48-hours worth of 
 metrics at the hour level.
 
 how do I get rid of the old counters? 
 do I need to write a archiver that will go through each url (could be 
 millions) and just delete them?
 
 I'm sure other people have encountered this, and was wondering how they 
 approached it.
 
 TIA
 Ian


Re: need some help with counters

2011-06-09 Thread Yang
something like this:
https://issues.apache.org/jira/browse/CASSANDRA-2103

https://issues.apache.org/jira/browse/CASSANDRA-2103but this turns out not
feasible

On Thu, Jun 9, 2011 at 12:41 PM, Ian Holsman had...@holsman.net wrote:

 Hi.

 I had a brief look at CASSANDRA-2103 (expiring counter columns), and I was
 wondering if anyone can help me with my problem.

 I want to keep some page-view stats on a URL at different levels of
 granularity (page views per hour, page views per day, page views per year
 etc etc).


 so my thinking was to create something a counter with a key based on
 Year-Month-Day-Hour, and simply increment the counter as I go along.

 this work's well and I'm getting my metrics beautifully put into the right
 places.

 the only problem I have is that I only need the last 48-hours worth of
 metrics at the hour level.

 how do I get rid of the old counters?
 do I need to write a archiver that will go through each url (could be
 millions) and just delete them?

 I'm sure other people have encountered this, and was wondering how they
 approached it.

 TIA
 Ian


Re: need some help with counters

2011-06-09 Thread Ian Holsman
Hi Ryan.
you wouldn't have your version of cassandra up on github would you??

Colin.. always a pleasure.

On Jun 9, 2011, at 3:44 PM, Ryan King wrote:

 On Thu, Jun 9, 2011 at 12:41 PM, Ian Holsman had...@holsman.net wrote:
 Hi.
 
 I had a brief look at CASSANDRA-2103 (expiring counter columns), and I was 
 wondering if anyone can help me with my problem.
 
 I want to keep some page-view stats on a URL at different levels of 
 granularity (page views per hour, page views per day, page views per year 
 etc etc).
 
 
 so my thinking was to create something a counter with a key based on 
 Year-Month-Day-Hour, and simply increment the counter as I go along.
 
 this work's well and I'm getting my metrics beautifully put into the right 
 places.
 
 the only problem I have is that I only need the last 48-hours worth of 
 metrics at the hour level.
 
 how do I get rid of the old counters?
 do I need to write a archiver that will go through each url (could be 
 millions) and just delete them?
 
 I'm sure other people have encountered this, and was wondering how they 
 approached it.
 
 Here's how we are going to do it at twitter:
 https://issues.apache.org/jira/browse/CASSANDRA-2735
 
 -ryan



Re: need some help with counters

2011-06-09 Thread Ryan King
On Thu, Jun 9, 2011 at 1:06 PM, Ian Holsman had...@holsman.net wrote:
 Hi Ryan.
 you wouldn't have your version of cassandra up on github would you??

No, and the patch isn't in our version yet either. We're still working on it.

-ryan


Re: need some help with counters

2011-06-09 Thread Ian Holsman
So would doing something like storing it in reverse (so I know what to delete) 
work? Or is storing a million columns in a supercolumn impossible. 

I could always use a logfile and run the archiver off that as a worst case I 
guess. 
Would doing so many deletes screw up the db/cause other problems?

---
Ian Holsman - 703 879-3128

I saw the angel in the marble and carved until I set him free -- Michelangelo

On 09/06/2011, at 4:22 PM, Ryan King r...@twitter.com wrote:

 On Thu, Jun 9, 2011 at 1:06 PM, Ian Holsman had...@holsman.net wrote:
 Hi Ryan.
 you wouldn't have your version of cassandra up on github would you??
 
 No, and the patch isn't in our version yet either. We're still working on it.
 
 -ryan


Re: need some help with counters

2011-06-09 Thread aaron morton
I may be missing something but could you use a column for each of the last 48 
hours all in the same row for a url ?

e.g. 
{
/url.com/hourly : {
20110609T01:00:00 : 456,
20110609T02:00:00 : 4567,
}
}

Increment the current hour only. Delete the older columns either when a read 
detects there are old values or as a maintenance job. Or as part of writing 
values for the first 5 minutes of any hour. 
 
The row will get spread out over a lot of sstables which may reduce read speed. 
If this is a problem consider a separate CF with more aggressive GC and 
compaction settings. 
 
Cheers


-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 10 Jun 2011, at 09:28, Ian Holsman wrote:

 So would doing something like storing it in reverse (so I know what to 
 delete) work? Or is storing a million columns in a supercolumn impossible. 
 
 I could always use a logfile and run the archiver off that as a worst case I 
 guess. 
 Would doing so many deletes screw up the db/cause other problems?
 
 ---
 Ian Holsman - 703 879-3128
 
 I saw the angel in the marble and carved until I set him free -- Michelangelo
 
 On 09/06/2011, at 4:22 PM, Ryan King r...@twitter.com wrote:
 
 On Thu, Jun 9, 2011 at 1:06 PM, Ian Holsman had...@holsman.net wrote:
 Hi Ryan.
 you wouldn't have your version of cassandra up on github would you??
 
 No, and the patch isn't in our version yet either. We're still working on it.
 
 -ryan