Re: need some help with counters

2011-06-16 Thread Ian Holsman

On Jun 13, 2011, at 5:10 AM, aaron morton wrote:

 I am wondering how to index on the most recent hour as well. (ie show me top 
 5 URLs type query).. 
 
 AFAIK thats not a great application for counters. You would need range 
 support in the secondary indexes so you could get the first X rows ordered by 
 a column value. 
 
 To be honest, depending on scale, I'd consider a sorted set in redis for 
 that. 

It does.
Thanks Aaron.

 
 Hope that helps. 
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 11 Jun 2011, at 00:36, Ian Holsman wrote:
 
 
 On Jun 9, 2011, at 10:04 PM, aaron morton wrote:
 
 I may be missing something but could you use a column for each of the last 
 48 hours all in the same row for a url ?
 
 e.g. 
 {
 /url.com/hourly : {
 20110609T01:00:00 : 456,
 20110609T02:00:00 : 4567,
 }
 }
 
 yes.. that would work better... I was storing all the different times in the 
 same row.
 {
  /url.com : {
   H-20110609T01:00:00 : 456,
   H-0110609T02:00:00 : 4567,
   D-0110609 : 5678,
  }
 }
 
 I am wondering how to index on the most recent hour as well. (ie show me top 
 5 URLs type query).. 
 
 
 Increment the current hour only. Delete the older columns either when a 
 read detects there are old values or as a maintenance job. Or as part of 
 writing values for the first 5 minutes of any hour. 
 
 yes.. I thought of that. The problem with doing it on read is there may be a 
 case where a old URL never gets read.. so it will just sit there taking up 
 space.. the maintenance job is the route I went down.
 
 
 The row will get spread out over a lot of sstables which may reduce read 
 speed. If this is a problem consider a separate CF with more aggressive GC 
 and compaction settings. 
 
 Thanks!
 
 Cheers
 
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 10 Jun 2011, at 09:28, Ian Holsman wrote:
 
 So would doing something like storing it in reverse (so I know what to 
 delete) work? Or is storing a million columns in a supercolumn impossible. 
 
 I could always use a logfile and run the archiver off that as a worst case 
 I guess. 
 Would doing so many deletes screw up the db/cause other problems?
 
 ---
 Ian Holsman - 703 879-3128
 
 I saw the angel in the marble and carved until I set him free -- 
 Michelangelo
 
 On 09/06/2011, at 4:22 PM, Ryan King r...@twitter.com wrote:
 
 On Thu, Jun 9, 2011 at 1:06 PM, Ian Holsman had...@holsman.net wrote:
 Hi Ryan.
 you wouldn't have your version of cassandra up on github would you??
 
 No, and the patch isn't in our version yet either. We're still working on 
 it.
 
 -ryan
 
 
 



Re: need some help with counters

2011-06-10 Thread Ian Holsman

On Jun 9, 2011, at 10:04 PM, aaron morton wrote:

 I may be missing something but could you use a column for each of the last 48 
 hours all in the same row for a url ?
 
 e.g. 
 {
   /url.com/hourly : {
   20110609T01:00:00 : 456,
   20110609T02:00:00 : 4567,
   }
 }

yes.. that would work better... I was storing all the different times in the 
same row.
{
/url.com : {
 H-20110609T01:00:00 : 456,
 H-0110609T02:00:00 : 4567,
 D-0110609 : 5678,
}
}

I am wondering how to index on the most recent hour as well. (ie show me top 5 
URLs type query).. 

 
 Increment the current hour only. Delete the older columns either when a read 
 detects there are old values or as a maintenance job. Or as part of writing 
 values for the first 5 minutes of any hour. 

yes.. I thought of that. The problem with doing it on read is there may be a 
case where a old URL never gets read.. so it will just sit there taking up 
space.. the maintenance job is the route I went down.

 
 The row will get spread out over a lot of sstables which may reduce read 
 speed. If this is a problem consider a separate CF with more aggressive GC 
 and compaction settings. 

Thanks!
 
 Cheers
 
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 10 Jun 2011, at 09:28, Ian Holsman wrote:
 
 So would doing something like storing it in reverse (so I know what to 
 delete) work? Or is storing a million columns in a supercolumn impossible. 
 
 I could always use a logfile and run the archiver off that as a worst case I 
 guess. 
 Would doing so many deletes screw up the db/cause other problems?
 
 ---
 Ian Holsman - 703 879-3128
 
 I saw the angel in the marble and carved until I set him free -- Michelangelo
 
 On 09/06/2011, at 4:22 PM, Ryan King r...@twitter.com wrote:
 
 On Thu, Jun 9, 2011 at 1:06 PM, Ian Holsman had...@holsman.net wrote:
 Hi Ryan.
 you wouldn't have your version of cassandra up on github would you??
 
 No, and the patch isn't in our version yet either. We're still working on 
 it.
 
 -ryan
 



Re: Where is the Overview Documentation on Counters?

2011-06-10 Thread Ian Holsman
Hi AJ.

Counters are really cool for certain things.. 

The main benefit (from a high level perspective) is that you don't have to read 
the record in to find the old value. (and stick a lock on the record to prevent 
it from changing underneath you).

what I use them for is to increment page-views. I just read a log-line in, and 
can update the rows without having to read it beforehand.. which is nice 
performance wise.

On Jun 10, 2011, at 2:18 PM, AJ wrote:

 I can't find any that gives an overview of their purpose/benefits/etc, only 
 how to code them.  I can only guess that they are more efficient for some 
 reason but don't know exactly why or exactly what conditions I would choose 
 to use them over a regular column.
 
 Thanks!



need some help with counters

2011-06-09 Thread Ian Holsman
Hi.

I had a brief look at CASSANDRA-2103 (expiring counter columns), and I was 
wondering if anyone can help me with my problem.

I want to keep some page-view stats on a URL at different levels of granularity 
(page views per hour, page views per day, page views per year etc etc).


so my thinking was to create something a counter with a key based on 
Year-Month-Day-Hour, and simply increment the counter as I go along. 

this work's well and I'm getting my metrics beautifully put into the right 
places.

the only problem I have is that I only need the last 48-hours worth of metrics 
at the hour level.

how do I get rid of the old counters? 
do I need to write a archiver that will go through each url (could be millions) 
and just delete them?

I'm sure other people have encountered this, and was wondering how they 
approached it.

TIA
Ian

Re: need some help with counters

2011-06-09 Thread Ian Holsman
Hi Ryan.
you wouldn't have your version of cassandra up on github would you??

Colin.. always a pleasure.

On Jun 9, 2011, at 3:44 PM, Ryan King wrote:

 On Thu, Jun 9, 2011 at 12:41 PM, Ian Holsman had...@holsman.net wrote:
 Hi.
 
 I had a brief look at CASSANDRA-2103 (expiring counter columns), and I was 
 wondering if anyone can help me with my problem.
 
 I want to keep some page-view stats on a URL at different levels of 
 granularity (page views per hour, page views per day, page views per year 
 etc etc).
 
 
 so my thinking was to create something a counter with a key based on 
 Year-Month-Day-Hour, and simply increment the counter as I go along.
 
 this work's well and I'm getting my metrics beautifully put into the right 
 places.
 
 the only problem I have is that I only need the last 48-hours worth of 
 metrics at the hour level.
 
 how do I get rid of the old counters?
 do I need to write a archiver that will go through each url (could be 
 millions) and just delete them?
 
 I'm sure other people have encountered this, and was wondering how they 
 approached it.
 
 Here's how we are going to do it at twitter:
 https://issues.apache.org/jira/browse/CASSANDRA-2735
 
 -ryan



Re: need some help with counters

2011-06-09 Thread Ian Holsman
So would doing something like storing it in reverse (so I know what to delete) 
work? Or is storing a million columns in a supercolumn impossible. 

I could always use a logfile and run the archiver off that as a worst case I 
guess. 
Would doing so many deletes screw up the db/cause other problems?

---
Ian Holsman - 703 879-3128

I saw the angel in the marble and carved until I set him free -- Michelangelo

On 09/06/2011, at 4:22 PM, Ryan King r...@twitter.com wrote:

 On Thu, Jun 9, 2011 at 1:06 PM, Ian Holsman had...@holsman.net wrote:
 Hi Ryan.
 you wouldn't have your version of cassandra up on github would you??
 
 No, and the patch isn't in our version yet either. We're still working on it.
 
 -ryan


[OT] Real Time Open source solutions for aggregation and stream processing

2010-06-15 Thread Ian Holsman

firstly, my apologies for the off-topic message,
but I thought most people on this list would be knowledgeable and 
interested in this kind of thing.


We are looking to find a open source, scalable solution to do RT 
aggregation and stream processing (similar to what the 'hop' project 
http://code.google.com/p/hop/ set out to do) for large(ish) click-stream 
logs.


My first thought was something like esper, but in our testing it kind of 
hits the wall at around 10,000 rules per JVM.


I was wondering if any of you guys had some experiences in this area, 
and what your favorite toolsets are around this.


currently we are using cassandra and redis with home grown software to 
do the aggregation, but I'd love to use a common package if there is one.


and again.. apologies for the off-topic message and the x-posting.

regards
Ian