Re: data model for unique users in a time period

2011-11-03 Thread David Jeske
On Wed, Nov 2, 2011 at 7:26 PM, David Jeske  wrote:

> - make sure the summarizer does try to do it's job for a batch of counters
> until they are fully replicated and 'static' (no new increments will appear)
>
Apologies. make the summarizer ( doesn't ) try to do it's job...


Re: data model for unique users in a time period

2011-11-02 Thread David Jeske
I understand what you are thinking daniel, but this approach has at least
one big wrinkle. You would be introducing  depencencies between compaction
and replication.

The 'unique' idempotent records are required for cassandra to read repair
properly. Therefore, if a compaction (or even a memtable flush) occurred,
the system could no longer read repair the counters. Your strategy is
closer to how bt/hbase handles accumulators, but it works because in that
system there is a single consistient write log.

Here is a different approach to doing this with cassandra...

- use timestamps in the column uniqueness

- Don't try to use custom compaction. Instead, layer counter summarization
on top as a periodic summarization job.

- make sure the summarizer does try to do it's job for a batch of counters
until they are fully replicated and 'static' (no new increments will appear)

- write the 'summary' of a bunch of unique timestamps in a way that anyone
summing the values know the existance of a summary means they should ignore
individual values for the range (because it will take time for them to be
deleted)


Re: data model for unique users in a time period

2011-11-01 Thread Daniel Doubleday
Hm - kind of hijacking this but since we have a similar problem I might throw 
in my idea:

We need consistent, idempotent counters. On the client side we can create 
unique (replayable) keys - like your user ids.

What we want to do is:

- add increment commands as columns such as [prefixByte.uniqueKey -> +1] 
- use a custom compactor that sums up the commands and writes a single column 
[prefixByte.sstid -> +6647] (make sure that keys dont clash)
- to read do a range query with the prefixByte

So you can have multiple counters in one row but max one column per counter per 
sst.

With leveled compaction this should work pretty nicely.

If you need fast access and want to use the row cache you will need to do some 
further patching though.

This is early brainstorming phase so any comments would be welcome

Cheers,

Daniel Doubleday
smeet.com

On Oct 31, 2011, at 7:08 PM, Ed Anuff wrote:

> Thanks, good point, splitting wide rows via sharding is a good
> optimization for the get_count approach.
> 
> On Mon, Oct 31, 2011 at 10:58 AM, Zach Richardson
>  wrote:
>> Ed,
>> 
>> I could be completely wrong about this working--I haven't specifically
>> looked at how the counts are executed, but I think this makes sense.
>> 
>> You could potentially shard across several rows, based on a hash of
>> the username combined with the time period as the row key.  Run a
>> count across each row and then add them up.  If your cluster is large
>> enough this could spread the computation enough to make each query for
>> the count a bit faster.
>> 
>> Depending on how often this query would be hit, I would still
>> recommend caching, but you could calculate reality a little more
>> often.
>> 
>> Zach
>> 
>> 
>> On Mon, Oct 31, 2011 at 12:22 PM, Ed Anuff  wrote:
>>> I'm looking at the scenario of how to keep track of the number of
>>> unique visitors within a given time period.  Inserting user ids into a
>>> wide row would allow me to have a list of every user within the time
>>> period that the row represented.  My experience in the past was that
>>> using get_count on a row to get the column count got slow pretty quick
>>> but that might still be the easiest way to get the count of unique
>>> users with some sort of caching of the count so that it's not
>>> expensive subsequently.  Using Hadoop is overkill for this scenario.
>>> Any other approaches?
>>> 
>>> Ed
>>> 
>> 



Re: data model for unique users in a time period

2011-10-31 Thread Ed Anuff
Thanks, good point, splitting wide rows via sharding is a good
optimization for the get_count approach.

On Mon, Oct 31, 2011 at 10:58 AM, Zach Richardson
 wrote:
> Ed,
>
> I could be completely wrong about this working--I haven't specifically
> looked at how the counts are executed, but I think this makes sense.
>
> You could potentially shard across several rows, based on a hash of
> the username combined with the time period as the row key.  Run a
> count across each row and then add them up.  If your cluster is large
> enough this could spread the computation enough to make each query for
> the count a bit faster.
>
> Depending on how often this query would be hit, I would still
> recommend caching, but you could calculate reality a little more
> often.
>
> Zach
>
>
> On Mon, Oct 31, 2011 at 12:22 PM, Ed Anuff  wrote:
>> I'm looking at the scenario of how to keep track of the number of
>> unique visitors within a given time period.  Inserting user ids into a
>> wide row would allow me to have a list of every user within the time
>> period that the row represented.  My experience in the past was that
>> using get_count on a row to get the column count got slow pretty quick
>> but that might still be the easiest way to get the count of unique
>> users with some sort of caching of the count so that it's not
>> expensive subsequently.  Using Hadoop is overkill for this scenario.
>> Any other approaches?
>>
>> Ed
>>
>


Re: data model for unique users in a time period

2011-10-31 Thread Zach Richardson
Ed,

I could be completely wrong about this working--I haven't specifically
looked at how the counts are executed, but I think this makes sense.

You could potentially shard across several rows, based on a hash of
the username combined with the time period as the row key.  Run a
count across each row and then add them up.  If your cluster is large
enough this could spread the computation enough to make each query for
the count a bit faster.

Depending on how often this query would be hit, I would still
recommend caching, but you could calculate reality a little more
often.

Zach


On Mon, Oct 31, 2011 at 12:22 PM, Ed Anuff  wrote:
> I'm looking at the scenario of how to keep track of the number of
> unique visitors within a given time period.  Inserting user ids into a
> wide row would allow me to have a list of every user within the time
> period that the row represented.  My experience in the past was that
> using get_count on a row to get the column count got slow pretty quick
> but that might still be the easiest way to get the count of unique
> users with some sort of caching of the count so that it's not
> expensive subsequently.  Using Hadoop is overkill for this scenario.
> Any other approaches?
>
> Ed
>


data model for unique users in a time period

2011-10-31 Thread Ed Anuff
I'm looking at the scenario of how to keep track of the number of
unique visitors within a given time period.  Inserting user ids into a
wide row would allow me to have a list of every user within the time
period that the row represented.  My experience in the past was that
using get_count on a row to get the column count got slow pretty quick
but that might still be the easiest way to get the count of unique
users with some sort of caching of the count so that it's not
expensive subsequently.  Using Hadoop is overkill for this scenario.
Any other approaches?

Ed