Thanks Janne, Alain and Eric.

Now say I go with counters (hourly, daily, monthly) and also store UUID as
below:

user Id : yyyy/mm/dd as row key and dynamic columns for each click with
column key as timestamp and value as empty. Periodically count the columns
and rows and correct the counters. Now in this case, there will be one row
per day but as many columns as user click.

Other way is to store row per hour
user id : yyyy/mm/dd/hh as row key and dynamic columns for each click with
column key as timestamp and value as empty.

Is there any difference (in performance or any known issues) between more
rows Vs more columns as Cassandra deletes them through tombstones (say by
default 20 days).

Thanks
Ajay

On Mon, Dec 29, 2014 at 7:47 PM, Eric Stevens <migh...@gmail.com> wrote:

> > If the counters get incorrect, it could't be corrected
>
> You'd have to store something that allowed you to correct it.  For
> example, the TimeUUID approach to keep true counts, which are slow to read
> but accurate, and a background process that trues up your counter columns
> periodically.
>
> On Mon, Dec 29, 2014 at 7:05 AM, Ajay <ajay.ga...@gmail.com> wrote:
>
>> Thanks for the clarification.
>>
>> In my case, Cassandra is the only storage. If the counters get incorrect,
>> it could't be corrected. For that if we store raw data, we can as well go
>> that approach. But the granularity has to be as seconds level as more than
>> one user can click the same link. So the data will be huge with more writes
>> and more rows to count for reads right?
>>
>> Thanks
>> Ajay
>>
>>
>> On Mon, Dec 29, 2014 at 7:10 PM, Alain RODRIGUEZ <arodr...@gmail.com>
>> wrote:
>>
>>> Hi Ajay,
>>>
>>> Here is a good explanation you might want to read.
>>>
>>>
>>> http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters
>>>
>>> Though we use counters for 3 years now, we used them from start C* 0.8
>>> and we are happy with them. Limits I can see in both ways are:
>>>
>>> Counters:
>>>
>>> - accuracy indeed (Tend to be small in our use case < 5% - when the
>>> business allow 10%, so fair enough for us) + we recount them through a
>>> batch processing tool (spark / hadoop - Kind of lambda architecture). So
>>> our real-time stats are inaccurate and after a few minutes or hours we have
>>> the real value.
>>> - Read-Before-Write model, which is an anti-pattern. Makes you use more
>>> machine due to the pressure involved, affordable for us too.
>>>
>>> Raw data (counted)
>>>
>>> - Space used (can become quite impressive very fast, depending on your
>>> business) !
>>> - Time to answer a request (we expose the data to customer, they don't
>>> want to wait 10 sec for Cassandra to read 1 000 000 + columns)
>>> - Performances in o(n) (linear) instead of o(1) (constant). Customer
>>> won't always understand that for you it is harder to read 1 than 1 000 000,
>>> since it should be reading 1 number in both case, and your interface will
>>> have very unstable read time.
>>>
>>> Pick the best solution (or combination) for your use case. Those
>>> disadvantages lists are not exhaustive, just things that came to my mind
>>> right now.
>>>
>>> C*heers
>>>
>>> Alain
>>>
>>> 2014-12-29 13:33 GMT+01:00 Ajay <ajay.ga...@gmail.com>:
>>>
>>>> Hi,
>>>>
>>>> So you mean to say counters are not accurate? (It is highly likely that
>>>> multiple parallel threads trying to increment the counter as users click
>>>> the links).
>>>>
>>>> Thanks
>>>> Ajay
>>>>
>>>>
>>>> On Mon, Dec 29, 2014 at 4:49 PM, Janne Jalkanen <
>>>> janne.jalka...@ecyrd.com> wrote:
>>>>
>>>>>
>>>>> Hi!
>>>>>
>>>>> It’s really a tradeoff between accurate and fast and your read access
>>>>> patterns; if you need it to be fairly fast, use counters by all means, but
>>>>> accept the fact that they will (especially in older versions of cassandra
>>>>> or adverse network conditions) drift off from the true click count.  If 
>>>>> you
>>>>> need accurate, use a timeuuid and count the rows (this is fairly safe for
>>>>> replays too).  However, if using timeuuids your storage will need lots of
>>>>> space; and your reads will be slow if the click counts are huge (because
>>>>> Cassandra will need to read every item).  Using counters makes it easy to
>>>>> just grab a slice of the time series data and shove it to a client for
>>>>> visualization.
>>>>>
>>>>> You could of course do a hybrid system; use timeuuids and then
>>>>> periodically count and add the result to a regular column, and then remove
>>>>> the columns.  Note that you might want to optimize this so that you don’t
>>>>> end up with a lot of tombstones, e.g. by bucketing the writes so that you
>>>>> can delete everything with just a single partition delete.
>>>>>
>>>>> At Thinglink some of the more important counters that we use are
>>>>> backed up by the actual data. So for speed purposes we use always counters
>>>>> for reads, but there’s a repair process that fixes the counter value if we
>>>>> suspect it starts drifting off the real data too much.  (You might be able
>>>>> to tell that we’ve been using counters for quite some time :-P)
>>>>>
>>>>> /Janne
>>>>>
>>>>> On 29 Dec 2014, at 13:00, Ajay <ajay.ga...@gmail.com> wrote:
>>>>>
>>>>> > Hi,
>>>>> >
>>>>> > Is it better to use Counter to User click count than maintaining
>>>>> creating new row as user id : timestamp and count it.
>>>>> >
>>>>> > Basically we want to track the user clicks and use the same for
>>>>> hourly/daily/monthly report.
>>>>> >
>>>>> > Thanks
>>>>> > Ajay
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to