Re: anyone using Cassandra as an analytics/data warehouse?

Jake Luciani Tue, 04 Jan 2011 19:19:25 -0800

Some relevant information here:
https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/


On Tue, Jan 4, 2011 at 10:09 PM, Dave Viner <davevi...@gmail.com> wrote:

> Hi Peter,
>
> Thanks.  These are great ideas.  One comment tho.  I'm actually not as
> worried about the "logging into the system" performance and more
> speculating/imagining the querying out of the system.
>
> Most traditional data warehouses have a cube or a star schema or something
> similar.  I'm trying to imagine how one might use Cassandra in situations
> where that sort of design has historically been applied.
>
> But, I want to make sure I understand your suggestion A.
>
> Is it something like this?
>
> "a Column Family with the row key being the Unix time divided by 60x60 and
> a column key of... pretty much anything unique"
>     LogCF[hour-day-in-epoch-seconds][timeuuid] = 1
> where 'hour-day-in-epoch-seconds' is something like the first second of the
> given hour of the day, so 01/04/2011 19:00:00 (in epoch
> seconds: 1294167600); 'timeuuid' is a TimeUUID from cassandra, and '1' is
> the value of the entry.
>
> Then "look at the current row every hour to actually compile the numbers,
> and store the count in the same Column Family"
>     LogCF[hour-day-in-epoch-seconds][total] = x
> where 'x' is the sum of the number of timeuuid columns in the row?
>
>
> Is that what you're envisioning in Option A?
>
> Thanks
> Dave Viner
>
>
>
> On Tue, Jan 4, 2011 at 6:38 PM, Peter Harrison <cheetah...@gmail.com>wrote:
>
>> Okay, here is two ways to handle this, both are quite different from each
>> other.
>>
>>
>> A)
>>
>> This approach does not depend on counters. You simply have a Column Family
>> with the row key being the Unix time divided by 60x60 and a column key of...
>> pretty much anything unique. Then have another process look at the current
>> row every hour to actually compile the numbers, and store the count in the
>> same Column Family. This will solve the first and third use cases, as it is
>> just a matter of looking at the right rows. The second case will require a
>> similar index, but one which includes a country code to be appended to the
>> row key.
>>
>> The downside here is that you are storing lots of data on individual
>> requests and retaining it. If you don't want the detailed data you might add
>> a second process to purge the detail every hour.
>>
>> B)
>>
>> There is a "counter" feature added to the latest versions of Cassandra. I
>> have not used them, but they should be able to be used to achieve the same
>> effect without a second process cleaning up every hour. Also means it is
>> more of a real time system so you can see how many requests in the hour you
>> are currently in.
>>
>>
>>
>> Basically you have to design your approach based on the query you will be
>> doing. Don't get too hung up on traditional data structures and queries as
>> they have little relationship to a Cassandra approach.
>>
>>
>>
>> On Wed, Jan 5, 2011 at 2:34 PM, Dave Viner <davevi...@gmail.com> wrote:
>>
>>> Does anyone use Cassandra to power an analytics or data warehouse
>>> implementation?
>>>
>>> As a concrete example, one could imagine Cassandra storing data for
>>> something that reports on page-views on a website.  The basic notions might
>>> be simple (url as row-key and columns as timeuuids of viewers).  But, how
>>> would one store things like ip-geolocation to set of pages viewed?  Or
>>> hour-of-day to pages viewed?
>>>
>>> Also, how would one do a query like
>>> - "tell me how many page views occurred between 12/01/2010 and
>>> 12/31/2010"?
>>> - "tell me how many page views occurred between 12/01/2010 and 12/31/2010
>>> from the US"?
>>> - "tell me how many page views occurred between 12/01/2010 and 12/31/2010
>>> from the US in the 9th hour of the day (in gmt)"?
>>>
>>> Time slicing and dimension slicing seems like it might be very
>>> challenging (especially since the windows of time would not be known in
>>> advance).
>>>
>>> Thanks
>>> Dave Viner
>>>
>>
>>
>

Re: anyone using Cassandra as an analytics/data warehouse?

Reply via email to