Some relevant information here: https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
On Tue, Jan 4, 2011 at 10:09 PM, Dave Viner <davevi...@gmail.com> wrote: > Hi Peter, > > Thanks. These are great ideas. One comment tho. I'm actually not as > worried about the "logging into the system" performance and more > speculating/imagining the querying out of the system. > > Most traditional data warehouses have a cube or a star schema or something > similar. I'm trying to imagine how one might use Cassandra in situations > where that sort of design has historically been applied. > > But, I want to make sure I understand your suggestion A. > > Is it something like this? > > "a Column Family with the row key being the Unix time divided by 60x60 and > a column key of... pretty much anything unique" > LogCF[hour-day-in-epoch-seconds][timeuuid] = 1 > where 'hour-day-in-epoch-seconds' is something like the first second of the > given hour of the day, so 01/04/2011 19:00:00 (in epoch > seconds: 1294167600); 'timeuuid' is a TimeUUID from cassandra, and '1' is > the value of the entry. > > Then "look at the current row every hour to actually compile the numbers, > and store the count in the same Column Family" > LogCF[hour-day-in-epoch-seconds][total] = x > where 'x' is the sum of the number of timeuuid columns in the row? > > > Is that what you're envisioning in Option A? > > Thanks > Dave Viner > > > > On Tue, Jan 4, 2011 at 6:38 PM, Peter Harrison <cheetah...@gmail.com>wrote: > >> Okay, here is two ways to handle this, both are quite different from each >> other. >> >> >> A) >> >> This approach does not depend on counters. You simply have a Column Family >> with the row key being the Unix time divided by 60x60 and a column key of... >> pretty much anything unique. Then have another process look at the current >> row every hour to actually compile the numbers, and store the count in the >> same Column Family. This will solve the first and third use cases, as it is >> just a matter of looking at the right rows. The second case will require a >> similar index, but one which includes a country code to be appended to the >> row key. >> >> The downside here is that you are storing lots of data on individual >> requests and retaining it. If you don't want the detailed data you might add >> a second process to purge the detail every hour. >> >> B) >> >> There is a "counter" feature added to the latest versions of Cassandra. I >> have not used them, but they should be able to be used to achieve the same >> effect without a second process cleaning up every hour. Also means it is >> more of a real time system so you can see how many requests in the hour you >> are currently in. >> >> >> >> Basically you have to design your approach based on the query you will be >> doing. Don't get too hung up on traditional data structures and queries as >> they have little relationship to a Cassandra approach. >> >> >> >> On Wed, Jan 5, 2011 at 2:34 PM, Dave Viner <davevi...@gmail.com> wrote: >> >>> Does anyone use Cassandra to power an analytics or data warehouse >>> implementation? >>> >>> As a concrete example, one could imagine Cassandra storing data for >>> something that reports on page-views on a website. The basic notions might >>> be simple (url as row-key and columns as timeuuids of viewers). But, how >>> would one store things like ip-geolocation to set of pages viewed? Or >>> hour-of-day to pages viewed? >>> >>> Also, how would one do a query like >>> - "tell me how many page views occurred between 12/01/2010 and >>> 12/31/2010"? >>> - "tell me how many page views occurred between 12/01/2010 and 12/31/2010 >>> from the US"? >>> - "tell me how many page views occurred between 12/01/2010 and 12/31/2010 >>> from the US in the 9th hour of the day (in gmt)"? >>> >>> Time slicing and dimension slicing seems like it might be very >>> challenging (especially since the windows of time would not be known in >>> advance). >>> >>> Thanks >>> Dave Viner >>> >> >> >