Dean,

We have the same question...

We have thousands of separate feeds of data as well (20,000+).  To
date, we've been using a CF per feed strategy, but as we scale this
thing out to accommodate all of those feeds, we're trying to figure
out if we're going to blow out the memory.

The initial documentation for heap sizing had column families in the equation:
http://www.datastax.com/docs/0.7/operations/tuning#heap-sizing

But in the more recent documentation, it looks like they removed the
column family variable with the introduction of the universal
key_cache_size.
http://www.datastax.com/docs/1.0/operations/tuning#tuning-java-heap-size

We haven't committed either way yet, but given Ed Anuff's presentation
on virtual keyspaces, we were leaning towards a single column family
approach:
http://blog.apigee.com/detail/building_a_mobile_data_platform_with_cassandra_-_apigee_under_the_hood/?

Definitely let us know what you decide.

-brian

On Fri, Sep 28, 2012 at 11:48 AM, Flavio Baronti
<f.baro...@list-group.com> wrote:
> We had some serious trouble with dynamically adding CFs, although last time
> we tried we were using version 0.7, so maybe
> that's not an issue any more.
> Our problems were two:
> - You are (were?) not supposed to add CFs concurrently. Since we had more
> servers talking to the same Cassandra cluster,
> we had to use distributed locks (Hazelcast) to avoid concurrency.
> - You must be very careful to add new CFs to different Cassandra nodes. If
> you do that fast enough, and the clocks of
> the two servers are skewed, you will severely compromise your schema
> (Cassandra will not understand in which order the
> updates must be applied).
>
> As I said, this applied to version 0.7, maybe current versions solved these
> problems.
>
> Flavio
>
>
> Il 2012/09/27 16:11 PM, Hiller, Dean ha scritto:
>> We have 1000's of different building devices and we stream data from these
> devices.  The format and data from each one varies so one device has 
> temperature
> at timeX with some other variables, another device has CO2 percentage and 
> other
> variables.  Every device is unique and streams it's own data.  We dynamically
> discover devices and register them.  Basically, one CF or table per thing 
> really
> makes sense in this environment.  While we could try to find out which devices
> "are" similar, this would really be a pain and some devices add some new
> variable into the equation.  NOT only that but researchers can register new
> datasets and upload them as well and each dataset they have they do NOT want 
> to
> share with other researches necessarily so we have security groups and each CF
> belongs to security groups.  We dynamically create CF's on the fly as people
> register new datasets.
>>
>> On top of that, when the data sets get too large, we probably want to
> partition a single CF into time partitions.  We could create one CF and put 
> all
> the data and have a partition per device, but then a time partition will 
> contain
> "multiple" devices of data meaning we need to shrink our time partition size
> where if we have CF per device, the time partition can be larger as it is only
> for that one device.
>>
>> THEN, on top of that, we have a meta CF for these devices so some people want
> to query for streams that match criteria AND which returns a CF name and they
> query that CF name so we almost need a query with variables like select cfName
> from Meta where x = y and then select * from cfName where xxxxx. Which we can 
> do
> today.
>>
>> Dean
>>
>> From: Marcelo Elias Del Valle <mvall...@gmail.com<mailto:mvall...@gmail.com>>
>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
> <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
>> Date: Thursday, September 27, 2012 8:01 AM
>> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
> <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
>> Subject: Re: 1000's of column families
>>
>> Out of curiosity, is it really necessary to have that amount of CFs?
>> I am probably still used to relational databases, where you would use a new
> table just in case you need to store different kinds of data. As Cassandra
> stores anything in each CF, it might probably make sense to have a lot of CFs 
> to
> store your data...
>> But why wouldn't you use a single CF with partitions in these case? Wouldn't
> it be the same thing? I am asking because I might learn a new modeling 
> technique
> with the answer.
>>
>> []s
>>
>> 2012/9/26 Hiller, Dean <dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov>>
>> We are streaming data with 1 stream per 1 CF and we have 1000's of CF.  When
> using the tools they are all geared to analyzing ONE column family at a time 
> :(.
> If I remember correctly, Cassandra supports as many CF's as you want, correct?
> Even though I am going to have tons of funs with limitations on the tools,
> correct?
>>
>> (I may end up wrapping the node tool with my own aggregate calls if needed to
> sum up multiple column families and such).
>>
>> Thanks,
>> Dean
>>
>>
>>
>> --
>> Marcelo Elias Del Valle
>> http://mvalle.com - @mvallebr
>>
>
>



-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
Apache Cassandra MVP
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42

Reply via email to