Thanks for the answers, Dan, Aaron. 

...
Ok, so one question is, if I haven't made any writes at all, can I decommission 
without delay? (Is there a "force drop" option or something, or will the 
cluster recognize the lack of writes)?

I may be able to segregate writes to the "reference collection" so that they 
occur late at night and/or on weekends when I don't have much load, otherwise. 
(NB it would be nice to be able to control replication strategy by keyspace; as 
it is I can probably put the reference data in its own cluster.)

But thanks for the suggestions about a caching layer -- I had already thought 
of memcache (as noted problematic due to amount of data), but hadn't considered 
the some of the other options you've mentioned. I didn't know, for instance, 
that you could use the queueing services this way.

As for S3, etc... I guess its possible, but the costs seem to mount quickly as 
well. Typically I have one sporadic writer and many readers, but I do write 
sometimes.

Another use case is to have expanded capacity for writes & reads of 
intermediate results while running hadoop. Should I perhaps just start a whole 
other cluster for these?

Gratefully,

-- Shaun



On Mar 5, 2011, at 10:52 PM, aaron morton wrote:

> Agree. Cassandra generally assumes a reasonable static cluster membership. 
> There are some tricks that can be done with copying SSTables but they will 
> only reduce the need to stream data around, not eliminate it.
> 
> This may not suit your problem domain but, speaking of the AWS infrastructure 
> how about using the SQS messaging service (or similar e.g. RabbitMQ) to 
> smooth out your throughput ? You could then throttle the inserts into the 
> cassandra cluster to a maximum level and spec your HW against that. During 
> peak the message queue can soak up the overflow. 
> 
> Hope that helps. 
> Aaron
> 
> On 4/03/2011, at 2:07 PM, Dan Hendry wrote:
> 
>> To some extent, the boot-strapping problem will be an issue with most
>> solutions: the data has to be duplicated from somewhere. Bootstrapping
>> should not cause much performance degradation unless you are already pushing
>> capacity limits. It's the decommissioning problem which makes Cassandra
>> somewhat problematic in your case. You grow your cluster x5 then write to
>> it. You have to perform a proper decommission when shrinking the cluster
>> again which involves validating and streaming data to the remaining
>> replicas: a fairly serious operation with TBs of data. For most realistic
>> situations, unless the cluster is completely read-only, you cant just kill
>> most of the nodes in the cluster.
>> 
>> I cant really think of a good, general, way to do this with just Cassandra
>> although there may be some hacktastical possibilities. I think a more
>> statically sized Cassandra cluster then a variable cache layer (memcached or
>> similar) is probably a better solution. This option kind of falls apart at
>> the terabytes of data range. 
>> 
>> Have you considered using S3, Amazon cloud front or some other CDN instead
>> of rolling your own solution? For immutable data, its what they excel at.
>> Cassandra has amazing write capacity and its design focus is on scaling
>> writes. I would not really consider it a good tool for the job of serving
>> massive amounts of static content.
>> 
>> Dan
>> 
>> -----Original Message-----
>> From: Shaun Cutts [mailto:sh...@cuttshome.net] 
>> Sent: March-03-11 13:00
>> To: user@cassandra.apache.org
>> Subject: question about replicas & dynamic response to load
>> 
>> Hello,
>> 
>> In our project our usage pattern is likely to be quite variable -- high for
>> a a few days, then lower, etc could vary as much (or more) as 10x from peak
>> to "non-peak". Also, much of our data is immutable -- but there is a
>> considerable amount of it -- perhaps in the single digit TBs. Finally, we
>> are hosting with amazon.
>> 
>> I'm looking for advice on how to vary the number of nodes dynamically, in
>> order to reduce our hosting costs at non-peak times. I worry that just
>> adding "new" nodes in response to demand will make things worse -- at least
>> temporarily -- as the new node copies data to itself; then bringing it down
>> will also cause a degradation.
>> 
>> I'm wondering if it is possible to bring up exact copies of other nodes? Or
>> alternately to take down a populated node containing (only?) immutable data,
>> then bring it up again when the need arises?
>> 
>> Are there reference/reading materials(/blogs) concerning dynamically varying
>> number of nodes in response to demand?
>> 
>> Thanks!
>> 
>> -- Shaun
>> 
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com 
>> Version: 9.0.872 / Virus Database: 271.1.1/3479 - Release Date: 03/03/11
>> 02:34:00
>> 
> 

Reply via email to