Thanks for the answers, Dan, Aaron. ... Ok, so one question is, if I haven't made any writes at all, can I decommission without delay? (Is there a "force drop" option or something, or will the cluster recognize the lack of writes)?
I may be able to segregate writes to the "reference collection" so that they occur late at night and/or on weekends when I don't have much load, otherwise. (NB it would be nice to be able to control replication strategy by keyspace; as it is I can probably put the reference data in its own cluster.) But thanks for the suggestions about a caching layer -- I had already thought of memcache (as noted problematic due to amount of data), but hadn't considered the some of the other options you've mentioned. I didn't know, for instance, that you could use the queueing services this way. As for S3, etc... I guess its possible, but the costs seem to mount quickly as well. Typically I have one sporadic writer and many readers, but I do write sometimes. Another use case is to have expanded capacity for writes & reads of intermediate results while running hadoop. Should I perhaps just start a whole other cluster for these? Gratefully, -- Shaun On Mar 5, 2011, at 10:52 PM, aaron morton wrote: > Agree. Cassandra generally assumes a reasonable static cluster membership. > There are some tricks that can be done with copying SSTables but they will > only reduce the need to stream data around, not eliminate it. > > This may not suit your problem domain but, speaking of the AWS infrastructure > how about using the SQS messaging service (or similar e.g. RabbitMQ) to > smooth out your throughput ? You could then throttle the inserts into the > cassandra cluster to a maximum level and spec your HW against that. During > peak the message queue can soak up the overflow. > > Hope that helps. > Aaron > > On 4/03/2011, at 2:07 PM, Dan Hendry wrote: > >> To some extent, the boot-strapping problem will be an issue with most >> solutions: the data has to be duplicated from somewhere. Bootstrapping >> should not cause much performance degradation unless you are already pushing >> capacity limits. It's the decommissioning problem which makes Cassandra >> somewhat problematic in your case. You grow your cluster x5 then write to >> it. You have to perform a proper decommission when shrinking the cluster >> again which involves validating and streaming data to the remaining >> replicas: a fairly serious operation with TBs of data. For most realistic >> situations, unless the cluster is completely read-only, you cant just kill >> most of the nodes in the cluster. >> >> I cant really think of a good, general, way to do this with just Cassandra >> although there may be some hacktastical possibilities. I think a more >> statically sized Cassandra cluster then a variable cache layer (memcached or >> similar) is probably a better solution. This option kind of falls apart at >> the terabytes of data range. >> >> Have you considered using S3, Amazon cloud front or some other CDN instead >> of rolling your own solution? For immutable data, its what they excel at. >> Cassandra has amazing write capacity and its design focus is on scaling >> writes. I would not really consider it a good tool for the job of serving >> massive amounts of static content. >> >> Dan >> >> -----Original Message----- >> From: Shaun Cutts [mailto:sh...@cuttshome.net] >> Sent: March-03-11 13:00 >> To: user@cassandra.apache.org >> Subject: question about replicas & dynamic response to load >> >> Hello, >> >> In our project our usage pattern is likely to be quite variable -- high for >> a a few days, then lower, etc could vary as much (or more) as 10x from peak >> to "non-peak". Also, much of our data is immutable -- but there is a >> considerable amount of it -- perhaps in the single digit TBs. Finally, we >> are hosting with amazon. >> >> I'm looking for advice on how to vary the number of nodes dynamically, in >> order to reduce our hosting costs at non-peak times. I worry that just >> adding "new" nodes in response to demand will make things worse -- at least >> temporarily -- as the new node copies data to itself; then bringing it down >> will also cause a degradation. >> >> I'm wondering if it is possible to bring up exact copies of other nodes? Or >> alternately to take down a populated node containing (only?) immutable data, >> then bring it up again when the need arises? >> >> Are there reference/reading materials(/blogs) concerning dynamically varying >> number of nodes in response to demand? >> >> Thanks! >> >> -- Shaun >> >> No virus found in this incoming message. >> Checked by AVG - www.avg.com >> Version: 9.0.872 / Virus Database: 271.1.1/3479 - Release Date: 03/03/11 >> 02:34:00 >> >