I was hoping to transition my "simple" cassandra cluster (where each node is a 
cassandra + hadoop tasktracker) to a cluster with two virtual datacenters 
(vanilla cassandra vs. cassandra + hadoop tasktracker), based on this:
http://wiki.apache.org/cassandra/HadoopSupport#ClusterConfig
The problem I'm having is my hadoop jobs are getting heavy enough it's 
affecting my user facing performance on my cluster.

Right now I'm in AWS, and I have 4 nodes in us-east split over two availability 
zones ("us-east-1c" that I'll call "c" and "us-east-1d" that I'll call "d"), 
setup with this keyspace:
create keyspace civicscience with replication_factor=3 and strategy_options = 
[{us-east:3}] and 
placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy';
And I'm using the Ec2Snitch.

I'm wondering if I write my own snitch that extends Ec2Snitch with overrides as 
follows:
getDC = if(AZ == c || d) return return us-east (to keep current nodes the same) 
else return us-east-hadoop;
getRack = return super(); (returning a,b,c,d seems ok)

Then, if I boot N new nodes into us-east-1[a,b] they will be "hadoop" nodes 
because of the snitch.  I'll obviously have to change my home brew cassandra + 
hadoop instances to selectively run task trackers or not (a/b = yes, and c/d = 
no).

But:
-Is the overall RF=3 still ok?
-What is the recommended split between "normal" and "hadoop" in terms of 
strategy_options (assuming RF=3)?  2/1?  
-Can I (how do I safely) change the keyspace strategy_options from 
[{us-east:3}] to [{us-east:2, us-east-hadoop:1}]   This seems like the 
riskiest/most complicated step of everything I've proposed...
-After I change the options, what (if anything) would I have to do to migrate 
data around?  

One final question: should I add new nodes as Brisk instances instead of my 
home brew cassandra + hadoop nodes?  I've obviously already put in the 
pain/effort of learning how to run hadoop + cassandra...

Thanks for any help/advice!

will

Reply via email to