R: Re: Schema clone ...

2012-01-11 Thread cbert...@libero.it
I got it :-)Thanks for your patience Aaron ... the problem was that the 
cluster-name in yaml was different.Now it works, I've cloned the schema
Regards
Carlo


Messaggio originale

Da: aa...@thelastpickle.com

Data: 10/01/2012 20.13

A: user@cassandra.apache.org

Ogg: Re: Schema clone ...




* Grab the system sstables from one of the 0.7 nodes and spin up a temp 
1.0 machine them, then use the command. Grab the *system* tables Migrations , 
Schema etc.  in cassandra/data/system
Cheers 
-Aaron MortonFreelance 
Developer@aaronmortonhttp://www.thelastpickle.com


On 10/01/2012, at 10:20 PM, cbert...@libero.it wrote:
* Grab the system sstables from one of the 0.7 nodes and spin up a temp 
1.0 machine them, then use the command.  

Probably I'm still sleeping but I can't get what I want! :-(
I've copied the SSTables of a node to my own computer where I installed a 
Cassandra 1.0 just for the purpose.
I've copied it in the data folder under the keyspace name 


carlo@ubpc:/store/cassandra/data/social$


now here I have lots of file like this now ...


User-f-74-Data.db 
User-f-74-Filter.db
User-f-74-Index.db   User-f-74-Statistics.db   
but now how to tell cassandra hey, load the content of social?Did I miss 
something?
Cheers,Carlo




Messaggio originale
 Ogg: Re: Schema clone ...



ah, sorry brain not good work. 
It's only in 0.8. 
You could either:
*  write the CLI script by handor* Grab the system sstables from one of the 0.7 
nodes and spin up a temp 1.0 machine them, then use the command. or* See if 
your cassandra client software can help. 

Hope that helps. 

-Aaron MortonFreelance 
Developer@aaronmortonhttp://www.thelastpickle.com


On 9/01/2012, at 11:41 PM, cbert...@libero.it wrote:I was just trying it but 
... in 0.7 CLI there is no show schema command.When I connect with 1.0 CLI to 
my 0.7 cluster ...

[default@social] show schema;
null
I always get a null as answer! :-|Any tip for this?
ty, Cheers 
Carlo


Messaggio originale

Da: aa...@thelastpickle.com

Data: 09/01/2012 11.33

A: user@cassandra.apache.org, cbert...@libero.itcbert...@libero.it

Ogg: Re: Schema clone ...



Try show schema in the CLI. 
Cheers

-Aaron MortonFreelance 
Developer@aaronmortonhttp://www.thelastpickle.com



On 9/01/2012, at 11:12 PM, cbert...@libero.it wrote:Hi,
I have create a new dev-cluster with cassandra 1.0 -- I would like to have the 
same CFs that I have in the 0.7 one but I don't need data to be there, just the 
schema. Which is the fastest way to do it without making 30 create column 
family ... 

Best regards,

Carlo







 











multiple datastax opscenter?

2012-01-11 Thread Jeesoo Shin
Hi,
I know OpsCenter doesn't support multiple cluster yet.
I tried to install  run multiple OpsCenter in a server with different port.
different port for web and agent. same cassandra jmx, thrift port.
But second OpsCenter doesn't show any cluster node number. (web is working)

Am I missing a config? or is it not possible to run multiple OpsCenter?


thanks.


Re: multiple datastax opscenter?

2012-01-11 Thread Jeesoo Shin
never mind.
it works.

my second cluster didn't have any keyspace, that why.

On 1/11/12, Jeesoo Shin bsh...@gmail.com wrote:
 Hi,
 I know OpsCenter doesn't support multiple cluster yet.
 I tried to install  run multiple OpsCenter in a server with different
 port.
 different port for web and agent. same cassandra jmx, thrift port.
 But second OpsCenter doesn't show any cluster node number. (web is working)

 Am I missing a config? or is it not possible to run multiple OpsCenter?


 thanks.



Re: AW: AW: How to control location of data?

2012-01-11 Thread Andreas Rudolph
Hi!

 ... 
 So actually if you use Cassandra – for the application the actual storage 
 location of the data should not matter. It will be available anywhere in the 
 cluster if it is stored on any reachable node.
I suspected it so, that is Cassandra does not provide a mechanism to strictly 
constrain what nodes in a cluster hold the data for a specific key space 
because Cassandra is not designed for that purpose.

Thank you very much for your effort and detailed explanation.

  
 Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com] 
 Gesendet: Dienstag, 10. Januar 2012 15:06
 An: user@cassandra.apache.org
 Betreff: Re: AW: How to control location of data?
  
 Hi!
  
 Thank you for your last reply. I'm still wondering if I got you right...
  
 ... 
 A partitioner decides into which partition a piece of data belongs
 Does your statement imply that the partitioner does not take any decisions at 
 all on the (physical) storage location? Or put another way: What do you mean 
 with partition?
  
 To quote http://wiki.apache.org/cassandra/ArchitectureInternals: ... 
 AbstractReplicationStrategy controls what nodes get secondary, tertiary, etc. 
 replicas of each key range. Primary replica is always determined by the token 
 ring (...)
 
 
 ... 
 You can select different placement strategies and partitioners for different 
 keyspaces, thereby choosing known data to be stored on known hosts.
 This is however discouraged for various reasons – i.e.  you need a lot of 
 knowledge about your data to keep the cluster balanced. What is your usecase 
 for this requirement? there is probably a more suitable solution.
  
 What we want is to partition the cluster with respect to key spaces.
 That is we want to establish an association between nodes and key spaces so 
 that a node of the cluster holds data from a key space if and only if that 
 node is a *member* of that key space.
  
 To our knowledge Cassandra has no built-in way to specify such a 
 membership-relation. Therefore we thought of implementing our own replica 
 placement strategy until we started to assume that the partitioner had to be 
 replaced, too, to accomplish the task.
  
 Do you have any ideas?
  
 
 
 Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com] 
 Gesendet: Dienstag, 10. Januar 2012 09:53
 An: user@cassandra.apache.org
 Betreff: How to control location of data?
  
 Hi!
  
 We're evaluating Cassandra for our storage needs. One of the key benefits we 
 see is the online replication of the data, that is an easy way to share data 
 across nodes. But we have the need to precisely control on what node group 
 specific parts of a key space (columns/column families) are stored on. Now 
 we're having trouble understanding the documentation. Could anyone help us 
 with to find some answers to our questions?
 
 ·  What does the term replica mean: If a key is stored on exactly three 
 nodes in a cluster, is it correct then to say that there are three replicas 
 of that key or are there just two replicas (copies) and one original?
 ·  What is the relation between the Cassandra concepts Partitioner and 
 Replica Placement Strategy? According to documentation found on DataStax 
 web site and architecture internals from the Cassandra Wiki the first storage 
 location of a key (and its associated data) is determined by the 
 Partitioner whereas additional storage locations are defined by Replica 
 Placement Strategy. I'm wondering if I could completely redefine the way how 
 nodes are selected to store a key by just implementing my own subclass of 
 AbstractReplicationStrategy and configuring that subclass into the key space.
 ·  How can I suppress that the Partitioner is consulted at all to determine 
 what node stores a key first?
 ·  Is a key space always distributed across the whole cluster? Is it possible 
 to configure Cassandra in such a way that more or less freely chosen parts of 
 a key space (columns) are stored on arbitrarily chosen nodes?
  
 Any tips would be very appreciated :-)
  
 




Re: How to control location of data?

2012-01-11 Thread Andreas Rudolph
Hi!

 ...
 Again, it's probably a bad idea. 
I agree on that, now.

Thank you.

 
 Cheers
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 11/01/2012, at 4:56 AM, Roland Gude wrote:
 
  
 Each node in the cluster is assigned a token (can be done automatically – 
 but usually should not)
 The token of a node is the start token of the partition it is responsible 
 for (and the token of the next node is the end token of the current tokens 
 partition)
  
 Assume you have the following nodes/tokens (which are usually numbers but 
 for the example I will use letters)
  
 N1/A
 N2/D
 N3/M
 N4/X
  
 This means that N1 is responsible (primary) for [A-D)
N2 for [D-M)
N3 for [M-X)
 And N4 for [X-A)
  
 If you have a replication factor of 1 data will go on the nodes like this:
  
 B - N1
 E-N2
 X-N4
  
 And so on
 If you have a higher replication factor, the placement strategy decides 
 which node will take replicas of which partition (becoming secondary node 
 for that partition)
 Simple strategy will just put the replica on the next node in the ring
 So same example as above but RF of 2 and simple strategy:
  
 B- N1 and N2
 E - N2 and N3
 X - N4 and N1
  
 Other strategies can factor in things like “put  data in another datacenter” 
 or “put data in another rack” or such things.
  
 Even though the terms primary and secondary imply some means of quality or 
 consistency, this is not the case. If a node is responsible for a piece of 
 data, it will store it.
  
  
 But placement of the replicas is usually only relevant for availability 
 reasons (i.e. disaster recovery etc.)
 Actual location should mean nothing to most applications as you can ask any 
 node for the data you want and it will provide it to you (fetching it from 
 the responsible nodes).
 This should be sufficient in almost all cases.
  
 So in the above example again, you can ask N3 “what data is available” and 
 it will tell you: B, E and X, or you could ask it “give me X” and it will 
 fetch it from N4 or N1 or both of them depending on consistency 
 configuration and return the data to you.
  
  
 So actually if you use Cassandra – for the application the actual storage 
 location of the data should not matter. It will be available anywhere in the 
 cluster if it is stored on any reachable node.
  
 Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com] 
 Gesendet: Dienstag, 10. Januar 2012 15:06
 An: user@cassandra.apache.org
 Betreff: Re: AW: How to control location of data?
  
 Hi!
  
 Thank you for your last reply. I'm still wondering if I got you right...
  
 ... 
 A partitioner decides into which partition a piece of data belongs
 Does your statement imply that the partitioner does not take any decisions 
 at all on the (physical) storage location? Or put another way: What do you 
 mean with partition?
  
 To quote http://wiki.apache.org/cassandra/ArchitectureInternals: ... 
 AbstractReplicationStrategy controls what nodes get secondary, tertiary, 
 etc. replicas of each key range. Primary replica is always determined by the 
 token ring (...)
 
 
 ... 
 You can select different placement strategies and partitioners for different 
 keyspaces, thereby choosing known data to be stored on known hosts.
 This is however discouraged for various reasons – i.e.  you need a lot of 
 knowledge about your data to keep the cluster balanced. What is your usecase 
 for this requirement? there is probably a more suitable solution.
  
 What we want is to partition the cluster with respect to key spaces.
 That is we want to establish an association between nodes and key spaces so 
 that a node of the cluster holds data from a key space if and only if that 
 node is a *member* of that key space.
  
 To our knowledge Cassandra has no built-in way to specify such a 
 membership-relation. Therefore we thought of implementing our own replica 
 placement strategy until we started to assume that the partitioner had to be 
 replaced, too, to accomplish the task.
  
 Do you have any ideas?
  
 
 
 Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com] 
 Gesendet: Dienstag, 10. Januar 2012 09:53
 An: user@cassandra.apache.org
 Betreff: How to control location of data?
  
 Hi!
  
 We're evaluating Cassandra for our storage needs. One of the key benefits we 
 see is the online replication of the data, that is an easy way to share data 
 across nodes. But we have the need to precisely control on what node group 
 specific parts of a key space (columns/column families) are stored on. Now 
 we're having trouble understanding the documentation. Could anyone help us 
 with to find some answers to our questions?
 
 ·  What does the term replica mean: If a key is stored on exactly three 
 nodes in a cluster, is it correct then to say that there are three replicas 
 of that key or are there just two replicas (copies) and one original?
 ·  What is the relation between the Cassandra 

Rebalance cluster

2012-01-11 Thread Daning Wang
Hi All,

We have 5 nodes cluster(on 0.8.6), but two machines are slower and have
less memory, so the performance was not good  on those two machines for
large volume traffic.I want to move some data from slower machine to faster
machine to ease some load, the token ring will not be equally balanced.

I am thinking the following steps,

1. modify cassandra.yaml to change the initial token.
2. restart cassandra(don't need to auto-bootstrap, right?)
3. then run nodetool repair,(or nodetool move?, not sure which one to use)


Is there any doc that has detailed steps about how to do this?

Thanks in advance,

Daning


Re: Syncing across environments

2012-01-11 Thread aaron morton
Nothing Cassandra specific that I am aware of. Do you have any ops automation 
such as chef or puppet ? 

Data Stax make their chef cook books available here 
https://github.com/riptano/chef

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 11/01/2012, at 9:53 AM, David McNelis wrote:

 Is anyone familiar with any tools that are already available to allow for 
 configurable synchronization of different clusters?
 
 Specifically for purposes of development, i.e. Dev, staging, test, and 
 production cassandra environments, so that you can easily plug in the 
 information that you want to filter back down to your 'lower level' 
 environments...
 
 If not, I'm interested in starting working on something like that, so if you 
 have specific thoughts about features/requirements for something extendable 
 that you'd like to share I'm all ears.
 
 In general the main pieces that I know I would like to have on a column 
 family basis:
 
 1) Synchronize the schema
 2) Specify keys or a range of keys to sync for that CF
 3) Support full CF sync
 4) Entirely configurable by either maven properties, basic properties, or xml 
 file
 5) Basic reporting about what was synchronized
 6) Allow plugin development  for mutating keys as you move to different  
 environments (in case your keys in one environment need to be a different 
 value in another environment, for example, you have a client_id based on an 
 account number.  The account number exists on dev and prod, but the client_id 
 is different.  Want to let a dev write a mutator plugin to  update the key 
 prior to it writing to the destination.
 7) Support multiple destinations
 
 Any thoughts on this, folks?  I'd wager this is an issue just about all of  
 us deal with, and we're probably all doing it in a little different way.
 
 David



Re: Rebalance cluster

2012-01-11 Thread David McNelis
Daning,

You can see how to do this basic sort of thing on the Wiki's operations
page ( http://wiki.apache.org/cassandra/Operations )

In short, you'll want to run:
nodetool -h hostname move newtoken

Then, once you've update each of your tokens that you want to move, you'll
want to run
nodetool -h hostname cleanup

That will remove the no-longer necessary tokens from your smaller machines.

Please note that someone else may have some better insights than I into
whether or not  your strategy is going to be effective.  On the surface I
think what you are doing is logical, but I'm unsure of the  actual
performance gains you'll see.

David

On Wed, Jan 11, 2012 at 1:32 PM, Daning Wang dan...@netseer.com wrote:

 Hi All,

 We have 5 nodes cluster(on 0.8.6), but two machines are slower and have
 less memory, so the performance was not good  on those two machines for
 large volume traffic.I want to move some data from slower machine to faster
 machine to ease some load, the token ring will not be equally balanced.

 I am thinking the following steps,

 1. modify cassandra.yaml to change the initial token.
 2. restart cassandra(don't need to auto-bootstrap, right?)
 3. then run nodetool repair,(or nodetool move?, not sure which one to use)


 Is there any doc that has detailed steps about how to do this?

 Thanks in advance,

 Daning




Re: Syncing across environments

2012-01-11 Thread David McNelis
Not currently using any of those tools (though certainly an option, just
never looked into them).

Those tools seem more based around configuration of your
environments...where I'm more concerned with backfilling data from
production to early-stage environments to help facilitate development.
 Still DevOps at the end of the day...just don't know if those would be the
appropriate tools for the job.

David

On Wed, Jan 11, 2012 at 1:37 PM, aaron morton aa...@thelastpickle.comwrote:

 Nothing Cassandra specific that I am aware of. Do you have any ops
 automation such as chef or puppet ?

 Data Stax make their chef cook books available here
 https://github.com/riptano/chef

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 11/01/2012, at 9:53 AM, David McNelis wrote:

 Is anyone familiar with any tools that are already available to allow for
 configurable synchronization of different clusters?

 Specifically for purposes of development, i.e. Dev, staging, test, and
 production cassandra environments, so that you can easily plug in the
 information that you want to filter back down to your 'lower level'
 environments...

 If not, I'm interested in starting working on something like that, so if
 you have specific thoughts about features/requirements for something
 extendable that you'd like to share I'm all ears.

 In general the main pieces that I know I would like to have on a column
 family basis:

 1) Synchronize the schema
 2) Specify keys or a range of keys to sync for that CF
 3) Support full CF sync
 4) Entirely configurable by either maven properties, basic properties, or
 xml file
 5) Basic reporting about what was synchronized
 6) Allow plugin development  for mutating keys as you move to different
  environments (in case your keys in one environment need to be a different
 value in another environment, for example, you have a client_id based on an
 account number.  The account number exists on dev and prod, but the
 client_id is different.  Want to let a dev write a mutator plugin to
  update the key prior to it writing to the destination.
 7) Support multiple destinations

 Any thoughts on this, folks?  I'd wager this is an issue just about all of
  us deal with, and we're probably all doing it in a little different way.

 David





Re: Rebalance cluster

2012-01-11 Thread aaron morton
I have good news and bad. 

The good news is I have a nice coffee. The bad news is it's pretty difficult to 
have some nodes with less load. 

In a cluster with 5 nodes and RF 3 each node holds the following token ranges. 

node1: node 1, 5 and 4
node 2: node 2, 1, 5
node 3: node 3, 2, 1
node 4: node 4, 3, 2
node 5: node 5, 4, 3

The load on each node is it's token range, and those of the preceding RF-1 
nodes. e.g. In a balanced ring of 5 nodes with RF 3 each node has 20 % of the 
token ring and 60% of the total load. 

if you split the token ring is split like this below each node has the total 
load shown after the /

node 1: 12.5 %  / 50%
node 2: 25 % / 62.5%
node 3:  25 % / 62.5%
node 4: 12.5 % / 62.5%
node 5: 25% / 62.5 %

Only node 1 gets a small amount less. Try a different approach…

node 1: 12.5 %  / 62.5%
node 2: 12.5 % / 50%
node 3: 25 % / 50%
node 4: 25 % / 62.5%
node 5: 25 % / 75.5 %

That's even worse. 

David is right to use nodetool move. It's a good idea to update the initial 
tokens in the yaml (or your ops condif) after the fact even though they are not 
used. 

Hope that helps.

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 12/01/2012, at 8:41 AM, David McNelis wrote:

 Daning,
 
 You can see how to do this basic sort of thing on the Wiki's operations page 
 ( http://wiki.apache.org/cassandra/Operations )
 
 In short, you'll want to run:
 nodetool -h hostname move newtoken
 
 Then, once you've update each of your tokens that you want to move, you'll 
 want to run
 nodetool -h hostname cleanup
 
 That will remove the no-longer necessary tokens from your smaller machines.
 
 Please note that someone else may have some better insights than I into 
 whether or not  your strategy is going to be effective.  On the surface I 
 think what you are doing is logical, but I'm unsure of the  actual 
 performance gains you'll see.
 
 David
 
 On Wed, Jan 11, 2012 at 1:32 PM, Daning Wang dan...@netseer.com wrote:
 Hi All,
 
 We have 5 nodes cluster(on 0.8.6), but two machines are slower and have less 
 memory, so the performance was not good  on those two machines for large 
 volume traffic.I want to move some data from slower machine to faster machine 
 to ease some load, the token ring will not be equally balanced.
 
 I am thinking the following steps,
 
 1. modify cassandra.yaml to change the initial token.
 2. restart cassandra(don't need to auto-bootstrap, right?)
 3. then run nodetool repair,(or nodetool move?, not sure which one to use)
 
 
 Is there any doc that has detailed steps about how to do this?
 
 Thanks in advance,
 
 Daning
 
 



Re: Syncing across environments

2012-01-11 Thread aaron morton
You can use chef for setting up the cluster and pull snapshots down for the 
data. That will require a 1 to 1 mapping between the prod and dev / QA 
clusters. 

That way you can also test the DevOps processes for deployment and disaster 
recovery. 

Cheers
 
-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 12/01/2012, at 8:47 AM, David McNelis wrote:

 Not currently using any of those tools (though certainly an option, just 
 never looked into them).
 
 Those tools seem more based around configuration of your environments...where 
 I'm more concerned with backfilling data from production to early-stage 
 environments to help facilitate development.  Still DevOps at the end of the 
 day...just don't know if those would be the appropriate tools for the job.
 
 David
 
 On Wed, Jan 11, 2012 at 1:37 PM, aaron morton aa...@thelastpickle.com wrote:
 Nothing Cassandra specific that I am aware of. Do you have any ops automation 
 such as chef or puppet ? 
 
 Data Stax make their chef cook books available here 
 https://github.com/riptano/chef
 
 Cheers
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 11/01/2012, at 9:53 AM, David McNelis wrote:
 
 Is anyone familiar with any tools that are already available to allow for 
 configurable synchronization of different clusters?
 
 Specifically for purposes of development, i.e. Dev, staging, test, and 
 production cassandra environments, so that you can easily plug in the 
 information that you want to filter back down to your 'lower level' 
 environments...
 
 If not, I'm interested in starting working on something like that, so if you 
 have specific thoughts about features/requirements for something extendable 
 that you'd like to share I'm all ears.
 
 In general the main pieces that I know I would like to have on a column 
 family basis:
 
 1) Synchronize the schema
 2) Specify keys or a range of keys to sync for that CF
 3) Support full CF sync
 4) Entirely configurable by either maven properties, basic properties, or 
 xml file
 5) Basic reporting about what was synchronized
 6) Allow plugin development  for mutating keys as you move to different  
 environments (in case your keys in one environment need to be a different 
 value in another environment, for example, you have a client_id based on an 
 account number.  The account number exists on dev and prod, but the 
 client_id is different.  Want to let a dev write a mutator plugin to  update 
 the key prior to it writing to the destination.
 7) Support multiple destinations
 
 Any thoughts on this, folks?  I'd wager this is an issue just about all of  
 us deal with, and we're probably all doing it in a little different way.
 
 David
 
 



Re: Syncing across environments

2012-01-11 Thread David McNelis
Right.

One of the  challenges though is if the key's wouldn't necessarily directly
translate.  Say I've got a MySQL instance in my Dev VM and my keys are
based on PKs  in that database...but my MySQL instance  by nature of what
 I'm working on has some different IDs...so in some way I'd need to mutate
the keys as they come into my local environment.  Additionally, I wouldn't
want my entire Prod cluster in dev either...since generally speaking I'd
only need a limited set of data in Dev compared to prod.

Does that make sense?  Kind of rambling response.

David

On Wed, Jan 11, 2012 at 2:03 PM, aaron morton aa...@thelastpickle.comwrote:

 You can use chef for setting up the cluster and pull snapshots down for
 the data. That will require a 1 to 1 mapping between the prod and dev / QA
 clusters.

 That way you can also test the DevOps processes for deployment and
 disaster recovery.

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 12/01/2012, at 8:47 AM, David McNelis wrote:

 Not currently using any of those tools (though certainly an option, just
 never looked into them).

 Those tools seem more based around configuration of your
 environments...where I'm more concerned with backfilling data from
 production to early-stage environments to help facilitate development.
  Still DevOps at the end of the day...just don't know if those would be the
 appropriate tools for the job.

 David

 On Wed, Jan 11, 2012 at 1:37 PM, aaron morton aa...@thelastpickle.comwrote:

 Nothing Cassandra specific that I am aware of. Do you have any ops
 automation such as chef or puppet ?

 Data Stax make their chef cook books available here
 https://github.com/riptano/chef

 Cheers

   -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 11/01/2012, at 9:53 AM, David McNelis wrote:

 Is anyone familiar with any tools that are already available to allow for
 configurable synchronization of different clusters?

 Specifically for purposes of development, i.e. Dev, staging, test, and
 production cassandra environments, so that you can easily plug in the
 information that you want to filter back down to your 'lower level'
 environments...

 If not, I'm interested in starting working on something like that, so if
 you have specific thoughts about features/requirements for something
 extendable that you'd like to share I'm all ears.

 In general the main pieces that I know I would like to have on a column
 family basis:

 1) Synchronize the schema
 2) Specify keys or a range of keys to sync for that CF
 3) Support full CF sync
 4) Entirely configurable by either maven properties, basic properties, or
 xml file
 5) Basic reporting about what was synchronized
 6) Allow plugin development  for mutating keys as you move to different
  environments (in case your keys in one environment need to be a different
 value in another environment, for example, you have a client_id based on an
 account number.  The account number exists on dev and prod, but the
 client_id is different.  Want to let a dev write a mutator plugin to
  update the key prior to it writing to the destination.
 7) Support multiple destinations

 Any thoughts on this, folks?  I'd wager this is an issue just about all
 of  us deal with, and we're probably all doing it in a little different way.

 David







Exception thrown during repair, contains jmx classes -- why?

2012-01-11 Thread Maxim Potekhin
As per below trace, there is jmx.mbeanserber involved. What I ran was a 
common repair.

Is that right? What does this failure indicate?

at 
org.apache.cassandra.service.StorageService.forceTableRepair(StorageService.java:1613)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)
at 
com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
at 
com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27)
at 
com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
at 
com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
at 
com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836)
at 
com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
at 
javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427)
at 
javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72)
at 
javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265)
at 
javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360)
at 
javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)
at 
sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:305)

at sun.rmi.transport.Transport$1.run(Transport.java:159)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
at 
sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)
at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)





Re: Rebalance cluster

2012-01-11 Thread Antonio Martinez
There is another possible approach that I reference from the original
Dynamo paper. Instead of trying to manage a heterogeneous cluster at the
cassandra level, it might be possible to take the approach Amazon took.
Find the smallest common denominator of resource for your nodes(most likely
your smallest node) and virtualize the others to that level. For example,
say you have 3 physical computers, one with one processor and 2gb of
memory, one with 2 processors and 4gb, and one with 4 and 8gb. You could
make the smallest one your basic block and then put two one processor 2gb
vm's on the second machine and 4 of those on the third and largest machine.
Then instead of managing the three of them separately and worrying about
them being different you instead manage a ring of 7 equal nodes with equal
portions of the ring. This allows you to give smaller machines a lesser
load compared to the more powerful ones. The amazon paper on dynamo has
more information on how they did it and some of the tricks they use for
reliability.
http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Hope this helps somewhat

On Wed, Jan 11, 2012 at 2:00 PM, aaron morton aa...@thelastpickle.comwrote:

 I have good news and bad.

 The good news is I have a nice coffee. The bad news is it's pretty
 difficult to have some nodes with less load.

 In a cluster with 5 nodes and RF 3 each node holds the following token
 ranges.

 node1: node 1, 5 and 4
 node 2: node 2, 1, 5
 node 3: node 3, 2, 1
 node 4: node 4, 3, 2
 node 5: node 5, 4, 3

 The load on each node is it's token range, and those of the preceding RF-1
 nodes. e.g. In a balanced ring of 5 nodes with RF 3 each node has 20 % of
 the token ring and 60% of the total load.

 if you split the token ring is split like this below each node has the
 total load shown after the /

 node 1: 12.5 %  / 50%
 node 2: 25 % / 62.5%
 node 3:  25 % / 62.5%
 node 4: 12.5 % / 62.5%
 node 5: 25% / 62.5 %

 Only node 1 gets a small amount less. Try a different approach…

 node 1: 12.5 %  / 62.5%
 node 2: 12.5 % / 50%
 node 3: 25 % / 50%
 node 4: 25 % / 62.5%
 node 5: 25 % / 75.5 %

 That's even worse.

 David is right to use nodetool move. It's a good idea to update the
 initial tokens in the yaml (or your ops condif) after the fact even though
 they are not used.

 Hope that helps.

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 12/01/2012, at 8:41 AM, David McNelis wrote:

 Daning,

 You can see how to do this basic sort of thing on the Wiki's operations
 page ( http://wiki.apache.org/cassandra/Operations )

 In short, you'll want to run:
 nodetool -h hostname move newtoken

 Then, once you've update each of your tokens that you want to move, you'll
 want to run
 nodetool -h hostname cleanup

 That will remove the no-longer necessary tokens from your smaller machines.

 Please note that someone else may have some better insights than I into
 whether or not  your strategy is going to be effective.  On the surface I
 think what you are doing is logical, but I'm unsure of the  actual
 performance gains you'll see.

 David

 On Wed, Jan 11, 2012 at 1:32 PM, Daning Wang dan...@netseer.com wrote:

 Hi All,

 We have 5 nodes cluster(on 0.8.6), but two machines are slower and have
 less memory, so the performance was not good  on those two machines for
 large volume traffic.I want to move some data from slower machine to faster
 machine to ease some load, the token ring will not be equally balanced.

 I am thinking the following steps,

 1. modify cassandra.yaml to change the initial token.
 2. restart cassandra(don't need to auto-bootstrap, right?)
 3. then run nodetool repair,(or nodetool move?, not sure which one to use)


 Is there any doc that has detailed steps about how to do this?

 Thanks in advance,

 Daning






-- 
Antonio Perez de Tejada Martinez


Re: Exception thrown during repair, contains jmx classes -- why?

2012-01-11 Thread aaron morton
You are missing the name of the exception from the top of the stack trace. 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 12/01/2012, at 10:12 AM, Maxim Potekhin wrote:

 As per below trace, there is jmx.mbeanserber involved. What I ran was a 
 common repair.
 Is that right? What does this failure indicate?
 
at 
 org.apache.cassandra.service.StorageService.forceTableRepair(StorageService.java:1613)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
 com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
at 
 com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27)
at 
 com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
at 
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836)
at 
 com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
at 
 javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427)
at 
 javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72)
at 
 javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265)
at 
 javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360)
at 
 javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:305)
at sun.rmi.transport.Transport$1.run(Transport.java:159)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
at 
 sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)
at 
 sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
at 
 sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
 
 
 



Re: multiple datastax opscenter?

2012-01-11 Thread Jeesoo Shin
hmmm.. just an update.
sorry, it doesn't work.
I did get multiple web port listen,
but opscenter seem to be using same resource or what and it can not
have multiple running.


On 1/11/12, Jeesoo Shin bsh...@gmail.com wrote:
 never mind.
 it works.

 my second cluster didn't have any keyspace, that why.

 On 1/11/12, Jeesoo Shin bsh...@gmail.com wrote:
 Hi,
 I know OpsCenter doesn't support multiple cluster yet.
 I tried to install  run multiple OpsCenter in a server with different
 port.
 different port for web and agent. same cassandra jmx, thrift port.
 But second OpsCenter doesn't show any cluster node number. (web is
 working)

 Am I missing a config? or is it not possible to run multiple OpsCenter?


 thanks.




Re: How to control location of data?

2012-01-11 Thread Viktor Jevdokimov
The idea behind client that controls location of a data is performance, to
avoid unnecessary network round-trips between nodes and unnecessary caching
of backup ranges. All of this mostly is true for reads at CL.ONE and RF1.

How it works (in our case):

Our client uses describe_ring that returns ring for specified Keyspace with
token ranges and replica endpoints for each range. First node in the list
for a token range is a kind of a primary, others are backup replicas.

The client for most requests for a single key calculates a token and
connects to node that is a primary node for this token. If primary is down,
next endpoint from the list of endpoints for that token range is used.

This way the network load between nodes is much lower. In our case, when
load balancing just rotated all nodes we've seen a 100Mbps load on a node,
while with an approach above 25Mbps only.

Caches on a single node is filled up with data of a primary range of that
node, avoiding of caching replica ranges that also belongs to this node by
RF.

The downside is when primary node is not accessible, backup node has no
cache for a range we're switched to.


2012/1/11 Andreas Rudolph andreas.rudo...@spontech-spine.com

 Hi!

 ...
 Again, it's probably a bad idea.

 I agree on that, now.

 Thank you.


 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 11/01/2012, at 4:56 AM, Roland Gude wrote:

 ** **
 Each node in the cluster is assigned a token (can be done automatically –
 but usually should not)
 The token of a node is the start token of the partition it is responsible
 for (and the token of the next node is the end token of the current tokens
 partition)
 ** **
 Assume you have the following nodes/tokens (which are usually numbers but
 for the example I will use letters)
 ** **
 N1/A
 N2/D
 N3/M
 N4/X
 ** **
 This means that N1 is responsible (primary) for [A-D)
N2 for [D-M)
N3 for [M-X)
 And N4 for [X-A)
 ** **
 If you have a replication factor of 1 data will go on the nodes like this:
 
 ** **
 B - N1
 E-N2
 X-N4
 ** **
 And so on
 If you have a higher replication factor, the placement strategy decides
 which node will take replicas of which partition (becoming secondary node
 for that partition)
 Simple strategy will just put the replica on the next node in the ring
 So same example as above but RF of 2 and simple strategy:
 ** **
 B- N1 and N2
 E - N2 and N3
 X - N4 and N1
 ** **
 Other strategies can factor in things like “put  data in another
 datacenter” or “put data in another rack” or such things.
 ** **
 Even though the terms primary and secondary imply some means of quality or
 consistency, this is not the case. If a node is responsible for a piece of
 data, it will store it.
 ** **
 ** **
 But placement of the replicas is usually only relevant for availability
 reasons (i.e. disaster recovery etc.)
 Actual location should mean nothing to most applications as you can ask
 any node for the data you want and it will provide it to you (fetching it
 from the responsible nodes).
 This should be sufficient in almost all cases.
 ** **
 So in the above example again, you can ask N3 “what data is available” and
 it will tell you: B, E and X, or you could ask it “give me X” and it will
 fetch it from N4 or N1 or both of them depending on consistency
 configuration and return the data to you.
 ** **
 ** **
 So actually if you use Cassandra – for the application the actual storage
 location of the data should not matter. It will be available anywhere in
 the cluster if it is stored on any reachable node.
 ** **
 *Von:* Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com]
 *Gesendet:* Dienstag, 10. Januar 2012 15:06
 *An:* user@cassandra.apache.org
 *Betreff:* Re: AW: How to control location of data?
 ** **
 Hi!
 ** **
 Thank you for your last reply. I'm still wondering if I got you right...**
 **
 ** **

 ... 
 A partitioner decides into which partition a piece of data belongs

 Does your statement imply that the partitioner does not take any decisions
 at all on the (physical) storage location? Or put another way: What do you
 mean with partition?
 ** **
 To quote http://wiki.apache.org/cassandra/ArchitectureInternals: ... 
 AbstractReplicationStrategy
 controls what nodes get secondary, tertiary, etc. replicas of each key
 range. Primary replica is always determined by the token ring (...)


 
 ... 
 You can select different placement strategies and partitioners for
 different keyspaces, thereby choosing known data to be stored on known
 hosts.
 This is however discouraged for various reasons – i.e.  you need a lot of
 knowledge about your data to keep the cluster balanced. What is your
 usecase for this requirement? there is probably a more suitable solution.*
 ***
  
 What we want is to