date:20120110

R: Re: Schema clone ...

2012-01-10 Thread cbert...@libero.it



* Grab the system sstables from one of the 0.7 nodes and spin up a temp 
1.0 machine them, then use the command.  

Probably I'm still sleeping but I can't get what I want! :-(
I've copied the SSTables of a node to my own computer where I installed a 
Cassandra 1.0 just for the purpose.
I've copied it in the data folder under the keyspace name 


carlo@ubpc:/store/cassandra/data/social$


now here I have lots of file like this now ...


User-f-74-Data.db 
User-f-74-Filter.db
User-f-74-Index.db   User-f-74-Statistics.db   
but now how to tell cassandra hey, load the content of social?Did I miss 
something?
Cheers,Carlo





Messaggio originale
 Ogg: Re: Schema clone ...



ah, sorry brain not good work. 
It's only in 0.8. 
You could either:
*  write the CLI script by handor* Grab the system sstables from one of the 0.7 
nodes and spin up a temp 1.0 machine them, then use the command. or* See if 
your cassandra client software can help. 

Hope that helps. 

-Aaron MortonFreelance 
Developer@aaronmortonhttp://www.thelastpickle.com


On 9/01/2012, at 11:41 PM, cbert...@libero.it wrote:I was just trying it but 
... in 0.7 CLI there is no show schema command.When I connect with 1.0 CLI to 
my 0.7 cluster ...

[default@social] show schema;
null
I always get a null as answer! :-|Any tip for this?
ty, Cheers 
Carlo


Messaggio originale

Da: aa...@thelastpickle.com

Data: 09/01/2012 11.33

A: user@cassandra.apache.org, cbert...@libero.itcbert...@libero.it

Ogg: Re: Schema clone ...



Try show schema in the CLI. 
Cheers

-Aaron MortonFreelance 
Developer@aaronmortonhttp://www.thelastpickle.com



On 9/01/2012, at 11:12 PM, cbert...@libero.it wrote:Hi,
I have create a new dev-cluster with cassandra 1.0 -- I would like to have the 
same CFs that I have in the 0.7 one but I don't need data to be there, just the 
schema. Which is the fastest way to do it without making 30 create column 
family ... 

Best regards,

Carlo

AW: How to control location of data?

2012-01-10 Thread Roland Gude

Hi,

i think everything is called a replica so if data is on 3 nodes you have 3 
replicas. There is no such thing as an original.

A partitioner decides into which partition a piece of data belongs
A replica placement strategy decides which partition goes on which node

You cannot suppress the partitioner.

You can select different placement strategies and partitioners for different 
keyspaces, thereby choosing known data to be stored on known hosts.
This is however discouraged for various reasons - i.e.  you need a lot of 
knowledge about your data to keep the cluster balanced. What is your usecase 
for this requirement? there is probably a more suitable solution.

Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com]
Gesendet: Dienstag, 10. Januar 2012 09:53
An: user@cassandra.apache.org
Betreff: How to control location of data?

Hi!

We're evaluating Cassandra for our storage needs. One of the key benefits we 
see is the online replication of the data, that is an easy way to share data 
across nodes. But we have the need to precisely control on what node group 
specific parts of a key space (columns/column families) are stored on. Now 
we're having trouble understanding the documentation. Could anyone help us with 
to find some answers to our questions?

*  What does the term replica mean: If a key is stored on exactly three nodes 
in a cluster, is it correct then to say that there are three replicas of that 
key or are there just two replicas (copies) and one original?

*  What is the relation between the Cassandra concepts Partitioner and 
Replica Placement Strategy? According to documentation found on DataStax web 
site and architecture internals from the Cassandra Wiki the first storage 
location of a key (and its associated data) is determined by the Partitioner 
whereas additional storage locations are defined by Replica Placement 
Strategy. I'm wondering if I could completely redefine the way how nodes are 
selected to store a key by just implementing my own subclass of 
AbstractReplicationStrategy and configuring that subclass into the key space.

*  How can I suppress that the Partitioner is consulted at all to determine 
what node stores a key first?

*  Is a key space always distributed across the whole cluster? Is it possible 
to configure Cassandra in such a way that more or less freely chosen parts of a 
key space (columns) are stored on arbitrarily chosen nodes?

Any tips would be very appreciated :-)

Re: AW: How to control location of data?

2012-01-10 Thread Andreas Rudolph

Hi!

Thank you for your last reply. I'm still wondering if I got you right...

 ... 
 A partitioner decides into which partition a piece of data belongs
Does your statement imply that the partitioner does not take any decisions at 
all on the (physical) storage location? Or put another way: What do you mean 
with partition?

To quote http://wiki.apache.org/cassandra/ArchitectureInternals: ... 
AbstractReplicationStrategy controls what nodes get secondary, tertiary, etc. 
replicas of each key range. Primary replica is always determined by the token 
ring (...)

 ... 
 You can select different placement strategies and partitioners for different 
 keyspaces, thereby choosing known data to be stored on known hosts.
 This is however discouraged for various reasons – i.e.  you need a lot of 
 knowledge about your data to keep the cluster balanced. What is your usecase 
 for this requirement? there is probably a more suitable solution.
  
What we want is to partition the cluster with respect to key spaces.
That is we want to establish an association between nodes and key spaces so 
that a node of the cluster holds data from a key space if and only if that node 
is a *member* of that key space.

To our knowledge Cassandra has no built-in way to specify such a 
membership-relation. Therefore we thought of implementing our own replica 
placement strategy until we started to assume that the partitioner had to be 
replaced, too, to accomplish the task.

Do you have any ideas?


 Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com] 
 Gesendet: Dienstag, 10. Januar 2012 09:53
 An: user@cassandra.apache.org
 Betreff: How to control location of data?
  
 Hi!
  
 We're evaluating Cassandra for our storage needs. One of the key benefits we 
 see is the online replication of the data, that is an easy way to share data 
 across nodes. But we have the need to precisely control on what node group 
 specific parts of a key space (columns/column families) are stored on. Now 
 we're having trouble understanding the documentation. Could anyone help us 
 with to find some answers to our questions?
 
 ·  What does the term replica mean: If a key is stored on exactly three 
 nodes in a cluster, is it correct then to say that there are three replicas 
 of that key or are there just two replicas (copies) and one original?
 ·  What is the relation between the Cassandra concepts Partitioner and 
 Replica Placement Strategy? According to documentation found on DataStax 
 web site and architecture internals from the Cassandra Wiki the first storage 
 location of a key (and its associated data) is determined by the 
 Partitioner whereas additional storage locations are defined by Replica 
 Placement Strategy. I'm wondering if I could completely redefine the way how 
 nodes are selected to store a key by just implementing my own subclass of 
 AbstractReplicationStrategy and configuring that subclass into the key space.
 ·  How can I suppress that the Partitioner is consulted at all to determine 
 what node stores a key first?
 ·  Is a key space always distributed across the whole cluster? Is it possible 
 to configure Cassandra in such a way that more or less freely chosen parts of 
 a key space (columns) are stored on arbitrarily chosen nodes?
  
 Any tips would be very appreciated :-)

R: Re: AW: How to control location of data?

2012-01-10 Thread cbert...@libero.it


In each node of the ring has a unique Token which representing the node's 
logical position in the cluster. 
When you perform an operation on a row is calculated a token based on this row 
... the node-token closest to the row-token will store the data (and also the 
RF-1 remaining nodes) -- this tecnique should guarantee that data are balanced 
among the cluster (if you use the Random Partitioner)
Regards,Carlo



Messaggio originale

Da: andreas.rudo...@spontech-spine.com

Data: 10/01/2012 15.05

A: user@cassandra.apache.orguser@cassandra.apache.org

Ogg: Re: AW: How to control location of data?



--Hi!
Thank you for your last reply. I'm still wondering if I got you right...
... A partitioner decides into which partition a piece of data belongsDoes your 
statement imply that the partitioner does not take any decisions at all on the 
(physical) storage location? Or put another way: What do you mean with 
partition?
To quote http://wiki.apache.org/cassandra/ArchitectureInternals: ... 
AbstractReplicationStrategy controls what nodes get secondary, tertiary,
 etc. replicas of each key range.  Primary replica is always determined 
by the token ring (...)
... You can select different placement strategies and partitioners for 
different keyspaces, thereby choosing known data to be stored on known 
hosts.This is however discouraged for various reasons – i.e.  you need a lot of 
knowledge about your data to keep the cluster balanced. What is your usecase 
for this requirement? there is probably a more suitable solution. What we want 
is to partition the cluster with respect to key spaces.That is we want to 
establish an association between nodes and key spaces so that a node of the 
cluster holds data from a key space if and only if that node is a *member* of 
that key space.
To our knowledge Cassandra has no built-in way to specify such a 
membership-relation. Therefore we thought of implementing our own replica 
placement strategy until we started to assume that the partitioner had to be 
replaced, too, to accomplish the task.
Do you have any ideas?

Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com] 
Gesendet: Dienstag, 10. Januar 2012 09:53
An: user@cassandra.apache.org
Betreff: How to control location of data? Hi! We're evaluating Cassandra for 
our storage needs. One of the key benefits we see is the online replication of 
the data, that is an easy way to share data across nodes. But we have the need 
to precisely control on what node group specific parts of a key space 
(columns/column families) are stored on. Now we're having trouble understanding 
the documentation. Could anyone help us with to find some answers to our 
questions?·  What does the term replica mean: If a key is stored on exactly 
three nodes in a cluster, is it correct then to say that there are three 
replicas of that key or are there just two replicas (copies) and one original?· 
 What is the relation between the Cassandra concepts Partitioner and Replica 
Placement Strategy? According to documentation found on DataStax web site and 
architecture internals from the Cassandra Wiki the first storage location of a 
key (and its associated data) is determined by the Partitioner whereas 
additional storage locations are defined by Replica Placement Strategy. I'm 
wondering if I could completely redefine the way how nodes are selected to 
store a key by just implementing my own subclass of AbstractReplicationStrategy 
and configuring that subclass into the key space.·  How can I suppress that the 
Partitioner is consulted at all to determine what node stores a key first?·  
Is a key space always distributed across the whole cluster? Is it possible to 
configure Cassandra in such a way that more or less freely chosen parts of a 
key space (columns) are stored on arbitrarily chosen nodes? Any tips would be 
very appreciated :-)

DataStax offers two new advanced Apache Cassandra classes (San Mateo)

2012-01-10 Thread Lynn Bender

DataStax is now offering two advanced classes for Apache Cassandra

Tuesday, February 14 - Wednesday February 15 - San Mateo
Advanced Modeling and Analytics with Apache Cassandra
Additional details and RSVP at Eventbrite:
http://cassandra-modeling-sanmateo.eventbrite.com/

Thursday, February 16 - Friday, February 17 - San Mateo
Advanced Operations with Apache Cassandra
Additional details and RSVP at Eventbrite:
http://advancedops-sanmateo.eventbrite.com/

These classes will be led by Ben Coverston.

-- 
-Lynn Bender
Events and Outreach, DataStax
http://www.datastax.com
http://twitter.com/linearb
http://www.linkedin.com/in/lynnbender

AW: AW: How to control location of data?

2012-01-10 Thread Roland Gude


Each node in the cluster is assigned a token (can be done automatically - but 
usually should not)
The token of a node is the start token of the partition it is responsible for 
(and the token of the next node is the end token of the current tokens 
partition)

Assume you have the following nodes/tokens (which are usually numbers but for 
the example I will use letters)

N1/A
N2/D
N3/M
N4/X

This means that N1 is responsible (primary) for [A-D)
   N2 for [D-M)
   N3 for [M-X)
And N4 for [X-A)

If you have a replication factor of 1 data will go on the nodes like this:

B - N1
E-N2
X-N4

And so on
If you have a higher replication factor, the placement strategy decides which 
node will take replicas of which partition (becoming secondary node for that 
partition)
Simple strategy will just put the replica on the next node in the ring
So same example as above but RF of 2 and simple strategy:

B- N1 and N2
E - N2 and N3
X - N4 and N1

Other strategies can factor in things like put  data in another datacenter or 
put data in another rack or such things.

Even though the terms primary and secondary imply some means of quality or 
consistency, this is not the case. If a node is responsible for a piece of 
data, it will store it.


But placement of the replicas is usually only relevant for availability reasons 
(i.e. disaster recovery etc.)
Actual location should mean nothing to most applications as you can ask any 
node for the data you want and it will provide it to you (fetching it from the 
responsible nodes).
This should be sufficient in almost all cases.

So in the above example again, you can ask N3 what data is available and it 
will tell you: B, E and X, or you could ask it give me X and it will fetch it 
from N4 or N1 or both of them depending on consistency configuration and return 
the data to you.


So actually if you use Cassandra - for the application the actual storage 
location of the data should not matter. It will be available anywhere in the 
cluster if it is stored on any reachable node.

Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com]
Gesendet: Dienstag, 10. Januar 2012 15:06
An: user@cassandra.apache.org
Betreff: Re: AW: How to control location of data?

Hi!

Thank you for your last reply. I'm still wondering if I got you right...

...
A partitioner decides into which partition a piece of data belongs
Does your statement imply that the partitioner does not take any decisions at 
all on the (physical) storage location? Or put another way: What do you mean 
with partition?

To quote http://wiki.apache.org/cassandra/ArchitectureInternals: ... 
AbstractReplicationStrategy controls what nodes get secondary, tertiary, etc. 
replicas of each key range. Primary replica is always determined by the token 
ring (...)


...
You can select different placement strategies and partitioners for different 
keyspaces, thereby choosing known data to be stored on known hosts.
This is however discouraged for various reasons - i.e.  you need a lot of 
knowledge about your data to keep the cluster balanced. What is your usecase 
for this requirement? there is probably a more suitable solution.

What we want is to partition the cluster with respect to key spaces.
That is we want to establish an association between nodes and key spaces so 
that a node of the cluster holds data from a key space if and only if that node 
is a *member* of that key space.

To our knowledge Cassandra has no built-in way to specify such a 
membership-relation. Therefore we thought of implementing our own replica 
placement strategy until we started to assume that the partitioner had to be 
replaced, too, to accomplish the task.

Do you have any ideas?



Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com]
Gesendet: Dienstag, 10. Januar 2012 09:53
An: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Betreff: How to control location of data?

Hi!

We're evaluating Cassandra for our storage needs. One of the key benefits we 
see is the online replication of the data, that is an easy way to share data 
across nodes. But we have the need to precisely control on what node group 
specific parts of a key space (columns/column families) are stored on. Now 
we're having trouble understanding the documentation. Could anyone help us with 
to find some answers to our questions?

*  What does the term replica mean: If a key is stored on exactly three nodes 
in a cluster, is it correct then to say that there are three replicas of that 
key or are there just two replicas (copies) and one original?

*  What is the relation between the Cassandra concepts Partitioner and 
Replica Placement Strategy? According to documentation found on DataStax web 
site and architecture internals from the Cassandra Wiki the first storage 
location of a key (and its associated data) is determined by the Partitioner 
whereas additional storage locations are defined by Replica Placement 
Strategy. I'm wondering if I could

Announcing Countandra 0.5

2012-01-10 Thread Milind Parikh

Inspired by twitter's rainbird project, Countandra is a hierarchical
distributed counting engine at scale.

It provides a complete http based interface to both posting events and
getting queries. The syntax of a event posting is done in a FORMS
compatible way. The result of the query is emitted in JSON to make it
maniputable by browsers directly.

Read more about it and to try it @ www.countandra.org.

Regards
Milind

Re: Schema clone ...

2012-01-10 Thread aaron morton

   * Grab the system sstables from one of the 0.7 nodes and spin up a temp 
 1.0 machine them, then use the command. 
 
Grab the *system* tables Migrations , Schema etc.  in cassandra/data/system

Cheers
 
-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 10/01/2012, at 10:20 PM, cbert...@libero.it wrote:

 
   * Grab the system sstables from one of the 0.7 nodes and spin up a temp 
 1.0 machine them, then use the command.  
 
 Probably I'm still sleeping but I can't get what I want! :-(
 
 I've copied the SSTables of a node to my own computer where I installed a 
 Cassandra 1.0 just for the purpose.
 
 I've copied it in the data folder under the keyspace name 
 
 
 
 carlo@ubpc:/store/cassandra/data/social$
 
 
 
 now here I have lots of file like this now ...
 
 
 User-f-74-Data.db 
 
 User-f-74-Filter.db
 
 User-f-74-Index.db   
 
 User-f-74-Statistics.db   
 
 but now how to tell cassandra hey, load the content of social?
 Did I miss something?
 
 Cheers,
 Carlo
 
 
 
 
 
 Messaggio originale
  Ogg: Re: Schema clone ...
 
 ah, sorry brain not good work. 
 
 It's only in 0.8. 
 
 You could either:
 
 *  write the CLI script by hand
 or
 * Grab the system sstables from one of the 0.7 nodes and spin up a temp 1.0 
 machine them, then use the command. 
 or
 * See if your cassandra client software can help. 
 
 
 Hope that helps. 
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 9/01/2012, at 11:41 PM, cbert...@libero.it wrote:
 
 I was just trying it but ... in 0.7 CLI there is no show schema command.
 When I connect with 1.0 CLI to my 0.7 cluster ...
 
 
 [default@social] show schema;
 null
 
 
 
 I always get a null as answer! :-|
 Any tip for this?
 
 ty, Cheers 
 
 Carlo
 
 Messaggio originale
 Da: aa...@thelastpickle.com
 Data: 09/01/2012 11.33
 A: user@cassandra.apache.org, cbert...@libero.itcbert...@libero.it
 Ogg: Re: Schema clone ...
 
 Try show schema in the CLI. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 9/01/2012, at 11:12 PM, cbert...@libero.it wrote:
 
 Hi,
 I have create a new dev-cluster with cassandra 1.0 -- I would like to have 
 the 
 same CFs that I have in the 0.7 one but I don't need data to be there, just 
 the 
 schema. Which is the fastest way to do it without making 30 create column 
 family ... 
 
 Best regards,
 
 Carlo

Re: How to control location of data?

2012-01-10 Thread aaron morton

 What we want is to partition the cluster with respect to key spaces.
Why do you want to do this ? (It's probably a bad idea) 

Background in here on the partitioner, placement strategy and the snitch 
http://thelastpickle.com/2011/02/07/Introduction-to-Cassandra/

Now here's how to do it….

Use the NetworkTopologyStrategy and the PropertyFileSnitch (or the 
RackInferringSnitch, don't use the RickInferringSnitch see 
http://goo.gl/cLPjb). 

1) In the PropertyFileSnitch place the nodes into different Data Centers, see 
conf/cassandra-topology.properties for examples. e.g. nodes 1,2 and 3 in DC1 
then nodes 4,5,6 in DC 2
2) In the definition of the Keyspace use the NetworkTopologyStrategy and place 
replicas in the DC's that contain the nodes you want. e.g. ks1 with 
strategy_options={DC1:3} and ks2 with strategy_options={DC2:3}
3) You will still want to use the RandomPartitioner. 
4) Rows with the same key in different keyspaces cannot be written to or read 
from in the same request.

Again, it's probably a bad idea. 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 11/01/2012, at 4:56 AM, Roland Gude wrote:

  
 Each node in the cluster is assigned a token (can be done automatically – but 
 usually should not)
 The token of a node is the start token of the partition it is responsible for 
 (and the token of the next node is the end token of the current tokens 
 partition)
  
 Assume you have the following nodes/tokens (which are usually numbers but for 
 the example I will use letters)
  
 N1/A
 N2/D
 N3/M
 N4/X
  
 This means that N1 is responsible (primary) for [A-D)
N2 for [D-M)
N3 for [M-X)
 And N4 for [X-A)
  
 If you have a replication factor of 1 data will go on the nodes like this:
  
 B - N1
 E-N2
 X-N4
  
 And so on
 If you have a higher replication factor, the placement strategy decides which 
 node will take replicas of which partition (becoming secondary node for that 
 partition)
 Simple strategy will just put the replica on the next node in the ring
 So same example as above but RF of 2 and simple strategy:
  
 B- N1 and N2
 E - N2 and N3
 X - N4 and N1
  
 Other strategies can factor in things like “put  data in another datacenter” 
 or “put data in another rack” or such things.
  
 Even though the terms primary and secondary imply some means of quality or 
 consistency, this is not the case. If a node is responsible for a piece of 
 data, it will store it.
  
  
 But placement of the replicas is usually only relevant for availability 
 reasons (i.e. disaster recovery etc.)
 Actual location should mean nothing to most applications as you can ask any 
 node for the data you want and it will provide it to you (fetching it from 
 the responsible nodes).
 This should be sufficient in almost all cases.
  
 So in the above example again, you can ask N3 “what data is available” and it 
 will tell you: B, E and X, or you could ask it “give me X” and it will fetch 
 it from N4 or N1 or both of them depending on consistency configuration and 
 return the data to you.
  
  
 So actually if you use Cassandra – for the application the actual storage 
 location of the data should not matter. It will be available anywhere in the 
 cluster if it is stored on any reachable node.
  
 Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com] 
 Gesendet: Dienstag, 10. Januar 2012 15:06
 An: user@cassandra.apache.org
 Betreff: Re: AW: How to control location of data?
  
 Hi!
  
 Thank you for your last reply. I'm still wondering if I got you right...
  
 ... 
 A partitioner decides into which partition a piece of data belongs
 Does your statement imply that the partitioner does not take any decisions at 
 all on the (physical) storage location? Or put another way: What do you mean 
 with partition?
  
 To quote http://wiki.apache.org/cassandra/ArchitectureInternals: ... 
 AbstractReplicationStrategy controls what nodes get secondary, tertiary, etc. 
 replicas of each key range. Primary replica is always determined by the token 
 ring (...)
 
 
 ... 
 You can select different placement strategies and partitioners for different 
 keyspaces, thereby choosing known data to be stored on known hosts.
 This is however discouraged for various reasons – i.e.  you need a lot of 
 knowledge about your data to keep the cluster balanced. What is your usecase 
 for this requirement? there is probably a more suitable solution.
  
 What we want is to partition the cluster with respect to key spaces.
 That is we want to establish an association between nodes and key spaces so 
 that a node of the cluster holds data from a key space if and only if that 
 node is a *member* of that key space.
  
 To our knowledge Cassandra has no built-in way to specify such a 
 membership-relation. Therefore we thought of implementing our own replica 
 placement strategy until we started to assume that the partitioner had to be 
 replaced, too, to accomplish the

Re: Should I throttle deletes?

2012-01-10 Thread Maxim Potekhin


Thanks, this makes sense. I'll try that.

Maxim

On 1/6/2012 10:51 AM, Vitalii Tymchyshyn wrote:
Do you mean on writes? Yes, your timeouts must be so that your write 
batch could complete until timeout elapsed. But this will lower write 
load, so reads should not timeout.


Best regards, Vitalii Tymchyshym

06.01.12 17:37, Philippe написав(ла):


But you will then get timeouts.

Le 6 janv. 2012 15:17, Vitalii Tymchyshyn tiv...@gmail.com 
mailto:tiv...@gmail.com a écrit :


05.01.12 22:29, Philippe написав(ла):


Then I do have a question, what do people generally use as
the batch size?

I used to do batches from 500 to 2000 like you do.
After investigating issues such as the one you've encountered
I've moved to batches of 20 for writes and 256 for reads.
Everything is a lot smoother : no more timeouts.


I'd better reduce mutation thread pool with concurrent_writes
setting. This will lower server load no matter, how many clients
are sending batches, at the same time you still have good batching.

Best regards, Vitalii Tymchyshyn

Syncing across environments

2012-01-10 Thread David McNelis

Is anyone familiar with any tools that are already available to allow for
configurable synchronization of different clusters?

Specifically for purposes of development, i.e. Dev, staging, test, and
production cassandra environments, so that you can easily plug in the
information that you want to filter back down to your 'lower level'
environments...

If not, I'm interested in starting working on something like that, so if
you have specific thoughts about features/requirements for something
extendable that you'd like to share I'm all ears.

In general the main pieces that I know I would like to have on a column
family basis:

1) Synchronize the schema
2) Specify keys or a range of keys to sync for that CF
3) Support full CF sync
4) Entirely configurable by either maven properties, basic properties, or
xml file
5) Basic reporting about what was synchronized
6) Allow plugin development  for mutating keys as you move to different
 environments (in case your keys in one environment need to be a different
value in another environment, for example, you have a client_id based on an
account number.  The account number exists on dev and prod, but the
client_id is different.  Want to let a dev write a mutator plugin to
 update the key prior to it writing to the destination.
7) Support multiple destinations

Any thoughts on this, folks?  I'd wager this is an issue just about all of
 us deal with, and we're probably all doing it in a little different way.

David

Example Hadoop JobConf for reading from Cassandra?

2012-01-10 Thread Kevin Burton

I'm trying to port the Hadoop InputFormat to Peregrine (another map reduce
impl I'm working on) …

http://peregrine_mapreduce.bitbucket.org/

The problem is that I can't get it to work with my config because the
documentation is a bit sparse.

I could probably spend a ton of time tracking this down but I figured I'd
be lazy and just ask :)

Can someone post their example Hadoop JobConf so I can use it as a template?

It's hard to figure out which params are required, optional, etc.

-- 

Founder/CEO Spinn3r.com http://spinn3r.com/

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*

Re: AW: How to control location of data?

2012-01-10 Thread Maki Watanabe

Small correction:
The token range for each node is (Previous_token, My_Token].
( means exclusive and ] means inclusive.
So N1 is responsible from X+1 to A in following case.

maki

2012/1/11 Roland Gude roland.g...@yoochoose.com:


 Each node in the cluster is assigned a token (can be done automatically –
 but usually should not)

 The token of a node is the start token of the partition it is responsible
 for (and the token of the next node is the end token of the current tokens
 partition)



 Assume you have the following nodes/tokens (which are usually numbers but
 for the example I will use letters)



 N1/A

 N2/D

 N3/M

 N4/X



 This means that N1 is responsible (primary) for [A-D)

    N2 for [D-M)

        N3 for [M-X)

 And N4 for [X-A)



 If you have a replication factor of 1 data will go on the nodes like this:



 B - N1

 E-N2

 X-N4



 And so on

 If you have a higher replication factor, the placement strategy decides
 which node will take replicas of which partition (becoming secondary node
 for that partition)

 Simple strategy will just put the replica on the next node in the ring

 So same example as above but RF of 2 and simple strategy:



 B- N1 and N2

 E - N2 and N3

 X - N4 and N1



 Other strategies can factor in things like “put  data in another datacenter”
 or “put data in another rack” or such things.



 Even though the terms primary and secondary imply some means of quality or
 consistency, this is not the case. If a node is responsible for a piece of
 data, it will store it.





 But placement of the replicas is usually only relevant for availability
 reasons (i.e. disaster recovery etc.)

 Actual location should mean nothing to most applications as you can ask any
 node for the data you want and it will provide it to you (fetching it from
 the responsible nodes).

 This should be sufficient in almost all cases.



 So in the above example again, you can ask N3 “what data is available” and
 it will tell you: B, E and X, or you could ask it “give me X” and it will
 fetch it from N4 or N1 or both of them depending on consistency
 configuration and return the data to you.





 So actually if you use Cassandra – for the application the actual storage
 location of the data should not matter. It will be available anywhere in the
 cluster if it is stored on any reachable node.



 Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com]
 Gesendet: Dienstag, 10. Januar 2012 15:06
 An: user@cassandra.apache.org
 Betreff: Re: AW: How to control location of data?



 Hi!



 Thank you for your last reply. I'm still wondering if I got you right...



 ...

 A partitioner decides into which partition a piece of data belongs

 Does your statement imply that the partitioner does not take any decisions
 at all on the (physical) storage location? Or put another way: What do you
 mean with partition?



 To quote http://wiki.apache.org/cassandra/ArchitectureInternals:
 ... AbstractReplicationStrategy controls what nodes get secondary,
 tertiary, etc. replicas of each key range. Primary replica is always
 determined by the token ring (...)



 ...

 You can select different placement strategies and partitioners for different
 keyspaces, thereby choosing known data to be stored on known hosts.

 This is however discouraged for various reasons – i.e.  you need a lot of
 knowledge about your data to keep the cluster balanced. What is your usecase
 for this requirement? there is probably a more suitable solution.



 What we want is to partition the cluster with respect to key spaces.

 That is we want to establish an association between nodes and key spaces so
 that a node of the cluster holds data from a key space if and only if that
 node is a *member* of that key space.



 To our knowledge Cassandra has no built-in way to specify such a
 membership-relation. Therefore we thought of implementing our own replica
 placement strategy until we started to assume that the partitioner had to be
 replaced, too, to accomplish the task.



 Do you have any ideas?





 Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com]
 Gesendet: Dienstag, 10. Januar 2012 09:53
 An: user@cassandra.apache.org
 Betreff: How to control location of data?



 Hi!



 We're evaluating Cassandra for our storage needs. One of the key benefits we
 see is the online replication of the data, that is an easy way to share data
 across nodes. But we have the need to precisely control on what node group
 specific parts of a key space (columns/column families) are stored on. Now
 we're having trouble understanding the documentation. Could anyone help us
 with to find some answers to our questions?

 ·  What does the term replica mean: If a key is stored on exactly three
 nodes in a cluster, is it correct then to say that there are three replicas
 of that key or are there just two replicas (copies) and one original?

 ·  What is the relation between the Cassandra concepts Partitioner and
 Replica

R: Re: Schema clone ...

AW: How to control location of data?

Re: AW: How to control location of data?

R: Re: AW: How to control location of data?

DataStax offers two new advanced Apache Cassandra classes (San Mateo)

AW: AW: How to control location of data?

Announcing Countandra 0.5

Re: Schema clone ...

Re: How to control location of data?

Re: Should I throttle deletes?

Syncing across environments

Example Hadoop JobConf for reading from Cassandra?

Re: AW: How to control location of data?

13 matches

Site Navigation

Mail list logo

Footer information