Re: Replication factor, LOCAL_QUORUM write consistency and materialized views
Hi, On Fri, May 17, 2024 at 6:18 PM Jon Haddad wrote: > I strongly suggest you don't use materialized views at all. There are > edge cases that in my opinion make them unsuitable for production, both in > terms of cluster stability as well as data integrity. > Oh, there is already an open and fresh Jira ticket about it: https://issues.apache.org/jira/browse/CASSANDRA-19383 Bye, Gábor AUTH
Re: Replication factor, LOCAL_QUORUM write consistency and materialized views
Hi, On Fri, May 17, 2024 at 6:18 PM Jon Haddad wrote: > I strongly suggest you don't use materialized views at all. There are > edge cases that in my opinion make them unsuitable for production, both in > terms of cluster stability as well as data integrity. > I totally agree with you about it. But it looks like a strange and interesting issue... the affected table has only ~1300 rows and less than 200 kB data. :) Also, I found a same issue: https://dba.stackexchange.com/questions/325140/single-node-failure-in-cassandra-4-0-7-causes-cluster-to-run-into-high-cpu Bye, Gábor AUTH > On Fri, May 17, 2024 at 8:58 AM Gábor Auth wrote: > >> Hi, >> >> I know, I know, the materialized view is experimental... :) >> >> So, I ran into a strange error. Among others, I have a very small 4-nodes >> cluster, with very minimal data (~100 MB at all), the keyspace's >> replication factor is 3, everything is works fine... except: if I restart a >> node, I get a lot of errors with materialized views and consistency level >> ONE, but only for those tables for which there is more than one >> materialized view. >> >> Tables without materialized view don't have it, works fine. >> Tables that have it, but only one materialized view, also works fine. >> But, a table with more than one materialized view, whoops, the cluster >> crashes temporarily, I can also see on the calling side (Java backend) that >> no nodes are responding: >> >> Caused by: com.datastax.driver.core.exceptions.WriteFailureException: >> Cassandra failure during write query at consistency LOCAL_QUORUM (2 >> responses were required but only 1 replica responded, 2 failed) >> >> I am surprised by this behavior, because there is so little data >> involved, and it occurs when there is more than one materialized view only, >> so it might be a concurrency issue under the hood. >> >> Have you seen an issue like this? >> >> Here is a stack trace on the Cassandra's side: >> >> [cassandra-dc03-1] ERROR [MutationStage-1] 2024-05-17 08:51:47,425 >> Keyspace.java:652 - Unknown exception caught while attempting to update >> MaterializedView! pope.unit >> [cassandra-dc03-1] org.apache.cassandra.exceptions.UnavailableException: >> Cannot achieve consistency level ONE >> [cassandra-dc03-1] at >> org.apache.cassandra.exceptions.UnavailableException.create(UnavailableException.java:37) >> [cassandra-dc03-1] at >> org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicas(ReplicaPlans.java:170) >> [cassandra-dc03-1] at >> org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicasForWrite(ReplicaPlans.java:113) >> [cassandra-dc03-1] at >> org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:354) >> [cassandra-dc03-1] at >> org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:345) >> [cassandra-dc03-1] at >> org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:339) >> [cassandra-dc03-1] at >> org.apache.cassandra.service.StorageProxy.wrapViewBatchResponseHandler(StorageProxy.java:1312) >> [cassandra-dc03-1] at >> org.apache.cassandra.service.StorageProxy.mutateMV(StorageProxy.java:1004) >> [cassandra-dc03-1] at >> org.apache.cassandra.db.view.TableViews.pushViewReplicaUpdates(TableViews.java:167) >> [cassandra-dc03-1] at >> org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:647) >> [cassandra-dc03-1] at >> org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:477) >> [cassandra-dc03-1] at >> org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:210) >> [cassandra-dc03-1] at >> org.apache.cassandra.db.MutationVerbHandler.doVerb(MutationVerbHandler.java:58) >> [cassandra-dc03-1] at >> org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78) >> [cassandra-dc03-1] at >> org.apache.cassandra.net.InboundSink.accept(InboundSink.java:97) >> [cassandra-dc03-1] at >> org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45) >> [cassandra-dc03-1] at >> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:432) >> [cassandra-dc03-1] at >> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown >> Source) >> [cassandra-dc03-1] at >> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:165) >> [cassandra-dc03-1] at >> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:137) >> [cassandra-dc03-1] at >> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119) >> [cassandra-dc03-1] at >> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) >> [cassandra-dc03-1] at java.base/java.lang.Thread.run(Unknown Source) >> >> -- >> Bye, >> Gábor AUTH >> >
Re: Replication factor, LOCAL_QUORUM write consistency and materialized views
I strongly suggest you don't use materialized views at all. There are edge cases that in my opinion make them unsuitable for production, both in terms of cluster stability as well as data integrity. Jon On Fri, May 17, 2024 at 8:58 AM Gábor Auth wrote: > Hi, > > I know, I know, the materialized view is experimental... :) > > So, I ran into a strange error. Among others, I have a very small 4-nodes > cluster, with very minimal data (~100 MB at all), the keyspace's > replication factor is 3, everything is works fine... except: if I restart a > node, I get a lot of errors with materialized views and consistency level > ONE, but only for those tables for which there is more than one > materialized view. > > Tables without materialized view don't have it, works fine. > Tables that have it, but only one materialized view, also works fine. > But, a table with more than one materialized view, whoops, the cluster > crashes temporarily, I can also see on the calling side (Java backend) that > no nodes are responding: > > Caused by: com.datastax.driver.core.exceptions.WriteFailureException: > Cassandra failure during write query at consistency LOCAL_QUORUM (2 > responses were required but only 1 replica responded, 2 failed) > > I am surprised by this behavior, because there is so little data involved, > and it occurs when there is more than one materialized view only, so it > might be a concurrency issue under the hood. > > Have you seen an issue like this? > > Here is a stack trace on the Cassandra's side: > > [cassandra-dc03-1] ERROR [MutationStage-1] 2024-05-17 08:51:47,425 > Keyspace.java:652 - Unknown exception caught while attempting to update > MaterializedView! pope.unit > [cassandra-dc03-1] org.apache.cassandra.exceptions.UnavailableException: > Cannot achieve consistency level ONE > [cassandra-dc03-1] at > org.apache.cassandra.exceptions.UnavailableException.create(UnavailableException.java:37) > [cassandra-dc03-1] at > org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicas(ReplicaPlans.java:170) > [cassandra-dc03-1] at > org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicasForWrite(ReplicaPlans.java:113) > [cassandra-dc03-1] at > org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:354) > [cassandra-dc03-1] at > org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:345) > [cassandra-dc03-1] at > org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:339) > [cassandra-dc03-1] at > org.apache.cassandra.service.StorageProxy.wrapViewBatchResponseHandler(StorageProxy.java:1312) > [cassandra-dc03-1] at > org.apache.cassandra.service.StorageProxy.mutateMV(StorageProxy.java:1004) > [cassandra-dc03-1] at > org.apache.cassandra.db.view.TableViews.pushViewReplicaUpdates(TableViews.java:167) > [cassandra-dc03-1] at > org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:647) > [cassandra-dc03-1] at > org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:477) > [cassandra-dc03-1] at > org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:210) > [cassandra-dc03-1] at > org.apache.cassandra.db.MutationVerbHandler.doVerb(MutationVerbHandler.java:58) > [cassandra-dc03-1] at > org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78) > [cassandra-dc03-1] at > org.apache.cassandra.net.InboundSink.accept(InboundSink.java:97) > [cassandra-dc03-1] at > org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45) > [cassandra-dc03-1] at > org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:432) > [cassandra-dc03-1] at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown > Source) > [cassandra-dc03-1] at > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:165) > [cassandra-dc03-1] at > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:137) > [cassandra-dc03-1] at > org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119) > [cassandra-dc03-1] at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > [cassandra-dc03-1] at java.base/java.lang.Thread.run(Unknown Source) > > -- > Bye, > Gábor AUTH >
Re: Replication Factor Change
Hello OK i got it , so i should set CL to ALL for reads, otherwise data may be retrieved from nodes that does not have yet current record. Thanks for help. Yulian Oifa On Thu, Nov 5, 2015 at 5:33 PM, Eric Stevenswrote: > If you switch reads to CL=LOCAL_ALL, you should be able to increase RF, > then run repair, and after repair is complete, go back to your old > consistency level. However, while you're operating at ALL consistency, you > have no tolerance for a node failure (but at RF=1 you already have no > tolerance for a node failure, so that doesn't really change your > availability model). > > On Thu, Nov 5, 2015 at 8:01 AM Yulian Oifa wrote: > >> Hello to all. >> I am planning to change replication factor from 1 to 3. >> Will it cause data read errors in time of nodes repair? >> >> Best regards >> Yulian Oifa >> >
Re: Replication Factor Change
If you switch reads to CL=LOCAL_ALL, you should be able to increase RF, then run repair, and after repair is complete, go back to your old consistency level. However, while you're operating at ALL consistency, you have no tolerance for a node failure (but at RF=1 you already have no tolerance for a node failure, so that doesn't really change your availability model). On Thu, Nov 5, 2015 at 8:01 AM Yulian Oifawrote: > Hello to all. > I am planning to change replication factor from 1 to 3. > Will it cause data read errors in time of nodes repair? > > Best regards > Yulian Oifa >
RE: Replication Factor Change
Hello, If current CL = ONE, Be careful on production at the time of change replication factor, 3 nodes will be queried while data is being transformed ==> So data read errors! De : Yulian Oifa [mailto:oifa.yul...@gmail.com] Envoyé : jeudi 5 novembre 2015 16:02 À : user@cassandra.apache.org Objet : Replication Factor Change Hello to all. I am planning to change replication factor from 1 to 3. Will it cause data read errors in time of nodes repair? Best regards Yulian Oifa _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified. Thank you.
Re: Re : Replication factor for system_auth keyspace
To elaborate on what Robert said, I think with most things technology related, the answer with these sorts of questions (i.e. "ideal settings") is usually "it depends." Remember that technology is a tool that we use to accomplish something we want. It's just a mechanism that we as humans use to exert our wishes on other things. In this case, cassandra allows us to exert our wishes on the data we need to have available. So think for a second about what you want? To be less philosophical and more practical, how many nodes you are comfortable losing or likely to lose? How many copies of your system_auth keyspace do you want to have always available? Also, what do you mean by "really long?" What version of cassandra are you using? If you are on 2.1, look at migrating to incremental repair. That it takes so long for such a small keyspace leads me to believe you're using sequential repair ... -V On Thu, Oct 15, 2015 at 7:46 PM, Robert Coliwrote: > On Thu, Oct 15, 2015 at 10:24 AM, sai krishnam raju potturi < > pskraj...@gmail.com> wrote: > >> we are deploying a new cluster with 2 datacenters, 48 nodes in each DC. >> For the system_auth keyspace, what should be the ideal replication_factor >> set? >> >> We tried setting the replication factor equal to the number of nodes in a >> datacenter, and the repair for the system_auth keyspace took really long. >> Your suggestions would be of great help. >> > > More than 1 and a lot less than 48. > > =Rob > >
Re: Re : Replication factor for system_auth keyspace
thanks guys for the advice. We were running parallel repairs earlier, with cassandra version 2.0.14. As pointed out having set the replication factor really huge for system_auth was causing the repair to take really long. thanks Sai On Fri, Oct 16, 2015 at 9:56 AM, Victor Chenwrote: > To elaborate on what Robert said, I think with most things technology > related, the answer with these sorts of questions (i.e. "ideal settings") > is usually "it depends." Remember that technology is a tool that we use to > accomplish something we want. It's just a mechanism that we as humans use > to exert our wishes on other things. In this case, cassandra allows us to > exert our wishes on the data we need to have available. So think for a > second about what you want? To be less philosophical and more practical, > how many nodes you are comfortable losing or likely to lose? How many > copies of your system_auth keyspace do you want to have always available? > > Also, what do you mean by "really long?" What version of cassandra are you > using? If you are on 2.1, look at migrating to incremental repair. That it > takes so long for such a small keyspace leads me to believe you're using > sequential repair ... > > -V > > On Thu, Oct 15, 2015 at 7:46 PM, Robert Coli wrote: > >> On Thu, Oct 15, 2015 at 10:24 AM, sai krishnam raju potturi < >> pskraj...@gmail.com> wrote: >> >>> we are deploying a new cluster with 2 datacenters, 48 nodes in each >>> DC. For the system_auth keyspace, what should be the ideal >>> replication_factor set? >>> >>> We tried setting the replication factor equal to the number of nodes in >>> a datacenter, and the repair for the system_auth keyspace took really long. >>> Your suggestions would be of great help. >>> >> >> More than 1 and a lot less than 48. >> >> =Rob >> >> >
Re: Re : Replication factor for system_auth keyspace
On Thu, Oct 15, 2015 at 10:24 AM, sai krishnam raju potturi < pskraj...@gmail.com> wrote: > we are deploying a new cluster with 2 datacenters, 48 nodes in each DC. > For the system_auth keyspace, what should be the ideal replication_factor > set? > > We tried setting the replication factor equal to the number of nodes in a > datacenter, and the repair for the system_auth keyspace took really long. > Your suggestions would be of great help. > More than 1 and a lot less than 48. =Rob
Re : Replication factor for system_auth keyspace
hi; we are deploying a new cluster with 2 datacenters, 48 nodes in each DC. For the system_auth keyspace, what should be the ideal replication_factor set? We tried setting the replication factor equal to the number of nodes in a datacenter, and the repair for the system_auth keyspace took really long. Your suggestions would be of great help. thanks Sai
Re: Replication factor 2 with immutable data
On Fri, Jul 25, 2014 at 10:46 AM, Jon Travis jtra...@p00p.org wrote: I have a couple questions regarding the availability of my data in a RF=2 scenario. You have just explained why consistency does not work with Replication Factor of fewer than 3 and Consistency Level of less than QUORUM. Basically, with RF=2, QUORUM is ALL, and you can't be available at ALL because that's impossible. =Rob
Re: Replication Factor question
On Wed, Apr 16, 2014 at 1:47 AM, Markus Jais markus.j...@yahoo.de wrote: thanks. How many nodes to you have running in those 5 racks and RF 5? Only 5 nodes or more? While I haven't contemplated it too much, I'd think the absolute minimum would be RF=N=5, sure. The real minimum with headroom would depend on workload, but would probably be at least a few nodes greater than 5. =Rob
Re: Replication Factor question
Hi Rob, thanks. How many nodes to you have running in those 5 racks and RF 5? Only 5 nodes or more? Markus Robert Coli rc...@eventbrite.com schrieb am 20:36 Dienstag, 15.April 2014: On Tue, Apr 15, 2014 at 6:14 AM, Ken Hancock ken.hanc...@schange.com wrote: Keep in mind if you lose the wrong two, you can't satisfy quorum. In a 5-node cluster with RF=3, it would be impossible to lose 2 nodes without affecting quorum for at least some of your data. In a 6 node cluster, once you've lost one node, if you were to lose another, you only have a 1-in-5 chance of not affecting quorum for some of your data. This is why the real highly available way to run Cassandra with QUORUM is RF=5, with 5 racks. Briefly, any given node running a JVM based distributed application should be assumed to potentially become transiently unavailable for a short time, for example during long GC pauses or rolling restarts. There is also a chance of non-transient failure (hard down) at any time, and a much smaller chance of two simultaneous non-transient failures. If you have RF=3 and lose two nodes (one transient, the other non-transient) in a range, that range is now unavailable because quorum is 2 and 3-2 is 1, which is less than 2. If you have RF=5 and lose two nodes in the same way, quorum is 3 and 5-2 is 3, which is equal to 3. AFAICT, no one actually runs Cassandra in this way because keeping 5 copies of your already denormalized data seems excessive and is difficult to justify to management. =Rob
Re: Replication Factor question
Hi all, thanks for your answers. Very helpful. We plan to use enough nodes so that the failure of 1 or 2 machines is no problem. E.g. for a workload to can be handled by 3 nodes all the time, we would use at least 5, better 6 nodes to survive the failure of at least 2 nodes, even when the 2 nodes fail at the same time. This should allow the cluster to rebuild the missing nodes and still serve all requests with a RF=3 and Quorum reads. All the best, Markus Tupshin Harper tups...@tupshin.com schrieb am 21:23 Montag, 14.April 2014: tl;dr make sure you have enough capacity in the event of node failure. For light workloads, that can be fulfilled with nodes=rf. -Tupshin On Apr 14, 2014 2:35 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Apr 14, 2014 at 2:25 AM, Markus Jais markus.j...@yahoo.de wrote: It is generally not recommended to set a replication factor of 3 if you have fewer than six nodes in a data center. I have a detailed post about this somewhere in the archives of this list (which I can't seem to find right now..) but briefly, the 6-for-3 advice relates to the percentage of capacity you have remaining when you have a node down. It has become slightly less accurate over time because vnodes reduce bootstrap time and there have been other improvements to node startup time. If you have fewer than 6 nodes with RF=3, you lose 1/6th of capacity when you lose a single node, which is a significant percentage of total cluster capacity. You then lose another meaningful percentage of your capacity when your existing nodes participate in rebuilding the missing node. If you are then unlucky enough to lose another node, you are missing a very significant percentage of your cluster capacity and have to use a relatively small fraction of it to rebuild the now two down nodes. I wouldn't generalize the rule of thumb as don't run under N=RF*2, but rather as probably don't run RF=3 under about 6 nodes. IOW, in my view, the most operationally sane initial number of nodes for RF=3 is likely closer to 6 than 3. =Rob
Re: Replication Factor question
Keep in mind if you lose the wrong two, you can't satisfy quorum. In a 5-node cluster with RF=3, it would be impossible to lose 2 nodes without affecting quorum for at least some of your data. In a 6 node cluster, once you've lost one node, if you were to lose another, you only have a 1-in-5 chance of not affecting quorum for some of your data. In much larger clusters, it becomes less probable that you will lose multiple nodes within a RF group. On Tue, Apr 15, 2014 at 4:37 AM, Markus Jais markus.j...@yahoo.de wrote: Hi all, thanks for your answers. Very helpful. We plan to use enough nodes so that the failure of 1 or 2 machines is no problem. E.g. for a workload to can be handled by 3 nodes all the time, we would use at least 5, better 6 nodes to survive the failure of at least 2 nodes, even when the 2 nodes fail at the same time. This should allow the cluster to rebuild the missing nodes and still serve all requests with a RF=3 and Quorum reads. All the best, Markus Tupshin Harper tups...@tupshin.com schrieb am 21:23 Montag, 14.April 2014: tl;dr make sure you have enough capacity in the event of node failure. For light workloads, that can be fulfilled with nodes=rf. -Tupshin On Apr 14, 2014 2:35 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Apr 14, 2014 at 2:25 AM, Markus Jais markus.j...@yahoo.de wrote: It is generally not recommended to set a replication factor of 3 if you have fewer than six nodes in a data center. I have a detailed post about this somewhere in the archives of this list (which I can't seem to find right now..) but briefly, the 6-for-3 advice relates to the percentage of capacity you have remaining when you have a node down. It has become slightly less accurate over time because vnodes reduce bootstrap time and there have been other improvements to node startup time. If you have fewer than 6 nodes with RF=3, you lose 1/6th of capacity when you lose a single node, which is a significant percentage of total cluster capacity. You then lose another meaningful percentage of your capacity when your existing nodes participate in rebuilding the missing node. If you are then unlucky enough to lose another node, you are missing a very significant percentage of your cluster capacity and have to use a relatively small fraction of it to rebuild the now two down nodes. I wouldn't generalize the rule of thumb as don't run under N=RF*2, but rather as probably don't run RF=3 under about 6 nodes. IOW, in my view, the most operationally sane initial number of nodes for RF=3 is likely closer to 6 than 3. =Rob
Re: Replication Factor question
Hi Ken, thanks. Good point. Markus Ken Hancock ken.hanc...@schange.com schrieb am 15:15 Dienstag, 15.April 2014: Keep in mind if you lose the wrong two, you can't satisfy quorum. In a 5-node cluster with RF=3, it would be impossible to lose 2 nodes without affecting quorum for at least some of your data. In a 6 node cluster, once you've lost one node, if you were to lose another, you only have a 1-in-5 chance of not affecting quorum for some of your data. In much larger clusters, it becomes less probable that you will lose multiple nodes within a RF group. On Tue, Apr 15, 2014 at 4:37 AM, Markus Jais markus.j...@yahoo.de wrote: Hi all, thanks for your answers. Very helpful. We plan to use enough nodes so that the failure of 1 or 2 machines is no problem. E.g. for a workload to can be handled by 3 nodes all the time, we would use at least 5, better 6 nodes to survive the failure of at least 2 nodes, even when the 2 nodes fail at the same time. This should allow the cluster to rebuild the missing nodes and still serve all requests with a RF=3 and Quorum reads. All the best, Markus Tupshin Harper tups...@tupshin.com schrieb am 21:23 Montag, 14.April 2014: tl;dr make sure you have enough capacity in the event of node failure. For light workloads, that can be fulfilled with nodes=rf. -Tupshin On Apr 14, 2014 2:35 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Apr 14, 2014 at 2:25 AM, Markus Jais markus.j...@yahoo.de wrote: It is generally not recommended to set a replication factor of 3 if you have fewer than six nodes in a data center. I have a detailed post about this somewhere in the archives of this list (which I can't seem to find right now..) but briefly, the 6-for-3 advice relates to the percentage of capacity you have remaining when you have a node down. It has become slightly less accurate over time because vnodes reduce bootstrap time and there have been other improvements to node startup time. If you have fewer than 6 nodes with RF=3, you lose 1/6th of capacity when you lose a single node, which is a significant percentage of total cluster capacity. You then lose another meaningful percentage of your capacity when your existing nodes participate in rebuilding the missing node. If you are then unlucky enough to lose another node, you are missing a very significant percentage of your cluster capacity and have to use a relatively small fraction of it to rebuild the now two down nodes. I wouldn't generalize the rule of thumb as don't run under N=RF*2, but rather as probably don't run RF=3 under about 6 nodes. IOW, in my view, the most operationally sane initial number of nodes for RF=3 is likely closer to 6 than 3. =Rob
Re: Replication Factor question
On Tue, Apr 15, 2014 at 6:14 AM, Ken Hancock ken.hanc...@schange.comwrote: Keep in mind if you lose the wrong two, you can't satisfy quorum. In a 5-node cluster with RF=3, it would be impossible to lose 2 nodes without affecting quorum for at least some of your data. In a 6 node cluster, once you've lost one node, if you were to lose another, you only have a 1-in-5 chance of not affecting quorum for some of your data. This is why the real highly available way to run Cassandra with QUORUM is RF=5, with 5 racks. Briefly, any given node running a JVM based distributed application should be assumed to potentially become transiently unavailable for a short time, for example during long GC pauses or rolling restarts. There is also a chance of non-transient failure (hard down) at any time, and a much smaller chance of two simultaneous non-transient failures. If you have RF=3 and lose two nodes (one transient, the other non-transient) in a range, that range is now unavailable because quorum is 2 and 3-2 is 1, which is less than 2. If you have RF=5 and lose two nodes in the same way, quorum is 3 and 5-2 is 3, which is equal to 3. AFAICT, no one actually runs Cassandra in this way because keeping 5 copies of your already denormalized data seems excessive and is difficult to justify to management. =Rob
Re: Replication Factor question
It is not common, but I know of multiple organizations running with RF=5, in at least one DC, for HA reasons. -Tupshin On Apr 15, 2014 2:36 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Apr 15, 2014 at 6:14 AM, Ken Hancock ken.hanc...@schange.comwrote: Keep in mind if you lose the wrong two, you can't satisfy quorum. In a 5-node cluster with RF=3, it would be impossible to lose 2 nodes without affecting quorum for at least some of your data. In a 6 node cluster, once you've lost one node, if you were to lose another, you only have a 1-in-5 chance of not affecting quorum for some of your data. This is why the real highly available way to run Cassandra with QUORUM is RF=5, with 5 racks. Briefly, any given node running a JVM based distributed application should be assumed to potentially become transiently unavailable for a short time, for example during long GC pauses or rolling restarts. There is also a chance of non-transient failure (hard down) at any time, and a much smaller chance of two simultaneous non-transient failures. If you have RF=3 and lose two nodes (one transient, the other non-transient) in a range, that range is now unavailable because quorum is 2 and 3-2 is 1, which is less than 2. If you have RF=5 and lose two nodes in the same way, quorum is 3 and 5-2 is 3, which is equal to 3. AFAICT, no one actually runs Cassandra in this way because keeping 5 copies of your already denormalized data seems excessive and is difficult to justify to management. =Rob
Re: Replication Factor question
Hi Markus, It is generally not recommended to set a replication factor of 3 if you have fewer than six nodes in a data center. Actually you can create a cluster with 3 nodes and replication level 3. But in this case if one of them would fail cluster become inconsistent. In this way minimum reasonable nodes number is 4 for replication level 3. In this case we can tolerate single node failure. But in this situation each node would contain 3/4 of all data. This is not very good. Number 6 is recommended because in this case each node contain 1/2 of all data, this is quite adequate overhead. Typically Cassandra clusters don't have big replication level, typically it is 3 (failure of any single node don't crush cluster) or 5 (failure of any two nodes don't crush cluster). For more details you should look to replication level calculator http://www.ecyrd.com/cassandracalculator/. -- Thanks, Sergey On 14/04/14 13:25, Markus Jais wrote: Hello, currently reading the Practical Cassandra. In the section about replication factors the book says: It is generally not recommended to set a replication factor of 3 if you have fewer than six nodes in a data center. Why is that? What problems would arise if I had a replication factor of 3 and only 5 nodes? Does that mean that for a replication of 4 I would need at least 8 nodes and for a factor of 5 at least 10 nodes? Not saying that I would factor 5 andn 10 nodes, just curious about how this works. All the best, Markus signature.asc Description: OpenPGP digital signature
Re: Replication Factor question
I do not agree with this advice. It can be perfectly reasonable to have #nodes 2*RF. It is common to deploy a 3 node cluster with RF=3 and it works fine as long as each node can handle 100% of your data, and keep up with the workload. -Tupshin On Apr 14, 2014 5:25 AM, Markus Jais markus.j...@yahoo.de wrote: Hello, currently reading the Practical Cassandra. In the section about replication factors the book says: It is generally not recommended to set a replication factor of 3 if you have fewer than six nodes in a data center. Why is that? What problems would arise if I had a replication factor of 3 and only 5 nodes? Does that mean that for a replication of 4 I would need at least 8 nodes and for a factor of 5 at least 10 nodes? Not saying that I would factor 5 andn 10 nodes, just curious about how this works. All the best, Markus
Re: Replication Factor question
Hi all, thanks. Very helpful. @Tupshin: With a 3 node cluster and RF 3 isn't it a problem if one node fails (due to hardware problems, for example). According to the C* docs, writes fail if the number of nodes is smaller than the RF. I agree that it will run fine as long as all nodes are up and they can handle the load but eventually hardware will fail. Markus Tupshin Harper tups...@tupshin.com schrieb am 13:44 Montag, 14.April 2014: I do not agree with this advice. It can be perfectly reasonable to have #nodes 2*RF. It is common to deploy a 3 node cluster with RF=3 and it works fine as long as each node can handle 100% of your data, and keep up with the workload. -Tupshin On Apr 14, 2014 5:25 AM, Markus Jais markus.j...@yahoo.de wrote: Hello, currently reading the Practical Cassandra. In the section about replication factors the book says: It is generally not recommended to set a replication factor of 3 if you have fewer than six nodes in a data center. Why is that? What problems would arise if I had a replication factor of 3 and only 5 nodes? Does that mean that for a replication of 4 I would need at least 8 nodes and for a factor of 5 at least 10 nodes? Not saying that I would factor 5 andn 10 nodes, just curious about how this works. All the best, Markus
Re: Replication Factor question
With 3 nodes, and RF=3, you can always use CL=ALL if all nodes are up, QUORUM if 1 node is down, and ONE if any two nodes are down. The exact same thing is true if you have more nodes. -Tupshin On Apr 14, 2014 7:51 AM, Markus Jais markus.j...@yahoo.de wrote: Hi all, thanks. Very helpful. @Tupshin: With a 3 node cluster and RF 3 isn't it a problem if one node fails (due to hardware problems, for example). According to the C* docs, writes fail if the number of nodes is smaller than the RF. I agree that it will run fine as long as all nodes are up and they can handle the load but eventually hardware will fail. Markus Tupshin Harper tups...@tupshin.com schrieb am 13:44 Montag, 14.April 2014: I do not agree with this advice. It can be perfectly reasonable to have #nodes 2*RF. It is common to deploy a 3 node cluster with RF=3 and it works fine as long as each node can handle 100% of your data, and keep up with the workload. -Tupshin On Apr 14, 2014 5:25 AM, Markus Jais markus.j...@yahoo.de wrote: Hello, currently reading the Practical Cassandra. In the section about replication factors the book says: It is generally not recommended to set a replication factor of 3 if you have fewer than six nodes in a data center. Why is that? What problems would arise if I had a replication factor of 3 and only 5 nodes? Does that mean that for a replication of 4 I would need at least 8 nodes and for a factor of 5 at least 10 nodes? Not saying that I would factor 5 andn 10 nodes, just curious about how this works. All the best, Markus
Re: Replication Factor question
On Mon, Apr 14, 2014 at 2:25 AM, Markus Jais markus.j...@yahoo.de wrote: It is generally not recommended to set a replication factor of 3 if you have fewer than six nodes in a data center. I have a detailed post about this somewhere in the archives of this list (which I can't seem to find right now..) but briefly, the 6-for-3 advice relates to the percentage of capacity you have remaining when you have a node down. It has become slightly less accurate over time because vnodes reduce bootstrap time and there have been other improvements to node startup time. If you have fewer than 6 nodes with RF=3, you lose 1/6th of capacity when you lose a single node, which is a significant percentage of total cluster capacity. You then lose another meaningful percentage of your capacity when your existing nodes participate in rebuilding the missing node. If you are then unlucky enough to lose another node, you are missing a very significant percentage of your cluster capacity and have to use a relatively small fraction of it to rebuild the now two down nodes. I wouldn't generalize the rule of thumb as don't run under N=RF*2, but rather as probably don't run RF=3 under about 6 nodes. IOW, in my view, the most operationally sane initial number of nodes for RF=3 is likely closer to 6 than 3. =Rob
Re: Replication Factor question
tl;dr make sure you have enough capacity in the event of node failure. For light workloads, that can be fulfilled with nodes=rf. -Tupshin On Apr 14, 2014 2:35 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Apr 14, 2014 at 2:25 AM, Markus Jais markus.j...@yahoo.de wrote: It is generally not recommended to set a replication factor of 3 if you have fewer than six nodes in a data center. I have a detailed post about this somewhere in the archives of this list (which I can't seem to find right now..) but briefly, the 6-for-3 advice relates to the percentage of capacity you have remaining when you have a node down. It has become slightly less accurate over time because vnodes reduce bootstrap time and there have been other improvements to node startup time. If you have fewer than 6 nodes with RF=3, you lose 1/6th of capacity when you lose a single node, which is a significant percentage of total cluster capacity. You then lose another meaningful percentage of your capacity when your existing nodes participate in rebuilding the missing node. If you are then unlucky enough to lose another node, you are missing a very significant percentage of your cluster capacity and have to use a relatively small fraction of it to rebuild the now two down nodes. I wouldn't generalize the rule of thumb as don't run under N=RF*2, but rather as probably don't run RF=3 under about 6 nodes. IOW, in my view, the most operationally sane initial number of nodes for RF=3 is likely closer to 6 than 3. =Rob
Re: replication factor is zero
On Thu, Jun 6, 2013 at 1:28 PM, Daning Wang dan...@netseer.com wrote: could we set replication factor to 0 on other data center? what is the best to way for not syncing some data in a cluster? Yes, you can set it to 0, and that's the recommended way to handle this. -- Tyler Hobbs DataStax http://datastax.com/
Re: replication factor is zero
But afaik you can set the RF only per Keyspace. So you will have to pull those tables apart, in a different Keyspace. 2013/6/6 Tyler Hobbs ty...@datastax.com On Thu, Jun 6, 2013 at 1:28 PM, Daning Wang dan...@netseer.com wrote: could we set replication factor to 0 on other data center? what is the best to way for not syncing some data in a cluster? Yes, you can set it to 0, and that's the recommended way to handle this. -- Tyler Hobbs DataStax http://datastax.com/
Re: Replication Factor and Consistency Level Confusion
Hello, Thank you very much for your quick responses. Initially we were thinking the same thing, that an explanation would be that the wrong node could be down, but then isn't this something that hinted handoff sorts out? So actually, Consistency Level refers to the number of replicas, not the total number of nodes in a cluster. Keeping that in mind and assuming that hinted handoff has nothing to do with that as I thought, I could explain some results but not all. Let me explain: Test 1 (3/3 Nodes UP): CL :ANY ONETWOTHREEQUORUM ALL RF 3:OK OK OK OK OK OK Test 2 (2/3 Nodes UP): CL :ANYONETWOTHREEQUORUMALL RF 2:OK OK x xOKx Test 3 (2/3 Nodes UP): CL :ANYONETWOTHREEQUORUMALL RF 3:OK OK x xOKOK Test 1: Everything was fine because all nodes were up and the RF does not exceed the total number of nodes, in which case writes would be blocked. Test 2: CL=TWO did not work because we were unlucky and the wrong node, responsible for the key range we were trying to insert, was DOWN (I can accept that for now, however I do not quite understand why isn't this sorted by the hinted handoff). My explanation might be wrong again, but CL=THREE should fail because we only have set RF=2, so there isn't a 3rd replica anywhere anyway. Why did CL=QUORUM not fail then? Since QUORUM=(RF/2)+1=2 in this case, the write operation should try to write in 2 replicas, one of which, the one responsible for that range as we said, is DOWN. I should expect CL=2 and CL=QUORUM to have the same outcome in this case. Why that's not the case? CL=ALL fails for the same reason as CL=TWO I presume. Test 3: I was expecting only CL=ANY and CL=ONE to work in this case. CL=TWO does not work because , just like with Test 2, the same situation applies with the node responsible for that particular key range to be DOWN. If that's the case, why CL=QUORUM was successful??? The only explanation I can thing of at the moment is that QUORUM explicitly refers to the total number of nodes in the cluster rather than the number of replicas determined by the RF. CL=THREE seems easy, it fails because one of the three replicas is DOWN. CL=ALL is confusing as well. If my understanding is correct and ALL means all replicas, 3 in this case, then the operation should fail because one replica is DOWN and I can not be lucky to have the right node DOWN, because RF=3. So, every node should have a copy of the data. Furthermore, with regards to being unlucky with the wrong node if this actually what is happening, how is it possible to ever have a node-failure resiliant cassandra cluster? My understanding of this implies that even with 100 nodes, every 1/100 writes would fail until the node is replaced/repaired. Thank you very much in advance. Vasilis On Wed, Dec 19, 2012 at 4:18 PM, Roland Gude roland.g...@ez.no wrote: Hi RF 2 means that 2 nodes are responsible for any given row (no matter how many nodes are in the cluster) For your cluster with three nodes let's just assume the following responsibilities NodeA B C Primary keys0-5 6-1011-15 Replica keys11-15 0-5 6-10 Assume node 'C' is down Writing any key in range 0-5 with consistency TWO is possible (A and B are up) Writing any key in range 11-15 with consistency TWO will fail (C is down and 11-15 is its primary range) Writing any key in range 6-10 with consistency TWO will fail (C is down and it is the replica for this range) I hope this explains it. -Ursprüngliche Nachricht- Von: Vasileios Vlachos [mailto:vasileiosvlac...@gmail.com] Gesendet: Mittwoch, 19. Dezember 2012 17:07 An: user@cassandra.apache.org Betreff: Replication Factor and Consistency Level Confusion Hello All, We have a 3-node cluster and we created a keyspace (say Test_1) with Replication Factor set to 3. I know is not great but we wanted to test different behaviors. So, we created a Column Family (say cf_1) and we tried writing something with Consistency Level ANY, ONE, TWO, THREE, QUORUM and ALL. We did that while all nodes were in UP state, so we had no problems at all. No matter what the Consistency Level was, we were able to insert a value. Same cluster, different keyspace (say Test_2) with Replication Factor set to 2 this time and one of the 3 nodes deliberately DOWN. Again, we created a Column Family (say cf_1) and we tried writing something with different Consistency Levels. Here is what we got: ANY: worked (expected...) ONE: worked (expected...) TWO: did not work (WHT???) THREE: did not work (expected...) QUORUM: worked (expected...) ALL: did not work (expected I guess...) Now, we know that QUORUM derives from (RF/2)+1, so we were expecting that to work, after all only 1 node was DOWN. Why did Consistency Level TWO not
Re: Replication Factor and Consistency Level Confusion
Don't run with a replication factor of 2, use 3 instead, and do all reads and writes using quorum consistency. That way, if a single node is down, all your operations will complete. In fact, if every third node is down, you'll still be fine and able to handle all requests. However, if two adjacent nodes are down at the same time, operations against keys that are stored on both those servers will fail beause quorum can't be satisfied. To gain a better understanding, repeat your tests, but with multiple random keys, and keep track of how many operations fail in each case. /Henrik On Thu, Dec 20, 2012 at 10:26 AM, Vasileios Vlachos vasileiosvlac...@gmail.com wrote: Furthermore, with regards to being unlucky with the wrong node if this actually what is happening, how is it possible to ever have a node-failure resiliant cassandra cluster? My understanding of this implies that even with 100 nodes, every 1/100 writes would fail until the node is replaced/repaired.
Re: Replication Factor and Consistency Level Confusion
On Thu, Dec 20, 2012 at 11:26 AM, Vasileios Vlachos vasileiosvlac...@gmail.com wrote: Initially we were thinking the same thing, that an explanation would be that the wrong node could be down, but then isn't this something that hinted handoff sorts out? If a node is partitioned from the rest of the cluster (ie. the node goes down, but later comes back with the same data it had), it will obviously be out of data with regard to any writes that happened while it was down. Anti-entropy (nodetool repair) and read repair will repair this inconsistency over time, but not right away; hinted handoff is an optimization that will allow the node to become mostly consistent right away on rejoining the cluster, as the nodes will have stored hints for it while it was down, and will send it them once the node is back up. However, the important thing to note is that this is an /optimization/. If a replica is down, then it will not be able to satisfy any consistency level requirements, except for the special case of CL=ANY. If you use another CL like TWO, then two actual replica nodes must be up for the ranges you are writing to, a node that is not a replica but will write a hint does not count. Test 2 (2/3 Nodes UP): CL :ANYONETWOTHREEQUORUMALL RF 2:OK OK x xOKx For this test, QUORUM = RF/2+1 = 2/2+1 = 2. A write at QUORUM should have succeded if both of the replicas for the range were up, but if one of the replicas for the range was the downed node, then it would have failed. I think you can use the 'nodetool getendpoints' command to list the nodes that are replicas for the given row key. I am unable to explain how a write at QUORUM could succeed if a write at TWO for the same key failed. Test 3 (2/3 Nodes UP): CL :ANYONETWOTHREEQUORUMALL RF 3:OK OK x xOKOK For this test, QUORUM = RF/2+1 = 3/2+1 = 2. Again, I am unable to explain why a write at QUORUM would succeed if a write at TWO failed, and I am also unable to explain how a write at ALL could succeed, for any key, if one of the nodes is down. I would suggest double-checking your test setup; also, make sure you use the same row keys every time (if this is not already the case) so that you have repeatable results. Furthermore, with regards to being unlucky with the wrong node if this actually what is happening, how is it possible to ever have a node-failure resiliant cassandra cluster? My understanding of this implies that even with 100 nodes, every 1/100 writes would fail until the node is replaced/repaired. RF is the important number when considering fault-tolerance in your cluster, not the number of nodes. If RF=3, and you read and write at quorum, then you can tolerate one node being down in the range you are operating on. If you need to be able to tolerate two nodes being down, RF=5 and QUORUM would work. In other words, if you need better fault tolerance, RF is what you need to increase; if you need better performance, or you need to store more data, then N (number of nodes in cluster) is what you need to increase. Of course, N must be at least as big as RF... -- mithrandi, i Ainil en-Balandor, a faer Ambar
Re: Replication Factor and Consistency Level Confusion
this actually what is happening, how is it possible to ever have a node-failure resiliant cassandra cluster? Background http://thelastpickle.com/2011/06/13/Down-For-Me/ I would suggest double-checking your test setup; also, make sure you use the same row keys every time (if this is not already the case) so that you have repeatable results. Take a look at the nodetool getendpoints command. It will tell you which nodes a key is stored on. Though for RF 3 and N3 it's all of them :) Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/12/2012, at 12:54 AM, Tristan Seligmann mithra...@mithrandi.net wrote: On Thu, Dec 20, 2012 at 11:26 AM, Vasileios Vlachos vasileiosvlac...@gmail.com wrote: Initially we were thinking the same thing, that an explanation would be that the wrong node could be down, but then isn't this something that hinted handoff sorts out? If a node is partitioned from the rest of the cluster (ie. the node goes down, but later comes back with the same data it had), it will obviously be out of data with regard to any writes that happened while it was down. Anti-entropy (nodetool repair) and read repair will repair this inconsistency over time, but not right away; hinted handoff is an optimization that will allow the node to become mostly consistent right away on rejoining the cluster, as the nodes will have stored hints for it while it was down, and will send it them once the node is back up. However, the important thing to note is that this is an /optimization/. If a replica is down, then it will not be able to satisfy any consistency level requirements, except for the special case of CL=ANY. If you use another CL like TWO, then two actual replica nodes must be up for the ranges you are writing to, a node that is not a replica but will write a hint does not count. Test 2 (2/3 Nodes UP): CL :ANYONETWOTHREEQUORUMALL RF 2:OK OK x xOKx For this test, QUORUM = RF/2+1 = 2/2+1 = 2. A write at QUORUM should have succeded if both of the replicas for the range were up, but if one of the replicas for the range was the downed node, then it would have failed. I think you can use the 'nodetool getendpoints' command to list the nodes that are replicas for the given row key. I am unable to explain how a write at QUORUM could succeed if a write at TWO for the same key failed. Test 3 (2/3 Nodes UP): CL :ANYONETWOTHREEQUORUMALL RF 3:OK OK x xOKOK For this test, QUORUM = RF/2+1 = 3/2+1 = 2. Again, I am unable to explain why a write at QUORUM would succeed if a write at TWO failed, and I am also unable to explain how a write at ALL could succeed, for any key, if one of the nodes is down. I would suggest double-checking your test setup; also, make sure you use the same row keys every time (if this is not already the case) so that you have repeatable results. Furthermore, with regards to being unlucky with the wrong node if this actually what is happening, how is it possible to ever have a node-failure resiliant cassandra cluster? My understanding of this implies that even with 100 nodes, every 1/100 writes would fail until the node is replaced/repaired. RF is the important number when considering fault-tolerance in your cluster, not the number of nodes. If RF=3, and you read and write at quorum, then you can tolerate one node being down in the range you are operating on. If you need to be able to tolerate two nodes being down, RF=5 and QUORUM would work. In other words, if you need better fault tolerance, RF is what you need to increase; if you need better performance, or you need to store more data, then N (number of nodes in cluster) is what you need to increase. Of course, N must be at least as big as RF... -- mithrandi, i Ainil en-Balandor, a faer Ambar
Re: Replication Factor and Consistency Level Confusion
ANY: worked (expected...) ONE: worked (expected...) TWO: did not work (WHT???) This is expected sometimes and sometimes not. It depends on the 2 of the 3 nodes that have the data. Since you have one node down, that might be the one where that data goes ;). THREE: did not work (expected...) QUORUM: worked (expected...) ALL: did not work (expected I guess...) On 12/19/12 9:07 AM, Vasileios Vlachos vasileiosvlac...@gmail.com wrote: ANY: worked (expected...) ONE: worked (expected...) TWO: did not work (WHT???) THREE: did not work (expected...) QUORUM: worked (expected...) ALL: did not work (expected I guess...)
Re: Replication Factor and Consistency Level Confusion
Ps, you may be getting a bit confused by the way. Just think if you have a 10 node cluster and one node is down and you do CL=2Š..if the node that is down is where your data goes, yes, you will fail. If you do CL=quorum and RF=3 you can tolerate one node being downŠIf you use astyanax, I think they have an implementation that will switch the CL level to lower when a node is out so the writes will still work which is quite nice. (ie. Write fails, switch to CL lower and write again or same with the read). Dean On 12/19/12 9:07 AM, Vasileios Vlachos vasileiosvlac...@gmail.com wrote: Hello All, We have a 3-node cluster and we created a keyspace (say Test_1) with Replication Factor set to 3. I know is not great but we wanted to test different behaviors. So, we created a Column Family (say cf_1) and we tried writing something with Consistency Level ANY, ONE, TWO, THREE, QUORUM and ALL. We did that while all nodes were in UP state, so we had no problems at all. No matter what the Consistency Level was, we were able to insert a value. Same cluster, different keyspace (say Test_2) with Replication Factor set to 2 this time and one of the 3 nodes deliberately DOWN. Again, we created a Column Family (say cf_1) and we tried writing something with different Consistency Levels. Here is what we got: ANY: worked (expected...) ONE: worked (expected...) TWO: did not work (WHT???) THREE: did not work (expected...) QUORUM: worked (expected...) ALL: did not work (expected I guess...) Now, we know that QUORUM derives from (RF/2)+1, so we were expecting that to work, after all only 1 node was DOWN. Why did Consistency Level TWO not work then??? Third test... Same cluster again, different keyspace (say Test_3) with Replication Factor set to 3 this time and 1 of the 3 nodes deliberately DOWN again. Same approach again, created different Column Family (say cf_1) and different Consistency Level settings resulted in the following: ANY: worked (what???) ONE: worked (what???) TWO: did not work (what???) THREE: did not work (expected...) QUORUM: worked (what???) ALL: worked (what???) We thought that if the Replication Factor is greater than the number of nodes in the cluster, writes are blocked. Apparently we are completely missing the a level of understanding here, so we would appreciate any help! Thank you in advance! Vasilis
Re: Replication factor and performance questions
@oleg, to answer your last question a cassandra node should never ask another node for information it doesn't have. it uses the key and the partitioner to determine where the data is located before ever contacting another node. On Mon, Nov 5, 2012 at 9:45 AM, Andrey Ilinykh ailin...@gmail.com wrote: You will have one extra hop. Not big deal, actually. And many client libraries (astyanax for example) are token aware, so they are smart enough to call the right node. On Mon, Nov 5, 2012 at 9:12 AM, Oleg Dulin oleg.du...@gmail.com wrote: Should be all under 400Gig on each. My question is -- is there additional overhead with replicas making requests to one another for keys they don't have ? how much of an overhead is that ? On 2012-11-05 17:00:37 +, Michael Kjellman said: Rule of thumb is to try to keep nodes under 400GB. Compactions/Repairs/Move operations etc become a nightmare otherwise. How much data do you expect to have on each node? Also depends on caches, bloom filters etc On 11/5/12 8:57 AM, Oleg Dulin oleg.du...@gmail.com wrote: I have 4 nodes at my disposal. I can configure them like this: 1) RF=1, each node has 25% of the data. On random-reads, how big is the performance penalty if a node needs to look for data on another replica ? 2) RF=2, each node has 50% of the data. Same question ? -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/ 'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions. Visit http://barracudanetworks.com/facebook -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/
Re: Replication factor and performance questions
Rule of thumb is to try to keep nodes under 400GB. Compactions/Repairs/Move operations etc become a nightmare otherwise. How much data do you expect to have on each node? Also depends on caches, bloom filters etc On 11/5/12 8:57 AM, Oleg Dulin oleg.du...@gmail.com wrote: I have 4 nodes at my disposal. I can configure them like this: 1) RF=1, each node has 25% of the data. On random-reads, how big is the performance penalty if a node needs to look for data on another replica ? 2) RF=2, each node has 50% of the data. Same question ? -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/ 'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions. Visit http://barracudanetworks.com/facebook
Re: Replication factor and performance questions
Our compactions/repairs have already become nightmares and we have not approached the levels of data you describe here (~200 GB). Have any pointers/case studies for optimizing this? On Nov 5, 2012, at 12:00 PM, Michael Kjellman wrote: Rule of thumb is to try to keep nodes under 400GB. Compactions/Repairs/Move operations etc become a nightmare otherwise. How much data do you expect to have on each node? Also depends on caches, bloom filters etc On 11/5/12 8:57 AM, Oleg Dulin oleg.du...@gmail.com wrote: I have 4 nodes at my disposal. I can configure them like this: 1) RF=1, each node has 25% of the data. On random-reads, how big is the performance penalty if a node needs to look for data on another replica ? 2) RF=2, each node has 50% of the data. Same question ? -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/ 'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions. Visit http://barracudanetworks.com/facebook
Re: Replication factor and performance questions
Should be all under 400Gig on each. My question is -- is there additional overhead with replicas making requests to one another for keys they don't have ? how much of an overhead is that ? On 2012-11-05 17:00:37 +, Michael Kjellman said: Rule of thumb is to try to keep nodes under 400GB. Compactions/Repairs/Move operations etc become a nightmare otherwise. How much data do you expect to have on each node? Also depends on caches, bloom filters etc On 11/5/12 8:57 AM, Oleg Dulin oleg.du...@gmail.com wrote: I have 4 nodes at my disposal. I can configure them like this: 1) RF=1, each node has 25% of the data. On random-reads, how big is the performance penalty if a node needs to look for data on another replica ? 2) RF=2, each node has 50% of the data. Same question ? -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/ 'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions. Visit http://barracudanetworks.com/facebook -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/
Re: Replication factor and performance questions
You will have one extra hop. Not big deal, actually. And many client libraries (astyanax for example) are token aware, so they are smart enough to call the right node. On Mon, Nov 5, 2012 at 9:12 AM, Oleg Dulin oleg.du...@gmail.com wrote: Should be all under 400Gig on each. My question is -- is there additional overhead with replicas making requests to one another for keys they don't have ? how much of an overhead is that ? On 2012-11-05 17:00:37 +, Michael Kjellman said: Rule of thumb is to try to keep nodes under 400GB. Compactions/Repairs/Move operations etc become a nightmare otherwise. How much data do you expect to have on each node? Also depends on caches, bloom filters etc On 11/5/12 8:57 AM, Oleg Dulin oleg.du...@gmail.com wrote: I have 4 nodes at my disposal. I can configure them like this: 1) RF=1, each node has 25% of the data. On random-reads, how big is the performance penalty if a node needs to look for data on another replica ? 2) RF=2, each node has 50% of the data. Same question ? -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/ 'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions. Visit http://barracudanetworks.com/facebook -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/
Re: Replication factor 2, consistency and failover
Aaron, thank you! Your message was exactly what we wanted to see: that we didn't miss something critical. We'll share our Astyanax patch in the future. On 10 September 2012 03:44, aaron morton aa...@thelastpickle.com wrote: In general we want to achieve strong consistency. You need to have R + W N LOCAL_QUORUM and reads with ONE. Gives you 2 + 1 2 when you use it. When you drop back to ONE / ONE you no longer have strong consistency. may be advise on how to improve it. Sounds like you know how to improve it :) Things you could play with: * hinted_handoff_throttle_delay_in_ms in YAML to reduce the time it takes for HH delay to deliver the messages. * increase the read_repair_chance for the CF's. This will increase the chance of RR repairing an inconsistency behind the scenes, so the next read is consistent. This will also increase the IO load on the system. With the RF 2 restriction you are probably doing the best you can. You are giving up consistency for availability and partition tolerance. The best thing to do to get peeps to agree that we will accept reduced consistency for high availability rather than say in general we want to achieve strong consistency. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 9/09/2012, at 9:09 PM, Sergey Tryuber stryu...@gmail.com wrote: Hi We have to use Cassandra with RF=2 (don't ask why...). There are two datacenters (RF=2 in each datacenter). Also we use Astyanax as a client library. In general we want to achieve strong consistency. Read performance is important for us, that's why we perform writes with LOCAL_QUORUM and reads with ONE. If one server is down, we automatically switch to Writes.ONE, Reads.ONE only for that replica which has failed node (we modified Astyanax to achieve that). When the server comes back, we turn back Writes.LOCAL_QUORUM and Reads.ONE, but, of course, we see some inconsistencies during the switching process and some time after (when hinted handnoff works). Basically I don't have any questions, just want to share our ugly failover algorithm, to hear your criticism and may be advise on how to improve it. Unfortunately we can't change replication factor and most of the time we have to read with consistency level ONE (because we have strict requirements on read performance). Thank you!
Re: Replication factor 2, consistency and failover
In general we want to achieve strong consistency. You need to have R + W N LOCAL_QUORUM and reads with ONE. Gives you 2 + 1 2 when you use it. When you drop back to ONE / ONE you no longer have strong consistency. may be advise on how to improve it. Sounds like you know how to improve it :) Things you could play with: * hinted_handoff_throttle_delay_in_ms in YAML to reduce the time it takes for HH delay to deliver the messages. * increase the read_repair_chance for the CF's. This will increase the chance of RR repairing an inconsistency behind the scenes, so the next read is consistent. This will also increase the IO load on the system. With the RF 2 restriction you are probably doing the best you can. You are giving up consistency for availability and partition tolerance. The best thing to do to get peeps to agree that we will accept reduced consistency for high availability rather than say in general we want to achieve strong consistency. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 9/09/2012, at 9:09 PM, Sergey Tryuber stryu...@gmail.com wrote: Hi We have to use Cassandra with RF=2 (don't ask why...). There are two datacenters (RF=2 in each datacenter). Also we use Astyanax as a client library. In general we want to achieve strong consistency. Read performance is important for us, that's why we perform writes with LOCAL_QUORUM and reads with ONE. If one server is down, we automatically switch to Writes.ONE, Reads.ONE only for that replica which has failed node (we modified Astyanax to achieve that). When the server comes back, we turn back Writes.LOCAL_QUORUM and Reads.ONE, but, of course, we see some inconsistencies during the switching process and some time after (when hinted handnoff works). Basically I don't have any questions, just want to share our ugly failover algorithm, to hear your criticism and may be advise on how to improve it. Unfortunately we can't change replication factor and most of the time we have to read with consistency level ONE (because we have strict requirements on read performance). Thank you!
Re: Replication factor - Consistency Questions
But isn't QUORUM on a 2-node cluster still 2 nodes? Yes. 3 is where you start to get some redundancy - http://thelastpickle.com/2011/06/13/Down-For-Me/ Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 20/07/2012, at 10:24 AM, Kirk True wrote: But isn't QUORUM on a 2-node cluster still 2 nodes? On 07/17/2012 11:50 PM, Jason Tang wrote: Yes, for ALL, it is not good for HA, and because we meet problem when use QUORAM, and current solution is switch Write:QUORAM / Read:QUORAM when got UnavailableException exception. 2012/7/18 Jay Parashar jparas...@itscape.com Thanks..but write ALL will fail for any downed nodes. I am thinking of QUORAM. From: Jason Tang [mailto:ares.t...@gmail.com] Sent: Tuesday, July 17, 2012 8:24 PM To: user@cassandra.apache.org Subject: Re: Replication factor - Consistency Questions Hi I am starting using Cassandra for not a long time, and also have problems in consistency. Here is some thinking. If you have Write:Any / Read:One, it will have consistency problem, and if you want to repair, check your schema, and check the parameter Read repair chance: http://wiki.apache.org/cassandra/StorageConfiguration And if you want to get consistency result, my suggestion is to have Write:ALL / Read:One, since for Cassandra, write is more faster then read. For performance impact, you need to test your traffic, and if your memory can not cache all your data, or your network is not fast enough, then yes, it will impact to write one more node. BRs 2012/7/18 Jay Parashar jparas...@itscape.com Hello all, There is a lot of material on Replication factor and Consistency level but I am a little confused by what is happening on my setup. (Cassandra 1.1.2). I would appreciate any answers. My Setup: A cluster of 2 nodes evenly balanced. My RF =2, Consistency Level; Write = ANY and Read = 1 I know that my consistency is Weak but since my RF = 2, I thought data would be just duplicated in both the nodes but sometimes, querying does not give me the correct (or gives partial) results. In other times, it gives me the right results Is the Read Repair going on after the first query? But as RF = 2, data is duplicated then why the repair? Note: My query is done a while after the Writes so data should have been in both the nodes. Or is this not the case (flushing not happening etc)? I am thinking of making the Write as 1 and Read as QUORAM so R + W RF (1 + 2 2) to give strong consistency. Will that affect performance a lot (generally speaking)? Thanks in advance Regards Jay
Re: Replication factor - Consistency Questions
But isn't QUORUM on a 2-node cluster still 2 nodes? On 07/17/2012 11:50 PM, Jason Tang wrote: Yes, for ALL, it is not good for HA, and because we meet problem when use QUORAM, and current solution is switch Write:QUORAM / Read:QUORAM when got UnavailableException exception. 2012/7/18 Jay Parashar jparas...@itscape.com mailto:jparas...@itscape.com Thanks..but write ALL will fail for any downed nodes. I am thinking of QUORAM. *From:*Jason Tang [mailto:ares.t...@gmail.com mailto:ares.t...@gmail.com] *Sent:* Tuesday, July 17, 2012 8:24 PM *To:* user@cassandra.apache.org mailto:user@cassandra.apache.org *Subject:* Re: Replication factor - Consistency Questions Hi I am starting using Cassandra for not a long time, and also have problems in consistency. Here is some thinking. If you have Write:Any / Read:One, it will have consistency problem, and if you want to repair, check your schema, and check the parameter Read repair chance: http://wiki.apache.org/cassandra/StorageConfiguration And if you want to get consistency result, my suggestion is to have Write:ALL / Read:One, since for Cassandra, write is more faster then read. For performance impact, you need to test your traffic, and if your memory can not cache all your data, or your network is not fast enough, then yes, it will impact to write one more node. BRs 2012/7/18 Jay Parashar jparas...@itscape.com mailto:jparas...@itscape.com Hello all, There is a lot of material on Replication factor and Consistency level but I am a little confused by what is happening on my setup. (Cassandra 1.1.2). I would appreciate any answers. My Setup: A cluster of 2 nodes evenly balanced. My RF =2, Consistency Level; Write = ANY and Read = 1 I know that my consistency is Weak but since my RF = 2, I thought data would be just duplicated in both the nodes but sometimes, querying does not give me the correct (or gives partial) results. In other times, it gives me the right results Is the Read Repair going on after the first query? But as RF = 2, data is duplicated then why the repair? Note: My query is done a while after the Writes so data should have been in both the nodes. Or is this not the case (flushing not happening etc)? I am thinking of making the Write as 1 and Read as QUORAM so R + W RF (1 + 2 2) to give strong consistency. Will that affect performance a lot (generally speaking)? Thanks in advance Regards Jay
Re: Replication factor - Consistency Questions
Yes, for ALL, it is not good for HA, and because we meet problem when use QUORAM, and current solution is switch Write:QUORAM / Read:QUORAM when got UnavailableException exception. 2012/7/18 Jay Parashar jparas...@itscape.com Thanks..but write ALL will fail for any downed nodes. I am thinking of QUORAM. ** ** *From:* Jason Tang [mailto:ares.t...@gmail.com] *Sent:* Tuesday, July 17, 2012 8:24 PM *To:* user@cassandra.apache.org *Subject:* Re: Replication factor - Consistency Questions ** ** Hi ** ** I am starting using Cassandra for not a long time, and also have problems in consistency. ** ** Here is some thinking. If you have Write:Any / Read:One, it will have consistency problem, and if you want to repair, check your schema, and check the parameter Read repair chance: http://wiki.apache.org/cassandra/StorageConfiguration ** ** And if you want to get consistency result, my suggestion is to have Write:ALL / Read:One, since for Cassandra, write is more faster then read. ** ** For performance impact, you need to test your traffic, and if your memory can not cache all your data, or your network is not fast enough, then yes, it will impact to write one more node. ** ** BRs ** ** 2012/7/18 Jay Parashar jparas...@itscape.com Hello all, There is a lot of material on Replication factor and Consistency level but I am a little confused by what is happening on my setup. (Cassandra 1.1.2). I would appreciate any answers. My Setup: A cluster of 2 nodes evenly balanced. My RF =2, Consistency Level; Write = ANY and Read = 1 I know that my consistency is Weak but since my RF = 2, I thought data would be just duplicated in both the nodes but sometimes, querying does not give me the correct (or gives partial) results. In other times, it gives me the right results Is the Read Repair going on after the first query? But as RF = 2, data is duplicated then why the repair? Note: My query is done a while after the Writes so data should have been in both the nodes. Or is this not the case (flushing not happening etc)? I am thinking of making the Write as 1 and Read as QUORAM so R + W RF (1 + 2 2) to give strong consistency. Will that affect performance a lot (generally speaking)? Thanks in advance Regards Jay ** **
Re: Replication factor - Consistency Questions
Hi I am starting using Cassandra for not a long time, and also have problems in consistency. Here is some thinking. If you have Write:Any / Read:One, it will have consistency problem, and if you want to repair, check your schema, and check the parameter Read repair chance: http://wiki.apache.org/cassandra/StorageConfiguration And if you want to get consistency result, my suggestion is to have Write:ALL / Read:One, since for Cassandra, write is more faster then read. For performance impact, you need to test your traffic, and if your memory can not cache all your data, or your network is not fast enough, then yes, it will impact to write one more node. BRs 2012/7/18 Jay Parashar jparas...@itscape.com Hello all, There is a lot of material on Replication factor and Consistency level but I am a little confused by what is happening on my setup. (Cassandra 1.1.2). I would appreciate any answers. My Setup: A cluster of 2 nodes evenly balanced. My RF =2, Consistency Level; Write = ANY and Read = 1 I know that my consistency is Weak but since my RF = 2, I thought data would be just duplicated in both the nodes but sometimes, querying does not give me the correct (or gives partial) results. In other times, it gives me the right results Is the Read Repair going on after the first query? But as RF = 2, data is duplicated then why the repair? Note: My query is done a while after the Writes so data should have been in both the nodes. Or is this not the case (flushing not happening etc)? I am thinking of making the Write as 1 and Read as QUORAM so R + W RF (1 + 2 2) to give strong consistency. Will that affect performance a lot (generally speaking)? Thanks in advance Regards Jay
RE: Replication factor - Consistency Questions
Thanks..but write ALL will fail for any downed nodes. I am thinking of QUORAM. From: Jason Tang [mailto:ares.t...@gmail.com] Sent: Tuesday, July 17, 2012 8:24 PM To: user@cassandra.apache.org Subject: Re: Replication factor - Consistency Questions Hi I am starting using Cassandra for not a long time, and also have problems in consistency. Here is some thinking. If you have Write:Any / Read:One, it will have consistency problem, and if you want to repair, check your schema, and check the parameter Read repair chance: http://wiki.apache.org/cassandra/StorageConfiguration And if you want to get consistency result, my suggestion is to have Write:ALL / Read:One, since for Cassandra, write is more faster then read. For performance impact, you need to test your traffic, and if your memory can not cache all your data, or your network is not fast enough, then yes, it will impact to write one more node. BRs 2012/7/18 Jay Parashar jparas...@itscape.com Hello all, There is a lot of material on Replication factor and Consistency level but I am a little confused by what is happening on my setup. (Cassandra 1.1.2). I would appreciate any answers. My Setup: A cluster of 2 nodes evenly balanced. My RF =2, Consistency Level; Write = ANY and Read = 1 I know that my consistency is Weak but since my RF = 2, I thought data would be just duplicated in both the nodes but sometimes, querying does not give me the correct (or gives partial) results. In other times, it gives me the right results Is the Read Repair going on after the first query? But as RF = 2, data is duplicated then why the repair? Note: My query is done a while after the Writes so data should have been in both the nodes. Or is this not the case (flushing not happening etc)? I am thinking of making the Write as 1 and Read as QUORAM so R + W RF (1 + 2 2) to give strong consistency. Will that affect performance a lot (generally speaking)? Thanks in advance Regards Jay
Re: Replication factor
Ah. The lack of page cache hits after compaction makes sense. But I don't think the drastic effect it appears to have is expected. Do you have an idea of how much slower local reads get ? If you are selecting coordinators based on token ranges the DS is not as much. It still has some utility as the Digest reads will be happening on other nodes and it should help with selecting them. Thanks for the extra info. Aaron - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 25/05/2012, at 1:24 AM, Viktor Jevdokimov wrote: All data is in the page cache. No repairs. Compactions not hitting disk for read. CPU 50%. ParNew GC 100 ms in average. After one compaction completes, new sstable is not in page cache, there may be a disk usage spike before data is cached, so local reads gets slower for a moment, comparing with other nodes. Redirecting almost all requests to other nodes finally ends up with a huge latency spike almost on all nodes, especially when ParNew GC may spike on one node (200ms). We call it “cluster hiccup”, when incoming and outgoing network traffic drops for a moment. And such hiccups happens several times an hour, few seconds long. Playing with badness threshold did not gave a lot better results, but turning DS off completely fixed all problems with latencies, node spikes, cluster hiccups and network traffic drops. In our case, our client is selecting endpoints for a key by calculating a token, so we always hit a replica. Best regards / Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063, Fax +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Follow us on Twitter: @adforminsider What is Adform: watch this short video signature-logo18be.png Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Thursday, May 24, 2012 13:00 To: user@cassandra.apache.org Subject: Re: Replication factor Your experience is when using CL ONE the Dynamic Snitch is moving local reads off to other nodes and this is causing spikes in read latency ? Did you notice what was happening on the node for the DS to think it was so slow ? Was compaction or repair going on ? Have you played with the badness threshold https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L472 ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/05/2012, at 5:28 PM, Viktor Jevdokimov wrote: Depends on use case. For ours we have another experience and statistics, when turning dynamic snitch off makes overall latency and spikes much, much lower. Best regards / Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063, Fax +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Follow us on Twitter: @adforminsider What is Adform: watch this short video signature-logo29.png Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. From: Brandon Williams [mailto:dri...@gmail.com] Sent: Thursday, May 24, 2012 02:35 To: user@cassandra.apache.org Subject: Re: Replication factor On Wed, May 23, 2012 at 5:51 AM, Viktor Jevdokimov viktor.jevdoki...@adform.com wrote: When RF == number of nodes, and you read at CL ONE you will always be reading locally. “always be reading locally” – only if Dynamic Snitch is “off”. With dynamic snitch “on” request may be redirected to other node, which may introduce latency spikes. Actually it's preventing spikes, since if it won't read locally that means the local replica is in worse shape than the rest (compacting, repairing, etc.) -Brandon
Re: Replication factor
ReadRepair means including all UP replicas in the request, waiting asynchronously after the read has completed, resolving and repairing differences. If you read at QUOURM with RR running, ALL (replace) nodes will perform a read. At any CL ONE the responses from CL nodes are reconciled and differences are repaired using a mechanism similar to RR. This has to happen before the response can be sent to the client. The naming does not help, but they are a different. RR is a background process designed to reduce the chance of an inconsistent read. It's needed less since changed to Hinted Handoff in 1.0. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/05/2012, at 5:47 AM, Daning Wang wrote: Thanks guys. Aaron, I am confused about this. from wiki http://wiki.apache.org/cassandra/ReadRepair, looks for any consistency level. Read Repair will be done either before or after responding data. Read Repair does not run at CL ONE Daning On Wed, May 23, 2012 at 3:51 AM, Viktor Jevdokimov viktor.jevdoki...@adform.com wrote: When RF == number of nodes, and you read at CL ONE you will always be reading locally. “always be reading locally” – only if Dynamic Snitch is “off”. With dynamic snitch “on” request may be redirected to other node, which may introduce latency spikes. Best regards / Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063, Fax +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Follow us on Twitter: @adforminsider What is Adform: watch this short video signature-logo7789.png Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Wednesday, May 23, 2012 13:00 To: user@cassandra.apache.org Subject: Re: Replication factor RF is normally adjusted to modify availability (see http://thelastpickle.com/2011/06/13/Down-For-Me/) for example, if I have 4 nodes cluster in one data center, how can RF=2 vs RF=4 affect read performance? If consistency level is ONE, looks reading does not need to go to another hop to get data if RF=4, but it would do more work on read repair in the background. Read Repair does not run at CL ONE. When RF == number of nodes, and you read at CL ONE you will always be reading locally. But with a low consistency. If you read with QUORUM when RF == number of nodes you will still get some performance benefit from the data being read locally. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/05/2012, at 9:34 AM, Daning Wang wrote: Hello, What is the pros and cons to choose different number of replication factor in term of performance? if space is not a concern. for example, if I have 4 nodes cluster in one data center, how can RF=2 vs RF=4 affect read performance? If consistency level is ONE, looks reading does not need to go to another hop to get data if RF=4, but it would do more work on read repair in the background. Can you share some insights about this? Thanks in advance, Daning
Re: Replication factor
Your experience is when using CL ONE the Dynamic Snitch is moving local reads off to other nodes and this is causing spikes in read latency ? Did you notice what was happening on the node for the DS to think it was so slow ? Was compaction or repair going on ? Have you played with the badness threshold https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L472 ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/05/2012, at 5:28 PM, Viktor Jevdokimov wrote: Depends on use case. For ours we have another experience and statistics, when turning dynamic snitch off makes overall latency and spikes much, much lower. Best regards / Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063, Fax +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Follow us on Twitter: @adforminsider What is Adform: watch this short video signature-logo29.png Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. From: Brandon Williams [mailto:dri...@gmail.com] Sent: Thursday, May 24, 2012 02:35 To: user@cassandra.apache.org Subject: Re: Replication factor On Wed, May 23, 2012 at 5:51 AM, Viktor Jevdokimov viktor.jevdoki...@adform.com wrote: When RF == number of nodes, and you read at CL ONE you will always be reading locally. “always be reading locally” – only if Dynamic Snitch is “off”. With dynamic snitch “on” request may be redirected to other node, which may introduce latency spikes. Actually it's preventing spikes, since if it won't read locally that means the local replica is in worse shape than the rest (compacting, repairing, etc.) -Brandon
RE: Replication factor
All data is in the page cache. No repairs. Compactions not hitting disk for read. CPU 50%. ParNew GC 100 ms in average. After one compaction completes, new sstable is not in page cache, there may be a disk usage spike before data is cached, so local reads gets slower for a moment, comparing with other nodes. Redirecting almost all requests to other nodes finally ends up with a huge latency spike almost on all nodes, especially when ParNew GC may spike on one node (200ms). We call it cluster hiccup, when incoming and outgoing network traffic drops for a moment. And such hiccups happens several times an hour, few seconds long. Playing with badness threshold did not gave a lot better results, but turning DS off completely fixed all problems with latencies, node spikes, cluster hiccups and network traffic drops. In our case, our client is selecting endpoints for a key by calculating a token, so we always hit a replica. Best regards / Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.commailto:viktor.jevdoki...@adform.com Phone: +370 5 212 3063, Fax +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Follow us on Twitter: @adforminsiderhttp://twitter.com/#!/adforminsider What is Adform: watch this short videohttp://vimeo.com/adform/display [Adform News] http://www.adform.com Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Thursday, May 24, 2012 13:00 To: user@cassandra.apache.org Subject: Re: Replication factor Your experience is when using CL ONE the Dynamic Snitch is moving local reads off to other nodes and this is causing spikes in read latency ? Did you notice what was happening on the node for the DS to think it was so slow ? Was compaction or repair going on ? Have you played with the badness threshold https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L472 ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/05/2012, at 5:28 PM, Viktor Jevdokimov wrote: Depends on use case. For ours we have another experience and statistics, when turning dynamic snitch off makes overall latency and spikes much, much lower. Best regards / Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.commailto:viktor.jevdoki...@adform.com Phone: +370 5 212 3063, Fax +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Follow us on Twitter: @adforminsiderhttp://twitter.com/#!/adforminsider What is Adform: watch this short videohttp://vimeo.com/adform/display signature-logo29.pnghttp://www.adform.com Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. From: Brandon Williams [mailto:dri...@gmail.com]mailto:[mailto:dri...@gmail.com] Sent: Thursday, May 24, 2012 02:35 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Replication factor On Wed, May 23, 2012 at 5:51 AM, Viktor Jevdokimov viktor.jevdoki...@adform.commailto:viktor.jevdoki...@adform.com wrote: When RF == number of nodes, and you read at CL ONE you will always be reading locally. always be reading locally - only if Dynamic Snitch is off. With dynamic snitch on request may be redirected to other node, which may introduce latency spikes. Actually it's preventing spikes, since if it won't read locally that means the local replica is in worse shape than the rest (compacting, repairing, etc.) -Brandon inline: signature-logo18be.png
Re: Replication factor
RF is normally adjusted to modify availability (see http://thelastpickle.com/2011/06/13/Down-For-Me/) for example, if I have 4 nodes cluster in one data center, how can RF=2 vs RF=4 affect read performance? If consistency level is ONE, looks reading does not need to go to another hop to get data if RF=4, but it would do more work on read repair in the background. Read Repair does not run at CL ONE. When RF == number of nodes, and you read at CL ONE you will always be reading locally. But with a low consistency. If you read with QUORUM when RF == number of nodes you will still get some performance benefit from the data being read locally. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/05/2012, at 9:34 AM, Daning Wang wrote: Hello, What is the pros and cons to choose different number of replication factor in term of performance? if space is not a concern. for example, if I have 4 nodes cluster in one data center, how can RF=2 vs RF=4 affect read performance? If consistency level is ONE, looks reading does not need to go to another hop to get data if RF=4, but it would do more work on read repair in the background. Can you share some insights about this? Thanks in advance, Daning
RE: Replication factor
When RF == number of nodes, and you read at CL ONE you will always be reading locally. always be reading locally - only if Dynamic Snitch is off. With dynamic snitch on request may be redirected to other node, which may introduce latency spikes. Best regards / Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.commailto:viktor.jevdoki...@adform.com Phone: +370 5 212 3063, Fax +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Follow us on Twitter: @adforminsiderhttp://twitter.com/#!/adforminsider What is Adform: watch this short videohttp://vimeo.com/adform/display [Adform News] http://www.adform.com Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Wednesday, May 23, 2012 13:00 To: user@cassandra.apache.org Subject: Re: Replication factor RF is normally adjusted to modify availability (see http://thelastpickle.com/2011/06/13/Down-For-Me/) for example, if I have 4 nodes cluster in one data center, how can RF=2 vs RF=4 affect read performance? If consistency level is ONE, looks reading does not need to go to another hop to get data if RF=4, but it would do more work on read repair in the background. Read Repair does not run at CL ONE. When RF == number of nodes, and you read at CL ONE you will always be reading locally. But with a low consistency. If you read with QUORUM when RF == number of nodes you will still get some performance benefit from the data being read locally. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/05/2012, at 9:34 AM, Daning Wang wrote: Hello, What is the pros and cons to choose different number of replication factor in term of performance? if space is not a concern. for example, if I have 4 nodes cluster in one data center, how can RF=2 vs RF=4 affect read performance? If consistency level is ONE, looks reading does not need to go to another hop to get data if RF=4, but it would do more work on read repair in the background. Can you share some insights about this? Thanks in advance, Daning inline: signature-logo7789.png
Re: Replication factor
Thanks guys. Aaron, I am confused about this. from wiki http://wiki.apache.org/cassandra/ReadRepair, looks for any consistency level. Read Repair will be done either before or after responding data. Read Repair does not run at CL ONE Daning On Wed, May 23, 2012 at 3:51 AM, Viktor Jevdokimov viktor.jevdoki...@adform.com wrote: When RF == number of nodes, and you read at CL ONE you will always be reading locally. “always be reading locally” – only if Dynamic Snitch is “off”. With dynamic snitch “on” request may be redirected to other node, which may introduce latency spikes. ** ** ** ** Best regards / Pagarbiai *Viktor Jevdokimov* Senior Developer Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063, Fax +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Follow us on Twitter: @adforminsider http://twitter.com/#!/adforminsider What is Adform: watch this short video http://vimeo.com/adform/display [image: Adform News] http://www.adform.com Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. *From:* aaron morton [mailto:aa...@thelastpickle.com] *Sent:* Wednesday, May 23, 2012 13:00 *To:* user@cassandra.apache.org *Subject:* Re: Replication factor ** ** RF is normally adjusted to modify availability (see http://thelastpickle.com/2011/06/13/Down-For-Me/) ** ** for example, if I have 4 nodes cluster in one data center, how can RF=2 vs RF=4 affect read performance? If consistency level is ONE, looks reading does not need to go to another hop to get data if RF=4, but it would do more work on read repair in the background. Read Repair does not run at CL ONE. When RF == number of nodes, and you read at CL ONE you will always be reading locally. But with a low consistency. If you read with QUORUM when RF == number of nodes you will still get some performance benefit from the data being read locally. ** ** Cheers ** ** ** ** - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com ** ** On 23/05/2012, at 9:34 AM, Daning Wang wrote: Hello, What is the pros and cons to choose different number of replication factor in term of performance? if space is not a concern. for example, if I have 4 nodes cluster in one data center, how can RF=2 vs RF=4 affect read performance? If consistency level is ONE, looks reading does not need to go to another hop to get data if RF=4, but it would do more work on read repair in the background. Can you share some insights about this? Thanks in advance, Daning ** ** signature-logo7789.png
Re: Replication factor
On Wed, May 23, 2012 at 5:51 AM, Viktor Jevdokimov viktor.jevdoki...@adform.com wrote: When RF == number of nodes, and you read at CL ONE you will always be reading locally. “always be reading locally” – only if Dynamic Snitch is “off”. With dynamic snitch “on” request may be redirected to other node, which may introduce latency spikes. Actually it's preventing spikes, since if it won't read locally that means the local replica is in worse shape than the rest (compacting, repairing, etc.) -Brandon
RE: Replication factor
Depends on use case. For ours we have another experience and statistics, when turning dynamic snitch off makes overall latency and spikes much, much lower. Best regards / Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.commailto:viktor.jevdoki...@adform.com Phone: +370 5 212 3063, Fax +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Follow us on Twitter: @adforminsiderhttp://twitter.com/#!/adforminsider What is Adform: watch this short videohttp://vimeo.com/adform/display [Adform News] http://www.adform.com Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. From: Brandon Williams [mailto:dri...@gmail.com] Sent: Thursday, May 24, 2012 02:35 To: user@cassandra.apache.org Subject: Re: Replication factor On Wed, May 23, 2012 at 5:51 AM, Viktor Jevdokimov viktor.jevdoki...@adform.commailto:viktor.jevdoki...@adform.com wrote: When RF == number of nodes, and you read at CL ONE you will always be reading locally. always be reading locally - only if Dynamic Snitch is off. With dynamic snitch on request may be redirected to other node, which may introduce latency spikes. Actually it's preventing spikes, since if it won't read locally that means the local replica is in worse shape than the rest (compacting, repairing, etc.) -Brandon inline: signature-logo29.png
Re: Replication factor per column family
Ok, that's clear, thank you for your time! 2012/2/16 aaron morton aa...@thelastpickle.com yes. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 16/02/2012, at 10:15 PM, R. Verlangen wrote: Hmm ok. This means if I want to have a CF with RF = 3 and another CF with RF = 1 (e.g. some debug logging) I will have to create 2 keyspaces? 2012/2/16 aaron morton aa...@thelastpickle.com Multiple CF mutations for a row are treated atomically in the commit log, and they are sent together to the replicas. Replication occurs at the row level, not the row+cf level. If each CF had it's own RF, odd things may happen. Like sending a batch mutation for one row and two CF's that fails because there is not enough nodes for one of the CF's. Would be other reasons as well. In short it's baked in. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 16/02/2012, at 9:54 PM, R. Verlangen wrote: Hi there, As the subject states: Is it possible to set a replication factor per column family? Could not find anything of recent releases. I'm running Cassandra 1.0.7 and I think it should be possible on a per CF basis instead of the whole keyspace. With kind regards, Robin
Re: Replication factor per column family
Multiple CF mutations for a row are treated atomically in the commit log, and they are sent together to the replicas. Replication occurs at the row level, not the row+cf level. If each CF had it's own RF, odd things may happen. Like sending a batch mutation for one row and two CF's that fails because there is not enough nodes for one of the CF's. Would be other reasons as well. In short it's baked in. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 16/02/2012, at 9:54 PM, R. Verlangen wrote: Hi there, As the subject states: Is it possible to set a replication factor per column family? Could not find anything of recent releases. I'm running Cassandra 1.0.7 and I think it should be possible on a per CF basis instead of the whole keyspace. With kind regards, Robin
Re: Replication factor per column family
Hmm ok. This means if I want to have a CF with RF = 3 and another CF with RF = 1 (e.g. some debug logging) I will have to create 2 keyspaces? 2012/2/16 aaron morton aa...@thelastpickle.com Multiple CF mutations for a row are treated atomically in the commit log, and they are sent together to the replicas. Replication occurs at the row level, not the row+cf level. If each CF had it's own RF, odd things may happen. Like sending a batch mutation for one row and two CF's that fails because there is not enough nodes for one of the CF's. Would be other reasons as well. In short it's baked in. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 16/02/2012, at 9:54 PM, R. Verlangen wrote: Hi there, As the subject states: Is it possible to set a replication factor per column family? Could not find anything of recent releases. I'm running Cassandra 1.0.7 and I think it should be possible on a per CF basis instead of the whole keyspace. With kind regards, Robin
Re: Replication factor per column family
yes. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 16/02/2012, at 10:15 PM, R. Verlangen wrote: Hmm ok. This means if I want to have a CF with RF = 3 and another CF with RF = 1 (e.g. some debug logging) I will have to create 2 keyspaces? 2012/2/16 aaron morton aa...@thelastpickle.com Multiple CF mutations for a row are treated atomically in the commit log, and they are sent together to the replicas. Replication occurs at the row level, not the row+cf level. If each CF had it's own RF, odd things may happen. Like sending a batch mutation for one row and two CF's that fails because there is not enough nodes for one of the CF's. Would be other reasons as well. In short it's baked in. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 16/02/2012, at 9:54 PM, R. Verlangen wrote: Hi there, As the subject states: Is it possible to set a replication factor per column family? Could not find anything of recent releases. I'm running Cassandra 1.0.7 and I think it should be possible on a per CF basis instead of the whole keyspace. With kind regards, Robin
Re: Replication factor and other schema changes in = 0.7
It is coming. In fact, I started working on this ticket yesterday. Most of the settings that you could change before will be modifiable. Unfortunately, you must still manually perform the repair operations, etc., afterward. https://issues.apache.org/jira/browse/CASSANDRA-1285 Gary. On Thu, Aug 19, 2010 at 18:00, Andres March ama...@qualcomm.com wrote: How should we go about changing the replication factor and other keyspace settings now that it and other KSMetaData are no longer managed in cassandra.yaml? I found makeDefinitionMutation() in the Migration class and see that it is called for the other schema migrations. There just seems to be a big gap in the management API for the KSMetaData we might want to change. -- Andres March ama...@qualcomm.com Qualcomm Internet Services
Re: Replication factor and other schema changes in = 0.7
Cool, thanks. I suspected the same, including the repair. On 08/20/2010 06:05 AM, Gary Dusbabek wrote: It is coming. In fact, I started working on this ticket yesterday. Most of the settings that you could change before will be modifiable. Unfortunately, you must still manually perform the repair operations, etc., afterward. https://issues.apache.org/jira/browse/CASSANDRA-1285 Gary. On Thu, Aug 19, 2010 at 18:00, Andres Marchama...@qualcomm.com wrote: How should we go about changing the replication factor and other keyspace settings now that it and other KSMetaData are no longer managed in cassandra.yaml? I found makeDefinitionMutation() in the Migration class and see that it is called for the other schema migrations. There just seems to be a big gap in the management API for the KSMetaData we might want to change. -- Andres March ama...@qualcomm.com Qualcomm Internet Services -- *Andres March* ama...@qualcomm.com mailto:ama...@qualcomm.com Qualcomm Internet Services
Re: Replication Factor and Data Centers
(moving to user@) On Mon, Jun 14, 2010 at 10:43 PM, Masood Mortazavi masoodmortaz...@gmail.com wrote: Is the clearer interpretation of this statement (in conf/datacenters.properties) given anywhere else? # The sum of all the datacenter replication factor values should equal # the replication factor of the keyspace (i.e. sum(dc_rf) = RF) # keyspace\:datacenter=replication factor Keyspace1\:DC1=3 Keyspace1\:DC2=2 Keyspace1\:DC3=1 Does the above example configuration imply that Keyspace1 has a RF of 6, and that of these 3 will go to DC1, 2 to DC2 and 1 to DC3? Yes. What will happen if datacenters.properties and cassandra-rack.properties are simply empty? You have an illegal configuration. https://issues.apache.org/jira/browse/CASSANDRA-1191 is open to have Cassandra raise an error under this condition and others. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com