Re: Practical limitations of too many columns/cells ?
No problem. IS there a JIRA ticket already for this? On Mon, Aug 24, 2015 at 6:06 AM, Jonathan Haddad j...@jonhaddad.com wrote: Can you post your findings to JIRA as well? Would be good to see some real numbers from production. The refactor of the storage engine (8099) may completely change this, but it's good to have it on the radar. On Sun, Aug 23, 2015 at 10:31 PM Kevin Burton bur...@spinn3r.com wrote: Agreed. We’re going to run a benchmark. Just realized we grew to 144 columns. Fun. Kind of disappointing that Cassandra is so slow in this regard. Kind of defeats the whole point of flexible schema if actually using that feature is slow as hell. On Sun, Aug 23, 2015 at 4:54 PM, Jeff Jirsa jeff.ji...@crowdstrike.com wrote: The key is to benchmark it with your real data. Modern cassandra-stress let’s you get very close to your actual read/write behavior, and the real differentiator will depend on your use case (how often do you write the whole row vs updating just one column/field). My gist shows a ton of different examples, but they’re not scientific, and at this point they’re old versions (and performance varies version to version). - Jeff From: burtonator2...@gmail.com on behalf of Kevin Burton Reply-To: user@cassandra.apache.org Date: Sunday, August 23, 2015 at 2:58 PM To: user@cassandra.apache.org Subject: Re: Practical limitations of too many columns/cells ? Ah.. yes. Great benchmarks. If I’m interpreting them correctly it was ~15x slower for 22 columns vs 2 columns? Guess we have to refactor again :-P Not the end of the world of course. On Sun, Aug 23, 2015 at 1:53 PM, Jeff Jirsa jeff.ji...@crowdstrike.com wrote: A few months back, a user in #cassandra on freenode mentioned that when they transitioned from thrift to cql, their overall performance decreased significantly. They had 66 columns per table, so I ran some benchmarks with various versions of Cassandra and thrift/cql combinations. It shouldn’t really surprise you that more columns = more work = slower operations. It’s not necessarily the size of the writes, but the amount of work that needs to be done with the extra cells (2 large columns totaling 2k performs better than 66 small columns totaling 0.66k even though it’s three times as much raw data being written to disk) https://gist.github.com/jeffjirsa/6e481b132334dfb6d42c 2.0.13, 2 tokens per node, 66 columns, 10 bytes per column, thrift (660 bytes per): cassandra-stress --operation INSERT --num-keys 100 --columns 66 --column-size=10 --replication-factor 2 --nodesfile=nodes Averages from the middle 80% of values: interval_op_rate : 10720 2.0.13, 2 tokens per node, 20 columns, 10 bytes per column, thrift (200 bytes per): cassandra-stress --operation INSERT --num-keys 100 --columns 20 --column-size=10 --replication-factor 2 --nodesfile=nodes Averages from the middle 80% of values: interval_op_rate : 28667 2.0.13, 2 tokens per node, 2 large columns, thrift (2048 bytes per): cassandra-stress --operation INSERT --num-keys 100 --columns 2 --column-size=1024 --replication-factor 2 --nodesfile=nodes Averages from the middle 80% of values: interval_op_rate : 23489 From: burtonator2...@gmail.com on behalf of Kevin Burton Reply-To: user@cassandra.apache.org Date: Sunday, August 23, 2015 at 1:02 PM To: user@cassandra.apache.org Subject: Practical limitations of too many columns/cells ? Is there any advantage to using say 40 columns per row vs using 2 columns (one for the pk and the other for data) and then shoving the data into a BLOB as a JSON object? To date, we’ve been just adding new columns. I profiled Cassandra and about 50% of the CPU time is spent on CPU doing compactions. Seeing that CS is being CPU bottlenecked maybe this is a way I can optimize it. Any thoughts? -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts
Re: How can I specify the file_data_directories for a keyspace
At this point, it is only/automatically managed by cassandra, but if you’re clever with mount points you can probably work around the limitation. From: Ahmed Eljami Reply-To: user@cassandra.apache.org Date: Tuesday, August 25, 2015 at 2:09 AM To: user@cassandra.apache.org Subject: How can I specify the file_data_directories for a keyspace When I defines several file_data_directories in cassandra.yaml, would it be possible to specify the location keyspace and tables ? or it is only and automatically managed by Cassandra. Thx. -- Ahmed ELJAMI smime.p7s Description: S/MIME cryptographic signature
Re: lightweight transactions with potential problem?
What an excellent explanation!!, thank you a lot. By the way, I do not understand why in lightweight transactions in Cassandra has round-trip commit/acknowledgment? For me, I think we can commit the value within phase propose/accept. Do you agree? If not agree can you explain why we need commit/acknowledgment? Regards, ibrahim
Re: lightweight transactions with potential problem?
So you meant that the older ballot will not only reject in round-trip1 (prepare/promise), it also can be reject in propose/accept round-trips2, Is that correct? Yes. You Said : Or more precisely, you got step 8 wrong: when a replica PROMISE, the promise is not that they won't promise a ballot older than 2,it's that they won't accept a ballot older than 2 Why step 8 wrong? I think replicas can accept any highest ballot, so ballot 2 is the highest in step 8? what do you think? Do you also mean replica can promise older ballot. I shouldn't have said wrong. What I meant is that your description of what a PROMISE meant was incomplete. It's true that in practice replicas won't promise older ballots, but it's not the important property in this case, the important property is that they also promise to not accept any older ballot. I wish you could make it more clear. Thank you a lot Sylvain Ibrahim On Tue, Aug 25, 2015 at 1:40 PM, Sylvain Lebresne sylv...@datastax.com wrote: That scenario cannot happen. More specifically, your step 12 cannot happen if step 8 has happen. Or more precisely, you got step 8 wrong: when a replica PROMISE, the promise is not that they won't promise a ballot older than 2, it's that they won't accept a ballot older than 2. Therefore, after step 8, the accept from N1 will be reject in step 12 and the insert from N1 will be rejected (that is, N1 will restart the whole algorithm with a new ballot). On Tue, Aug 25, 2015 at 1:54 PM, ibrahim El-sanosi ibrahimsaba...@gmail.com wrote: Hi folks, Cassandra provides *linearizable consistency (CAS, Compare-and-Set) by using Paxos 4 round-trips as following* *1. **Prepare/promise* *2. **Read/result* *3. **Propose/accept* *4. **Commit/acknowledgment * Assume we have an application for resistering new account, I want to make sure I only allow exactly one user to claim a given account. For example, we do not allow two users having the same username. Assuming we have a cluster consist of 5 nodes N1, N2, N3, N4, and N5. We have two concurrent clients C1 and C2. We have replication factor 3 and the partitioner has determined the primary and the replicas nodes of the INSERT example are N3, N4, and N5. The scenario happens in following order: 1. C1 connects to coordinator N1 and sends INSERT V1 (assume V1 is username, not resister before) 2. N1 sends PREPARE message with ballot 1 (highest ballot have seen) to N3, N4 and N5. Note that this prepare for C1 and V1. 3. N3, N4 and N5 send a PROMISE message to N1, to not promise any with older than ballot 1. 4.N1 sends READ message to N3, N4 and N5 to read V1. 5.N3, N4 and N5 send RESULT message to N1, informing that V1 not exist which results in N1 will go forward to next round. 6. Now C2 connects to coordinator N2 and sends INSERT V1. 7. N2 sends PREPARE message with ballot 2 (highest ballot after re-prepare because first time, N2 does not know about ballot 1, then eventual it solves and have ballot 2) to N3, N4 and N5. Note that this prepare for C2 and V1. 8. N3, N4 and N5 send a PROMISE message to N2, to not promise any with older than ballot 2. 9. N2 sends READ message to N3, N4 and N5 to read V1. 10. N3, N4 and N5 send RESULT message to N2, informing that V1 not exist which results in N2 will go forward to next round. 11. Now N1 send PROPOSE message to N3, N4 and N5 (ballot 1, V1). 12. N3, N4 and N5 send ACCEPT message to N1. 13. N2 send PROPOSE message to N3, N4 and N5 (ballot 2, V1). 14. N3, N4 and N5 send ACCEPT message to N2. 15. N1 send COMMIT message to N3, N4 and N5 (ballot 1). 16. N3, N4 and N5 send ACK message to N1. 17. N2 send COMMIT message to N3, N4 and N5 (ballot 2). 18. N3, N4 and N5 send ACK message to N2. As result, both V1 from client C1 and V1 from client C2 have written to replicas N3, N4, and N5. Which I think it does not achieve the goal of *linearizable consistency and CAS. * *Is that true and such scenario could be occurred?* I look forward to hearing from you. Regards,
Re: lightweight transactions with potential problem?
The rationale of the last commit/ack phase is to set the chosen value (here the mutation) in a durable storage (here into Cassandra) and reset this value to allow another round of Paxos. More explanation in this blog post: http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0 For a detailed explanation of different Paxos phases, look at those slides: http://www.slideshare.net/doanduyhai/distributed-algorithms-for-big-data-geecon/53 On Tue, Aug 25, 2015 at 6:07 PM, ibrahim El-sanosi ibrahimsaba...@gmail.com wrote: What an excellent explanation!!, thank you a lot. By the way, I do not understand why in lightweight transactions in Cassandra has round-trip commit/acknowledgment? For me, I think we can commit the value within phase propose/accept. Do you agree? If not agree can you explain why we need commit/acknowledgment? Regards, ibrahim
Re: PrepareStatement BUG
Hi, anybody knows how to resolve this problem? 2015-08-23 1:35 GMT+08:00 joseph gao gaojf.bok...@gmail.com: I'm using cassandra 2.1.7 and datastax java drive 2.1.6 Here is the problem: I use PrepareStatement for query like : SELECT * FROM somespace.sometable where id = ? And I Cached the PrepareStatement in my jvm; When the table metadata has changed like a column was added; And I use the cached PrepareStament , the data and the metadata(column definations) don't match. So I re-prepare the sql using session.prepare(sql) again, but i see the code in the async-prepare callback part: stmt = cluster.manager.addPrepare(stmt); in the SessionManager.java this will return the previous PrepareStatement. So it neither re-prepare automatically nor allow user to re-prepare! Is this a bug or I use it like a fool? -- -- Joseph Gao PhoneNum:15210513582 QQ: 409343351 -- -- Joseph Gao PhoneNum:15210513582 QQ: 409343351
Re: Written data is lost and no exception thrown back to the client
I have the same problem. When I bulk load my data, I have a problem with Cassandra Datastax driver. dependency groupIdcom.datastax.cassandra/groupId artifactIdcassandra-driver-core/artifactId version2.1.4/version !-- Driver 2.1.6, 2.1.7.1 gives problems. Some data is lost. -- /dependency With version 2.1.6 and also with version 2.1.7.1 I have lost records with no error message what so ever. With version 2.1.4 I have no missing records. I use CL.ONE to write my records. I use RF 3. On 21 Aug 2015, at 13:06 , Robert Wille rwi...@fold3.commailto:rwi...@fold3.com wrote: But it shouldn’t matter. I have missing data, and no errors, which shouldn’t be possible except with CL=ANY. FWIW, I’m working on some sample code so I can post a Jira. Robert On Aug 21, 2015, at 5:04 AM, Robert Wille rwi...@fold3.commailto:rwi...@fold3.com wrote: RF=1 with QUORUM consistency. I know QUORUM is weird with RF=1, but it should be the same as ONE. If’s QUORUM instead of ONE because production has RF=3, and I was running this against my test cluster with RF=1. On Aug 20, 2015, at 7:28 PM, Jason jkushm...@rocketfuelinc.commailto:jkushm...@rocketfuelinc.com wrote: What consistency level were the writes? From: Robert Willemailto:rwi...@fold3.com Sent: 8/20/2015 18:25 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Written data is lost and no exception thrown back to the client I wrote a data migration application which I was testing, and I pushed it too hard and the FlushWriter thread pool blocked, and I ended up with dropped mutation messages. I compared the source data against what is in my cluster, and as expected I have missing records. The strange thing is that my application didn’t error out. I’ve been doing some forensics, and there’s a lot about this that makes no sense and makes me feel very uneasy. I use a lot of asynchronous queries, and I thought it was possible that I had bad error handling, so I checked for errors in other, independent ways. I have a retry policy that on the first failure logs the error and then requests a retry. On the second failure it logs the error and then rethrows. A few retryable errors appeared in my logs, but no fatal errors. In theory, I should have a fatal error in my logs for any error that gets reported back to the client. I wrap my Session object, and all queries go through this wrapper. This wrapper logs all query errors. Synchronous queries are wrapped in a try/catch which logs and rethrows. Asynchronous queries use a FutureCallback to log any onFailure invocations. My logs indicate that no errors whatsoever were reported back to me. I do not understand how I can get dropped mutation messages and not know about it. I am running 2.0.16 with datastax Java driver 2.0.8. Three node cluster with RF=1. If someone could help me understand how this can occur, I would greatly appreciate it. A database that errors out is one thing. A database that errors out and makes you think everything was fine is quite another. Thanks Robert
'no such object in table'
I'm trying to run nodetool from one node, connecting to another. I can successfully connect to the majority of nodes in my ring, but two nodes throw the following error. nodetool: Failed to connect to 'IP:7199' NoSuchObjectException: 'no such object in table'. Any idea why this is happening? Misconfiguration? jas
How can I specify the file_data_directories for a keyspace
When I defines several file_data_directories in cassandra.yaml, would it be possible to specify the location keyspace and tables ? or it is * only* and *automatically* managed by Cassandra. Thx. -- Ahmed ELJAMI
Re: abnormal log after remove a node
Hi, I am facing the same issue on 2.0.16. Did you solve this ? How ? I plan to try a rolling restart and see if gossip state recover from this. C*heers, Alain 2015-06-19 11:40 GMT+02:00 曹志富 cao.zh...@gmail.com: I have a C* 2.1.5 with 24 nodes.A few days ago ,I have remove a node from this cluster using nodetool decommission. But tody I find some log like this: INFO [GossipStage:1] 2015-06-19 17:38:05,616 Gossiper.java:968 - InetAddress /172.19.105.41 is now DOWN INFO [GossipStage:1] 2015-06-19 17:38:05,617 StorageService.java:1885 - Removing tokens [-1014432261309809702, -1055322450438958612, -1120728727235087395, -1191392141261832305, -1203676771883970142, -1215563040745505837, -1215648909329054362, -1269531760567530381, -1278047879489577908, -1313427877031136549, -1342822572958042617, -1350792764922315814, -1383390744017639599, -139000372807970456, -140827955201469664, -1631551789771606023, -1633789813430312609, -1795528665156349205, -1836619444785023397, -1879127294549041822, -1962337787208890426, -2022309807234530256, -2033402140526360327, -2089413865145942100, -210961549458416802, -2148530352195763113, -2184481573787758786, -610790268720205, -2340762266634834427, -2513416003567685694, -2520971378752190013, -2596695976621541808, -2620636796023437199, -2640378596436678113, -2679143017361311011, -2721176590519112233, -2749213392354746126, -279267896827516626, -2872377759991294853, -2904711688111888325, -290489381926812623, -3000574339499272616, -301428600802598523, -3019280155316984595, -3024451041907074275, -3056898917375012425, -3161300347260716852, -3166392383659271772, -3327634380871627036, -3530685865340274372, -3563112657791369745, -366930313427781469, -3729582520450700795, -3901838244986519991, -4065326606010524312, -4174346928341550117, -4184239233207315432, -4204369933734181327, -4206479093137814808, -421410317165821100, -4311166118017934135, -4407123461118340117, -4466364858622123151, -4466939645485100087, -448955147512581975, -4587780638857304626, -4649897584350376674, -4674234125365755024 , -4833801201210885896, -4857586579802212277, -4868896650650107463, -4980063310159547694, -4983471821416248610, -4992846054037653676, -5026994389965137674, -514302500353679181 0, -5198414516309928594, -5245363745777287346, -5346838390293957674, -5374413419545696184, -5427881744040857637, -5453876964430787287, -5491923669475601173, -55219734138599212 6, -5523011502670737422, -5537121117160410549, -5557015938925208697, -5572489682738121748, -5745899409803353484, -5771239101488682535, -5893479791287484099, -59766730414807540 44, -6014643892406938367, -6086002438656595783, -6129360679394503700, -6224240257573911174, -6290393495130499466, -6378712056928268929, -6430306056990093461, -6800188263839065 013, -6912720411187525051, -7160327814305587432, -7175004328733776324, -7272070430660252577, -7307945744786025148, -742448651973108101, -7539255117639002578, -7657460716997978 94, -7846698077070579798, -7870621904906244395, -7900841391761900719, -7918145426423910061, -7936795453892692473, -8070255024778921411, -8086888710627677669, -8124855925323654 631, -8175270408138820500, -8271197636596881168, -8336685710406477123, -8466220397076441627, -8534337908154758270, -8550484400487603561, -862246738021989870, -8727219287242892 185, -8895705475282612927, -8921801772904834063, -9057266752652143883, -9059183540698454288, -9067986437682229598, -9148183367896132028, -962208188860606543, 10859447725819218 30, 1189775396643491793, 1253728955879686947, 1389982523380382228, 1429632314664544045, 143610053770130548, 150118120072602242, 1575692041584712198, 1624575905722628764, 17894 76212785155173, 1995296121962835019, 2041217364870030239, 2120277336231792146, 2124445736743406711, 2154979704292433983, 2340726755918680765, 23481654796845972, 23620268084352 24407, 2366144489007464626, 2381492708106933027, 2398868971489617398, 2427315953339163528, 2433999003913998534, 2633074510238705620, 266659839023809792, 2677817641360639089, 2 719725410894526151, 2751925111749406683, 2815703589803785617, 3041515796379693113, 3044903149214270978, 3094954503756703989, 3243933267690865263, 3246086646486800371, 33270068 97333869434, 3393657685587750192, 3395065499228709345, 3426126123948029459, 3500469615600510698, 3644011364716880512, 3693249207133187620, 3776164494954636918, 38780676797 8035, 3872151295451662867, 3937077827707223414, 4041082935346014761, 4060208918173638435, 4086747843759164940, 4165638694482690057, 4203996339238989224, 4220155275330961826, 4 366784953339236686, 4390116924352514616, 4391225331964772681, 4392419346255765958, 4448400054980766409, 4463335839328115373, 4547306976104362915, 4588174843388248100, 48438580 67983993745, 4912719175808770608, 499628843707992459, 5004392861473086088, 5021047773702107258, 510226752691159107, 5109551630357971118, 5157669927051121583, 51627694176199618 24, 5238710860488961530, 5245958115092331518,
Re: Incremental, Sequential repair?
On Tue, Aug 25, 2015 at 2:44 PM, Bryan Cheng br...@blockcypher.com wrote: [2015-08-25 21:36:43,433] It is not possible to mix sequential repair and incremental repairs. Is this a limitation around a specific configuration? Or is it generally true that incremental and sequential repairs are not compatible? There's a migration process to sequential repairs. http://www.datastax.com/dev/blog/more-efficient-repairs etc. =Rob
Re: 'no such object in table'
On 08/25/2015 02:19 PM, Jason Lewis wrote: I'm trying to run nodetool from one node, connecting to another. I can successfully connect to the majority of nodes in my ring, but two nodes throw the following error. nodetool: Failed to connect to 'IP:7199' NoSuchObjectException: 'no such object in table'. Any idea why this is happening? Misconfiguration? Possibly. Check those nodes to see if 7199 is listening to only localhost or some private IP your client node cannot reach (failed to connect). The default is to only listen on localhost, as seen on my machine: $ netstat -ln | grep 7199 tcp0 0 127.0.0.1:7199 0.0.0.0:* LISTEN JMX configuration is set in conf/cassandra-env.sh - please, configure JMX security as documented in that file and/or firewall JMX. Check all your nodes JMX security configs! :) -- Kind regards, Michael
Commit/acknowledgment phase in CAS?
Hi folks, To achieve linearizable consistency in Cassandra, there are four round-trips must be performed: 1. Prepare/promise 2. Read/result 3. Propose/accept *4. **Commit/acknowledgment * In the last phase in Paxos protocol (white paper), there is decide phase only, no Commit/acknowledgment. DESIDE means to tell learners to apply the accepted value. If Commit/acknowledgment phase in CAS has similar purpose as DECIDE, then why we have an acknowledgment round? In fact, I want to know the purpose of Commit/acknowledgment phase in lineazaible consistency in Cassandra. I have read the http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0, but it does not explain whole the picture. I look forward to hearing from you Ibrahim
Re: Incremental, Sequential repair?
Thanks Robert! To clarify, you're referring to the process using sstablerepairedset to mark sstables as repaired after a full repair with autocompaction off? We're in the process of doing that throughout our cluster now. On Tue, Aug 25, 2015 at 3:30 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Aug 25, 2015 at 2:44 PM, Bryan Cheng br...@blockcypher.com wrote: [2015-08-25 21:36:43,433] It is not possible to mix sequential repair and incremental repairs. Is this a limitation around a specific configuration? Or is it generally true that incremental and sequential repairs are not compatible? There's a migration process to sequential repairs. http://www.datastax.com/dev/blog/more-efficient-repairs etc. =Rob
Re: Incremental, Sequential repair?
On Tue, Aug 25, 2015 at 4:05 PM, Bryan Cheng br...@blockcypher.com wrote: Thanks Robert! To clarify, you're referring to the process using sstablerepairedset to mark sstables as repaired after a full repair with autocompaction off? We're in the process of doing that throughout our cluster now. Yep. As an aside, incremental repair currently doesn't handle some (edge) cases that non-incremental repair does. FWIW, which is not too much! https://issues.apache.org/jira/browse/CASSANDRA-5791 and https://issues.apache.org/jira/browse/CASSANDRA-9947 =Rob
Incremental, Sequential repair?
Hey all, Got a question about incremental repairs, a quick google search turned up nothing conclusive. In the docs, in a few places, sequential, incremental repairs are mentioned. From http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_repair_nodes_c.html (indirectly): You can combine repair options, such as parallel and incremental repair. From http://www.datastax.com/dev/blog/more-efficient-repairs: Incremental repairs can be opted into via the -inc option to nodetool repair. This is compatible with both sequential and parallel (-par) repair However, when I try to run an incremental, sequential repair (nodetool repair -inc), I get: [2015-08-25 21:36:43,433] It is not possible to mix sequential repair and incremental repairs. Is this a limitation around a specific configuration? Or is it generally true that incremental and sequential repairs are not compatible? The cluster is a mixed 2.1.8/2.1.7, replication is NetworkTopology, with LeveledCompaction (if it's relevant). Thanks in advance!
lightweight transactions with potential problem?
Hi folks, Cassandra provides *linearizable consistency (CAS, Compare-and-Set) by using Paxos 4 round-trips as following* *1. **Prepare/promise* *2. **Read/result* *3. **Propose/accept* *4. **Commit/acknowledgment * Assume we have an application for resistering new account, I want to make sure I only allow exactly one user to claim a given account. For example, we do not allow two users having the same username. Assuming we have a cluster consist of 5 nodes N1, N2, N3, N4, and N5. We have two concurrent clients C1 and C2. We have replication factor 3 and the partitioner has determined the primary and the replicas nodes of the INSERT example are N3, N4, and N5. The scenario happens in following order: 1. C1 connects to coordinator N1 and sends INSERT V1 (assume V1 is username, not resister before) 2. N1 sends PREPARE message with ballot 1 (highest ballot have seen) to N3, N4 and N5. Note that this prepare for C1 and V1. 3. N3, N4 and N5 send a PROMISE message to N1, to not promise any with older than ballot 1. 4.N1 sends READ message to N3, N4 and N5 to read V1. 5.N3, N4 and N5 send RESULT message to N1, informing that V1 not exist which results in N1 will go forward to next round. 6. Now C2 connects to coordinator N2 and sends INSERT V1. 7. N2 sends PREPARE message with ballot 2 (highest ballot after re-prepare because first time, N2 does not know about ballot 1, then eventual it solves and have ballot 2) to N3, N4 and N5. Note that this prepare for C2 and V1. 8. N3, N4 and N5 send a PROMISE message to N2, to not promise any with older than ballot 2. 9. N2 sends READ message to N3, N4 and N5 to read V1. 10. N3, N4 and N5 send RESULT message to N2, informing that V1 not exist which results in N2 will go forward to next round. 11. Now N1 send PROPOSE message to N3, N4 and N5 (ballot 1, V1). 12. N3, N4 and N5 send ACCEPT message to N1. 13. N2 send PROPOSE message to N3, N4 and N5 (ballot 2, V1). 14. N3, N4 and N5 send ACCEPT message to N2. 15. N1 send COMMIT message to N3, N4 and N5 (ballot 1). 16. N3, N4 and N5 send ACK message to N1. 17. N2 send COMMIT message to N3, N4 and N5 (ballot 2). 18. N3, N4 and N5 send ACK message to N2. As result, both V1 from client C1 and V1 from client C2 have written to replicas N3, N4, and N5. Which I think it does not achieve the goal of *linearizable consistency and CAS. * *Is that true and such scenario could be occurred?* I look forward to hearing from you. Regards,
Re: lightweight transactions with potential problem?
That scenario cannot happen. More specifically, your step 12 cannot happen if step 8 has happen. Or more precisely, you got step 8 wrong: when a replica PROMISE, the promise is not that they won't promise a ballot older than 2, it's that they won't accept a ballot older than 2. Therefore, after step 8, the accept from N1 will be reject in step 12 and the insert from N1 will be rejected (that is, N1 will restart the whole algorithm with a new ballot). On Tue, Aug 25, 2015 at 1:54 PM, ibrahim El-sanosi ibrahimsaba...@gmail.com wrote: Hi folks, Cassandra provides *linearizable consistency (CAS, Compare-and-Set) by using Paxos 4 round-trips as following* *1. **Prepare/promise* *2. **Read/result* *3. **Propose/accept* *4. **Commit/acknowledgment * Assume we have an application for resistering new account, I want to make sure I only allow exactly one user to claim a given account. For example, we do not allow two users having the same username. Assuming we have a cluster consist of 5 nodes N1, N2, N3, N4, and N5. We have two concurrent clients C1 and C2. We have replication factor 3 and the partitioner has determined the primary and the replicas nodes of the INSERT example are N3, N4, and N5. The scenario happens in following order: 1. C1 connects to coordinator N1 and sends INSERT V1 (assume V1 is username, not resister before) 2. N1 sends PREPARE message with ballot 1 (highest ballot have seen) to N3, N4 and N5. Note that this prepare for C1 and V1. 3. N3, N4 and N5 send a PROMISE message to N1, to not promise any with older than ballot 1. 4.N1 sends READ message to N3, N4 and N5 to read V1. 5.N3, N4 and N5 send RESULT message to N1, informing that V1 not exist which results in N1 will go forward to next round. 6. Now C2 connects to coordinator N2 and sends INSERT V1. 7. N2 sends PREPARE message with ballot 2 (highest ballot after re-prepare because first time, N2 does not know about ballot 1, then eventual it solves and have ballot 2) to N3, N4 and N5. Note that this prepare for C2 and V1. 8. N3, N4 and N5 send a PROMISE message to N2, to not promise any with older than ballot 2. 9. N2 sends READ message to N3, N4 and N5 to read V1. 10. N3, N4 and N5 send RESULT message to N2, informing that V1 not exist which results in N2 will go forward to next round. 11. Now N1 send PROPOSE message to N3, N4 and N5 (ballot 1, V1). 12. N3, N4 and N5 send ACCEPT message to N1. 13. N2 send PROPOSE message to N3, N4 and N5 (ballot 2, V1). 14. N3, N4 and N5 send ACCEPT message to N2. 15. N1 send COMMIT message to N3, N4 and N5 (ballot 1). 16. N3, N4 and N5 send ACK message to N1. 17. N2 send COMMIT message to N3, N4 and N5 (ballot 2). 18. N3, N4 and N5 send ACK message to N2. As result, both V1 from client C1 and V1 from client C2 have written to replicas N3, N4, and N5. Which I think it does not achieve the goal of *linearizable consistency and CAS. * *Is that true and such scenario could be occurred?* I look forward to hearing from you. Regards,
Re: lightweight transactions with potential problem?
OK, I see. So you meant that the older ballot will not only reject in round-trip1 (prepare/promise), it also can be reject in propose/accept round-trips2, Is that correct? You Said : Or more precisely, you got step 8 wrong: when a replica PROMISE, the promise is not that they won't promise a ballot older than 2,it's that they won't accept a ballot older than 2 Why step 8 wrong? I think replicas can accept any highest ballot, so ballot 2 is the highest in step 8? what do you think? Do you also mean replica can promise older ballot. I wish you could make it more clear. Thank you a lot Sylvain Ibrahim On Tue, Aug 25, 2015 at 1:40 PM, Sylvain Lebresne sylv...@datastax.com wrote: That scenario cannot happen. More specifically, your step 12 cannot happen if step 8 has happen. Or more precisely, you got step 8 wrong: when a replica PROMISE, the promise is not that they won't promise a ballot older than 2, it's that they won't accept a ballot older than 2. Therefore, after step 8, the accept from N1 will be reject in step 12 and the insert from N1 will be rejected (that is, N1 will restart the whole algorithm with a new ballot). On Tue, Aug 25, 2015 at 1:54 PM, ibrahim El-sanosi ibrahimsaba...@gmail.com wrote: Hi folks, Cassandra provides *linearizable consistency (CAS, Compare-and-Set) by using Paxos 4 round-trips as following* *1. **Prepare/promise* *2. **Read/result* *3. **Propose/accept* *4. **Commit/acknowledgment * Assume we have an application for resistering new account, I want to make sure I only allow exactly one user to claim a given account. For example, we do not allow two users having the same username. Assuming we have a cluster consist of 5 nodes N1, N2, N3, N4, and N5. We have two concurrent clients C1 and C2. We have replication factor 3 and the partitioner has determined the primary and the replicas nodes of the INSERT example are N3, N4, and N5. The scenario happens in following order: 1. C1 connects to coordinator N1 and sends INSERT V1 (assume V1 is username, not resister before) 2. N1 sends PREPARE message with ballot 1 (highest ballot have seen) to N3, N4 and N5. Note that this prepare for C1 and V1. 3. N3, N4 and N5 send a PROMISE message to N1, to not promise any with older than ballot 1. 4.N1 sends READ message to N3, N4 and N5 to read V1. 5.N3, N4 and N5 send RESULT message to N1, informing that V1 not exist which results in N1 will go forward to next round. 6. Now C2 connects to coordinator N2 and sends INSERT V1. 7. N2 sends PREPARE message with ballot 2 (highest ballot after re-prepare because first time, N2 does not know about ballot 1, then eventual it solves and have ballot 2) to N3, N4 and N5. Note that this prepare for C2 and V1. 8. N3, N4 and N5 send a PROMISE message to N2, to not promise any with older than ballot 2. 9. N2 sends READ message to N3, N4 and N5 to read V1. 10. N3, N4 and N5 send RESULT message to N2, informing that V1 not exist which results in N2 will go forward to next round. 11. Now N1 send PROPOSE message to N3, N4 and N5 (ballot 1, V1). 12. N3, N4 and N5 send ACCEPT message to N1. 13. N2 send PROPOSE message to N3, N4 and N5 (ballot 2, V1). 14. N3, N4 and N5 send ACCEPT message to N2. 15. N1 send COMMIT message to N3, N4 and N5 (ballot 1). 16. N3, N4 and N5 send ACK message to N1. 17. N2 send COMMIT message to N3, N4 and N5 (ballot 2). 18. N3, N4 and N5 send ACK message to N2. As result, both V1 from client C1 and V1 from client C2 have written to replicas N3, N4, and N5. Which I think it does not achieve the goal of *linearizable consistency and CAS. * *Is that true and such scenario could be occurred?* I look forward to hearing from you. Regards,