Re: Node Dead/Up
On Wed, Oct 24, 2012 at 2:32 PM, aaron morton aa...@thelastpickle.comwrote: I don't see errors in the logs, but I do see a lot of dropped mutations and reads. Any correlation? Yes. The dropped messages mean the server is overloaded. +1 . Been there..overloaded system normally produce frequent dropped mutation and/or reads. Run nodetool tpstats and will reveal many indicator. Look for log messages from the GCInspector in /var/log/cassandra/system.log and/or an overloaded IO system see http://spyced.blogspot.co.nz/2010/01/linux-performance-basics.html Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/10/2012, at 1:27 PM, Jason Hill jasonhill...@gmail.com wrote: thanks for the replies. I'll check the load on the node that is reported as DOWN/UP. At first glace it does not appear to be overloaded. But, I will dig in deeper, is there a specific indicator on an ubuntu server that would be useful to me? Also, I didn't make it clear, but in my original post, there are logs from 2 different nodes: 10.21 and 10.25. They are each reporting that the other is DOWN/UP at the same time. Would that still point me to the suggestions you made? I don't see errors in the logs, but I do see a lot of dropped mutations and reads. Any correlation? thanks again, Jason On Tue, Oct 23, 2012 at 12:49 AM, aaron morton aa...@thelastpickle.com wrote: check 10.50.10.21 for what is the system load. +1 And take a look in the logs on 10.21. 10.21 is being seen as down by the other nodes. it could be: * 10.21 failing to gossip fast enough, say by being overloaded to in long ParNew GC pauses. * This node failing to process gossip fast , say by being overloaded to in long ParNew GC pauses. * Problems with the tubes used to connect the nodes. (It's probably the first one.) Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/10/2012, at 8:19 PM, Jason Wee peich...@gmail.com wrote: check 10.50.10.21 for what is the system load. On Tue, Oct 23, 2012 at 10:41 AM, Jason Hill jasonhill...@gmail.com wrote: Hello, I'm on version 1.0.11. I'm seeing this in my system log with occasional frequency: INFO [GossipTasks:1] 2012-10-23 02:26:34,449 Gossiper.java (line 818) InetAddress /10.50.10.21 is now dead. INFO [GossipStage:1] 2012-10-23 02:26:34,620 Gossiper.java (line 804) InetAddress /10.50.10.21 is now UP INFO [StreamStage:1] 2012-10-23 02:24:38,763 StreamOutSession.java (line 228) Streaming to /10.50.10.25 --this line included for context INFO [GossipTasks:1] 2012-10-23 02:26:30,603 Gossiper.java (line 818) InetAddress /10.50.10.25 is now dead. INFO [GossipStage:1] 2012-10-23 02:26:40,763 Gossiper.java (line 804) InetAddress /10.50.10.25 is now UP INFO [AntiEntropyStage:1] 2012-10-23 02:27:30,249 AntiEntropyService.java (line 233) [repair #5a3383c0-1cb5-11e2--56b66459adef] Sending completed merkle tree to /10.50.10.25 for (Innovari,TICCompressedLoad) --this line included for context What is this telling me? Is my network dropping for less than a second? Are my nodes really dead and then up? Can someone shed some light on this for me? cheers, Jason
Re: What does ReadRepair exactly do?
Keep in mind, returning the older version is usually fine. Just imagine if your user clicked write 1 ms before, then the new version might be returned. If he gets the older version and refreshes the page, he gets the newer version. Same with an automated program as wellŠ.in general it is okay to get the older or newer value. If you are reading 2 rows however instead of one, that may change. Dean On 10/23/12 7:04 PM, shankarpnsn shankarp...@gmail.com wrote: manuzhang wrote why repair again? We block until the consistency constraint is met. Then the latest version is returned and repair is done asynchronously if any mismatch. We may retry read if fewer columns than required are returned. Just to make sure I understand you correct, considering the case when a read repair is in flight and a subsequent write affects one or more of the replicas that was scheduled to received the repair mutations. In this case, are you saying that we return the older version to the user rather than the latest version that was effected by the write ? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does -ReadRepair-exactly-do-tp7583261p7583355.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Hinted Handoff runs every ten minutes
On Sun, Oct 21, 2012 at 6:44 PM, aaron morton aa...@thelastpickle.com wrote: I *think* this may be ghost rows which have not being compacted. You would be correct in the case of 1.0.8: https://issues.apache.org/jira/browse/CASSANDRA-3955 -Brandon
Re: Hinted Handoff runs every ten minutes
Is there a walk around other than upgrade? Thanks, *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Wed, Oct 24, 2012 at 1:56 PM, Brandon Williams dri...@gmail.com wrote: On Sun, Oct 21, 2012 at 6:44 PM, aaron morton aa...@thelastpickle.com wrote: I *think* this may be ghost rows which have not being compacted. You would be correct in the case of 1.0.8: https://issues.apache.org/jira/browse/CASSANDRA-3955 -Brandon tokLogo.png
Re: What does ReadRepair exactly do?
Hiller, Dean wrote in general it is okay to get the older or newer value. If you are reading 2 rows however instead of one, that may change. This is certainly interesting, as it could mean that the user could see a value that never met the required consistency. For instance with 3 replicas R1,R2,R3 and a quorum consistency, assume that R1 is initiating a read (becomes the coordinator) - notices a conflict with R2 (assume R1 has a more recent value) and initiates a read repair with its value. Meanwhile R2 and R3 have seen two different writes with newer values than what was computed by the read repair. If R1 were to respond back to the user with the value that was computed at the time of read repair, wouldn't it be a value that never met the consistency constraint? I was thinking if this should trigger another round of repair that tries to reach the consistency constraint with a newer value or time-out, which is the expected case when you don't meet the required consistency. Please let me know if I'm missing something here. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583366.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Strange row expiration behavior
That worked perfectly, inserting another row after the first compaction, then flushing and compacting again triggered the empty rows to be removed. Thanks for your help and for clarifying the gcBefore point Aaron. Stephen On Tue, Oct 23, 2012 at 4:47 PM, aaron morton aa...@thelastpickle.comwrote: In the first example, I am running compaction at step 7 through nodetool, Sorry missed that. 1. insert a couple rows with ttl=5 (again, just a small number) 2. ExpiringColumn's are only purged if their TTL has expired AND their absolute (node local) expiry time occurred before the current gcBefore time. This may have explained why the columns were not purged in the first compaction. Can you try your first steps again. And then for the second set of steps add a new row, flush, compact. The expired rows should be removed. I don't have to manually delete empty rows after the columns expire. . Rows are automatically purged when all columns are purged. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/10/2012, at 3:05 AM, Stephen Mullins smull...@thebrighttag.com wrote: Thanks Aaron, my reply is inline below: On Tue, Oct 23, 2012 at 2:38 AM, aaron morton aa...@thelastpickle.comwrote: Performing these steps results in the rows still being present using *cassandra-cli list*. I assume you are saying the row key is listed without any columns. aka a ghost row. Correct. What gets really odd is if I add these steps it works That's working as designed. gc_grace_seconds does not specify when tombstones must be purged, rather it specifies the minimum duration the tombstone must be stored. It's really saying if you compact this column X seconds after the delete you can purge the tombstone. Minor / automatic compaction will kick in if there are (by default) 4 SSTables of the same size. And will only purge tombstones if all fragments of the row exists in the SSTables being compaction. Major / manual compaction compacts all the sstables, and so purges the tombstones IF gc_grace_seconds has expired. In your first example compaction had not run so the tombstones stayed on disk. In the second the major compaction purged expired tombstones. In the first example, I am running compaction at step 7 through nodetool, after gc_grace_seconds has expired. Additionally, if I do not perform the manual delete of the row in the second example, the ghost rows are not cleaned up. I want to know that in our production environment, I don't have to manually delete empty rows after the columns expire. But I can't get an example working to that effect. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/10/2012, at 2:49 PM, Stephen Mullins smull...@thebrighttag.com wrote: Hello, I'm seeing Cassandra behavior that I can't explain, on v1.0.12. I'm trying to test removing rows after all columns have expired. I've read the following: http://wiki.apache.org/cassandra/DistributedDeletes http://wiki.apache.org/cassandra/MemtableSSTable https://issues.apache.org/jira/browse/CASSANDRA-2795 And came up with a test to demonstrate the empty row removal that does the following: 1. create a keyspace 2. create a column family with gc_seconds=10 (arbitrary small number) 3. insert a couple rows with ttl=5 (again, just a small number) 4. use nodetool to flush the column family 5. sleep 10 seconds 6. ensure the columns are removed with *cassandra-cli list * 7. use nodetool to compact the keyspace Performing these steps results in the rows still being present using *cassandra-cli list*. What gets really odd is if I add these steps it works: 1. sleep 5 seconds 2. use cassandra-cli to *del mycf[arow]* 3. use nodetool to flush the column family 4. use nodetool to compact the keyspace I don't understand why the first set of steps (1-7) don't work to remove the empty row, nor do I understand why the explicit row delete somehow makes this work. I have all this in a script that I could attach if that's appropriate. Is there something wrong with the steps that I have? Thanks, Stephen
Re: What does ReadRepair exactly do?
The user will meet the required consistency unless you encounter some kind of bug in cassandra. You will either get the older value or the newer value. If you read quorum, and maybe a write CL=1 just happened, you may get the older or new value depending on if the node that received the write was involved. If you read quorum and your wrote CL=QUOROM, then you may get the newer value or the older value depending on who gets their first so to speak. In your scenario, if the read repair read from R2 just before the write is applied, you get the old value. If it read from R2 just after the write was applied, it gets the new value. BOTH of these met the consistency constraint. A better example to clear this up may be the following... If you read a value at CL=QUOROM, and you have a write 20ms later, you get the old value, right? And it met the consistency level, right? NOW, what about if the write is 1ms later? What if it the right is .1ms later? It still met the consistency level, right? If it is .1ms before, you get the new value as it repairs first with the new node. It is just when programming, your read may get the newer value or older value and generally if you write the code in a way that works, this concept works out great in most cases(in some cases, you need to think a bit differently and solve it other ways). I hope that clears it up Later, Dean On 10/24/12 8:02 AM, shankarpnsn shankarp...@gmail.com wrote: Hiller, Dean wrote in general it is okay to get the older or newer value. If you are reading 2 rows however instead of one, that may change. This is certainly interesting, as it could mean that the user could see a value that never met the required consistency. For instance with 3 replicas R1,R2,R3 and a quorum consistency, assume that R1 is initiating a read (becomes the coordinator) - notices a conflict with R2 (assume R1 has a more recent value) and initiates a read repair with its value. Meanwhile R2 and R3 have seen two different writes with newer values than what was computed by the read repair. If R1 were to respond back to the user with the value that was computed at the time of read repair, wouldn't it be a value that never met the consistency constraint? I was thinking if this should trigger another round of repair that tries to reach the consistency constraint with a newer value or time-out, which is the expected case when you don't meet the required consistency. Please let me know if I'm missing something here. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does -ReadRepair-exactly-do-tp7583261p7583366.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: What does ReadRepair exactly do?
And we don't send read request to all of the three replicas (R1, R2, R3) if CL=QUOROM; just 2 of them depending on proximity On Wed, Oct 24, 2012 at 10:20 PM, Hiller, Dean dean.hil...@nrel.gov wrote: The user will meet the required consistency unless you encounter some kind of bug in cassandra. You will either get the older value or the newer value. If you read quorum, and maybe a write CL=1 just happened, you may get the older or new value depending on if the node that received the write was involved. If you read quorum and your wrote CL=QUOROM, then you may get the newer value or the older value depending on who gets their first so to speak. In your scenario, if the read repair read from R2 just before the write is applied, you get the old value. If it read from R2 just after the write was applied, it gets the new value. BOTH of these met the consistency constraint. A better example to clear this up may be the following... If you read a value at CL=QUOROM, and you have a write 20ms later, you get the old value, right? And it met the consistency level, right? NOW, what about if the write is 1ms later? What if it the right is .1ms later? It still met the consistency level, right? If it is .1ms before, you get the new value as it repairs first with the new node. It is just when programming, your read may get the newer value or older value and generally if you write the code in a way that works, this concept works out great in most cases(in some cases, you need to think a bit differently and solve it other ways). I hope that clears it up Later, Dean On 10/24/12 8:02 AM, shankarpnsn shankarp...@gmail.com wrote: Hiller, Dean wrote in general it is okay to get the older or newer value. If you are reading 2 rows however instead of one, that may change. This is certainly interesting, as it could mean that the user could see a value that never met the required consistency. For instance with 3 replicas R1,R2,R3 and a quorum consistency, assume that R1 is initiating a read (becomes the coordinator) - notices a conflict with R2 (assume R1 has a more recent value) and initiates a read repair with its value. Meanwhile R2 and R3 have seen two different writes with newer values than what was computed by the read repair. If R1 were to respond back to the user with the value that was computed at the time of read repair, wouldn't it be a value that never met the consistency constraint? I was thinking if this should trigger another round of repair that tries to reach the consistency constraint with a newer value or time-out, which is the expected case when you don't meet the required consistency. Please let me know if I'm missing something here. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does -ReadRepair-exactly-do-tp7583261p7583366.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: What does ReadRepair exactly do?
I guess one more thing is I completely ignore your second write mainly because I assume it comes after we already read so your let's say you current state is node1 = val1 node2 = val1 node3 = val1 You do a write quorom of val=2 which is IN the middle!!! node1 = val1 node2 = val2 node3 = val1 (NOTICE the write is not complete yet) If you read from node1 and node3, you get val1. If you read from node1 and node2, you get val2 as a read repair will happen. Ie. You always get the older value or newer value. If you have two writes come in like so node1 = val1 node2 = val2 and node3= val3 Well, I think you can figure it out when you do a read ;). If your read quorum reads from node1 and node3 , you get val3, etc. etc. This is basically how it works….If your scenario is a web page, a user simply hits the refresh button and sees the values changing. Later, Dean From: Manu Zhang owenzhang1...@gmail.commailto:owenzhang1...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Wednesday, October 24, 2012 8:26 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: What does ReadRepair exactly do? And we don't send read request to all of the three replicas (R1, R2, R3) if CL=QUOROM; just 2 of them depending on proximity On Wed, Oct 24, 2012 at 10:20 PM, Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov wrote: The user will meet the required consistency unless you encounter some kind of bug in cassandra. You will either get the older value or the newer value. If you read quorum, and maybe a write CL=1 just happened, you may get the older or new value depending on if the node that received the write was involved. If you read quorum and your wrote CL=QUOROM, then you may get the newer value or the older value depending on who gets their first so to speak. In your scenario, if the read repair read from R2 just before the write is applied, you get the old value. If it read from R2 just after the write was applied, it gets the new value. BOTH of these met the consistency constraint. A better example to clear this up may be the following... If you read a value at CL=QUOROM, and you have a write 20ms later, you get the old value, right? And it met the consistency level, right? NOW, what about if the write is 1ms later? What if it the right is .1ms later? It still met the consistency level, right? If it is .1ms before, you get the new value as it repairs first with the new node. It is just when programming, your read may get the newer value or older value and generally if you write the code in a way that works, this concept works out great in most cases(in some cases, you need to think a bit differently and solve it other ways). I hope that clears it up Later, Dean On 10/24/12 8:02 AM, shankarpnsn shankarp...@gmail.commailto:shankarp...@gmail.com wrote: Hiller, Dean wrote in general it is okay to get the older or newer value. If you are reading 2 rows however instead of one, that may change. This is certainly interesting, as it could mean that the user could see a value that never met the required consistency. For instance with 3 replicas R1,R2,R3 and a quorum consistency, assume that R1 is initiating a read (becomes the coordinator) - notices a conflict with R2 (assume R1 has a more recent value) and initiates a read repair with its value. Meanwhile R2 and R3 have seen two different writes with newer values than what was computed by the read repair. If R1 were to respond back to the user with the value that was computed at the time of read repair, wouldn't it be a value that never met the consistency constraint? I was thinking if this should trigger another round of repair that tries to reach the consistency constraint with a newer value or time-out, which is the expected case when you don't meet the required consistency. Please let me know if I'm missing something here. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does -ReadRepair-exactly-do-tp7583261p7583366.html Sent from the cassandra-u...@incubator.apache.orgmailto:cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: What does ReadRepair exactly do?
oh, it would clarity a lot if you go to read the source code; the method is o.a.c.service.StorageProxy.fetchRows if I remember it correctly On Wed, Oct 24, 2012 at 10:26 PM, Manu Zhang owenzhang1...@gmail.comwrote: And we don't send read request to all of the three replicas (R1, R2, R3) if CL=QUOROM; just 2 of them depending on proximity On Wed, Oct 24, 2012 at 10:20 PM, Hiller, Dean dean.hil...@nrel.govwrote: The user will meet the required consistency unless you encounter some kind of bug in cassandra. You will either get the older value or the newer value. If you read quorum, and maybe a write CL=1 just happened, you may get the older or new value depending on if the node that received the write was involved. If you read quorum and your wrote CL=QUOROM, then you may get the newer value or the older value depending on who gets their first so to speak. In your scenario, if the read repair read from R2 just before the write is applied, you get the old value. If it read from R2 just after the write was applied, it gets the new value. BOTH of these met the consistency constraint. A better example to clear this up may be the following... If you read a value at CL=QUOROM, and you have a write 20ms later, you get the old value, right? And it met the consistency level, right? NOW, what about if the write is 1ms later? What if it the right is .1ms later? It still met the consistency level, right? If it is .1ms before, you get the new value as it repairs first with the new node. It is just when programming, your read may get the newer value or older value and generally if you write the code in a way that works, this concept works out great in most cases(in some cases, you need to think a bit differently and solve it other ways). I hope that clears it up Later, Dean On 10/24/12 8:02 AM, shankarpnsn shankarp...@gmail.com wrote: Hiller, Dean wrote in general it is okay to get the older or newer value. If you are reading 2 rows however instead of one, that may change. This is certainly interesting, as it could mean that the user could see a value that never met the required consistency. For instance with 3 replicas R1,R2,R3 and a quorum consistency, assume that R1 is initiating a read (becomes the coordinator) - notices a conflict with R2 (assume R1 has a more recent value) and initiates a read repair with its value. Meanwhile R2 and R3 have seen two different writes with newer values than what was computed by the read repair. If R1 were to respond back to the user with the value that was computed at the time of read repair, wouldn't it be a value that never met the consistency constraint? I was thinking if this should trigger another round of repair that tries to reach the consistency constraint with a newer value or time-out, which is the expected case when you don't meet the required consistency. Please let me know if I'm missing something here. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does -ReadRepair-exactly-do-tp7583261p7583366.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: What does ReadRepair exactly do?
Hiller, Dean wrote I guess one more thing is I completely ignore your second write mainly because I assume it comes after we already read so your let's say you current state is node1 = val1 node2 = val1 node3 = val1 You do a write quorom of val=2 which is IN the middle!!! node1 = val1 node2 = val2 node3 = val1 (NOTICE the write is not complete yet) If you read from node1 and node3, you get val1. If you read from node1 and node2, you get val2 as a read repair will happen. Ie. You always get the older value or newer value. If you have two writes come in like so node1 = val1 node2 = val2 and node3= val3 Well, I think you can figure it out when you do a read ;). If your read quorum reads from node1 and node3 , you get val3, etc. etc. This is basically how it works….If your scenario is a web page, a user simply hits the refresh button and sees the values changing. I'm extending your example Later, Dean Thanks for the example Dean. This definitely clears things up when you have an overlap between the read and the write, and one comes after the other. I'm still missing, how read repairs behave. Just extending your example for the following case: 1. node1 = val1 node2 = val1 node3 = val1 2. You do a write operation (W1) with quorom of val=2 node1 = val1 node2 = val2 node3 = val1 (write val2 is not complete yet) 3. Now with a read (R1) from node1 and node2, a read repair will be initiated that needs to write val2 on node 1. node1 = val1; node2 = val2; node3 = val1 (read repair val2 is not complete yet) 4. Say, in the meanwhile node 1 receives a write val 4; Read repair for R1 now arrives at node 1 but sees a newer value val4. node1 = val4; node2 = val2; node3 = val1 (write val4 is not complete, read repair val2 not complete) In this case, for read R1, the value val2 does not have a quorum. Would read R1 return val2 or val4 ? Zhang, Manu wrote And we don't send read request to all of the three replicas (R1, R2, R3) if CL=QUOROM; just 2 of them depending on proximity Thanks Zhang. But, this again seems a little strange thing to do, since one (say R2) of the 2 close replicas (say R1,R2) might be down, resulting in a read failure while there are still enough number of replicas (R1 and R3) live to satisfy a read. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583372.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: What does ReadRepair exactly do?
Thanks Zhang. But, this again seems a little strange thing to do, since one (say R2) of the 2 close replicas (say R1,R2) might be down, resulting in a read failure while there are still enough number of replicas (R1 and R3) live to satisfy a read. He means in the case where all 3 nodes are liveŠ.if a node is down, naturally it redirects to the other node and still succeeds because it found 2 nodes even with one node down(feel free to test this live though !) Thanks for the example Dean. This definitely clears things up when you have an overlap between the read and the write, and one comes after the other. I'm still missing, how read repairs behave. Just extending your example for the following case: 1. node1 = val1 node2 = val1 node3 = val1 2. You do a write operation (W1) with quorom of val=2 node1 = val1 node2 = val2 node3 = val1 (write val2 is not complete yet) 3. Now with a read (R1) from node1 and node2, a read repair will be initiated that needs to write val2 on node 1. node1 = val1; node2 = val2; node3 = val1 (read repair val2 is not complete yet) 4. Say, in the meanwhile node 1 receives a write val 4; Read repair for R1 now arrives at node 1 but sees a newer value val4. node1 = val4; node2 = val2; node3 = val1 (write val4 is not complete, read repair val2 not complete) In this case, for read R1, the value val2 does not have a quorum. Would read R1 return val2 or val4 ? At this point as Manu suggests, you need to look at the code but most likely what happens is they lock that row, receive the write in memory(ie. Not losing it) and return to client, caching it so as soon as read-repair is over, it will write that next value. Ie. Your client would receive val2 and val4 would be the value in the database right after you received val2. Ie. When a client interacts with cassandra and you have tons of writes to a row, val1, val2, val3, val4 in a short time period, just like a normal database, your client may get one of those 4 values depending on here the read gets inserted in the order of the writesŠsame as a normal RDBMS. The only thing you don't have is the atomic nature with other rows. NOTICE: they would not have to cache val4 very long, and if a newer write came in, they would just replace it with that newer val and cache that one instead so it would not be a queueŠbut this is all just a guessŠread the code if you really want to know. Zhang, Manu wrote And we don't send read request to all of the three replicas (R1, R2, R3) if CL=QUOROM; just 2 of them depending on proximity -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does -ReadRepair-exactly-do-tp7583261p7583372.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
get_paged_slice with SlicePredicate
Hello all, I'm playing around with using the get_paged_slice thrift call, and I noticed that it always was returning to me everything in the row--there's no mechanism for specifying a SlicePredicate. Was that intentional? If so, is there a different way that I can limit what I get back? I'd like to page over many rows, but only have data that is contained in a SlicePredicate be returned. Thanks for your help, Scott Fines
Re: Java 7 support?
And if you want a competitive edge, use it, tune it, take full advantage of the better version (7) and DON'T share. See the problem with not assigning this as a first class task for ASF team? Sent from my iPad On Oct 23, 2012, at 11:12 PM, Eric Evans eev...@acunu.com wrote: On Tue, Oct 16, 2012 at 7:54 PM, Rob Coli rc...@palominodb.com wrote: On Tue, Oct 16, 2012 at 4:45 PM, Edward Sargisson edward.sargis...@globalrelay.net wrote: The Datastax documentation says that Java 7 is not recommended[1]. However, Java 6 is due to EOL in Feb 2013 so what is the reasoning behind that comment? I've asked this approximate question here a few times, with no official response. The reason I ask is that in addition to Java 7 not being recommended, in Java 7 OpenJDK becomes the reference JVM, and OpenJDK is also not recommended. From other channels, I have conjectured that the current advice on Java 7 is it 'works' but is not as extensively tested (and definitely not as commonly deployed) as Java 6. That sounds about right. The best way to change the status quo would be to use Java 7, report any bugs you find, and share your experiences. -- Eric Evans Acunu | http://www.acunu.com | @acunu
Java 7 support?
We have been using cassandra and java7 for months. No problems. A key concept of java is portable binaries. There are sometimes wrinkles with upgrades. If you hit one undo the upgrade and restart. On Tuesday, October 23, 2012, Eric Evans eev...@acunu.com wrote: On Tue, Oct 16, 2012 at 7:54 PM, Rob Coli rc...@palominodb.com wrote: On Tue, Oct 16, 2012 at 4:45 PM, Edward Sargisson edward.sargis...@globalrelay.net wrote: The Datastax documentation says that Java 7 is not recommended[1]. However, Java 6 is due to EOL in Feb 2013 so what is the reasoning behind that comment? I've asked this approximate question here a few times, with no official response. The reason I ask is that in addition to Java 7 not being recommended, in Java 7 OpenJDK becomes the reference JVM, and OpenJDK is also not recommended. From other channels, I have conjectured that the current advice on Java 7 is it 'works' but is not as extensively tested (and definitely not as commonly deployed) as Java 6. That sounds about right. The best way to change the status quo would be to use Java 7, report any bugs you find, and share your experiences. -- Eric Evans Acunu | http://www.acunu.com | @acunu
Re: constant CMS GC using CPU time
On Mon, Oct 22, 2012 at 8:38 AM, Bryan Talbot btal...@aeriagames.com wrote: The nodes with the most data used the most memory. All nodes are affected eventually not just one. The GC was on-going even when the nodes were not compacting or running a heavy application load -- even when the main app was paused constant the GC continued. This sounds very much like my heap is so consumed by (mostly) bloom filters that I am in steady state GC thrash. Do you have heap graphs which show a healthy sawtooth GC cycle which then more or less flatlines? =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Hinted Handoff storage inflation
I'm testing various scenarios in a multi data center configuration. The setup is 10 Cassandra 1.1.5 nodes configured into two data centers, 5 nodes in each DC (RF DC1:3,DC2:3, write consistency LOCAL_QUORUM). I have a synthetic random data generator that I can run, and each run adds roughly 1GiB of data to each node per run, DC RackStatus State LoadEffective-Ownership DC1 RAC1Up Normal 1010.71 MB 60.00% DC2 RAC1Up Normal 1009.08 MB 60.00% DC1 RAC1Up Normal 1.01 GB 60.00% DC2 RAC1Up Normal 1 GB60.00% DC1 RAC1Up Normal 1.01 GB 60.00% DC2 RAC1Up Normal 1014.45 MB 60.00% DC1 RAC1Up Normal 1.01 GB 60.00% DC2 RAC1Up Normal 1.01 GB 60.00% DC1 RAC1Up Normal 1.01 GB 60.00% DC2 RAC1Up Normal 1.01 GB 60.00% Now, if I kill all the nodes in DC2, and run the data generator again, I would expect roughly 2GiB to be added to each node in DC1 (local replicas + hints to other data center), instead I get this: DC RackStatus State LoadEffective-Ownership DC1 RAC1Up Normal 17.56 GB60.00% DC2 RAC1Down Normal 1009.08 MB 60.00% DC1 RAC1Up Normal 17.47 GB60.00% DC2 RAC1Down Normal 1 GB60.00% DC1 RAC1Up Normal 17.22 GB60.00% DC2 RAC1Down Normal 1014.45 MB 60.00% DC1 RAC1Up Normal 16.94 GB60.00% DC2 RAC1Down Normal 1.01 GB 60.00% DC1 RAC1Up Normal 17.26 GB60.00% DC2 RAC1Down Normal 1.01 GB 60.00% Checking the sstables on a node reveals this, -bash-3.2$ du -hs HintsColumnFamily/ 16G HintsColumnFamily/ -bash-3.2$ So it seems that what I would have expected to be 1GiB of hints is much larger in reality, a 15x-16x inflation. This has a huge impact on write performance as well. If I bring DC2 up again, eventually the load will drop down and even out to 2GiB across the entire cluster. I'm wondering if this inflation is intended or if it is possibly a bug or something I'm doing wrong? Assuming this inflation is correct, what is the best way to deal with temporary connectivity issues with a second data center? Write performance is paramount in my use case. A 2x-3x overhead is doable, but not 15x-16x. Thanks, /dml
Re: constant CMS GC using CPU time
On Wed, Oct 24, 2012 at 2:38 PM, Rob Coli rc...@palominodb.com wrote: On Mon, Oct 22, 2012 at 8:38 AM, Bryan Talbot btal...@aeriagames.com wrote: The nodes with the most data used the most memory. All nodes are affected eventually not just one. The GC was on-going even when the nodes were not compacting or running a heavy application load -- even when the main app was paused constant the GC continued. This sounds very much like my heap is so consumed by (mostly) bloom filters that I am in steady state GC thrash. Yes, I think that was at least part of the issue. Do you have heap graphs which show a healthy sawtooth GC cycle which then more or less flatlines? I didn't save any graphs but that is what they would look like. I was using jstat to monitor gc activity. -Bryan
Re: compression
Can you try restarting the node ? That would reload the CF Meta data and reset the compaction settings. Sorry that's not very helpful but it's all I can think of for now. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/10/2012, at 11:41 PM, Tamar Fraenkel ta...@tok-media.com wrote: Hi! I tried again, I see the scrub action in cassandra logs INFO [CompactionExecutor:4029] 2012-10-24 10:36:54,108 CompactionManager.java (line 476) Scrubbing SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-339-Data.db') INFO [CompactionExecutor:4029] 2012-10-24 10:36:54,184 CompactionManager.java (line 658) Scrub of SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-339-Data.db') complete: 54 rows in new sstable and 0 empty (tombstoned) rows dropped INFO [CompactionExecutor:4029] 2012-10-24 10:36:54,185 CompactionManager.java (line 476) Scrubbing SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-340-Data.db') INFO [CompactionExecutor:4029] 2012-10-24 10:36:54,914 CompactionManager.java (line 658) Scrub of SSTableReader(path='/raid0/cassandra/data/tok/tk_usus_user-hc-340-Data.db') complete: 7037 rows in new sstable and 0 empty (tombstoned) rows dropped I don't see any CompressionInfo.db files and compression ratio is still 0.0 on this node only, on other nodes it is almost 0.5... Any idea? Thanks, Tamar Fraenkel Senior Software Engineer, TOK Media tokLogo.png ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Wed, Sep 26, 2012 at 3:40 AM, aaron morton aa...@thelastpickle.com wrote: Check the logs on nodes 2 and 3 to see if the scrub started. The logs on 1 will be a good help with that. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/09/2012, at 10:31 PM, Tamar Fraenkel ta...@tok-media.com wrote: Hi! I ran UPDATE COLUMN FAMILY cf_name WITH compression_options={sstable_compression:SnappyCompressor, chunk_length_kb:64}; I then ran on all my nodes (3) sudo nodetool -h localhost scrub tok cf_name I have replication factor 3. The size of the data on disk was cut in half in the first node and in the jmx I can see that indeed the compression ration is 0.46. But on nodes 2 and 3 nothing happened. In the jmx I can see that compression ratio is 0 and the size of the files of disk stayed the same. In cli ColumnFamily: cf_name Key Validation Class: org.apache.cassandra.db.marshal.UUIDType Default column value validator: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type) Row cache size / save period in seconds / keys to save : 0.0/0/all Row Cache Provider: org.apache.cassandra.cache.SerializingCacheProvider Key cache size / save period in seconds: 20.0/14400 GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: true Bloom Filter FP chance: default Built indexes: [] Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy Compression Options: chunk_length_kb: 64 sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor Can anyone help? Thanks Tamar Fraenkel Senior Software Engineer, TOK Media tokLogo.png ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Mon, Sep 24, 2012 at 8:37 AM, Tamar Fraenkel ta...@tok-media.com wrote: Thanks all, that helps. Will start with one - two CFs and let you know the effect Tamar Fraenkel Senior Software Engineer, TOK Media tokLogo.png ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Sun, Sep 23, 2012 at 8:21 PM, Hiller, Dean dean.hil...@nrel.gov wrote: As well as your unlimited column names may all have the same prefix, right? Like accounts.rowkey56, accounts.rowkey78, etc. etc. so the accounts gets a ton of compression then. Later, Dean From: Tyler Hobbs ty...@datastax.commailto:ty...@datastax.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Sunday, September 23, 2012 11:46 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: compression column metadata, you're still likely to get a reasonable amount of compression. This is especially true if there is some amount of repetition in the column names, values, or TTLs in wide rows. Compression will almost always be beneficial unless you're already somehow
Re: Hinted Handoff runs every ten minutes
Thanks. I thought it had been addressed so before but couldn't find the ticket. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 25/10/2012, at 12:56 AM, Brandon Williams dri...@gmail.com wrote: On Sun, Oct 21, 2012 at 6:44 PM, aaron morton aa...@thelastpickle.com wrote: I *think* this may be ghost rows which have not being compacted. You would be correct in the case of 1.0.8: https://issues.apache.org/jira/browse/CASSANDRA-3955 -Brandon
Re: Hinted Handoff storage inflation
Hints store the columns, row key, KS name and CF id(s) for each mutation to each node. Where an executed mutation will store the most recent columns collated with others under the same row key. So depending on the type of mutation hints will take up more space. The worse case would be lots of overwrites. After that writing a small amount of data to many rows would result in a lot of the serialised space being devoted to row keys, KS name and CF id. 16Gb is a lot though. What was the write workload like ? You can get an estimate on the number of keys in the Hints CF using nodetool cfstats. Also some metrics in the JMX will tell you how many hints are stored. This has a huge impact on write performance as well. Yup. Hints are added to the same Mutation thread pool as normal mutations. They are processed async to the mutation request but they still take resources to store. You can adjust how long hints a collected for with max_hint_window_in_ms in the yaml file. How long did the test run for ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 25/10/2012, at 11:26 AM, Mattias Larsson mlars...@yahoo-inc.com wrote: I'm testing various scenarios in a multi data center configuration. The setup is 10 Cassandra 1.1.5 nodes configured into two data centers, 5 nodes in each DC (RF DC1:3,DC2:3, write consistency LOCAL_QUORUM). I have a synthetic random data generator that I can run, and each run adds roughly 1GiB of data to each node per run, DC RackStatus State LoadEffective-Ownership DC1 RAC1Up Normal 1010.71 MB 60.00% DC2 RAC1Up Normal 1009.08 MB 60.00% DC1 RAC1Up Normal 1.01 GB 60.00% DC2 RAC1Up Normal 1 GB60.00% DC1 RAC1Up Normal 1.01 GB 60.00% DC2 RAC1Up Normal 1014.45 MB 60.00% DC1 RAC1Up Normal 1.01 GB 60.00% DC2 RAC1Up Normal 1.01 GB 60.00% DC1 RAC1Up Normal 1.01 GB 60.00% DC2 RAC1Up Normal 1.01 GB 60.00% Now, if I kill all the nodes in DC2, and run the data generator again, I would expect roughly 2GiB to be added to each node in DC1 (local replicas + hints to other data center), instead I get this: DC RackStatus State LoadEffective-Ownership DC1 RAC1Up Normal 17.56 GB60.00% DC2 RAC1Down Normal 1009.08 MB 60.00% DC1 RAC1Up Normal 17.47 GB60.00% DC2 RAC1Down Normal 1 GB60.00% DC1 RAC1Up Normal 17.22 GB60.00% DC2 RAC1Down Normal 1014.45 MB 60.00% DC1 RAC1Up Normal 16.94 GB60.00% DC2 RAC1Down Normal 1.01 GB 60.00% DC1 RAC1Up Normal 17.26 GB60.00% DC2 RAC1Down Normal 1.01 GB 60.00% Checking the sstables on a node reveals this, -bash-3.2$ du -hs HintsColumnFamily/ 16G HintsColumnFamily/ -bash-3.2$ So it seems that what I would have expected to be 1GiB of hints is much larger in reality, a 15x-16x inflation. This has a huge impact on write performance as well. If I bring DC2 up again, eventually the load will drop down and even out to 2GiB across the entire cluster. I'm wondering if this inflation is intended or if it is possibly a bug or something I'm doing wrong? Assuming this inflation is correct, what is the best way to deal with temporary connectivity issues with a second data center? Write performance is paramount in my use case. A 2x-3x overhead is doable, but not 15x-16x. Thanks, /dml
Re: Java 7 support?
Are you using openJDK or Oracle JDK? I know java7 should be based on openJDK since 7, but still not sure. On 25 October 2012 05:42, Edward Capriolo edlinuxg...@gmail.com wrote: We have been using cassandra and java7 for months. No problems. A key concept of java is portable binaries. There are sometimes wrinkles with upgrades. If you hit one undo the upgrade and restart.
Re: Java 7 support?
FWIW, we're using openjdk7 on most of our clusters. For those where we are still on openjdk6, it's not because of an issue - just haven't gotten to rolling out the upgrade yet. We haven't had any issues that I recall with upgrading the JDK. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)