Re: Completely removing a node from the cluster
I ran into this. I also tried log_ring_state=false which also did not help. The way I got through this was to stop the entire cluster and start the nodes one-by-one. I realize this is not a practical solution for everyone, but if you can afford to stop the cluster for a few minutes, it's worth a try. On Aug 23, 2011, at 9:26 AM, aaron morton wrote: I'm running low on ideas for this one. Anyone else ? If the phantom node is not listed in the ring, other nodes should not be storing hints for it. You can see what nodes they are storing hints for via JConsole. You can try a rolling restart passing the JVM opt -Dcassandra.load_ring_state=false However if the phantom node is been passed around in the gossip state it will probably just come back again. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 23/08/2011, at 3:49 PM, Bryce Godfrey wrote: Could this ghost node be causing my hints column family to grow to this size? I also crash after about 24 hours due to commit logs growth taking up all the drive space. A manual nodetool flush keeps it under control though. Column Family: HintsColumnFamily SSTable count: 6 Space used (live): 666480352 Space used (total): 666480352 Number of Keys (estimate): 768 Memtable Columns Count: 1043 Memtable Data Size: 461773 Memtable Switch Count: 3 Read Count: 38 Read Latency: 131.289 ms. Write Count: 582108 Write Latency: 0.019 ms. Pending Tasks: 0 Key cache capacity: 7 Key cache size: 6 Key cache hit rate: 0.8334 Row cache: disabled Compacted row minimum size: 2816160 Compacted row maximum size: 386857368 Compacted row mean size: 120432714 Is there a way for me to manually remove this dead node? -Original Message- From: Bryce Godfrey [mailto:bryce.godf...@azaleos.com] Sent: Sunday, August 21, 2011 9:09 PM To: user@cassandra.apache.org Subject: RE: Completely removing a node from the cluster It's been at least 4 days now. -Original Message- From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Sunday, August 21, 2011 3:16 PM To: user@cassandra.apache.org Subject: Re: Completely removing a node from the cluster I see the mistake I made about ring, gets the endpoint list from the same place but uses the token's to drive the whole process. I'm guessing here, don't have time to check all the code. But there is a 3 day timeout in the gossip system. Not sure if it applies in this case. Anyone know ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22/08/2011, at 6:23 AM, Bryce Godfrey wrote: Both .2 and .3 list the same from the mbean that Unreachable is empty collection, and Live node lists all 3 nodes still: 192.168.20.2 192.168.20.3 192.168.20.1 The removetoken was done a few days ago, and I believe the remove was done from .2 Here is what ring outlook looks like, not sure why I get that token on the empty first line either: Address DC RackStatus State LoadOwns Token 85070591730234615865843651857942052864 192.168.20.2datacenter1 rack1 Up Normal 79.53 GB 50.00% 0 192.168.20.3datacenter1 rack1 Up Normal 42.63 GB 50.00% 85070591730234615865843651857942052864 Yes, both nodes show the same thing when doing a describe cluster, that .1 is unreachable. -Original Message- From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Sunday, August 21, 2011 4:23 AM To: user@cassandra.apache.org Subject: Re: Completely removing a node from the cluster Unreachable nodes in either did not respond to the message or were known to be down and were not sent a message. The way the node lists are obtained for the ring command and describe cluster are the same. So it's a bit odd. Can you connect to JMX and have a look at the o.a.c.db.StorageService MBean ? What do the LiveNode and UnrechableNodes attributes say ? Also how long ago did you remove the token and on which machine? Do both 20.2 and 20.3 think 20.1 is still around ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 20/08/2011, at 9:48 AM, Bryce Godfrey wrote: I'm on 0.8.4 I have removed a dead node from the cluster using nodetool removetoken command, and moved one of the remaining nodes to rebalance the tokens. Everything looks fine when I run nodetool ring now, as
Re: Re: Urgent:!! Re: Need to maintenance on a cassandra node, are there problems with this process
Hi - From what I understand, Peter's recommendation should work for you. They have both worked for me. No need to copy anything by hand on the new node. Bootstrap/repair does that for you. From the Wiki: If a node goes down entirely, then you have two options: (Recommended approach) Bring up the replacement node with a new IP address, Set initial token to (failure node's token) - 1 and AutoBootstrap set to true in cassandra.yaml (storage-conf.xml for 0.6 or earlier). This will place the replacement node in front of the failure node. Then the bootstrap process begins. While this process runs, the node will not receive reads until finished. Once this process is finished on the replacement node, run nodetool removetoken once, supplying the token of the dead node, and nodetool cleanup on each node. You can obtain the dead node's token by running nodetool ring on any live node, unless there was some kind of outage, and the others came up but not the down one -- in that case, you can retrieve the token from the live nodes' system tables. (Alternative approach) Bring up a replacement node with the same IP and token as the old, and run nodetool repair. Until the repair process is complete, clients reading only from this node may get no data back. Using a higher ConsistencyLevel on reads will avoid this. On , Anand Somani meatfor...@gmail.com wrote: Let me be specific on lost data - lost a replica , the other 2 nodes have replicas I am running read/write at quorum. At this point I have turned off my clients from talking to this node. So if that is the case I can potentially just nodetool repair (without changing IP). But would it be better if I copied over the data/mykeyspace from another replica and then run repair? On Fri, Aug 19, 2011 at 11:20 AM, Peter Schuller peter.schul...@infidyne.com wrote: ok, so we just lost the data on that node. are building the raid on it, but once it is up what is the best way to bring it back in the cluster You're saying the raid failed and data is gone? just let it come up and run nodetool repair copy data from another node and then run nodetool repair, do I still need to run repair immeidately if I copy the data? Want to schedule repair for later during non peak hours? If data is gone, the safe way is to have it re-join the cluster: http://wiki.apache.org/cassandra/Operations#Handling_failure But note that in your case, since you've lost data (if I understand you), it's effectively a completely new node. That means you either want to switch it's IP address and go for the recommended approach, or do the other option but that WILL mean the node is serving reads with incorrect data, violating the consistency. Depending on your application, this may or may not be the case. Unless it's a major problem for you, I suggest bringing it back in with a new IP address and make it be treated like a completely fresh replacement node. Probably decreases the risk of mistakes happening. As for the other stuff about repair in the e-mail you pasted; periodic repairs are part of regular cluster maintenance. See: http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair -- / Peter Schuller (@scode on twitter)
upgrade from 0.7.6 to 0.8.4
Hi - sorry if this was asked before but I couldn't find any answers about it. Is the upgrade path from 0.7.6 to 0.8.4 possible via a simple rolling restart? Are nodes with these different versions compatible - i.e., can one node be upgraded in order to see if we run into any problems before upgrading the others?
Re: Re: Cassandra start/stop scripts
A simple kill without -9 should work. Have you tried that? On , Jason Pell jasonmp...@gmail.com wrote: Check out the rpm packages from Cassandra they have init.d scripts that work very nicely, there are debs as well for ubuntu Sent from my iPhone On Jul 27, 2011, at 3:19, Priyanka priya...@gmail.com wrote: I do the same way... On Tue, Jul 26, 2011 at 1:07 PM, mcasandra [via [hidden email]] [hidden email] wrote: I need to write cassandra start/stop script. Currently I run cassandra to start and kill -9 to stop. Is this the best way? kill -9 doesn't sound right :) Wondering how others do it. If you reply to this email, your message will be added to the discussion below: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-start-stop-scripts-tp6622977p6622977.html To start a new topic under [hidden email], email [hidden email] To unsubscribe from [hidden email], click here. View this message in context: Re: Cassandra start/stop scripts Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
eliminate need to repair by using column TTL??
One of the main reasons for regularly running repair is to make sure deletes are propagated in the cluster, ie, data is not resurrected if a node never received the delete call. And repair-on-read takes care of repairing inconsistencies on-the-fly. So if I were to set a universal TTL on all columns - so everything would only live for a certain age, would I be able to get away without having to do regular repairs with nodetool? I realize this scenario would not be applicable for everyone, but our data model would allow us to do this. So could this be an alternative to running the (resource-intensive, long-running) repairs with nodetool? Thanks.
Re: Re: eliminate need to repair by using column TTL??
good points Aaron. I realize now how expensive repair on reads are. I'm going to keep doing repairs regularly but still have a max TTL on all columns to make sure we don't have really old data we no longer need getting buried in the cluster. On , aaron morton aa...@thelastpickle.com wrote: Read repair will only repair data that is read on the nodes that are up at that time, and does not guarantee that any changes it detects will be written back to the nodes. The diff mutations are async fire and forget messages which may go missing or be dropped or ignored by the recipient just like any other message. Also getting hit with a bunch of read repair operations is pretty painful. The normal read runs, the coordinator detects the digest mis-match, the read runs again from all nodes and they all have to return their full data (no digests this time), the coordinator detects the diffs, mutations are sent back to each node that needs them. All this happens sync to the read request when the CL ONE. Thats 2 reads with more network IO and up to RF mutations . The delete thing is important but repair also reduces the chance of reads getting hit with RR and gives me confidence when it's necessary to nuke a bad node. Your plan may work but it feels risky to me. You may end up with worse read performance and unpleasent emotions if you ever have to nuke a node. Others may disagree. Not ignoring the fact the repair can take a long time, fail, hurt performance etc. There are plans to improve it though. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 19:55, jonathan.co...@gmail.com wrote: One of the main reasons for regularly running repair is to make sure deletes are propagated in the cluster, ie, data is not resurrected if a node never received the delete call. And repair-on-read takes care of repairing inconsistencies on-the-fly. So if I were to set a universal TTL on all columns - so everything would only live for a certain age, would I be able to get away without having to do regular repairs with nodetool? I realize this scenario would not be applicable for everyone, but our data model would allow us to do this. So could this be an alternative to running the (resource-intensive, long-running) repairs with nodetool? Thanks.
Repair question - why is so much data transferred?
I regularly run repair on my cassandra cluster. However, I often seen that during the repair operation very large amounts of data are transferred to other nodes. My questions is, if only some data is out of sync, why are entire Data files being transferred? /var/lib/cassandra/data/DFS/main-f-893-Data.db sections=2602 progress=22942842880/63149903764 - 36% /var/lib/cassandra/data/DFS/main-f-946-Data.db sections=1437 progress=0/65991601 - 0% /var/lib/cassandra/data/DFS/main-f-907-Data.db sections=2602 progress=0/1635822909 - 0% My guess is that since data in the Data files is immutable, it needs to copy the entire file over, then I assume a compaction would take place to consolidate the data. But that's just my wild guess. Can anyone explain this behavior?
Re: Re: Repair question - why is so much data transferred?
from ticket 2818: One (reasonably simple) proposition to fix this would be to have repair schedule validation compactions across nodes one by one (ie, one CF/range at a time), waiting for all nodes to return their tree before submitting the next request. Then on each node, we should make sure that the node will start the validation compaction as soon as requested. For that, we probably want to have a specific executor for validation compaction .. This was the way I thought repair worked. Anyway, in our case, we only have one CF, so I'm not sure if both issues apply to my situation. Thanks. Looking forward to the release where these 2 things are fixed. On , Jonathan Ellis jbel...@gmail.com wrote: On Thu, Jul 21, 2011 at 9:14 AM, Jonathan Colby jonathan.co...@gmail.com wrote: I regularly run repair on my cassandra cluster. However, I often seen that during the repair operation very large amounts of data are transferred to other nodes. https://issues.apache.org/jira/browse/CASSANDRA-2280 https://issues.apache.org/jira/browse/CASSANDRA-2816 My questions is, if only some data is out of sync, why are entire Data files being transferred? Repair streams ranges of files as a unit (which becomes a new file on the target node) rather than using the normal write path. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Decorator Algorithm
thanks guys. That clears things up. On Jun 24, 2011, at 4:53 AM, Maki Watanabe wrote: A little addendum Key := Your data to identify a row Token := Index on the ring calculated from Key. The calculation is defined in replication strategy. You can lookup responsible nodes (endpoints) for a specific key with JMX getNaturalEndpoints interface. maki 2011/6/24 aaron morton aa...@thelastpickle.com: Various places in the code call IPartitioner.decorateKey() which returns a DecoratedKeyT which contains both the original key and the TokenT The RandomPartitioner md5 to hash the key ByteBuffer and create a BigInteger. OPP converts the key into utf8 encoded String. Using the token to find which endpoints contain replicas is done by the AbstractReplicationStrategy.calculateNaturalEndpoints() implementations. Does that help? - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 23 Jun 2011, at 19:58, Jonathan Colby wrote: Hi - I'd like to understand more how the token is hashed with the key to determine on which node the data is stored - called decorating in cassandra speak. Can anyone share any documentation on this or describe this more in detail? Yes, I could look at the code, but I was hoping to be able to read more about how it works first. thanks. -- w3m
Decorator Algorithm
Hi - I'd like to understand more how the token is hashed with the key to determine on which node the data is stored - called decorating in cassandra speak. Can anyone share any documentation on this or describe this more in detail? Yes, I could look at the code, but I was hoping to be able to read more about how it works first. thanks.
Re: insufficient space to compact even the two smallest files, aborting
A compaction will be triggered when min number of same sized SStable files are found. So what's actually the purpose of the max part of the threshold? On Jun 23, 2011, at 12:55 AM, aaron morton wrote: Setting them to 2 and 2 means compaction can only ever compact 2 files at time, so it will be worse off. Lets the try following: - restore the compactions settings to the default 4 and 32 - run `ls -lah` in the data dir and grab the output - run `nodetool flush` this will trigger minor compaction once the memtables have been flushed - check the logs for messages from 'CompactionManager' - when done grab the output from `ls -lah` again. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 23 Jun 2011, at 02:04, Héctor Izquierdo Seliva wrote: Hi All. I set the compaction threshold at minimum 2, maximum 2 and try to run compact, but it's not doing anything. There are over 69 sstables now, read performance is horrible, and it's taking an insane amount of space. Maybe I don't quite get how the new per bucket stuff works, but I think this is not normal behaviour. El lun, 13-06-2011 a las 10:32 -0500, Jonathan Ellis escribió: As Terje already said in this thread, the threshold is per bucket (group of similarly sized sstables) not per CF. 2011/6/13 Héctor Izquierdo Seliva izquie...@strands.com: I was already way over the minimum. There were 12 sstables. Also, is there any reason why scrub got stuck? I did not see anything in the logs. Via jmx I saw that the scrubbed bytes were equal to one of the sstables size, and it stuck there for a couple hours . El lun, 13-06-2011 a las 22:55 +0900, Terje Marthinussen escribió: That most likely happened just because after scrub you had new files and got over the 4 file minimum limit. https://issues.apache.org/jira/browse/CASSANDRA-2697 Is the bug report.
simple question about merged SSTable sizes
The way compaction works, x same-sized files are merged into a new SSTable. This repeats itself and the SSTable get bigger and bigger. So what is the upper limit?? If you are not deleting stuff fast enough, wouldn't the SSTable sizes grow indefinitely? I ask because we have some rather large SSTable files (80-100 GB) and I'm starting to worry about future compactions. Second, compacting such large files is an IO killer.What can be tuned other than compaction_threshold to help optimize this and prevent the files from getting too big? Thanks!
Re: simple question about merged SSTable sizes
Thanks for the explanation. I'm still a bit skeptical. So if you really needed to control the maximum size of compacted SSTables, you need to delete data at such a rate that the new files created by compaction are less than or equal to the sum of the segments being merged. Is anyone else running into really large compacted SSTables that gave you trouble with hard disk capacity? How did you deal with it? We have 1 TB disks in our nodes, but keeping in mind we need to have at least 50% for the worst case compaction scenario I'm still a bit worried that one day we're going to hit a dead end. On Jun 22, 2011, at 6:50 PM, Eric tamme wrote: On Wed, Jun 22, 2011 at 12:35 PM, Jonathan Colby jonathan.co...@gmail.com wrote: The way compaction works, x same-sized files are merged into a new SSTable. This repeats itself and the SSTable get bigger and bigger. So what is the upper limit?? If you are not deleting stuff fast enough, wouldn't the SSTable sizes grow indefinitely? I ask because we have some rather large SSTable files (80-100 GB) and I'm starting to worry about future compactions. Second, compacting such large files is an IO killer.What can be tuned other than compaction_threshold to help optimize this and prevent the files from getting too big? Thanks! The compaction is an iterative process that first compacts uncompacted SSTables and removes tombstones etc. This compaction takes multiple files and merges them into one SSTable. This process repeats until you have compaction_threshold=X number of similarly sized SSTables, then those will get re-compacted (merged) together. The number and size of SSTables that you have as a result of a flush is tuned by max size, or records, or time. Contrary to what you might believe, having fewer larger SSTables reduces IO compared to compacting many small SSTables. Also the merge operation of previously compacted SSTables is relatively fast. As far as I know, cassandra will continue compacting SSTables into an indefinitely larger sized SSTable. The tunable side of things is for adjusting when to flush memtable to SSTable, and the number of SSTables of similar size that must be present to execute a compaction. -Eric
Re: simple question about merged SSTable sizes
So the take-away is try to avoid major compactions at all costs! Thanks Ed and Eric. On Jun 22, 2011, at 7:00 PM, Edward Capriolo wrote: Yes, if you are not deleting fast enough they will grow. This is not specifically a cassandra problem /var/log/messages has the same issue. There is a JIRA ticket about having a maximum size for SSTables, so they always stay manageable You fall into a small trap when you force major compaction in that many small tables turn into one big one, from their it is hard to get back to many smaller ones again, the other side of the coin if you do not major compact you can end up with much more disk usage then live data (IE large % of disk is overwrites and tombstones). You can tune the compaction rate now so compaction does not kill your IO. Generally I think avoiding really large SSTables is the best way to do. Scale out and avoid very large SSTables/node if possible. Edward On Wed, Jun 22, 2011 at 12:35 PM, Jonathan Colby jonathan.co...@gmail.com wrote: The way compaction works, x same-sized files are merged into a new SSTable. This repeats itself and the SSTable get bigger and bigger. So what is the upper limit?? If you are not deleting stuff fast enough, wouldn't the SSTable sizes grow indefinitely? I ask because we have some rather large SSTable files (80-100 GB) and I'm starting to worry about future compactions. Second, compacting such large files is an IO killer.What can be tuned other than compaction_threshold to help optimize this and prevent the files from getting too big? Thanks!
Re: simple question about merged SSTable sizes
Thanks Ryan. Done that : ) 1 TB is the striped size.We might look into bigger disks for our blades. On Jun 22, 2011, at 7:09 PM, Ryan King wrote: On Wed, Jun 22, 2011 at 10:00 AM, Jonathan Colby jonathan.co...@gmail.com wrote: Thanks for the explanation. I'm still a bit skeptical. So if you really needed to control the maximum size of compacted SSTables, you need to delete data at such a rate that the new files created by compaction are less than or equal to the sum of the segments being merged. Is anyone else running into really large compacted SSTables that gave you trouble with hard disk capacity? How did you deal with it? We have 1 TB disks in our nodes, but keeping in mind we need to have at least 50% for the worst case compaction scenario I'm still a bit worried that one day we're going to hit a dead end. You should stripe those disks together with RAID-0. -ryan
Re: simple question about merged SSTable sizes
Awesome tip on TTL. We can really use this as a catch-all to make sure all columns are purged based on time. Fits our use-case good. I forgot this feature existed. On Jun 22, 2011, at 7:11 PM, Eric tamme wrote: Second, compacting such large files is an IO killer.What can be tuned other than compaction_threshold to help optimize this and prevent the files from getting too big? Thanks! Just a personal implementation note - I make heavy use of column TTL, so I have very specifically tuned cassandra to having a pretty constant max disk usage based on my data insertion rate, the TTL, the memtable flush threshold, and min compaction threshold. My data basically lives for 7 days and depending on where it is in the compaction cycle goes from 130 gigs per node up to 160gigs per node. If setting TTL is an option for you, It is one way to auto purge data and keep overall size in check. -Eric
Re: New web client future API
I just took a look at the demo. This is really great stuff! I will try this on our cluster as soon as possible. I like this because it allows people not too familiar with the cassandra CLI or Thrift a way to query cassandra data. On Jun 20, 2011, at 10:56 AM, Markus Wiesenbacher | Codefreun.de wrote: Should work now ... Von meinem iPhone gesendet Am 20.06.2011 um 09:28 schrieb Andrey V. Panov panov.a...@gmail.com: How to download it? Your Download war-file open just blank page :( On 14/06/2011, Markus Wiesenbacher | Codefreun.de m...@codefreun.de wrote: I just released an early version of my web client (http://www.codefreun.de/apollo) which is Thrift-based, and therefore I would like to know what the future is ...
Re: jsvc hangs shell
jsvc is not very flexible. Check out wrapper software out. we swear by it. http://wrapper.tanukisoftware.com/doc/english/download.jsp On Jun 17, 2011, at 2:52 AM, Ken Brumer wrote: Anton Belyaev anton.belyaev at gmail.com writes: I guess it is not trivial to modify the package to make it use JSW instead of JSVC. I am still not sure the JSVC itself is a culprit. Maybe something is wrong in my setup. I am seeing similar behavior using the Brisk Debian packages for Maverick: http://www.datastax.com/docs/0.8/brisk/install_brisk_packages#installing-the-brisk-packaged-releases Not sure if it's my configuration, but I verified in on two seperate installs. -Ken
Re: Re: minor vs major compaction and purging data
Cleanup removes any data that node is no longer responsible for, according to the node's token range. A node can have data it is no longer responsible for if you do certain maintenance operations like move or loadbalance. On , Sebastien Coutu sco...@openplaces.org wrote: How about cleanups? What would be the difference between cleanup and compactions? On Sat, Jun 11, 2011 at 8:14 AM, Jonathan Ellis jbel...@gmail.com wrote: Yes. On Sat, Jun 11, 2011 at 6:08 AM, Jonathan Colby jonathan.co...@gmail.com wrote: I've been reading inconsistent descriptions of what major and minor compactions do. So my question for clarification: Are tombstones purges (ie, space reclaimed) for minor AND major compactions? Thanks. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
minor vs major compaction and purging data
I've been reading inconsistent descriptions of what major and minor compactions do. So my question for clarification: Are tombstones purges (ie, space reclaimed) for minor AND major compactions? Thanks.
Compacting Large Row
I'm seeing this in my logs. We are storing emails in cassandra and some of them might be rather large. Is this bad? What exactly is happening when this appears? INFO [CompactionExecutor:1] 2011-06-11 13:39:19,217 CompactionIterator.java (line 150) Compacting large row 39653235326331302d626530362d346339362d383966302d646338366366353237663565 (67149805 bytes) incrementally INFO [CompactionExecutor:1] 2011-06-11 13:40:55,215 CompactionIterator.java (line 150) Compacting large row 63343864303464622d336336332d343036322d386130392d343737373766343439643539 (70605320 bytes) incrementally INFO [CompactionExecutor:1] 2011-06-11 13:43:27,353 CompactionIterator.java (line 150) Compacting large row 39353463363062612d646364612d346137382d613838652d633130613439663664353532 (72450230 bytes) incrementally INFO [CompactionExecutor:1] 2011-06-11 13:46:04,439 CompactionIterator.java (line 150) Compacting large row 613634392d656135332d343565382d393265662d303336363731666365376439 (72007535 bytes) incrementally INFO [CompactionExecutor:1] 2011-06-11 13:46:57,517 CompactionIterator.java (line 150) Compacting large row 31636532356365332d323566632d343535382d623232312d363934636538333432323330 (75976735 bytes) incrementally Thanks Jon
after a while nothing happening with repair
When I run repair on a node in my 0.7.6-2 cluster, the repair starts to stream data and activity is seen in the logs. However, after a while (a day or so) it seems like everything freezes up. The repair command is still running (the command prompt has not returned) and netstats shows output similar to below. All streams at 0% and nothing happening. The logs indicate that things were started but there is no indication if anything is in fact still active. For example, this is the last log entry related to repair, just this morning: INFO [StreamStage:1] 2011-06-09 07:13:21,423 StreamOut.java (line 173) Stream context metadata [/var/lib/cassandra/data/DFS/main-f-144-Data.db sections=2 progress=0/31947748 - 0%, /var/lib/cassandra/data/DFS/main-f-145-Data.db section s=2 progress=0/25786564 - 0%, /var/lib/cassandra/data/DFS/main-f-143-Data.db sections=2 progress=0/5830103399 - 0%], 9 sstables. INFO [StreamStage:1] 2011-06-09 07:13:21,423 StreamOutSession.java (line 174) Streaming to /10.46.108.104 However, netstats on all related notes looks something like this. The nodes continue to handle read/write requests just fine. They are not overloaded at all. Any advice would be greatly appreciated. Because repairs seem like they never finish, I have a feeling we have a lot of garbage data in our cluster. /opt/cassandra/bin/nodetool -h $HOSTNAME -p 35014 netstats Mode: Normal Not sending any streams. Streaming from: /10.46.108.104 DFS: /var/lib/cassandra/data/DFS/main-f-209-Data.db sections=2 progress=0/276461810 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-153-Data.db sections=2 progress=0/100340568 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-40-Data.db sections=2 progress=0/62726190502 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-180-Data.db sections=1 progress=0/158898493 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-109-Data.db sections=2 progress=0/87250515569 - 0% Streaming from: /10.47.108.102 DFS: /var/lib/cassandra/data/DFS/main-f-304-Data.db sections=2 progress=0/13563864214 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-350-Data.db sections=1 progress=0/2877129955 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-379-Data.db sections=2 progress=0/143804948 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-370-Data.db sections=2 progress=0/683716174 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-371-Data.db sections=2 progress=0/56650 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-368-Data.db sections=2 progress=0/4005533616 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-369-Data.db sections=2 progress=0/155515922 - 0% Streaming from: /10.46.108.103 DFS: /var/lib/cassandra/data/DFS/main-f-888-Data.db sections=2 progress=0/158096259 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-828-Data.db sections=1 progress=0/29508276 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-886-Data.db sections=2 progress=0/133704150 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-759-Data.db sections=2 progress=0/83629797522 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-889-Data.db sections=2 progress=0/96903803 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-751-Data.db sections=2 progress=0/17944852950 - 0% Streaming from: /10.46.108.101 DFS: /var/lib/cassandra/data/DFS/main-f-1318-Data.db sections=2 progress=0/60617216778 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-1179-Data.db sections=2 progress=0/11870790009 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-1324-Data.db sections=2 progress=0/710603722 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-1322-Data.db sections=2 progress=0/5844992187 - 0%
fixing unbalanced cluster !?
I got myself into a situation where one node (10.47.108.100) has a lot more data than the other nodes. In fact, the 1 TB disk on this node is almost full. I added 3 new nodes and let cassandra automatically calculate new tokens by taking the highest loaded nodes. Unfortunately there is still a big token range this node is responsible for (5113... - 85070...). Yes, I know that one option would be to rebalance the entire cluster with move but this is an extremely time-consuming and error-prone process because of the amount of data involved. Our RF = 3 and we read/write quorum. The nodes have been repaired so I think the data should be in good shape. Question:Can I get myself out of this mess without installing new nodes? I was thinking of either decommission or removetoken to have the cluster rebalance itself. The re-bootstrap this node to a new token. Address Status State LoadOwnsToken 127605887595351923798765477786913079296 10.46.108.100 Up Normal 218.52 GB 25.00% 0 10.46.108.101 Up Normal 260.04 GB 12.50% 21267647932558653966460912964485513216 10.46.108.104 Up Normal 286.79 GB 17.56% 51138582157040063602728874106478613120 10.47.108.100 Up Normal 874.91 GB 19.94% 85070591730234615865843651857942052863 10.47.108.102 Up Normal 302.79 GB 4.16% 92156241323118845370666296304459139297 10.47.108.103 Up Normal 242.02 GB 4.16% 99241191538897700272878550821956884116 10.47.108.101 Up Normal 439.9 GB8.34% 113427455640312821154458202477256070484 10.46.108.103 Up Normal 304 GB 8.33% 127605887595351923798765477786913079296
Re: fixing unbalanced cluster !?
Thanks Ben. That's what I was afraid I had to do. I can see how it's a lot easier if you simply double the cluster when adding capacity. Jon On Jun 9, 2011, at 4:44 PM, Benjamin Coverston wrote: Because you were able to successfully run repair you can follow up with a nodetool cleanup which will git rid of some of the extraneous data on that (bigger) node. You're also assured after you run repair that entropy beteen the nodes is minimal. Assuming you're using the random ordered partitioner: To balance your ring I would start by calculating the new token locations, then moving each of your nodes backwards along their owned range to their new locations. From the script on http://wiki.apache.org/cassandra/Operations your new balanced tokens would be: 0 21267647932558653966460912964485513216 42535295865117307932921825928971026432 63802943797675961899382738893456539648 85070591730234615865843651857942052864 106338239662793269832304564822427566080 127605887595351923798765477786913079296 148873535527910577765226390751398592512 From this you can see that 10.46.108.{100, 101} is already in the right place so you don't have to do anything with those nodes. Proceed with moving 10.46.108.104 to its new token, the safest way to do this would be to use nodetool move. Another way to do it could be to run a remove-token followed by re-adding the node into the ring at its new location. The risk here is that if you do not at least repair after re-joining the ring (and before you move the next node in the ring) then some of the data on that node would be ignored as it would now fall out of the owned range, so it's good practice to immediately run repair on a node that you do a removetoken / re-join on. The rest of your balancing should be an iteration on the above steps moving through the range. On 6/9/11 6:21 AM, Jonathan Colby wrote: I got myself into a situation where one node (10.47.108.100) has a lot more data than the other nodes. In fact, the 1 TB disk on this node is almost full. I added 3 new nodes and let cassandra automatically calculate new tokens by taking the highest loaded nodes. Unfortunately there is still a big token range this node is responsible for (5113... - 85070...). Yes, I know that one option would be to rebalance the entire cluster with move but this is an extremely time-consuming and error-prone process because of the amount of data involved. Our RF = 3 and we read/write quorum. The nodes have been repaired so I think the data should be in good shape. Question:Can I get myself out of this mess without installing new nodes? I was thinking of either decommission or removetoken to have the cluster rebalance itself. The re-bootstrap this node to a new token. Address Status State LoadOwnsToken 127605887595351923798765477786913079296 10.46.108.100 Up Normal 218.52 GB 25.00% 0 10.46.108.101 Up Normal 260.04 GB 12.50% 21267647932558653966460912964485513216 10.46.108.104 Up Normal 286.79 GB 17.56% 51138582157040063602728874106478613120 10.47.108.100 Up Normal 874.91 GB 19.94% 85070591730234615865843651857942052863 10.47.108.102 Up Normal 302.79 GB 4.16% 92156241323118845370666296304459139297 10.47.108.103 Up Normal 242.02 GB 4.16% 99241191538897700272878550821956884116 10.47.108.101 Up Normal 439.9 GB8.34% 113427455640312821154458202477256070484 10.46.108.103 Up Normal 304 GB 8.33% 127605887595351923798765477786913079296 -- Ben Coverston Director of Operations DataStax -- The Apache Cassandra Company http://www.datastax.com/
no additional log output after running repair
I'm trying to run a repair on a 7.6-2 Node. After running the repair command, this line shows up in the cassandra.log, but nothing else. It's been hours. Nothing is seen in the logs from other servers or with nodetool commands like netstats or tpstats. How do I know if the repair is actually going on or not? This is incredibly frustrating. INFO [manual-repair-9629edfc-7ae9-4626-b90a-2aa6eb1e8224] 2011-05-31 14:05:25,625 AntiEntropyService.java (line 786) Waiting for repair requests: [#TreeRequest manual-repair-9629edfc-7ae9-4626-b90a-2aa6eb1e8224 , /10.47.108.100, (DFS,main), #TreeRequest manual-repair-9629edfc-7ae9-4626-b90a-2aa6eb1e8224, /10.47.108.103, (DFS,main), #TreeRequest manual-repair-9629edfc-7ae9-4626-b90a-2aa6eb1e8224, /10.46.108.103, (DFS ,main), #TreeRequest manual-repair-9629edfc-7ae9-4626-b90a-2aa6eb1e8224, /10.46.108.101, (DFS,main)] Jon
Re: exception when adding a node replication factor (3) exceeds number of endpoints (1) - SOLVED
OK, is seems a phantom node (one that was removed from the cluster) kept being passed around in gossip as a down endpoint and was messing up the gossip algorithm. I had the luxury of being able to stop the entire cluster and bring the nodes up one by one. That purged the bad node from gossip. Not sure if there was a more elegant way to do that. On Fri, May 27, 2011 at 9:28 AM, jonathan.co...@gmail.com wrote: Anyone have any idea what this could mean? This is a cluster of 7 nodes, I'm trying to add the 8th node. INFO [FlushWriter:1] 2011-05-27 09:22:40,495 Memtable.java (line 164) Completed flushing /var/lib/cassandra/data/system/Migrations-f-1-Data.db (6358 bytes) INFO [FlushWriter:1] 2011-05-27 09:22:40,496 Memtable.java (line 157) Writing Memtable-Schema@60230368(2363 bytes, 3 operations) INFO [FlushWriter:1] 2011-05-27 09:22:40,562 Memtable.java (line 164) Completed flushing /var/lib/cassandra/data/system/Schema-f-1-Data.db (2513 bytes) INFO [GossipStage:1] 2011-05-27 09:22:40,829 Gossiper.java (line 610) Node /10.46.108.104 is now part of the cluster ERROR [GossipStage:1] 2011-05-27 09:22:40,845 DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor java.lang.IllegalStateException: replication factor (3) exceeds number of endpoints (1) at org.apache.cassandra.locator.OldNetworkTopologyStrategy.calculateNaturalEndpoints(OldNetworkTopologyStrategy.java:100) at org.apache.cassandra.locator.AbstractReplicationStrategy.getAddressRanges(AbstractReplicationStrategy.java:196) at org.apache.cassandra.service.StorageService.calculatePendingRanges(StorageService.java:945) at org.apache.cassandra.service.StorageService.calculatePendingRanges(StorageService.java:896) at org.apache.cassandra.service.StorageService.handleStateBootstrap(StorageService.java:707) at org.apache.cassandra.service.StorageService.onChange(StorageService.java:648) at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:1124) at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:643) at org.apache.cassandra.gms.Gossiper.handleNewJoin(Gossiper.java:611) at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:690) at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:60) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) ERROR [GossipStage:1] 2011-05-27 09:22:40,847 AbstractCassandraDaemon.java (line 112) Fatal exception in thread Thread[GossipStage:1,5,main] java.lang.IllegalStateException: replication factor (3) exceeds number of endpoints (1) at org.apache.cassandra.locator.OldNetworkTopologyStrategy.calculateNaturalEndpoints(OldNetworkTopologyStrategy.java:100) at org.apache.cassandra.locator.AbstractReplicationStrategy.getAddressRanges(AbstractReplicationStrategy.java:196) at org.apache.cassandra.service.StorageService.calculatePendingRanges(StorageService.java:945) at org.apache.cassandra.service.StorageService.calculatePendingRanges(StorageService.java:896) at org.apache.cassandra.service.StorageService.handleStateBootstrap(StorageService.java:707) at org.apache.cassandra.service.StorageService.onChange(StorageService.java:648) at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:1124) at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:643) at org.apache.cassandra.gms.Gossiper.handleNewJoin(Gossiper.java:611) at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:690) at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:60) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
new thing going on with repair in 0.7.6??
It might just not have occurred to me in the previous 0.7.4 version, but when I do a repair on a node in v0.7.6, it seems like data is also synced with neighboring nodes. My understanding of repair is that the data is reconciled one the node being repaired. i.e., data is removed or added to that node based on reading the data on other nodes. I read another thread about a bug which results in the entire data being streamed over when you don't specify a CF. But in my case, we only have one CF - we're using cassandra as a simple key/value store so I don't think it applies to my setup. This is a netstats on the node being repaired. Note how everything is streaming out to other nodes. Is this a bug or an improvement? Mode: Normal Streaming to: /10.47.108.103 /var/lib/cassandra/data/DFS/main-f-1833-Data.db sections=2542 progress=6243767484/48128279825 - 12% /var/lib/cassandra/data/DFS/main-f-1886-Data.db sections=2146 progress=0/748205318 - 0% /var/lib/cassandra/data/DFS/main-f-1854-Data.db sections=2542 progress=0/47640938847 - 0% /var/lib/cassandra/data/DFS/main-f-1851-Data.db sections=2502 progress=0/1587416504 - 0% /var/lib/cassandra/data/DFS/main-f-1892-Data.db sections=1409 progress=0/175226826 - 0% /var/lib/cassandra/data/DFS/main-f-1850-Data.db sections=1108 progress=0/107442430 - 0% /var/lib/cassandra/data/DFS/main-f-1859-Data.db sections=2542 progress=0/81697265819 - 0% Streaming to: /10.46.108.103 /var/lib/cassandra/data/DFS/main-f-1854-Data.db sections=72 progress=0/303912581 - 0% /var/lib/cassandra/data/DFS/main-f-1851-Data.db sections=71 progress=0/24604460 - 0% /var/lib/cassandra/data/DFS/main-f-1892-Data.db sections=26 progress=0/30900263 - 0% /var/lib/cassandra/data/DFS/main-f-1850-Data.db sections=19 progress=0/150012 - 0% /var/lib/cassandra/data/DFS/main-f-1859-Data.db sections=72 progress=0/436200262 - 0% Streaming to: /10.46.108.101 /var/lib/cassandra/data/DFS/main-f-1892-Data.db sections=193 progress=0/54332711 - 0% /var/lib/cassandra/data/DFS/main-f-1851-Data.db sections=693 progress=0/52937963 - 0% /var/lib/cassandra/data/DFS/main-f-1850-Data.db sections=135 progress=0/1323107 - 0% /var/lib/cassandra/data/DFS/main-f-1859-Data.db sections=702 progress=0/4220897850 - 0% Nothing streaming from /10.47.108.103
average repair/bootstrap durations
Hi - Operations like repair and bootstrap on nodes in our cluster (average load 150GB each) take a very long time. By long I mean 1-2 days. With nodetool netstats I can see the progress % very slowly progressing. I guess there are some throttling mechanisms built into cassandra. And yes there is also production load on these nodes so it is somewhat understandable. Also some of out compacted data files are as 50-60 GB each. I was just wondering if these times are similar to what other people are experiencing or if there is a serious configuration problem with our setup. So what have you guys seen with operations like loadbalance,repair, cleanup, bootstrap on nodes with large amounts of data?? I'm not seeing too many full garbage collections. Other minor GCs are well under a second. Setup info: 0.7.4 5 GB heap 8 GB ram 64 bit linux os AMD quad core HP blades CMS Garbage collector with default cassandra settings 1 TB raid 0 sata disks across 2 datacenters, but operations within the same dc take very long too. This is a netstat output of a bootstrap that has been going on for 3+ hours: Mode: Normal Streaming to: /10.47.108.103 /var/lib/cassandra/data/DFS/main-f-1541-Data.db/(0,32842490722),(32842490722,139556639427),(139556639427,161075890783) progress=94624588642/161075890783 - 58% /var/lib/cassandra/data/DFS/main-f-1455-Data.db/(0,660743002) progress=0/660743002 - 0% /var/lib/cassandra/data/DFS/main-f-1444-Data.db/(0,32816130132),(32816130132,71465138397),(71465138397,90968640033) progress=0/90968640033 - 0% /var/lib/cassandra/data/DFS/main-f-1540-Data.db/(0,931632934),(931632934,2621052149),(2621052149,3236107041) progress=0/3236107041 - 0% /var/lib/cassandra/data/DFS/main-f-1488-Data.db/(0,33428780851),(33428780851,110546591227),(110546591227,110851587206) progress=0/110851587206 - 0% /var/lib/cassandra/data/DFS/main-f-1542-Data.db/(0,24091168),(24091168,97485080),(97485080,108233211) progress=0/108233211 - 0% /var/lib/cassandra/data/DFS/main-f-1544-Data.db/(0,3646406),(3646406,18065308),(18065308,25776551) progress=0/25776551 - 0% /var/lib/cassandra/data/DFS/main-f-1452-Data.db/(0,676616940) progress=0/676616940 - 0% /var/lib/cassandra/data/DFS/main-f-1548-Data.db/(0,6957269),(6957269,48966550),(48966550,51499779) progress=0/51499779 - 0% /var/lib/cassandra/data/DFS/main-f-1552-Data.db/(0,237153399),(237153399,750466875),(750466875,898056853) progress=0/898056853 - 0% /var/lib/cassandra/data/DFS/main-f-1554-Data.db/(0,45155582),(45155582,195640768),(195640768,247592141) progress=0/247592141 - 0% /var/lib/cassandra/data/DFS/main-f-1449-Data.db/(0,2812483216) progress=0/2812483216 - 0% /var/lib/cassandra/data/DFS/main-f-1545-Data.db/(0,107648943),(107648943,434575065),(434575065,436667186) progress=0/436667186 - 0% Not receiving any streams. Pool NameActive Pending Completed Commandsn/a 0 134283 Responses n/a 0 192438
Re: average repair/bootstrap durations
Thanks Ed! I was thinking about surrendering more memory to mmap operations. I'm going to try bringing the Xmx down to 4G On Fri, May 27, 2011 at 5:19 PM, Edward Capriolo edlinuxg...@gmail.com wrote: On Fri, May 27, 2011 at 9:08 AM, Jonathan Colby jonathan.co...@gmail.com wrote: Hi - Operations like repair and bootstrap on nodes in our cluster (average load 150GB each) take a very long time. By long I mean 1-2 days. With nodetool netstats I can see the progress % very slowly progressing. I guess there are some throttling mechanisms built into cassandra. And yes there is also production load on these nodes so it is somewhat understandable. Also some of out compacted data files are as 50-60 GB each. I was just wondering if these times are similar to what other people are experiencing or if there is a serious configuration problem with our setup. So what have you guys seen with operations like loadbalance,repair, cleanup, bootstrap on nodes with large amounts of data?? I'm not seeing too many full garbage collections. Other minor GCs are well under a second. Setup info: 0.7.4 5 GB heap 8 GB ram 64 bit linux os AMD quad core HP blades CMS Garbage collector with default cassandra settings 1 TB raid 0 sata disks across 2 datacenters, but operations within the same dc take very long too. This is a netstat output of a bootstrap that has been going on for 3+ hours: Mode: Normal Streaming to: /10.47.108.103 /var/lib/cassandra/data/DFS/main-f-1541-Data.db/(0,32842490722),(32842490722,139556639427),(139556639427,161075890783) progress=94624588642/161075890783 - 58% /var/lib/cassandra/data/DFS/main-f-1455-Data.db/(0,660743002) progress=0/660743002 - 0% /var/lib/cassandra/data/DFS/main-f-1444-Data.db/(0,32816130132),(32816130132,71465138397),(71465138397,90968640033) progress=0/90968640033 - 0% /var/lib/cassandra/data/DFS/main-f-1540-Data.db/(0,931632934),(931632934,2621052149),(2621052149,3236107041) progress=0/3236107041 - 0% /var/lib/cassandra/data/DFS/main-f-1488-Data.db/(0,33428780851),(33428780851,110546591227),(110546591227,110851587206) progress=0/110851587206 - 0% /var/lib/cassandra/data/DFS/main-f-1542-Data.db/(0,24091168),(24091168,97485080),(97485080,108233211) progress=0/108233211 - 0% /var/lib/cassandra/data/DFS/main-f-1544-Data.db/(0,3646406),(3646406,18065308),(18065308,25776551) progress=0/25776551 - 0% /var/lib/cassandra/data/DFS/main-f-1452-Data.db/(0,676616940) progress=0/676616940 - 0% /var/lib/cassandra/data/DFS/main-f-1548-Data.db/(0,6957269),(6957269,48966550),(48966550,51499779) progress=0/51499779 - 0% /var/lib/cassandra/data/DFS/main-f-1552-Data.db/(0,237153399),(237153399,750466875),(750466875,898056853) progress=0/898056853 - 0% /var/lib/cassandra/data/DFS/main-f-1554-Data.db/(0,45155582),(45155582,195640768),(195640768,247592141) progress=0/247592141 - 0% /var/lib/cassandra/data/DFS/main-f-1449-Data.db/(0,2812483216) progress=0/2812483216 - 0% /var/lib/cassandra/data/DFS/main-f-1545-Data.db/(0,107648943),(107648943,434575065),(434575065,436667186) progress=0/436667186 - 0% Not receiving any streams. Pool Name Active Pending Completed Commands n/a 0 134283 Responses n/a 0 192438 That is a little long but every case is diffent par. With low requiest load and some heavy server iron RAID,RAM you can see a compaction move really fast 300 GB in 4-6 hours. With enough load one of these operations compact,cleanup,join can get really bogged down to the point where it almost does not move. Sometimes that is just the way it is based on how fragmented your rows are and how fast your gear is. Not pushing your Cassandra caches up to your JVM limit can help. If your heap is often near full you can have jvm memory fragmentation which slows things down. 0.8 has some more tuning options for compaction, multi-threaded, knobs for effective rate. I notice you are using: 5 GB heap 8 GB ram So your RAM/DATA ratio is on the lower site. I think unless you have a good use case for row cache less XMx is more, but that is a minor tweak.
Re: Re: nodetool move trying to stream data to node no longer in cluster
Glad to report I fixed this problem. 1. I added the load_ring_state=false flag 2. I was able to arrange a time where I could take down the whole cluster and bring it back up. After that the phantom node disappeared. On Fri, May 27, 2011 at 12:48 AM, jonathan.co...@gmail.com wrote: Hi Aaron - Thanks alot for the great feedback. I'll try your suggestion on removing it as an endpoint with jmx. On , aaron morton aa...@thelastpickle.com wrote: Off the top of my head the simple way to stop invalid end point state been passed around is a full cluster stop. Obviously thats not an option. The problem is if one node has the IP is will share it around with the others. Out of interest take a look at the o.a.c.db.FailureDetector MBean getAllEndpointStates() function. That returns the end point state held by the Gossiper. I think you should see the Phantom IP listed in there. If it's only on some nodes *perhaps* restarting the node with the JVM option -Dcassandra.load_ring_state=false *may* help. That will stop the node from loading it's save ring state and force it to get it via gossip. Again, if there are other nodes with the phantom IP it may just get it again. I'll do some digging and try to get back to you. This pops up from time to time and thinking out loud I wonder if it would be possible to add a new application state that purges an IP from the ring. e.g. VersionedValue.STATUS_PURGED that works with a ttl so it goes through X number of gossip rounds and then disappears. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 26 May 2011, at 19:58, Jonathan Colby wrote: @Aaron - Unfortunately I'm still seeing message like: is down, removing from gossip, although with not the same frequency. And repair/move jobs don't seem to try to stream data to the removed node anymore. Anyone know how to totally purge any stored gossip/endpoint data on nodes that were removed from the cluster. Or what might be happening here otherwise? On May 26, 2011, at 9:10 AM, aaron morton wrote: cool. I was going to suggest that but as you already had the move running I thought it may be a little drastic. Did it show any progress ? If the IP address is not responding there should have been some sort of error. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 26 May 2011, at 15:28, jonathan.co...@gmail.com wrote: Seems like it had something to do with stale endpoint information. I did a rolling restart of the whole cluster and that seemed to trigger the nodes to remove the node that was decommissioned. On , aaron morton aa...@thelastpickle.com wrote: Is it showing progress ? It may just be a problem with the information printed out. Can you check from the other nodes in the cluster to see if they are receiving the stream ? cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 26 May 2011, at 00:42, Jonathan Colby wrote: I recently removed a node (with decommission) from our cluster. I added a couple new nodes and am now trying to rebalance the cluster using nodetool move. However, netstats shows that the node being moved is trying to stream data to the node that I already decommissioned yesterday. The removed node was powered-off, taken out of dns, its IP is not even pingable. It was never a seed neither. This is cassandra 0.7.5 on 64bit linux. How do I tell the cluster that this node is gone? Gossip should have detected this. The ring commands shows the correct cluster IPs. Here is a portion of netstats. 10.46.108.102 is the node which was removed. Mode: Leaving: streaming data to other nodes Streaming to: /10.46.108.102 /var/lib/cassandra/data/DFS/main-f-1064-Data.db/(4681027,5195491),(5195491,15308570),(15308570,15891710),(16336750,20558705),(20558705,29112203),(29112203,36279329),(36465942,36623223),(36740457,37227058),(37227058,42206994),(42206994,47380294),(47635053,47709813),(47709813,48353944),(48621287,49406499),(53330048,53571312),(53571312,54153922),(54153922,59857615),(59857615,61029910),(61029910,61871509),(62190800,62498605),(62824281,62964830),(63511604,64353114),(64353114,64760400),(65174702,65919771),(65919771,66435630),(81440029,81725949),(81725949,83313847),(83313847,83908709),(88983863,89237303),(89237303,89934199),(89934199,97 ... 5693491,14795861666),(14795861666,14796105318),(14796105318,14796366886),(14796699825,14803874941),(14803874941,14808898331),(14808898331,14811670699),(14811670699,14815125177),(14815125177,14819765003),(14820229433,14820858266
Re: nodetool move trying to stream data to node no longer in cluster
@Aaron - Unfortunately I'm still seeing message like: ip-of-removed-node is down, removing from gossip, although with not the same frequency. And repair/move jobs don't seem to try to stream data to the removed node anymore. Anyone know how to totally purge any stored gossip/endpoint data on nodes that were removed from the cluster. Or what might be happening here otherwise? On May 26, 2011, at 9:10 AM, aaron morton wrote: cool. I was going to suggest that but as you already had the move running I thought it may be a little drastic. Did it show any progress ? If the IP address is not responding there should have been some sort of error. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 26 May 2011, at 15:28, jonathan.co...@gmail.com wrote: Seems like it had something to do with stale endpoint information. I did a rolling restart of the whole cluster and that seemed to trigger the nodes to remove the node that was decommissioned. On , aaron morton aa...@thelastpickle.com wrote: Is it showing progress ? It may just be a problem with the information printed out. Can you check from the other nodes in the cluster to see if they are receiving the stream ? cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 26 May 2011, at 00:42, Jonathan Colby wrote: I recently removed a node (with decommission) from our cluster. I added a couple new nodes and am now trying to rebalance the cluster using nodetool move. However, netstats shows that the node being moved is trying to stream data to the node that I already decommissioned yesterday. The removed node was powered-off, taken out of dns, its IP is not even pingable. It was never a seed neither. This is cassandra 0.7.5 on 64bit linux. How do I tell the cluster that this node is gone? Gossip should have detected this. The ring commands shows the correct cluster IPs. Here is a portion of netstats. 10.46.108.102 is the node which was removed. Mode: Leaving: streaming data to other nodes Streaming to: /10.46.108.102 /var/lib/cassandra/data/DFS/main-f-1064-Data.db/(4681027,5195491),(5195491,15308570),(15308570,15891710),(16336750,20558705),(20558705,29112203),(29112203,36279329),(36465942,36623223),(36740457,37227058),(37227058,42206994),(42206994,47380294),(47635053,47709813),(47709813,48353944),(48621287,49406499),(53330048,53571312),(53571312,54153922),(54153922,59857615),(59857615,61029910),(61029910,61871509),(62190800,62498605),(62824281,62964830),(63511604,64353114),(64353114,64760400),(65174702,65919771),(65919771,66435630),(81440029,81725949),(81725949,83313847),(83313847,83908709),(88983863,89237303),(89237303,89934199),(89934199,97 ... 5693491,14795861666),(14795861666,14796105318),(14796105318,14796366886),(14796699825,14803874941),(14803874941,14808898331),(14808898331,14811670699),(14811670699,14815125177),(14815125177,14819765003),(14820229433,14820858266) progress=280574376402/12434049900 - 2256% . Note 10.46.108.102 is NOT part of the ring. Address Status State LoadOwnsToken 148873535527910577765226390751398592512 10.46.108.100 Up Normal 71.73 GB12.50% 0 10.46.108.101 Up Normal 109.69 GB 12.50% 21267647932558653966460912964485513216 10.47.108.100 Up Leaving 281.95 GB 37.50% 85070591730234615865843651857942052863 10.47.108.102 Up Normal 210.77 GB 0.00% 85070591730234615865843651857942052864 10.47.108.101 Up Normal 289.59 GB 16.67% 113427455640312821154458202477256070484 10.46.108.103 Up Normal 299.87 GB 8.33% 127605887595351923798765477786913079296 10.47.108.103 Up Normal 94.99 GB12.50% 148873535527910577765226390751398592511 10.46.108.104 Up Normal 103.01 GB 0.00% 148873535527910577765226390751398592512
nodetool move trying to stream data to node no longer in cluster
I recently removed a node (with decommission) from our cluster. I added a couple new nodes and am now trying to rebalance the cluster using nodetool move. However, netstats shows that the node being moved is trying to stream data to the node that I already decommissioned yesterday. The removed node was powered-off, taken out of dns, its IP is not even pingable. It was never a seed neither. This is cassandra 0.7.5 on 64bit linux. How do I tell the cluster that this node is gone? Gossip should have detected this. The ring commands shows the correct cluster IPs. Here is a portion of netstats. 10.46.108.102 is the node which was removed. Mode: Leaving: streaming data to other nodes Streaming to: /10.46.108.102 /var/lib/cassandra/data/DFS/main-f-1064-Data.db/(4681027,5195491),(5195491,15308570),(15308570,15891710),(16336750,20558705),(20558705,29112203),(29112203,36279329),(36465942,36623223),(36740457,37227058),(37227058,42206994),(42206994,47380294),(47635053,47709813),(47709813,48353944),(48621287,49406499),(53330048,53571312),(53571312,54153922),(54153922,59857615),(59857615,61029910),(61029910,61871509),(62190800,62498605),(62824281,62964830),(63511604,64353114),(64353114,64760400),(65174702,65919771),(65919771,66435630),(81440029,81725949),(81725949,83313847),(83313847,83908709),(88983863,89237303),(89237303,89934199),(89934199,97 ... 5693491,14795861666),(14795861666,14796105318),(14796105318,14796366886),(14796699825,14803874941),(14803874941,14808898331),(14808898331,14811670699),(14811670699,14815125177),(14815125177,14819765003),(14820229433,14820858266) progress=280574376402/12434049900 - 2256% . Note 10.46.108.102 is NOT part of the ring. Address Status State LoadOwnsToken 148873535527910577765226390751398592512 10.46.108.100 Up Normal 71.73 GB12.50% 0 10.46.108.101 Up Normal 109.69 GB 12.50% 21267647932558653966460912964485513216 10.47.108.100 Up Leaving 281.95 GB 37.50% 85070591730234615865843651857942052863 - currently being moved 10.47.108.102 Up Normal 210.77 GB 0.00% 85070591730234615865843651857942052864 10.47.108.101 Up Normal 289.59 GB 16.67% 113427455640312821154458202477256070484 10.46.108.103 Up Normal 299.87 GB 8.33% 127605887595351923798765477786913079296 10.47.108.103 Up Normal 94.99 GB12.50% 148873535527910577765226390751398592511 10.46.108.104 Up Normal 103.01 GB 0.00% 148873535527910577765226390751398592512
Re: Database grows 10X bigger after running nodetool repair
I'm not sure if this is the absolute best advice, but perhaps running clean on the data will help cleanup any data that isn't assigned to this token - in case you've moved the cluster around before. Any exceptions in the logs, eg EOF ? I experienced this and it caused the repairs to trip up every time. It was fixed with a scrub which rebuilds all the tables. I also turned swap off on my nodes, which is unnecessary overhead since mmap manages the virtual memory pretty good. Be careful about running major compactions. You'll keep fusing all the Data into bigger and bigger files, which are harder to perform maintenance tasks on in my experience. Jon On , Dominic Williams thedwilli...@gmail.com wrote: Hi, I've got a strange problem, where the database on a node has inflated 10X after running repair. This is not the result of receiving missed data. I didn't perform repair within my usual 10 day cycle, so followed recommended practice: http://wiki.apache.org/cassandra/Operations#Dealing_with_the_consequences_of_nodetool_repair_not_running_within_GCGraceSeconds The sequence of events was like this: 1) set GCGraceSeconds to some huge value 2) perform rolling upgrade from 0.7.4 to 0.7.6-2 3) run nodetool repair on the first node in cluster ~10pm. It has a ~30G database 3) 2.30am decide to leave it running all night and wake up 9am to find still running 4) late morning investigation shows that db size has increased to 370G. The snapshot folder accounts for only 30G 5) node starts to run out of disk space http://pastebin.com/Sm0B7nfR 6) decide to bail! Reset GCGraceSeconds to 864000 and restart node to stop repair 7) as node restarts it deletes a bunch of tmp files, reducing db size from 370G to 270G 8) node now constantly performing minor compactions and du rising slightly then falling by a greater amount after minor compaction deletes sstable 9) gradually disk usage is coming down. Currently at 254G (3pm) 10) performance of node obviously not great! Investigation of the database reveals the main problem to have occurred in a single column family, UserFights. This contains millions of fight records from our MMO, but actually exactly the same number as the MonsterFights cf. However, the comparative size is Column Family: MonsterFights SSTable count: 38 Space used (live): 13867454647 Space used (total): 13867454647 (13G) Memtable Columns Count: 516 Memtable Data Size: 598770 Memtable Switch Count: 4 Read Count: 514 Read Latency: 157.649 ms. Write Count: 4059 Write Latency: 0.025 ms. Pending Tasks: 0 Key cache capacity: 20 Key cache size: 183004 Key cache hit rate: 0.0023566218452145135 Row cache: disabled Compacted row minimum size: 771 Compacted row maximum size: 943127 Compacted row mean size: 3208 Column Family: UserFights SSTable count: 549 Space used (live): 185355019679 Space used (total): 219489031691 (219G) Memtable Columns Count: 483 Memtable Data Size: 560569 Memtable Switch Count: 8 Read Count: 2159 Read Latency: 2589.150 ms. Write Count: 4080 Write Latency: 0.018 ms. Pending Tasks: 0 Key cache capacity: 20 Key cache size: 20 Key cache hit rate: 0.03357770764288416 Row cache: disabled Compacted row minimum size: 925 Compacted row maximum size: 12108970 Compacted row mean size: 503069 These stats were taken at 3pm, and at 1pm UserFights was using 224G total, so overall size is gradually coming down. Another observation is the following appearing in the logs during the minor compactions: Compacting large row 536c69636b5061756c (121235810 bytes) incrementally The largest number of fights any user has performed on our MMO that I can find is short of 10,000. Each fight record is smaller than 1K... so it looks like these rows have grown +10X somehow. The size of UserFights on another replica node, which actually has a slightly higher proportion of ring is Column Family: UserFights SSTable count: 14 Space used (live): 17844982744 Space used (total): 17936528583 (18G) Memtable Columns Count: 767 Memtable Data Size: 891153 Memtable Switch Count: 6 Read Count: 2298 Read Latency: 61.020 ms. Write Count: 4261 Write Latency: 0.104 ms. Pending Tasks: 0 Key cache capacity: 20 Key cache size: 55172 Key cache hit rate: 0.8079570484581498 Row cache: disabled Compacted row minimum size: 925 Compacted row maximum size: 12108970 Compacted row mean size: 846477 ... All ideas and suggestions greatly appreciated as always! Dominic ria101.wordpress.com
Re: Re: nodetool move trying to stream data to node no longer in cluster
Seems like it had something to do with stale endpoint information. I did a rolling restart of the whole cluster and that seemed to trigger the nodes to remove the node that was decommissioned. On , aaron morton aa...@thelastpickle.com wrote: Is it showing progress ? It may just be a problem with the information printed out. Can you check from the other nodes in the cluster to see if they are receiving the stream ? cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 26 May 2011, at 00:42, Jonathan Colby wrote: I recently removed a node (with decommission) from our cluster. I added a couple new nodes and am now trying to rebalance the cluster using nodetool move. However, netstats shows that the node being moved is trying to stream data to the node that I already decommissioned yesterday. The removed node was powered-off, taken out of dns, its IP is not even pingable. It was never a seed neither. This is cassandra 0.7.5 on 64bit linux. How do I tell the cluster that this node is gone? Gossip should have detected this. The ring commands shows the correct cluster IPs. Here is a portion of netstats. 10.46.108.102 is the node which was removed. Mode: Leaving: streaming data to other nodes Streaming to: /10.46.108.102 /var/lib/cassandra/data/DFS/main-f-1064-Data.db/(4681027,5195491),(5195491,15308570),(15308570,15891710),(16336750,20558705),(20558705,29112203),(29112203,36279329),(36465942,36623223),(36740457,37227058),(37227058,42206994),(42206994,47380294),(47635053,47709813),(47709813,48353944),(48621287,49406499),(53330048,53571312),(53571312,54153922),(54153922,59857615),(59857615,61029910),(61029910,61871509),(62190800,62498605),(62824281,62964830),(63511604,64353114),(64353114,64760400),(65174702,65919771),(65919771,66435630),(81440029,81725949),(81725949,83313847),(83313847,83908709),(88983863,89237303),(89237303,89934199),(89934199,97 ... 5693491,14795861666),(14795861666,14796105318),(14796105318,14796366886),(14796699825,14803874941),(14803874941,14808898331),(14808898331,14811670699),(14811670699,14815125177),(14815125177,14819765003),(14820229433,14820858266) progress=280574376402/12434049900 - 2256% . Note 10.46.108.102 is NOT part of the ring. Address Status State Load Owns Token 148873535527910577765226390751398592512 10.46.108.100 Up Normal 71.73 GB 12.50% 0 10.46.108.101 Up Normal 109.69 GB 12.50% 21267647932558653966460912964485513216 10.47.108.100 Up Leaving 281.95 GB 37.50% 85070591730234615865843651857942052863 10.47.108.102 Up Normal 210.77 GB 0.00% 85070591730234615865843651857942052864 10.47.108.101 Up Normal 289.59 GB 16.67% 113427455640312821154458202477256070484 10.46.108.103 Up Normal 299.87 GB 8.33% 127605887595351923798765477786913079296 10.47.108.103 Up Normal 94.99 GB 12.50% 148873535527910577765226390751398592511 10.46.108.104 Up Normal 103.01 GB 0.00% 148873535527910577765226390751398592512
extremely high temporary disk utilization 0.7.5
On each of our nodes we have an average of 80 - 100 GB actual cassandra data on 1 TB disks.There is normally plenty of capacity on the nodes. Swap is OFF. OS is Debian 64 bit. Every once in a while, the disk usage will skyrocket to 500+ GB, even once filling up the 1 TB disk (at least according to linux df). The thing is, after restarting the cassandra daemon, the disk usage correctly reflects the actual data usage. What could be causing this massive temporary disk allocation? Is it malloc? Is this an indication that something is not configured correctly? Is this a bug? Any help would be appreciated! Jon
Re: jsvc hangs shell
We use the Java Service Wrapper from Tanuki Software and are very happy with it. It's a lot more robust than jsvc. http://wrapper.tanukisoftware.com/doc/english/download.jsp The free community version will be enough in most cases. Jon On May 11, 2011 10:30pm, Anton Belyaev anton.bely...@gmail.com wrote: Hello, I installed 0.7.5 to my Ubuntu 11.04 64 bit from package at deb http://www.apache.org/dist/cassandra/debian 07x main And I met really strange problem. Any shell command that requires Cassandra's jsvc command line (for example, ps -ef, or top with cmdline args) - just hangs. Using STRACE I found out that commands hang during reading /proc//cmdline. I tried to cat the file - shell hung. I tried both OpenJDK and Sun JDK - the bug remains. I tried 0.6.13 on the same machine - works fine. I tried 0.7.5 on another machine (with older Ubuntu) - works fine. I believe this is not a Cassandra bug. But I am not sure where to ask help with the problem. Could you please advise what should I check to find out where is the problem? Thanks. Anton.
Re: What will be the steps for adding new nodes
Your questions are pretty fundamental. I recommend reading through the documentation to get a better understanding of how Cassandra works. Here's good documentation from DataStax: http://www.datastax.com/docs/0.7/operations/clustering#adding-capacity In a nutshell: you only bootstrap new nodes, all nodes should have the same seed list, old nodes don't have to be restarted On Apr 16, 2011, at 7:48 AM, Roni wrote: I have a 0.6.4 Cassandra cluster of two nodes in full replica (replica factor 2). I wants to add two more nodes and balance the cluster (replica factor 2). I want all of them to be seed's. What should be the simple steps: 1. add the AutoBootstraptrue/AutoBootstrap to all the nodes or only the new ones? 2. add the Seed[new_node]/Seed to the config file of the old nodes before adding the new ones? 3. do the old node need to be restarted (if no change is needed in their config file)? TX,
recurring EOFException exception in 0.7.4
I've been struggling with these kinds of exceptions for some time now. I thought it might have been a one-time thing, so on the 2 nodes where I saw this problem I pulled in fresh data with a repair on an empty data directory. Unfortunately, this problem is now coming up on a new node that has, up until now, not had this problem. What could be causing this? Could it be related to encoding? Why are these rows not readable? This exception prevents cassandra from doing repairs, and even minor compactions. It also messes up memtable management (with a normal load of 25GB, disk goes to almost 100% full on a 500 GB hd). This is incredibly frustrating. This is the only pain-point I have had with cassandra so far. By the way, this node was never upgraded - it was 0.7.4 from the start, so that eliminates format compatibility problems. ERROR [CompactionExecutor:1] 2011-04-15 21:31:23,479 PrecompactedRow.java (line 82) Skipping row DecoratedKey(105452551814086725777389040553659117532, 4d657373616765456e726963686d656e743a313032343937) in /var/lib/cassandra/data/DFS/main-f-91-Data.db java.io.EOFException at java.io.RandomAccessFile.readFully(RandomAccessFile.java:383) at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361) at org.apache.cassandra.io.util.BufferedRandomAccessFile.readBytes(BufferedRandomAccessFile.java:270) at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:315) at org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:272) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:94) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:35) at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:176) at org.apache.cassandra.io.PrecompactedRow.init(PrecompactedRow.java:78) at org.apache.cassandra.io.CompactionIterator.getCompactedRow(CompactionIterator.java:147) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:108) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43) at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131) at org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183) at org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94) at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:449) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:124) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:94) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
Re: Questions about the nodetool ring.
This is normal when you just add single nodes. When no token is assigned, the new node takes a portion of the ring from the most heavily loaded node. As a consequence of this, the nodes will be out of balance. In other words, when you double the amount nodes you would not have this problem. The best way to rebalance the cluster is to generate new tokens and use the nodetool move new-token command to rebalance the nodes, one at a time. After rebalancing you can run cleanup so the nodes get rid of data they no longer are responsible for. links: http://wiki.apache.org/cassandra/Operations#Range_changes http://wiki.apache.org/cassandra/Operations#Moving_or_Removing_nodes http://www.datastax.com/docs/0.7/operations/clustering#adding-capacity On Apr 12, 2011, at 11:00 AM, Dikang Gu wrote: I have 3 cassandra 0.7.4 nodes in a cluster, and I get the ring stats: [root@yun-phy2 apache-cassandra-0.7.4]# bin/nodetool -h 192.168.1.28 -p 8090 ring Address Status State LoadOwnsToken 109028275973926493413574716008500203721 192.168.1.25Up Normal 157.25 MB 69.92% 57856537434773737201679995572503935972 192.168.1.27Up Normal 201.71 MB 24.28% 99165710459060760249270263771474737125 192.168.1.28Up Normal 68.12 MB5.80% 109028275973926493413574716008500203721 The load and owns vary on each node, is this normal? And is there a way to balance the three nodes? Thanks. -- Dikang Gu 0086 - 18611140205
Re: Questions about the nodetool ring.
when you do a move, the node is decommissioned and bootstrapped. During the autobootstrap process the node will not receive reads until bootstrapping is complete. I assume during the decommission phase the node will also be unavailable, someone correct me if I'm wrong. the ring distribution looks better now. The ? I get all the time too. And if you run ring against different hosts, the question marks probably appear in different places. I'm not sure if it means there is a problem. I haven't taken those question marks too seriously. On Apr 12, 2011, at 11:57 AM, Dikang Gu wrote: After the nodetool move, I got this: [root@server3 apache-cassandra-0.7.4]# bin/nodetool -h 10.18.101.213 ring Address Status State LoadOwnsToken 113427455640312821154458202477256070485 10.18.101.211 ? Normal 82.31 MB33.33% 0 10.18.101.212 ? Normal 84.24 MB33.33% 56713727820156410577229101238628035242 10.18.101.213 Up Normal 54.44 MB33.33% 113427455640312821154458202477256070485 Is this correct? Why is the status ? ? Thanks. On Tue, Apr 12, 2011 at 5:43 PM, Dikang Gu dikan...@gmail.com wrote: The 3 nodes were added to the cluster at the same time, so I'm not sure whey the data vary. I calculate the tokens and get: node 0: 0 node 1: 56713727820156410577229101238628035242 node 2: 113427455640312821154458202477256070485 So I should set these tokens to the three nodes? And during the time I execute the nodetool move commands, can the cassandra servers serve the front end requests at the same time? Is the data safe? Thanks. On Tue, Apr 12, 2011 at 5:15 PM, Jonathan Colby jonathan.co...@gmail.com wrote: This is normal when you just add single nodes. When no token is assigned, the new node takes a portion of the ring from the most heavily loaded node. As a consequence of this, the nodes will be out of balance. In other words, when you double the amount nodes you would not have this problem. The best way to rebalance the cluster is to generate new tokens and use the nodetool move new-token command to rebalance the nodes, one at a time. After rebalancing you can run cleanup so the nodes get rid of data they no longer are responsible for. links: http://wiki.apache.org/cassandra/Operations#Range_changes http://wiki.apache.org/cassandra/Operations#Moving_or_Removing_nodes http://www.datastax.com/docs/0.7/operations/clustering#adding-capacity On Apr 12, 2011, at 11:00 AM, Dikang Gu wrote: I have 3 cassandra 0.7.4 nodes in a cluster, and I get the ring stats: [root@yun-phy2 apache-cassandra-0.7.4]# bin/nodetool -h 192.168.1.28 -p 8090 ring Address Status State LoadOwnsToken 109028275973926493413574716008500203721 192.168.1.25Up Normal 157.25 MB 69.92% 57856537434773737201679995572503935972 192.168.1.27Up Normal 201.71 MB 24.28% 99165710459060760249270263771474737125 192.168.1.28Up Normal 68.12 MB5.80% 109028275973926493413574716008500203721 The load and owns vary on each node, is this normal? And is there a way to balance the three nodes? Thanks. -- Dikang Gu 0086 - 18611140205 -- Dikang Gu 0086 - 18611140205 -- Dikang Gu 0086 - 18611140205
repair never completes with finished successfully
There are a few other threads related to problems with the nodetool repair in 0.7.4. However I'm not seeing any errors, just never getting a message that the repair completed successfully. In my production and test cluster (with just a few MB data) the repair nodetool prompt never returns and the last entry in the cassandra.log is always something like: #TreeRequest manual-repair-f739ca7a-bef8-4683-b249-09105f6719d9, /10.46.108.102, (DFS,main) completed successfully: 1 outstanding But I don't see a message, even hours later, that the 1 outstanding request finished successfully. Anyone else experience this? These are physical server nodes in local data centers and not EC2
Re: repair never completes with finished successfully
There is no Repair session message either. It just starts with a message like: INFO [manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723] 2011-04-10 14:00:59,051 AntiEntropyService.java (line 770) Waiting for repair requests: [#TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.46.108.101, (DFS,main), #TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.100, (DFS,main), #TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.102, (DFS,main), #TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.101, (DFS,main)] NETSTATS: Mode: Normal Not sending any streams. Not receiving any streams. Pool NameActive Pending Completed Commandsn/a 0 150846 Responses n/a 0 443183 One node in our cluster still has unreadable rows, where the reads trip up every time for certain sstables (you've probably seen my earlier threads regarding that). My suspicion is that the bloom filter read on the node with the corrupt sstables is never reporting back to the repair, thereby causing it to hang. What would be great is a scrub tool that ignores unreadable/unserializable rows! : ) On Apr 12, 2011, at 2:15 PM, aaron morton wrote: Do you see a message starting Repair session and ending with completed successfully ? Or do you see any streaming activity using nodetool netstats Repair can hang if a neighbour dies and fails to send a requested stream. It will timeout after 24 hours (I think). Aaron On 12 Apr 2011, at 23:39, Karl Hiramoto wrote: On 12/04/2011 13:31, Jonathan Colby wrote: There are a few other threads related to problems with the nodetool repair in 0.7.4. However I'm not seeing any errors, just never getting a message that the repair completed successfully. In my production and test cluster (with just a few MB data) the repair nodetool prompt never returns and the last entry in the cassandra.log is always something like: #TreeRequest manual-repair-f739ca7a-bef8-4683-b249-09105f6719d9, /10.46.108.102, (DFS,main) completed successfully: 1 outstanding But I don't see a message, even hours later, that the 1 outstanding request finished successfully. Anyone else experience this? These are physical server nodes in local data centers and not EC2 I've seen this. To fix it try a nodetool compact then repair. -- Karl
quick repair tool question
does a repair just compare the existing data from sstables on the node being repaired, or will it figure out which data this node should have and copy it in? I'm trying to refresh all the data for a given node (without reassigning the token) starting with an emptied out data directory. I tried nodetool move, but if I give the same token it previously was assigned it doesn't seem to trigger a decommission/bootstrap. Thanks.
Re: quick repair tool question
I think I answered the question myself. The data is streaming in from other replicas even though the node's data dir was emptied out (system dir was left alone). I'm not sure if this is the kosher way to rebuild the sstable data, but it seemed to work. /var/lib/cassandra/data # /opt/cassandra/bin/nodetool -h $HOSTNAME -p 35014 netstats Mode: Normal Not sending any streams. Streaming from: /10.46.108.100 DFS: /var/lib/cassandra/data/DFS/main-f-85-Data.db/(101772144,192460041),(192460041,267088244) progress=0/165316100 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-86-Data.db/(118410757,194489915),(194489915,247653739) progress=0/129242982 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-40-Data.db/(4823893695,4850323665),(4850323665,7818579650) progress=0/2994685955 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-89-Data.db/(0,707948),(707948,2011040) progress=0/2011040 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-70-Data.db/(778069440,1015544852),(1015544852,1200443249) progress=0/422373809 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-71-Data.db/(119366025,132069485),(132069485,156787816) progress=0/37421791 - 0% Streaming from: /10.47.108.100 DFS: /var/lib/cassandra/data/DFS/main-f-365-Data.db/(0,24748050),(126473995,170409694) progress=0/68683749 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-367-Data.db/(0,935041),(935041,2238133) progress=0/2238133 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-366-Data.db/(0,4608808),(37713613,46884920) progress=0/13780115 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-242-Data.db/(0,1057203157),(3307900143,4339490352) progress=0/2088793366 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-352-Data.db/(0,19422069),(81246761,122537002) progress=0/60712310 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-225-Data.db/(0,1580865981),(4540941750,6024843721) progress=0/3064767952 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-349-Data.db/(0,21720053),(54115405,71716716) progress=0/39321364 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-364-Data.db/(0,72606213),(175419693,238159626) progress=0/135346146 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-363-Data.db/(0,1184983783),(3458591846,4556646617) progress=0/2283038554 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-368-Data.db/(0,756228),(756228,1626647) progress=0/1626647 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-361-Data.db/(48074007,78009236) progress=0/29935229 - 0% DFS: /var/lib/cassandra/data/DFS/main-f-226-Data.db/(0,3111952321),(8592898278,11484622800) progress=0/6003676843 - 0% Pool NameActive Pending Completed Commandsn/a 0 5765 Responses n/a 0 9811 On Apr 12, 2011, at 4:59 PM, Jonathan Colby wrote: does a repair just compare the existing data from sstables on the node being repaired, or will it figure out which data this node should have and copy it in? I'm trying to refresh all the data for a given node (without reassigning the token) starting with an emptied out data directory. I tried nodetool move, but if I give the same token it previously was assigned it doesn't seem to trigger a decommission/bootstrap. Thanks.
Re: Cassandra 2 DC deployment
When the down data center comes back up, the Quorum reads will result in a read-repair, so you will get valid data. Besides that, hinted handoff will take care of getting data replicated to a previously down node. You're example is a little unrealistic because you could theoretically have a DC with only one node. So CL.ONE would work every time. But if you have more than 1 node, you have to decide if your application can tolerate getting NULL for a read if the write hasn't propagated from the responsible node to the replica. disclaimer: I'm a cassandra novice. On Apr 12, 2011, at 5:12 PM, Raj N wrote: Hi experts, We are planning to deploy Cassandra in 2 datacenters. Let assume there are 3 nodes, RF=3, 2 nodes in 1 DC and 1 node in 2nd DC. Under normal operations, we would read and write at QUORUM. What we want to do though is if we lose a datacenter which has 2 nodes, DC1 in this case, we want to downgrade our consistency to ONE. Basically I am saying that whenever there is a partition, then prefer availability over consistency. In order to do this we plan to catch UnavailableException and take corrective action. So try QUORUM under normal circumstances, if unavailable try ONE. My questions - Do you guys see any flaws with this approach? What happens when DC1 comes back up and we start reading/writing at QUORUM again? Will we read stale data in this case? Thanks -Raj
Re: Help on decommission
how long as it been in Leaving status? Is the cluster under stress test load while you are doing the decommission? On Apr 12, 2011, at 6:53 PM, Baskar Duraikannu wrote: I have setup a 4 node cluster for testing. When I setup the cluster, I have setup initial tokens in such a way that each gets 25% of load and then started the node with autobootstrap=false. After all nodes are up, I loaded data using the stress test tool with replication factor of 3. As per of my testing, I am trying to remove one of the node using nodetool decomission but the node seems to be stuck in leaving status. How do I check whether it is doing any work at all? Please help [root@localhost bin]# ./nodetool -h 10.140.22.25 ring Address Status State LoadOwnsToken 127605887595351923798765477786913079296 10.140.22.66Up Leaving 119.41 MB 25.00% 0 10.140.22.42Up Normal 116.23 MB 25.00% 42535295865117307932921825928971026432 10.140.22.28Up Normal 119.93 MB 25.00% 85070591730234615865843651857942052864 10.140.22.25Up Normal 116.21 MB 25.00% 127605887595351923798765477786913079296 [root@localhost bin]# ./nodetool -h 10.140.22.66 netstats Mode: Leaving: streaming data to other nodes Streaming to: /10.140.22.42 /var/lib/cassandra/data/Keyspace1/Standard1-f-1-Data.db/(0,120929157) progress=120929157/120929157 - 100% /var/lib/cassandra/data/Keyspace1/Standard1-f-2-Data.db/(0,3361291) progress=0/3361291 - 0% Not receiving any streams. Pool NameActive Pending Completed Commandsn/a 0 17 Responses n/a 0 108109 [root@usnynyc1cass02 bin]# ./nodetool -h 10.140.22.42 netstats Mode: Normal Not sending any streams. Streaming from: /10.140.22.66 Keyspace1: /var/lib/cassandra/data/Keyspace1/Standard1-f-2-Data.db/(0,3361291) progress=0/3361291 - 0% Pool NameActive Pending Completed Commandsn/a 0 11 Responses n/a 0 107879 Regards, Baskar
Re: flush_largest_memtables_at messages in 7.4
your jvm heap has reached 78% so cassandra automatically flushes its memtables. you need to explain more about your configuration. 32 or 64 bit OS, what is max heap, how much ram installed? If this happens under stress test conditions its probably understandable. you should look into graphing your memory usage, or use the jconsole to graph heap during your tests. On Apr 12, 2011, at 8:36 PM, mcasandra wrote: I am using cassandra 7.4 and getting these messages. Heap is 0.7802529021498031 full. You may need to reduce memtable and/or cache sizes Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically How do I verify that I need to adjust any thresholds? And how to calculate correct value? When I got this message only reads were occuring. create keyspace StressKeyspace with replication_factor = 3 and placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy'; use StressKeyspace; drop column family StressStandard; create column family StressStandard with comparator = UTF8Type and keys_cached = 100 and memtable_flush_after = 1440 and memtable_throughput = 128; nodetool -h dsdb4 tpstats Pool NameActive Pending Completed ReadStage32 281 456598 RequestResponseStage 0 0 797237 MutationStage 0 0 499205 ReadRepairStage 0 0 149077 GossipStage 0 0 217227 AntiEntropyStage 0 0 0 MigrationStage0 0201 MemtablePostFlusher 0 0 1842 StreamStage 0 0 0 FlushWriter 0 0 1841 FILEUTILS-DELETE-POOL 0 0 3670 MiscStage 0 0 0 FlushSorter 0 0 0 InternalResponseStage 0 0 0 HintedHandoff 0 0 15 cfstats Keyspace: StressKeyspace Read Count: 460988 Read Latency: 38.07654727454945 ms. Write Count: 499205 Write Latency: 0.007409593253272703 ms. Pending Tasks: 0 Column Family: StressStandard SSTable count: 9 Space used (live): 247408645485 Space used (total): 247408645485 Memtable Columns Count: 0 Memtable Data Size: 0 Memtable Switch Count: 1878 Read Count: 460989 Read Latency: 28.237 ms. Write Count: 499205 Write Latency: NaN ms. Pending Tasks: 0 Key cache capacity: 100 Key cache size: 299862 Key cache hit rate: 0.6031833150384193 Row cache: disabled Compacted row minimum size: 219343 Compacted row maximum size: 5839588 Compacted row mean size: 497474 -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/flush-largest-memtables-at-messages-in-7-4-tp6266221p6266221.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: quick repair tool question
cool! and I thought I made that one up myself : ) On Apr 13, 2011, at 2:13 AM, Chris Burroughs wrote: On 04/12/2011 11:11 AM, Jonathan Colby wrote: I'm not sure if this is the kosher way to rebuild the sstable data, but it seemed to work. http://wiki.apache.org/cassandra/Operations#Handling_failure Option #3.
Re: repair never completes with finished successfully
great tips. I will investigate further with your suggestions in mind. Hopefully the problem has gone away since I pulled in fresh data on the node with problems. On Apr 13, 2011, at 3:54 AM, aaron morton wrote: Ah, unreadable rows and in the validation compaction no less. Makes a little more sense now. Anyone help with the EOF when deserializing columns ? Is the fix to run scrub or drop the sstable ? Here's a a theory, AES is trying to... 1) Create TreeRequest 's that specify a range we want to validate. 2) Send TreeRequest 's to local node and neighbour 3) Process TreeRequest by running a validation compaction (CompactionManager.doValidationCompaction in your prev stacks) 4) When both TreeRequests return back work out the differences and then stream data if needed. Perhaps step 3 is not completing because of errors like http://www.mail-archive.com/user@cassandra.apache.org/msg12196.html If the row is over multiple sstables we can skip the row in one sstable. However if it's in a single sstable PrecompactedRow will raise an IOError if there is a problem. This is not what is in the linked error stack that shows a row been skipped, just a hunch we could checkout. Do you see an IOErrors (not exceptions) in the logs or exceptions with doValidationCompaction in the stack? For a tree request on the node you start compaction on you should see these logs... 1) Waiting for repair requests... 2) One of Stored local tree or Stored remote tree depending on which returns first at DEBUG level 3) Queuing comparison If we do not have the 3rd log then we did not get a replay from either local or remote. Aaron On 13 Apr 2011, at 00:57, Jonathan Colby wrote: There is no Repair session message either. It just starts with a message like: INFO [manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723] 2011-04-10 14:00:59,051 AntiEntropyService.java (line 770) Waiting for repair requests: [#TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.46.108.101, (DFS,main), #TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.100, (DFS,main), #TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.102, (DFS,main), #TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.101, (DFS,main)] NETSTATS: Mode: Normal Not sending any streams. Not receiving any streams. Pool NameActive Pending Completed Commandsn/a 0 150846 Responses n/a 0 443183 One node in our cluster still has unreadable rows, where the reads trip up every time for certain sstables (you've probably seen my earlier threads regarding that). My suspicion is that the bloom filter read on the node with the corrupt sstables is never reporting back to the repair, thereby causing it to hang. What would be great is a scrub tool that ignores unreadable/unserializable rows! : ) On Apr 12, 2011, at 2:15 PM, aaron morton wrote: Do you see a message starting Repair session and ending with completed successfully ? Or do you see any streaming activity using nodetool netstats Repair can hang if a neighbour dies and fails to send a requested stream. It will timeout after 24 hours (I think). Aaron On 12 Apr 2011, at 23:39, Karl Hiramoto wrote: On 12/04/2011 13:31, Jonathan Colby wrote: There are a few other threads related to problems with the nodetool repair in 0.7.4. However I'm not seeing any errors, just never getting a message that the repair completed successfully. In my production and test cluster (with just a few MB data) the repair nodetool prompt never returns and the last entry in the cassandra.log is always something like: #TreeRequest manual-repair-f739ca7a-bef8-4683-b249-09105f6719d9, /10.46.108.102, (DFS,main) completed successfully: 1 outstanding But I don't see a message, even hours later, that the 1 outstanding request finished successfully. Anyone else experience this? These are physical server nodes in local data centers and not EC2 I've seen this. To fix it try a nodetool compact then repair. -- Karl
Re: unrepairable sstable data rows
Thanks for the answer Aaron. There are Data, Index, Filter, and Statistics files associated with SSTables. What files must be physically moved/deleted? I tried just moving the Data file and Cassandra would not start. I see this exception: WARN [WrapperSimpleAppMain] 2011-04-11 12:04:23,239 ColumnFamilyStore.java (line 493) Removing orphans for /var/lib/cassandra/data/DFS/main-f-5: [Data.db] ERROR [WrapperSimpleAppMain] 2011-04-11 12:04:23,240 AbstractCassandraDaemon.java (line 333) Exception encountered during startup. java.lang.AssertionError: attempted to delete non-existing file main-f-5-Data.db at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:46) at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:41) at org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:498) at org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:153) On Apr 11, 2011, at 2:14 AM, aaron morton wrote: But if you wanted to get fresh data on the node, a simple approach is to delete/move just the SSTable that is causing problems then run a repair. That should reduce the amount of data that needs to be moved.
exceptions during bootstrap 0.7.4
Seeing these exceptions on a node during the bootstrap phase of a move . Cassandra 0.7.4. Anyone able to shed more light on what may be causing this? btw - the move was done to assign a new token, decommission phase seemed to have gone ok. bootstrapping is still in progress (i hope) INFO [CompactionExecutor:1] 2011-04-11 16:26:25,583 SSTableReader.java (line 154) Opening /var/lib/cassandra/data/DFS/main-f-249 INFO [CompactionExecutor:1] 2011-04-11 16:27:21,067 SSTableReader.java (line 154) Opening /var/lib/cassandra/data/DFS/main-f-250 INFO [CompactionExecutor:1] 2011-04-11 16:28:01,745 SSTableReader.java (line 154) Opening /var/lib/cassandra/data/DFS/main-f-251 INFO [CompactionExecutor:1] 2011-04-11 16:36:21,320 SSTableReader.java (line 154) Opening /var/lib/cassandra/data/DFS/main-f-252 INFO [CompactionExecutor:1] 2011-04-11 16:36:33,485 SSTableReader.java (line 154) Opening /var/lib/cassandra/data/DFS/main-f-253 ERROR [CompactionExecutor:1] 2011-04-11 16:36:34,368 AbstractCassandraDaemon.java (line 112) Fatal exception in thread Thread[CompactionExecutor:1,1,main] java.io.EOFException at org.apache.cassandra.io.sstable.IndexHelper.skipIndex(IndexHelper.java:65) at org.apache.cassandra.io.sstable.SSTableWriter$Builder.build(SSTableWriter.java:315) at org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:942) at org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:935) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) ERROR [Thread-329] 2011-04-11 16:36:34,369 AbstractCassandraDaemon.java (line 112) Fatal exception in thread Thread[Thread-329,5,main] java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.io.EOFException at org.apache.cassandra.streaming.StreamInSession.closeIfFinished(StreamInSession.java:151) at org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:63) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:91) Caused by: java.util.concurrent.ExecutionException: java.io.EOFException at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.cassandra.streaming.StreamInSession.closeIfFinished(StreamInSession.java:135) ... 2 more Caused by: java.io.EOFException at org.apache.cassandra.io.sstable.IndexHelper.skipIndex(IndexHelper.java:65) at org.apache.cassandra.io.sstable.SSTableWriter$Builder.build(SSTableWriter.java:315) at org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:942) at org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:935) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) INFO [CompactionExecutor:1] 2011-04-11 16:36:37,317 SSTableReader.java (line 154) Opening /var/lib/cassandra/data/DFS/main-f-255 INFO [CompactionExecutor:1] 2011-04-11 16:36:37,426 SSTableReader.java (line 154) Opening /var/lib/cassandra/data/DFS/main-f-256 ERROR [CompactionExecutor:1] 2011-04-11 16:36:38,290 AbstractCassandraDaemon.java (line 112) Fatal exception in thread Thread[CompactionExecutor:1,1,main] java.io.EOFException at org.apache.cassandra.io.sstable.IndexHelper.skipIndex(IndexHelper.java:65) at org.apache.cassandra.io.sstable.SSTableWriter$Builder.build(SSTableWriter.java:315) at org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:942) at org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:935) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
help! seed node needs to be replaced
My seed node (1 of 4) having the wraparound range (token 0) needs to be replaced. Should I bootstrap the node with a new IP, then add it back as a seed? Should I run remove token on another node to take over the range?
Re: help! seed node needs to be replaced
I shutdown cassandra, deleted (with a backup) the contents of the data directory and did a nodetool move 0.It seems to be populating the node with its range of data.Hope that was a good idea. On Apr 11, 2011, at 10:38 PM, Jonathan Colby wrote: My seed node (1 of 4) having the wraparound range (token 0) needs to be replaced. Should I bootstrap the node with a new IP, then add it back as a seed? Should I run remove token on another node to take over the range?
Re: help! seed node needs to be replaced
Yes. This node has repeatedly given problems while reading various sstables. So I decided to start with a fresh data dir, relying on the fact that with an RF=3, the data will be able to be retrieved from the cluster. Since this is a seed node, I am a little unsure how to proceed. From everything I've read, bootstrapping a seed is not a good idea. One idea I had was to change the IP, bootstrap, and change the IP back.But I just tried nodetool move 0 to try, with the hopes that it might work. On Apr 11, 2011, at 11:31 PM, aaron morton wrote: Is this the node that had the earlier EOF error during bootstrap ? Aaron On 12 Apr 2011, at 08:42, Jonathan Colby wrote: I shutdown cassandra, deleted (with a backup) the contents of the data directory and did a nodetool move 0.It seems to be populating the node with its range of data.Hope that was a good idea. On Apr 11, 2011, at 10:38 PM, Jonathan Colby wrote: My seed node (1 of 4) having the wraparound range (token 0) needs to be replaced. Should I bootstrap the node with a new IP, then add it back as a seed? Should I run remove token on another node to take over the range?
unrepairable sstable data rows
It appears we have several unserializable or unreadable rows. These were not fixed even after doing a scrub on all nodes - even though the scrub seemed to have completed successfully. I trying to fix these by doing a repair, but these exceptions are thrown exactly when doing a repair. Anyone run into this issue? What's the best way to fix this? I was thinking that flushing and reloading the data with a move (reusing the same token) might be a way to get out of this. Exception seem multiple times for different keys during a repair: ERROR [CompactionExecutor:1] 2011-04-10 14:05:55,528 PrecompactedRow.java (line 82) Skipping row DecoratedKey(58054163627659284217684165071269705317, 64396663313763662d383432622d343439652d623761312d643164663936333738306565) in /var/lib/cassandra/data/DFS/main-f-232-Data.db java.io.EOFException at java.io.RandomAccessFile.readFully(RandomAccessFile.java:383) at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361) at org.apache.cassandra.io.util.BufferedRandomAccessFile.readBytes(BufferedRandomAccessFile.java:268) at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:310) at org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:267) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:94) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:35) at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:176) at org.apache.cassandra.io.PrecompactedRow.init(PrecompactedRow.java:78) at org.apache.cassandra.io.CompactionIterator.getCompactedRow(CompactionIterator.java:139) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:108) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43) at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131) at org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183) at org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94) at org.apache.cassandra.db.CompactionManager.doValidationCompaction(CompactionManager.java:803) at org.apache.cassandra.db.CompactionManager.access$800(CompactionManager.java:56) at org.apache.cassandra.db.CompactionManager$6.call(CompactionManager.java:358) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) This WARN also seems to come up often during a repair. Not sure if it related to this problem: WARN [ScheduledTasks:1] 2011-04-10 14:10:24,991 GCInspector.java (line 149) Heap is 0.8675910480028087 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically WARN [ScheduledTasks:1] 2011-04-10 14:10:24,992 StorageService.java (line 2206) Flushing ColumnFamilyStore(table='DFS', columnFamily='main') to relieve memory pressure INFO [ScheduledTasks:1] 2011-04-10 14:10:24,992 ColumnFamilyStore.java (line 695) switching in a fresh Memtable for main at CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1302435708131.log', position=28257053)
Re: auto_bootstrap
I can't explain the technical reason why it's not advisable to bootstrap a seed. However, from what I've read you would bootstrap the node as a non-seed first, then add it as seed once it has finished bootstrapping. On Apr 8, 2011, at 9:30 PM, mcasandra wrote: in yaml: # Set to true to make new [non-seed] nodes automatically migrate data # to themselves from the pre-existing nodes in the cluster. Why only non-seed nodes? What if seed nodes need to bootstrap? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/auto-bootstrap-tp6254993p6254993.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: nodetool move hammers the next node in the ring
thanks! I'll be watching this issue closely. On Apr 9, 2011, at 5:41 AM, Chris Goffinet wrote: We also have a ticket open at https://issues.apache.org/jira/browse/CASSANDRA-2399 We have observed in production the impact of streaming data to new nodes being added. We actually have our entire dataset in page cache in one of our clusters, our 99th percentiles go from 20ms to 1 second on streaming nodes when bootstrapping in new nodes because of blowing out the page cache during the process. We are hoping to have this addressed soon. I think throttling of streaming would be good too, to help minimize saturating the network card on the streaming node. Dynamic snitch should help with this, we'll try to report back our results very soon on what it looks like for that case. -Chris On Apr 8, 2011, at 7:35 PM, aaron morton wrote: My brain just started working. The streaming for the move may need to be throttled, but once the file has been received the bloom filters, row indexes and secondary indexes are built. That will also take some effort, do you have any secondary indexes? If you are doing a move again could you try turing up logging to DEBUG on one of the neighbour nodes. Once the file has been received you will see a message saying Finished {file_name}. Sending ack to {remote_ip}. After this log message the rebuilds will start, would be interesting to see what is more heavy weight I'm guessing the rebuilds. This is similar to https://issues.apache.org/jira/browse/CASSANDRA-2156 but that ticket will not cover this case. I've added this use case to the comments, please check there if you want to follow along. Cheers Aaron On 6 Apr 2011, at 16:26, Jonathan Colby wrote: thanks for the response Aaron. Our cluster has 6 nodes with 10 GB load on each. RF=3.AMD 64 bit Blades, Quad Core, 8 GB ram, running Debian Linux. Swap off. Cassandra 0.7.4 On Apr 6, 2011, at 2:40 AM, aaron morton wrote: Not that I know of, may be useful to be able to throttle things. But if the receiving node has little head room it may still be overwhelmed. Currently there is a single thread for streaming. If we were to throttle it may be best to make it multi threaded with a single concurrent stream per end point. Out of interest how many nodes do you have and whats the RF? Aaron On 6 Apr 2011, at 01:16, Jonathan Colby wrote: When doing a move, decommission, loadbalance, etc. data is streamed to the next node in such a way that it really strains the receiving node - to the point where it has a problem serving requests. Any way to throttle the streaming of data?
Is the repair still going on or did it fail because of exceptions?
It seems on my cluster there are a few unserializable Rows. I'm trying to run a repair on the nodes, but it also seems that the replica nodes have unreadable or unserializable rows.The problem is, I cannot determine if the repair is still going on, or if was interrupted because of these errors. It is unclear because nothing else related to the repair show up in the logs. It's been about 5 hours and I also don't see anything happening when I perform a nodetool netstats on the nodes. The nodetool repair command is still blocking from the console. On the node I'm trying to repair, I see this after launching a repair: ... INFO [manual-repair-6160b400-2c82-4ccb-9451-79caafd7d3cc] 2011-04-08 11:41:55,520 AntiEntropyService.java (line 770) Waiting for repair requests: [#TreeRequest manual-repair-6160b400-2c82-4ccb-9451-7 9caafd7d3cc, /10.46.108.102, (DFS,main), #TreeRequest manual-repair-6160b400-2c82-4ccb-9451-79caafd7d3cc, /10.46.108.101, (DFS,main), #TreeRequest manual-repair-6160b400-2c82-4ccb-9451-79caafd7d3cc , /10.46.108.100, (DFS,main), #TreeRequest manual-repair-6160b400-2c82-4ccb-9451-79caafd7d3cc, /10.47.108.101, (DFS,main)] ... In the log of the node 10.46.108.102 where the repair tries to compare the replica data, I see a couple of the below exceptions a few minutes later. Are the exceptions bad enough to cause the repair to fail? ERROR [CompactionExecutor:1] 2011-04-08 11:43:01,177 PrecompactedRow.java (line 82) Skipping row DecoratedKey(1782314446006375058060694305099335169, 4d657373616765456e726963686d656e743a31343236) in /va r/lib/cassandra/data/DFS/main-f-177-Data.db java.io.EOFException at java.io.RandomAccessFile.readFully(RandomAccessFile.java:383) at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361) at org.apache.cassandra.io.util.BufferedRandomAccessFile.readBytes(BufferedRandomAccessFile.java:268) at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:310) at org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:267) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:94) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:35) at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:176) at org.apache.cassandra.io.PrecompactedRow.init(PrecompactedRow.java:78) at org.apache.cassandra.io.CompactionIterator.getCompactedRow(CompactionIterator.java:139) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:108) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43) at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131) at org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183) at org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94) at org.apache.cassandra.db.CompactionManager.doValidationCompaction(CompactionManager.java:803) at org.apache.cassandra.db.CompactionManager.access$800(CompactionManager.java:56) at org.apache.cassandra.db.CompactionManager$6.call(CompactionManager.java:358) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) ERROR [CompactionExecutor:1] 2011-04-08 11:43:53,762 PrecompactedRow.java (line 82) Skipping row DecoratedKey(8073554114801607394928746621229606383, 34393734663734382d316330302d346164372d61372d3162 3430386661393832) in /var/lib/cassandra/data/DFS/main-f-177-Data.db java.io.EOFException at java.io.RandomAccessFile.readFully(RandomAccessFile.java:383) at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361) at org.apache.cassandra.io.util.BufferedRandomAccessFile.readBytes(BufferedRandomAccessFile.java:268) at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:310) at org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:267) : nodetool netstats reports: Mode: Normal Not sending any streams. Not receiving any streams. Pool NameActive Pending Completed Commandsn/a 0 526207
Re: consistency ONE and null
that makes sense. thanks! On Apr 7, 2011, at 8:36 AM, Stephen Connolly wrote: also there is a configuration parameter that controls the probability of any read request triggering a read repair - Stephen --- Sent from my Android phone, so random spelling mistakes, random nonsense words and other nonsense are a direct result of using swype to type on the screen On 7 Apr 2011 07:35, Stephen Connolly stephen.alan.conno...@gmail.com wrote: as I understand, the read repair is a background task triggered by the read request, but once the consistency requirement has been met you will be given a response. the coordinator at CL.ONE is allowed to return your responce once it has one response (empty or not) from any replica. if the first response is empty, you get null - Stephen --- Sent from my Android phone, so random spelling mistakes, random nonsense words and other nonsense are a direct result of using swype to type on the screen On 7 Apr 2011 00:10, Jonathan Colby jonathan.co...@gmail.com wrote: Let's say you have RF of 3 and a write was written to 2 nodes. 1 was not written because the node had a network hiccup (but came back online again). My question is, if you are reading a key with a CL of ONE, and you happen to land on that node that didn't get the write, will the read fail immediately? Or, would read repair check the other replicas and fetch the correct data from the other node(s)? Secondly, is read repair done according to the consistency level, or is read repair an independent configuration setting that can be turned on/off. There was a recent thread about a different variation of my question, but went into very technical details, so I didn't want to hijack that thread.
reoccurring exceptions seen
These types of exceptions is seen sporadically in our cassandra logs. They occur especially after running a repair with the nodetool. I assume there are a few corrupt rows. Is this cause for panic? Will a repair fix this, or is it best to do a decomission + bootstrap via a move for example? or would a scrub help here? ERROR [CompactionExecutor:1] 2011-04-07 15:51:12,093 PrecompactedRow.java (line 82) Skipping row DecoratedKey(36813508603227779893025154359070714012, 32326437643439642d623566332d346433392d613334622d343738643433633130383633) in /var/lib/cassandra/data/DFS/main-f-164-Data.db java.io.EOFException at java.io.RandomAccessFile.readFully(RandomAccessFile.java:383) at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361) at org.apache.cassandra.io.util.BufferedRandomAccessFile.readBytes(BufferedRandomAccessFile.java:268) at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:310) at org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:267) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:76) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:35) at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:176) at org.apache.cassandra.io.PrecompactedRow.init(PrecompactedRow.java:78) at org.apache.cassandra.io.CompactionIterator.getCompactedRow(CompactionIterator.java:139) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:108) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43) at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131) at org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183) at org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94) at org.apache.cassandra.db.CompactionManager.doValidationCompaction(CompactionManager.java:803) at org.apache.cassandra.db.CompactionManager.access$800(CompactionManager.java:56) at org.apache.cassandra.db.CompactionManager$6.call(CompactionManager.java:358) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) ERROR [CompactionExecutor:1] 2011-04-07 15:51:26,356 INFO [MigrationStage:1] 2011-03-11 17:20:10,900 Migration.java (line 136) Applying migration 6f6e2a6c-4bfb-11e0-a3ae-87e4c47e8541 Add keyspace: DFSrep factor:2rep strategy:NetworkTopologyStrategy{org.apache.cassandra.config.CFMetaData@2a4bd173[cfId=1000,tableName=DFS,cfName=main,cfType=Standard,comparator=org.apache.cassandra.db.marshal.BytesType@c16c2c0,subcolumncomparator=null,c...skipping... at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:176) at org.apache.cassandra.io.PrecompactedRow.init(PrecompactedRow.java:78) at org.apache.cassandra.io.CompactionIterator.getCompactedRow(CompactionIterator.java:139) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:108) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43) at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131) at org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183) at org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94) at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:449) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:124) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:94) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138)
Re: nodetool move hammers the next node in the ring
thanks for the response Aaron. Our cluster has 6 nodes with 10 GB load on each. RF=3.AMD 64 bit Blades, Quad Core, 8 GB ram, running Debian Linux. Swap off. Cassandra 0.7.4 On Apr 6, 2011, at 2:40 AM, aaron morton wrote: Not that I know of, may be useful to be able to throttle things. But if the receiving node has little head room it may still be overwhelmed. Currently there is a single thread for streaming. If we were to throttle it may be best to make it multi threaded with a single concurrent stream per end point. Out of interest how many nodes do you have and whats the RF? Aaron On 6 Apr 2011, at 01:16, Jonathan Colby wrote: When doing a move, decommission, loadbalance, etc. data is streamed to the next node in such a way that it really strains the receiving node - to the point where it has a problem serving requests. Any way to throttle the streaming of data?
Re: Location-aware replication based on objects' access pattern
good to see a discussion on this. This also has practical use for business continuity where you can control that the clients in a given data center first write replicas to its own data center, then to the other data center for backup. If I understand correctly, a write takes the token into account first, then the replication strategy decides where the replicas go. I would like to see the the first writes to be based on location instead of token - whether that is accomplished by manipulating the key or some other mechanism. That way, if you do suffer the loss of a data center, the clients are guaranteed to meet quorum on the nodes in its own data center (given a mirrored architecture across 2 data centers). We have 2 data centers. If one goes down we have the problem that quorum cannot be satisfied for half of the reads. On Apr 6, 2011, at 6:00 AM, Jonathan Ellis wrote: On Tue, Apr 5, 2011 at 10:45 PM, Yudong Gao st...@umich.edu wrote: A better solution would be to just push the DecoratedKey into the ReplicationStrategy so it can make its decision before information is thrown away. I agree. So in this case, I guess the hashed based token ring is still preserved to avoid hot spot, but we further use the DecoratedKey to guide the replication strategy. For example, replica 2 is placed in the first node along the ring the belongs the desirable data center (based on the location hint embedded DecoratedKey). But we may not be able to control the primary replica. Do you think this will be a reasonable design? calculateNaturalEndpoints has complete freedom to generate all replicas any way it likes. Thinking of an endpoint as primary because it was generated first by one algorithm is dangerous. As one of the docstrings explains, replica destinations (endpoints) should be considered a Set even though we use a List for efficiency. None of them are special at the ReplicationStrategy level. Just curious, are they happy with the current solution with keyspace, and is there some requests for per-row placement control? Enough people want to try it that we have the ticket open. :) -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
consistency ONE and null
Let's say you have RF of 3 and a write was written to 2 nodes. 1 was not written because the node had a network hiccup (but came back online again). My question is, if you are reading a key with a CL of ONE, and you happen to land on that node that didn't get the write, will the read fail immediately? Or, would read repair check the other replicas and fetch the correct data from the other node(s)? Secondly, is read repair done according to the consistency level, or is read repair an independent configuration setting that can be turned on/off. There was a recent thread about a different variation of my question, but went into very technical details, so I didn't want to hijack that thread.
Re: Re: nodetool cleanup - results in more disk use?
I think the key thing to remember is that compaction is performed on *similar* sized sstables. so it makes sense that over time this will have a cascading effect. I think by default it starts out with compacting 4 flushed sstables, then the cycle begins. On Apr 4, 2011 3:42pm, shimi shim...@gmail.com wrote: The bigger the file the longer it will take for it to be part of a compaction again.Compacting bucket of large files takes longer then compacting bucket of small files Shimi On Mon, Apr 4, 2011 at 3:58 PM, aaron morton aa...@thelastpickle.com wrote: mmm, interesting. My theory was t0 - major compaction runs, there is now one sstable t1 - x new sstables have been created t2 - minor compaction runs and determines there are two buckets, one with the x new sstables and one with the single big file. The bucket of many files is compacted into one, the bucket of one file is ignored. I can see that it takes longer for the big file to be involved in compaction again, and when it finally was it would take more time. But that minor compactions of new SSTables would still happen at the same rate, especially if they are created at the same rate as previously. Am I missing something or am I just reading the docs wrong ? Cheers Aaron On 4 Apr 2011, at 22:20, Jonathan Colby wrote: hi Aaron - The Datastax documentation brought to light the fact that over time, major compactions will be performed on bigger and bigger SSTables. They actually recommend against performing too many major compactions. Which is why I am wary to trigger too many major compactions ... http://www.datastax.com/docs/0.7/operations/scheduled_tasks Performing Major Compaction¶ A major compaction process merges all SSTables for all column families in a keyspace – not just similar sized ones, as in minor compaction. Note that this may create extremely large SStables that result in long intervals before the next minor compaction (and a resulting increase in CPU usage for each minor compaction). Though a major compaction ultimately frees disk space used by accumulated SSTables, during runtime it can temporarily double disk space usage. It is best to run major compactions, if at all, at times of low demand on the cluster. On Apr 4, 2011, at 1:57 PM, aaron morton wrote: cleanup reads each SSTable on disk and writes a new file that contains the same data with the exception of rows that are no longer in a token range the node is a replica for. It's not compacting the files into fewer files or purging tombstones. But it is re-writing all the data for the CF. Part of the process will trigger GC if needed to free up disk space from SSTables no longer needed. AFAIK having fewer bigger files will not cause longer minor compactions. Compaction thresholds are applied per bucket of files that share a similar size, there is normally more smaller files and fewer larger files. Aaron On 2 Apr 2011, at 01:45, Jonathan Colby wrote: I discovered that a Garbage collection cleans up the unused old SSTables. But I still wonder whether cleanup really does a full compaction. This would be undesirable if so. On Apr 1, 2011, at 4:08 PM, Jonathan Colby wrote: I ran node cleanup on a node in my cluster and discovered the disk usage went from 3.3 GB to 5.4 GB. Why is this? I thought cleanup just removed hinted handoff information. I read that *during* cleanup extra disk space will be used similar to a compaction. But I was expecting the disk usage to go back down when it finished. I hope cleanup doesn't trigger a major compaction. I'd rather not run major compactions because it means future minor compactions will take longer and use more CPU and disk.
if nodetool operations abort with timeout, did the operation continue?
when doing a nodetool move , after about 15 minutes I got the below exception. The cassandra log seems to indicate that the move is still ongoing. Is this anything to worry about? Exception in thread main java.rmi.UnmarshalException: Error unmarshaling return header; nested exception is: java.io.EOFException at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:209) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:142) at com.sun.jmx.remote.internal.PRef.invoke(Unknown Source) at javax.management.remote.rmi.RMIConnectionImpl_Stub.invoke(Unknown Source) at javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.invoke(RMIConnector.java:993) at javax.management.MBeanServerInvocationHandler.invoke(MBeanServerInvocationHandler.java:288) at $Proxy0.move(Unknown Source) at org.apache.cassandra.tools.NodeProbe.move(NodeProbe.java:347) at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:564) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:195)
Disable Swap? batch_mutate failed: out of sequence response
Hi Jonathan - Would you recommend to disable system swap as a rule? I'm running on Debian 64bit and am seeing light swapping: total used free sharedbuffers cached Mem: 8003 7969 33 0 0 4254 -/+ buffers/cache: 3714 4288 Swap: 513 15498 On Apr 5, 2011, at 8:35 PM, Jonathan Ellis wrote: Step 1: disable swap. 2011/4/5 Héctor Izquierdo Seliva izquie...@strands.com: Update with more info: I'm still running into problems. Now I don't write more than 100 columns at a time, and I'm having lots of Stop-the-world gc pauses. I'm writing into three column families, with memtable_operations = 0.3 and memtable_throughput = 64. There is now swapping, and full GCs are taking around 5 seconds. I'm running cassandra with a heap of 8 GB. Should I tune this somehow? Is any of this wrong? -Original Message- From: Héctor Izquierdo Seliva [mailto:izquie...@strands.com] Sent: April-05-11 8:30 To: user@cassandra.apache.org Subject: batch_mutate failed: out of sequence response Hi everyone. I'm having trouble while inserting big amounts of data into cassandra. I'm getting this exception: batch_mutate failed: out of sequence response I'm gessing is due to very big mutates. I have made the batch mutates smaller and it seems to be behaving. Can somebody shed some light? Thanks! No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.894 / Virus Database: 271.1.1/3551 - Release Date: 04/05/11 02:34:00 -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
extreme memory consumption
I've seen the other posts about memory consumption, but I'm seeing some weird behavior with 0.7.4 with 5 GB heap size (64 bit system with 8 GB ram total)... note the virtual mem used 20.6 GB ?! and Shared 8.4 GB ?! PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 2390 root 20 0 000 D1 0.0 0:28.73 flush-104:0 31684 cassandr 20 0 20.6g 3.5g 8496 S1 45.4 4:08.91 java 17 root 20 0 000 S0 0.0 0:38.03 events/2 What could be going on here? config initial_token: auto_bootstrap: true hinted_handoff_enabled: true max_hint_window_in_ms: 360 # one hour hinted_handoff_throttle_delay_in_ms: 50 authenticator: org.apache.cassandra.auth.AllowAllAuthenticator authority: org.apache.cassandra.auth.AllowAllAuthority partitioner: org.apache.cassandra.dht.RandomPartitioner data_file_directories: - /var/lib/cassandra/data commitlog_directory: /var/lib/cassandra/commitlog saved_caches_directory: /var/lib/cassandra/saved_caches commitlog_rotation_threshold_in_mb: 128 commitlog_sync: periodic commitlog_sync_period_in_ms: 1 flush_largest_memtables_at: 0.75 reduce_cache_sizes_at: 0.85 reduce_cache_capacity_to: 0.6 disk_access_mode: auto concurrent_reads: 16 concurrent_writes: 32 sliced_buffer_size_in_kb: 64 storage_port: 7000 rpc_port: 9160 rpc_keepalive: true thrift_framed_transport_size_in_mb: 15 thrift_max_message_length_in_mb: 16 snapshot_before_compaction: false binary_memtable_throughput_in_mb: 256 column_index_size_in_kb: 64 in_memory_compaction_limit_in_mb: 64 rpc_timeout_in_ms: 1 endpoint_snitch: org.apache.cassandra.locator.RackInferringSnitch dynamic_snitch: true dynamic_snitch_update_interval_in_ms: 100 dynamic_snitch_reset_interval_in_ms: 60 dynamic_snitch_badness_threshold: 0.0 request_scheduler: org.apache.cassandra.scheduler.NoScheduler index_interval: 128 keyspaces: - name: DFS replica_placement_strategy: org.apache.cassandra.locator.OldNetworkTopologyStrategy replication_factor: 3 column_families: - name: main compare_with: BytesType keys_cached: 20 rows_cached: 200 row_cache_save_period_in_seconds: 0 key_cache_save_period_in_seconds: 3600
nothing happening in the cluster after a nodetool move
I added a node to the cluster and I am having a difficult time reassigning the new tokens. It seems after a while nothing shows up in the new node's logs and it just stays in status Leaving. nodetool netstats on all nodes shows Nothing streaming to/from. There is no activity in the other logs related to the move. The data size is not even that big, around 5 GB.What could be happening? Seems like the move is frozen.
Update: Re: nothing happening in the cluster after a nodetool move
Well, since my last post, about 10 minutes later, the node goes into bootstrap mode. It's kind of worrying that a lot of time goes by where it seems like nothing is happening, then all of a sudden things get going again. 22,584 keys. Time: 20,276ms. INFO [HintedHandoff:1] 2011-04-05 22:29:23,167 HintedHandOffManager.java (line 304) Started hinted handoff for endpoint /10.46.108.101 INFO [HintedHandoff:1] 2011-04-05 22:29:23,167 HintedHandOffManager.java (line 360) Finished hinted handoff of 0 rows to endpoint /10.46.108.101 LONG PAUSE WHERE NOTHING HAPPENS INFO [RMI TCP Connection(4)-10.46.108.102] 2011-04-05 22:43:38,770 StorageService.java (line 1637) Announcing that I have left the ring for 3ms INFO [RMI TCP Connection(4)-10.46.108.102] 2011-04-05 22:44:08,770 StorageService.java (line 1747) re-bootstrapping to new token 85070591730234615865843651857942052863 INFO [RMI TCP Connection(4)-10.46.108.102] 2011-04-05 22:44:08,771 ColumnFamilyStore.java (line 695) switching in a fresh Memtable for LocationInfo at CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1302035265949.log', position=25920946) INFO [RMI TCP Connection(4)-10.46.108.102] 2011-04-05 22:44:08,771 ColumnFamilyStore.java (line 1006) Enqueuing flush of Memtable-LocationInfo@1358281533(53 bytes, 2 operations) INFO [FlushWriter:1] 2011-04-05 22:44:08,772 Memtable.java (line 157) Writing Memtable-LocationInfo@1358281533(53 bytes, 2 operations) INFO [FlushWriter:1] 2011-04-05 22:44:08,825 Memtable.java (line 164) Completed flushing /var/lib/cassandra/data/system/LocationInfo-f-22-Data.db (163 bytes) INFO [RMI TCP Connection(4)-10.46.108.102] 2011-04-05 22:44:08,826 StorageService.java (line 505) Joining: sleeping 3 ms for pending range setup INFO [RMI TCP Connection(4)-10.46.108.102] 2011-04-05 22:44:38,826 StorageService.java (line 505) Bootstrapping INFO [CompactionExecutor:1] 2011-04-05 22:44:43,952 SSTableReader.java (line 154) Opening /var/lib/cassandra/data/DFS/main-f-128 INFO [CompactionExecutor:1] 2011-04-05 22:44:43,978 SSTableReader.java (line 154) Opening /var/lib/cassandra/data/DFS/main-f-129 INFO [CompactionExecutor:1] 2011-04-05 22:46:02,228 SSTableReader.java (line 154) Opening /var/lib/cassandra/data/DFS/main-f-130 On Apr 5, 2011, at 10:46 PM, Jonathan Colby wrote: I added a node to the cluster and I am having a difficult time reassigning the new tokens. It seems after a while nothing shows up in the new node's logs and it just stays in status Leaving. nodetool netstats on all nodes shows Nothing streaming to/from. There is no activity in the other logs related to the move. The data size is not even that big, around 5 GB.What could be happening? Seems like the move is frozen.
Re: nodetool cleanup - results in more disk use?
hi Aaron - The Datastax documentation brought to light the fact that over time, major compactions will be performed on bigger and bigger SSTables. They actually recommend against performing too many major compactions. Which is why I am wary to trigger too many major compactions ... http://www.datastax.com/docs/0.7/operations/scheduled_tasks Performing Major Compaction¶ A major compaction process merges all SSTables for all column families in a keyspace – not just similar sized ones, as in minor compaction. Note that this may create extremely large SStables that result in long intervals before the next minor compaction (and a resulting increase in CPU usage for each minor compaction). Though a major compaction ultimately frees disk space used by accumulated SSTables, during runtime it can temporarily double disk space usage. It is best to run major compactions, if at all, at times of low demand on the cluster. On Apr 4, 2011, at 1:57 PM, aaron morton wrote: cleanup reads each SSTable on disk and writes a new file that contains the same data with the exception of rows that are no longer in a token range the node is a replica for. It's not compacting the files into fewer files or purging tombstones. But it is re-writing all the data for the CF. Part of the process will trigger GC if needed to free up disk space from SSTables no longer needed. AFAIK having fewer bigger files will not cause longer minor compactions. Compaction thresholds are applied per bucket of files that share a similar size, there is normally more smaller files and fewer larger files. Aaron On 2 Apr 2011, at 01:45, Jonathan Colby wrote: I discovered that a Garbage collection cleans up the unused old SSTables. But I still wonder whether cleanup really does a full compaction. This would be undesirable if so. On Apr 1, 2011, at 4:08 PM, Jonathan Colby wrote: I ran node cleanup on a node in my cluster and discovered the disk usage went from 3.3 GB to 5.4 GB. Why is this? I thought cleanup just removed hinted handoff information. I read that *during* cleanup extra disk space will be used similar to a compaction. But I was expecting the disk usage to go back down when it finished. I hope cleanup doesn't trigger a major compaction. I'd rather not run major compactions because it means future minor compactions will take longer and use more CPU and disk.
Re: changing replication strategy and effects on replica nodes
Hi Aaron - Yes, I've read the part about changing the replication factor on a running cluster. I've even done it without a problem. My real point of my question was do you now have unused replica data on the old replica nodes that you need to clean up manually? any insight would be appreciated. On Apr 1, 2011, at 1:45 PM, aaron morton wrote: See the section on Replication here http://wiki.apache.org/cassandra/Operations#Replication It talks about how to change the RF and then says you can do the same when change the placement strategy. It can be done, but is a little messy. Depending on your setup it may also be possible to copy / move the nodes manually by moving sstable files. I've not done it myself, are you able to run a test ? Hope that helps. Aaron On 1 Apr 2011, at 02:04, Jonathan Colby wrote: From my understanding of replica copies, cassandra picks which nodes to replicate the data based on replication strategy, and those same replica partner nodes are always used according to token ring distribution. If you change the replication strategy, does cassandra pick new nodes to replicate to? (for example if you went from simple strategy to a networkTopology strategy where copies are to be sent to another datacenter) If so, do you now have unused replica data on the old replica nodes that you need to clean up manually?
nodetool cleanup - results in more disk use?
I ran node cleanup on a node in my cluster and discovered the disk usage went from 3.3 GB to 5.4 GB. Why is this? I thought cleanup just removed hinted handoff information. I read that *during* cleanup extra disk space will be used similar to a compaction. But I was expecting the disk usage to go back down when it finished. I hope cleanup doesn't trigger a major compaction. I'd rather not run major compactions because it means future minor compactions will take longer and use more CPU and disk.
Re: nodetool cleanup - results in more disk use?
I discovered that a Garbage collection cleans up the unused old SSTables. But I still wonder whether cleanup really does a full compaction. This would be undesirable if so. On Apr 1, 2011, at 4:08 PM, Jonathan Colby wrote: I ran node cleanup on a node in my cluster and discovered the disk usage went from 3.3 GB to 5.4 GB. Why is this? I thought cleanup just removed hinted handoff information. I read that *during* cleanup extra disk space will be used similar to a compaction. But I was expecting the disk usage to go back down when it finished. I hope cleanup doesn't trigger a major compaction. I'd rather not run major compactions because it means future minor compactions will take longer and use more CPU and disk.
changing replication strategy and effects on replica nodes
From my understanding of replica copies, cassandra picks which nodes to replicate the data based on replication strategy, and those same replica partner nodes are always used according to token ring distribution. If you change the replication strategy, does cassandra pick new nodes to replicate to? (for example if you went from simple strategy to a networkTopology strategy where copies are to be sent to another datacenter) If so, do you now have unused replica data on the old replica nodes that you need to clean up manually?
Re: How to determine if repair need to be run
silly question, would every cassandra installation need to have manual repairs done on it? It would seem cassandra's read repair and regular compaction would take care of keeping the data clean. Am I missing something? On Mar 30, 2011, at 7:46 PM, Peter Schuller wrote: I just wanted to chime in here and say some people NEVER run repair. Just so long as the OP is understanding that this implies taking an explicit decision to accept the misbehavior you will see as a result. I.e., the reason people survive not doing repairs in some cases is, as in your case, that they can actually live with the consequences such as old data magically re-appearing permanently. as it really increased on disk data. I have followed some threads and there are some conditions that I read repair can't handle. The For one thing, RR will only touch data that is read. And not even all data that is read at that (e.g. range slices don't imply repair). -- / Peter Schuller
Re: How to determine if repair need to be run
Peter - Thanks a lot for elaborating on repairs.Still, it's a bit fuzzy to me why it is so important to run a repair before the GCGraceSeconds kicks in. Does this mean a delete does not get replicated ? In other words when I delete something on a node, doesn't cassandra set tombstones on its replica copies? And technically, isn't repair only needed for cases where things weren't properly propogated in the cluster? If all writes are written to the right replicas, and all deletes are written to all the replicas, and all nodes were available at all times, then everything should work as designed - without manual intervention, right? Thanks again. On Mar 31, 2011, at 6:17 PM, Peter Schuller wrote: silly question, would every cassandra installation need to have manual repairs done on it? It would seem cassandra's read repair and regular compaction would take care of keeping the data clean. Am I missing something? See my previous posts in this thread for the distinct reasons to run repair. Except in special circumstances where you know exactly what you're doing (mainly that no deletes are performed), you are *required* to run repair often enough for GGraceSeconds: http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair It seems that there needs to be some more elaborate documentation about this somewhere to point to since there seems to be confusion. Regular compaction does *not* imply repair. Read repair only works if (1) you touch all data within GCGraceSeconds, and (2) you touch it in such a way that read repair is enabled (e.g., not range scans), and (3) no node ever happens to be down, flap, or drop a request when you touch the data in question. Basically, unless you are really sure what you're doing - run repair. -- / Peter Schuller
difference between compaction, repair, clean
I'm a little unclear on the differences between the nodetool operations: - compaction - repair - clean I understand that compaction consolidates the SSTables and physically performs deletes by taking into account the Tombstones. But what does clean and repair do then?
Re: Central monitoring of Cassandra cluster
Cacti and Munin are great for graphing, nagios is good for monitoring. I wrote a very simple JMX proxy that you can send a request to and it retrieves the desired JMX beans. there are jmx proxys out there if you don't want to write your own, for example http://code.google.com/p/polarrose-jmx-rest-bridge/ There is even a JMX proxy that integrates with nagios. I don't remember the name but google will help you. On Mar 24, 2011, at 7:44 PM, mcasandra wrote: Can someone share if they have centralized monitoring for all cassandra servers. With many nodes it becomes difficult to monitor them individually unless we can look at data in one place. I am looking at solutions where this can be done. Looking at Cacti currently but not sure how to integrate it with JMX. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Central-monitoring-of-Cassandra-cluster-tp6205275p6205275.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
how does cassandra pick its replicant peers?
Does anyone know how cassandra chooses the nodes for its other replicant copies? The first node gets the first copy because its token is assigned for that key. But what about the other copies of the data? Do the replicant nodes stay the same based on the token range? Or are the other copies send to any random node based on its load and availability? I think this is important in order to understand because it affects how to plan for situations where a significant number of nodes are suddenly unavailable, such as the loss of a data center. If the replicants are copied just based on random availability, then quorum writes could survive on the remaining nodes. But if the replicant nodes are somehow pre-determined, those replicants may node be available and writes will fail.
Quorum, Hector, and datacenter preference
Hi - Our cluster is spread between 2 datacenters. We have a straight-forward IP assignment so that OldNetworkTopology (rackinferring snitch) works well.We have cassandra clients written in Hector in each of those data centers. The Hector clients all have a list of all cassandra nodes across both data centers. RF=3. Is there an order as to which data center gets the first write?In other words, would (or can) the Hector client do its first write to the cassandra nodes in its own data center? It would be ideal it Hector chose the local cassandra nodes. That way, if one data center is unreachable, the Quorum of replicas in cassandra is still reached (because it was written to the working data center first). Otherwise, if the cassandra writes are really random from the Hector client point-of-view, a data center outage would result in a read failure for any data that has 2 replicas in the lost data center. Is anyone doing this? Is there a flaw in my logic?
Re: Quorum, Hector, and datacenter preference
Indeed I found the big flaw in my own logic. Even writing to the local cassandra nodes does not guarantee where the replicas will end up. The decision where to write the first replica is based on the token ring, which is spread out on all nodes regardless of datacenter. right ? On Mar 24, 2011, at 2:02 PM, Jonathan Colby wrote: Hi - Our cluster is spread between 2 datacenters. We have a straight-forward IP assignment so that OldNetworkTopology (rackinferring snitch) works well. We have cassandra clients written in Hector in each of those data centers. The Hector clients all have a list of all cassandra nodes across both data centers. RF=3. Is there an order as to which data center gets the first write?In other words, would (or can) the Hector client do its first write to the cassandra nodes in its own data center? It would be ideal it Hector chose the local cassandra nodes. That way, if one data center is unreachable, the Quorum of replicas in cassandra is still reached (because it was written to the working data center first). Otherwise, if the cassandra writes are really random from the Hector client point-of-view, a data center outage would result in a read failure for any data that has 2 replicas in the lost data center. Is anyone doing this? Is there a flaw in my logic?
Deleting old SSTables
According to the Wiki Page on compaction: once compaction is finished, the old SSTable files may be deleted* * http://wiki.apache.org/cassandra/MemtableSSTable I thought the old SSTables would be deleted automatically, but this wiki page got me thinking otherwise. Question is, if it is true that old SSTables must be manually deleted, how can one safely identify which SSTables can be deleted?? Jon
Changing memtable_throughput_in_mb on a running system
It seems some settings like memtable_throughput_in_mb are Keyspace-specific (at least with 0.7.4). How can these settings best be changed on a running cluster? PS - preferable by a sysadmin using nodetool or cassandra-cli Thanks! Jon
Re: Deleting old SSTables
doooh. thanks! On Mar 22, 2011, at 3:27 PM, Jonathan Ellis wrote: From the next paragraph of the same wiki page: SSTables that are obsoleted by a compaction are deleted asynchronously when the JVM performs a GC. You can force a GC from jconsole if necessary, but Cassandra will force one itself if it detects that it is low on space. A compaction marker is also added to obsolete sstables so they can be deleted on startup if the server does not perform a GC before being restarted. On Tue, Mar 22, 2011 at 8:30 AM, Jonathan Colby jonathan.co...@gmail.com wrote: According to the Wiki Page on compaction: once compaction is finished, the old SSTable files may be deleted* * http://wiki.apache.org/cassandra/MemtableSSTable I thought the old SSTables would be deleted automatically, but this wiki page got me thinking otherwise. Question is, if it is true that old SSTables must be manually deleted, how can one safely identify which SSTables can be deleted?? Jon -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Meaning of TotalReadLatencyMicros and TotalWriteLatencyMicrosStatistics
Hi - On our recently live cassandra cluster of 5 nodes, we've noticed that the latency readings, especially Reads have gone up drastically. TotalReadLatencyMicros 5413483 TotalWriteLatencyMicros 1811824 I understand these are in microseconds, but what meaning do they have for the performance of the cluster? In other words what do these numbers actually measure. In our case, it looks like we have a read latency of 5.4 seconds, which is very troubling if I interpret this correctly. Are reads really taking an average of 5 seconds to complete??
cassandra nodes with mixed hard disk sizes
This is a two part question ... 1. If you have cassandra nodes with different sized hard disks, how do you deal with assigning the token ring such that the nodes with larger disks get more data? In other words, given equally distributed token ranges, when the smaller disk nodes run out of space, the larger disk nodes with still have unused capacity.Or is installing a mixed hardware cluster a no-no? 2. What happens when a cassandra node runs out of disk space for its data files? Does it continue serving the data while not accepting new data? Or does the node break and require manual intervention? This info has alluded me elsewhere. Jon
Re: script to modify cassandra.yaml file
We use Puppet to manage the cassandra.yaml in a different location from the installation. Ours is in /etc/cassandra/cassandra.yaml You can set the environment CASSANDRA_CONF (i believe it is. check the cassandra.in.sh) and the startup script will pick up this as the configuration file to use. With Puppet you can manage the list of seeds, set the IP addresses, etc dynamically. I even use it to set the initial tokens. It makes life a lot easier. On Mar 21, 2011, at 9:14 AM, Sasha Dolgy wrote: I use grep / awk / sed from within a bash script ... this works quite well. -sd On Mon, Mar 21, 2011 at 12:39 AM, Anurag Gujral anurag.guj...@gmail.com wrote: Hi All, I want to modify the values in the cassandra.yaml which comes with the cassandra-0.7 package based on values of hostnames, colo etc. Does someone knows of some script which I can use which reads in default cassandra.yaml and write outs new cassandra.yaml with values based on number of nodes in the cluster ,hostname,colo name etc.
Replacing a dead seed
Hi - If a seed crashes (i.e., suddenly unavailable due to HW problem), what is the best way to replace the seed in the cluster? I've read that you should not bootstrap a seed. Therefore I came up with this procedure, but it seems pretty complicated. any better ideas? 1. update the seed list on all nodes, taking out the dead node and restart the nodes in the cluster so the new seed list is updated 2. then bootstrap the new (replacement ) node as a normal node (not yet as a seed) 3. when bootstrapping is done, make the new node a seed. 4. update the seed list again adding back the replacement seed (and rolling restart the cluster as in step 1) That seems to me like a whole lot of work. Surely there is a better way? Jon
OldNetworkTopologyStrategy with one data center
Hi - I have a question. Obviously there is no purpose in running OldNetworkTopologyStrategy in one data center. However, we want to share the same configuration in our production (multiple data centers) and pre-production (one data center) environments. My question is will org.apache.cassandra.locator.OldNetworkTopologyStrategy function with one data center and RackInferringSnitch? Jon
where to find the stress testing programs?
According to the Cassandra Wiki and OReilly book supposedly there is a contrib directory within the cassandra download containing the Python Stress Test script stress.py. It's not in the binary tarball of 0.7.3. Anyone know where to find it? Anyone know of other, maybe better stress testing scripts? Jon
Re: Virtual IP / hardware load balancing for cassandra nodes
Thanks guys. On Dec 20, 2010, at 5:44 PM, Dave Viner wrote: You can put a Cassandra cluster behind a load balancer. One thing to be cautious of is the health check. Just because the node is listening on port 9160 doesn't mean that it's healthy to serve requests. It is required, but not sufficient. The real test is the JMX values. Dave Viner On Mon, Dec 20, 2010 at 6:25 AM, Jonathan Colby jonathan.co...@gmail.com wrote: I was unable to find example or documentation on my question. I'd like to know what the best way to group a cluster of cassandra nodes behind a virtual ip. For example, can cassandra nodes be placed behind a Citrix Netscaler hardware load balancer? I can't imagine it being a problem, but in doing so would you break any cassandra functionality? The goal is to have the application talk to a single virtual ip and be directed to a random node in the cluster. I heard a little about adding the node addresses to Hector's load-balancing mechanism, but this doesn't seem too robust or easy to maintain. Thanks in advance.
Quorum and Datacenter loss
Hi cassandra experts - We're planning a cassandra cluster across 2 datacenters (datacenter-aware, random partitioning) with QUORUM consistency. It seems to me that with 2 datacenters, if one datacenter is lost, the reads/writes to cassandra will fail in the surviving datacenter because of the N/2 + 1 distribution of replicas. In other words, you need more than half of the replicas to respond but in the case of a datacenter loss you would only ever get 1/2 to respond at best. Is my logic wrong here? Is there a way to ensure the nodes in the alive datacenter respond successfully if the second datacenter is lost? Anyone have experience with this kind of problem? Thanks.
Re: Quorum and Datacenter loss
Thanks a lot Peter. So basically we would need to choose a consistency other than QUORUM.I think in our case consistency is not necessarily an issue since our data is write-once, read-many (immutable data). I suppose having a replication factor of 4 would result in two nodes in each datacenter having a copy of the data. If there's a flaw in my logic, please let me know : ] On Sun, Dec 12, 2010 at 2:04 PM, Peter Schuller peter.schul...@infidyne.com wrote: Is my logic wrong here? Is there a way to ensure the nodes in the alive datacenter respond successfully if the second datacenter is lost? Anyone have experience with this kind of problem? It's impossible to achieve the consistency and availability at the same time. See: (Assuming partition tolerance) Anyways, to expand a bit: The final consequence is that if you have a cluster that really does need QUORUM consistency, you won't be able to survive (in terms of availability, i.e., the cluster serving your traffic) data centers going down. If you want to continue operating in the case of a partition, you (1) cannot use QUORUM and (2) your application must be designed to work with and survive seeing inconsistent data. -- / Peter Schuller
understanding the cassandra storage scaling
I have a very basic question which I have been unable to find in online documentation on cassandra. It seems like every node in a cassandra cluster contains all the data ever stored in the cluster (i.e., all nodes are identical). I don't understand how you can scale this on commodity servers with merely internal hard disks. In other words, if I want to store 5 TB of data, does that each node need a hard disk capacity of 5 TB?? With HBase, memcached and other nosql solutions it is more clear how data is spilt up in the cluster and replicated for fault tolerance. Again, please excuse the rather basic question.
Re: understanding the cassandra storage scaling
Thanks Ran. This helps a little but unfortunately I'm still a bit fuzzy for me. So is it not true that each node contains all the data in the cluster? I haven't come across any information on how clustered data is coordinated in cassandra. how does my query get directed to the right node? On Thu, Dec 9, 2010 at 11:35 AM, Ran Tavory ran...@gmail.com wrote: there are two numbers to look at, N the numbers of hosts in the ring (cluster) and R the number of replicas for each data item. R is configurable per column family. Typically for large clusters N R. For very small clusters if makes sense for R to be close to N in which case cassandra is useful so the database doesn't have a single a single point of failure but not so much b/c of the size of the data. But for large clusters it rarely makes sense to have N=R, usually N R. On Thu, Dec 9, 2010 at 12:28 PM, Jonathan Colby jonathan.co...@gmail.com wrote: I have a very basic question which I have been unable to find in online documentation on cassandra. It seems like every node in a cassandra cluster contains all the data ever stored in the cluster (i.e., all nodes are identical). I don't understand how you can scale this on commodity servers with merely internal hard disks. In other words, if I want to store 5 TB of data, does that each node need a hard disk capacity of 5 TB?? With HBase, memcached and other nosql solutions it is more clear how data is spilt up in the cluster and replicated for fault tolerance. Again, please excuse the rather basic question. -- /Ran
Re: understanding the cassandra storage scaling
awesome! Thank you guys for the really quick answers and the links to the presentations. On Thu, Dec 9, 2010 at 12:06 PM, Sylvain Lebresne sylv...@yakaz.com wrote: This helps a little but unfortunately I'm still a bit fuzzy for me. So is it not true that each node contains all the data in the cluster? Not at all. Basically each node is responsible of only a part of the data (a range really). But for each data you can choose on how many nodes it is; this is the Replication Factor. For instance, if you choose to have RF=1, then each piece of data will be on exactly one node (this is usually a bad idea since it offers very weak durability guarantees but nevertheless, it can be done). If you choose RF=3, each piece of data is on 3 nodes (independently of the number of nodes your cluster have). You can have all data on all node, but for that you'll have to choose RF=#{nodes in the cluster}. But this is a very degenerate case. how does my query get directed to the right node? Each node in the cluster knows the ranges of data each other nodes hold. I suggest you watch the first video linked in this page http://wiki.apache.org/cassandra/ArticlesAndPresentations It explains this and more. -- Sylvain