Re: File Store
On Wed, Feb 20, 2013 at 6:47 PM, Kanwar Sangha kan...@mavenir.com wrote: Hi – I am looking for some inputs on the file storage in Cassandra. Each file size can range from 200kb – 3MB. I don’t see any limitation on the column size. But would it be a good idea to store these files as binary in the columns ? We do the same, keeps a lot of small files (up to 15Mb). Limitation came from the Thrift side - it's bindings requires to load whole file in memory, but affordable in our case. -- Sergey
Adding new nodes in a cluster with virtual nodes
Hello, We are using Cassandra 1.2.0. We have a cluster of 16 physical nodes, each node has 256 virtual nodes. We want to add 2 new nodes in our cluster : we follow the procedure as explained here : http://www.datastax.com/docs/1.2/operations/add_replace_nodes. After starting 1 of the new node, we can see that this new node has 256 tokens ==looks good We can see that this node is in the ring (using nodetool status) == looks good After the bootstrap is finished in the new node, no data has been moved automatically from the old nodes to this new node. However, when we send insert queries in our cluster, the new node accepts to insert the new rows. Please, could you tell me if we need to perform a nodetool repair after the bootstrap of the new node ? What happens if we perform a nodetool cleanup in the old nodes before doing the nodetool repair ? (Is there a risk of loosing some data ?) Regards. Jean Armel
key cache size
Hi - What is the approximate overhead of the key cache ? Say each key is 50 bytes. What would be the overhead for this key in the key cache ? Thanks, Kanwar
Re: Testing compaction strategies on a single production server?
Thanks Aaron, I hear you on the unchartered territory bit, we're definitely not gonna risk our live data unless we know it's safe to do what we suggested. :-) Oh well, I guess we'll have to setup a survey node instead. /Henrik On Thu, Feb 21, 2013 at 4:54 AM, aaron morton aa...@thelastpickle.comwrote: I *think* it will work. The steps in the blog post to change the compaction strategy before RING_DELAY expires is to ensure no sstables are created before the strategy is changed. But I think you will be venturing into unchartered territory where their might be dragons. And not the fun Disney kind. While it may be more work I personally would use one node in write survey to test LCS Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 20/02/2013, at 6:28 AM, Henrik Schröder skro...@gmail.com wrote: Well, that answer didn't really help. I know how to make a survey node, and I know how to simulate reads to it, it's just that that's a lot of work, and I wouldn't be sure that the simulated load is the same as the production load. We gather a lot of metrics from our production servers, so we know exactly how they perform over long periods of time. Changing a single server to run a different compaction strategy would allow us to know in detail how a different strategy would impact the cluster. So, is it possible to modify org.apache.cassandra.db.[keyspace].[column family].CompactionStrategyClass through jmx on a production server without any ill effects? Or is this only possible to do on a survey node while it is in a specific state? /Henrik On Tue, Feb 19, 2013 at 3:09 PM, Viktor Jevdokimov viktor.jevdoki...@adform.com wrote: Just turn off dynamic snitch on survey node and make read requests from it directly with CL.ONE, watch histograms, compare. ** ** Regarding switching compaction strategy there’re a lot of info already.** ** ** ** ** ** Best regards / Pagarbiai *Viktor Jevdokimov* Senior Developer Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063, Fax +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Follow us on Twitter: @adforminsiderhttp://twitter.com/#!/adforminsider Take a ride with Adform's Rich Media Suitehttp://vimeo.com/adform/richmedia signature-logo18be.png http://www.adform.com/ signature-best-employer-logo6784.png http://www.adform.com/site/blog/adform/adform-takes-top-spot-in-best-employer-survey/ Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. *From:* Henrik Schröder [mailto:skro...@gmail.com] *Sent:* Tuesday, February 19, 2013 15:57 *To:* user *Subject:* Testing compaction strategies on a single production server?** ** ** ** Hey, Version 1.1 of Cassandra introduced live traffic sampling, which allows you to measure the performance of a node without it really joining the cluster: http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-1-live-traffic-sampling That page mentions that you can change the compaction strategy through jmx if you want to test out a different strategy on your survey node. That's great, but it doesn't give you a complete view of how your performance would change, since you're not doing reads from the survey node. But what would happen if you used jmx to change the compaction strategy of a column family on a single *production* node? Would that be a safe way to test it out or are there side-effects of doing that live? And if you do that, would running a major compaction transform the entire column family to the new format? Finally, if the test was a success, how do you proceed from there? Just change the schema? /Henrik
RE: Read IO
Ok.. Cassandra default block size is 256k ? Now say my data in the column is 4 MB. And the disk is giving me 4k block size random reads @ 100 IOPS. I can read max 400k in one seek ? does that mean I would need multiple seeks to get the complete data ? -Original Message- From: sc...@scode.org [mailto:sc...@scode.org] On Behalf Of Peter Schuller Sent: 21 February 2013 00:05 To: user@cassandra.apache.org Subject: Re: Read IO Is this correct ? Yes, at least under optimal conditions and assuming a reasonably sized row. Things like read-ahead (at the kernel level) will play into it; and if your read (even if assumed to be small) straddles two pages you might or might not take another read depending on your kernel settings (typically trading pollution of page cache vs. number of I/O:s). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Read IO
Hi, On Feb 21, 2013, at 7:52 , Kanwar Sangha kan...@mavenir.com wrote: Hi – Can someone explain the worst case IOPS for a read ? No key cache, No row cache, sampling rate say 512. 1) Bloom filter will be checked to see existence of key (In RAM) 2) Index filer sample (IN RAM) will be checked to find approx. location in index file on disk 3) 1 IOPS to read the actual index file on disk (DISK) 4) 1 IOPS to get the data from the location in the sstable (DISK) Is this correct ? As you were asking for the worst case, I would still add one step that would be a seek inside an SSTable from the row start to the queried columns using column index. However, this applies only if you are querying a subset of columns in the row (not all) and the total row size exceeds column_index_size_in_kb (defaults to 64kB). So, as far as I have understood, the worst case steps (without any caches) are: 1. Check the SSTable bloom filters (in memory) 2. Use index samples to find approx. correct place in the key index file (in memory) 3. Read the key index file until correct key is found (1st disk seek read) 5. Seek to the start of the row in SSTable file and read row headers (possibly including column index) (2nd seek read) 6. Using column index seek to the correct place inside the SSTable file to actually read the columns (3rd seek read) If the row is very wide and you are asking for a random bunch of columns from here and there, the step 6 might even be needed multiple times. Also, if your row has spread over many SSTables, each of them needs to be accessed (at least steps 1. - 5.) to get the complete results for the query. All this in mind, if your node has any reasonable amount of reads, I'd say that in practice key index files will be page cached by the OS very quickly and thus normal read would end up being either one seek (for small rows without the column index) or two (for wider rows). Of course, as Peter already pointed out, the more columns you ask for, the more disk needs to read. For a continuous set of columns the read should be linear, however. -Jouni
Re: Cassandra network latency tuning
I would like to understand how we can capture network latencies between a 1GbE and 10GbE for ex. Cassandra reports two latencies. The CF latencies reported by nodetool cfstats, nodetool cfhistograms and the CF MBeans cover the local time it takes to read or write the data. This does not include any local wait times, network latency or coordinator overhead. The Storage Proxy latency from nodetool proxyhistograms and the StorageProxy MBean is the total latency for a request on a coordinator. Under load, with a consistent workload, the CF latency should not vary too much. While the request latency can increase as wait time becomes more of an factor. Additionally streaming is throttled which you may want to increase, see the the yaml file. We will soon be adding SSD's and was wondering how Cassandra can utilize the 10GbE and the SSD's and if there are specific tuning that is required. You may want to increase both the concurrent_writes and reads in the yaml file to take advantage of the extra IO. Same for the compaction settings, comments in the yaml file will help. With SSD and 10GbE you can easily hold more data on each node. Typically we advise 300GB to 500GB per node with HDD and 1GbE, because of the time repair and node replacement takes. With SSD and 10GbE it will take less, and even less if you are using SSD. If you feel like being thorough add repair and node replacement (all under load) to your test lineup. Hope that helps. - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 20/02/2013, at 1:44 PM, Brandon Walsh brandon_9021...@yahoo.com wrote: I have a 5 node cluster and currently running ver 1.2. Prior to full scale deployment, I'm running some benchmarks using YCSB. From a hadoop cluster deployment we saw an excellent improvement using higher speed networks. However Cassandra does not include network latencies and I would like to understand how we can capture network latencies between a 1GbE and 10GbE for ex. As of now all the graphs look the same. We will soon be adding SSD's and was wondering how Cassandra can utilize the 10GbE and the SSD's and if there are specific tuning that is required.
Re: How to limit query results like from row 50 to 100
CQL does not support offset but does have limit. See http://www.datastax.com/docs/1.2/cql_cli/cql/SELECT#specifying-rows-returned-using-limit Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 20/02/2013, at 1:47 PM, Mateus Ferreira e Freitas mateus.ffrei...@hotmail.com wrote: With CQL or an API.
Re: Heap is N.N full. Immediately on startup
My first guess would be the bloom filter and index sampling from lots-o-rows Check the row count in cfstats Check the bloom filter size in cfstats. Background on memory requirements http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 20/02/2013, at 11:27 PM, Andras Szerdahelyi andras.szerdahe...@ignitionone.com wrote: Hey list, Any ideas ( before I take a heap dump ) what might be consuming my 8GB JVM heap at startup in Cassandra 1.1.6 besides row cache : not persisted and is at 0 keys when this warning is produced Memtables : no write traffic at startup, my app's column families are durable_writes:false Pending tasks : no pending tasks, except for 928 compactions ( not sure where those are coming from ) I drew these conclusions from the StatusLogger output below: INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 GCInspector.java (line 122) GC for ConcurrentMarkSweep: 14959 ms for 2 collections, 7017934560 used; max is 8375238656 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 StatusLogger.java (line 57) Pool NameActive Pending Blocked INFO [ScheduledTasks:1] 2013-02-20 05:13:25,199 StatusLogger.java (line 72) ReadStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) RequestResponseStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) ReadRepairStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) MutationStage 0-1 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) ReplicateOnWriteStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) GossipStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) AntiEntropyStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) MigrationStage0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) StreamStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) MemtablePostFlusher 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) FlushWriter 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) MiscStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) commitlog_archiver0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,203 StatusLogger.java (line 72) InternalResponseStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 77) CompactionManager 0 928 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 89) MessagingServicen/a 0,0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 99) Cache Type Size Capacity KeysToSave Provider INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 100) KeyCache 25 25 all INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 106) RowCache 00 all org.apache.cassandra.cache.SerializingCacheProvider INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 113) ColumnFamilyMemtable ops,data INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) MYAPP_1.CF0,0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) MYAPP_2.CF 0,0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) HiveMetaStore.MetaStore 0,0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) system.NodeIdInfo 0,0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) system.IndexInfo 0,0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) system.LocationInfo 0,0 INFO
Re: SSTable Num
Hi – I have around 6TB of data on 1 node Unless you have SSD and 10GbE you probably have too much data on there. Remember you need to run repair and that can take a long time with a lot of data. Also you may need to replace a node one day and moving 6TB will take a while. Or will the sstable compaction continue and eventually we will have 1 file ? No. The default size tiered strategy compacts files what are roughly the same size, and only when there are more than 4 (default) of them. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 3:47 AM, Kanwar Sangha kan...@mavenir.com wrote: Hi – I have around 6TB of data on 1 node and the cfstats show 32 sstables. There is no compaction job running in the background. Is there a limit on the size per sstable ? Or will the sstable compaction continue and eventually we will have 1 file ? Thanks, Kanwar
Re: how to debug slowdowns from these log snippets-more info 2
Some things to consider: Check for contention around the switch lock. This can happen if you get a lot of tables flushing at the same time, or if you have a lot of secondary indexes. It shows up as a pattern in the logs. As soon a the writer starts flushing a memtable another will be queued. Probably not happening here but can be a pain when a lot of memtables are flushed. I would turn on GC logging in cassandra-env.sh and watch that. After a full CMS flush how full / empty is the tenured heap ? If it is still got a lot in it then you are running with too much cache / bloom filter / index sampling. You can also experiment with the Max Tenuring Threshold, try turning it up to 4 to start with. The GC logs will show you how much data is at each tenuring level. You can then see how much data is being tenuring, and if premature tenuring was an issue. I've seen premature tenuring cause issues with wide rows / long reads. Hope that helps. - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 4:35 AM, Hiller, Dean dean.hil...@nrel.gov wrote: Oh, and my startup command that cassandra logged was a2.bigde.nrel.gov: xss = -ea -javaagent:/opt/cassandra/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8021M -Xmx8021M -Xmn1600M -XX:+HeapDumpOnOutOfMemoryError -Xss128k And I remember from docs you don't want to go above 8G or java GC doesn't work out so well. I am not sure why this is not working out though. Dean On 2/20/13 7:16 AM, Hiller, Dean dean.hil...@nrel.gov wrote: Here is the printout before that log which is probably important as wellŠ.. INFO [ScheduledTasks:1] 2013-02-20 07:14:00,375 GCInspector.java (line 122) GC for ConcurrentMarkSweep: 3618 ms for 2 collections, 7038159096 used; max is 8243904512 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,375 StatusLogger.java (line 57) Pool NameActive Pending Blocked INFO [ScheduledTasks:1] 2013-02-20 07:14:00,375 StatusLogger.java (line 72) ReadStage11 264 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line 72) RequestResponseStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line 72) ReadRepairStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line 72) MutationStage1288 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line 72) ReplicateOnWriteStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line 72) GossipStage 1 7 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,376 StatusLogger.java (line 72) AntiEntropyStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line 72) MigrationStage0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line 72) StreamStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line 72) MemtablePostFlusher 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line 72) FlushWriter 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line 72) MiscStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,377 StatusLogger.java (line 72) commitlog_archiver0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line 72) InternalResponseStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line 72) HintedHandoff 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line 77) CompactionManager 4 5 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line 89) MessagingServicen/a10,127 INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line 99) Cache Type Size Capacity KeysToSave Provider INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line 100) KeyCache1310719 1310719 all INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line 106) RowCache 00 all org.apache.cassandra.cache.SerializingCacheProvider INFO [ScheduledTasks:1] 2013-02-20 07:14:00,378 StatusLogger.java (line 113) ColumnFamilyMemtable ops,data INFO [ScheduledTasks:1] 2013-02-20 07:14:00,379 StatusLogger.java
Re: Mutation dropped
What does rpc_timeout control? Only the reads/writes? Yes. like data stream, streaming_socket_timeout_in_ms in the yaml merkle tree request? Either no time out or a number of days, cannot remember which right now. What is the side effect if it's set to a really small number, say 20ms? You will probably get a lot more requests that fail with a TimedOutException. rpc_timeout needs to be longer than the time it takes a node to process the message, and the time it takes the coordinator to do it's thing. You can look at cfhistograms and proxyhistograms to get a better idea of how long a request takes in your system. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 6:56 AM, Wei Zhu wz1...@yahoo.com wrote: What does rpc_timeout control? Only the reads/writes? How about other inter-node communication, like data stream, merkle tree request? What is the reasonable value for roc_timeout? The default value of 10 seconds are way too long. What is the side effect if it's set to a really small number, say 20ms? Thanks. -Wei From: aaron morton aa...@thelastpickle.com To: user@cassandra.apache.org Sent: Tuesday, February 19, 2013 7:32 PM Subject: Re: Mutation dropped Does the rpc_timeout not control the client timeout ? No it is how long a node will wait for a response from other nodes before raising a TimedOutException if less than CL nodes have responded. Set the client side socket timeout using your preferred client. Is there any param which is configurable to control the replication timeout between nodes ? There is no such thing. rpc_timeout is roughly like that, but it's not right to think about it that way. i.e. if a message to a replica times out and CL nodes have already responded then we are happy to call the request complete. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote: Thanks Aaron. Does the rpc_timeout not control the client timeout ? Is there any param which is configurable to control the replication timeout between nodes ? Or the same param is used to control that since the other node is also like a client ? From: aaron morton [mailto:aa...@thelastpickle.com] Sent: 17 February 2013 11:26 To: user@cassandra.apache.org Subject: Re: Mutation dropped You are hitting the maximum throughput on the cluster. The messages are dropped because the node fails to start processing them before rpc_timeout. However the request is still a success because the client requested CL was achieved. Testing with RF 2 and CL 1 really just tests the disks on one local machine. Both nodes replicate each row, and writes are sent to each replica, so the only thing the client is waiting on is the local node to write to it's commit log. Testing with (and running in prod) RF3 and CL QUROUM is a more real world scenario. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote: Hi – Is there a parameter which can be tuned to prevent the mutations from being dropped ? Is this logic correct ? Node A and B with RF=2, CL =1. Load balanced between the two. -- Address Load Tokens Owns (effective) Host ID Rack UN 10.x.x.x 746.78 GB 256 100.0% dbc9e539-f735-4b0b-8067-b97a85522a1a rack1 UN 10.x.x.x 880.77 GB 256 100.0% 95d59054-be99-455f-90d1-f43981d3d778 rack1 Once we hit a very high TPS (around 50k/sec of inserts), the nodes start falling behind and we see the mutation dropped messages. But there are no failures on the client. Does that mean other node is not able to persist the replicated data ? Is there some timeout associated with replicated data persistence ? Thanks, Kanwar From: Kanwar Sangha [mailto:kan...@mavenir.com] Sent: 14 February 2013 09:08 To: user@cassandra.apache.org Subject: Mutation dropped Hi – I am doing a load test using YCSB across 2 nodes in a cluster and seeing a lot of mutation dropped messages. I understand that this is due to the replica not being written to the other node ? RF = 2, CL =1. From the wiki - For MUTATION messages this means that the mutation was not applied to all replicas it was sent to. The inconsistency will be repaired by Read Repair or Anti Entropy Repair Thanks, Kanwar
Re: very confused by jmap dump of cassandra
Cannot comment too much on the jmap but I can add my general compaction is hurting strategy. Try any or all of the following to get to a stable setup, then increase until things go bang. Set concurrent compactors to 2. Reduce compaction throughput by half. Reduce in_memory_compaction_limit. If you see compactions using a lot of sstables in the logs, reduce max_compaction_threshold. I can easily go higher than 8G on these systems as I have 32gig each node, but there was docs that said 8G is better for GC. More JVM memory is not the answer. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 7:49 AM, Hiller, Dean dean.hil...@nrel.gov wrote: I took this jmap dump of cassandra(in production). Before I restarted the whole production cluster, I had some nodes running compaction and it looked like all memory had been consumed(kind of like cassandra is not clearing out the caches or memtables fast enough). I am trying to still debug compaction causes slowness on the cluster since all cassandra.yaml files are pretty much the defaults with size tiered compaction. The weird thing is I dump and get a 5.4G heap.bin file and load that into ecipse who tells me total is 142.8MB….what So low, top was showing 1.9G at the time(and I took this top snapshot later(2 hours after)… (how is eclipse profile telling me the jmap showed 142.8MB in use instead of 1.9G in use?) Tasks: 398 total, 1 running, 397 sleeping, 0 stopped, 0 zombie Cpu(s): 2.8%us, 0.5%sy, 0.0%ni, 96.5%id, 0.1%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 32854680k total, 31910708k used, 943972k free,89776k buffers Swap: 33554424k total,18288k used, 33536136k free, 23428596k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 20909 cassandr 20 0 64.1g 9.2g 2.1g S 75.7 29.4 182:37.92 java 22455 cassandr 20 0 15288 1340 824 R 3.9 0.0 0:00.02 top It almost seems like cassandra is not being good about memory management here as we slowly get into a situation where compaction is run which takes out our memory(configured for 8G). I can easily go higher than 8G on these systems as I have 32gig each node, but there was docs that said 8G is better for GC. Has anyone else taken a jmap dump of cassandra? Thanks, Dean
Re: very confused by jmap dump of cassandra
Roughly how much data do you have per node? Sent from my iPhone On Feb 20, 2013, at 10:49 AM, Hiller, Dean dean.hil...@nrel.gov wrote: I took this jmap dump of cassandra(in production). Before I restarted the whole production cluster, I had some nodes running compaction and it looked like all memory had been consumed(kind of like cassandra is not clearing out the caches or memtables fast enough). I am trying to still debug compaction causes slowness on the cluster since all cassandra.yaml files are pretty much the defaults with size tiered compaction. The weird thing is I dump and get a 5.4G heap.bin file and load that into ecipse who tells me total is 142.8MB….what So low, top was showing 1.9G at the time(and I took this top snapshot later(2 hours after)… (how is eclipse profile telling me the jmap showed 142.8MB in use instead of 1.9G in use?) Tasks: 398 total, 1 running, 397 sleeping, 0 stopped, 0 zombie Cpu(s): 2.8%us, 0.5%sy, 0.0%ni, 96.5%id, 0.1%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 32854680k total, 31910708k used, 943972k free,89776k buffers Swap: 33554424k total,18288k used, 33536136k free, 23428596k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 20909 cassandr 20 0 64.1g 9.2g 2.1g S 75.7 29.4 182:37.92 java 22455 cassandr 20 0 15288 1340 824 R 3.9 0.0 0:00.02 top It almost seems like cassandra is not being good about memory management here as we slowly get into a situation where compaction is run which takes out our memory(configured for 8G). I can easily go higher than 8G on these systems as I have 32gig each node, but there was docs that said 8G is better for GC. Has anyone else taken a jmap dump of cassandra? Thanks, Dean
RE: SSTable Num
No. The default size tiered strategy compacts files what are roughly the same size, and only when there are more than 4 (default) of them. Ok. So for 10 TB, I could have at least 4 SStables files each of 2.5 TB ? From: aaron morton [mailto:aa...@thelastpickle.com] Sent: 21 February 2013 11:01 To: user@cassandra.apache.org Subject: Re: SSTable Num Hi - I have around 6TB of data on 1 node Unless you have SSD and 10GbE you probably have too much data on there. Remember you need to run repair and that can take a long time with a lot of data. Also you may need to replace a node one day and moving 6TB will take a while. Or will the sstable compaction continue and eventually we will have 1 file ? No. The default size tiered strategy compacts files what are roughly the same size, and only when there are more than 4 (default) of them. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 3:47 AM, Kanwar Sangha kan...@mavenir.commailto:kan...@mavenir.com wrote: Hi - I have around 6TB of data on 1 node and the cfstats show 32 sstables. There is no compaction job running in the background. Is there a limit on the size per sstable ? Or will the sstable compaction continue and eventually we will have 1 file ? Thanks, Kanwar
Re: cassandra vs. mongodb quick question(good additional info)
If you are lazy like me wolfram alpha can help http://www.wolframalpha.com/input/?i=transfer+42TB+at+10GbEa=UnitClash_*TB.*Tebibytes-- 10 hours 15 minutes 43.59 seconds Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 11:31 AM, Wojciech Meler wojciech.me...@gmail.com wrote: you have 86400 seconds a day so 42T could take less than 12 hours on 10Gb link 19 lut 2013 02:01, Hiller, Dean dean.hil...@nrel.gov napisał(a): I thought about this more, and even with a 10Gbit network, it would take 40 days to bring up a replacement node if mongodb did truly have a 42T / node like I had heard. I wrote the below email to the person I heard this from going back to basics which really puts some perspective on it….(and a lot of people don't even have a 10Gbit network like we do) Nodes are hooked up by a 10G network at most right now where that is 10gigabit. We are talking about 10Terabytes on disk per node recently. Google 10 gigabit in gigabytes gives me 1.25 gigabytes/second (yes I could have divided by 8 in my head but eh…course when I saw the number, I went duh) So trying to transfer 10 Terabytes or 10,000 Gigabytes to a node that we are bringing online to replace a dead node would take approximately 5 days??? This means no one else is using the bandwidth too ;). 10,000Gigabytes * 1 second/1.25 * 1hr/60secs * 1 day / 24 hrs = 5.55 days. This is more likely 11 days if we only use 50% of the network. So bringing a new node up to speed is more like 11 days once it is crashed. I think this is the main reason the 1Terabyte exists to begin with, right? From an ops perspective, this could sound like a nightmare scenario of waiting 10 days…..maybe it is livable though. Either way, I thought it would be good to share the numbers. ALSO, that is assuming the bus with it's 10 disk can keep up with 10G Can it? What is the limit of throughput on a bus / second on the computers we have as on wikipedia there is a huge variance? What is the rate of the disks too (multiplied by 10 of course)? Will they keep up with a 10G rate for bringing a new node online? This all comes into play even more so when you want to double the size of your cluster of course as all nodes have to transfer half of what they have to all the new nodes that come online(cassandra actually has a very data center/rack aware topology to transfer data correctly to not use up all bandwidth unecessarily…I am not sure mongodb has that). Anyways, just food for thought. From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Monday, February 18, 2013 1:39 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org, Vegard Berget p...@fantasista.nomailto:p...@fantasista.no Subject: Re: cassandra vs. mongodb quick question My experience is repair of 300GB compressed data takes longer than 300GB of uncompressed, but I cannot point to an exact number. Calculating the differences is mostly CPU bound and works on the non compressed data. Streaming uses compression (after uncompressing the on disk data). So if you have 300GB of compressed data, take a look at how long repair takes and see if you are comfortable with that. You may also want to test replacing a node so you can get the procedure documented and understand how long it takes. The idea of the soft 300GB to 500GB limit cam about because of a number of cases where people had 1 TB on a single node and they were surprised it took days to repair or replace. If you know how long things may take, and that fits in your operations then go with it. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 18/02/2013, at 10:08 PM, Vegard Berget p...@fantasista.nomailto:p...@fantasista.no wrote: Just out of curiosity : When using compression, does this affect this one way or another? Is 300G (compressed) SSTable size, or total size of data? .vegard, - Original Message - From: user@cassandra.apache.orgmailto:user@cassandra.apache.org To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Cc: Sent: Mon, 18 Feb 2013 08:41:25 +1300 Subject: Re: cassandra vs. mongodb quick question If you have spinning disk and 1G networking and no virtual nodes, I would still say 300G to 500G is a soft limit. If you are using virtual nodes, SSD, JBOD disk configuration or faster networking you may go higher. The limiting factors are the time it take to repair, the time it takes to replace a node, the memory considerations for 100's of millions of rows. If you the
Re: key cache size
This is the key cache entry https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cache/KeyCacheKey.java Note that the Descriptor is re-used. If you want to see key cache metrics, including bytes used, use nodetool info. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 22/02/2013, at 3:45 AM, Kanwar Sangha kan...@mavenir.com wrote: Hi – What is the approximate overhead of the key cache ? Say each key is 50 bytes. What would be the overhead for this key in the key cache ? Thanks, Kanwar
Re: Heap is N.N full. Immediately on startup
Thank you- indeed my index interval is 64 with a CF of 300M rows + bloom filter false positive chance was default. Raising the index interval to 512 didn't fix this alone, so I guess I'll have to set the bloom filter to some reasonable value and scrub. From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday 21 February 2013 17:58 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Heap is N.N full. Immediately on startup My first guess would be the bloom filter and index sampling from lots-o-rows Check the row count in cfstats Check the bloom filter size in cfstats. Background on memory requirements http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 20/02/2013, at 11:27 PM, Andras Szerdahelyi andras.szerdahe...@ignitionone.commailto:andras.szerdahe...@ignitionone.com wrote: Hey list, Any ideas ( before I take a heap dump ) what might be consuming my 8GB JVM heap at startup in Cassandra 1.1.6 besides * row cache : not persisted and is at 0 keys when this warning is produced * Memtables : no write traffic at startup, my app's column families are durable_writes:false * Pending tasks : no pending tasks, except for 928 compactions ( not sure where those are coming from ) I drew these conclusions from the StatusLogger output below: INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 GCInspector.java (line 122) GC for ConcurrentMarkSweep: 14959 ms for 2 collections, 7017934560 used; max is 8375238656 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 StatusLogger.java (line 57) Pool NameActive Pending Blocked INFO [ScheduledTasks:1] 2013-02-20 05:13:25,199 StatusLogger.java (line 72) ReadStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) RequestResponseStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) ReadRepairStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) MutationStage 0-1 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) ReplicateOnWriteStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) GossipStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) AntiEntropyStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) MigrationStage0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) StreamStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) MemtablePostFlusher 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) FlushWriter 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) MiscStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72) commitlog_archiver0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,203 StatusLogger.java (line 72) InternalResponseStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 77) CompactionManager 0 928 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 89) MessagingServicen/a 0,0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 99) Cache Type Size Capacity KeysToSave Provider INFO [ScheduledTasks:1] 2013-02-20 05:13:25,212 StatusLogger.java (line 100) KeyCache 25 25 all INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 106) RowCache 00 all org.apache.cassandra.cache.SerializingCacheProvider INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 113) ColumnFamilyMemtable ops,data INFO [ScheduledTasks:1] 2013-02-20 05:13:25,213 StatusLogger.java (line 116) MYAPP_1.CF0,0 INFO [ScheduledTasks:1] 2013-02-20
Re: Using Cassandra for read operations
To avoid disk I/Os, the best option we thought is to have data in memory. Is it a good idea to have memtable setup around 1/2 or 3/4 of heap size? Obviously flushing will take a lot of time but would that hurt that node's performance big time? Start with the defaults and test your workload. If memtables start flushing aggressively (because of write load or bad settings), that can cause compaction work on the disk, that might impair read I/O. Is there a way to figure out max read-latency for a bunch of read operations? Use nodetool's histogram feature to get a sense of outlier latency. We just need one column family with a long key Take time to tune your key caches and bloom filters. They use memory and have an impact on read performance. Given that cassandra provides off-heap row caching, in a machine 32 gb RAM, would it be wise to have a 10 gb row cache with 8 gb java heap? If you use the off heap cache, allow enough room for the filesystems' own cache, i.e. don't give over all of ram to the off heap cache. Also the off heap cache can slow you down with wide rows due to serialisation overhead, or cache invalidation thrashing if you are update heavy. if you use the on-heap cache, pay close attention to GC cycles and memory stability - if you are cycling/evicting through the cache at a high rate that can leave too much garbage in memory such that the garbage collector can't keep up. If the node doesn't have enough working memory after GC, it will _resize_ key and row caches. This will lead to degraded read performance and with some workloads can result in a vicious cycle. For our SLAs, a read of max 15-20 rows at once(using multi slice), should not take more than 4 ms. If you control your own hardware (and you probably should/must for this kind of latency demand) consider SSDs. You might want to carefully control background repair/compaction operations if predictable performance is your goal. You might want to avoid storing strings and use byte representations. If you have an application tier on the path consider caching in that tier as well to avoid the overhead of network calls and thrift processing. In a nutshell - - Start with defaults and tune based on small discrete adjustments and leave time to see the effect of each change. No-one will know your workload better than you and the questions you are asking are workload sensitive. - Allow time for tuning and spending time understanding the memory model and JVM GC. - Be very careful with caches. Leave enough room in the OS for its own disk cache. - Get an SSD Bill On 21 Feb 2013, at 19:03, amulya rattan talk2amu...@gmail.com wrote: Dear All, We are currently evaluating Cassandra for an application involving strict SLAs(Service level agreements). We just need one column family with a long key and approximately 70-80 bytes row. We are not concerned about write performance but are primarily concerned about read. For our SLAs, a read of max 15-20 rows at once(using multi slice), should not take more than 4 ms. Till now, on a single node setup, using cassandra' stress tool, the numbers are promising. But I am guessing that's because there is no network latency involved there and since we set memtable around 2gb(4 gb heap), we never had to get to Disk I/O. Assuming our nodes having 32GB RAM, a couple of questions regarding read: * To avoid disk I/Os, the best option we thought is to have data in memory. Is it a good idea to have memtable setup around 1/2 or 3/4 of heap size? Obviously flushing will take a lot of time but would that hurt that node's performance big time? * Cassandra stress tool only gives out average read latency. Is there a way to figure out max read-latency for a bunch of read operations? * How big a row cache can one have? Given that cassandra provides off-heap row caching, in a machine 32 gb RAM, would it be wise to have a 10 gb row cache with 8 gb java heap? And how big should the corresponding key cache be then? Any response is appreciated. ~Amulya
RE: cassandra vs. mongodb quick question(good additional info)
“The limiting factors are the time it take to repair, the time it takes to replace a node, the memory considerations for 100's of millions of rows. If you the performance of those operations is acceptable to you, then go crazy” If I have a node which is attached to a RAID and the node crashes but the data is still good on the drives, it would just mean bringing up the node using the same storage ? would this not be fast…? From: aaron morton [mailto:aa...@thelastpickle.com] Sent: 21 February 2013 11:46 To: user@cassandra.apache.org Subject: Re: cassandra vs. mongodb quick question(good additional info) If you are lazy like me wolfram alpha can help http://www.wolframalpha.com/input/?i=transfer+42TB+at+10GbEa=UnitClash_*TB.*Tebibytes-- 10 hours 15 minutes 43.59 seconds Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 11:31 AM, Wojciech Meler wojciech.me...@gmail.commailto:wojciech.me...@gmail.com wrote: you have 86400 seconds a day so 42T could take less than 12 hours on 10Gb link 19 lut 2013 02:01, Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov napisał(a): I thought about this more, and even with a 10Gbit network, it would take 40 days to bring up a replacement node if mongodb did truly have a 42T / node like I had heard. I wrote the below email to the person I heard this from going back to basics which really puts some perspective on it….(and a lot of people don't even have a 10Gbit network like we do) Nodes are hooked up by a 10G network at most right now where that is 10gigabit. We are talking about 10Terabytes on disk per node recently. Google 10 gigabit in gigabytes gives me 1.25 gigabytes/second (yes I could have divided by 8 in my head but eh…course when I saw the number, I went duh) So trying to transfer 10 Terabytes or 10,000 Gigabytes to a node that we are bringing online to replace a dead node would take approximately 5 days??? This means no one else is using the bandwidth too ;). 10,000Gigabytes * 1 second/1.25 * 1hr/60secs * 1 day / 24 hrs = 5.55 days. This is more likely 11 days if we only use 50% of the network. So bringing a new node up to speed is more like 11 days once it is crashed. I think this is the main reason the 1Terabyte exists to begin with, right? From an ops perspective, this could sound like a nightmare scenario of waiting 10 days…..maybe it is livable though. Either way, I thought it would be good to share the numbers. ALSO, that is assuming the bus with it's 10 disk can keep up with 10G Can it? What is the limit of throughput on a bus / second on the computers we have as on wikipedia there is a huge variance? What is the rate of the disks too (multiplied by 10 of course)? Will they keep up with a 10G rate for bringing a new node online? This all comes into play even more so when you want to double the size of your cluster of course as all nodes have to transfer half of what they have to all the new nodes that come online(cassandra actually has a very data center/rack aware topology to transfer data correctly to not use up all bandwidth unecessarily…I am not sure mongodb has that). Anyways, just food for thought. From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.commailto:aa...@thelastpickle.commailto:aa...@thelastpickle.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Monday, February 18, 2013 1:39 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org, Vegard Berget p...@fantasista.nomailto:p...@fantasista.nomailto:p...@fantasista.nomailto:p...@fantasista.no Subject: Re: cassandra vs. mongodb quick question My experience is repair of 300GB compressed data takes longer than 300GB of uncompressed, but I cannot point to an exact number. Calculating the differences is mostly CPU bound and works on the non compressed data. Streaming uses compression (after uncompressing the on disk data). So if you have 300GB of compressed data, take a look at how long repair takes and see if you are comfortable with that. You may also want to test replacing a node so you can get the procedure documented and understand how long it takes. The idea of the soft 300GB to 500GB limit cam about because of a number of cases where people had 1 TB on a single node and they were surprised it took days to repair or replace. If you know how long things may take, and that fits in your operations then go with it. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand
Cassandra with SAN
Hi - Is it a good idea to use Cassandra with SAN ? Say a SAN which provides me 8 Petabytes of storage. Would I not be I/O bound irrespective of the no of Cassandra machines and scaling by adding machines won't help ? Thanks Kanwar
Re: Cassandra with SAN
No, this is a really really bad idea and C* was not designed for this, in fact, it was designed so you don't need to have a large expensive SAN. Don't be tempted by the shiny expensive SAN. :) If money is no object instead throw SSD's in your nodes and run 10G between racks From: Kanwar Sangha kan...@mavenir.commailto:kan...@mavenir.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, February 21, 2013 2:56 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Cassandra with SAN Hi – Is it a good idea to use Cassandra with SAN ? Say a SAN which provides me 8 Petabytes of storage. Would I not be I/O bound irrespective of the no of Cassandra machines and scaling by adding machines won’t help ? Thanks Kanwar Copy, by Barracuda, helps you store, protect, and share all your amazing things. Start today: www.copy.com.
Re: cassandra vs. mongodb quick question(good additional info)
The theoretical maximum of 10G is not even close to what you actually get. http://www.google.com/url?sa=trct=jq=esrc=ssource=webcd=1ved=0CDIQFjAAurl=http%3A%2F%2Fdownload.intel.com%2Fsupport%2Fnetwork%2Fsb%2Ffedexcasestudyfinal.pdfei=HawmUcWIM6q20QG8j4DIBwusg=AFQjCNG8Qskl9vXdJvB7OLtIPQgparrt9Abvm=bv.42661473,d.dmQcad=rja Sorry did not have time to strip the google stuff out of this link. On Thu, Feb 21, 2013 at 12:45 PM, aaron morton aa...@thelastpickle.com wrote: If you are lazy like me wolfram alpha can help http://www.wolframalpha.com/input/?i=transfer+42TB+at+10GbEa=UnitClash_*TB.*Tebibytes-- 10 hours 15 minutes 43.59 seconds Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 11:31 AM, Wojciech Meler wojciech.me...@gmail.com wrote: you have 86400 seconds a day so 42T could take less than 12 hours on 10Gb link 19 lut 2013 02:01, Hiller, Dean dean.hil...@nrel.gov napisał(a): I thought about this more, and even with a 10Gbit network, it would take 40 days to bring up a replacement node if mongodb did truly have a 42T / node like I had heard. I wrote the below email to the person I heard this from going back to basics which really puts some perspective on it….(and a lot of people don't even have a 10Gbit network like we do) Nodes are hooked up by a 10G network at most right now where that is 10gigabit. We are talking about 10Terabytes on disk per node recently. Google 10 gigabit in gigabytes gives me 1.25 gigabytes/second (yes I could have divided by 8 in my head but eh…course when I saw the number, I went duh) So trying to transfer 10 Terabytes or 10,000 Gigabytes to a node that we are bringing online to replace a dead node would take approximately 5 days??? This means no one else is using the bandwidth too ;). 10,000Gigabytes * 1 second/1.25 * 1hr/60secs * 1 day / 24 hrs = 5.55 days. This is more likely 11 days if we only use 50% of the network. So bringing a new node up to speed is more like 11 days once it is crashed. I think this is the main reason the 1Terabyte exists to begin with, right? From an ops perspective, this could sound like a nightmare scenario of waiting 10 days…..maybe it is livable though. Either way, I thought it would be good to share the numbers. ALSO, that is assuming the bus with it's 10 disk can keep up with 10G Can it? What is the limit of throughput on a bus / second on the computers we have as on wikipedia there is a huge variance? What is the rate of the disks too (multiplied by 10 of course)? Will they keep up with a 10G rate for bringing a new node online? This all comes into play even more so when you want to double the size of your cluster of course as all nodes have to transfer half of what they have to all the new nodes that come online(cassandra actually has a very data center/rack aware topology to transfer data correctly to not use up all bandwidth unecessarily…I am not sure mongodb has that). Anyways, just food for thought. From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Monday, February 18, 2013 1:39 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org, Vegard Berget p...@fantasista.nomailto:p...@fantasista.no Subject: Re: cassandra vs. mongodb quick question My experience is repair of 300GB compressed data takes longer than 300GB of uncompressed, but I cannot point to an exact number. Calculating the differences is mostly CPU bound and works on the non compressed data. Streaming uses compression (after uncompressing the on disk data). So if you have 300GB of compressed data, take a look at how long repair takes and see if you are comfortable with that. You may also want to test replacing a node so you can get the procedure documented and understand how long it takes. The idea of the soft 300GB to 500GB limit cam about because of a number of cases where people had 1 TB on a single node and they were surprised it took days to repair or replace. If you know how long things may take, and that fits in your operations then go with it. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 18/02/2013, at 10:08 PM, Vegard Berget p...@fantasista.nomailto:p...@fantasista.no wrote: Just out of curiosity : When using compression, does this affect this one way or another? Is 300G (compressed) SSTable size, or total size of data? .vegard, - Original Message - From: user@cassandra.apache.orgmailto:user@cassandra.apache.org To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Cc: Sent: Mon, 18 Feb 2013 08:41:25 +1300 Subject: Re:
RE: Cassandra with SAN
Ok. What would be the drawbacks :) From: Michael Kjellman [mailto:mkjell...@barracuda.com] Sent: 21 February 2013 17:12 To: user@cassandra.apache.org Subject: Re: Cassandra with SAN No, this is a really really bad idea and C* was not designed for this, in fact, it was designed so you don't need to have a large expensive SAN. Don't be tempted by the shiny expensive SAN. :) If money is no object instead throw SSD's in your nodes and run 10G between racks From: Kanwar Sangha kan...@mavenir.commailto:kan...@mavenir.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, February 21, 2013 2:56 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Cassandra with SAN Hi - Is it a good idea to use Cassandra with SAN ? Say a SAN which provides me 8 Petabytes of storage. Would I not be I/O bound irrespective of the no of Cassandra machines and scaling by adding machines won't help ? Thanks Kanwar -- Copy, by Barracuda, helps you store, protect, and share all your amazing things. Start today: www.copy.comhttp://www.copy.com?a=em_footer.
Re: Cassandra with SAN
Who breaks a butterfly upon a wheel? It will work, but you'd have a distributed database running on a single point of failure storage fabric, thus destroying much of your benefits, unless you have enough discrete SAN units that you treat them as racks in your cassandra topology to ensure that you have data replicated across redundant SAN shelves|controllers|etc. You also would end up with redundancy at cross purposes in that the SAN will be striping data that Cassandra is already distributing efficiently. If the SAN is free and unused, it'll be fine as a Cassandra test platform. But I wouldn't spend a penny on SAN hardware instead of a much larger distributed cluster with commodity hardware. Derive your redundancy and performance from lots of hardware in lots of places, not expensive hardware in one place. --DRS On Feb 21, 2013, at 3:42 PM, Kanwar Sangha kan...@mavenir.com wrote: Ok. What would be the drawbacks J From: Michael Kjellman [mailto:mkjell...@barracuda.com] Sent: 21 February 2013 17:12 To: user@cassandra.apache.org Subject: Re: Cassandra with SAN No, this is a really really bad idea and C* was not designed for this, in fact, it was designed so you don't need to have a large expensive SAN. Don't be tempted by the shiny expensive SAN. :) If money is no object instead throw SSD's in your nodes and run 10G between racks From: Kanwar Sangha kan...@mavenir.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Thursday, February 21, 2013 2:56 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Cassandra with SAN Hi – Is it a good idea to use Cassandra with SAN ? Say a SAN which provides me 8 Petabytes of storage. Would I not be I/O bound irrespective of the no of Cassandra machines and scaling by adding machines won’t help ? Thanks Kanwar -- Copy, by Barracuda, helps you store, protect, and share all your amazing things. Start today: www.copy.com.
Re: Cassandra with SAN
Adding a Single Point of Failure when you chose a distributed database for probably a good reason. I'd also think you'd be tempted to have multiple terabytes per node. (so you're even more cost inefficient because you'll still need to buy the same number of nodes everyone else does even though you have the SAN). Then any operations are going to be unbearable (repair, cleanup). Also if you want to be multi dc, now you'll need two SANS. I can't think of one good reason to run C* with a SAN. From: Kanwar Sangha kan...@mavenir.commailto:kan...@mavenir.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, February 21, 2013 3:42 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: RE: Cassandra with SAN Ok. What would be the drawbacks :) From: Michael Kjellman [mailto:mkjell...@barracuda.com] Sent: 21 February 2013 17:12 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Cassandra with SAN No, this is a really really bad idea and C* was not designed for this, in fact, it was designed so you don't need to have a large expensive SAN. Don't be tempted by the shiny expensive SAN. :) If money is no object instead throw SSD's in your nodes and run 10G between racks From: Kanwar Sangha kan...@mavenir.commailto:kan...@mavenir.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, February 21, 2013 2:56 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Cassandra with SAN Hi – Is it a good idea to use Cassandra with SAN ? Say a SAN which provides me 8 Petabytes of storage. Would I not be I/O bound irrespective of the no of Cassandra machines and scaling by adding machines won’t help ? Thanks Kanwar -- Copy, by Barracuda, helps you store, protect, and share all your amazing things. Start today: www.copy.comhttp://www.copy.com?a=em_footer. Copy, by Barracuda, helps you store, protect, and share all your amazing things. Start today: www.copy.com.
Re: Cassandra with SAN
Cassandra is designed to write and read data in a way that is optimized for physical spinning disks. Running C* on a SAN introduces a layer of abstraction that, at best negates those optimizations, and at worst introduces additional overhead. Sent from my iPhone On Feb 21, 2013, at 6:42 PM, Kanwar Sangha kan...@mavenir.com wrote: Ok. What would be the drawbacks J From: Michael Kjellman [mailto:mkjell...@barracuda.com] Sent: 21 February 2013 17:12 To: user@cassandra.apache.org Subject: Re: Cassandra with SAN No, this is a really really bad idea and C* was not designed for this, in fact, it was designed so you don't need to have a large expensive SAN. Don't be tempted by the shiny expensive SAN. :) If money is no object instead throw SSD's in your nodes and run 10G between racks From: Kanwar Sangha kan...@mavenir.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Thursday, February 21, 2013 2:56 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Cassandra with SAN Hi – Is it a good idea to use Cassandra with SAN ? Say a SAN which provides me 8 Petabytes of storage. Would I not be I/O bound irrespective of the no of Cassandra machines and scaling by adding machines won’t help ? Thanks Kanwar -- Copy, by Barracuda, helps you store, protect, and share all your amazing things. Start today: www.copy.com.
Re: Cassandra with SAN
I shouldn't have used the word spinning... SSDs are a great option as well. I also agree with all the expensive SPOF points others have made. Sent from my iPhone On Feb 21, 2013, at 6:56 PM, P. Taylor Goetz ptgo...@gmail.com wrote: Cassandra is designed to write and read data in a way that is optimized for physical spinning disks. Running C* on a SAN introduces a layer of abstraction that, at best negates those optimizations, and at worst introduces additional overhead. Sent from my iPhone On Feb 21, 2013, at 6:42 PM, Kanwar Sangha kan...@mavenir.com wrote: Ok. What would be the drawbacks J From: Michael Kjellman [mailto:mkjell...@barracuda.com] Sent: 21 February 2013 17:12 To: user@cassandra.apache.org Subject: Re: Cassandra with SAN No, this is a really really bad idea and C* was not designed for this, in fact, it was designed so you don't need to have a large expensive SAN. Don't be tempted by the shiny expensive SAN. :) If money is no object instead throw SSD's in your nodes and run 10G between racks From: Kanwar Sangha kan...@mavenir.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Thursday, February 21, 2013 2:56 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Cassandra with SAN Hi – Is it a good idea to use Cassandra with SAN ? Say a SAN which provides me 8 Petabytes of storage. Would I not be I/O bound irrespective of the no of Cassandra machines and scaling by adding machines won’t help ? Thanks Kanwar -- Copy, by Barracuda, helps you store, protect, and share all your amazing things. Start today: www.copy.com.
Re: Counting problem
There is a limit option, find it in the doc. On Fri, Feb 22, 2013 at 3:41 AM, Sri Ramya ramya.1...@gmail.com wrote: hi,, Cassandra can display maximum 100 rows in a Columnfamily. can i increase it. If it is possible please mention here. Thank you
Re: unsubscribe
On Tue, Feb 19, 2013 at 1:02 PM, Anurag Gujral anurag.guj...@gmail.com wrote: Unsubscribe me please. Thanks Could I interest you in picture of a lemur instead? http://goo.gl/RZw3e -- Eric Evans Acunu | http://www.acunu.com | @acunu
Re: Cassandra with SAN
As a counter argument though, anyone running a C* cluster on the Amazon cloud is going to be using SAN storage (or some kind of proprietary storage array) at the lowest layers...Amazon isn't going to have a bunch of JBOD running their cloud infrastructure. However, they've invested in the infrastructure to do it right. This is certainly true when using EBS, however it's generally not recommended to use EBS when running Cassandra. EBS has proven to be unreliable in the past and it's a bit of a SPOF. Instead, it's recommended to use the instance store disks that come with most instances (handy chart here: http://www.ec2instances.info/). These are the rough equivalent of local disks (probably host level RAID 10 storage if I'd have to guess.) -Jared On 22 February 2013 00:40, Michael Morris michael.m.mor...@gmail.comwrote: I'm running a 27 node cassandra cluster on SAN without issue. I will be perfectly clear though, the hosts are multi-homed to different switches/fabrics in the SAN, we have an _expensive_ EMC array, and other than a datacenter-wide power outage, there's no SPOF for the SAN. We use it because it's there, and it's already a sunk cost. I certainly would not go out of my way to purchase SAN infrastructure for a C* cluster, it just doesn't make sense (for all the reasons others have mentioned). Any more, you can load up a single 2U server with multi-TB worth of disk, so the aggregate storage capacity of your C* cluster could potentially be as much as a SAN you would purchase (and a lot less hassle too). As a counter argument though, anyone running a C* cluster on the Amazon cloud is going to be using SAN storage (or some kind of proprietary storage array) at the lowest layers...Amazon isn't going to have a bunch of JBOD running their cloud infrastructure. However, they've invested in the infrastructure to do it right. - Mike On Thu, Feb 21, 2013 at 6:08 PM, P. Taylor Goetz ptgo...@gmail.comwrote: I shouldn't have used the word spinning... SSDs are a great option as well. I also agree with all the expensive SPOF points others have made. Sent from my iPhone On Feb 21, 2013, at 6:56 PM, P. Taylor Goetz ptgo...@gmail.com wrote: Cassandra is designed to write and read data in a way that is optimized for physical spinning disks. Running C* on a SAN introduces a layer of abstraction that, at best negates those optimizations, and at worst introduces additional overhead. Sent from my iPhone On Feb 21, 2013, at 6:42 PM, Kanwar Sangha kan...@mavenir.com wrote: Ok. What would be the drawbacks J ** ** *From:* Michael Kjellman [mailto:mkjell...@barracuda.commkjell...@barracuda.com] *Sent:* 21 February 2013 17:12 *To:* user@cassandra.apache.org *Subject:* Re: Cassandra with SAN ** ** No, this is a really really bad idea and C* was not designed for this, in fact, it was designed so you don't need to have a large expensive SAN.*** * ** ** Don't be tempted by the shiny expensive SAN. :) ** ** If money is no object instead throw SSD's in your nodes and run 10G between racks ** ** *From: *Kanwar Sangha kan...@mavenir.com *Reply-To: *user@cassandra.apache.org user@cassandra.apache.org *Date: *Thursday, February 21, 2013 2:56 PM *To: *user@cassandra.apache.org user@cassandra.apache.org *Subject: *Cassandra with SAN ** ** Hi – Is it a good idea to use Cassandra with SAN ? Say a SAN which provides me 8 Petabytes of storage. Would I not be I/O bound irrespective of the no of Cassandra machines and scaling by adding machines won’t help ? Thanks Kanwar ** ** -- Copy, by Barracuda, helps you store, protect, and share all your amazing things. Start today: www.copy.com http://www.copy.com?a=em_footer.
Re: Cassandra with SAN
On Friday, February 22, 2013, Jared Biel wrote: As a counter argument though, anyone running a C* cluster on the Amazon cloud is going to be using SAN storage (or some kind of proprietary storage array) at the lowest layers...Amazon isn't going to have a bunch of JBOD running their cloud infrastructure. However, they've invested in the infrastructure to do it right. This is certainly true when using EBS, however it's generally not recommended to use EBS when running Cassandra. EBS has proven to be unreliable in the past and it's a bit of a SPOF. Instead, it's recommended to use the instance store disks that come with most instances (handy chart here: http://www.ec2instances.info/). These are the rough equivalent of local disks (probably host level RAID 10 storage if I'd have to guess.) -Jared On 22 February 2013 00:40, Michael Morris michael.m.mor...@gmail.comwrote: I'm running a 27 node cassandra cluster on SAN without issue. I will be perfectly clear though, the hosts are multi-homed to different switches/fabrics in the SAN, we have an _expensive_ EMC array, and other than a datacenter-wide power outage, there's no SPOF for the SAN. We use it because it's there, and it's already a sunk cost. I certainly would not go out of my way to purchase SAN infrastructure for a C* cluster, it just doesn't make sense (for all the reasons others have mentioned). Any more, you can load up a single 2U server with multi-TB worth of disk, so the aggregate storage capacity of your C* cluster could potentially be as much as a SAN you would purchase (and a lot less hassle too). As a counter argument though, anyone running a C* cluster on the Amazon cloud is going to be using SAN storage (or some kind of proprietary storage array) at the lowest layers...Amazon isn't going to have a bunch of JBOD running their cloud infrastructure. However, they've invested in the infrastructure to do it right. - Mike On Thu, Feb 21, 2013 at 6:08 PM, P. Taylor Goetz ptgo...@gmail.comwrote: I shouldn't have used the word spinning... SSDs are a great option as well. I also agree with all the expensive SPOF points others have made. Sent from my iPhone On Feb 21, 2013, at 6:56 PM, P. Taylor Goetz ptgo...@gmail.com wrote: Cassandra is designed to write and read data in a way that is optimized for physical spinning disks. Running C* on a SAN introduces a layer of abstraction that, at best negates those optimizations, and at worst introduces additional overhead. Sent from my iPhone On Feb 21, 2013, at 6:42 PM, Kanwar Sangha kan...@mavenir.com wrote: Ok. What would be the drawbacks J ** ** *From:* Michael Kjellman [mailto:mkjell...@barracuda.com] *Sent:* 21 February 2013 17:12 *To:* user@cassandra.apache.org *Subject:* Re: Cassandra with SAN ** ** No, this is a really really bad idea and C* was not designed for this, in fact, it was designed so you don't need to have a large expensive SAN. ** ** Don't be tempted by the shiny expensive SAN. :) ** ** If money is no object instead throw SSD's in your nodes and run 10G between racks ** ** *From: *Kanwar Sangha kan...@mavenir.com *Reply-To: *user@cassandra.apache.org
RE: Using Cassandra for read operations
Bill de hÓra already answered, I'd like to add: To achieve ~4ms reads (from client standpoint): 1. You can't use multi-slice, since different keys may occur on different nodes that require internode communication. Design you data and reads to use one key/row. 2. Use ConsistencyLevel.ONE to avoid waiting for other nodes. 3. Use smart client that selects endpoints by token (key) to put request to appropriate node, Astyanax (Java) or write such client yourself. 4. Turn off dynamic snitch. While coordinator node may read locally, dynamic snitch may redirect it to another replica. 5. Use SSD's to avoid re-cache issue when sstables are compacted. 6. If you do writes, the rest issue is GC. If you're not on Azul Zing JVM, which I can't confirm to be better than Oracle HotSpot or JRockit (both has GC issues), you can't tune JVM to avoid Young Gen GC pauses to be as low as you need. You will fight pause frequency VS time. So if you can afford Zing, check also Aerospike (ex-CitrusLeaf) alternative to Cassandra, which is written in C and has no GC issues. From: Bill de hÓra [mailto:b...@dehora.net] Sent: Thursday, February 21, 2013 22:07 To: user@cassandra.apache.org Subject: Re: Using Cassandra for read operations In a nutshell - - Start with defaults and tune based on small discrete adjustments and leave time to see the effect of each change. No-one will know your workload better than you and the questions you are asking are workload sensitive. - Allow time for tuning and spending time understanding the memory model and JVM GC. - Be very careful with caches. Leave enough room in the OS for its own disk cache. - Get an SSD Bill On 21 Feb 2013, at 19:03, amulya rattan talk2amu...@gmail.com wrote: Dear All, We are currently evaluating Cassandra for an application involving strict SLAs(Service level agreements). We just need one column family with a long key and approximately 70-80 bytes row. We are not concerned about write performance but are primarily concerned about read. For our SLAs, a read of max 15-20 rows at once(using multi slice), should not take more than 4 ms. Till now, on a single node setup, using cassandra' stress tool, the numbers are promising. But I am guessing that's because there is no network latency involved there and since we set memtable around 2gb(4 gb heap), we never had to get to Disk I/O. Assuming our nodes having 32GB RAM, a couple of questions regarding read: * To avoid disk I/Os, the best option we thought is to have data in memory. Is it a good idea to have memtable setup around 1/2 or 3/4 of heap size? Obviously flushing will take a lot of time but would that hurt that node's performance big time? * Cassandra stress tool only gives out average read latency. Is there a way to figure out max read-latency for a bunch of read operations? * How big a row cache can one have? Given that cassandra provides off- heap row caching, in a machine 32 gb RAM, would it be wise to have a 10 gb row cache with 8 gb java heap? And how big should the corresponding key cache be then? Any response is appreciated. ~Amulya Best regards / Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063 Fax: +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.
Re: Mutation dropped
Thanks Aaron for the great information as always. I just checked cfhistograms and only a handful of read latency are bigger than 100ms, but for proxyhistograms there are 10 times more are greater than 100ms. We are using QUORUM for reading with RF=3, and I understand coordinator needs to get the digest from other nodes and read repair on the miss match etc. But is it normal to see the latency from proxyhistograms to go beyond 100ms? Is there anyway to improve that? We are tracking the metrics from Client side and we see the 95th percentile response time averages at 40ms which is a bit high. Our 50th percentile was great under 3ms. Any suggestion is very much appreciated. Thanks. -Wei - Original Message - From: aaron morton aa...@thelastpickle.com To: Cassandra User user@cassandra.apache.org Sent: Thursday, February 21, 2013 9:20:49 AM Subject: Re: Mutation dropped What does rpc_timeout control? Only the reads/writes? Yes. like data stream, streaming_socket_timeout_in_ms in the yaml merkle tree request? Either no time out or a number of days, cannot remember which right now. What is the side effect if it's set to a really small number, say 20ms? You will probably get a lot more requests that fail with a TimedOutException. rpc_timeout needs to be longer than the time it takes a node to process the message, and the time it takes the coordinator to do it's thing. You can look at cfhistograms and proxyhistograms to get a better idea of how long a request takes in your system. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 6:56 AM, Wei Zhu wz1...@yahoo.com wrote: What does rpc_timeout control? Only the reads/writes? How about other inter-node communication, like data stream, merkle tree request? What is the reasonable value for roc_timeout? The default value of 10 seconds are way too long. What is the side effect if it's set to a really small number, say 20ms? Thanks. -Wei From: aaron morton aa...@thelastpickle.com To: user@cassandra.apache.org Sent: Tuesday, February 19, 2013 7:32 PM Subject: Re: Mutation dropped Does the rpc_timeout not control the client timeout ? No it is how long a node will wait for a response from other nodes before raising a TimedOutException if less than CL nodes have responded. Set the client side socket timeout using your preferred client. Is there any param which is configurable to control the replication timeout between nodes ? There is no such thing. rpc_timeout is roughly like that, but it's not right to think about it that way. i.e. if a message to a replica times out and CL nodes have already responded then we are happy to call the request complete. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote: Thanks Aaron. Does the rpc_timeout not control the client timeout ? Is there any param which is configurable to control the replication timeout between nodes ? Or the same param is used to control that since the other node is also like a client ? From: aaron morton [mailto:aa...@thelastpickle.com] Sent: 17 February 2013 11:26 To: user@cassandra.apache.org Subject: Re: Mutation dropped You are hitting the maximum throughput on the cluster. The messages are dropped because the node fails to start processing them before rpc_timeout. However the request is still a success because the client requested CL was achieved. Testing with RF 2 and CL 1 really just tests the disks on one local machine. Both nodes replicate each row, and writes are sent to each replica, so the only thing the client is waiting on is the local node to write to it's commit log. Testing with (and running in prod) RF3 and CL QUROUM is a more real world scenario. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote: Hi – Is there a parameter which can be tuned to prevent the mutations from being dropped ? Is this logic correct ? Node A and B with RF=2, CL =1. Load balanced between the two. -- Address Load Tokens Owns (effective) Host ID Rack UN 10.x.x.x 746.78 GB 256 100.0% dbc9e539-f735-4b0b-8067-b97a85522a1a rack1 UN 10.x.x.x 880.77 GB 256 100.0% 95d59054-be99-455f-90d1-f43981d3d778 rack1 Once we hit a very high TPS (around 50k/sec of inserts), the nodes start falling behind and we see the mutation dropped messages. But there are no failures on the client. Does that mean other node is not able to persist the replicated data ? Is there some