Re: Timeout during stress test
Couple of hits here, one from jonathan and some previous discussions on the user list http://www.google.co.nz/search?q=cassandra+iostat Same here for cfhistograms http://www.google.co.nz/search?q=cassandra+cfhistograms cfhistograms includes information on the number of sstables read during recent requests. As your initial cfstats showed 236 sstables I thought it may be useful see if there was a high number of sstables been accessed per read. 70 requests per second is slow against a 6 node cluster where each node has 12 cores and 96GB of ram. Something is not right. Aaron On 12 Apr 2011, at 17:11, mcasandra wrote: aaron morton wrote: You'll need to provide more information, from the TP stats the read stage could not keep up. If the node is not CPU bound then it is probably IO bound. What sort of read? How many columns was it asking for ? How many columns do the rows have ? Was the test asking for different rows ? How many ops requests per second did it get up to? What do the io stats look like ? What does nodetool cfhistograms say ? It's simple read of 1M rows with one column of avg size of 200K. Got around 70 req per sec. Not sure how to intepret the iostats output with things happening async in cassandra. Can you give little description on how to interpret it? I have posted output of cfstats. Does cfhistograms provide better info? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6263859.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Timeout during stress test
0 0 17084 690 0 0 0 20501 960 0 0 0 24601 1272 0 0 0 29521 1734 0 0 0 35425 2262 0 0 0 42510 2734 0 0 0 51012 3098 0 0 0 61214 3426 0 0 0 73457 3879 0 0 0 88148 4157 0 0 0 1057784065 0 0 0 1269343804 0 0 0 1523212828 0 0 0 1827851699 0 0 0 219342 821 0 0 0 263210 300 0249214 0 315852 88 0149731 0 379022 12 0 0 0 454826 3 0 0 0 545791 0 0 0 0 654949 0 0 0 0 785939 0 0 0 0 943127 0 0 74915 0 1131752 0 0 0 0 1358102 0 0 0 0 1629722 0 0 0 0 1955666 0 0 0 0 2346799 0 0 0 0 2816159 0 0 0 0 3379391 0 0 22438 0 4055269 0 0 0 0 4866323 0 0 0 0 5839588 0 0 2559 0 7007506 0 0 0 0 8409007 0 0 0 0 10090808 0 0 0 0 12108970 0 0 0 0 14530764 0 0 0 0 17436917 0 0 0 0 20924300 0 0 0 0 25109160 0 0 0 0 -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6265925.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Timeout during stress test
I am running stress test using hector. In the client logs I see: me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException() at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:32) at me.prettyprint.cassandra.service.HColumnFamilyImpl$1.execute(HColumnFamilyImpl.java:256) at me.prettyprint.cassandra.service.HColumnFamilyImpl$1.execute(HColumnFamilyImpl.java:227) at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:101) at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:221) at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97) at me.prettyprint.cassandra.service.HColumnFamilyImpl.doExecuteSlice(HColumnFamilyImpl.java:227) at me.prettyprint.cassandra.service.HColumnFamilyImpl.getColumns(HColumnFamilyImpl.java:139) at com.riptano.cassandra.stress.SliceCommand.call(SliceCommand.java:48) at com.riptano.cassandra.stress.SliceCommand.call(SliceCommand.java:20) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: TimedOutException() at org.apache.cassandra.thrift.Cassandra$get_slice_result.read(Cassandra.java:7174) at org.apache.cassandra.thrift.Cassandra$Client.recv_get_slice(Cassandra.java:540) at org.apache.cassandra.thrift.Cassandra$Client.get_slice(Cassandra.java:512) at me.prettyprint.cassandra.service.HColumnFamilyImpl$1.execute(HColumnFamilyImpl.java:236) But I don't see anything in cassandra logs. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6262430.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Timeout during stress test
I see this occurring often when all cassandra nodes all of a sudden show CPU spike. All reads fail for about 2 mts. GC.log and system.log doesn't reveal much. Only think I notice is that when I restart nodes there are tons of files that gets deleted. cfstats from one of the nodes looks like this: nodetool -h `hostname` tpstats Pool NameActive Pending Completed ReadStage2727 21491 RequestResponseStage 0 0 201641 MutationStage 0 0 236513 ReadRepairStage 0 0 7222 GossipStage 0 0 31498 AntiEntropyStage 0 0 0 MigrationStage0 0 0 MemtablePostFlusher 0 0324 StreamStage 0 0 0 FlushWriter 0 0324 FILEUTILS-DELETE-POOL 0 0 1220 MiscStage 0 0 0 FlushSorter 0 0 0 InternalResponseStage 0 0 0 HintedHandoff 1 3 9 -- Keyspace: StressKeyspace Read Count: 21957 Read Latency: 46.91765058978913 ms. Write Count: 222104 Write Latency: 0.008302124230090408 ms. Pending Tasks: 0 Column Family: StressStandard SSTable count: 286 Space used (live): 377916657941 Space used (total): 377916657941 Memtable Columns Count: 362 Memtable Data Size: 164403613 Memtable Switch Count: 326 Read Count: 21958 Read Latency: 631.464 ms. Write Count: 222104 Write Latency: 0.007 ms. Pending Tasks: 0 Key cache capacity: 100 Key cache size: 22007 Key cache hit rate: 0.002453626459907744 Row cache: disabled Compacted row minimum size: 87 Compacted row maximum size: 5839588 Compacted row mean size: 552698 -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6263087.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Timeout during stress test
TimedOutException means the cluster could not perform the request in rpc_timeout time. The client should retry as the problem may be transitory. In this case read performance may have slowed down due to the number of sstables 286. It hard to tell without knowing what the workload is. Aaron On 12 Apr 2011, at 09:56, mcasandra wrote: I see this occurring often when all cassandra nodes all of a sudden show CPU spike. All reads fail for about 2 mts. GC.log and system.log doesn't reveal much. Only think I notice is that when I restart nodes there are tons of files that gets deleted. cfstats from one of the nodes looks like this: nodetool -h `hostname` tpstats Pool NameActive Pending Completed ReadStage2727 21491 RequestResponseStage 0 0 201641 MutationStage 0 0 236513 ReadRepairStage 0 0 7222 GossipStage 0 0 31498 AntiEntropyStage 0 0 0 MigrationStage0 0 0 MemtablePostFlusher 0 0324 StreamStage 0 0 0 FlushWriter 0 0324 FILEUTILS-DELETE-POOL 0 0 1220 MiscStage 0 0 0 FlushSorter 0 0 0 InternalResponseStage 0 0 0 HintedHandoff 1 3 9 -- Keyspace: StressKeyspace Read Count: 21957 Read Latency: 46.91765058978913 ms. Write Count: 222104 Write Latency: 0.008302124230090408 ms. Pending Tasks: 0 Column Family: StressStandard SSTable count: 286 Space used (live): 377916657941 Space used (total): 377916657941 Memtable Columns Count: 362 Memtable Data Size: 164403613 Memtable Switch Count: 326 Read Count: 21958 Read Latency: 631.464 ms. Write Count: 222104 Write Latency: 0.007 ms. Pending Tasks: 0 Key cache capacity: 100 Key cache size: 22007 Key cache hit rate: 0.002453626459907744 Row cache: disabled Compacted row minimum size: 87 Compacted row maximum size: 5839588 Compacted row mean size: 552698 -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6263087.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Timeout during stress test
It looks like hector did retry on all the nodes and failed. Does this then mean cassandra is down for clients in this scenario? That would be bad. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6263270.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Timeout during stress test
It means the cluster is currently overloaded and unable to complete requests in time at the CL specified. Aaron On 12 Apr 2011, at 11:18, mcasandra wrote: It looks like hector did retry on all the nodes and failed. Does this then mean cassandra is down for clients in this scenario? That would be bad. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6263270.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Timeout during stress test
But I don't understand the reason for oveload. It was doing simple read of 12 threads and reasing 5 rows. Avg CPU only 20%, No GC issues that I see. I would expect cassandra to be able to process more with 6 nodes, 12 core, 96 GB RAM and 4 GB heap. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6263470.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Timeout during stress test
You'll need to provide more information, from the TP stats the read stage could not keep up. If the node is not CPU bound then it is probably IO bound. What sort of read? How many columns was it asking for ? How many columns do the rows have ? Was the test asking for different rows ? How many ops requests per second did it get up to? What do the io stats look like ? What does nodetool cfhistograms say ? Aaron On 12 Apr 2011, at 13:02, mcasandra wrote: But I don't understand the reason for oveload. It was doing simple read of 12 threads and reasing 5 rows. Avg CPU only 20%, No GC issues that I see. I would expect cassandra to be able to process more with 6 nodes, 12 core, 96 GB RAM and 4 GB heap. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6263470.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Timeout during stress test
I notice you have pending hinted handoffs? Look for errors related to that. We have seen occasional corruptions in the hinted handoff sstables, If you are stressing the system to its limits, you may also consider playing with more with the number of read/write threads (concurrent_reads/writes) as well as rate limiting the number of requests you can get per node (throttle limit). We have seen similar issue when sending large number of requests to a cluster (read/write threads running out, timeouts, nodes marked as down). Terje We have seen similar issues when On Tue, Apr 12, 2011 at 9:56 AM, aaron morton aa...@thelastpickle.comwrote: It means the cluster is currently overloaded and unable to complete requests in time at the CL specified. Aaron On 12 Apr 2011, at 11:18, mcasandra wrote: It looks like hector did retry on all the nodes and failed. Does this then mean cassandra is down for clients in this scenario? That would be bad. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6263270.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Timeout during stress test
aaron morton wrote: You'll need to provide more information, from the TP stats the read stage could not keep up. If the node is not CPU bound then it is probably IO bound. What sort of read? How many columns was it asking for ? How many columns do the rows have ? Was the test asking for different rows ? How many ops requests per second did it get up to? What do the io stats look like ? What does nodetool cfhistograms say ? It's simple read of 1M rows with one column of avg size of 200K. Got around 70 req per sec. Not sure how to intepret the iostats output with things happening async in cassandra. Can you give little description on how to interpret it? I have posted output of cfstats. Does cfhistograms provide better info? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6263859.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.