Re: Timeout during stress test

2011-04-12 Thread aaron morton
Couple of hits here, one from jonathan and some previous discussions on the 
user list http://www.google.co.nz/search?q=cassandra+iostat

Same here for cfhistograms 
http://www.google.co.nz/search?q=cassandra+cfhistograms 
cfhistograms includes information on the number of sstables read during recent 
requests. As your initial cfstats showed 236 sstables I thought it may be 
useful see if there was a high number of sstables been accessed per read. 

70 requests per second is slow against a 6 node cluster where each node has 12 
cores and 96GB of ram. Something is not right.

Aaron 
On 12 Apr 2011, at 17:11, mcasandra wrote:

 
 aaron morton wrote:
 
 You'll need to provide more information, from the TP stats the read stage
 could not keep up. If the node is not CPU bound then it is probably IO
 bound. 
 
 
 What sort of read?
 How many columns was it asking for ? 
 How many columns do the rows have ?
 Was the test asking for different rows ?
 How many ops requests per second did it get up to?
 What do the io stats look like ? 
 What does nodetool cfhistograms say ?
 
 It's simple read of 1M rows with one column of avg size of 200K. Got around
 70 req per sec.
 
 Not sure how to intepret the iostats output with things happening async in
 cassandra. Can you give little description on how to interpret it?
 
 I have posted output of cfstats. Does cfhistograms provide better info?
 
 
 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6263859.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.



Re: Timeout during stress test

2011-04-12 Thread mcasandra
 0  
  
0
17084  690 0 0  
  
0
20501  960 0 0  
  
0
24601 1272 0 0  
  
0
29521 1734 0 0  
  
0
35425 2262 0 0  
  
0
42510 2734 0 0  
  
0
51012 3098 0 0  
  
0
61214 3426 0 0  
  
0
73457 3879 0 0  
  
0
88148 4157 0 0  
  
0
1057784065 0 0  
  
0
1269343804 0 0  
  
0
1523212828 0 0  
  
0
1827851699 0 0  
  
0
219342 821 0 0  
  
0
263210 300 0249214  
  
0
315852  88 0149731  
  
0
379022  12 0 0  
  
0
454826   3 0 0  
  
0
545791   0 0 0  
  
0
654949   0 0 0  
  
0
785939   0 0 0  
  
0
943127   0 0 74915  
  
0
1131752  0 0 0  
  
0
1358102  0 0 0  
  
0
1629722  0 0 0  
  
0
1955666  0 0 0  
  
0
2346799  0 0 0  
  
0
2816159  0 0 0  
  
0
3379391  0 0 22438  
  
0
4055269  0 0 0  
  
0
4866323  0 0 0  
  
0
5839588  0 0  2559  
  
0
7007506  0 0 0  
  
0
8409007  0 0 0  
  
0
10090808 0 0 0  
  
0
12108970 0 0 0  
  
0
14530764 0 0 0  
  
0
17436917 0 0 0  
  
0
20924300 0 0 0  
  
0
25109160 0 0 0  
  
0


--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6265925.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Timeout during stress test

2011-04-11 Thread mcasandra
I am running stress test using hector. In the client logs I see:

me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException()
at
me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:32)
at
me.prettyprint.cassandra.service.HColumnFamilyImpl$1.execute(HColumnFamilyImpl.java:256)
at
me.prettyprint.cassandra.service.HColumnFamilyImpl$1.execute(HColumnFamilyImpl.java:227)
at
me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:101)
at
me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:221)
at
me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
at
me.prettyprint.cassandra.service.HColumnFamilyImpl.doExecuteSlice(HColumnFamilyImpl.java:227)
at
me.prettyprint.cassandra.service.HColumnFamilyImpl.getColumns(HColumnFamilyImpl.java:139)
at
com.riptano.cassandra.stress.SliceCommand.call(SliceCommand.java:48)
at
com.riptano.cassandra.stress.SliceCommand.call(SliceCommand.java:20)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: TimedOutException()
at
org.apache.cassandra.thrift.Cassandra$get_slice_result.read(Cassandra.java:7174)
at
org.apache.cassandra.thrift.Cassandra$Client.recv_get_slice(Cassandra.java:540)
at
org.apache.cassandra.thrift.Cassandra$Client.get_slice(Cassandra.java:512)
at
me.prettyprint.cassandra.service.HColumnFamilyImpl$1.execute(HColumnFamilyImpl.java:236)


But I don't see anything in cassandra logs.

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6262430.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Timeout during stress test

2011-04-11 Thread mcasandra
I see this occurring often when all cassandra nodes all of a sudden show CPU
spike. All reads fail for about 2 mts. GC.log and system.log doesn't reveal
much.

Only think I notice is that when I restart nodes there are tons of files
that gets deleted. cfstats from one of the nodes looks like this:

nodetool -h `hostname` tpstats
Pool NameActive   Pending  Completed
ReadStage2727  21491
RequestResponseStage  0 0 201641
MutationStage 0 0 236513
ReadRepairStage   0 0   7222
GossipStage   0 0  31498
AntiEntropyStage  0 0  0
MigrationStage0 0  0
MemtablePostFlusher   0 0324
StreamStage   0 0  0
FlushWriter   0 0324
FILEUTILS-DELETE-POOL 0 0   1220
MiscStage 0 0  0
FlushSorter   0 0  0
InternalResponseStage 0 0  0
HintedHandoff 1 3  9

--


Keyspace: StressKeyspace
Read Count: 21957
Read Latency: 46.91765058978913 ms.
Write Count: 222104
Write Latency: 0.008302124230090408 ms.
Pending Tasks: 0
Column Family: StressStandard
SSTable count: 286
Space used (live): 377916657941
Space used (total): 377916657941
Memtable Columns Count: 362
Memtable Data Size: 164403613
Memtable Switch Count: 326
Read Count: 21958
Read Latency: 631.464 ms.
Write Count: 222104
Write Latency: 0.007 ms.
Pending Tasks: 0
Key cache capacity: 100
Key cache size: 22007
Key cache hit rate: 0.002453626459907744
Row cache: disabled
Compacted row minimum size: 87
Compacted row maximum size: 5839588
Compacted row mean size: 552698




--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6263087.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Timeout during stress test

2011-04-11 Thread aaron morton
TimedOutException means the cluster could not perform the request in 
rpc_timeout time. The client should retry as the problem may be transitory. 

In this case read performance may have slowed down due to the number of 
sstables 286. It hard to tell without knowing what the workload is.

Aaron

On 12 Apr 2011, at 09:56, mcasandra wrote:

 I see this occurring often when all cassandra nodes all of a sudden show CPU
 spike. All reads fail for about 2 mts. GC.log and system.log doesn't reveal
 much.
 
 Only think I notice is that when I restart nodes there are tons of files
 that gets deleted. cfstats from one of the nodes looks like this:
 
 nodetool -h `hostname` tpstats
 Pool NameActive   Pending  Completed
 ReadStage2727  21491
 RequestResponseStage  0 0 201641
 MutationStage 0 0 236513
 ReadRepairStage   0 0   7222
 GossipStage   0 0  31498
 AntiEntropyStage  0 0  0
 MigrationStage0 0  0
 MemtablePostFlusher   0 0324
 StreamStage   0 0  0
 FlushWriter   0 0324
 FILEUTILS-DELETE-POOL 0 0   1220
 MiscStage 0 0  0
 FlushSorter   0 0  0
 InternalResponseStage 0 0  0
 HintedHandoff 1 3  9
 
 --
 
 
 Keyspace: StressKeyspace
Read Count: 21957
Read Latency: 46.91765058978913 ms.
Write Count: 222104
Write Latency: 0.008302124230090408 ms.
Pending Tasks: 0
Column Family: StressStandard
SSTable count: 286
Space used (live): 377916657941
Space used (total): 377916657941
Memtable Columns Count: 362
Memtable Data Size: 164403613
Memtable Switch Count: 326
Read Count: 21958
Read Latency: 631.464 ms.
Write Count: 222104
Write Latency: 0.007 ms.
Pending Tasks: 0
Key cache capacity: 100
Key cache size: 22007
Key cache hit rate: 0.002453626459907744
Row cache: disabled
Compacted row minimum size: 87
Compacted row maximum size: 5839588
Compacted row mean size: 552698
 
 
 
 
 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6263087.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.



Re: Timeout during stress test

2011-04-11 Thread mcasandra
It looks like hector did retry on all the nodes and failed. Does this then
mean cassandra is down for clients in this scenario? That would be bad.

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6263270.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Timeout during stress test

2011-04-11 Thread aaron morton
It means the cluster is currently overloaded and unable to complete requests in 
time at the CL specified. 

Aaron

On 12 Apr 2011, at 11:18, mcasandra wrote:

 It looks like hector did retry on all the nodes and failed. Does this then
 mean cassandra is down for clients in this scenario? That would be bad.
 
 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6263270.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.



Re: Timeout during stress test

2011-04-11 Thread mcasandra
But I don't understand the reason for oveload. It was doing simple read of 12
threads and reasing 5 rows. Avg CPU only 20%, No GC issues that I see. I
would expect cassandra to be able to process more with 6 nodes, 12 core, 96
GB RAM and 4 GB heap.

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6263470.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Timeout during stress test

2011-04-11 Thread aaron morton
You'll need to provide more information, from the TP stats the read stage could 
not keep up. If the node is not CPU bound then it is probably IO bound. 


What sort of read?
How many columns was it asking for ? 
How many columns do the rows have ?
Was the test asking for different rows ?
How many ops requests per second did it get up to?
What do the io stats look like ? 
What does nodetool cfhistograms say ?

Aaron

On 12 Apr 2011, at 13:02, mcasandra wrote:

 But I don't understand the reason for oveload. It was doing simple read of 12
 threads and reasing 5 rows. Avg CPU only 20%, No GC issues that I see. I
 would expect cassandra to be able to process more with 6 nodes, 12 core, 96
 GB RAM and 4 GB heap.
 
 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6263470.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.



Re: Timeout during stress test

2011-04-11 Thread Terje Marthinussen
I notice you have pending hinted handoffs?

Look for errors related to that. We have seen occasional corruptions in the
hinted handoff sstables,

If you are stressing the system to its limits, you may also consider playing
with more with the number of  read/write threads  (concurrent_reads/writes)
as well as rate limiting the number of requests you can get per node
(throttle limit).

We have seen similar issue when sending large number of requests to a
cluster (read/write threads running out, timeouts, nodes marked as down).

Terje

We have seen similar issues when

On Tue, Apr 12, 2011 at 9:56 AM, aaron morton aa...@thelastpickle.comwrote:

 It means the cluster is currently overloaded and unable to complete
 requests in time at the CL specified.

 Aaron

 On 12 Apr 2011, at 11:18, mcasandra wrote:

  It looks like hector did retry on all the nodes and failed. Does this
 then
  mean cassandra is down for clients in this scenario? That would be bad.
 
  --
  View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6263270.html
  Sent from the cassandra-u...@incubator.apache.org mailing list archive
 at Nabble.com.




Re: Timeout during stress test

2011-04-11 Thread mcasandra

aaron morton wrote:
 
 You'll need to provide more information, from the TP stats the read stage
 could not keep up. If the node is not CPU bound then it is probably IO
 bound. 
 
 
 What sort of read?
 How many columns was it asking for ? 
 How many columns do the rows have ?
 Was the test asking for different rows ?
 How many ops requests per second did it get up to?
 What do the io stats look like ? 
 What does nodetool cfhistograms say ?
 
It's simple read of 1M rows with one column of avg size of 200K. Got around
70 req per sec.

Not sure how to intepret the iostats output with things happening async in
cassandra. Can you give little description on how to interpret it?

I have posted output of cfstats. Does cfhistograms provide better info?


--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Timeout-during-stress-test-tp6262430p6263859.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.