RE: propertyfilesnitch problem

2011-11-10 Thread Shu Zhang
At first, I was also thinking that one or more nodes in the cluster are broken 
or not responding. But through nodetool cfstats, it looks like all the nodes 
are working as expected and pings gives me the expected inter-node latencies. 
Also the scores calculated by dynamic snitch in the steady state seem to 
correspond to how we configure the network topology. 

We're not timing out, but comparing periods when the dynamic snitch has 
appropriate scores and when it doesn't, the latency of LOCAL_QUORUM operations 
gets bumped up from ~10ms to ~100ms. Quorum operations remain at ~100ms 
regardless of dynamic snitch settings. We maintain the consistent load through 
the tests and there are no feedback mechanisms.

Thanks,
Shu

From: sc...@scode.org [sc...@scode.org] On Behalf Of Peter Schuller 
[peter.schul...@infidyne.com]
Sent: Wednesday, November 09, 2011 11:07 PM
To: user@cassandra.apache.org
Subject: Re: propertyfilesnitch problem

 2. With the same setup, after each period as defined by 
 dynamic_snitch_reset_interval_in_ms, the LOCAL_QUORUM performance greatly 
 degrades before drastically improving again within a minute.

This part sounds to me like one or more nodes in the cluster are
either broken and not responding at all, or overloaded. Restarts will
tend to temporarily cause additional pressure on nodes (particularly
I/O due to cache eviction issues).

Because the dynamic snitch won't ever know that the node is slow
(after a reset) until requests start actually timing out, it can be up
to rpc_timeout second before it gets snitched away. That sounds like
what you're seeing. On ever reset, an rpc_timeout period of poor
latency for clients.

Is rpc_timeout 60 seconds?

 4. With dynamic snitch turned on, QUORUM operations' performance is about the 
 same as using LOCAL_QUORUM when the dynamic snitch is off or the first minute 
 after a restart with the snitch turned on.

This is strange, unless it is co-incidental.

Can you be more specific about the performance characteristics you're
seeing when degraded? For example:

* High latency, or timeouts?
* Are you getting Unavailable exceptions?
* Are you maintaining the same overall throughput or is there a
feedback mechanism such that when queries have high latency the
request rate decreases?
* Which data points are you using to consider something degraded?
What's matching in the QUORUM and LOCAL_QUOROM w/o dynsnitch cases?
--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


RE: propertyfilesnitch problem

2011-11-09 Thread Shu Zhang
Hi, sorry to ask again, but I'm having trouble getting to the bottom of this...

Does anyone else see this? When dynamic snitch is turned off, the performance 
of LOCAL_QUORUM operations is as bad as QUORUM. The property file snitch 
appears to be properly configured. Any suggestions on how I can investigate 
further would be greatly appreciated.

From: Shu Zhang [szh...@mediosystems.com]
Sent: Monday, November 07, 2011 6:07 PM
To: user@cassandra.apache.org
Subject: propertyfilesnitch problem

Hi,

We have a 2 DC setup on version 0.7.9 and have observed the following:
1. Using a property file snitch, with dynamic snitch turned on. The performance 
of LOCAL_QUORUM operations is poor for a while (around a minute) after a 
cluster restart before drastically improving.
2. With the same setup, after each period as defined by 
dynamic_snitch_reset_interval_in_ms, the LOCAL_QUORUM performance greatly 
degrades before drastically improving again within a minute.
3. With the dynamic snitch turned off, LOCAL_QUORUM operations perform 
extremely poorly... same as the 1st minute after a restart.
4. With dynamic snitch turned on, QUORUM operations' performance is about the 
same as using LOCAL_QUORUM when the dynamic snitch is off or the first minute 
after a restart with the snitch turned on.

All of this seem to point to LOCAL_QUORUM operations not differentiating our 
DCs using the property file snitch and its performance effectively degrades to 
that of QUORUM when dynamic snitch doesn't have appropriate scores.

Our main concern is the performance degradation at the periods defined by 
dynamic_snitch_reset_interval_in_ms.

The DynamicEndpointSnitch in steady state assigns scores that matches the DCs 
we've configured through the network topology property file.

Our network topology property file appears to be properly configured and have 
been confirmed through the EndpointSnitchInfo mbean.

Please advice.

Thanks,
Shu
Medio Systems


Re: propertyfilesnitch problem

2011-11-09 Thread Peter Schuller
 2. With the same setup, after each period as defined by 
 dynamic_snitch_reset_interval_in_ms, the LOCAL_QUORUM performance greatly 
 degrades before drastically improving again within a minute.

This part sounds to me like one or more nodes in the cluster are
either broken and not responding at all, or overloaded. Restarts will
tend to temporarily cause additional pressure on nodes (particularly
I/O due to cache eviction issues).

Because the dynamic snitch won't ever know that the node is slow
(after a reset) until requests start actually timing out, it can be up
to rpc_timeout second before it gets snitched away. That sounds like
what you're seeing. On ever reset, an rpc_timeout period of poor
latency for clients.

Is rpc_timeout 60 seconds?

 4. With dynamic snitch turned on, QUORUM operations' performance is about the 
 same as using LOCAL_QUORUM when the dynamic snitch is off or the first minute 
 after a restart with the snitch turned on.

This is strange, unless it is co-incidental.

Can you be more specific about the performance characteristics you're
seeing when degraded? For example:

* High latency, or timeouts?
* Are you getting Unavailable exceptions?
* Are you maintaining the same overall throughput or is there a
feedback mechanism such that when queries have high latency the
request rate decreases?
* Which data points are you using to consider something degraded?
What's matching in the QUORUM and LOCAL_QUOROM w/o dynsnitch cases?
-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


propertyfilesnitch problem

2011-11-07 Thread Shu Zhang
Hi,

We have a 2 DC setup on version 0.7.9 and have observed the following:
1. Using a property file snitch, with dynamic snitch turned on. The performance 
of LOCAL_QUORUM operations is poor for a while (around a minute) after a 
cluster restart before drastically improving.
2. With the same setup, after each period as defined by 
dynamic_snitch_reset_interval_in_ms, the LOCAL_QUORUM performance greatly 
degrades before drastically improving again within a minute.
3. With the dynamic snitch turned off, LOCAL_QUORUM operations perform 
extremely poorly... same as the 1st minute after a restart.
4. With dynamic snitch turned on, QUORUM operations' performance is about the 
same as using LOCAL_QUORUM when the dynamic snitch is off or the first minute 
after a restart with the snitch turned on.

All of this seem to point to LOCAL_QUORUM operations not differentiating our 
DCs using the property file snitch and its performance effectively degrades to 
that of QUORUM when dynamic snitch doesn't have appropriate scores.

Our main concern is the performance degradation at the periods defined by 
dynamic_snitch_reset_interval_in_ms.

The DynamicEndpointSnitch in steady state assigns scores that matches the DCs 
we've configured through the network topology property file.

Our network topology property file appears to be properly configured and have 
been confirmed through the EndpointSnitchInfo mbean.

Please advice.

Thanks,
Shu
Medio Systems