Re: Read time get worse during dynamic snitch reset

aaron morton Tue, 12 Apr 2011 04:32:20 -0700

Something feels odd. 

From Peters nice write up of the dynamic snitch 
http://www.mail-archive.com/user@cassandra.apache.org/msg12092.html The 
RackInferringSnitch (and the PropertyFileSnitch) derive from the 
AbstractNetworkTopologySnitch and should...
"
In the case of the NetworkTopologyStrategy, it inherits the
implementation in AbstractNetworkTopologySnitch which sorts by
AbstractNetworkTopologySnitch.compareEndPoints(), which:


(1) Always prefers itself to any other node. So "myself" is always
"closest", no matter what.
(2) Else, always prefers a node in the same rack, to a node in a different rack.
(3) Else, always prefers a node in the same dc, to a node in a different dc.
"

AFAIK the (data) request should be going to the local DC even after the 
DynamicSnitch has reset the scores. Because the underlying RackInferringSnitch 
should prefer local nodes.

Just for fun check rack and dc assignments are what you thought using the 
operations on o.a.c.db.EndpointSnitchInfo bean in JConsole. Pass in the ip 
address for the nodes in each dc. If possible can you provide some info on the 
ip's in each dc?

Aaron
 
On 12 Apr 2011, at 18:24, shimi wrote:

> On Tue, Apr 12, 2011 at 12:26 AM, aaron morton <aa...@thelastpickle.com> 
> wrote:
> The reset interval clears the latency tracked for each node so a bad node 
> will be read from again. The scores for each node are then updated every 
> 100ms (default) using the last 100 responses from a node. 
> 
> How long does the bad performance last for?
> Only a few seconds and but there are a lot of read requests during this time
> 
> What CL are you reading at ? At Quorum with RF 4 the read request will be 
> sent to 3 nodes, ordered by proximity and wellness according to the dynamic 
> snitch. (for background recent discussion on dynamic snitch 
> http://www.mail-archive.com/user@cassandra.apache.org/msg12089.html)
> I am reading with CL of ONE,  read_repair_chance=0.33, RackInferringSnitch 
> and keys_cached = rows_cached = 0
> 
> You can take a look at the weights and timings used by the DynamicSnitch in 
> JConsole under o.a.c.db.DynamicSnitchEndpoint . Also at DEBUG log level you 
> will be able to see which nodes the request is sent to. 
> Everything looks OK. The weights are around 3 for the nodes in the same data 
> center and around 5 for the others. I will turn on the DEBUG level to see if 
> I can find more info.
> 
> My guess is the DynamicSnitch is doing the right thing and the slow down is a 
> node with a problem getting back into the list of nodes used for your read. 
> It's then moved down the list as it's bad performance is noticed.
> Looking the DynamicSnitch MBean I don't see any problems with any of the 
> nodes. My guess is that during the reset time there are reads that are sent 
> to the other data center. 
> 
> Hope that helps
> Aaron
> 
> Shimi
>  
> 
> On 12 Apr 2011, at 01:28, shimi wrote:
> 
>> I finally upgraded 0.6.x to 0.7.4.  The nodes are running with the new 
>> version for several days across 2 data centers.
>> I noticed that the read time in some of the nodes increase by x50-60 every 
>> ten minutes.
>> There was no indication in the logs for something that happen at the same 
>> time. The only thing that I know that is running every 10 minutes is the 
>> dynamic snitch reset.
>> So I changed dynamic_snitch_reset_interval_in_ms to 20 minutes and now I 
>> have the problem once in every 20 minutes.
>> 
>> I am running all nodes with:
>> replica_placement_strategy: 
>> org.apache.cassandra.locator.NetworkTopologyStrategy
>>       strategy_options:
>>         DC1 : 2
>>         DC2 : 2
>>       replication_factor: 4
>> 
>> (DC1 and DC2 are taken from the ips)
>> Does anyone familiar with this kind of behavior?
>> 
>> Shimi
>> 
> 
>

Re: Read time get worse during dynamic snitch reset

Reply via email to