Re: Interesting failure scenario, SolrCloud and ZK nodes on different times
I seem to recall seeing this on my cluster when we didn't have clocks in sync, but perhaps my memory is fuzzy as well. -Grant On Aug 7, 2013, at 7:41 AM, Erick Erickson erickerick...@gmail.com wrote: Well, we're reconstructing a chain of _possibilities_ post-mortem, so there's not much I can say for sure. Mostly just throwing this out there in case it sparks some aha moments. Not knowing ZK well, anything I say is speculation. But I speculate that this isn't really the root of the problem given that we haven't been seeing the ClusterState says we are the leader... error go by the user lists for a while. It may well be a coincidence. The place that this happened reported that the problem seemed to be better after adjusting the ZK nodes' times. I know when I reconstruct events like this I'm never sure about cause and effect since I'm usually doing several things at once. Erick On Tue, Aug 6, 2013 at 5:51 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : When the times were coordinated, many of the problems with recovery went : away. We're trying to reconstruct the scenario from memory, but it : prompted me to pass the incident in case it sparked any thoughts. : Specifically, I wonder if there's anything that comes to mind if the ZK : nodes are significantly out of synch with each other time-wise. : : Does this mean that ntp or other strict time synchronization is important for : SolrCloud? I strive for this anyway, just to ensure that when I'm researching : log files between two machines that I can match things up properly. I don't know if/how Solr/ZK is affected by having machines with clocks out of sync, but i do remember seeing discussions a while back about weird things happening ot ZK client apps *while* time adjustments are taking place to get back in sync. IIRC: as the local clock starts accelerating and jumping ahead in increments to correct itself with ntp, then those jumps can confuse the ZK code into thinking it's been waiting a lot longer then it really has for zk heartbeat (or whatever it's called) and it can trigger a timeout situation. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: Interesting failure scenario, SolrCloud and ZK nodes on different times
Well, we're reconstructing a chain of _possibilities_ post-mortem, so there's not much I can say for sure. Mostly just throwing this out there in case it sparks some aha moments. Not knowing ZK well, anything I say is speculation. But I speculate that this isn't really the root of the problem given that we haven't been seeing the ClusterState says we are the leader... error go by the user lists for a while. It may well be a coincidence. The place that this happened reported that the problem seemed to be better after adjusting the ZK nodes' times. I know when I reconstruct events like this I'm never sure about cause and effect since I'm usually doing several things at once. Erick On Tue, Aug 6, 2013 at 5:51 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : When the times were coordinated, many of the problems with recovery went : away. We're trying to reconstruct the scenario from memory, but it : prompted me to pass the incident in case it sparked any thoughts. : Specifically, I wonder if there's anything that comes to mind if the ZK : nodes are significantly out of synch with each other time-wise. : : Does this mean that ntp or other strict time synchronization is important for : SolrCloud? I strive for this anyway, just to ensure that when I'm researching : log files between two machines that I can match things up properly. I don't know if/how Solr/ZK is affected by having machines with clocks out of sync, but i do remember seeing discussions a while back about weird things happening ot ZK client apps *while* time adjustments are taking place to get back in sync. IIRC: as the local clock starts accelerating and jumping ahead in increments to correct itself with ntp, then those jumps can confuse the ZK code into thinking it's been waiting a lot longer then it really has for zk heartbeat (or whatever it's called) and it can trigger a timeout situation. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Interesting failure scenario, SolrCloud and ZK nodes on different times
I've become aware of a situation I thought I'd pass along. A SolrCloud installation had several ZK nodes that has very significantly offset times. They were being hit with the ClusterState says we are the leader, but locally we don't think we are error when nodes were recovering. Of course whether this problem is now taken care of with recent Solr releases (I haven't seen this go by the user's list for quite a while) I don't quite know. When the times were coordinated, many of the problems with recovery went away. We're trying to reconstruct the scenario from memory, but it prompted me to pass the incident in case it sparked any thoughts. Specifically, I wonder if there's anything that comes to mind if the ZK nodes are significantly out of synch with each other time-wise. FWIW, Erick
Re: Interesting failure scenario, SolrCloud and ZK nodes on different times
On 8/6/2013 1:56 PM, Erick Erickson wrote: I've become aware of a situation I thought I'd pass along. A SolrCloud installation had several ZK nodes that has very significantly offset times. They were being hit with the ClusterState says we are the leader, but locally we don't think we are error when nodes were recovering. Of course whether this problem is now taken care of with recent Solr releases (I haven't seen this go by the user's list for quite a while) I don't quite know. When the times were coordinated, many of the problems with recovery went away. We're trying to reconstruct the scenario from memory, but it prompted me to pass the incident in case it sparked any thoughts. Specifically, I wonder if there's anything that comes to mind if the ZK nodes are significantly out of synch with each other time-wise. Does this mean that ntp or other strict time synchronization is important for SolrCloud? I strive for this anyway, just to ensure that when I'm researching log files between two machines that I can match things up properly. Thanks, Shawn - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Interesting failure scenario, SolrCloud and ZK nodes on different times
: When the times were coordinated, many of the problems with recovery went : away. We're trying to reconstruct the scenario from memory, but it : prompted me to pass the incident in case it sparked any thoughts. : Specifically, I wonder if there's anything that comes to mind if the ZK : nodes are significantly out of synch with each other time-wise. : : Does this mean that ntp or other strict time synchronization is important for : SolrCloud? I strive for this anyway, just to ensure that when I'm researching : log files between two machines that I can match things up properly. I don't know if/how Solr/ZK is affected by having machines with clocks out of sync, but i do remember seeing discussions a while back about weird things happening ot ZK client apps *while* time adjustments are taking place to get back in sync. IIRC: as the local clock starts accelerating and jumping ahead in increments to correct itself with ntp, then those jumps can confuse the ZK code into thinking it's been waiting a lot longer then it really has for zk heartbeat (or whatever it's called) and it can trigger a timeout situation. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org