Re: Interesting failure scenario, SolrCloud and ZK nodes on different times

2013-08-08 Thread Grant Ingersoll
I seem to recall seeing this on my cluster when we didn't have clocks in sync, 
but perhaps my memory is fuzzy as well.

-Grant

On Aug 7, 2013, at 7:41 AM, Erick Erickson erickerick...@gmail.com wrote:

 Well, we're reconstructing a chain of _possibilities_ post-mortem,
 so there's not much I can say for sure. Mostly just throwing this 
 out there in case it sparks some aha moments. Not knowing
 ZK well, anything I say is speculation.
 
 But I speculate that this isn't really the root of the problem given
 that we haven't been seeing the ClusterState says we are the leader...
 error go by the user lists for a while. It may well be a coincidence. The
 place that this happened reported that the problem seemed to 
 be better after adjusting the ZK nodes' times. I know when I
 reconstruct events like this I'm never sure about cause and
 effect since I'm usually doing several things at once.
 
 Erick
 
 
 On Tue, Aug 6, 2013 at 5:51 PM, Chris Hostetter hossman_luc...@fucit.org 
 wrote:
 
 :  When the times were coordinated, many of the problems with recovery went
 :  away. We're trying to reconstruct the scenario from memory, but it
 :  prompted me to pass the incident in case it sparked any thoughts.
 :  Specifically, I wonder if there's anything that comes to mind if the ZK
 :  nodes are significantly out of synch with each other time-wise.
 :
 : Does this mean that ntp or other strict time synchronization is important 
 for
 : SolrCloud?  I strive for this anyway, just to ensure that when I'm 
 researching
 : log files between two machines that I can match things up properly.
 
 I don't know if/how Solr/ZK is affected by having machines with clocks out
 of sync, but i do remember seeing discussions a while back about weird
 things happening ot ZK client apps *while* time adjustments are taking
 place to get back in sync.
 
 IIRC: as the local clock starts accelerating and jumping ahead in
 increments to correct itself with ntp, then those jumps can confuse the
 ZK code into thinking it's been waiting a lot longer then it really
 has for zk heartbeat (or whatever it's called) and it can trigger a
 timeout situation.
 
 
 -Hoss
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Re: Interesting failure scenario, SolrCloud and ZK nodes on different times

2013-08-07 Thread Erick Erickson
Well, we're reconstructing a chain of _possibilities_ post-mortem,
so there's not much I can say for sure. Mostly just throwing this
out there in case it sparks some aha moments. Not knowing
ZK well, anything I say is speculation.

But I speculate that this isn't really the root of the problem given
that we haven't been seeing the ClusterState says we are the leader...
error go by the user lists for a while. It may well be a coincidence. The
place that this happened reported that the problem seemed to
be better after adjusting the ZK nodes' times. I know when I
reconstruct events like this I'm never sure about cause and
effect since I'm usually doing several things at once.

Erick


On Tue, Aug 6, 2013 at 5:51 PM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 :  When the times were coordinated, many of the problems with recovery
 went
 :  away. We're trying to reconstruct the scenario from memory, but it
 :  prompted me to pass the incident in case it sparked any thoughts.
 :  Specifically, I wonder if there's anything that comes to mind if the ZK
 :  nodes are significantly out of synch with each other time-wise.
 :
 : Does this mean that ntp or other strict time synchronization is
 important for
 : SolrCloud?  I strive for this anyway, just to ensure that when I'm
 researching
 : log files between two machines that I can match things up properly.

 I don't know if/how Solr/ZK is affected by having machines with clocks out
 of sync, but i do remember seeing discussions a while back about weird
 things happening ot ZK client apps *while* time adjustments are taking
 place to get back in sync.

 IIRC: as the local clock starts accelerating and jumping ahead in
 increments to correct itself with ntp, then those jumps can confuse the
 ZK code into thinking it's been waiting a lot longer then it really
 has for zk heartbeat (or whatever it's called) and it can trigger a
 timeout situation.


 -Hoss

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Interesting failure scenario, SolrCloud and ZK nodes on different times

2013-08-06 Thread Erick Erickson
I've become aware of a situation I thought I'd pass along. A SolrCloud
installation had several ZK nodes that has very significantly offset times.
They were being hit with the ClusterState says we are the leader, but
locally we don't think we are error when nodes were recovering. Of course
whether this problem is now taken care of with recent Solr releases (I
haven't seen this go by the user's list for quite a while) I don't quite
know.

When the times were coordinated, many of the problems with recovery went
away. We're trying to reconstruct the scenario from memory, but it prompted
me to pass the incident in case it sparked any thoughts. Specifically, I
wonder if there's anything that comes to mind if the ZK nodes are
significantly out of synch with each other time-wise.

FWIW,
Erick


Re: Interesting failure scenario, SolrCloud and ZK nodes on different times

2013-08-06 Thread Shawn Heisey

On 8/6/2013 1:56 PM, Erick Erickson wrote:

I've become aware of a situation I thought I'd pass along. A SolrCloud
installation had several ZK nodes that has very significantly offset
times. They were being hit with the ClusterState says we are the
leader, but locally we don't think we are error when nodes were
recovering. Of course whether this problem is now taken care of with
recent Solr releases (I haven't seen this go by the user's list for
quite a while) I don't quite know.

When the times were coordinated, many of the problems with recovery went
away. We're trying to reconstruct the scenario from memory, but it
prompted me to pass the incident in case it sparked any thoughts.
Specifically, I wonder if there's anything that comes to mind if the ZK
nodes are significantly out of synch with each other time-wise.


Does this mean that ntp or other strict time synchronization is 
important for SolrCloud?  I strive for this anyway, just to ensure that 
when I'm researching log files between two machines that I can match 
things up properly.


Thanks,
Shawn


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Interesting failure scenario, SolrCloud and ZK nodes on different times

2013-08-06 Thread Chris Hostetter

:  When the times were coordinated, many of the problems with recovery went
:  away. We're trying to reconstruct the scenario from memory, but it
:  prompted me to pass the incident in case it sparked any thoughts.
:  Specifically, I wonder if there's anything that comes to mind if the ZK
:  nodes are significantly out of synch with each other time-wise.
: 
: Does this mean that ntp or other strict time synchronization is important for
: SolrCloud?  I strive for this anyway, just to ensure that when I'm researching
: log files between two machines that I can match things up properly.

I don't know if/how Solr/ZK is affected by having machines with clocks out 
of sync, but i do remember seeing discussions a while back about weird 
things happening ot ZK client apps *while* time adjustments are taking 
place to get back in sync. 

IIRC: as the local clock starts accelerating and jumping ahead in 
increments to correct itself with ntp, then those jumps can confuse the 
ZK code into thinking it's been waiting a lot longer then it really 
has for zk heartbeat (or whatever it's called) and it can trigger a 
timeout situation.


-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org