Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

2013-04-22 Thread Timothy Potter
Have a little more info about this ... the numDocs for *:* fluctuates
between two values (difference of 324 docs) depending on which nodes I
hit (distrib=true)

589,674,416
589,674,092

Using distrib=false, I found 1 shard with a mis-match:

shard15: {
  leader = 32,765,254
  replica = 32,764,930 diff:324
}

Interesting that the replica has more docs than the leader.

Unfortunately, due to some bad log management scripting on my part,
the logs were lost when these instances got re-started, which really
bums me out :-(

For now, I'm going to assume the replica with more docs is the one I
want to keep and will replicate the full index over to the other one.
Sorry about losing the logs :-(

Tim




On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter thelabd...@gmail.com wrote:
 Thanks for responding Mark. I'll collect the information you asked
 about and open a JIRA once I have a little more understanding of what
 happened. Hopefully I can piece together some story after going over
 the logs.

 As for replica / leader, I suspect some leaders went down but
 fail-over to new leaders seemed to work fine. We lost about 9 nodes at
 once and continued to serve queries, which is awesome.

 On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller markrmil...@gmail.com wrote:
 Yeah, thats no good.

 You might hit each node with distrib=false to get the doc counts.

 Which ones have what you think are the right counts and which the wrong - eg 
 is it all replicas that are off, or leaders as well?

 You say several replicas - do you mean no leaders went down?

 You might look closer at the logs for a node that has it's count off.

 Finally, I guess I'd try and track it in a JIRA issue.

 - Mark

 On Apr 19, 2013, at 6:37 PM, Timothy Potter thelabd...@gmail.com wrote:

 We had a rogue query take out several replicas in a large 4.2.0 cluster
 today, due to OOM's (we use the JVM args to kill the process on OOM).

 After recovering, when I execute the match all docs query (*:*), I get a
 different count each time.

 In other words, if I execute q=*:* several times in a row, then I get a
 different count back for numDocs.

 This was not the case prior to the failure as that is one thing we monitor
 for.

 I think I should be worried ... any ideas on how to troubleshoot this? One
 thing to mention is that several of my replicas had to do full recoveries
 from the leader when they came back online. Indexing was happening when the
 replicas failed.

 Thanks.
 Tim



Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

2013-04-22 Thread Timothy Potter
nm - can't read my own output - the leader had more docs than the replica ;-)

On Mon, Apr 22, 2013 at 11:42 AM, Timothy Potter thelabd...@gmail.com wrote:
 Have a little more info about this ... the numDocs for *:* fluctuates
 between two values (difference of 324 docs) depending on which nodes I
 hit (distrib=true)

 589,674,416
 589,674,092

 Using distrib=false, I found 1 shard with a mis-match:

 shard15: {
   leader = 32,765,254
   replica = 32,764,930 diff:324
 }

 Interesting that the replica has more docs than the leader.

 Unfortunately, due to some bad log management scripting on my part,
 the logs were lost when these instances got re-started, which really
 bums me out :-(

 For now, I'm going to assume the replica with more docs is the one I
 want to keep and will replicate the full index over to the other one.
 Sorry about losing the logs :-(

 Tim




 On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter thelabd...@gmail.com wrote:
 Thanks for responding Mark. I'll collect the information you asked
 about and open a JIRA once I have a little more understanding of what
 happened. Hopefully I can piece together some story after going over
 the logs.

 As for replica / leader, I suspect some leaders went down but
 fail-over to new leaders seemed to work fine. We lost about 9 nodes at
 once and continued to serve queries, which is awesome.

 On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller markrmil...@gmail.com wrote:
 Yeah, thats no good.

 You might hit each node with distrib=false to get the doc counts.

 Which ones have what you think are the right counts and which the wrong - 
 eg is it all replicas that are off, or leaders as well?

 You say several replicas - do you mean no leaders went down?

 You might look closer at the logs for a node that has it's count off.

 Finally, I guess I'd try and track it in a JIRA issue.

 - Mark

 On Apr 19, 2013, at 6:37 PM, Timothy Potter thelabd...@gmail.com wrote:

 We had a rogue query take out several replicas in a large 4.2.0 cluster
 today, due to OOM's (we use the JVM args to kill the process on OOM).

 After recovering, when I execute the match all docs query (*:*), I get a
 different count each time.

 In other words, if I execute q=*:* several times in a row, then I get a
 different count back for numDocs.

 This was not the case prior to the failure as that is one thing we monitor
 for.

 I think I should be worried ... any ideas on how to troubleshoot this? One
 thing to mention is that several of my replicas had to do full recoveries
 from the leader when they came back online. Indexing was happening when the
 replicas failed.

 Thanks.
 Tim



Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

2013-04-22 Thread Mark Miller
Bummer on the log loss :(

Good info though. Somehow that replica became active without actually syncing? 
This is heavily tested (though not with OOM's I suppose), so I'm a little 
surprised, but it's hard to speculate how it happened without the logs. 
Specially, the logs from the node that is off would be great - we would see 
what it did when it recovered and why it might think it was in sync :(

- Mark

On Apr 22, 2013, at 2:19 PM, Timothy Potter thelabd...@gmail.com wrote:

 nm - can't read my own output - the leader had more docs than the replica ;-)
 
 On Mon, Apr 22, 2013 at 11:42 AM, Timothy Potter thelabd...@gmail.com wrote:
 Have a little more info about this ... the numDocs for *:* fluctuates
 between two values (difference of 324 docs) depending on which nodes I
 hit (distrib=true)
 
 589,674,416
 589,674,092
 
 Using distrib=false, I found 1 shard with a mis-match:
 
 shard15: {
  leader = 32,765,254
  replica = 32,764,930 diff:324
 }
 
 Interesting that the replica has more docs than the leader.
 
 Unfortunately, due to some bad log management scripting on my part,
 the logs were lost when these instances got re-started, which really
 bums me out :-(
 
 For now, I'm going to assume the replica with more docs is the one I
 want to keep and will replicate the full index over to the other one.
 Sorry about losing the logs :-(
 
 Tim
 
 
 
 
 On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter thelabd...@gmail.com 
 wrote:
 Thanks for responding Mark. I'll collect the information you asked
 about and open a JIRA once I have a little more understanding of what
 happened. Hopefully I can piece together some story after going over
 the logs.
 
 As for replica / leader, I suspect some leaders went down but
 fail-over to new leaders seemed to work fine. We lost about 9 nodes at
 once and continued to serve queries, which is awesome.
 
 On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller markrmil...@gmail.com wrote:
 Yeah, thats no good.
 
 You might hit each node with distrib=false to get the doc counts.
 
 Which ones have what you think are the right counts and which the wrong - 
 eg is it all replicas that are off, or leaders as well?
 
 You say several replicas - do you mean no leaders went down?
 
 You might look closer at the logs for a node that has it's count off.
 
 Finally, I guess I'd try and track it in a JIRA issue.
 
 - Mark
 
 On Apr 19, 2013, at 6:37 PM, Timothy Potter thelabd...@gmail.com wrote:
 
 We had a rogue query take out several replicas in a large 4.2.0 cluster
 today, due to OOM's (we use the JVM args to kill the process on OOM).
 
 After recovering, when I execute the match all docs query (*:*), I get a
 different count each time.
 
 In other words, if I execute q=*:* several times in a row, then I get a
 different count back for numDocs.
 
 This was not the case prior to the failure as that is one thing we monitor
 for.
 
 I think I should be worried ... any ideas on how to troubleshoot this? One
 thing to mention is that several of my replicas had to do full recoveries
 from the leader when they came back online. Indexing was happening when 
 the
 replicas failed.
 
 Thanks.
 Tim
 



Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

2013-04-22 Thread Mark Miller
What do you know about the # of docs you *should*? Do you have that mean when 
taking the bad replica out of the equation?

- Mark

On Apr 22, 2013, at 4:33 PM, Mark Miller markrmil...@gmail.com wrote:

 Bummer on the log loss :(
 
 Good info though. Somehow that replica became active without actually 
 syncing? This is heavily tested (though not with OOM's I suppose), so I'm a 
 little surprised, but it's hard to speculate how it happened without the 
 logs. Specially, the logs from the node that is off would be great - we would 
 see what it did when it recovered and why it might think it was in sync :(
 
 - Mark
 
 On Apr 22, 2013, at 2:19 PM, Timothy Potter thelabd...@gmail.com wrote:
 
 nm - can't read my own output - the leader had more docs than the replica ;-)
 
 On Mon, Apr 22, 2013 at 11:42 AM, Timothy Potter thelabd...@gmail.com 
 wrote:
 Have a little more info about this ... the numDocs for *:* fluctuates
 between two values (difference of 324 docs) depending on which nodes I
 hit (distrib=true)
 
 589,674,416
 589,674,092
 
 Using distrib=false, I found 1 shard with a mis-match:
 
 shard15: {
 leader = 32,765,254
 replica = 32,764,930 diff:324
 }
 
 Interesting that the replica has more docs than the leader.
 
 Unfortunately, due to some bad log management scripting on my part,
 the logs were lost when these instances got re-started, which really
 bums me out :-(
 
 For now, I'm going to assume the replica with more docs is the one I
 want to keep and will replicate the full index over to the other one.
 Sorry about losing the logs :-(
 
 Tim
 
 
 
 
 On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter thelabd...@gmail.com 
 wrote:
 Thanks for responding Mark. I'll collect the information you asked
 about and open a JIRA once I have a little more understanding of what
 happened. Hopefully I can piece together some story after going over
 the logs.
 
 As for replica / leader, I suspect some leaders went down but
 fail-over to new leaders seemed to work fine. We lost about 9 nodes at
 once and continued to serve queries, which is awesome.
 
 On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller markrmil...@gmail.com 
 wrote:
 Yeah, thats no good.
 
 You might hit each node with distrib=false to get the doc counts.
 
 Which ones have what you think are the right counts and which the wrong - 
 eg is it all replicas that are off, or leaders as well?
 
 You say several replicas - do you mean no leaders went down?
 
 You might look closer at the logs for a node that has it's count off.
 
 Finally, I guess I'd try and track it in a JIRA issue.
 
 - Mark
 
 On Apr 19, 2013, at 6:37 PM, Timothy Potter thelabd...@gmail.com wrote:
 
 We had a rogue query take out several replicas in a large 4.2.0 cluster
 today, due to OOM's (we use the JVM args to kill the process on OOM).
 
 After recovering, when I execute the match all docs query (*:*), I get a
 different count each time.
 
 In other words, if I execute q=*:* several times in a row, then I get a
 different count back for numDocs.
 
 This was not the case prior to the failure as that is one thing we 
 monitor
 for.
 
 I think I should be worried ... any ideas on how to troubleshoot this? 
 One
 thing to mention is that several of my replicas had to do full recoveries
 from the leader when they came back online. Indexing was happening when 
 the
 replicas failed.
 
 Thanks.
 Tim
 
 



Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

2013-04-22 Thread Timothy Potter
I ended up just nuking the index on the replica with less docs and
restarting it - which triggered the snap pull from the leader. So now
I'm in sync and have better processes in place to capture the
information if it happens again, which given some of the queries my UI
team develops, is highly likely ;-)

Also, all our input data to Solr lives in Hive so I'm doing some id
-to- id comparisons of what is in Solr vs. what is in Hive to find any
discrepancies.

Again, sorry about the loss of the logs. This is a tough scenario to
try to re-create as it was a perfect storm of high indexing throughput
and a rogue query.

Tim

On Mon, Apr 22, 2013 at 2:41 PM, Mark Miller markrmil...@gmail.com wrote:
 What do you know about the # of docs you *should*? Do you have that mean when 
 taking the bad replica out of the equation?

 - Mark

 On Apr 22, 2013, at 4:33 PM, Mark Miller markrmil...@gmail.com wrote:

 Bummer on the log loss :(

 Good info though. Somehow that replica became active without actually 
 syncing? This is heavily tested (though not with OOM's I suppose), so I'm a 
 little surprised, but it's hard to speculate how it happened without the 
 logs. Specially, the logs from the node that is off would be great - we 
 would see what it did when it recovered and why it might think it was in 
 sync :(

 - Mark

 On Apr 22, 2013, at 2:19 PM, Timothy Potter thelabd...@gmail.com wrote:

 nm - can't read my own output - the leader had more docs than the replica 
 ;-)

 On Mon, Apr 22, 2013 at 11:42 AM, Timothy Potter thelabd...@gmail.com 
 wrote:
 Have a little more info about this ... the numDocs for *:* fluctuates
 between two values (difference of 324 docs) depending on which nodes I
 hit (distrib=true)

 589,674,416
 589,674,092

 Using distrib=false, I found 1 shard with a mis-match:

 shard15: {
 leader = 32,765,254
 replica = 32,764,930 diff:324
 }

 Interesting that the replica has more docs than the leader.

 Unfortunately, due to some bad log management scripting on my part,
 the logs were lost when these instances got re-started, which really
 bums me out :-(

 For now, I'm going to assume the replica with more docs is the one I
 want to keep and will replicate the full index over to the other one.
 Sorry about losing the logs :-(

 Tim




 On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter thelabd...@gmail.com 
 wrote:
 Thanks for responding Mark. I'll collect the information you asked
 about and open a JIRA once I have a little more understanding of what
 happened. Hopefully I can piece together some story after going over
 the logs.

 As for replica / leader, I suspect some leaders went down but
 fail-over to new leaders seemed to work fine. We lost about 9 nodes at
 once and continued to serve queries, which is awesome.

 On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller markrmil...@gmail.com 
 wrote:
 Yeah, thats no good.

 You might hit each node with distrib=false to get the doc counts.

 Which ones have what you think are the right counts and which the wrong 
 - eg is it all replicas that are off, or leaders as well?

 You say several replicas - do you mean no leaders went down?

 You might look closer at the logs for a node that has it's count off.

 Finally, I guess I'd try and track it in a JIRA issue.

 - Mark

 On Apr 19, 2013, at 6:37 PM, Timothy Potter thelabd...@gmail.com wrote:

 We had a rogue query take out several replicas in a large 4.2.0 cluster
 today, due to OOM's (we use the JVM args to kill the process on OOM).

 After recovering, when I execute the match all docs query (*:*), I get a
 different count each time.

 In other words, if I execute q=*:* several times in a row, then I get a
 different count back for numDocs.

 This was not the case prior to the failure as that is one thing we 
 monitor
 for.

 I think I should be worried ... any ideas on how to troubleshoot this? 
 One
 thing to mention is that several of my replicas had to do full 
 recoveries
 from the leader when they came back online. Indexing was happening when 
 the
 replicas failed.

 Thanks.
 Tim





Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

2013-04-22 Thread Mark Miller
No worries, thanks for the info. Let me know if you gain any more insight! I'd 
love to figure out what happened here and address it. And I'm especially 
interested in knowing if you lost any updates if you are able to determine that.

- Mark

On Apr 22, 2013, at 5:02 PM, Timothy Potter thelabd...@gmail.com wrote:

 I ended up just nuking the index on the replica with less docs and
 restarting it - which triggered the snap pull from the leader. So now
 I'm in sync and have better processes in place to capture the
 information if it happens again, which given some of the queries my UI
 team develops, is highly likely ;-)
 
 Also, all our input data to Solr lives in Hive so I'm doing some id
 -to- id comparisons of what is in Solr vs. what is in Hive to find any
 discrepancies.
 
 Again, sorry about the loss of the logs. This is a tough scenario to
 try to re-create as it was a perfect storm of high indexing throughput
 and a rogue query.
 
 Tim
 
 On Mon, Apr 22, 2013 at 2:41 PM, Mark Miller markrmil...@gmail.com wrote:
 What do you know about the # of docs you *should*? Do you have that mean 
 when taking the bad replica out of the equation?
 
 - Mark
 
 On Apr 22, 2013, at 4:33 PM, Mark Miller markrmil...@gmail.com wrote:
 
 Bummer on the log loss :(
 
 Good info though. Somehow that replica became active without actually 
 syncing? This is heavily tested (though not with OOM's I suppose), so I'm a 
 little surprised, but it's hard to speculate how it happened without the 
 logs. Specially, the logs from the node that is off would be great - we 
 would see what it did when it recovered and why it might think it was in 
 sync :(
 
 - Mark
 
 On Apr 22, 2013, at 2:19 PM, Timothy Potter thelabd...@gmail.com wrote:
 
 nm - can't read my own output - the leader had more docs than the replica 
 ;-)
 
 On Mon, Apr 22, 2013 at 11:42 AM, Timothy Potter thelabd...@gmail.com 
 wrote:
 Have a little more info about this ... the numDocs for *:* fluctuates
 between two values (difference of 324 docs) depending on which nodes I
 hit (distrib=true)
 
 589,674,416
 589,674,092
 
 Using distrib=false, I found 1 shard with a mis-match:
 
 shard15: {
 leader = 32,765,254
 replica = 32,764,930 diff:324
 }
 
 Interesting that the replica has more docs than the leader.
 
 Unfortunately, due to some bad log management scripting on my part,
 the logs were lost when these instances got re-started, which really
 bums me out :-(
 
 For now, I'm going to assume the replica with more docs is the one I
 want to keep and will replicate the full index over to the other one.
 Sorry about losing the logs :-(
 
 Tim
 
 
 
 
 On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter thelabd...@gmail.com 
 wrote:
 Thanks for responding Mark. I'll collect the information you asked
 about and open a JIRA once I have a little more understanding of what
 happened. Hopefully I can piece together some story after going over
 the logs.
 
 As for replica / leader, I suspect some leaders went down but
 fail-over to new leaders seemed to work fine. We lost about 9 nodes at
 once and continued to serve queries, which is awesome.
 
 On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller markrmil...@gmail.com 
 wrote:
 Yeah, thats no good.
 
 You might hit each node with distrib=false to get the doc counts.
 
 Which ones have what you think are the right counts and which the wrong 
 - eg is it all replicas that are off, or leaders as well?
 
 You say several replicas - do you mean no leaders went down?
 
 You might look closer at the logs for a node that has it's count off.
 
 Finally, I guess I'd try and track it in a JIRA issue.
 
 - Mark
 
 On Apr 19, 2013, at 6:37 PM, Timothy Potter thelabd...@gmail.com 
 wrote:
 
 We had a rogue query take out several replicas in a large 4.2.0 cluster
 today, due to OOM's (we use the JVM args to kill the process on OOM).
 
 After recovering, when I execute the match all docs query (*:*), I get 
 a
 different count each time.
 
 In other words, if I execute q=*:* several times in a row, then I get a
 different count back for numDocs.
 
 This was not the case prior to the failure as that is one thing we 
 monitor
 for.
 
 I think I should be worried ... any ideas on how to troubleshoot this? 
 One
 thing to mention is that several of my replicas had to do full 
 recoveries
 from the leader when they came back online. Indexing was happening 
 when the
 replicas failed.
 
 Thanks.
 Tim
 
 
 



Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

2013-04-22 Thread Sudhakar Maddineni
We had encountered similar issue few days back with 4.0- Beta version.
We have 6 node - 3 shard cluster setup.And, one of our replica
servers[tomcat] was not responding to any requests because it reached the
max no of the threads[200 -default]. To temporarily fix the issue, we had
to restart the server.After restarting, we realized that there were 2
tomcat processes running[old one + new one].So, we manually killed the two
tomcat processes and had a clean start.And, we observed the numDocs of
replica server not matching to the count on leader.
So, this discrepancy is because we manually killed the process which
interrupted the sync process?

Thx,Sudhakar.




On Mon, Apr 22, 2013 at 3:28 PM, Mark Miller markrmil...@gmail.com wrote:

 No worries, thanks for the info. Let me know if you gain any more insight!
 I'd love to figure out what happened here and address it. And I'm
 especially interested in knowing if you lost any updates if you are able to
 determine that.

 - Mark

 On Apr 22, 2013, at 5:02 PM, Timothy Potter thelabd...@gmail.com wrote:

  I ended up just nuking the index on the replica with less docs and
  restarting it - which triggered the snap pull from the leader. So now
  I'm in sync and have better processes in place to capture the
  information if it happens again, which given some of the queries my UI
  team develops, is highly likely ;-)
 
  Also, all our input data to Solr lives in Hive so I'm doing some id
  -to- id comparisons of what is in Solr vs. what is in Hive to find any
  discrepancies.
 
  Again, sorry about the loss of the logs. This is a tough scenario to
  try to re-create as it was a perfect storm of high indexing throughput
  and a rogue query.
 
  Tim
 
  On Mon, Apr 22, 2013 at 2:41 PM, Mark Miller markrmil...@gmail.com
 wrote:
  What do you know about the # of docs you *should*? Do you have that
 mean when taking the bad replica out of the equation?
 
  - Mark
 
  On Apr 22, 2013, at 4:33 PM, Mark Miller markrmil...@gmail.com wrote:
 
  Bummer on the log loss :(
 
  Good info though. Somehow that replica became active without actually
 syncing? This is heavily tested (though not with OOM's I suppose), so I'm a
 little surprised, but it's hard to speculate how it happened without the
 logs. Specially, the logs from the node that is off would be great - we
 would see what it did when it recovered and why it might think it was in
 sync :(
 
  - Mark
 
  On Apr 22, 2013, at 2:19 PM, Timothy Potter thelabd...@gmail.com
 wrote:
 
  nm - can't read my own output - the leader had more docs than the
 replica ;-)
 
  On Mon, Apr 22, 2013 at 11:42 AM, Timothy Potter 
 thelabd...@gmail.com wrote:
  Have a little more info about this ... the numDocs for *:* fluctuates
  between two values (difference of 324 docs) depending on which nodes
 I
  hit (distrib=true)
 
  589,674,416
  589,674,092
 
  Using distrib=false, I found 1 shard with a mis-match:
 
  shard15: {
  leader = 32,765,254
  replica = 32,764,930 diff:324
  }
 
  Interesting that the replica has more docs than the leader.
 
  Unfortunately, due to some bad log management scripting on my part,
  the logs were lost when these instances got re-started, which really
  bums me out :-(
 
  For now, I'm going to assume the replica with more docs is the one I
  want to keep and will replicate the full index over to the other one.
  Sorry about losing the logs :-(
 
  Tim
 
 
 
 
  On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter 
 thelabd...@gmail.com wrote:
  Thanks for responding Mark. I'll collect the information you asked
  about and open a JIRA once I have a little more understanding of
 what
  happened. Hopefully I can piece together some story after going over
  the logs.
 
  As for replica / leader, I suspect some leaders went down but
  fail-over to new leaders seemed to work fine. We lost about 9 nodes
 at
  once and continued to serve queries, which is awesome.
 
  On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller 
 markrmil...@gmail.com wrote:
  Yeah, thats no good.
 
  You might hit each node with distrib=false to get the doc counts.
 
  Which ones have what you think are the right counts and which the
 wrong - eg is it all replicas that are off, or leaders as well?
 
  You say several replicas - do you mean no leaders went down?
 
  You might look closer at the logs for a node that has it's count
 off.
 
  Finally, I guess I'd try and track it in a JIRA issue.
 
  - Mark
 
  On Apr 19, 2013, at 6:37 PM, Timothy Potter thelabd...@gmail.com
 wrote:
 
  We had a rogue query take out several replicas in a large 4.2.0
 cluster
  today, due to OOM's (we use the JVM args to kill the process on
 OOM).
 
  After recovering, when I execute the match all docs query (*:*),
 I get a
  different count each time.
 
  In other words, if I execute q=*:* several times in a row, then I
 get a
  different count back for numDocs.
 
  This was not the case prior to the failure as that is one thing
 we monitor
  for.
 
  I think I should 

Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

2013-04-22 Thread Timothy Potter
Hi Sudhakar,

Unfortunately, we don't know the underlying cause and I lost the logs
that could have helped diagnose further. FWIW, I think this is an
extreme case as I've lost nodes before and haven't had any
discrepancies after recovering. In my case, it was a perfect storm of
high throughput indexing 8-10K docs/sec and an very nasty query that
OOM'd on about half the nodes in my cluster. Of course we want to get
to the bottom of this but it's going to be hard to reproduce. The good
news is I'm recovered and the cluster is consistent again.

Cheers,
Tim

On Mon, Apr 22, 2013 at 5:18 PM, Sudhakar Maddineni
maddineni...@gmail.com wrote:
 We had encountered similar issue few days back with 4.0- Beta version.
 We have 6 node - 3 shard cluster setup.And, one of our replica
 servers[tomcat] was not responding to any requests because it reached the
 max no of the threads[200 -default]. To temporarily fix the issue, we had
 to restart the server.After restarting, we realized that there were 2
 tomcat processes running[old one + new one].So, we manually killed the two
 tomcat processes and had a clean start.And, we observed the numDocs of
 replica server not matching to the count on leader.
 So, this discrepancy is because we manually killed the process which
 interrupted the sync process?

 Thx,Sudhakar.




 On Mon, Apr 22, 2013 at 3:28 PM, Mark Miller markrmil...@gmail.com wrote:

 No worries, thanks for the info. Let me know if you gain any more insight!
 I'd love to figure out what happened here and address it. And I'm
 especially interested in knowing if you lost any updates if you are able to
 determine that.

 - Mark

 On Apr 22, 2013, at 5:02 PM, Timothy Potter thelabd...@gmail.com wrote:

  I ended up just nuking the index on the replica with less docs and
  restarting it - which triggered the snap pull from the leader. So now
  I'm in sync and have better processes in place to capture the
  information if it happens again, which given some of the queries my UI
  team develops, is highly likely ;-)
 
  Also, all our input data to Solr lives in Hive so I'm doing some id
  -to- id comparisons of what is in Solr vs. what is in Hive to find any
  discrepancies.
 
  Again, sorry about the loss of the logs. This is a tough scenario to
  try to re-create as it was a perfect storm of high indexing throughput
  and a rogue query.
 
  Tim
 
  On Mon, Apr 22, 2013 at 2:41 PM, Mark Miller markrmil...@gmail.com
 wrote:
  What do you know about the # of docs you *should*? Do you have that
 mean when taking the bad replica out of the equation?
 
  - Mark
 
  On Apr 22, 2013, at 4:33 PM, Mark Miller markrmil...@gmail.com wrote:
 
  Bummer on the log loss :(
 
  Good info though. Somehow that replica became active without actually
 syncing? This is heavily tested (though not with OOM's I suppose), so I'm a
 little surprised, but it's hard to speculate how it happened without the
 logs. Specially, the logs from the node that is off would be great - we
 would see what it did when it recovered and why it might think it was in
 sync :(
 
  - Mark
 
  On Apr 22, 2013, at 2:19 PM, Timothy Potter thelabd...@gmail.com
 wrote:
 
  nm - can't read my own output - the leader had more docs than the
 replica ;-)
 
  On Mon, Apr 22, 2013 at 11:42 AM, Timothy Potter 
 thelabd...@gmail.com wrote:
  Have a little more info about this ... the numDocs for *:* fluctuates
  between two values (difference of 324 docs) depending on which nodes
 I
  hit (distrib=true)
 
  589,674,416
  589,674,092
 
  Using distrib=false, I found 1 shard with a mis-match:
 
  shard15: {
  leader = 32,765,254
  replica = 32,764,930 diff:324
  }
 
  Interesting that the replica has more docs than the leader.
 
  Unfortunately, due to some bad log management scripting on my part,
  the logs were lost when these instances got re-started, which really
  bums me out :-(
 
  For now, I'm going to assume the replica with more docs is the one I
  want to keep and will replicate the full index over to the other one.
  Sorry about losing the logs :-(
 
  Tim
 
 
 
 
  On Sat, Apr 20, 2013 at 10:23 AM, Timothy Potter 
 thelabd...@gmail.com wrote:
  Thanks for responding Mark. I'll collect the information you asked
  about and open a JIRA once I have a little more understanding of
 what
  happened. Hopefully I can piece together some story after going over
  the logs.
 
  As for replica / leader, I suspect some leaders went down but
  fail-over to new leaders seemed to work fine. We lost about 9 nodes
 at
  once and continued to serve queries, which is awesome.
 
  On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller 
 markrmil...@gmail.com wrote:
  Yeah, thats no good.
 
  You might hit each node with distrib=false to get the doc counts.
 
  Which ones have what you think are the right counts and which the
 wrong - eg is it all replicas that are off, or leaders as well?
 
  You say several replicas - do you mean no leaders went down?
 
  You might look closer at the logs for a 

Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

2013-04-20 Thread Mark Miller
Yeah, thats no good.

You might hit each node with distrib=false to get the doc counts.

Which ones have what you think are the right counts and which the wrong - eg is 
it all replicas that are off, or leaders as well?

You say several replicas - do you mean no leaders went down?

You might look closer at the logs for a node that has it's count off.

Finally, I guess I'd try and track it in a JIRA issue.

- Mark

On Apr 19, 2013, at 6:37 PM, Timothy Potter thelabd...@gmail.com wrote:

 We had a rogue query take out several replicas in a large 4.2.0 cluster
 today, due to OOM's (we use the JVM args to kill the process on OOM).
 
 After recovering, when I execute the match all docs query (*:*), I get a
 different count each time.
 
 In other words, if I execute q=*:* several times in a row, then I get a
 different count back for numDocs.
 
 This was not the case prior to the failure as that is one thing we monitor
 for.
 
 I think I should be worried ... any ideas on how to troubleshoot this? One
 thing to mention is that several of my replicas had to do full recoveries
 from the leader when they came back online. Indexing was happening when the
 replicas failed.
 
 Thanks.
 Tim



Re: Rogue query killed several replicas with OOM, after recovering - match all docs query problem

2013-04-20 Thread Timothy Potter
Thanks for responding Mark. I'll collect the information you asked
about and open a JIRA once I have a little more understanding of what
happened. Hopefully I can piece together some story after going over
the logs.

As for replica / leader, I suspect some leaders went down but
fail-over to new leaders seemed to work fine. We lost about 9 nodes at
once and continued to serve queries, which is awesome.

On Sat, Apr 20, 2013 at 10:11 AM, Mark Miller markrmil...@gmail.com wrote:
 Yeah, thats no good.

 You might hit each node with distrib=false to get the doc counts.

 Which ones have what you think are the right counts and which the wrong - eg 
 is it all replicas that are off, or leaders as well?

 You say several replicas - do you mean no leaders went down?

 You might look closer at the logs for a node that has it's count off.

 Finally, I guess I'd try and track it in a JIRA issue.

 - Mark

 On Apr 19, 2013, at 6:37 PM, Timothy Potter thelabd...@gmail.com wrote:

 We had a rogue query take out several replicas in a large 4.2.0 cluster
 today, due to OOM's (we use the JVM args to kill the process on OOM).

 After recovering, when I execute the match all docs query (*:*), I get a
 different count each time.

 In other words, if I execute q=*:* several times in a row, then I get a
 different count back for numDocs.

 This was not the case prior to the failure as that is one thing we monitor
 for.

 I think I should be worried ... any ideas on how to troubleshoot this? One
 thing to mention is that several of my replicas had to do full recoveries
 from the leader when they came back online. Indexing was happening when the
 replicas failed.

 Thanks.
 Tim



Rogue query killed several replicas with OOM, after recovering - match all docs query problem

2013-04-19 Thread Timothy Potter
We had a rogue query take out several replicas in a large 4.2.0 cluster
today, due to OOM's (we use the JVM args to kill the process on OOM).

After recovering, when I execute the match all docs query (*:*), I get a
different count each time.

In other words, if I execute q=*:* several times in a row, then I get a
different count back for numDocs.

This was not the case prior to the failure as that is one thing we monitor
for.

I think I should be worried ... any ideas on how to troubleshoot this? One
thing to mention is that several of my replicas had to do full recoveries
from the leader when they came back online. Indexing was happening when the
replicas failed.

Thanks.
Tim