[Bug 47141] Reindexing causes nagios alerts

bugzilla-daemon Wed, 17 Apr 2013 15:11:53 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=47141


--- Comment #4 from Munagala Ramanath (Ram) <[email protected]> ---
As I mentioned at the platform meeting this week, we are seeing a number of
failures in the log files about not being able to connect to the RMI Registry
on various hosts. The exceptions look like this:

----------------------------------------------------------------------
2013-04-17 00:00:48,681 [pool-1-thread-64] WARN 
org.wikimedia.lsearch.interoperability.RMIMessengerClient  - Cannot contact RMI
registry for host search1020 : error during JRMP connection establishment;
nested exception is:
  java.net.SocketTimeoutException: Read timed out
2013-04-17 00:00:48,914 [pool-1-thread-41] WARN 
org.wikimedia.lsearch.interoperability.RMIMessengerClient  - Cannot contact RMI
registry for host search1020 : error during JRMP connection establishment;
nested exception is:
  java.net.SocketTimeoutException: Read timed out
---------------------------------------------------------------------

While a plausible explanation for client-side timeouts is, as Tim suggests,
that searches are having to access disk, faiing to even connect to the RMI
Registry is indicative of a different issue -- the system being so busy that it
is not able to accept a connection.

So I wrote a script to do some rudimentary log analysis; the results are
appended below for search1015 and search1016. The histogram shows the number of
these failures on an hour-by-hour basis. Notice that we see a spike around 0500
which only abates around 3 to 4 hours later. This is consistent with the
failures being related to rsync'ing new snapshots since snapshots are generated
every day at 0430. However, there is a second spike around 0700 as well that is
unexplained.

The numbers are way higher for Apr-15 since I asked Peter to drop the log level
from WARN to TRACE over the weekend resulting in substantially larger log files
(~11GB versus 370MB normally); I'm guessing that having to log so much data
itself caused additional failures.

All these failures involve attempts to reach search1019 or search1020, though
why there are many more failures for the latter than the former (counts shown
at the end of each histogram) is another item needing investigation.

Histograms of RMI Registry failures on search1015
Hr.   Apr-16  Apr-15
--------------------
0 -  :     0     217 
1 -  :     7     142 
2 -  :    12     267 
3 -  :     3     333 
4 -  :    63     504 
5 -  :   599    1782
6 -  :   141     684 
7 -  :   314     864 
8 -  :    34     245 
9 -  :    47      40  
10 - :    10      64 
11 - :    36      21 
12 - :    10      74 
13 - :    32     371
14 - :    71     447
15 - :    50     666
16 - :    65     827
17 - :   465     451
18 - :    66     589
19 - :   176       7
20 - :    79      16
21 - :    97      18
22 - :   144      28
23 - :    81       3
Total = 2602    8660
search1020: 2210
search1019: 390

Histograms of RMI Registry failures on search1016
Hr.   Apr-16  Apr-15
--------------------
 0 - :    0      208
 1 - :   10      166
 2 - :   23      338
 3 - :    0      351
 4 - :   71      485
 5 - :  621     1677
 6 - :  131      766
 7 - :  325      951
 8 - :   23      241
 9 - :   40       41
10 - :    9       57
11 - :   28       24
12 - :    6       74
13 - :   28      388
14 - :   65      479
15 - :   40      677
16 - :   65      827
17 - :  441      473
18 - :   61      114
19 - :  177       10
20 - :   94       28
21 - :   96       21
22 - :  180       20
23 - :   89        0
Total = 2623    8416
search1020: 2239
search1019: 382

-- 
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 47141] Reindexing causes nagios alerts

Reply via email to