https://bugzilla.wikimedia.org/show_bug.cgi?id=47141
--- Comment #4 from Munagala Ramanath (Ram) <[email protected]> --- As I mentioned at the platform meeting this week, we are seeing a number of failures in the log files about not being able to connect to the RMI Registry on various hosts. The exceptions look like this: ---------------------------------------------------------------------- 2013-04-17 00:00:48,681 [pool-1-thread-64] WARN org.wikimedia.lsearch.interoperability.RMIMessengerClient - Cannot contact RMI registry for host search1020 : error during JRMP connection establishment; nested exception is: java.net.SocketTimeoutException: Read timed out 2013-04-17 00:00:48,914 [pool-1-thread-41] WARN org.wikimedia.lsearch.interoperability.RMIMessengerClient - Cannot contact RMI registry for host search1020 : error during JRMP connection establishment; nested exception is: java.net.SocketTimeoutException: Read timed out --------------------------------------------------------------------- While a plausible explanation for client-side timeouts is, as Tim suggests, that searches are having to access disk, faiing to even connect to the RMI Registry is indicative of a different issue -- the system being so busy that it is not able to accept a connection. So I wrote a script to do some rudimentary log analysis; the results are appended below for search1015 and search1016. The histogram shows the number of these failures on an hour-by-hour basis. Notice that we see a spike around 0500 which only abates around 3 to 4 hours later. This is consistent with the failures being related to rsync'ing new snapshots since snapshots are generated every day at 0430. However, there is a second spike around 0700 as well that is unexplained. The numbers are way higher for Apr-15 since I asked Peter to drop the log level from WARN to TRACE over the weekend resulting in substantially larger log files (~11GB versus 370MB normally); I'm guessing that having to log so much data itself caused additional failures. All these failures involve attempts to reach search1019 or search1020, though why there are many more failures for the latter than the former (counts shown at the end of each histogram) is another item needing investigation. Histograms of RMI Registry failures on search1015 Hr. Apr-16 Apr-15 -------------------- 0 - : 0 217 1 - : 7 142 2 - : 12 267 3 - : 3 333 4 - : 63 504 5 - : 599 1782 6 - : 141 684 7 - : 314 864 8 - : 34 245 9 - : 47 40 10 - : 10 64 11 - : 36 21 12 - : 10 74 13 - : 32 371 14 - : 71 447 15 - : 50 666 16 - : 65 827 17 - : 465 451 18 - : 66 589 19 - : 176 7 20 - : 79 16 21 - : 97 18 22 - : 144 28 23 - : 81 3 Total = 2602 8660 search1020: 2210 search1019: 390 Histograms of RMI Registry failures on search1016 Hr. Apr-16 Apr-15 -------------------- 0 - : 0 208 1 - : 10 166 2 - : 23 338 3 - : 0 351 4 - : 71 485 5 - : 621 1677 6 - : 131 766 7 - : 325 951 8 - : 23 241 9 - : 40 41 10 - : 9 57 11 - : 28 24 12 - : 6 74 13 - : 28 388 14 - : 65 479 15 - : 40 677 16 - : 65 827 17 - : 441 473 18 - : 61 114 19 - : 177 10 20 - : 94 28 21 - : 96 21 22 - : 180 20 23 - : 89 0 Total = 2623 8416 search1020: 2239 search1019: 382 -- You are receiving this mail because: You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
