https://bugzilla.wikimedia.org/show_bug.cgi?id=69667
--- Comment #8 from [email protected] --- Adding some more records: To rule out temporary basic network problems, I started a script to ping all zookeepers from analytics1021 in an endless loop. That produced results already: There was no output for ~10 seconds between 15:26:04 and 15:26:14 But RTT times were low before and afterwards. Since we're hardly having data to compare against, I collected the analytics1021 ganglia graphs that showed a exceptional in-/decrease during that period, and added the relevant parts of the kafka log and kafka gc log. (See attachment analytics1021-2014-09-04T15-26-03.tar.gz) It seems most disks spiked. kafka log looks harmless. kafka gc log looks harmless too. -------------------------------------------- Since both the pings and kafka gc stopped during that time, it might be it's not network related, but the machine itself preempting processes and blocking them for some time. So I started another monitor that (in an endless loop) logs the timestamp, and then waits a second. That already produced relevant results too: There was no output for ~13 seconds between 20:47:18 and 20:47:31 . Again most disks spiked. kafka log looks harmless. kafka gc log looks harmless too. -------------------------------------------- Since I lack access to the system's logs, do they show anything interesting around 15:26:04 -- 15:26:14, or 20:47:18 -- 20:47:31 ? -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
