https://bugzilla.wikimedia.org/show_bug.cgi?id=69667

--- Comment #8 from [email protected] ---
Adding some more records:

To rule out temporary basic network problems, I started a script to
ping all zookeepers from analytics1021 in an endless loop.
That produced results already:

  There was no output for ~10 seconds between 15:26:04 and 15:26:14

But RTT times were low before and afterwards. Since we're hardly
having data to compare against, I collected the analytics1021 ganglia
graphs that showed a exceptional in-/decrease during that period, and
added the relevant parts of the kafka log and kafka gc log.
(See attachment analytics1021-2014-09-04T15-26-03.tar.gz)

It seems most disks spiked.
kafka log looks harmless.
kafka gc log looks harmless too.

--------------------------------------------

Since both the pings and kafka gc stopped during that time, it might
be it's not network related, but the machine itself preempting
processes and blocking them for some time.

So I started another monitor that (in an endless loop) logs the
timestamp, and then waits a second.
That already produced relevant results too:

  There was no output for ~13 seconds between 20:47:18 and 20:47:31

.

Again most disks spiked.
kafka log looks harmless.
kafka gc log looks harmless too.

--------------------------------------------

Since I lack access to the system's logs, do they show anything
interesting around
  15:26:04 -- 15:26:14, or
  20:47:18 -- 20:47:31
?

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to