https://bugzilla.wikimedia.org/show_bug.cgi?id=28144

--- Comment #6 from Tim Starling <[email protected]> 2011-03-22 06:20:04 
UTC ---
Since I wasn't making much progress finding the bug by code review, I decided
to have a crack at debugging the running process with gdb, despite the lack of
symbols. I've determined the following:

* The hashtable hasn't grown, it still has hashpower=16.
* There are 134 entries still in the hashtable, so this wasn't an isolated
case.
* I looked at three entries, they all had count = processing = 2.
* By subtracting an appropriate offset from the address of the locks structure,
I could look at the client_data structure. For one of the hashtable entries,
there were two clients, with FDs 387 and 466. lsof says:

COMMAND    PID        USER   FD   TYPE             DEVICE SIZE/OFF     NODE
NAME
poolcount 1838 poolcounter  387u  sock                0,6      0t0 64556254
can't identify protocol
poolcount 1838 poolcounter  466u  sock                0,6      0t0 64552887
can't identify protocol

There are 275 FDs which give "can't identify protocol", which is suspiciously
close to double the number of hashtable entries. Maybe these FDs were closed on
the remote side, but poolcounterd never called close() on them. That wouldn't
be a surprising scenario, since the structures seem to indicating that
free_client_data() was never called on them, and on_client() never calls
close() without first calling free_client_data().

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to