Re: PoCo::Client::Ping intermittently fails to emit pings
ok, i like the idea of catching this situation in my application
however, i'm having trouble distinguishing between the case where send() fails
and the case where send() succeeds but the target does not return an ICMP Reply
client_got_pong {
[...]
my ($request, $response) = @_[ARG0, ARG1];
[...]
my ($resp_address, $roundtrip_time, $resp_time, $resp_ttl) = @{$response};
if (defined $resp_address) {
# Celebrate receiving the ICMP Reply
}
else {
if (not defined $roundtrip_time and not defined $resp_ttl) {
# Whine about the call to send() failing
}
else {
# Weep about the ICMP Reply not appearing
}
}
[...]
}
but, in fact, $roundtrip_time and $resp_ttl are undefined for both cases: the
case in which send() failed and the case in which the ICMP Reply didn't arrive
how might i distinguish between the two cases?
--sk
Rocco Caputo wrote:
The post($sender, $event, ) call sends $event response back to your
code if send() fails. The response address, round-trip time, and TTL
are undefined. The response time is "now".
You've added logging in a good place, if you want to solve this by
modifying the module. It might be better to catch the $event in your
application and log the problem there.
Re: PoCo::Client::Ping intermittently fails to emit pings
The post($sender, $event, ) call sends $event response back to your code if send() fails. The response address, round-trip time, and TTL are undefined. The response time is "now". You've added logging in a good place, if you want to solve this by modifying the module. It might be better to catch the $event in your application and log the problem there. -- Rocco Caputo - [email protected] On Apr 21, 2009, at 16:31, Stuart Kendrick wrote: i'm walking through .../perl5/site_perl/5.10.0/POE/Component/Client/ Ping.pm where do we emit an error if we are unable to send an ICMP packet? seems to me that we decide to send the ping in the very last line of 'sub poco_ping_ping': [...] # Request a ping. This code borrows heavily from Net::Ping. sub poco_ping_ping { [...] _send_packet($kernel, $heap); } and we actually send the ping on line 362, within the '_send_packet' routine, specifically, from the 'send' phrase within the 'unless' clause: sub _send_packet { my ($kernel, $heap) = @_; return unless (scalar @{$heap->{queue}}); [...] # Send the packet. If send() fails, then we bail with an error. my @user_args = (); ($event, @user_args) = @$event if ref($event) eq "ARRAY"; DEBUG and warn "sending packet sequence number $seq\n"; unless (send($heap->{socket_handle}, $msg, ICMP_FLAGS, $socket_address)) { $kernel->post( $sender, $event, [ $address,# REQ_ADDRESS $timeout,# REQ_TIMEOUT time(), # REQ_TIME @user_args, # REQ_USER_ARGS ], [ undef, # RES_ADDRESS undef, # RES_ROUNDTRIP time(), # RES_TIME undef, # RES_TTL ], ); _check_for_close($kernel, $heap); return; } [...] now, i see the comment "If send() fails, then we bail with an error". but i don't see where we actually spit out an error message. seems to me that _send_packet returns with a value of 0 (that 'return' statement just after the call to _check_for_close) ... but i don't see us checking the return value of _send_packet, back in "sub poco_ping_ping" or ... does the "$kernel-post( ..." event trigger an error message, through some mechanism i don't understand? would it be reasonable for me to add the following, in order to track the success/failure of the call to send() ? [...] unless (send($heap->{socket_handle}, $msg, ICMP_FLAGS, $socket_address)) { warn "Unable to send() ICMP packet"; $kernel->post( [...] or would i be duplicating effort performed elsewhere? actually, what i really want to do is to log such events to syslog, so am i headed in a fruitful direction by adding the following? [...] unless (send($heap->{socket_handle}, $msg, ICMP_FLAGS, $socket_address)) { use Sys::Syslog; openlog('PoCo-Ping', 'nofatal,ndelay,pid', 'local0'); syslog('error', 'Unable to send() ICMP packet'); closelog(); $kernel->post( [...] --sk
Re: PoCo::Client::Ping intermittently fails to emit pings
i'm walking through .../perl5/site_perl/5.10.0/POE/Component/Client/Ping.pm
where do we emit an error if we are unable to send an ICMP packet?
seems to me that we decide to send the ping in the very last line of 'sub
poco_ping_ping':
[...]
# Request a ping. This code borrows heavily from Net::Ping.
sub poco_ping_ping {
[...]
_send_packet($kernel, $heap);
}
and we actually send the ping on line 362, within the '_send_packet' routine,
specifically, from the 'send' phrase within the 'unless' clause:
sub _send_packet {
my ($kernel, $heap) = @_;
return unless (scalar @{$heap->{queue}});
[...]
# Send the packet. If send() fails, then we bail with an error.
my @user_args = ();
($event, @user_args) = @$event if ref($event) eq "ARRAY";
DEBUG and warn "sending packet sequence number $seq\n";
unless (send($heap->{socket_handle}, $msg, ICMP_FLAGS, $socket_address)) {
$kernel->post(
$sender, $event,
[ $address,# REQ_ADDRESS
$timeout,# REQ_TIMEOUT
time(), # REQ_TIME
@user_args, # REQ_USER_ARGS
],
[ undef, # RES_ADDRESS
undef, # RES_ROUNDTRIP
time(), # RES_TIME
undef, # RES_TTL
],
);
_check_for_close($kernel, $heap);
return;
}
[...]
now, i see the comment "If send() fails, then we bail with an error". but i
don't see where we actually spit out an error message. seems to me that
_send_packet returns with a value of 0 (that 'return' statement just after the
call to _check_for_close) ... but i don't see us checking the return value of
_send_packet, back in "sub poco_ping_ping"
or ... does the "$kernel-post( ..." event trigger an error message, through some
mechanism i don't understand?
would it be reasonable for me to add the following, in order to track the
success/failure of the call to send() ?
[...]
unless (send($heap->{socket_handle}, $msg, ICMP_FLAGS, $socket_address)) {
warn "Unable to send() ICMP packet";
$kernel->post(
[...]
or would i be duplicating effort performed elsewhere?
actually, what i really want to do is to log such events to syslog, so am i
headed in a fruitful direction by adding the following?
[...]
unless (send($heap->{socket_handle}, $msg, ICMP_FLAGS, $socket_address)) {
use Sys::Syslog;
openlog('PoCo-Ping', 'nofatal,ndelay,pid', 'local0');
syslog('error', 'Unable to send() ICMP packet');
closelog();
$kernel->post(
[...]
--sk
Re: PoCo::Client::Ping intermittently fails to emit pings
hi rocco,
yes, ok, i think i'm following you:
-i know that these ICMP Echos aren't leaving the NIC (during these multi-minute
periods), from the sniffer i have posted just outside the box
-many ICMP Echos are exitting the NIC ... but i happen to know which boxes i'm
pinging using Nagios, and which I'm pinging using NodeWatch (which employs POE),
and the pings i'm seeing are headed to/from the Nagios-monitored boxes, not the
NodeWatch-monitored boxes. additionally, NodeWatch logs to syslog when it is
emitting pings ... and, in at least one of my traces, no ICMP Echos leave the
box at all at that point ... (presumably, Nagios was off doing something else at
the time)
-so these ICMP Echos are being dropped somewhere inside the box
-as you say, perhaps the NIC's outbound buffer is overflowing -- i hadn't
thought of that. checking the output of 'netstat -i' ... i see zeros underneath
the TX-ERR, TX-DRP, and TX-OVR columns (and under the RX-ERR/DRP/OVR columns as
well). this suggests that the NIC isn't dropping frames ... but i haven't a
clue how reliable this output is. i could imagine a NIC driver which doesn't
update the relevant counter somewhere, when it discards a frame. so let's keep
this possibility on the table
-i wonder if the OS tracks 'outbound ICMP drops'?
gnat> netstat -s -w
[...]
Icmp:
3510705 ICMP messages received
34835 input ICMP message failed.
ICMP input histogram:
destination unreachable: 111508
timeout in transit: 26581
echo requests: 84611
echo replies: 3288005
7155838 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 64167
echo request: 7007060
echo replies: 84611
i wonder what 'input ICMP message failed' means? are those ICMP Echos for which
the OS didn't receive any replies? or are those ICMP messages which the OS
received but dropped due to resource constraints? or ... ICMP messages which
the OS was asked to send but which the OS dropped due to resource constraints?
this morning, i poked through Benvunti's "Understanding Linux Network
Internals", without success. i have 'Understanding the Linux Kernel' on order.
i should probably dig up Steven's 'Implementation' (Vol 2?) as well. any
other tips on where to dig to learn what "input ICMP message failed" means?
-now, if my application were CPU-starved, wouldn't i see the pings go out
eventually? i mean, my PoCo pinger might have timed out already and have
declared the ping lost ... but wouldn't i see the outbound ICMP Echos appear,
eventually, in the packet trace? [by 'eventually', i mean delayed by seconds,
not delayed by minutes ... i haven't poked deep enough into the trace to see
whether or not the ICMP Echos appear minutes late ... not sure how i would
set-up that experiment in fact]
-yes, i do tend to burst the pings:
$max_pings = 100;
[...]
# Throw out more pings
while (@addrs and $heap->{open_pings} < $max_pings}) {
my $addr = shift @addrs;
log_it("Pinging $addr") if ($debug == 5 or $debug == 6);
# "Pinger, do a ping and return the results as a pong event. The
# address to ping is $addr."
$result = $kernel->post( 'pinger',
'ping',
'pong',
$addr,
$timeout{$addr}
);
[...]
i suppose i could try dropping $max_pings to something smaller. but, i'm not
ready to actually solve the problem; instead, i want to verify that i understand
the problem, notably, where exactly the OS is dropping the ball
==> anyway, so this is why i'm now checking the return code on post() ... that's
a start, perhaps, but as you point out, hardly the end of my trouble-shooting
efforts. i'm beginning to think that what i really want is Dtrace for Linux ...
which doesn't exist ... but perhaps i can squeeze enough juice out of SystemTap
to tackle this. stapprobes.netdev should allow me to see when data arrives on
the NIC ... so i could at least tell whether or not the ICMP Echo arrives at the
NIC ... and perhaps stapprobes.socket would tell me whether or not the OS
received the request to create a raw socket to service the ICMP Echo
==> yes, i am poking at this from other angles: what changed recently? well,
the answer is: a lot. i wiped the box, which was running SuSE 9.3, and
installed CentOS 5.3. that's quite a bit of change, right there! i'm using
perl-5.10.0 now, whereas i was using perl-5.8.8. my other NodeWatch boxes
already run CentOS 5.3 ... so i'm gradually replacing perl-5.8.8 with
perl-5.10.0 on them ... that would be a clue, if the problem starts appearing
once i'm using perl-5.10.0 the actual applications & cron jobs i run today on
the box are the same i was running prior to the OS change. yes, i'm staring at
the timing involved ... at this point, the timing has becom
Re: PoCo::Client::Ping intermittently fails to emit pings
On Apr 15, 2009, at 07:59, Stuart Kendrick wrote: my favorite pinging application (in-house app, uses POE::Component::Client::Ping) intermittently reports a rash of missed replies. happens roughly the same time each night; the condition lasts for several minutes Time-based failures implicate periodic jobs. What's cron doing around that time? Details: - perl-5.10.0, POE 1.0.4, POE::Component::Client::Ping 1.14 - CentOS 5.3 (2.6.18-128.el5 x86_64 x86_64 x86_64 GNU/Linux) - the application is pinging ~120 hosts every 30 seconds Does the application ping them all in a burst? This could cause lost packets. As I mentioned, ICMP isn't reliable, and congestion on the network or interface could cause silent packet drops. - the box gets busy during this time, in terms of disk I/O (average latency of ~100ms ... fifteen minute rolling load average spikes to ~18: likely stems from several disk-intensive jobs which run during this period). this busyness lasts for ~5 hours Heavy load could be causing your monitor program to lag, triggering timeouts---possibly before the ICMP requests are sent. Reducing the pinger's niceness slightly may help, but you may just be overloading the system. See what the machine needs to handle the work you've given it. Consider distributing the work across multiple machines. - the box hosts a number of pinging applications (Nagios, a bunch of in-house apps), typically employing fping Probably not an issue, unless they stream an awful lot of requests. Load spikes may push the pinging activity together, which could result in lost packets. For example, 120 pings over 30 seconds is fine, but a lag spike might delay them. The lagged ones will try to catch up in a burst when the lag spike ends---or when the OS says your pingers have been CPU starved long enough. - i've been running the application for a handful of years now, a handful of instances, monitoring ~100 - 500 hosts per instance. this is novel behavior Novel behavior suggests the question "What has changed?" The variable could be something new and significant or something that has gradually approached a capacity threshold. Some things to think about: Has the number of hosts to ping increased by a lot recently or steadily over time? Have the disk-intensive jobs' workload increased by a lot recently or steadily over time? -- Rocco Caputo - [email protected]
Re: PoCo::Client::Ping intermittently fails to emit pings
ICMP is not a reliable network protocol. IP is allowed to drop the packets without notice. The only way to detect the case is when too much time has elapsed without a response---$timeout is reached. Enqueuing your request in the event queue is not the same thing as sending an ICMP packet. post() may succeed where the ping fails. -- Rocco Caputo - [email protected] On Apr 19, 2009, at 12:44, Stuart Kendrick wrote: If the OS fails to execute a ping, does there exist a mechanism for the OS to tell POE about that? And for me to 'notice' such an incident? I was hoping that checking the return value on a 'post' would do that ... but thus far, no dice (i.e. the box on which my script is running intermittently fails to emit pings, but by script doesn't log anything.) [...] # "Pinger, do a ping and return the results as a pong event. The # address to ping is $addr." $result = $kernel->post( 'pinger', 'ping', 'pong', $addr, $timeout, ); log_error("Problem pinging $addr: $!") unless $result; [...] From http://search.cpan.org/~rcaputo/POE-1.004/lib/POE/Kernel.pm#post_DESTINATION,_EVENT_NAME_ [,_PARAMETER_LIST] "post() returns a Boolean value indicating whether the message was successfully enqueued. If post() returns false, $! is set to explain the failure:" --sk Stuart Kendrick wrote: my favorite pinging application (in-house app, uses POE::Component::Client::Ping) intermittently reports a rash of missed replies. happens roughly the same time each night; the condition lasts for several minutes
Re: PoCo::Client::Ping intermittently fails to emit pings
If the OS fails to execute a ping, does there exist a mechanism for the OS to
tell POE about that? And for me to 'notice' such an incident?
I was hoping that checking the return value on a 'post' would do that ... but
thus far, no dice (i.e. the box on which my script is running intermittently
fails to emit pings, but by script doesn't log anything.)
[...]
# "Pinger, do a ping and return the results as a pong event. The
# address to ping is $addr."
$result = $kernel->post( 'pinger',
'ping',
'pong',
$addr,
$timeout,
);
log_error("Problem pinging $addr: $!") unless $result;
[...]
From
http://search.cpan.org/~rcaputo/POE-1.004/lib/POE/Kernel.pm#post_DESTINATION,_EVENT_NAME_[,_PARAMETER_LIST]
"post() returns a Boolean value indicating whether the message was successfully
enqueued. If post() returns false, $! is set to explain the failure:"
--sk
Stuart Kendrick wrote:
my favorite pinging application (in-house app, uses
POE::Component::Client::Ping) intermittently reports a rash of missed
replies. happens roughly the same time each night; the condition lasts
for several minutes
