Re: PoCo::Client::Ping intermittently fails to emit pings

2009-04-24 Thread Stuart Kendrick

ok, i like the idea of catching this situation in my application

however, i'm having trouble distinguishing between the case where send() fails 
and the case where send() succeeds but the target does not return an ICMP Reply


client_got_pong {
  [...]
  my ($request, $response) = @_[ARG0, ARG1];
  [...]
  my ($resp_address, $roundtrip_time, $resp_time, $resp_ttl) = @{$response};

  if (defined $resp_address) {
  # Celebrate receiving the ICMP Reply
  }
  else {
  if (not defined $roundtrip_time and not defined $resp_ttl) {
  # Whine about the call to send() failing
  }
  else {
  # Weep about the ICMP Reply not appearing
  }
  }

  [...]
}

but, in fact, $roundtrip_time and $resp_ttl are undefined for both cases:  the 
case in which send() failed and the case in which the ICMP Reply didn't arrive


how might i distinguish between the two cases?

--sk




Rocco Caputo wrote:
The post($sender, $event, ) call sends $event response back to your 
code if send() fails.  The response address, round-trip time, and TTL 
are undefined.  The response time is "now".


You've added logging in a good place, if you want to solve this by 
modifying the module.  It might be better to catch the $event in your 
application and log the problem there.




Re: PoCo::Client::Ping intermittently fails to emit pings

2009-04-21 Thread Rocco Caputo
The post($sender, $event, ) call sends $event response back to  
your code if send() fails.  The response address, round-trip time, and  
TTL are undefined.  The response time is "now".


You've added logging in a good place, if you want to solve this by  
modifying the module.  It might be better to catch the $event in your  
application and log the problem there.


--
Rocco Caputo - [email protected]


On Apr 21, 2009, at 16:31, Stuart Kendrick wrote:

i'm walking through .../perl5/site_perl/5.10.0/POE/Component/Client/ 
Ping.pm


where do we emit an error if we are unable to send an ICMP packet?



seems to me that we decide to send the ping in the very last line of  
'sub poco_ping_ping':


[...]
# Request a ping.  This code borrows heavily from Net::Ping.

sub poco_ping_ping {
[...]
 _send_packet($kernel, $heap);
}


and we actually send the ping on line 362, within the '_send_packet'  
routine, specifically, from the 'send' phrase within the 'unless'   
clause:



sub _send_packet {
 my ($kernel, $heap) = @_;
 return unless (scalar @{$heap->{queue}});
[...]
 # Send the packet.  If send() fails, then we bail with an error.
 my @user_args = ();
 ($event, @user_args) = @$event if ref($event) eq "ARRAY";

 DEBUG and warn "sending packet sequence number $seq\n";
 unless (send($heap->{socket_handle}, $msg, ICMP_FLAGS,  
$socket_address)) {

   $kernel->post(
 $sender, $event,
 [ $address,# REQ_ADDRESS
   $timeout,# REQ_TIMEOUT
   time(),  # REQ_TIME
   @user_args,  # REQ_USER_ARGS
 ],
 [ undef,   # RES_ADDRESS
   undef,   # RES_ROUNDTRIP
   time(),  # RES_TIME
   undef,   # RES_TTL
 ],
   );
   _check_for_close($kernel, $heap);
   return;
 }
[...]


now, i see the comment "If send() fails, then we bail with an  
error".  but i don't see where we actually spit out an error  
message.  seems to me that _send_packet returns with a value of 0  
(that 'return' statement just after the call to  
_check_for_close) ... but i don't see us checking the return value  
of _send_packet, back in "sub poco_ping_ping"



or ... does the "$kernel-post( ..." event trigger an error message,  
through some mechanism i don't understand?


would it be reasonable for me to add the following, in order to  
track the success/failure of the call to send()  ?


[...]
 unless (send($heap->{socket_handle}, $msg, ICMP_FLAGS,  
$socket_address)) {

   warn "Unable to send() ICMP packet";
   $kernel->post(
[...]

or would i be duplicating effort performed elsewhere?

actually, what i really want to do is to log such events to syslog,  
so am i headed in a fruitful direction by adding the following?


[...]
 unless (send($heap->{socket_handle}, $msg, ICMP_FLAGS,  
$socket_address)) {

   use Sys::Syslog;
   openlog('PoCo-Ping', 'nofatal,ndelay,pid', 'local0');
   syslog('error', 'Unable to send() ICMP packet');
   closelog();
   $kernel->post(
[...]


--sk




Re: PoCo::Client::Ping intermittently fails to emit pings

2009-04-21 Thread Stuart Kendrick

i'm walking through .../perl5/site_perl/5.10.0/POE/Component/Client/Ping.pm

where do we emit an error if we are unable to send an ICMP packet?



seems to me that we decide to send the ping in the very last line of 'sub 
poco_ping_ping':


[...]
# Request a ping.  This code borrows heavily from Net::Ping.

sub poco_ping_ping {
[...]
  _send_packet($kernel, $heap);
}


and we actually send the ping on line 362, within the '_send_packet' routine, 
specifically, from the 'send' phrase within the 'unless'  clause:



sub _send_packet {
  my ($kernel, $heap) = @_;
  return unless (scalar @{$heap->{queue}});
[...]
  # Send the packet.  If send() fails, then we bail with an error.
  my @user_args = ();
  ($event, @user_args) = @$event if ref($event) eq "ARRAY";

  DEBUG and warn "sending packet sequence number $seq\n";
  unless (send($heap->{socket_handle}, $msg, ICMP_FLAGS, $socket_address)) {
$kernel->post(
  $sender, $event,
  [ $address,# REQ_ADDRESS
$timeout,# REQ_TIMEOUT
time(),  # REQ_TIME
@user_args,  # REQ_USER_ARGS
  ],
  [ undef,   # RES_ADDRESS
undef,   # RES_ROUNDTRIP
time(),  # RES_TIME
undef,   # RES_TTL
  ],
);
_check_for_close($kernel, $heap);
return;
  }
[...]


now, i see the comment "If send() fails, then we bail with an error".  but i 
don't see where we actually spit out an error message.  seems to me that 
_send_packet returns with a value of 0 (that 'return' statement just after the 
call to _check_for_close) ... but i don't see us checking the return value of 
_send_packet, back in "sub poco_ping_ping"



or ... does the "$kernel-post( ..." event trigger an error message, through some 
mechanism i don't understand?


would it be reasonable for me to add the following, in order to track the 
success/failure of the call to send()  ?


[...]
  unless (send($heap->{socket_handle}, $msg, ICMP_FLAGS, $socket_address)) {
warn "Unable to send() ICMP packet";
$kernel->post(
[...]

or would i be duplicating effort performed elsewhere?

actually, what i really want to do is to log such events to syslog, so am i 
headed in a fruitful direction by adding the following?


[...]
  unless (send($heap->{socket_handle}, $msg, ICMP_FLAGS, $socket_address)) {
use Sys::Syslog;
openlog('PoCo-Ping', 'nofatal,ndelay,pid', 'local0');
syslog('error', 'Unable to send() ICMP packet');
closelog();
$kernel->post(
[...]


--sk


Re: PoCo::Client::Ping intermittently fails to emit pings

2009-04-21 Thread Stuart Kendrick

hi rocco,

yes, ok, i think i'm following you:

-i know that these ICMP Echos aren't leaving the NIC (during these multi-minute 
periods), from the sniffer i have posted just outside the box


-many ICMP Echos are exitting the NIC ... but i happen to know which boxes i'm 
pinging using Nagios, and which I'm pinging using NodeWatch (which employs POE), 
and the pings i'm seeing are headed to/from the Nagios-monitored boxes, not the 
NodeWatch-monitored boxes.  additionally, NodeWatch logs to syslog when it is 
emitting pings ... and, in at least one of my traces, no ICMP Echos leave the 
box at all at that point ... (presumably, Nagios was off doing something else at 
the time)


-so these ICMP Echos are being dropped somewhere inside the box

-as you say, perhaps the NIC's outbound buffer is overflowing -- i hadn't 
thought of that.  checking the output of 'netstat -i' ... i see zeros underneath 
the TX-ERR, TX-DRP, and TX-OVR columns (and under the RX-ERR/DRP/OVR columns as 
well).  this suggests that the NIC isn't dropping frames ... but i haven't a 
clue how reliable this output is.  i could imagine a NIC driver which doesn't 
update the relevant counter somewhere, when it discards a frame.  so let's keep 
this possibility on the table


-i wonder if the OS tracks 'outbound ICMP drops'?

gnat> netstat -s -w
[...]
Icmp:
3510705 ICMP messages received
34835 input ICMP message failed.
ICMP input histogram:
destination unreachable: 111508
timeout in transit: 26581
echo requests: 84611
echo replies: 3288005
7155838 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 64167
echo request: 7007060
echo replies: 84611

i wonder what 'input ICMP message failed' means?  are those ICMP Echos for which 
the OS didn't receive any replies?  or are those ICMP messages which the OS 
received but dropped due to resource constraints?  or ... ICMP messages which 
the OS was asked to send but which the OS dropped due to resource constraints? 
this morning, i poked through Benvunti's "Understanding Linux Network 
Internals", without success.  i have 'Understanding the Linux Kernel' on order. 
 i should probably dig up Steven's 'Implementation' (Vol 2?) as well.  any 
other tips on where to dig to learn what "input ICMP message failed" means?


-now, if my application were CPU-starved, wouldn't i see the pings go out 
eventually?  i mean, my PoCo pinger might have timed out already and have 
declared the ping lost ... but wouldn't i see the outbound ICMP Echos appear, 
eventually, in the packet trace?  [by 'eventually', i mean delayed by seconds, 
not delayed by minutes ... i haven't poked deep enough into the trace to see 
whether or not the ICMP Echos appear minutes late ... not sure how i would 
set-up that experiment in fact]


-yes, i do tend to burst the pings:

  $max_pings = 100;

  [...]
  # Throw out more pings
  while (@addrs and $heap->{open_pings} < $max_pings}) {
my $addr = shift @addrs;
log_it("Pinging $addr") if ($debug == 5 or $debug == 6);

# "Pinger, do a ping and return the results as a pong event.  The
# address to ping is $addr."
$result = $kernel->post(  'pinger',
  'ping',
  'pong',
  $addr,
  $timeout{$addr}
   );
  [...]

i suppose i could try dropping $max_pings to something smaller.  but, i'm not 
ready to actually solve the problem; instead, i want to verify that i understand 
the problem, notably, where exactly the OS is dropping the ball



==> anyway, so this is why i'm now checking the return code on post() ... that's 
a start, perhaps, but as you point out, hardly the end of my trouble-shooting 
efforts.  i'm beginning to think that what i really want is Dtrace for Linux ... 
which doesn't exist ... but perhaps i can squeeze enough juice out of SystemTap 
to tackle this.  stapprobes.netdev should allow me to see when data arrives on 
the NIC ... so i could at least tell whether or not the ICMP Echo arrives at the 
NIC ... and perhaps stapprobes.socket would tell me whether or not the OS 
received the request to create a raw socket to service the ICMP Echo



==> yes, i am poking at this from other angles:  what changed recently?  well, 
the answer is:  a lot.  i wiped the box, which was running SuSE 9.3, and 
installed CentOS 5.3.  that's quite a bit of change, right there!  i'm using 
perl-5.10.0 now, whereas i was using perl-5.8.8.  my other NodeWatch boxes 
already run CentOS 5.3 ... so i'm gradually replacing perl-5.8.8 with 
perl-5.10.0 on them ... that would be a clue, if the problem starts appearing 
once i'm using perl-5.10.0  the actual applications & cron jobs i run today on 
the box are the same i was running prior to the OS change.  yes, i'm staring at 
the timing involved ... at this point, the timing has becom

Re: PoCo::Client::Ping intermittently fails to emit pings

2009-04-20 Thread Rocco Caputo

On Apr 15, 2009, at 07:59, Stuart Kendrick wrote:

my favorite pinging application (in-house app, uses  
POE::Component::Client::Ping) intermittently reports a rash of  
missed replies. happens roughly the same time each night; the  
condition lasts for several minutes


Time-based failures implicate periodic jobs.  What's cron doing around  
that time?



Details:
-  perl-5.10.0, POE 1.0.4, POE::Component::Client::Ping 1.14

-  CentOS 5.3 (2.6.18-128.el5 x86_64 x86_64 x86_64 GNU/Linux)

-  the application is pinging ~120 hosts every 30 seconds


Does the application ping them all in a burst?  This could cause lost  
packets.  As I mentioned, ICMP isn't reliable, and congestion on the  
network or interface could cause silent packet drops.


-  the box gets busy during this time, in terms of disk I/O (average  
latency of
  ~100ms ... fifteen minute rolling load average spikes to ~18:   
likely stems
  from several disk-intensive jobs which run during this period).   
this

  busyness lasts for ~5 hours


Heavy load could be causing your monitor program to lag, triggering  
timeouts---possibly before the ICMP requests are sent.  Reducing the  
pinger's niceness slightly may help, but you may just be overloading  
the system.  See what the machine needs to handle the work you've  
given it.  Consider distributing the work across multiple machines.


-  the box hosts a number of pinging applications (Nagios, a bunch  
of in-house

  apps), typically employing fping


Probably not an issue, unless they stream an awful lot of requests.   
Load spikes may push the pinging activity together, which could result  
in lost packets.  For example, 120 pings over 30 seconds is fine, but  
a lag spike might delay them.  The lagged ones will try to catch up in  
a burst when the lag spike ends---or when the OS says your pingers  
have been CPU starved long enough.


-  i've been running the application for a handful of years now, a  
handful of
  instances, monitoring ~100 - 500 hosts per instance.  this is  
novel behavior


Novel behavior suggests the question "What has changed?"  The variable  
could be something new and significant or something that has gradually  
approached a capacity threshold.  Some things to think about:  Has the  
number of hosts to ping increased by a lot recently or steadily over  
time?  Have the disk-intensive jobs' workload increased by a lot  
recently or steadily over time?


--
Rocco Caputo - [email protected]


Re: PoCo::Client::Ping intermittently fails to emit pings

2009-04-20 Thread Rocco Caputo
ICMP is not a reliable network protocol.  IP is allowed to drop the  
packets without notice.  The only way to detect the case is when too  
much time has elapsed without a response---$timeout is reached.


Enqueuing your request in the event queue is not the same thing as  
sending an ICMP packet.  post() may succeed where the ping fails.


--
Rocco Caputo - [email protected]


On Apr 19, 2009, at 12:44, Stuart Kendrick wrote:

If the OS fails to execute a ping, does there exist a mechanism for  
the OS to tell POE about that?  And for me to 'notice' such an  
incident?


I was hoping that checking the return value on a 'post' would do  
that ... but thus far, no dice (i.e. the box on which my script is  
running intermittently fails to emit pings, but by script doesn't  
log anything.)


[...]
   # "Pinger, do a ping and return the results as a pong event.  The
   # address to ping is $addr."
   $result = $kernel->post(  'pinger',
 'ping',
 'pong',
 $addr,
 $timeout,
   );
   log_error("Problem pinging $addr: $!") unless $result;
[...]

From http://search.cpan.org/~rcaputo/POE-1.004/lib/POE/Kernel.pm#post_DESTINATION,_EVENT_NAME_ 
[,_PARAMETER_LIST]


"post() returns a Boolean value indicating whether the message was  
successfully enqueued. If post() returns false, $! is set to explain  
the failure:"


--sk


Stuart Kendrick wrote:
my favorite pinging application (in-house app, uses  
POE::Component::Client::Ping) intermittently reports a rash of  
missed replies. happens roughly the same time each night; the  
condition lasts for several minutes




Re: PoCo::Client::Ping intermittently fails to emit pings

2009-04-19 Thread Stuart Kendrick
If the OS fails to execute a ping, does there exist a mechanism for the OS to 
tell POE about that?  And for me to 'notice' such an incident?


I was hoping that checking the return value on a 'post' would do that ... but 
thus far, no dice (i.e. the box on which my script is running intermittently 
fails to emit pings, but by script doesn't log anything.)


[...]
# "Pinger, do a ping and return the results as a pong event.  The
# address to ping is $addr."
$result = $kernel->post(  'pinger',
  'ping',
  'pong',
  $addr,
  $timeout,
);
log_error("Problem pinging $addr: $!") unless $result;
[...]

From 
http://search.cpan.org/~rcaputo/POE-1.004/lib/POE/Kernel.pm#post_DESTINATION,_EVENT_NAME_[,_PARAMETER_LIST]


"post() returns a Boolean value indicating whether the message was successfully 
enqueued. If post() returns false, $! is set to explain the failure:"


--sk


Stuart Kendrick wrote:
my favorite pinging application (in-house app, uses 
POE::Component::Client::Ping) intermittently reports a rash of missed 
replies. happens roughly the same time each night; the condition lasts 
for several minutes