Re: [Simple-evcorr-users] shellcmd with timeout

Matthieu Pérotin Mon, 23 May 2011 06:53:13 -0700

Le vendredi 20 mai 2011 à 18:11 +0300, Risto Vaarandi a écrit :
> On 05/20/2011 02:56 PM, Matthieu Pérotin wrote:
> > Hi,
> >
> > we recently experienced an annoying problem with processes that, in some
> > circonstances, would get stuck and never return. The fault here is
> > clearly on the processes side, but one can never be sure that a process
> > will return nicely... The consequence on SEC's side is that child
> > processes remain attached to the SEC process, cluttering its %children
> > hash table and adding to the complexity of the check_children sub.
> >
> > A solution to the problem would be to have the possibility to give a
> > timeout option to the shellcmd action: on expiration a sigterm (or
> > sigkill, I'm still not sure) would be issued to the process that was
> > launched.
> >
> > I could not find in the mailing list archives any message about a
> > similar issue, and as we really needed this feature I implemented it as
> > a new action (to retain backward compatibility, it could not bear the
> > same name), which takes two parameters: a timeout in seconds and the
> > command to launch.
> >
> > I'm not quite sure this mailing list is the right place for proposing
> > patchs. If not, could someone give me the right place for that ?
> >
> > Regards,
> > Matthieu.
> 
> hi Matthieu,
> 
> indeed, the mailing list is the proper way for proposing patches. 
> However, in this case it looks to me the issue can quite easily tackled 
> with the means provided by the standard UNIX shell, for example:
> 
> action=shellcmd (/bin/yourprog & PROID=$! ; sleep 10; kill -9 $PROID)
> 
> Since the shellcmd action allows for shell intepretation of the 
> commandline (provided that shell metacharacters are present), this 
> action will run /bin/yourprog in background and assign its PID to a 
> variable PROID. Then, the shell that started /bin/yourprog will sleep 
> for 10 seconds, and then kill /bin/yourprog (provided the process is 
> still running).
> 
> There are a number of other ways for tackling the issue, like the 
> employment of Perl:
> 
> action=shellcmd ( perl -e 'alarm(10); exec("/bin/yourprog")' )
> 
> In this case, since the command line does not contain shell 
> metacharacters, an interpreting shell is not started, but SEC rather 
> runs perl directly. In the started new process, we invoke the alarm(2) 
> system call for delivering the ALRM signal for the process itself after 
> 10 seconds. Then we simply run /bin/yourprog within the current process, 
> and since the alarm timer is inherited by /bin/yourprog, the process 
> will get it after 10 seconds and terminate (provided /bin/yourprog does 
> not set a handler to ALRM).
> 
> In the past, the users have taken advantage of similar shell/Perl 
> features for advanced job control, e.g., see 
> http://simple-evcorr.sourceforge.net/FAQ.html#21. So instead of patching 
> SEC, I'd take advantage of the features of shell or Perl, since they are 
> simply so much more advanced.
> 
> hope this helps,
> risto


Hi Risto,

the solutions you list are fine with us. The only objections I may have
are:
- the shell only solution induce an additional fork, which is not
necessary with a patch. It may be problematic in heavy loaded systems;
- we are loosing the return value of the 'true' command: the one that
will get caught by waitpid will be the one of kill. The perl solution
did not have this issue, but it made it even more difficult to send
anything other than a SIGALARM;
- it renders the rules more complicate and less readable, as it
introduces some advance job control features inside the execution
instruction.

Now, I have one more serious concern regarding the use of PID in the
example you are giving above. Even though it is unlikely under linux
(but not impossible), if the command ends before the timeout, the pid
may be reassigned and a SIGKILL would be issued to an arbitrary process.
I am not too familiar with other Unix systems but it may be more likely
under those. To solve this issue, using shell job control directives may
be a good solution, that would look like:

action=shellcmd (/bin/yourprog & sleep 10; kill -9 %%1)

As there is always only one process put in bg in the spawned shell, we
will always point to the right one.

But, now a new minor remark arise: a kill will always be issued,
disregarding the fact that the command did finish or not. This has
little consequence on the system point of view, but it fills the sec
logs with ugly messages "Child xxxx terminated with non-zero exitcode
1". They comes from kill and are emitted even though everything went as
expected: the process did not get stuck. Again, we can circumvent this
from happening with a proper test:

action=shellcmd (/bin/yourprog & sleep 10 ; jobs 1; if [ $? -eq 0 ];
then kill -9 %%1; fi)

but I'm starting to get the feeling that it is becoming quite difficult
do read for a shellcmd action, and we still don't get the exit value of
the command.

Anyway, I perfectly understand that you are reluctant to add a patch to
SEC, and that we have workarounds. 

I'm attaching the patch I did to this message nonetheless, in case
someone wants to give it a look. I made this one against the 2.6.0
version.

With this patch, everything that is discussed above can be implemented
using a new shellcmdto action:
action=shellcmdto 10 /bin/yourprog

Regards,
Matthieu.

-- 
Open Software R&D / Cluster Management
BARD / Bruyères-Le-Châtel
Bull, Architect of an Open World TM (www.bull.com)
phone +33 (0)1 69 26 62 51
matthieu.pero...@bull.net

--- sec	2011-03-03 18:12:16.000000000 +0100
+++ sec.patched	2011-05-20 13:42:07.000000000 +0200
@@ -751,6 +751,7 @@
   my($createafter, $event, $timestamp);
   my($lifetime, $context, $alias);
   my($variable, $value, $code, $codeptr, $params, $evalok);
+  my($timeout);
 
   if ($action =~ /^none$/i)  { return NONE; }
 
@@ -809,7 +810,32 @@
               "Warning - '$progname' is not executable");
     }
 
-    return (SHELLCOMMAND, $cmdline); 
+    return (SHELLCOMMAND, undef, $cmdline); 
+
+  }
+
+  elsif ($action =~ /^shellcmdto\s+(\S+)\s+(.+)/i) { 
+
+    $timeout = $1;
+    $cmdline = $2;
+
+    # strip outer parentheses if they exist
+    if ($cmdline =~ /^\s*\(\s*(.*)\)\s*$/)  { $cmdline = $1; }
+
+    # remove backslashes in front of the parentheses
+    $cmdline =~ s/\\([\(\)])/$1/g;
+
+    $progname = (split(' ', $cmdline))[0];
+
+    if (! -f $progname) {
+      log_msg(LOG_WARN, "Rule in $conffile at line $lineno:", 
+              "Warning - could not find '$progname'");
+    } elsif (! -x $progname) {
+      log_msg(LOG_WARN, "Rule in $conffile at line $lineno:", 
+              "Warning - '$progname' is not executable");
+    }
+
+    return (SHELLCOMMAND, $timeout, $cmdline); 
 
   }
 
@@ -3605,6 +3631,7 @@
 
   my($cmd) = $_[0];
   my($collect_output) = $_[1];
+  my($timeout) = $_[2];
   my($pid);
   local *READ_FH;   # we need to use 'local *', since each time we enter
                     # this procedure a new filehandle must be created that
@@ -3649,6 +3676,10 @@
       $children{$pid}->{"open"} = 1;
     }
 
+    if (defined $timeout) {
+      $children{$pid}->{"timeout"} = time() + $timeout;
+    }
+
     log_msg(LOG_DEBUG, "Child $pid created for command '$cmd'");
     return $pid; 
 
@@ -3864,9 +3895,10 @@
   my($actionlist) = $_[0];
   my($text) = $_[1];
   my($i) = $_[2];
-  my($cmdline, $text2);
+  my($timeout, $cmdline, $text2);
 
-  $cmdline = $actionlist->[$i+1];
+  $timeout = $actionlist->[$i+1];
+  $cmdline = $actionlist->[$i+2];
   $text2 = $text;
 
   # if -quoting flag was specified, mask apostrophes in $text2 
@@ -3881,7 +3913,7 @@
 
   log_msg(LOG_INFO, "Executing shell command '$cmdline'");
 
-  shell_cmd($cmdline);
+  shell_cmd($cmdline, undef, $timeout);
 
   return 2;
 }
@@ -7232,8 +7264,8 @@
     }
 
     elsif ($actionlist->[$i] == SHELLCOMMAND) { 
-      $result .= "shellcmd " . $actionlist->[$i+1]; 
-      $i += 2;
+      $result .= "shellcmdto " . $actionlist->[$i+2] . " " . $actionlist->[$i+1]; 
+      $i += 3;
     } 
 
     elsif ($actionlist->[$i] == SPAWN) { 
@@ -8311,6 +8343,7 @@
 
   my($pid, $exitcode);
 
+  my $now = time();
   # if the child was started by 'spawn' action, gather the child
   # standard output and create events (if child has more than PIPE_BUF
   # bytes to write, we must start reading from pipe before child 
@@ -8318,6 +8351,9 @@
 
   while ($pid = each(%children)) { 
     if ($children{$pid}->{"open"})  { consume_pipe($pid); }
+    if (exists $children{$pid}->{"timeout"} && $children{$pid}->{"timeout"} < $now) {
+      kill("TERM", $pid);
+    }
   }
 
   # get the exit status of every terminated child process.
@@ -9688,7 +9724,7 @@
 $actioncopyfunc[NONE] = \&copy_one_elem_action;
 $actioncopyfunc[LOGONLY] = \&copy_two_elem_action;
 $actioncopyfunc[WRITE] = \&copy_three_elem_action;
-$actioncopyfunc[SHELLCOMMAND] = \&copy_two_elem_action;
+$actioncopyfunc[SHELLCOMMAND] = \&copy_three_elem_action;
 $actioncopyfunc[SPAWN] = \&copy_two_elem_action;
 $actioncopyfunc[PIPE] = \&copy_three_elem_action;
 $actioncopyfunc[CREATECONTEXT] = \&copy_create_set_action;
@@ -9715,7 +9751,7 @@
 $actionsubstfunc[NONE] = \&subst_none_action;
 $actionsubstfunc[LOGONLY] = \&subst_two_elem_action;
 $actionsubstfunc[WRITE] = \&subst_three_elem_action;
-$actionsubstfunc[SHELLCOMMAND] = \&subst_two_elem_action;
+$actionsubstfunc[SHELLCOMMAND] = \&subst_three_elem_action;
 $actionsubstfunc[SPAWN] = \&subst_two_elem_action;
 $actionsubstfunc[PIPE] = \&subst_three_elem_action;
 $actionsubstfunc[CREATECONTEXT] = \&subst_create_set_action;

------------------------------------------------------------------------------
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its 
next-generation tools to help Windows* and Linux* C/C++ and Fortran 
developers boost performance applications - including clusters. 
http://p.sf.net/sfu/intel-dev2devmay

_______________________________________________
Simple-evcorr-users mailing list
Simple-evcorr-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users

Re: [Simple-evcorr-users] shellcmd with timeout

Reply via email to