Re: [MTT devel] Analysis of hung jobs.

2009-10-06 Thread Ashley Pittman
On Tue, 2009-10-06 at 11:25 -0400, Ethan Mallove wrote:
> On Tue, Oct/06/2009 10:23:48AM, Ashley Pittman wrote:
> > 
> > Further to the mail linked below, padb is able to perform diagnostics,
> > including backtraces on hung jobs and integrates well into automated
> > testing environments.
> 
> Can padb get a backtrace from a non-debuggable MPI (e.g., not compiled
> with -g)?

It's gets what is available from the application, without -g it will
give you function names only, with -g it will also give you file names
and line numbers and optionally variables, their types and values.

It can show the message queues regardless of the -g option.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [MTT devel] Analysis of hung jobs.

2009-10-06 Thread Ethan Mallove
On Tue, Oct/06/2009 10:23:48AM, Ashley Pittman wrote:
> 
> Further to the mail linked below, padb is able to perform diagnostics,
> including backtraces on hung jobs and integrates well into automated
> testing environments.

Can padb get a backtrace from a non-debuggable MPI (e.g., not compiled
with -g)?

-Ethan

> 
> The attached patch is a minimal change which should enable the
> functionality.  I don't however have access to a working MTT
> installation to test this however.
> 
> http://www.open-mpi.org/community/lists/mtt-devel/2009/06/0415.php
> 
> This will require a HEAD version of padb, at least r273 to allow it to
> accept the pid of mpirun rather than a jobid assigned by the underlying
> resource manager.
> 
> Yours,
> 
> Ashley,
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk

> Index: lib/MTT/DoCommand.pm
> ===
> --- lib/MTT/DoCommand.pm  (revision 1322)
> +++ lib/MTT/DoCommand.pm  (working copy)
> @@ -359,6 +359,7 @@
>  }
>  my $killed_status = undef;
>  my $last_over = 0;
> +my $padb_output;
>  while ($done > 0) {
>  my $nfound = select($rout = $rin, undef, undef, $t);
>  if (vec($rout, fileno(OUTread), 1) == 1) {
> @@ -410,6 +411,8 @@
>  my $timeout_email_recipient = 
> $MTT::Globals::Values->{docommand_timeout_notify_email};
>  my $timeout_notify_timeout  = 
> $MTT::Globals::Values->{docommand_timeout_notify_timeout};
>  
> + $padb_output = `padb --config-option rmgr=mpirun 
> --full-report=$pid`;
> +
>  if (defined($timeout_sentinel_file)) {
>  
>  # Email someone, if an email address has been specified
> @@ -493,6 +496,9 @@
>  # Return an anonymous hash containing the relevant data
>  
>  $ret->{result_stdout} = join('', @out);
> +if ( defined $padb_output ) {
> +  $ret->{result_stdout} .= "\n$padb_output";
> +}
>  $ret->{result_stderr} = join('', @err),
>  if (!$merge_output);
>  return $ret;

> ___
> mtt-devel mailing list
> mtt-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel