Sorry for the delay in replying; this turned into a hectic week...
On Feb 4, 2009, at 11:28 AM, Hana Milani wrote:
Jeff, Thanks for helping me.
Is this a Fortran program, perchance?
Yes, it has been written by f77, but I have compiled it with
gfortran. People have also done the same with no problem.
Do you have access to the source code? I wonder if the program is
internally raising an error and effectively aborting itself. Do you
know that the application runs correctly? Do you have any test data
sets that you can try that give known outputs?
Yes, I have installed the source code. I have not been able to run
the program in parallel, but I have run my inputs sequentially and
got satisfactory results.
That's a good datapoint, but it's unfortunately not conclusive.
If you allow me, I can send the details of the code to your email.
If it's small and simple, sure. I'm afraid I don't have the time/
resources to investigate a large complex application that is
misbehaving.
I don't have any more insights other than to re-state that *something*
is killing your application with SIGTERM. It is *likely* some other
entity on your node - a daemon or some other controller process. But
it is also possible (although probably less likely) that the
application is aborting itself.
Are you able to run *any* MPI applications (especially those compiled
with Fortran) in parallel? E.g., the hello world and the ring
programs in the examples/ subdirectory in the OMPI distribution?
--
Jeff Squyres
Cisco Systems