Please understand that I'm decent at the engineering side of it.  As a
system administrator, I'm a decent engineer.

On the previous configurations, this program seems to run with any
number of processors.  I believe these successful users have been using
LAM/MPI.  While I was waiting for a reply, I installed LAM/MPI.  The
results were similar to those from OpenMPI.

While I can choose LAM/MPI, I'd prefer to port it to OpenMPI since that
is where all the development and most of the support are.

I cannot choose the Portland compiler.  I must use either GNU or Intel
compilers on the Itanium2.

Ted (more responses below)

On November 7, 2007 at 8:39 AM, Squyres, Jeff wrote:

        On Nov 5, 2007, at 4:12 PM, Benjamin, Ted G. wrote:

        >> I have a code that runs with both Portland and Intel
compilers
        >> on X86, AMD64 and Intel EM64T running various flavors of
Linux on
        >> clusters.  I am trying to port it to a 2-CPU Itanium2 (ia64)
running
        >> Red Hat Enterprise Linux 4.0; it has gcc 3.4.6-8 and the
Intel
        >> Fortran compiler 10.0.026 installed.  I have built Open MPI
1.2.4
        >> using these compilers.
        >> When I built the Open MPI, I didn't do anything special.  I
        >> enabled debug, but that was really all.  Of course, you can
see that
        >> in the config file that is attached.
        >> This system is not part of a cluster.  The two onboard CPUs
(an
        >> HP zx6000) are the only processors on which the job runs.
The code
        >> must run on MPI because the source calls it.  I compiled the
target
        >> software using the Fortran90 compiler (mpif90).
        >> I've been running the code in the foreground so that I could
        >> keep an eye on its behavior.
        >> When I try to run the compiled and linked code [mpirun -np #
        >> {executable file}], it performs as shown below:

        >> (1) With the source compiled at optimization -O0 and -np 1,
the job
        >> runs very slowly (6 days on the wall clock) to the correct
answer on
        >> the benchmark;
        >> (2) With the source compiled at optimization -O0 and -np 2,
the
        >> benchmark job fails with a segmentation violation;

        > Have you tried running your code through a memory-checking
debugger,
        > and/or examining any corefiles that were generated to see if
there is
        > a problem in your code?

        > I will certainly not guarantee that Open MPI is bug free, but
problems
        > like this are *usually* application-level issues.  One place I
always
        > start is running the application in a debugger to see if you
can catch
        > exactly where the Badness happens.  This can be most helpful.

I have tried to run a debugger, but I am not an expert at it.  I could
not get Intel's idb debugger to give me a prompt, but I could get a
prompt from gdb.  I've looked over the manual, but I'm not sure how to
put in the breakpoints et. al. that you geniuses use to evaluate a
program at critical junctures.  I actually used an "mpirun -np 2 dbg"
command to run it on 2 CPUs.  I attached the file at the prompt.  When I
did a run, it ran fine with no optimization and one processor.  With 2
processors, it didn't seem to do anything.  All I will say here is that
I have a lot to learn.  I'm calling on my friends for help on this.

        >> (3) With the source compiled at all other optimization (-O1,
-O2, -
        >> O3) and processor combinations (-np1 and -np 2), it fails in
what I
        >> would call a "quiescent" manner.  What I mean by this is that
it
        >> does not produce any error messages.  When I submit the job,
it
        >> produces a little standard output and it quits after 2-3
seconds.

        > That's fun.  Can you tell if it runs the app at all, or if it
dies before
        > main() starts?  This is probably more of an issue for your
        > intel support guy than us...

It's a Fortran program.  It starts in the main program.  I inserted some
PRINT*, statements of the "PRINT*,'Read the input at line 213' " variety
into the main program to see what would print.  It printed the first
four statements, but it didn't reach the last three.  The calls that
were reached were in the set-up section of the program.  The section
that wasn't reached had a lot of matrix-setting and solving subroutine
calls.

I'm going to point my Intel support person to this post and see where it
takes us.

        >> In an attempt to find the problem, the technical support
agent
        >> at Intel has had me run some simple "Hello" problems.
        >> The first one is an MPI hello code that is the attached
        >> hello_mpi.f.  This ran as expected, and it echoed one "Hello"
        >> for each of the two processors.
        >> The second one is a non-MPI hello that is the attached
        >> hello.f90.  Since it is a non-MPI source, I was told that
running it
        >> on a workstation with a properly configured MPI should only
echo one
        >> "Hello"; the Intel agent told me that two such echoes
indicate a
        >> problem with Open MPI.  It echoed twice, so now I have come
to you
        >> for help.

        > I'm not sure what you mean by that.  If you:

        >      mpirun -np 4 hostname

        > where "hostname" is non-MPI program (e.g., /bin/hostname),
you'll
        > still see the output 4 times because you told MPI to run 4
copies of
        > "hostname".  In this way, Open MPI is just being used as a job
launcher.

        > So if I'm understanding you right,

        >    mpirun -np 2 my_non_mpi_f90_hello_app

        > should still print 2 copies of "hello".  If it does, then Open
MPI is
        > doing exactly what it should do.

        > Specifically: Open MPI's mpirun can be used to launch non-MPI
        > applications (the same is not necessarily true for other MPI
        > implementations).

You understand correctly.  I am not an expert at MPI of any sort.  Both
the MPI and non-MPI versions of "Hello" print out once for each invoked
CPU (e.g.

     mpirun -np 1 mpi_hello
and
     mpirun -np 1 non_mpi_hello

print one "Hello, world" and

     mpirun -np 2 mpi_hello
and
     mpirun -np 2 non_mpi_hello

print two "Hello, world"s).  

        >> The other three attached files are the output requested on
the
        >> "Getting Help" page - (1) the output of /sbin/ifconfig, (2)
the
        >> output of ompt_info -all and (3) the config.log file.
        >> The installation of the Open MPI itself was as easy as could
        >> be.  I am really ignorant of how it works beyond what I've
read from
        >> the FAQs and learned in a little digging, so I hope it's a
simple
        >> solution.

        > FWIW, I see that you're using Open MPI v1.2.  Our latest
version is
        > v1.2.4; if possible, you might want to try and upgrade (e.g.,
delete
        > your prior installation, recompile/reinstall Open MPI, and
then
        > recompile/relink your application against the new Open MPI
        > installation); it has all of our latest bug fixes, etc.

This is my mistake.  I attached an old version of ompi_info.txt.  I am
now attaching the correct version.  I already have 1.2.4 installed.

        > -- 
        > Jeff Squyres
        > Cisco Systems
        _______________________________________________
        users mailing list
        us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
         <<ompi_info.txt.gz>> 

Attachment: ompi_info.txt.gz
Description: ompi_info.txt.gz

Reply via email to