Re: [OMPI users] MPI_COMPLEX16

2012-05-23 Thread David Singleton

On 05/23/2012 07:30 PM, Patrick Le Dot wrote:

David Singleton<David.Singleton  anu.edu.au>  writes:




I should have checked earlier - same for MPI_COMPLEX and MPI_COMPLEX8.

David

On 04/27/2012 08:43 AM, David Singleton wrote:


Apologies if this has already been covered somewhere. One of our users
has noticed that MPI_COMPLEX16 is flagged as an invalid type in 1.5.4
but not in 1.4.3 while MPI_DOUBLE_COMPLEX is accepted for both. This is
with either gfortran or intel-fc.
...


Hi,

I hit the same problem : MPI_COMPLEX8 and MPI_COMPLEX16 were available
in v1.4 but were removes in v1.5 and I don't understand why, except that
this types are not into the standard...

I have a patch to reintroduce them again so let me know what you think.



I would very much appreciate seeing that patch.

Thanks
David



Re: [OMPI users] MPI_COMPLEX16

2012-04-26 Thread David Singleton


I should have checked earlier - same for MPI_COMPLEX and MPI_COMPLEX8.

David

On 04/27/2012 08:43 AM, David Singleton wrote:


Apologies if this has already been covered somewhere. One of our users
has noticed that MPI_COMPLEX16 is flagged as an invalid type in 1.5.4
but not in 1.4.3 while MPI_DOUBLE_COMPLEX is accepted for both. This is
with either gfortran or intel-fc. Superficially, the configure looks
the same for 1.4.3 and 1.5.4, eg.
% grep COMPLEX16 opal/include/opal_config.h
#define OMPI_HAVE_F90_COMPLEX16 1
#define OMPI_HAVE_FORTRAN_COMPLEX16 1

Their test code (appended below) produces:

% module load openmpi/1.4.3
% mpif90 mpi_complex_test.f90
% mpirun -np 2 ./a.out
SUM1 (3.00,-1.00)
SUM2 (3.00,-1.00)
% module swap openmpi/1.5.4
% mpif90 mpi_complex_test.f90
% mpirun -np 2 ./a.out
[vayu1:1935] *** An error occurred in MPI_Reduce
[vayu1:1935] *** on communicator MPI_COMM_WORLD
[vayu1:1935] *** MPI_ERR_TYPE: invalid datatype
[vayu1:1935] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
SUM1 (3.00,-1.00)

Thanks for any help,
David


program mpi_test

implicit none
include 'mpif.h'
integer, parameter :: r8 = selected_real_kind(12)
complex(kind=r8) :: local, global
integer :: ierr, myid, nproc

call MPI_INIT (ierr)
call MPI_COMM_RANK (MPI_COMM_WORLD, myid, ierr)
call MPI_COMM_SIZE (MPI_COMM_WORLD, nproc, ierr)

local = cmplx(myid+1.0, myid-1.0, kind=r8)
call MPI_REDUCE (local, global, 1, MPI_DOUBLE_COMPLEX, MPI_SUM, 0, &
MPI_COMM_WORLD, ierr)
if ( myid == 0 ) then
print*, 'SUM1', global
end if

call MPI_REDUCE (local, global, 1, MPI_COMPLEX16, MPI_SUM, 0, &
MPI_COMM_WORLD, ierr)
if ( myid == 0 ) then
print*, 'SUM2', global
end if

call MPI_FINALIZE (ierr)

end program mpi_test



[OMPI users] MPI_COMPLEX16

2012-04-26 Thread David Singleton


Apologies if this has already been covered somewhere.  One of our users
has noticed that MPI_COMPLEX16 is flagged as an invalid type in 1.5.4
but not in 1.4.3 while MPI_DOUBLE_COMPLEX is accepted for both. This is
with either gfortran or intel-fc.  Superficially, the configure looks
the same for 1.4.3 and 1.5.4,  eg.
% grep COMPLEX16  opal/include/opal_config.h
#define OMPI_HAVE_F90_COMPLEX16 1
#define OMPI_HAVE_FORTRAN_COMPLEX16 1

Their test code (appended below) produces:

% module load openmpi/1.4.3
% mpif90 mpi_complex_test.f90
% mpirun -np 2 ./a.out
 SUM1 (3.00,-1.00)
 SUM2 (3.00,-1.00)
% module swap openmpi/1.5.4
% mpif90 mpi_complex_test.f90
% mpirun -np 2 ./a.out
[vayu1:1935] *** An error occurred in MPI_Reduce
[vayu1:1935] *** on communicator MPI_COMM_WORLD
[vayu1:1935] *** MPI_ERR_TYPE: invalid datatype
[vayu1:1935] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
 SUM1 (3.00,-1.00)

Thanks for any help,
David


program mpi_test

   implicit none
   include 'mpif.h'
   integer, parameter :: r8 = selected_real_kind(12)
   complex(kind=r8) :: local, global
   integer :: ierr, myid, nproc

   call MPI_INIT (ierr)
   call MPI_COMM_RANK (MPI_COMM_WORLD, myid, ierr)
   call MPI_COMM_SIZE (MPI_COMM_WORLD, nproc, ierr)

   local = cmplx(myid+1.0, myid-1.0, kind=r8)
   call MPI_REDUCE (local, global, 1, MPI_DOUBLE_COMPLEX, MPI_SUM, 0, &
MPI_COMM_WORLD, ierr)
   if ( myid == 0 ) then
  print*, 'SUM1', global
   end if

   call MPI_REDUCE (local, global, 1, MPI_COMPLEX16, MPI_SUM, 0, &
MPI_COMM_WORLD, ierr)
   if ( myid == 0 ) then
  print*, 'SUM2', global
   end if

   call MPI_FINALIZE (ierr)

end program mpi_test




Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread David Singleton


On 04/04/2011 12:56 AM, Ralph Castain wrote:


What I still don't understand is why you are trying to do it this way. Why not 
just run

time mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machineN 
/home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_1.def

where machineN contains the names of the nodes where you want the MPI apps to 
execute? mpirun will only execute apps on those nodes, so this accomplishes the 
same thing as your script - only with a lot less pain.

Your script would just contain a sequence of these commands, each with its 
number of procs and machinefile as required.



Maybe I missed why this suggestion of forgetting about the ssh/pbsdsh altogether
was not feasible?  Just use mpirun (with its great tm support!) to distribute
MPI jobs.

A simple example:

vayu1:~/MPI > qsub -lncpus=24,vmem=24gb,walltime=10:00 -wd -I
qsub: waiting for job 574900.vu-pbs to start
qsub: job 574900.vu-pbs ready

[dbs900@v250 ~/MPI]$ wc -l $PBS_NODEFILE
24
[dbs900@v250 ~/MPI]$ head -12 $PBS_NODEFILE > m1
[dbs900@v250 ~/MPI]$ tail -12 $PBS_NODEFILE > m2
[dbs900@v250 ~/MPI]$ mpirun --machinefile m1 ./a2a143 12 30 & mpirun 
--machinefile m2 ./pp143


Check how the processes are distributed ...

vayu1:~ > qps 574900.vu-pbs
Node 0: v250:
  PID S   RSSVSZ %MEM TIME %CPU COMMAND
11420 S  2104  10396  0.0 00:00:00  0.0 -tcsh
11421 S   620  10552  0.0 00:00:00  0.0 pbs_demux
12471 S  2208  49324  0.0 00:00:00  0.9 /apps/openmpi/1.4.3/bin/mpirun 
--machinefile m1 ./a2a143 12 30
12472 S  2116  49312  0.0 00:00:00  0.0 /apps/openmpi/1.4.3/bin/mpirun 
--machinefile m2 ./pp143
12535 R 270160 565668  1.0 00:00:02 82.4 ./a2a143 12 30
12536 R 270032 565536  1.0 00:00:02 81.4 ./a2a143 12 30
12537 R 270012 565528  1.0 00:00:02 87.3 ./a2a143 12 30
12538 R 269992 565532  1.0 00:00:02 93.3 ./a2a143 12 30
12539 R 269980 565516  1.0 00:00:02 81.4 ./a2a143 12 30
12540 R 270008 565516  1.0 00:00:02 86.3 ./a2a143 12 30
12541 R 270008 565516  1.0 00:00:02 96.3 ./a2a143 12 30
12542 R 272064 567568  1.0 00:00:02 91.3 ./a2a143 12 30
Node 1: v251:
  PID S   RSSVSZ %MEM TIME %CPU COMMAND
10367 S  1872  40648  0.0 00:00:00  0.0 orted -mca ess env -mca orte_ess_jobid 113440 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 
"113440.0;tcp://10.1.3.58:37339"
10368 S  1868  40648  0.0 00:00:00  0.0 orted -mca ess env -mca orte_ess_jobid 1444347904 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri 
"1444347904.0;tcp://10.1.3.58:39610"

10372 R 271112 567556  1.0 00:00:04 74.5 ./a2a143 12 30
10373 R 271036 567564  1.0 00:00:04 71.5 ./a2a143 12 30
10374 R 271032 567560  1.0 00:00:04 66.5 ./a2a143 12 30
10375 R 273112 569612  1.1 00:00:04 68.5 ./a2a143 12 30
10378 R 552280 840712  2.2 00:00:04 100 ./pp143
10379 R 552280 840708  2.2 00:00:04 100 ./pp143
10380 R 552328 841576  2.2 00:00:04 100 ./pp143
10381 R 552788 841216  2.2 00:00:04 99.3 ./pp143
Node 2: v252:
  PID S   RSSVSZ %MEM TIME %CPU COMMAND
10152 S  1908  40780  0.0 00:00:00  0.0 orted -mca ess env -mca orte_ess_jobid 1444347904 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri 
"1444347904.0;tcp://10.1.3.58:39610"

10156 R 552384 840200  2.2 00:00:07 99.3 ./pp143
10157 R 551868 839692  2.2 00:00:06 99.3 ./pp143
10158 R 551400 839184  2.2 00:00:07 100 ./pp143
10159 R 551436 839184  2.2 00:00:06 98.3 ./pp143
10160 R 551760 839692  2.2 00:00:07 100 ./pp143
10161 R 551788 839824  2.2 00:00:07 97.3 ./pp143
10162 R 552256 840332  2.2 00:00:07 100 ./pp143
10163 R 552216 840340  2.2 00:00:07 99.3 ./pp143


You would have to do something smarter to get correct process binding etc.




Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread David Singleton



You can prove this to yourself rather easily. Just ssh to a remote node and execute any command 
that lingers for awhile - say something simple like "sleep". Then kill the ssh and do a 
"ps" on the remote node. I guarantee that the command will have died.



H ...

vayu1:~ > ssh v37 sleep 600 &
[1] 30145
vayu1:~ > kill -9 30145
[1]  + Suspended (tty input) ssh v37 sleep 600
vayu1:~ >
[1]Killedssh v37 sleep 600
vayu1:~ > ssh v37 ps aux | grep dbs900 | grep sleep
dbs900   18774  0.0  0.0   9360  1348 ?Ss   07:12   0:00 /bin/tcsh -c 
sleep 600
dbs900   18806  0.0  0.0   3800   480 ?S07:12   0:00 sleep 600



Re: [OMPI users] Sending large boradcasts

2011-01-03 Thread David Singleton


Hi Brock,

That message should only be 2MB.  Are you sure its not a mismatch of
message lengths in MPI_Bcast calls?

David

On 01/04/2011 03:47 AM, Brock Palen wrote:

I have a user who reports that sending a broadcast of

540*1080 of reals (just over 2GB) fails with this:


*** An error occurred in MPI_Bcast
*** on communicator MPI_COMM_WORLD
*** MPI_ERR_TRUNCATE: message truncated
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

I was reading the archives and there appears to be an issue with large 
messages.  I was a little confused, is there a way to send messages larger than 
2GB?

The user has access to some IB machines, per a note in the archives there was 
an issue with writev() would this issue only be related to messages over 
ethernet?

Thanks just trying to get some clarification.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985




Re: [OMPI users] Open MPI vs IBM MPI performance help

2010-12-02 Thread David Singleton


http://www.open-mpi.org/faq/?category=running#oversubscribing

On 12/03/2010 06:25 AM, Price, Brian M (N-KCI) wrote:

Additional testing seems to show that the problem is related to barriers and 
how often they poll to determine whether or not it's time to leave.  Is there 
some MCA parameter or environment variable that allows me to control the 
frequency of polling while in barriers?
Thanks,
Brian Price

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Price, Brian M (N-KCI)
Sent: Wednesday, December 01, 2010 11:29 AM
To: Open MPI Users
Cc: Stern, Craig J
Subject: EXTERNAL: [OMPI users] Open MPI vs IBM MPI performance help

OpenMPI version: 1.4.3
Platform: IBM P5, 32 processors, 256 GB memory, Symmetric Multi-Threading (SMT) 
enabled
Application: starts up 48 processes and does MPI using MPI_Barrier, MPI_Get, 
MPI_Put (lots of transfers, large amounts of data)
Issue:  When implemented using Open MPI vs. IBM's MPI ('poe' from HPC Toolkit), 
the application runs 3-5 times slower.
I suspect that IBM's MPI implementation must take advantage of some knowledge 
that it has about data transfers that Open MPI is not taking advantage of.
Any suggestions?
Thanks,
Brian Price



Re: [OMPI users] Memory affinity

2010-09-27 Thread David Singleton

On 09/28/2010 06:52 AM, Tim Prince wrote:

On 9/27/2010 12:21 PM, Gabriele Fatigati wrote:

HI Tim,

I have read that link, but I haven't understood if enabling processor
affinity are enabled also memory affinity because is written that:

"Note that memory affinity support is enabled only when processor
affinity is enabled"

Can i set processory affinity without memory affinity? This is my
question..


2010/9/27 Tim Prince

On 9/27/2010 9:01 AM, Gabriele Fatigati wrote:

if OpenMPI is numa-compiled, memory affinity is enabled by default?
Because I didn't find memory affinity alone ( similar) parameter to
set at 1.



The FAQ http://www.open-mpi.org/faq/?category=tuning#using-paffinity
has a useful introduction to affinity. It's available in a default
build, but not enabled by default.


Memory affinity is implied by processor affinity. Your system libraries
are set up so as to cause any memory allocated to be made local to the
processor, if possible. That's one of the primary benefits of processor
affinity. Not being an expert in openmpi, I assume, in the absence of
further easily accessible documentation, there's no useful explicit way
to disable maffinity while using paffinity on platforms other than the
specified legacy platforms.



Memory allocation policy really needs to be independent of processor
binding policy.  The default memory policy (memory affinity) of "attempt
to allocate to the NUMA node of the cpu that made the allocation request
but fallback as needed" is flawed in a number of situations.  This is true
even when MPI jobs are given dedicated access to processors.  A common one is
where the local NUMA node is full of pagecache pages (from the checkpoint
of the last job to complete).  For those sites that support suspend/resume
based scheduling, NUMA nodes will generally contain pages from suspended
jobs. Ideally, the new (suspending) job should suffer a little bit of paging
overhead (pushing out the suspended job) to get ideal memory placement for
the next 6 or whatever hours of execution.

An mbind (MPOL_BIND) policy of binding to the one local NUMA node will not
work in the case of one process requiring more memory than that local NUMA
node.  One scenario is a master-slave where you might want:
  master (rank 0) bound to processor 0 but not memory bound
  slave (rank i) bound to processor i and memory bound to the local memory
of processor i.

They really are independent requirements.

Cheers,
David



Re: [OMPI users] spin-wait backoff

2010-09-03 Thread David Singleton

On 09/03/2010 10:05 PM, Jeff Squyres wrote:

On Sep 3, 2010, at 12:16 AM, Ralph Castain wrote:


Backing off the polling rate requires more application-specific logic like that 
offered below, so it is a little difficult for us to implement at the MPI 
library level. Not saying we eventually won't - just not sure anyone quite 
knows how to do so in a generalized form.


FWIW, we've *talked* about this kind of stuff among the developers -- it's at least 
somewhat similar to the "backoff to blocking communications instead of polling 
communications" issues.  That work in particular has been discussed for a long time 
but never implemented.

Are your jobs hanging because of deadlock (i.e., application error), or 
infrastructure error?  If they're hanging because of deadlock, there are some 
PMPI-based tools that might be able to help.



These are application deadlocks (like the well-known VASP calling MPI_Finalize 
when
it should be calling MPI_Abort!).  But I'm asking as a system manager with 
dozens of
apps run by dozens of users hanging and not being noticed for a day or two 
because
users are not attentive and, from outside the job, everything looks OK. So the 
problem
is detection.  Are you suggesting there are PMPI approaches we could apply to 
every
production job on the system?

I now have a hack to opal_progress that seems to do what we want without any 
impact
on performance in the "good" case.  It basically involves keeping count of the 
number
of contiguous calls to opal_progress with no events completed.  When that hits 
a large
number (eg 10^9), sleeping (maybe up to a second) on every, say, 10^3-10^4 
passes
through opal_progress seems to do "the right thing". (Obviously, any event 
completion
resets everything to spinning.)   There are a few magic numbers there that need 
to
be overrideable by users.  Please let me know if this idea is blatantly flawed.

Thanks,
David


[OMPI users] spin-wait backoff

2010-09-02 Thread David Singleton


I'm sure this has been discussed before but having watched hundreds of
thousands of cpuhrs being wasted by difficult-to-detect hung jobs, I'd
be keen to know why there isn't some sort of "spin-wait backoff" option.
For example, a way to specify spin-wait for x seconds/cycles/iterations
then backoff to lighter and lighter cpu usage.  At least that way, hung
jobs would become self-evident.

Maybe there is already some way of doing this?

Thanks,
David



Re: [OMPI users] Open MPI 1.4.2 released

2010-05-27 Thread David Singleton

On 05/28/2010 08:20 AM, Jeff Squyres wrote:

On May 16, 2010, at 5:21 AM, Aleksej Saushev wrote:


http://cvsweb.netbsd.org/bsdweb.cgi/pkgsrc/parallel/openmpi/patches/


Sorry for the high latency reply...

aa: We haven't added RPATH support yet.  We've talked about it but never done 
it.  There are some in OMPI who insist that rpath support needs to be optional. 
 A full patch solution would be appreciated.



We have problems with rpath overriding LD_RUN_PATH.  LD_RUN_PATH is
an intrinsic part of the way we configure our user's environment.  We
effectively use (impose) rpath but through the flexible, concatenatable
LD_RUN_PATH.

David


Re: [OMPI users] Hide Abort output

2010-03-31 Thread David Singleton


Yes, Dick has isolated the issue - novice users often believe Open MPI
(not their application) had a problem.  Anything along the lines he suggests
can only help.

David

On 04/01/2010 01:12 AM, Richard Treumann wrote:


I do not know what the OpenMPI message looks like or why people want to
hide it. It should be phrased to avoid any implication of a problem with
OpenMPI itself.

How about something like this which:

"The application has called MPI_Abort. The application is terminated by
OpenMPI as the application demanded"


Dick Treumann  -  MPI Team
IBM Systems&  Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363




   From:   "Jeff Squyres (jsquyres)"

   To:,

   Date:   03/31/2010 06:43 AM

   Subject:Re: [OMPI users] Hide Abort output

   Sent by:users-boun...@open-mpi.org






At present there is no such feature, but it should not be hard to add.

Can you guys be a little more specific about exactly what you are seeing
and exactly what you want to see?  (And what version you're working with -
I'll caveat my discussion that this may be a 1.5-and-forward thing)

-jms
Sent from my PDA.  No type good.

- Original Message -
From: users-boun...@open-mpi.org
To: Open MPI Users
Sent: Wed Mar 31 05:38:48 2010
Subject: Re: [OMPI users] Hide Abort output


I have to say this is a very common issue for our users.  They repeatedly
report the long Open MPI MPI_Abort() message in help queries and fail to
look for the application error message about the root cause.  A short
MPI_Abort() message that said "look elsewhere for the real error message"
would be useful.

Cheers,
David

On 03/31/2010 07:58 PM, Yves Caniou wrote:

Dear all,

I am using the MPI_Abort() command in a MPI program.
I would like to not see the note explaining that the command caused Open

MPI

to kill all the jobs and so on.
I thought that I could find a --mca parameter, but couldn't grep it. The

only

ones deal with the delay and printing more information (the stack).

Is there a mean to avoid the printing of the note (except the 2>/dev/null
tips)? Or to delay this printing?

Thank you.

.Yves.





Re: [OMPI users] Hide Abort output

2010-03-31 Thread David Singleton


I have to say this is a very common issue for our users.  They repeatedly
report the long Open MPI MPI_Abort() message in help queries and fail to
look for the application error message about the root cause.  A short
MPI_Abort() message that said "look elsewhere for the real error message"
would be useful.

Cheers,
David

On 03/31/2010 07:58 PM, Yves Caniou wrote:

Dear all,

I am using the MPI_Abort() command in a MPI program.
I would like to not see the note explaining that the command caused Open MPI
to kill all the jobs and so on.
I thought that I could find a --mca parameter, but couldn't grep it. The only
ones deal with the delay and printing more information (the stack).

Is there a mean to avoid the printing of the note (except the 2>/dev/null
tips)? Or to delay this printing?

Thank you.

.Yves.





Re: [OMPI users] Parallel file write in fortran (+mpi)

2010-02-02 Thread David Singleton


Its definitely not a bug in Lustre - its an essential part of any
"coherent" cluster filesystem (CXFS, QFS, GFS, GPFS, ).  The
whole point is that some people actually want to have meaningful
(non-garbage) files accessed using properly managed parallel IO
techniques.  The locking is necessary in this case.  You would see
the same issues on any of those filesystems.

David

On 02/03/2010 11:37 AM, Laurence Marks wrote:

Agreed that it is not good (and I am recoding some programs to avoid
this), but (and here life gets interesting) is this a "bug" in Lustre?

On Tue, Feb 2, 2010 at 5:59 PM, David Singleton
<david.single...@anu.edu.au>  wrote:


But its a very bad idea on a "coherent", "POSIX" filesystem like Lustre.
Locks have to bounce around between the nodes for every write.  This can
be VERY slow (even for trivial amounts of "logging" IO) and thrash the
filesystem for other users.   So, yes, at our site, we include this sort
of "parallel IO" on our list of disallowed behaviour.  Not a good practice
to adopt in general.

David

On 02/03/2010 10:40 AM, Laurence Marks wrote:


I know it's wrong, but I don't think it is forbidden (which I
guess is what you are saying).

On Tue, Feb 2, 2010 at 5:31 PM, Jeff Squyres<jsquy...@cisco.com>wrote:


+1 on Nick's responses.

AFAIK, if you don't mind getting garbage in the output file, it should be
fine to do.  Specifically: it should not cause OS issues (crash, reboot,
corrupted filesystem, etc.) to do this -- but the file contents will likely
be garbage.

That being said, this situation likely falls into the "Doc, it hurts when
I do this..." category.  Meaning: you know it's wrong, so you probably
shouldn't be doing it anyway.  :-)


On Feb 2, 2010, at 4:50 PM, Nicolas Bock wrote:


Hi Laurence,

I don't know whether it's as bad as a deadly sin, but for us parallel
writes are a huge problem and we get complete garbage in the file. Take a
look at:

Implementing MPI-IO Atomic Mode and Shared File Pointers Using MPI
One-Sided Communication, Robert Latham,Robert Ross, Rajeev Thakur,
International Journal of High Performance Computing Applications, 21, 132
(2007).

They describe an implemenation of a "mutex" like object in MPI. If you
protect writes to the file with an exclusive lock you can serialize the
writes and make use of NFS's close to open cache coherence.

nick


On Tue, Feb 2, 2010 at 08:27, Laurence Marks<l-ma...@northwestern.edu>
  wrote:
I have a question concerning having many processors in a mpi job all
write to the same file -- not using mpi calls but with standard
fortran I/O. I know that this can lead to consistency issues, but it
can also lead to OS issues with some flavors of nfs.

At least in fortran, there is nothing "wrong" with doing this. My
question is whether this is "One of the Seven Deadly Sins" of mpi
programming, or just frowned on. (That is, it should be OK even if it
leads to nonsense files, and not lead to OS issues.) If it is a sin, I
would appreciate a link to where this is spelt out in some "official"
document or similar.

--
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Electron crystallography is the branch of science that uses electron
scattering and imaging to study the structure of matter.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
jsquy...@cisco.com



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users










Re: [OMPI users] Parallel file write in fortran (+mpi)

2010-02-02 Thread David Singleton


But its a very bad idea on a "coherent", "POSIX" filesystem like Lustre.
Locks have to bounce around between the nodes for every write.  This can
be VERY slow (even for trivial amounts of "logging" IO) and thrash the
filesystem for other users.   So, yes, at our site, we include this sort
of "parallel IO" on our list of disallowed behaviour.  Not a good practice
to adopt in general.

David

On 02/03/2010 10:40 AM, Laurence Marks wrote:

I know it's wrong, but I don't think it is forbidden (which I
guess is what you are saying).

On Tue, Feb 2, 2010 at 5:31 PM, Jeff Squyres  wrote:

+1 on Nick's responses.

AFAIK, if you don't mind getting garbage in the output file, it should be fine 
to do.  Specifically: it should not cause OS issues (crash, reboot, corrupted 
filesystem, etc.) to do this -- but the file contents will likely be garbage.

That being said, this situation likely falls into the "Doc, it hurts when I do 
this..." category.  Meaning: you know it's wrong, so you probably shouldn't be doing 
it anyway.  :-)


On Feb 2, 2010, at 4:50 PM, Nicolas Bock wrote:


Hi Laurence,

I don't know whether it's as bad as a deadly sin, but for us parallel writes 
are a huge problem and we get complete garbage in the file. Take a look at:

Implementing MPI-IO Atomic Mode and Shared File Pointers Using MPI One-Sided 
Communication, Robert Latham,Robert Ross, Rajeev Thakur, International Journal 
of High Performance Computing Applications, 21, 132 (2007).

They describe an implemenation of a "mutex" like object in MPI. If you protect 
writes to the file with an exclusive lock you can serialize the writes and make use of 
NFS's close to open cache coherence.

nick


On Tue, Feb 2, 2010 at 08:27, Laurence Marks  wrote:
I have a question concerning having many processors in a mpi job all
write to the same file -- not using mpi calls but with standard
fortran I/O. I know that this can lead to consistency issues, but it
can also lead to OS issues with some flavors of nfs.

At least in fortran, there is nothing "wrong" with doing this. My
question is whether this is "One of the Seven Deadly Sins" of mpi
programming, or just frowned on. (That is, it should be OK even if it
leads to nonsense files, and not lead to OS issues.) If it is a sin, I
would appreciate a link to where this is spelt out in some "official"
document or similar.

--
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Electron crystallography is the branch of science that uses electron
scattering and imaging to study the structure of matter.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
jsquy...@cisco.com




Re: [OMPI users] exceedingly virtual memory consumption of MPI, environment if higher-setting "ulimit -s"

2009-12-02 Thread David Singleton


I think the issue is that if you *dont* specifically use
pthread_attr_setstacksize the pthread library will (can?) give
each thread a stack of size equal to the stacksize rlimit.

You are correct - this is not specifically an Open MPI issue
although if it is Open MPI spawning the threads, maybe it
should be actively setting pthread_attr_setstacksize.  But there
is an unfortunate tendency for users to set very large (or
unlimited) stacksize without realising these sort of repercussions.
Try catching/looking at the stacktrace of an application that is in
infinite recursion with a 1GB+ stacksize after it finally gets a
SEGV ...


Jeff Squyres wrote:
I can't think of what OMPI would be doing related to the predefined 
stack size -- I am not aware of anywhere in the code where we look up 
the predefine stack size and then do something with it.


That being said, I don't know what the OS and resource consumption 
effects are of setting 1GB+ stack size on *any* application...  Have you 
tried non-MPI examples, potentially with applications as large as MPI 
applications but without the complexity of MPI?



On Nov 19, 2009, at 3:13 PM, David Singleton wrote:



Depending on the setup, threads often get allocated a thread local
stack with size equal to the stacksize rlimit.  Two threads maybe?

David

Terry Dontje wrote:
> A couple things to note.  First Sun MPI 8.2.1 is effectively OMPI
> 1.3.4.  I also reproduced the below issue using a C code so I think 
this

> is a general issue with OMPI and not Fortran based.
>
> I did a pmap of a process and there were two anon spaces equal to the
> stack space set by ulimit.
>
> In one case (setting 102400) the anon spaces were next to each other
> prior to all the loadable libraries.  In another case (setting 1024000)
> one anon space was locate in the same area as the first case but the
> second space was deep into some memory used by ompi.
>
> Is any of this possibly related to the predefined handles?  Though I am
> not sure why it would expand based on stack size?.
>
> --td
>> Date: Thu, 19 Nov 2009 19:21:46 +0100
>> From: Paul Kapinos <kapi...@rz.rwth-aachen.de>
>> Subject: [OMPI users] exceedingly virtual memory consumption of MPI
>> environment if higher-setting "ulimit -s"
>> To: Open MPI Users <us...@open-mpi.org>
>> Message-ID: <4b058cba.3000...@rz.rwth-aachen.de>
>> Content-Type: text/plain; charset="iso-8859-1"; Format="flowed"
>>
>> Hi volks,
>>
>> we see an exeedingly *virtual* memory consumtion through MPI processes
>> if "ulimit -s" (stack size)in profile configuration was setted higher.
>>
>> Furthermore we believe, every mpi process started, wastes about the
>> double size of `ulimit -s` value which will be set in a fresh console
>> (that is, the value is configurated in e.g.  .zshenv, *not* the value
>> actually setted in the console from which the mpiexec runs).
>>
>> Sun MPI 8.2.1, an empty mpi-HelloWorld program
>> ! either if running both processes on the same host..
>>
>> .zshenv: ulimit -s 10240   --> VmPeak:180072 kB
>> .zshenv: ulimit -s 102400  --> VmPeak:364392 kB
>> .zshenv: ulimit -s 1024000 --> VmPeak:2207592 kB
>> .zshenv: ulimit -s 2024000 --> VmPeak:4207592 kB
>> .zshenv: ulimit -s 2024 --> VmPeak:   39.7 GB
>> (see the attached files; the a.out binary is a mpi helloworld program
>> running an never ending loop).
>>
>>
>>
>> Normally, the stack size ulimit is set to some 10 MB by us, but we see
>> a lot of codes which needs *a lot* of stack space, e.g. Fortran codes,
>> OpenMP codes (and especially fortran OpenMP codes). Users tends to
>> hard-code the setting-up the higher value for stack size ulimit.
>>
>> Normally, the using of a lot of virtual memory is no problem, because
>> there is a lot of this thing :-) But... If more than one person is
>> allowed to work on a computer, you have to divide the ressources in
>> such a way that nobody can crash the box. We do not know how to limit
>> the real RAM used so we need to divide the RAM by means of setting
>> virtual memory ulimit (in our batch system e.g.. That is, for us
>> "virtual memory consumption" = "real memory consumption".
>> And real memory is not that way cheap than virtual memory.
>>
>>
>> So, why consuming the *twice* amount of stack size for each process?
>>
>> And, why consuming the virtual memory at all? We guess this virtual
>> memory is allocated for the stack (why else it will be related to the
>> stack size ulimit). But, is such allocation really needed? Is there a
>> way to avoid the

Re: [OMPI users] exceedingly virtual memory consumption of MPI, environment if higher-setting "ulimit -s"

2009-11-19 Thread David Singleton


Depending on the setup, threads often get allocated a thread local
stack with size equal to the stacksize rlimit.  Two threads maybe?

David

Terry Dontje wrote:
A couple things to note.  First Sun MPI 8.2.1 is effectively OMPI 
1.3.4.  I also reproduced the below issue using a C code so I think this 
is a general issue with OMPI and not Fortran based.


I did a pmap of a process and there were two anon spaces equal to the 
stack space set by ulimit.


In one case (setting 102400) the anon spaces were next to each other 
prior to all the loadable libraries.  In another case (setting 1024000) 
one anon space was locate in the same area as the first case but the 
second space was deep into some memory used by ompi.


Is any of this possibly related to the predefined handles?  Though I am 
not sure why it would expand based on stack size?.


--td

Date: Thu, 19 Nov 2009 19:21:46 +0100
From: Paul Kapinos 
Subject: [OMPI users] exceedingly virtual memory consumption of MPI
environment if higher-setting "ulimit -s"
To: Open MPI Users 
Message-ID: <4b058cba.3000...@rz.rwth-aachen.de>
Content-Type: text/plain; charset="iso-8859-1"; Format="flowed"

Hi volks,

we see an exeedingly *virtual* memory consumtion through MPI processes 
if "ulimit -s" (stack size)in profile configuration was setted higher.


Furthermore we believe, every mpi process started, wastes about the 
double size of `ulimit -s` value which will be set in a fresh console 
(that is, the value is configurated in e.g.  .zshenv, *not* the value 
actually setted in the console from which the mpiexec runs).


Sun MPI 8.2.1, an empty mpi-HelloWorld program
! either if running both processes on the same host..

.zshenv: ulimit -s 10240   --> VmPeak:180072 kB
.zshenv: ulimit -s 102400  --> VmPeak:364392 kB
.zshenv: ulimit -s 1024000 --> VmPeak:2207592 kB
.zshenv: ulimit -s 2024000 --> VmPeak:4207592 kB
.zshenv: ulimit -s 2024 --> VmPeak:   39.7 GB
(see the attached files; the a.out binary is a mpi helloworld program 
running an never ending loop).




Normally, the stack size ulimit is set to some 10 MB by us, but we see 
a lot of codes which needs *a lot* of stack space, e.g. Fortran codes, 
OpenMP codes (and especially fortran OpenMP codes). Users tends to 
hard-code the setting-up the higher value for stack size ulimit.


Normally, the using of a lot of virtual memory is no problem, because 
there is a lot of this thing :-) But... If more than one person is 
allowed to work on a computer, you have to divide the ressources in 
such a way that nobody can crash the box. We do not know how to limit 
the real RAM used so we need to divide the RAM by means of setting 
virtual memory ulimit (in our batch system e.g.. That is, for us

"virtual memory consumption" = "real memory consumption".
And real memory is not that way cheap than virtual memory.


So, why consuming the *twice* amount of stack size for each process?

And, why consuming the virtual memory at all? We guess this virtual 
memory is allocated for the stack (why else it will be related to the 
stack size ulimit). But, is such allocation really needed? Is there a 
way to avoid the vaste of virtual memory?


best regards,
Paul Kapinos







Re: [OMPI users] custom modules per job (PBS/OpenMPI/environment-modules)

2009-11-17 Thread David Singleton


Hi Ralph,

Now I'm in a quandry - if I show you that its actually Open MPI that is
propagating the environment then you are likely to "fix it" and then tm
users will lose a nice feature.  :-)

Can I suggest that "least surprise" would require that MPI tasks get
exactly the same environment/limits/... as mpirun so that "mpirun a.out"
behaves just like "a.out".  [Following this principle we modified
tm_spawn to propagate the callers rlimits to the spawned tasks.]
A comment in orterun.c (see below) below suggests that Open MPI is trying
to distinguish between "local" and "remote" processes.  I would have
thought that distinction should be invisible to users as much as possible
- a user asking for 4 cpus would like to see the same behaviour if all
4 are local or "2 local, 2 remote".

As to why tm does "The Right Thing": in the case of rsh/ssh the full
mpirun environment is given to the rsh/ssh process locally while in the tm
case it is an argument to tm_spawn and so gets given to the process (in
this case orted) being launched remotely. Relevant lines from 1.3.3 below.
PBS just passes along the environment it is told to.  We dont use torque
but as of 2.3.3, it was still the same as OpenPBS in this respect.

Michael just pointed out the slight flaw.  The environment should be
somewhat selectively propagated (exclude HOSTNAME etc).  I guess if you
were to "fix" plm_tm_module I would put the propagation behaviour in
tm_spawn and try to handle these exceptional cases.

Cheers,
David


orterun.c:

510 /* save the environment for launch purposes. This MUST be
511  * done so that we can pass it to any local procs we
512  * spawn - otherwise, those local procs won't see any
513  * non-MCA envars were set in the enviro prior to calling
514  * orterun
515  */
516 orte_launch_environ = opal_argv_copy(environ);


plm_rsh_module.c:

681 /* actually ssh the child */
682 static void ssh_child(int argc, char **argv,
683   orte_vpid_t vpid, int proc_vpid_index)
684 {

694 /* setup environment */
695 env = opal_argv_copy(orte_launch_environ);

766 execve(exec_path, exec_argv, env);


plm_tm_module.c:

128 static int plm_tm_launch_job(orte_job_t *jdata)
129 {

228 /* setup environment */
229 env = opal_argv_copy(orte_launch_environ);

311 rc = tm_spawn(argc, argv, env, node->launch_id, tm_task_ids + 
launched, tm_events + launched);



Ralph Castain wrote:

Not exactly. It completely depends on how Torque was setup - OMPI isn't 
forwarding the environment. Torque is.

We made a design decision at the very beginning of the OMPI project not to 
forward non-OMPI envars unless directed to do so by the user. I'm afraid I 
disagree with Michael's claim that other MPIs do forward them - yes, MPICH 
does, but not all others do.

The world is bigger than MPICH and OMPI :-)

Since there is inconsistency in this regard between MPIs, we chose not to 
forward. Reason was simple: there is no way to know what is safe to forward vs 
what is not (e.g., what to do with DISPLAY), nor what the underlying 
environment is trying to forward vs what it isn't. It is very easy to get 
cross-wise and cause totally unexpected behavior, as users have complained 
about for years.

First, if you are using a managed environment like Torque, we recommend that 
you work with your sys admin to decide how to configure it. This is the best 
way to resolve a problem.

Second, if you are not using a managed environment and/or decide not to have that 
environment do the forwarding, you can tell OMPI to forward the envars you need by 
specifying them via the -x cmd line option. We already have a request to expand this 
capability, and I will be doing so as time permits. One option I'll be adding is the 
reverse of -x - i.e., "forward all envars -except- the specified one(s)".

HTH
ralph



Re: [OMPI users] custom modules per job (PBS/OpenMPI/environment-modules)

2009-11-17 Thread David Singleton


I can see the difference - we built Open MPI with tm support.  For some
reason, I thought mpirun fed its environment to orted (after orted is
launched) so orted can pass it on to MPI tasks.  That should be portable
between different launch mechanisms.  But it looks like tm launches
orted with the full mpirun environment (at the request of mpirun).

Cheers,
David


Michael Sternberg wrote:

Hi David,

Hmm, your demo is well-chosen and crystal-clear, yet the output is unexpected.  
I do not see environment vars passed by default here:


login3$ qsub -l nodes=2:ppn=1 -I
qsub: waiting for job 34683.mds01 to start
qsub: job 34683.mds01 ready

n102$ mpirun -n 2 -machinefile $PBS_NODEFILE hostname
n102
n085
n102$ mpirun -n 2 -machinefile $PBS_NODEFILE env | grep FOO
n102$ export FOO=BAR
n102$ mpirun -n 2 -machinefile $PBS_NODEFILE env | grep FOO
FOO=BAR
n102$ type mpirun
mpirun is hashed (/opt/soft/openmpi-1.3.2-intel10-1/bin/mpirun)


Curious, what do you get upon:

where mpirun


I built OpenMPI-1.3.2 here from source with:

CC=icc  CXX=icpc  FC=ifort  F77=ifort \
LDFLAGS='-Wl,-z,noexecstack' \
CFLAGS='-O2 -g -fPIC' \
CXXFLAGS='-O2 -g -fPIC' \
FFLAGS='-O2 -g -fPIC' \
./configure --prefix=$prefix \
--with-libnuma=/usr \
--with-openib=/usr \
--with-udapl \
--enable-mpirun-prefix-by-default \
--without-tm


I did't find the behavior I saw strange, given that orterun(1) talks only about $OPMI_* 
and inheritance from the remote shell.  It also mentions a "boot MCA module", 
about which I couldn't find much on open-mpi.org - hmm.


In the meantime, I did find a possible solution, namely, to tell ssh to pass a 
variable using SendEnv/AcceptEnv.  That variable is then seen by and can be 
interpreted (cautiously) in /etc/profile.d/ scripts.  A user could set it in 
the job file (or even qalter it post submission):

#PBS -v VARNAME=foo:bar:baz

For VARNAME, I think simply "MODULES" or "EXTRAMODULES" could do.


With best regards,
Michael



On Nov 17, 2009, at 4:29 , David Singleton wrote:

I'm not sure why you dont see Open MPI behaving like other MPI's w.r.t.
modules/environment on remote MPI tasks - we do.

xe:~ > qsub -q express -lnodes=2:ppn=8,walltime=10:00,vmem=2gb -I
qsub: waiting for job 376366.xepbs to start
qsub: job 376366.xepbs ready

[dbs900@x27 ~]$ module load openmpi
[dbs900@x27 ~]$ mpirun -n 2 --bynode hostname
x27
x28
[dbs900@x27 ~]$ mpirun -n 2 --bynode env | grep FOO
[dbs900@x27 ~]$ setenv FOO BAR
[dbs900@x27 ~]$ mpirun -n 2 --bynode env | grep FOO
FOO=BAR
FOO=BAR
[dbs900@x27 ~]$ mpirun -n 2 --bynode env | grep amber
[dbs900@x27 ~]$ module load amber
[dbs900@x27 ~]$ mpirun -n 2 --bynode env | grep amber
LOADEDMODULES=openmpi/1.3.3:amber/9
PATH=/apps/openmpi/1.3.3/bin:/home/900/dbs900/bin:/bin:/usr/bin::/opt/bin:/usr/X11R6/bin:/opt/pbs/bin:/sbin:/usr/sbin:/apps/amber/9/exe
_LMFILES_=/apps/Modules/modulefiles/openmpi/1.3.3:/apps/Modules/modulefiles/amber/9
AMBERHOME=/apps/amber/9
LOADEDMODULES=openmpi/1.3.3:amber/9
PATH=/apps/openmpi/1.3.3/bin:/home/900/dbs900/bin:/bin:/usr/bin:/opt/bin:/usr/X11R6/bin:/opt/pbs/bin:/sbin:/usr/sbin:/apps/amber/9/exe
_LMFILES_=/apps/Modules/modulefiles/openmpi/1.3.3:/apps/Modules/modulefiles/amber/9
AMBERHOME=/apps/amber/9

David


Michael Sternberg wrote:

Dear readers,
With OpenMPI, how would one go about requesting to load environment modules (of 
the http://modules.sourceforge.net/ kind) on remote nodes, augmenting those  
normally loaded there by shell dotfiles?
Background:
I run a RHEL-5/CentOS-5 cluster.  I load a bunch of default modules through 
/etc/profile.d/ and recommend to users to customize modules in ~/.bashrc.  A 
problem arises for PBS jobs which might need job-specific modules, e.g., to 
pick a specific flavor of an application.  With other MPI implementations 
(ahem) which export all (or judiciously nearly all) environment variables by 
default, you can say:
#PBS ...
module load foo # not for OpenMPI
mpirun -np 42 ... \
bar-app
Not so with OpenMPI - any such customization is only effective for processes on 
the master (=local) node of the job, and any variables changed by a given 
module would have to be specifically passed via mpirun -x VARNAME.   On the 
remote nodes, those variables are not available in the dotfiles because they 
are passed only once orted is live (after dotfile processing by the shell), 
which then immediately spawns the application binaries (right?)
I thought along the following lines:
(1) I happen to run Lustre, which would allow writing a file coherently across 
nodes prior to mpirun, and thus hook into the shell dotfile processing, but 
that seems rather crude.
(2) "mpirun -x PATH -x LD_LIBRARY_PATH …" would take care of a lot, but is not 
really general.
Is there a recommended way?
regards,
Michael

___
users mail

Re: [OMPI users] custom modules per job (PBS/OpenMPI/environment-modules)

2009-11-17 Thread David Singleton


Hi Michael,

I'm not sure why you dont see Open MPI behaving like other MPI's w.r.t.
modules/environment on remote MPI tasks - we do.

xe:~ > qsub -q express -lnodes=2:ppn=8,walltime=10:00,vmem=2gb -I
qsub: waiting for job 376366.xepbs to start
qsub: job 376366.xepbs ready

[dbs900@x27 ~]$ module load openmpi
[dbs900@x27 ~]$ mpirun -n 2 --bynode hostname
x27
x28
[dbs900@x27 ~]$ mpirun -n 2 --bynode env | grep FOO
[dbs900@x27 ~]$ setenv FOO BAR
[dbs900@x27 ~]$ mpirun -n 2 --bynode env | grep FOO
FOO=BAR
FOO=BAR
[dbs900@x27 ~]$ mpirun -n 2 --bynode env | grep amber
[dbs900@x27 ~]$ module load amber
[dbs900@x27 ~]$ mpirun -n 2 --bynode env | grep amber
LOADEDMODULES=openmpi/1.3.3:amber/9
PATH=/apps/openmpi/1.3.3/bin:/home/900/dbs900/bin:/bin:/usr/bin::/opt/bin:/usr/X11R6/bin:/opt/pbs/bin:/sbin:/usr/sbin:/apps/amber/9/exe
_LMFILES_=/apps/Modules/modulefiles/openmpi/1.3.3:/apps/Modules/modulefiles/amber/9
AMBERHOME=/apps/amber/9
LOADEDMODULES=openmpi/1.3.3:amber/9
PATH=/apps/openmpi/1.3.3/bin:/home/900/dbs900/bin:/bin:/usr/bin:/opt/bin:/usr/X11R6/bin:/opt/pbs/bin:/sbin:/usr/sbin:/apps/amber/9/exe
_LMFILES_=/apps/Modules/modulefiles/openmpi/1.3.3:/apps/Modules/modulefiles/amber/9
AMBERHOME=/apps/amber/9

David


Michael Sternberg wrote:

Dear readers,

With OpenMPI, how would one go about requesting to load environment modules (of 
the http://modules.sourceforge.net/ kind) on remote nodes, augmenting those  
normally loaded there by shell dotfiles?


Background:

I run a RHEL-5/CentOS-5 cluster.  I load a bunch of default modules through 
/etc/profile.d/ and recommend to users to customize modules in ~/.bashrc.  A 
problem arises for PBS jobs which might need job-specific modules, e.g., to 
pick a specific flavor of an application.  With other MPI implementations 
(ahem) which export all (or judiciously nearly all) environment variables by 
default, you can say:

#PBS ...

module load foo # not for OpenMPI

mpirun -np 42 ... \
bar-app

Not so with OpenMPI - any such customization is only effective for processes on 
the master (=local) node of the job, and any variables changed by a given 
module would have to be specifically passed via mpirun -x VARNAME.   On the 
remote nodes, those variables are not available in the dotfiles because they 
are passed only once orted is live (after dotfile processing by the shell), 
which then immediately spawns the application binaries (right?)

I thought along the following lines:

(1) I happen to run Lustre, which would allow writing a file coherently across 
nodes prior to mpirun, and thus hook into the shell dotfile processing, but 
that seems rather crude.

(2) "mpirun -x PATH -x LD_LIBRARY_PATH …" would take care of a lot, but is not 
really general.

Is there a recommended way?


regards,
Michael



[OMPI users] bug in MPI_Cart_create?

2009-10-13 Thread David Singleton


Looking back through the archives, a lot of people have hit error
messages like

> [bl302:26556] *** An error occurred in MPI_Cart_create
> [bl302:26556] *** on communicator MPI_COMM_WORLD
> [bl302:26556] *** MPI_ERR_ARG: invalid argument of some other kind
> [bl302:26556] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

One of the reasons people *may* be hitting this is what I believe to
be an incorrect test in MPI_Cart_create():

if (0 > reorder || 1 < reorder) {
return OMPI_ERRHANDLER_INVOKE (old_comm, MPI_ERR_ARG,
  FUNC_NAME);
}

reorder is a "logical" argument and "2.5.2 C bindings" in the MPI 1.3
standard says:

Logical flags are integers with value 0 meaning “false” and a
non-zero value meaning “true.”

So I'm not sure there should be any argument test.


We hit this because we (sorta erroneously) were trying to use a GNU build
of Open MPI with Intel compilers.  gfortran has true=1 while ifort has
true=-1.  It seems to all work (by luck, I know) except this test.  Are
there any other tests like this in Open MPI?

David


[OMPI users] PBS tm error returns

2009-08-13 Thread David Singleton


Maybe this should go to the devel list but I'll start here.

In tracking the way the PBS tm API propagates error information
back to clients, I noticed that Open MPI is making an incorrect
assumption.  (I'm looking 1.3.2.) The relevant code in
orte/mca/plm/tm/plm_tm_module.c is:

/* TM poll for all the spawns */
for (i = 0; i < launched; ++i) {
rc = tm_poll(TM_NULL_EVENT, , 1, _err);
if (TM_SUCCESS != rc) {
errno = local_err;
opal_output(0, "plm:tm: failed to poll for a spawned daemon,"
   " return status = %d", rc);
goto cleanup;
}
}

My reading of the way the tm API works is that tm_poll() can (will)
return TM_SUCCESS(0) even when the tm_spawn event being waited on failed,
i.e. local_err needs to be checked even if rc=0.  It looks like TM_
errors (rc values) are from tm protocol failures or incorrect calls
to tm.  local_err is to do with why the actual requested action failed
and is usually some sort of internal PBSE_ error code.  In fact it's
probably always PBSE_SYSTEM (15010) - I think it is for tm_spawn().

Something like the following is probably closer to what is needed.

/* TM poll for all the spawns */
for (i = 0; i < launched; ++i) {
rc = tm_poll(TM_NULL_EVENT, , 1, _err);
if (TM_SUCCESS != rc) {
errno = local_err;
opal_output(0, "plm:tm: failed to poll for a spawned daemon,"
   " return status = %d", rc);
goto cleanup;
}
if (local_err!=0) {
errno = local_err;
opal_output(0, "plm:tm: failed to spawn daemon,"
   " error code = %d", errno );
goto cleanup;
}
}

I checked torque 2.3.3 to confirm that it's tm behaviour is the same as
OpenPBS in this respect. No idea about PBSPro.


David


Re: [OMPI users] pgi and gcc runtime compatability

2008-12-07 Thread David Singleton


I seem to remember Fortran logicals being represented differently in
PGI to other Fortran (1 vs -1 maybe - cant remember).  Causes
grief with things like MPI_Test.

David

Brock Palen wrote:
I did something today that I was happy worked,  but I want to know if 
anyone has had problem with it.


At runtime. (not compiling)  would a OpenMPI built with pgi  work to run 
a code that was compiled with the same version but gcc built OpenMPI ?  
I tested a few apps today after I accidentally did this and found it 
worked.  They were all C/C++ apps  (namd and gromacs)  but what about 
fortran apps?   Should we expect problems if someone does this?


I am not going to encourage this, but it is more if needed.


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] job abort on MPI task exit

2008-10-27 Thread David Singleton


Apologies if this has been covered in a previous thread - I
went back through a lot of posts without seeing anything
similar.

In an attempt to protect some users from themselves, I was hoping
that OpenMPI could be configured so that an MPI task calling
exit before calling MPI_Finalize() would cause job cleanup, i.e.
behave effectively as if MPI_Abort() was called.  The reason is
that many users dont realise they need to use MPI_Abort()
instead of Fortran stop or C exit.  If exit is called,  all
other processes get stuck in the next blocking call and, for a
large walltime limit batch job, that can be a real waste of
resources.

I think LAM terminated the job if a task exited with non-zero
exit status or due to a signal. OpenMPI appears to cleanup
only in the case a signalled task.  Ideally, any exit before
MPI_Finalize() should be terminal.  Why is this not the case?

Thanks,
David