Re: [OMPI users] Tmpdir work for first process only

2007-11-15 Thread Aurelien Bouteiller

Hi Clement,

First, if you run 400 jobs on 16 nodes you will end up with around 32  
processes on each nodes. Depending on the memory footprint of the  
application it will fail because of memory exhaustion. Usually I am  
able to oversubscribe up to 64 NAS class B processes on 2GB, and less  
than 16 class C.


About your initial problem: tmpdir is the temporary directory for the  
orte seed only. As you discovered, this parameter is ignored by all  
the other processes. However you can use the TMPDIR environment  
variable to set the tmpdir on every open MPI process. Juste use mpirun  
-X TMPDIR=/some/where to set it.


Regards,
Aurelien


Le 15 nov. 07 à 07:04, Clement Kam Man Chu a écrit :


Jeff Squyres wrote:

Thanks for your reply.  I am using pbs job scheduler and I reqested 16
cpus to run 400 processes, but I don't how many processes are  
allocated

on each cpus.  Do you think it is a problem?

Clement

Are you running all of these processes on the same machine, or
multiple different machines?

If you're running 400 processes on the same machine, it may well be
that you are simply running out of memory or other OS resources.  In
particular, I've never seem iof fail that way before (iof is our I/O
forwarding subsystem).

Looking at the iof code, the error you're seeing occurs when iof is
trying to create a pipe between our OMPI "helper daemon" and the  
newly

spawned user executable and fails.  The only reason that I can guess
for why that would happen is if a max limit of pipes have been  
created

on a machine and the OS refuses to create any more...?



On Nov 14, 2007, at 9:36 PM, Clement Kam Man Chu wrote:



Hi,

I have configured out why the tmpdir parameter works for the first
process. I got another problem if I tried to run 400 processes (no
problem if under 400 processes). I got an error "ORTE_ERROR_LOG: Out
of
resource in file base/iof_base_setup.c at line 106". I attached the
message as below:

[ac27:12442] [0,0,0] setting up session dir with
[ac27:12442] tmpdir /jobfs/z07/247752.ac-pbs
[ac27:12442] universe default-universe-12442
[ac27:12442] user kxc565
[ac27:12442] host ac27
[ac27:12442] jobid 0
[ac27:12442] procid 0
[ac27:12442] procdir:
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-
universe-12442/0/0
[ac27:12442] jobdir:
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-
universe-12442/0
[ac27:12442] unidir:
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-
universe-12442
[ac27:12442] top: openmpi-sessions-kxc565@ac27_0
[ac27:12442] tmp: ??
[ac27:12442] [0,0,0] contact_file
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-
universe-12442/universe-setup.txt
[ac27:12442] [0,0,0] wrote setup file
[ac27:12447] [0,0,1] setting up session dir with
[ac27:12447] universe default-universe-12442
[ac27:12447] user kxc565
[ac27:12447] host ac27
[ac27:12447] jobid 0
[ac27:12447] procid 1
[ac27:12447] procdir:
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-
universe-12442/0/1
[ac27:12447] jobdir:
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-
universe-12442/0
[ac27:12447] unidir:
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-
universe-12442
[ac27:12447] top: openmpi-sessions-kxc565@ac27_0
[ac27:12447] tmp: /jobfs/z07/247752.ac-pbs
[ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file
base/iof_base_setup.c at line 106
[ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file
odls_default_module.c at line 663
[ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file
odls_default_module.c at line 1191
[ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file orted.c
at
line 594
[ac27:12442] spawn: in job_state_callback(jobid = 1, state = 0x80)
mpirun noticed that job rank 0 with PID 0 on node ac27 exited on
signal
15 (Terminated).
[ac27:12447] sess_dir_finalize: job session dir not empty - leaving
[ac27:12447] sess_dir_finalize: proc session dir not empty - leaving
[ac27:12442] sess_dir_finalize: proc session dir not empty - leaving


Thanks,
Clement

Clement Kam Man Chu wrote:


Hi,

I am using openmpi 1.2.3 under ia64 machine. I typed "mpirun -d --
tmpdir
/home/565/kxc565/tmpdir -mca btl sm -np 400 ./testprogram". I found
only
the first process can use my parameter setting to store tmp file,  
but
the second process used its default setting to store tmp file in / 
tmp

directory. How can I change all processes stored in a directory I
required? I have attached the message from openmpi for more in
details.
Thanks for any help.

Cheers,
Clement


[ac27:27928] [0,0,0] setting up session dir with
[ac27:27928] tmpdir /home/565/kxc565/tmpdir
[ac27:27928] universe default-universe-27928
[ac27:27928] user kxc565
[ac27:27928] host ac27
[ac27:27928] jobid 0
[ac27:27928] procid 0
[ac27:27928] procdir:
/home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default-
universe-27928/0/0
[ac27:27928] jobdir:
/home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default-

Re: [OMPI users] Tmpdir work for first process only

2007-11-15 Thread Clement Kam Man Chu

Jeff Squyres wrote:

Thanks for your reply.  I am using pbs job scheduler and I reqested 16 
cpus to run 400 processes, but I don't how many processes are allocated 
on each cpus.  Do you think it is a problem?


Clement
Are you running all of these processes on the same machine, or  
multiple different machines?


If you're running 400 processes on the same machine, it may well be  
that you are simply running out of memory or other OS resources.  In  
particular, I've never seem iof fail that way before (iof is our I/O  
forwarding subsystem).


Looking at the iof code, the error you're seeing occurs when iof is  
trying to create a pipe between our OMPI "helper daemon" and the newly  
spawned user executable and fails.  The only reason that I can guess  
for why that would happen is if a max limit of pipes have been created  
on a machine and the OS refuses to create any more...?




On Nov 14, 2007, at 9:36 PM, Clement Kam Man Chu wrote:

  

Hi,

I have configured out why the tmpdir parameter works for the first
process. I got another problem if I tried to run 400 processes (no
problem if under 400 processes). I got an error "ORTE_ERROR_LOG: Out  
of

resource in file base/iof_base_setup.c at line 106". I attached the
message as below:

[ac27:12442] [0,0,0] setting up session dir with
[ac27:12442] tmpdir /jobfs/z07/247752.ac-pbs
[ac27:12442] universe default-universe-12442
[ac27:12442] user kxc565
[ac27:12442] host ac27
[ac27:12442] jobid 0
[ac27:12442] procid 0
[ac27:12442] procdir:
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- 
universe-12442/0/0

[ac27:12442] jobdir:
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- 
universe-12442/0

[ac27:12442] unidir:
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- 
universe-12442

[ac27:12442] top: openmpi-sessions-kxc565@ac27_0
[ac27:12442] tmp: ??
[ac27:12442] [0,0,0] contact_file
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- 
universe-12442/universe-setup.txt

[ac27:12442] [0,0,0] wrote setup file
[ac27:12447] [0,0,1] setting up session dir with
[ac27:12447] universe default-universe-12442
[ac27:12447] user kxc565
[ac27:12447] host ac27
[ac27:12447] jobid 0
[ac27:12447] procid 1
[ac27:12447] procdir:
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- 
universe-12442/0/1

[ac27:12447] jobdir:
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- 
universe-12442/0

[ac27:12447] unidir:
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- 
universe-12442

[ac27:12447] top: openmpi-sessions-kxc565@ac27_0
[ac27:12447] tmp: /jobfs/z07/247752.ac-pbs
[ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file
base/iof_base_setup.c at line 106
[ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file
odls_default_module.c at line 663
[ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file
odls_default_module.c at line 1191
[ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file orted.c  
at

line 594
[ac27:12442] spawn: in job_state_callback(jobid = 1, state = 0x80)
mpirun noticed that job rank 0 with PID 0 on node ac27 exited on  
signal

15 (Terminated).
[ac27:12447] sess_dir_finalize: job session dir not empty - leaving
[ac27:12447] sess_dir_finalize: proc session dir not empty - leaving
[ac27:12442] sess_dir_finalize: proc session dir not empty - leaving


Thanks,
Clement

Clement Kam Man Chu wrote:


Hi,

I am using openmpi 1.2.3 under ia64 machine. I typed "mpirun -d -- 
tmpdir
/home/565/kxc565/tmpdir -mca btl sm -np 400 ./testprogram". I found  
only

the first process can use my parameter setting to store tmp file, but
the second process used its default setting to store tmp file in /tmp
directory. How can I change all processes stored in a directory I
required? I have attached the message from openmpi for more in  
details.

Thanks for any help.

Cheers,
Clement


[ac27:27928] [0,0,0] setting up session dir with
[ac27:27928] tmpdir /home/565/kxc565/tmpdir
[ac27:27928] universe default-universe-27928
[ac27:27928] user kxc565
[ac27:27928] host ac27
[ac27:27928] jobid 0
[ac27:27928] procid 0
[ac27:27928] procdir:
/home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default- 
universe-27928/0/0

[ac27:27928] jobdir:
/home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default- 
universe-27928/0

[ac27:27928] unidir:
/home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default- 
universe-27928

[ac27:27928] top: openmpi-sessions-kxc565@ac27_0
[ac27:27928] tmp: ?
[ac27:27928] [0,0,0] contact_file
/home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default- 
universe-27928/universe-setup.txt

[ac27:27928] [0,0,0] wrote setup file
[ac27:27932] [0,0,1] setting up session dir with
[ac27:27932] universe default-universe-27928
[ac27:27932] user kxc565
[ac27:27932] host ac27
[ac27:27932] jobid 0
[ac27:27932] procid 1
[ac27:27932] procdir:
/tmp/openmpi-sessions-kxc565@ac27_0/default-universe-27928/0/1
[ac27:27932] jobdir:

Re: [OMPI users] Tmpdir work for first process only

2007-11-15 Thread Jeff Squyres
Are you running all of these processes on the same machine, or  
multiple different machines?


If you're running 400 processes on the same machine, it may well be  
that you are simply running out of memory or other OS resources.  In  
particular, I've never seem iof fail that way before (iof is our I/O  
forwarding subsystem).


Looking at the iof code, the error you're seeing occurs when iof is  
trying to create a pipe between our OMPI "helper daemon" and the newly  
spawned user executable and fails.  The only reason that I can guess  
for why that would happen is if a max limit of pipes have been created  
on a machine and the OS refuses to create any more...?




On Nov 14, 2007, at 9:36 PM, Clement Kam Man Chu wrote:


Hi,

I have configured out why the tmpdir parameter works for the first
process. I got another problem if I tried to run 400 processes (no
problem if under 400 processes). I got an error "ORTE_ERROR_LOG: Out  
of

resource in file base/iof_base_setup.c at line 106". I attached the
message as below:

[ac27:12442] [0,0,0] setting up session dir with
[ac27:12442] tmpdir /jobfs/z07/247752.ac-pbs
[ac27:12442] universe default-universe-12442
[ac27:12442] user kxc565
[ac27:12442] host ac27
[ac27:12442] jobid 0
[ac27:12442] procid 0
[ac27:12442] procdir:
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- 
universe-12442/0/0

[ac27:12442] jobdir:
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- 
universe-12442/0

[ac27:12442] unidir:
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- 
universe-12442

[ac27:12442] top: openmpi-sessions-kxc565@ac27_0
[ac27:12442] tmp: ??
[ac27:12442] [0,0,0] contact_file
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- 
universe-12442/universe-setup.txt

[ac27:12442] [0,0,0] wrote setup file
[ac27:12447] [0,0,1] setting up session dir with
[ac27:12447] universe default-universe-12442
[ac27:12447] user kxc565
[ac27:12447] host ac27
[ac27:12447] jobid 0
[ac27:12447] procid 1
[ac27:12447] procdir:
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- 
universe-12442/0/1

[ac27:12447] jobdir:
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- 
universe-12442/0

[ac27:12447] unidir:
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- 
universe-12442

[ac27:12447] top: openmpi-sessions-kxc565@ac27_0
[ac27:12447] tmp: /jobfs/z07/247752.ac-pbs
[ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file
base/iof_base_setup.c at line 106
[ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file
odls_default_module.c at line 663
[ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file
odls_default_module.c at line 1191
[ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file orted.c  
at

line 594
[ac27:12442] spawn: in job_state_callback(jobid = 1, state = 0x80)
mpirun noticed that job rank 0 with PID 0 on node ac27 exited on  
signal

15 (Terminated).
[ac27:12447] sess_dir_finalize: job session dir not empty - leaving
[ac27:12447] sess_dir_finalize: proc session dir not empty - leaving
[ac27:12442] sess_dir_finalize: proc session dir not empty - leaving


Thanks,
Clement

Clement Kam Man Chu wrote:

Hi,

I am using openmpi 1.2.3 under ia64 machine. I typed "mpirun -d -- 
tmpdir
/home/565/kxc565/tmpdir -mca btl sm -np 400 ./testprogram". I found  
only

the first process can use my parameter setting to store tmp file, but
the second process used its default setting to store tmp file in /tmp
directory. How can I change all processes stored in a directory I
required? I have attached the message from openmpi for more in  
details.

Thanks for any help.

Cheers,
Clement


[ac27:27928] [0,0,0] setting up session dir with
[ac27:27928] tmpdir /home/565/kxc565/tmpdir
[ac27:27928] universe default-universe-27928
[ac27:27928] user kxc565
[ac27:27928] host ac27
[ac27:27928] jobid 0
[ac27:27928] procid 0
[ac27:27928] procdir:
/home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default- 
universe-27928/0/0

[ac27:27928] jobdir:
/home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default- 
universe-27928/0

[ac27:27928] unidir:
/home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default- 
universe-27928

[ac27:27928] top: openmpi-sessions-kxc565@ac27_0
[ac27:27928] tmp: ?
[ac27:27928] [0,0,0] contact_file
/home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default- 
universe-27928/universe-setup.txt

[ac27:27928] [0,0,0] wrote setup file
[ac27:27932] [0,0,1] setting up session dir with
[ac27:27932] universe default-universe-27928
[ac27:27932] user kxc565
[ac27:27932] host ac27
[ac27:27932] jobid 0
[ac27:27932] procid 1
[ac27:27932] procdir:
/tmp/openmpi-sessions-kxc565@ac27_0/default-universe-27928/0/1
[ac27:27932] jobdir:
/tmp/openmpi-sessions-kxc565@ac27_0/default-universe-27928/0
[ac27:27932] unidir:
/tmp/openmpi-sessions-kxc565@ac27_0/default-universe-27928
[ac27:27932] top: openmpi-sessions-kxc565@ac27_0
[ac27:27932] tmp: /tmp
[ac27:27932] [0,0,1] 

Re: [OMPI users] Tmpdir work for first process only

2007-11-15 Thread Clement Kam Man Chu

Hi,

I have configured out why the tmpdir parameter works for the first 
process. I got another problem if I tried to run 400 processes (no 
problem if under 400 processes). I got an error "ORTE_ERROR_LOG: Out of 
resource in file base/iof_base_setup.c at line 106". I attached the 
message as below:


[ac27:12442] [0,0,0] setting up session dir with
[ac27:12442] tmpdir /jobfs/z07/247752.ac-pbs
[ac27:12442] universe default-universe-12442
[ac27:12442] user kxc565
[ac27:12442] host ac27
[ac27:12442] jobid 0
[ac27:12442] procid 0
[ac27:12442] procdir: 
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-universe-12442/0/0
[ac27:12442] jobdir: 
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-universe-12442/0
[ac27:12442] unidir: 
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-universe-12442

[ac27:12442] top: openmpi-sessions-kxc565@ac27_0
[ac27:12442] tmp: ??
[ac27:12442] [0,0,0] contact_file 
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-universe-12442/universe-setup.txt

[ac27:12442] [0,0,0] wrote setup file
[ac27:12447] [0,0,1] setting up session dir with
[ac27:12447] universe default-universe-12442
[ac27:12447] user kxc565
[ac27:12447] host ac27
[ac27:12447] jobid 0
[ac27:12447] procid 1
[ac27:12447] procdir: 
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-universe-12442/0/1
[ac27:12447] jobdir: 
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-universe-12442/0
[ac27:12447] unidir: 
/jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-universe-12442

[ac27:12447] top: openmpi-sessions-kxc565@ac27_0
[ac27:12447] tmp: /jobfs/z07/247752.ac-pbs
[ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file 
base/iof_base_setup.c at line 106
[ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file 
odls_default_module.c at line 663
[ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file 
odls_default_module.c at line 1191
[ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file orted.c at 
line 594

[ac27:12442] spawn: in job_state_callback(jobid = 1, state = 0x80)
mpirun noticed that job rank 0 with PID 0 on node ac27 exited on signal 
15 (Terminated).

[ac27:12447] sess_dir_finalize: job session dir not empty - leaving
[ac27:12447] sess_dir_finalize: proc session dir not empty - leaving
[ac27:12442] sess_dir_finalize: proc session dir not empty - leaving


Thanks,
Clement

Clement Kam Man Chu wrote:

Hi,

I am using openmpi 1.2.3 under ia64 machine. I typed "mpirun -d --tmpdir 
/home/565/kxc565/tmpdir -mca btl sm -np 400 ./testprogram". I found only 
the first process can use my parameter setting to store tmp file, but 
the second process used its default setting to store tmp file in /tmp 
directory. How can I change all processes stored in a directory I 
required? I have attached the message from openmpi for more in details. 
Thanks for any help.


Cheers,
Clement


[ac27:27928] [0,0,0] setting up session dir with
[ac27:27928] tmpdir /home/565/kxc565/tmpdir
[ac27:27928] universe default-universe-27928
[ac27:27928] user kxc565
[ac27:27928] host ac27
[ac27:27928] jobid 0
[ac27:27928] procid 0
[ac27:27928] procdir: 
/home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default-universe-27928/0/0
[ac27:27928] jobdir: 
/home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default-universe-27928/0
[ac27:27928] unidir: 
/home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default-universe-27928

[ac27:27928] top: openmpi-sessions-kxc565@ac27_0
[ac27:27928] tmp: ?
[ac27:27928] [0,0,0] contact_file 
/home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default-universe-27928/universe-setup.txt

[ac27:27928] [0,0,0] wrote setup file
[ac27:27932] [0,0,1] setting up session dir with
[ac27:27932] universe default-universe-27928
[ac27:27932] user kxc565
[ac27:27932] host ac27
[ac27:27932] jobid 0
[ac27:27932] procid 1
[ac27:27932] procdir: 
/tmp/openmpi-sessions-kxc565@ac27_0/default-universe-27928/0/1
[ac27:27932] jobdir: 
/tmp/openmpi-sessions-kxc565@ac27_0/default-universe-27928/0
[ac27:27932] unidir: 
/tmp/openmpi-sessions-kxc565@ac27_0/default-universe-27928

[ac27:27932] top: openmpi-sessions-kxc565@ac27_0
[ac27:27932] tmp: /tmp
[ac27:27932] [0,0,1] ORTE_ERROR_LOG: Out of resource in file 
base/iof_base_setup.c at line 106
[ac27:27932] [0,0,1] ORTE_ERROR_LOG: Out of resource in file 
odls_default_module.c at line 663
[ac27:27932] [0,0,1] ORTE_ERROR_LOG: Out of resource in file 
odls_default_module.c at line 1191
[ac27:27932] [0,0,1] ORTE_ERROR_LOG: Out of resource in file orted.c at 
line 594

[ac27:27928] spawn: in job_state_callback(jobid = 1, state = 0x80)
mpirun noticed that job rank 0 with PID 0 on node ac27 exited on signal 
15 (Terminated).

[ac27:27932] sess_dir_finalize: job session dir not empty - leaving
[ac27:27932] sess_dir_finalize: proc session dir not empty - leaving
[ac27:27928] sess_dir_finalize: proc session dir not empty - leaving

  



--
Clement Kam Man Chu
Research