Re: [OMPI users] Tmpdir work for first process only
Hi Clement, First, if you run 400 jobs on 16 nodes you will end up with around 32 processes on each nodes. Depending on the memory footprint of the application it will fail because of memory exhaustion. Usually I am able to oversubscribe up to 64 NAS class B processes on 2GB, and less than 16 class C. About your initial problem: tmpdir is the temporary directory for the orte seed only. As you discovered, this parameter is ignored by all the other processes. However you can use the TMPDIR environment variable to set the tmpdir on every open MPI process. Juste use mpirun -X TMPDIR=/some/where to set it. Regards, Aurelien Le 15 nov. 07 à 07:04, Clement Kam Man Chu a écrit : Jeff Squyres wrote: Thanks for your reply. I am using pbs job scheduler and I reqested 16 cpus to run 400 processes, but I don't how many processes are allocated on each cpus. Do you think it is a problem? Clement Are you running all of these processes on the same machine, or multiple different machines? If you're running 400 processes on the same machine, it may well be that you are simply running out of memory or other OS resources. In particular, I've never seem iof fail that way before (iof is our I/O forwarding subsystem). Looking at the iof code, the error you're seeing occurs when iof is trying to create a pipe between our OMPI "helper daemon" and the newly spawned user executable and fails. The only reason that I can guess for why that would happen is if a max limit of pipes have been created on a machine and the OS refuses to create any more...? On Nov 14, 2007, at 9:36 PM, Clement Kam Man Chu wrote: Hi, I have configured out why the tmpdir parameter works for the first process. I got another problem if I tried to run 400 processes (no problem if under 400 processes). I got an error "ORTE_ERROR_LOG: Out of resource in file base/iof_base_setup.c at line 106". I attached the message as below: [ac27:12442] [0,0,0] setting up session dir with [ac27:12442] tmpdir /jobfs/z07/247752.ac-pbs [ac27:12442] universe default-universe-12442 [ac27:12442] user kxc565 [ac27:12442] host ac27 [ac27:12442] jobid 0 [ac27:12442] procid 0 [ac27:12442] procdir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442/0/0 [ac27:12442] jobdir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442/0 [ac27:12442] unidir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442 [ac27:12442] top: openmpi-sessions-kxc565@ac27_0 [ac27:12442] tmp: ?? [ac27:12442] [0,0,0] contact_file /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442/universe-setup.txt [ac27:12442] [0,0,0] wrote setup file [ac27:12447] [0,0,1] setting up session dir with [ac27:12447] universe default-universe-12442 [ac27:12447] user kxc565 [ac27:12447] host ac27 [ac27:12447] jobid 0 [ac27:12447] procid 1 [ac27:12447] procdir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442/0/1 [ac27:12447] jobdir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442/0 [ac27:12447] unidir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442 [ac27:12447] top: openmpi-sessions-kxc565@ac27_0 [ac27:12447] tmp: /jobfs/z07/247752.ac-pbs [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file base/iof_base_setup.c at line 106 [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file odls_default_module.c at line 663 [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file odls_default_module.c at line 1191 [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file orted.c at line 594 [ac27:12442] spawn: in job_state_callback(jobid = 1, state = 0x80) mpirun noticed that job rank 0 with PID 0 on node ac27 exited on signal 15 (Terminated). [ac27:12447] sess_dir_finalize: job session dir not empty - leaving [ac27:12447] sess_dir_finalize: proc session dir not empty - leaving [ac27:12442] sess_dir_finalize: proc session dir not empty - leaving Thanks, Clement Clement Kam Man Chu wrote: Hi, I am using openmpi 1.2.3 under ia64 machine. I typed "mpirun -d -- tmpdir /home/565/kxc565/tmpdir -mca btl sm -np 400 ./testprogram". I found only the first process can use my parameter setting to store tmp file, but the second process used its default setting to store tmp file in / tmp directory. How can I change all processes stored in a directory I required? I have attached the message from openmpi for more in details. Thanks for any help. Cheers, Clement [ac27:27928] [0,0,0] setting up session dir with [ac27:27928] tmpdir /home/565/kxc565/tmpdir [ac27:27928] universe default-universe-27928 [ac27:27928] user kxc565 [ac27:27928] host ac27 [ac27:27928] jobid 0 [ac27:27928] procid 0 [ac27:27928] procdir: /home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default- universe-27928/0/0 [ac27:27928] jobdir: /home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default-
[OMPI users] MPI daemons suspend running job
Hi, I am encountering the problem that a working child process is frozen in the middle of its work, and continues only when its parent process ( which spawned it earlier on ) calls some MPI function. The issue here is, that in order to accept client socket communication the parent process is, at the same time, performing a select() on a server socket. It expects the child to finish its work, which will be collected and processed at a later date. This now cannot happen as the child does not continue until the parent process calls Recv(), Probe() or sends something itself. Is there the possibility that some mpi daemon freezes the child process if its parent process goes to "sleep" while listening to a socket? If so, how can I avoid it? These are independent processes and even though interlinked should not influence each other in this way. Regards, Murat
Re: [OMPI users] Suggestions on multi-compiler/multi-mpi build?
Really modules is the only way to go, we use it to maintain no less than 12 versions of openmpi compiled with pgi, gcc, nag, and intel. Yes as the other reply said just set LD_LIBRARY_PATH you can use ldd executable to see which libraries the executable is going to use under the current environment. We also have uses submit to torque with #PBS -V which copies the current environment with the job, this allows users to submit and have multiple jobs running each with their own environment. If your using PBS or torque look at the qsub man page. Another addition we came up with was after we load a default set a modules (pbs, a compiler , mpi library) we look for a file in $HOME/privatemodules/default which just has a list of module commands. This allows users to change their default modules for every login. For example i have in my default file module load git Because i use git for managing files. We have done other hacks, our webpage of software is built out of cron from the module files: http://cac.engin.umich.edu/resources/systems/nyxV2/software.html So we dont maintain software lists online, we just generate it dynamically. Modules is the admins best friend. Brock Palen Center for Advanced Computing bro...@umich.edu (734)936-1985 On Nov 15, 2007, at 8:54 AM, Jim Kusznir wrote: Hi all: I'm trying to set up a cluster for a group of users with very different needs. So far, it looks like I need gcc, pgi, and intel to work with openmpi and mpich, with each user able to control what combination they get. This is turning out to be much more difficult than I expected. Someone has pointed me to enviornment-modules ("Modules"), which looks like it will be a critical part of the solution. I even noticed that the provided openmpi.spec file has some direct support for modules, which makes me quite happy! However, I still have many questions about how to set things up. First, I get the impression that openmpi will need to be compiled with each compiler that will use it. If this is true, I'm not quite sure how to go about it. I could install in different directories for the user commands, but what about the libraries? I don't think I have a feesable way of selecting which library to use on the fly on the entire cluster for each user, so it seems like it would be better to have all the libraries available. In addition, I will need RPMs to deploy efficiently on the cluster. I suspect I can change the versioning info and build with each compiler, but at this point, I don't even know how to reliably select what compiler rpmbuild will use (I've only succeeded in using gcc). Finally, using modules, how do I set it up so that if a user changes compilers, but stays with openmpi, it will load the correct openmpi paths? I know I can set up the openmpi module file to load after the compiler module and based on that select different paths based on the currently-loaded compiler module. If the user changes the compiler module, will that cause the mpi module to also be reloaded so the new settings will be loaded? Or do I need this at all? Thanks for all your help! --Jim ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [MTT users] unable to pull tests
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Jeff Squyres wrote: > Done. > > FWIW, we don't give out access to the ompi-tests repository to anyone > except those on the Open MPI development team. The repository > contains a bunch of MPI test suites that are all publicly available, > but we've never bothered to look into redistribution rights. > > > On Nov 14, 2007, at 3:34 PM, Karol Mroz wrote: > > Ethan Mallove wrote: Do you have access to the ompi-tests repository? What happens if you do this command outside of MTT? $ svn export https://svn.open-mpi.org/svn/ompi-tests/trunk/onesided You could also try using "http", instead of "https" in your svn_url. -Ethan > Hi, Ethan. > > So I just recently received access to Open MPI devel repository (ompi) > but clearly this did not include access to ompi-tests :) I'll email > Jeff > as he was responsible for granting me access to ompi. > > Thanks for the help! > ___ mtt-users mailing list mtt-us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users Thanks very much, Jeff. - -- Karol Mroz km...@cs.ubc.ca -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHPIEMuoug78g/Mz8RAq9vAJwOf+r9RFNAyjaB8cWNZZZLxgViKgCfTZgG c4o/GBC3SGKD3w1zzGzhT9Q= =jZtc -END PGP SIGNATURE-
Re: [OMPI users] Suggestions on multi-compiler/multi-mpi build?
We have almost exactly the situation you describe with our clusters. I'm not the system administrator but I am the one who actually writes the module scripts. It is best to compile OpenMPI (and any other MPI) with each compiler separately; this is especially necessary for the Fortran and C++ bindings. We simply have a directory layout like /opt/openmpi/intel /opt/openmpi/pgi /opt/openmpi/gnu etc. To compile OpenMPI with a different compiler, you can use environment variables such as (for the Intel compiler) FC=ifort F90=ifort CC=icc CXX=icpc You set these within the %build part of the rpm spec file. Similarly for MPICH2. (I would recommend MPICH2 over MPICH1 at this point.) Users load a module to select a compiler+MPI combination. Loading the appropriate module sets the path for the version of mpicc/cxx/f90 chosen by the user. That in turn knows where its associated libraries live. Most users link MPI libraries statically. If dynamic libraries are needed then they can be added to LD_LIBRARY_PATH by the module script. I am away from my office until after the Thanksgiving holiday but if you email me personally then, I can send you some sample module scripts. I can also send you some sample spec files for the rpms we use. - Original Message - From: "Jim Kusznir"To: Sent: Thursday, November 15, 2007 11:54 AM Subject: [OMPI users] Suggestions on multi-compiler/multi-mpi build? Hi all: I'm trying to set up a cluster for a group of users with very different needs. So far, it looks like I need gcc, pgi, and intel to work with openmpi and mpich, with each user able to control what combination they get. This is turning out to be much more difficult than I expected. Someone has pointed me to enviornment-modules ("Modules"), which looks like it will be a critical part of the solution. I even noticed that the provided openmpi.spec file has some direct support for modules, which makes me quite happy! However, I still have many questions about how to set things up. First, I get the impression that openmpi will need to be compiled with each compiler that will use it. If this is true, I'm not quite sure how to go about it. I could install in different directories for the user commands, but what about the libraries? I don't think I have a feesable way of selecting which library to use on the fly on the entire cluster for each user, so it seems like it would be better to have all the libraries available. In addition, I will need RPMs to deploy efficiently on the cluster. I suspect I can change the versioning info and build with each compiler, but at this point, I don't even know how to reliably select what compiler rpmbuild will use (I've only succeeded in using gcc). Finally, using modules, how do I set it up so that if a user changes compilers, but stays with openmpi, it will load the correct openmpi paths? I know I can set up the openmpi module file to load after the compiler module and based on that select different paths based on the currently-loaded compiler module. If the user changes the compiler module, will that cause the mpi module to also be reloaded so the new settings will be loaded? Or do I need this at all? Thanks for all your help! --Jim ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Tmpdir work for first process only
Jeff Squyres wrote: Thanks for your reply. I am using pbs job scheduler and I reqested 16 cpus to run 400 processes, but I don't how many processes are allocated on each cpus. Do you think it is a problem? Clement Are you running all of these processes on the same machine, or multiple different machines? If you're running 400 processes on the same machine, it may well be that you are simply running out of memory or other OS resources. In particular, I've never seem iof fail that way before (iof is our I/O forwarding subsystem). Looking at the iof code, the error you're seeing occurs when iof is trying to create a pipe between our OMPI "helper daemon" and the newly spawned user executable and fails. The only reason that I can guess for why that would happen is if a max limit of pipes have been created on a machine and the OS refuses to create any more...? On Nov 14, 2007, at 9:36 PM, Clement Kam Man Chu wrote: Hi, I have configured out why the tmpdir parameter works for the first process. I got another problem if I tried to run 400 processes (no problem if under 400 processes). I got an error "ORTE_ERROR_LOG: Out of resource in file base/iof_base_setup.c at line 106". I attached the message as below: [ac27:12442] [0,0,0] setting up session dir with [ac27:12442] tmpdir /jobfs/z07/247752.ac-pbs [ac27:12442] universe default-universe-12442 [ac27:12442] user kxc565 [ac27:12442] host ac27 [ac27:12442] jobid 0 [ac27:12442] procid 0 [ac27:12442] procdir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442/0/0 [ac27:12442] jobdir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442/0 [ac27:12442] unidir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442 [ac27:12442] top: openmpi-sessions-kxc565@ac27_0 [ac27:12442] tmp: ?? [ac27:12442] [0,0,0] contact_file /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442/universe-setup.txt [ac27:12442] [0,0,0] wrote setup file [ac27:12447] [0,0,1] setting up session dir with [ac27:12447] universe default-universe-12442 [ac27:12447] user kxc565 [ac27:12447] host ac27 [ac27:12447] jobid 0 [ac27:12447] procid 1 [ac27:12447] procdir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442/0/1 [ac27:12447] jobdir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442/0 [ac27:12447] unidir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442 [ac27:12447] top: openmpi-sessions-kxc565@ac27_0 [ac27:12447] tmp: /jobfs/z07/247752.ac-pbs [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file base/iof_base_setup.c at line 106 [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file odls_default_module.c at line 663 [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file odls_default_module.c at line 1191 [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file orted.c at line 594 [ac27:12442] spawn: in job_state_callback(jobid = 1, state = 0x80) mpirun noticed that job rank 0 with PID 0 on node ac27 exited on signal 15 (Terminated). [ac27:12447] sess_dir_finalize: job session dir not empty - leaving [ac27:12447] sess_dir_finalize: proc session dir not empty - leaving [ac27:12442] sess_dir_finalize: proc session dir not empty - leaving Thanks, Clement Clement Kam Man Chu wrote: Hi, I am using openmpi 1.2.3 under ia64 machine. I typed "mpirun -d -- tmpdir /home/565/kxc565/tmpdir -mca btl sm -np 400 ./testprogram". I found only the first process can use my parameter setting to store tmp file, but the second process used its default setting to store tmp file in /tmp directory. How can I change all processes stored in a directory I required? I have attached the message from openmpi for more in details. Thanks for any help. Cheers, Clement [ac27:27928] [0,0,0] setting up session dir with [ac27:27928] tmpdir /home/565/kxc565/tmpdir [ac27:27928] universe default-universe-27928 [ac27:27928] user kxc565 [ac27:27928] host ac27 [ac27:27928] jobid 0 [ac27:27928] procid 0 [ac27:27928] procdir: /home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default- universe-27928/0/0 [ac27:27928] jobdir: /home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default- universe-27928/0 [ac27:27928] unidir: /home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default- universe-27928 [ac27:27928] top: openmpi-sessions-kxc565@ac27_0 [ac27:27928] tmp: ? [ac27:27928] [0,0,0] contact_file /home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default- universe-27928/universe-setup.txt [ac27:27928] [0,0,0] wrote setup file [ac27:27932] [0,0,1] setting up session dir with [ac27:27932] universe default-universe-27928 [ac27:27932] user kxc565 [ac27:27932] host ac27 [ac27:27932] jobid 0 [ac27:27932] procid 1 [ac27:27932] procdir: /tmp/openmpi-sessions-kxc565@ac27_0/default-universe-27928/0/1 [ac27:27932] jobdir:
Re: [OMPI users] Error on running large number of processes
My guess is that this is similar to the last post: you are oversubscribing the nodes so heavily that the OS is running out of some resources (perhaps regular or registered memory?) such that Open MPI is unable to setup its network transport layers properly. On Nov 15, 2007, at 6:35 AM, Clement Kam Man Chu wrote: Hi, I am using openmpi 1.2.3 under ia64 machine and uses pbs job scheduler. I can successfully run 100 processes on 16 cpus, but I got an error If run 200 processes on the same number of cpus. The error is : PML add procs failed --> Returned "Temporarily out of resource" (-3) instead of "Success" (0) Please help. Regards, Clement -- Clement Kam Man Chu Research Assistant Faculty of Information Technology Monash University, Caulfield Campus Ph: 61 3 9903 2355 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Tmpdir work for first process only
Are you running all of these processes on the same machine, or multiple different machines? If you're running 400 processes on the same machine, it may well be that you are simply running out of memory or other OS resources. In particular, I've never seem iof fail that way before (iof is our I/O forwarding subsystem). Looking at the iof code, the error you're seeing occurs when iof is trying to create a pipe between our OMPI "helper daemon" and the newly spawned user executable and fails. The only reason that I can guess for why that would happen is if a max limit of pipes have been created on a machine and the OS refuses to create any more...? On Nov 14, 2007, at 9:36 PM, Clement Kam Man Chu wrote: Hi, I have configured out why the tmpdir parameter works for the first process. I got another problem if I tried to run 400 processes (no problem if under 400 processes). I got an error "ORTE_ERROR_LOG: Out of resource in file base/iof_base_setup.c at line 106". I attached the message as below: [ac27:12442] [0,0,0] setting up session dir with [ac27:12442] tmpdir /jobfs/z07/247752.ac-pbs [ac27:12442] universe default-universe-12442 [ac27:12442] user kxc565 [ac27:12442] host ac27 [ac27:12442] jobid 0 [ac27:12442] procid 0 [ac27:12442] procdir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442/0/0 [ac27:12442] jobdir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442/0 [ac27:12442] unidir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442 [ac27:12442] top: openmpi-sessions-kxc565@ac27_0 [ac27:12442] tmp: ?? [ac27:12442] [0,0,0] contact_file /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442/universe-setup.txt [ac27:12442] [0,0,0] wrote setup file [ac27:12447] [0,0,1] setting up session dir with [ac27:12447] universe default-universe-12442 [ac27:12447] user kxc565 [ac27:12447] host ac27 [ac27:12447] jobid 0 [ac27:12447] procid 1 [ac27:12447] procdir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442/0/1 [ac27:12447] jobdir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442/0 [ac27:12447] unidir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default- universe-12442 [ac27:12447] top: openmpi-sessions-kxc565@ac27_0 [ac27:12447] tmp: /jobfs/z07/247752.ac-pbs [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file base/iof_base_setup.c at line 106 [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file odls_default_module.c at line 663 [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file odls_default_module.c at line 1191 [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file orted.c at line 594 [ac27:12442] spawn: in job_state_callback(jobid = 1, state = 0x80) mpirun noticed that job rank 0 with PID 0 on node ac27 exited on signal 15 (Terminated). [ac27:12447] sess_dir_finalize: job session dir not empty - leaving [ac27:12447] sess_dir_finalize: proc session dir not empty - leaving [ac27:12442] sess_dir_finalize: proc session dir not empty - leaving Thanks, Clement Clement Kam Man Chu wrote: Hi, I am using openmpi 1.2.3 under ia64 machine. I typed "mpirun -d -- tmpdir /home/565/kxc565/tmpdir -mca btl sm -np 400 ./testprogram". I found only the first process can use my parameter setting to store tmp file, but the second process used its default setting to store tmp file in /tmp directory. How can I change all processes stored in a directory I required? I have attached the message from openmpi for more in details. Thanks for any help. Cheers, Clement [ac27:27928] [0,0,0] setting up session dir with [ac27:27928] tmpdir /home/565/kxc565/tmpdir [ac27:27928] universe default-universe-27928 [ac27:27928] user kxc565 [ac27:27928] host ac27 [ac27:27928] jobid 0 [ac27:27928] procid 0 [ac27:27928] procdir: /home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default- universe-27928/0/0 [ac27:27928] jobdir: /home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default- universe-27928/0 [ac27:27928] unidir: /home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default- universe-27928 [ac27:27928] top: openmpi-sessions-kxc565@ac27_0 [ac27:27928] tmp: ? [ac27:27928] [0,0,0] contact_file /home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default- universe-27928/universe-setup.txt [ac27:27928] [0,0,0] wrote setup file [ac27:27932] [0,0,1] setting up session dir with [ac27:27932] universe default-universe-27928 [ac27:27932] user kxc565 [ac27:27932] host ac27 [ac27:27932] jobid 0 [ac27:27932] procid 1 [ac27:27932] procdir: /tmp/openmpi-sessions-kxc565@ac27_0/default-universe-27928/0/1 [ac27:27932] jobdir: /tmp/openmpi-sessions-kxc565@ac27_0/default-universe-27928/0 [ac27:27932] unidir: /tmp/openmpi-sessions-kxc565@ac27_0/default-universe-27928 [ac27:27932] top: openmpi-sessions-kxc565@ac27_0 [ac27:27932] tmp: /tmp [ac27:27932] [0,0,1]
[OMPI users] Error on running large number of processes
Hi, I am using openmpi 1.2.3 under ia64 machine and uses pbs job scheduler. I can successfully run 100 processes on 16 cpus, but I got an error If run 200 processes on the same number of cpus. The error is : PML add procs failed --> Returned "Temporarily out of resource" (-3) instead of "Success" (0) Please help. Regards, Clement -- Clement Kam Man Chu Research Assistant Faculty of Information Technology Monash University, Caulfield Campus Ph: 61 3 9903 2355
Re: [OMPI users] OpenMPI - compilation
On Nov 14, 2007, at 10:48 PM, Sajjad wrote: No i didn't find any executable after the issued the command "mpicc mpitest1.c -o mpitest1" If you're not finding the executable at all, then something else is very wrong. The "mpicc" command is just a "wrapper" compiler, meaning that it takes your command line, adds some more flags, and then invokes the underlying compiler (e.g., gcc). You can use the "-- showme" flag to see exactly what command mpicc actually invokes: mpicc mpitest1.c -o mpitest1 --showme Then try running the command that that shows manually and see what happens. If your compiler is not producing executables at all, then you have some other (non-MPI) system-level issue. And sorry for dumping such an irrelevant chunk of data to the mailing list. We actually request the output from ompi_info from users who are having run-time problems. See the "getting help" page on the OMPI web site. Also, I still strongly recommend upgrading to the latest stable version of Open MPI if possible. The version you have (v1.1) should not be responsible for you not being able to create executables, though -- you may need to fix that independently of upgrading Open MPI. -- Jeff Squyres Cisco Systems
[OMPI users] OpenMPI - compilation
Hello Jeff, No i didn't find any executable after the issued the command "mpicc mpitest1.c -o mpitest1" And sorry for dumping such an irrelevant chunk of data to the mailing list. Sajjad
[OMPI users] OpenMPI - compilation
Hello Jeff, I thought that the following information will be helpful to track the issue. *** sajjad@sajjad:~$ ompi_info Open MPI: 1.1 Open MPI SVN revision: r10477 Open RTE: 1.1 Open RTE SVN revision: r10477 OPAL: 1.1 OPAL SVN revision: r10477 Prefix: /usr Configured architecture: x86_64-pc-linux-gnu Configured by: buildd Configured on: Tue May 1 15:04:40 GMT 2007 Configure host: yellow Built by: buildd Built on: Tue May 1 15:19:17 GMT 2007 Built host: yellow C bindings: yes C++ bindings: yes Fortran77 bindings: yes (single underscore) Fortran90 bindings: yes Fortran90 bindings size: small C compiler: gcc C compiler absolute: /usr/bin/gcc C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fortran77 compiler: gfortran Fortran77 compiler abs: /usr/bin/gfortran Fortran90 compiler: gfortran Fortran90 compiler abs: /usr/bin/gfortran C profiling: yes C++ profiling: yes Fortran77 profiling: yes Fortran90 profiling: yes C++ exceptions: no Thread support: posix (mpi: no, progress: no) Internal debug support: no MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: yes MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.1) MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1) MCA timer: linux (MCA v1.0, API v1.0, Component v1.1) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.1) MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1) MCA coll: self (MCA v1.0, API v1.0, Component v1.1) MCA coll: sm (MCA v1.0, API v1.0, Component v1.1) MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1) MCA io: romio (MCA v1.0, API v1.0, Component v1.1) MCA mpool: openib (MCA v1.0, API v1.0, Component v1.1) MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1) MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1) MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1) MCA btl: openib (MCA v1.0, API v1.0, Component v1.1) MCA btl: self (MCA v1.0, API v1.0, Component v1.1) MCA btl: sm (MCA v1.0, API v1.0, Component v1.1) MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0) MCA topo: unity (MCA v1.0, API v1.0, Component v1.1) MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0) MCA gpr: null (MCA v1.0, API v1.0, Component v1.1) MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1) MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1) MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1) MCA iof: svc (MCA v1.0, API v1.0, Component v1.1) MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1) MCA ns: replica (MCA v1.0, API v1.0, Component v1.1) MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.1) MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.1) MCA ras: localhost (MCA v1.0, API v1.0, Component v1.1) MCA ras: slurm (MCA v1.0, API v1.0, Component v1.1) MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.1) MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1) MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.1) MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1) MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.1) MCA rml: oob (MCA v1.0, API v1.0, Component v1.1) MCA pls: fork (MCA v1.0, API v1.0, Component v1.1) MCA pls: rsh (MCA v1.0, API v1.0, Component v1.1) MCA pls: slurm (MCA v1.0, API v1.0, Component v1.1) MCA sds: env (MCA v1.0, API v1.0, Component v1.1) MCA sds: pipe (MCA v1.0, API v1.0, Component v1.1) MCA sds: seed (MCA v1.0, API v1.0, Component v1.1) MCA sds: singleton (MCA v1.0, API v1.0, Component v1.1) MCA sds: slurm (MCA v1.0, API v1.0, Component v1.1)
Re: [OMPI users] Tmpdir work for first process only
Hi, I have configured out why the tmpdir parameter works for the first process. I got another problem if I tried to run 400 processes (no problem if under 400 processes). I got an error "ORTE_ERROR_LOG: Out of resource in file base/iof_base_setup.c at line 106". I attached the message as below: [ac27:12442] [0,0,0] setting up session dir with [ac27:12442] tmpdir /jobfs/z07/247752.ac-pbs [ac27:12442] universe default-universe-12442 [ac27:12442] user kxc565 [ac27:12442] host ac27 [ac27:12442] jobid 0 [ac27:12442] procid 0 [ac27:12442] procdir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-universe-12442/0/0 [ac27:12442] jobdir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-universe-12442/0 [ac27:12442] unidir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-universe-12442 [ac27:12442] top: openmpi-sessions-kxc565@ac27_0 [ac27:12442] tmp: ?? [ac27:12442] [0,0,0] contact_file /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-universe-12442/universe-setup.txt [ac27:12442] [0,0,0] wrote setup file [ac27:12447] [0,0,1] setting up session dir with [ac27:12447] universe default-universe-12442 [ac27:12447] user kxc565 [ac27:12447] host ac27 [ac27:12447] jobid 0 [ac27:12447] procid 1 [ac27:12447] procdir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-universe-12442/0/1 [ac27:12447] jobdir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-universe-12442/0 [ac27:12447] unidir: /jobfs/z07/247752.ac-pbs/openmpi-sessions-kxc565@ac27_0/default-universe-12442 [ac27:12447] top: openmpi-sessions-kxc565@ac27_0 [ac27:12447] tmp: /jobfs/z07/247752.ac-pbs [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file base/iof_base_setup.c at line 106 [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file odls_default_module.c at line 663 [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file odls_default_module.c at line 1191 [ac27:12447] [0,0,1] ORTE_ERROR_LOG: Out of resource in file orted.c at line 594 [ac27:12442] spawn: in job_state_callback(jobid = 1, state = 0x80) mpirun noticed that job rank 0 with PID 0 on node ac27 exited on signal 15 (Terminated). [ac27:12447] sess_dir_finalize: job session dir not empty - leaving [ac27:12447] sess_dir_finalize: proc session dir not empty - leaving [ac27:12442] sess_dir_finalize: proc session dir not empty - leaving Thanks, Clement Clement Kam Man Chu wrote: Hi, I am using openmpi 1.2.3 under ia64 machine. I typed "mpirun -d --tmpdir /home/565/kxc565/tmpdir -mca btl sm -np 400 ./testprogram". I found only the first process can use my parameter setting to store tmp file, but the second process used its default setting to store tmp file in /tmp directory. How can I change all processes stored in a directory I required? I have attached the message from openmpi for more in details. Thanks for any help. Cheers, Clement [ac27:27928] [0,0,0] setting up session dir with [ac27:27928] tmpdir /home/565/kxc565/tmpdir [ac27:27928] universe default-universe-27928 [ac27:27928] user kxc565 [ac27:27928] host ac27 [ac27:27928] jobid 0 [ac27:27928] procid 0 [ac27:27928] procdir: /home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default-universe-27928/0/0 [ac27:27928] jobdir: /home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default-universe-27928/0 [ac27:27928] unidir: /home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default-universe-27928 [ac27:27928] top: openmpi-sessions-kxc565@ac27_0 [ac27:27928] tmp: ? [ac27:27928] [0,0,0] contact_file /home/565/kxc565/tmpdir/openmpi-sessions-kxc565@ac27_0/default-universe-27928/universe-setup.txt [ac27:27928] [0,0,0] wrote setup file [ac27:27932] [0,0,1] setting up session dir with [ac27:27932] universe default-universe-27928 [ac27:27932] user kxc565 [ac27:27932] host ac27 [ac27:27932] jobid 0 [ac27:27932] procid 1 [ac27:27932] procdir: /tmp/openmpi-sessions-kxc565@ac27_0/default-universe-27928/0/1 [ac27:27932] jobdir: /tmp/openmpi-sessions-kxc565@ac27_0/default-universe-27928/0 [ac27:27932] unidir: /tmp/openmpi-sessions-kxc565@ac27_0/default-universe-27928 [ac27:27932] top: openmpi-sessions-kxc565@ac27_0 [ac27:27932] tmp: /tmp [ac27:27932] [0,0,1] ORTE_ERROR_LOG: Out of resource in file base/iof_base_setup.c at line 106 [ac27:27932] [0,0,1] ORTE_ERROR_LOG: Out of resource in file odls_default_module.c at line 663 [ac27:27932] [0,0,1] ORTE_ERROR_LOG: Out of resource in file odls_default_module.c at line 1191 [ac27:27932] [0,0,1] ORTE_ERROR_LOG: Out of resource in file orted.c at line 594 [ac27:27928] spawn: in job_state_callback(jobid = 1, state = 0x80) mpirun noticed that job rank 0 with PID 0 on node ac27 exited on signal 15 (Terminated). [ac27:27932] sess_dir_finalize: job session dir not empty - leaving [ac27:27932] sess_dir_finalize: proc session dir not empty - leaving [ac27:27928] sess_dir_finalize: proc session dir not empty - leaving -- Clement Kam Man Chu Research