Excellent! Yes, we use pipe in several places, including in the run-time during various stages of launch, so that could be a problem.
Also, be aware that other users have reported problems on LDAP-based systems when attempting to launch large jobs. The problem is that the OpenMPI launch system has no rate control in it - and the LDAP's slapd servers get overwhelmed by the launch when we ssh on a large number of nodes. I promised another user to concoct a fix for this problem, but am taking a break from the project for a few months so it may be a little while before a fix is available. When I do get it done, it may or may not make it into an OpenMPI release for some time - I'm not sure how they will decide to schedule the change (is it a "bug", or a new "feature"?). So I may do an interim release as a patch on the OpenRTE site (since that is the run-time underneath OpenMPI). I'll let people know via this mailing list either way. Ralph On 3/18/07 2:06 PM, "David Bronke" <whitel...@gmail.com> wrote: > I just received an email from a friend who is helping me work on > resolving this; he was able to trace the problem back to a pipe() call > in OpenMPI apparently: > >> The problem is with the pipe() system call (which is invoked by the >> MPI_Send() as far as I can tell) by a LDAP authenticated user. Still >> working out where exactly that goes wrong, but the fact is that it isn't >> actually a permissions problem - the reason it works as root is because >> root is a local user and does /etc/passwd normal authentication. > > I had forgotten to mention that we use LDAP for authentication on this > machine; PAM and NSS are set up to use it, but I'm guessing that > either OpenMPI itself or the pipe() system call won't check with them > when needed... We have made some local users on the machine to get > things going, but I'll probably have to find an LDAP mailing list to > get this issue resolved. > > Thanks for all the help so far! > > On 3/16/07, Ralph Castain <r...@lanl.gov> wrote: >> I'm afraid I have zero knowledge or experience with gentoo portage, so I >> can't help you there. I always install our releases from the tarball source >> as it is pretty trivial to do and avoids any issues. >> >> I will have to defer to someone who knows that system to help you from here. >> It sounds like an installation or configuration issue. >> >> Ralph >> >> >> >> On 3/16/07 3:15 PM, "David Bronke" <whitel...@gmail.com> wrote: >> >>> On 3/15/07, Ralph Castain <r...@lanl.gov> wrote: >>>> Hmmm...well, a few thoughts to hopefully help with the debugging. One >>>> initial comment, though - 1.1.2 is quite old. You might want to upgrade to >>>> 1.2 (releasing momentarily - you can use the last release candidate in the >>>> interim as it is identical). >>> >>> Version 1.2 doesn't seem to be in gentoo portage yet, so I may end up >>> having to compile from source... I generally prefer to do everything >>> from portage if possible, because it makes upgrades and maintenance >>> much cleaner. >>> >>>> Meantime, looking at this output, there appear to be a couple of common >>>> possibilities. First, I don't see any of the diagnostic output from after >>>> we >>>> do a local fork (we do this prior to actually launching the daemon). Is it >>>> possible your system doesn't allow you to fork processes (some don't, >>>> though >>>> it's unusual)? >>> >>> I don't see any problems with forking on this system... I'm able to >>> start a dbus daemon as a regular user without any problems. >>> >>>> Second, it could be that the "orted" program isn't being found in your >>>> path. >>>> People often forget that the path in shells started up by programs isn't >>>> necessarily the same as that in their login shell. You might try executing >>>> a >>>> simple shellscript that outputs the results of "which orted" to verify this >>>> is correct. >>> >>> 'which orted' from a shell script gives me '/usr/bin/orted', which >>> seems to be correct. >>> >>>> BTW, I should have asked as well: what are you running this on, and how did >>>> you configure openmpi? >>> >>> I'm running this on two identical machines with 2 dual-core >>> hyperthreading Xeon processors. (EM64T) I simply installed OpenMPI >>> using portage, with the USE flags "debug fortran pbs -threads". (I've >>> also tried it with "-debug fortran pbs threads") >>> >>>> Ralph >>>> >>>> >>>> >>>> On 3/15/07 5:33 PM, "David Bronke" <whitel...@gmail.com> wrote: >>>> >>>>> I'm using OpenMPI version 1.1.2. I installed it using gentoo portage, >>>>> so I think it has the right permissions... I tried doing 'equery f >>>>> openmpi | xargs ls -dl' and inspecting the permissions of each file, >>>>> and I don't see much out of the ordinary; it is all owned by >>>>> root:root, but every file has read permission for user, group, and >>>>> other. (and execute for each as well when appropriate) From the debug >>>>> output, I can tell that mpirun is creating the session tree in /tmp, >>>>> and it does seem to be working fine... Here's the output when using >>>>> --debug-daemons: >>>>> >>>>> $ mpirun -aborted 8 -v -d --debug-daemons -np 8 >>>>> /workspace/bronke/mpi/hello >>>>> [trixie:25228] [0,0,0] setting up session dir with >>>>> [trixie:25228] universe default-universe >>>>> [trixie:25228] user bronke >>>>> [trixie:25228] host trixie >>>>> [trixie:25228] jobid 0 >>>>> [trixie:25228] procid 0 >>>>> [trixie:25228] procdir: >>>>> /tmp/openmpi-sessions-bronke@trixie_0/default-universe/0/0 >>>>> [trixie:25228] jobdir: >>>>> /tmp/openmpi-sessions-bronke@trixie_0/default-universe/0 >>>>> [trixie:25228] unidir: >>>>> /tmp/openmpi-sessions-bronke@trixie_0/default-universe >>>>> [trixie:25228] top: openmpi-sessions-bronke@trixie_0 >>>>> [trixie:25228] tmp: /tmp >>>>> [trixie:25228] [0,0,0] contact_file >>>>> /tmp/openmpi-sessions-bronke@trixie_0/default-universe/universe-setup.txt >>>>> [trixie:25228] [0,0,0] wrote setup file >>>>> [trixie:25228] pls:rsh: local csh: 0, local bash: 1 >>>>> [trixie:25228] pls:rsh: assuming same remote shell as local shell >>>>> [trixie:25228] pls:rsh: remote csh: 0, remote bash: 1 >>>>> [trixie:25228] pls:rsh: final template argv: >>>>> [trixie:25228] pls:rsh: /usr/bin/ssh <template> orted --debug >>>>> --debug-daemons --bootproxy 1 --name <template> --num_procs 2 >>>>> --vpid_start 0 --nodename <template> --universe >>>>> bronke@trixie:default-universe --nsreplica >>>>> "0.0.0;tcp://141.238.31.33:43838" --gprreplica >>>>> "0.0.0;tcp://141.238.31.33:43838" --mpi-call-yield 0 >>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving >>>>> [trixie:25228] spawn: in job_state_callback(jobid = 1, state = 0x100) >>>>> mpirun noticed that job rank 0 with PID 0 on node "localhost" exited >>>>> on signal 13. >>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving >>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving >>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving >>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving >>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving >>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving >>>>> [trixie:25228] sess_dir_finalize: proc session dir not empty - leaving >>>>> [trixie:25228] spawn: in job_state_callback(jobid = 1, state = 0x80) >>>>> mpirun noticed that job rank 0 with PID 0 on node "localhost" exited >>>>> on signal 13. >>>>> mpirun noticed that job rank 1 with PID 0 on node "localhost" exited >>>>> on signal 13. >>>>> mpirun noticed that job rank 2 with PID 0 on node "localhost" exited >>>>> on signal 13. >>>>> mpirun noticed that job rank 3 with PID 0 on node "localhost" exited >>>>> on signal 13. >>>>> mpirun noticed that job rank 4 with PID 0 on node "localhost" exited >>>>> on signal 13. >>>>> mpirun noticed that job rank 5 with PID 0 on node "localhost" exited >>>>> on signal 13. >>>>> mpirun noticed that job rank 6 with PID 0 on node "localhost" exited >>>>> on signal 13. >>>>> [trixie:25228] ERROR: A daemon on node localhost failed to start as >>>>> expected. >>>>> [trixie:25228] ERROR: There may be more information available from >>>>> [trixie:25228] ERROR: the remote shell (see above). >>>>> [trixie:25228] The daemon received a signal 13. >>>>> 1 additional process aborted (not shown) >>>>> [trixie:25228] sess_dir_finalize: found proc session dir empty - deleting >>>>> [trixie:25228] sess_dir_finalize: found job session dir empty - deleting >>>>> [trixie:25228] sess_dir_finalize: found univ session dir empty - deleting >>>>> [trixie:25228] sess_dir_finalize: found top session dir empty - deleting >>>>> >>>>> On 3/15/07, Ralph H Castain <r...@lanl.gov> wrote: >>>>>> It isn't a /dev issue. The problem is likely that the system lacks >>>>>> sufficient permissions to either: >>>>>> >>>>>> 1. create the Open MPI session directory tree. We create a hierarchy of >>>>>> subdirectories for temporary storage used for things like your shared >>>>>> memory >>>>>> file - the location of the head of that tree can be specified at run >>>>>> time, >>>>>> but has a series of built-in defaults it can search if you don't specify >>>>>> it >>>>>> (we look at your environmental variables - e.g., TMP or TMPDIR - as well >>>>>> as >>>>>> the typical Linux/Unix places). You might check to see what your tmp >>>>>> directory is, and that you have write permission into it. Alternatively, >>>>>> you >>>>>> can specify your own location (where you know you have permissions!) by >>>>>> setting --tmpdir your-dir on the mpirun command line. >>>>>> >>>>>> 2. execute or access the various binaries and/or libraries. This is >>>>>> usually >>>>>> caused when someone installs OpenMPI as root, and then tries to execute >>>>>> as >>>>>> a >>>>>> non-root user. Best thing here is to either run through the installation >>>>>> directory and add the correct permissions (assuming it is a system-level >>>>>> install), or reinstall as the non-root user (if the install is solely for >>>>>> you anyway). >>>>>> >>>>>> You can also set --debug-daemons on the mpirun command line to get more >>>>>> diagnostic output from the daemons and then send that along. >>>>>> >>>>>> BTW: if possible, it helps us to advise you if we know which version of >>>>>> OpenMPI you are using. ;-) >>>>>> >>>>>> Hope that helps. >>>>>> Ralph >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 3/15/07 1:51 PM, "David Bronke" <whitel...@gmail.com> wrote: >>>>>> >>>>>>> Ok, now that I've figured out what the signal means, I'm wondering >>>>>>> exactly what is running into permission problems... the program I'm >>>>>>> running doesn't use any functions except printf, sprintf, and MPI_*... >>>>>>> I was thinking that possibly changes to permissions on certain /dev >>>>>>> entries in newer distros might cause this, but I'm not even sure what >>>>>>> /dev entries would be used by MPI. >>>>>>> >>>>>>> On 3/15/07, McCalla, Mac <macmcca...@hess.com> wrote: >>>>>>>> Hi, >>>>>>>> If the perror command is available on your system it will tell >>>>>>>> you what the message is associated with the signal value. On my system >>>>>>>> RHEL4U3, it is permission denied. >>>>>>>> >>>>>>>> HTH, >>>>>>>> >>>>>>>> mac mccalla >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On >>>>>>>> Behalf Of David Bronke >>>>>>>> Sent: Thursday, March 15, 2007 12:25 PM >>>>>>>> To: us...@open-mpi.org >>>>>>>> Subject: [OMPI users] Signal 13 >>>>>>>> >>>>>>>> I've been trying to get OpenMPI working on two of the computers at a >>>>>>>> lab >>>>>>>> I help administer, and I'm running into a rather large issue. When >>>>>>>> running anything using mpirun as a normal user, I get the following >>>>>>>> output: >>>>>>>> >>>>>>>> >>>>>>>> $ mpirun --no-daemonize --host >>>>>>>> localhost,localhost,localhost,localhost,localhost,localhost,localhost,l>>>>>>>> o >>>>>>>> calhost >>>>>>>> /workspace/bronke/mpi/hello >>>>>>>> mpirun noticed that job rank 0 with PID 0 on node "localhost" exited on >>>>>>>> signal 13. >>>>>>>> [trixie:18104] ERROR: A daemon on node localhost failed to start as >>>>>>>> expected. >>>>>>>> [trixie:18104] ERROR: There may be more information available from >>>>>>>> [trixie:18104] ERROR: the remote shell (see above). >>>>>>>> [trixie:18104] The daemon received a signal 13. >>>>>>>> 8 additional processes aborted (not shown) >>>>>>>> >>>>>>>> >>>>>>>> However, running the same exact command line as root works fine: >>>>>>>> >>>>>>>> >>>>>>>> $ sudo mpirun --no-daemonize --host >>>>>>>> localhost,localhost,localhost,localhost,localhost,localhost,localhost,l>>>>>>>> o >>>>>>>> calhost >>>>>>>> /workspace/bronke/mpi/hello >>>>>>>> Password: >>>>>>>> p is 8, my_rank is 0 >>>>>>>> p is 8, my_rank is 1 >>>>>>>> p is 8, my_rank is 2 >>>>>>>> p is 8, my_rank is 3 >>>>>>>> p is 8, my_rank is 6 >>>>>>>> p is 8, my_rank is 7 >>>>>>>> Greetings from process 1! >>>>>>>> >>>>>>>> Greetings from process 2! >>>>>>>> >>>>>>>> Greetings from process 3! >>>>>>>> >>>>>>>> p is 8, my_rank is 5 >>>>>>>> p is 8, my_rank is 4 >>>>>>>> Greetings from process 4! >>>>>>>> >>>>>>>> Greetings from process 5! >>>>>>>> >>>>>>>> Greetings from process 6! >>>>>>>> >>>>>>>> Greetings from process 7! >>>>>>>> >>>>>>>> >>>>>>>> I've looked up signal 13, and have found that it is apparently SIGPIPE; >>>>>>>> I also found a thread on the LAM-MPI site: >>>>>>>> http://www.lam-mpi.org/MailArchives/lam/2004/08/8486.php >>>>>>>> However, this thread seems to indicate that the problem would be in the >>>>>>>> application, (/workspace/bronke/mpi/hello in this case) but there are >>>>>>>> no >>>>>>>> pipes in use in this app, and the fact that it works as expected as >>>>>>>> root >>>>>>>> doesn't seem to fit either. I have tried running mpirun with --verbose >>>>>>>> and it doesn't show any more output than without it, so I've run into a >>>>>>>> sort of dead-end on this issue. Does anyone know of any way I can >>>>>>>> figure >>>>>>>> out what's going wrong or how I can fix it? >>>>>>>> >>>>>>>> Thanks! >>>>>>>> -- >>>>>>>> David H. Bronke >>>>>>>> Lead Programmer >>>>>>>> G33X Nexus Entertainment >>>>>>>> http://games.g33xnexus.com/precursors/ >>>>>>>> >>>>>>>> v3sw5/7Hhw5/6ln4pr6Ock3ma7u7+8Lw3/7Tm3l6+7Gi2e4t4Mb7Hen5g8+9ORPa22s6MSr>>>>>>>> 7 >>>>>>>> p6 >>>>>>>> hackerkey.com >>>>>>>> Support Web Standards! http://www.webstandards.org/ >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >