Hi, Am 11.04.2013 um 10:41 schrieb Bernard Massot:
> I'm new to parallel computing and have a problem with Open MPI jobs in 0) Which version of Open MPI are you using? > gridengine. I compiled Open MPI with gridengine support and run programs > with "mpirun program" as gridengine jobs. > Most of the time everything just runs fine, but once in a while, under > circumstances I don't understand, without modifying any parameter, > mpirun will fail. It seems mpirun tries to run the program using SSH > instead of gridengine's mechanism. > > I get the following error in the job output : > ssh_askpass: exec(/usr/bin/ssh-askpass): No such file or directory > Host key verification failed. > A daemon died unexpectedly with status 129 while attempting > to launch so we are aborting. > [...] > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > > Indeed ssh-askpass is not available since users are not supposed and not > allowed to connect to nodes with SSH. I verified with strace that mpirun > is the one trying to use SSH. After that the jobs fails with the "dr" > status. This is a fine setup. I assume the setting in SGE's configuration is: $ qconf -sconf ... qlogin_command builtin qlogin_daemon builtin rlogin_command builtin rlogin_daemon builtin rsh_command builtin rsh_daemon builtin So Open MPI should detect it's running under SGE and issue `qrsh -inherit ...` in the end without the necessity to have somewhere ssh inside the cluster around in case it would start something on additional nodes. But this is not intended by your setup anyway due to the PE definition. 1) Is the `mpiexec` the users use the one you supplied or can it happen, that by accident they used a different `mpiexec` or even compiled their application with a different MPI library? 2) As the used jobscript is available on the node in $SGE_JOB_SPOOL_DIR/job_scripts (the directory specified during installation to be used by the exechosts): do they show anything unusual regarding the PATH settings? > I tried to run the job with a lot of "--mca" options to get more debug > input from mpirun but I didn't get interesting information. > Gridengine master's log only says : > 04/09/2013 16:16:44|worker|cerebro2|E|tightly integrated parallel task 206.1 > task 1.cerebro2-1 failed - killing job > > User's home directory is shared between master and nodes with NFS. Nodes > are connected with a slow network so I don't want an MPI job to spread > on several nodes. Each node has 64 slots. My gridengine script is like > this : As long as they are on one node only, this shouldn't happen at all as recent Open MPI uses only forks to start additonal processes. 3) Are all slots for a job coming from one queue only, or are the slots collected from several queues on one and the same exechost? I.e.: is the same PE "orte" attached to more than one queue? -- Reuti > #$ -pe orte 64 > mpirun -np "$NSLOTS" program > > $ qconf -sp orte > pe_name orte > slots 9999 > user_lists NONE > xuser_lists NONE > start_proc_args /bin/true > stop_proc_args /bin/true > allocation_rule $pe_slots > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary FALSE > $ ompi_info | grep gridengine > MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.2) > > I'm using gridengine 6.2u5 on Debian Squeeze. > > Do you know what could happen ? > -- > Bernard Massot > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
