Re: [gridengine users] Open MPI jobs randomly fail to run

Reuti Thu, 11 Apr 2013 02:32:41 -0700

Hi,

Am 11.04.2013 um 10:41 schrieb Bernard Massot:


> I'm new to parallel computing and have a problem with Open MPI jobs in

0) Which version of Open MPI are you using?


> gridengine. I compiled Open MPI with gridengine support and run programs
> with "mpirun program" as gridengine jobs.
> Most of the time everything just runs fine, but once in a while, under
> circumstances I don't understand, without modifying any parameter,
> mpirun will fail. It seems mpirun tries to run the program using SSH
> instead of gridengine's mechanism.
> 
> I get the following error in the job output :
> ssh_askpass: exec(/usr/bin/ssh-askpass): No such file or directory
> Host key verification failed.
> A daemon died unexpectedly with status 129 while attempting
> to launch so we are aborting.
> [...]
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> 
> Indeed ssh-askpass is not available since users are not supposed and not
> allowed to connect to nodes with SSH. I verified with strace that mpirun
> is the one trying to use SSH. After that the jobs fails with the "dr"
> status.

This is a fine setup. I assume the setting in SGE's configuration is:

$ qconf -sconf
...
qlogin_command               builtin
qlogin_daemon                builtin
rlogin_command               builtin
rlogin_daemon                builtin
rsh_command                  builtin
rsh_daemon                   builtin

So Open MPI should detect it's running under SGE and issue `qrsh -inherit ...` 
in the end without the necessity to have somewhere ssh inside the cluster 
around in case it would start something on additional nodes. But this is not 
intended by your setup anyway due to the PE definition.

1) Is the `mpiexec` the users use the one you supplied or can it happen, that 
by accident they used a different `mpiexec` or even compiled their application 
with a different MPI library?

2) As the used jobscript is available on the node in 
$SGE_JOB_SPOOL_DIR/job_scripts (the directory specified during installation to 
be used by the exechosts): do they show anything unusual regarding the PATH 
settings?


> I tried to run the job with a lot of "--mca" options to get more debug
> input from mpirun but I didn't get interesting information.
> Gridengine master's log only says :
> 04/09/2013 16:16:44|worker|cerebro2|E|tightly integrated parallel task 206.1 
> task 1.cerebro2-1 failed - killing job
> 
> User's home directory is shared between master and nodes with NFS. Nodes
> are connected with a slow network so I don't want an MPI job to spread
> on several nodes. Each node has 64 slots. My gridengine script is like
> this :

As long as they are on one node only, this shouldn't happen at all as recent 
Open MPI uses only forks to start additonal processes.

3) Are all slots for a job coming from one queue only, or are the slots 
collected from several queues on one and the same exechost? I.e.: is the same 
PE "orte" attached to more than one queue?

-- Reuti


> #$ -pe orte 64
> mpirun -np "$NSLOTS" program
> 
> $ qconf -sp orte
> pe_name            orte
> slots              9999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    $pe_slots
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary FALSE
> $ ompi_info | grep gridengine
> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.2)
> 
> I'm using gridengine 6.2u5 on Debian Squeeze.
> 
> Do you know what could happen ?
> -- 
> Bernard Massot
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Open MPI jobs randomly fail to run

Reply via email to