Hello,

I'm new to parallel computing and have a problem with Open MPI jobs in
gridengine. I compiled Open MPI with gridengine support and run programs
with "mpirun program" as gridengine jobs.
Most of the time everything just runs fine, but once in a while, under
circumstances I don't understand, without modifying any parameter,
mpirun will fail. It seems mpirun tries to run the program using SSH
instead of gridengine's mechanism.

I get the following error in the job output :
ssh_askpass: exec(/usr/bin/ssh-askpass): No such file or directory
Host key verification failed.
A daemon died unexpectedly with status 129 while attempting
to launch so we are aborting.
[...]
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.

Indeed ssh-askpass is not available since users are not supposed and not
allowed to connect to nodes with SSH. I verified with strace that mpirun
is the one trying to use SSH. After that the jobs fails with the "dr"
status.
I tried to run the job with a lot of "--mca" options to get more debug
input from mpirun but I didn't get interesting information.
Gridengine master's log only says :
04/09/2013 16:16:44|worker|cerebro2|E|tightly integrated parallel task 206.1 
task 1.cerebro2-1 failed - killing job

User's home directory is shared between master and nodes with NFS. Nodes
are connected with a slow network so I don't want an MPI job to spread
on several nodes. Each node has 64 slots. My gridengine script is like
this :
#$ -pe orte 64
mpirun -np "$NSLOTS" program

$ qconf -sp orte
pe_name            orte
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $pe_slots
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE
$ ompi_info | grep gridengine
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.2)

I'm using gridengine 6.2u5 on Debian Squeeze.

Do you know what could happen ?
-- 
Bernard Massot
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to