Hello, I'm new to parallel computing and have a problem with Open MPI jobs in gridengine. I compiled Open MPI with gridengine support and run programs with "mpirun program" as gridengine jobs. Most of the time everything just runs fine, but once in a while, under circumstances I don't understand, without modifying any parameter, mpirun will fail. It seems mpirun tries to run the program using SSH instead of gridengine's mechanism.
I get the following error in the job output : ssh_askpass: exec(/usr/bin/ssh-askpass): No such file or directory Host key verification failed. A daemon died unexpectedly with status 129 while attempting to launch so we are aborting. [...] mpirun noticed that the job aborted, but has no info as to the process that caused that situation. Indeed ssh-askpass is not available since users are not supposed and not allowed to connect to nodes with SSH. I verified with strace that mpirun is the one trying to use SSH. After that the jobs fails with the "dr" status. I tried to run the job with a lot of "--mca" options to get more debug input from mpirun but I didn't get interesting information. Gridengine master's log only says : 04/09/2013 16:16:44|worker|cerebro2|E|tightly integrated parallel task 206.1 task 1.cerebro2-1 failed - killing job User's home directory is shared between master and nodes with NFS. Nodes are connected with a slow network so I don't want an MPI job to spread on several nodes. Each node has 64 slots. My gridengine script is like this : #$ -pe orte 64 mpirun -np "$NSLOTS" program $ qconf -sp orte pe_name orte slots 9999 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $pe_slots control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE $ ompi_info | grep gridengine MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.2) I'm using gridengine 6.2u5 on Debian Squeeze. Do you know what could happen ? -- Bernard Massot _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
