Hello all, this is my first time posting to this mailing list. About 1% or less of our qrsh grid jobs are failing in an unusual way.
We are running Open Grid Scheduler 2011.11 on CentOS 6.5. The small percentage of failing qrsh jobs get a non-zero exit status back to the submit host (exit status 1), and display this message: Your "qrsh" request could not be scheduled, try again later. Note, we do include the "-now n" option on the command line. Also the qacct log shows the job as having completed successfully: qsub_time Thu Nov 13 14:17:47 2014 start_time Thu Nov 13 14:21:13 2014 end_time Thu Nov 13 14:25:15 2014 granted_pe NONE slots 1 failed 0 exit_status 0 ru_wallclock 242 ru_utime 226.439 ru_stime 5.383 And reviewing the working directory, it does look like the job completed properly. I'm not sure how to take the next step in debugging this problem. Any advice? Brian Small Northwest Logic 1100 NW Compton Drive, Ste. 100 Beaverton, OR 97006 Desk - 503-533-5800 x-320 Cell - 503-577-6869 Fax: 503-533-5900 E-mail - [email protected]<mailto:[email protected]> Web - www.nwlogic.com<http://www.nwlogic.com/>
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
