Sorry for the slow follow up on this. Prentice, thank you for your suggestions.
The problem was not specific to any one submit host, or any one execution host. We have 3 submit hosts, and 5 execution hosts. We run thousands of jobs a day. 1 few each day, sometimes more, sometimes less would fail in this way. In the end, I decided to get away from using qrsh for running our jobs, and instead use qsub -sync y. Also, we had been redirecting the output of qrsh to a log file, and instead now, I used "-j y -o $logfile" options with qsub to create the logfile. In addition, there was an unusal way of running the job in our main job creation script, like: (cd $work_dir && qrsh -cwd -now n ...<more options and command line>...) 2>%1 > $logfile I changed this to qsub -wd $work_dir -sync y -j y -o $logfile ...<more options and command line>... After making these changes, it seemed the spurious failure have gone away. However, I did get one new type of failure show up in the output logs: - Unable to initialize environment because of error: cannot register event client. Only 99 event clients are allowed in the system Which I assume showed up now as a result of me using "qsub -sync y". This may have been *related* to the problem I was having with qrsh, but I wasn't getting an error message. In any case, I think I have fixed this now by setting the qmaster_params MAX_DYN_EC=500 (based on some web searches on this issue). So at this time, it looks like my problem is fixed. But I've only got a few hours on the solution so far. - Brian Small Northwest Logic Desk: 503-533-5800 x320 Mobile: 503-577-6869 > -----Original Message----- > Date: Fri, 14 Nov 2014 13:39:25 -0500 > From: Prentice Bisbal <[email protected]> > To: [email protected] > Subject: Re: [gridengine users] Small percentage of qrsh jobs failing > on submit host, but successfully run on grid > Message-ID: <[email protected]> > Content-Type: text/plain; charset="iso-8859-1"; Format="flowed" > > Do you have multiple nodes that qrsh jobs could be sent to? > > I had a similar problem a couple of weeks ago. All the queues were set > up to be batch or interactive, so a qrsh job could be assigned to any > node in the cluster (that has since been fixed). So, the first step was > to see if the problem was on all nodes, or just one that every > successive qrsh job was sent to. To find this out, I used qrsh with the > '-l hostname=...' option inside a loop, like this: > > > for host in host1 host2 host3; do #customize the host list for your > hostnames > qrsh -l 'hostname=$host' > done > > Yes this is tedious and requirs typing 'exit' repeatedly, but I found > out it was a single host causing the problem. Once I knew the host, I > was able to look at it's logs and configuration more closely to find the > root of the problem. > > Prentice > > > On 11/13/2014 06:34 PM, Brian Small wrote: > > > > Hello all, this is my first time posting to this mailing list. > > > > About 1% or less of our qrsh grid jobs are failing in an unusual way. > > > > We are running Open Grid Scheduler 2011.11 on CentOS 6.5. > > > > The small percentage of failing qrsh jobs get a non-zero exit status > > back to the submit host (exit status 1), and display this message: > > > > Your "qrsh" request could not be scheduled, try again later. > > > > Note, we do include the "-now n" option on the command line. > > > > Also the qacct log shows the job as having completed successfully: > > > > qsub_time Thu Nov 13 14:17:47 2014 > > > > start_time Thu Nov 13 14:21:13 2014 > > > > end_time Thu Nov 13 14:25:15 2014 > > > > granted_pe NONE > > > > slots 1 > > > > failed 0 > > > > exit_status 0 > > > > ru_wallclock 242 > > > > ru_utime 226.439 > > > > ru_stime 5.383 > > > > And reviewing the working directory, it does look like the job > > completed properly. > > > > I'm not sure how to take the next step in debugging this problem. Any > > advice? > > > > *Brian Small* > > > > *Northwest Logic*** > > > > *1100 NW Compton Drive, Ste. 100* > > > > *Beaverton, OR 97006* > > > > *Desk - 503-533-5800 x-320* > > > > *Cell - 503-577-6869* > > > > *Fax: 503-533-5900* > > > > *E-mail - [email protected] <mailto:[email protected]>_* > > > > *Web - www.nwlogic.com <http://www.nwlogic.com/>* > > > > > > > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users > > -- > Prentice Bisbal > Manager of Information Technology > Rutgers Discovery Informatics Institute (RDI2) > Rutgers University > http://rdi2.rutgers.edu _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
