Reuti,

Thanks for the quick response.

I am not running a script, but an executable (I have "-b y" also on the command 
line).  The executable is running a job, and that job seems to finish correctly 
(it has its own log file which looks correct, and the job takes the right 
amount of time.   And it is not trying to run another qrsh.

This is what is looks like is happening to me:

run on submit host: qrsh <command>

   <command> runs on an execution host on the grid, finishes successfully, with 
no error status (as reported in qacct)

back on the submit host, at around the time the grid job completeds, qrsh 
returns a exit status of "1", and the message:

  Your "qrsh" request could not be scheduled, try again later.


And note, this is only happening on a small percentages of the jobs, all 
running the same <command> tool, with different options.  The ones that fail 
are seemingly random.

I'm hoping someone can suggest a means of debugging this further, there is 
nothing in the qmaster spool messages log, and the qacct log for the jobs that 
fail look good as well.  It looks like some problem that happens related to 
qrsh on the submit host only at the end of the job.

- Brian Small
Northwest Logic
Desk: 503-533-5800 x320
Mobile: 503-577-6869


> -----Original Message-----
> From: Reuti [mailto:[email protected]]
> Sent: Thursday, November 13, 2014 4:17 PM
> To: Brian Small
> Cc: [email protected]
> Subject: Re: [gridengine users] Small percentage of qrsh jobs failing on
> submit host, but successfully run on grid
> 
> Hi,
> 
> Am 14.11.2014 um 00:34 schrieb Brian Small:
> 
> > Hello all, this is my first time posting to this mailing list.
> >
> > About 1% or less of our qrsh grid jobs are failing in an unusual way.
> >
> > We are running Open Grid Scheduler 2011.11 on CentOS 6.5.
> >
> > The small percentage of failing qrsh jobs get a non-zero exit status back to
> the submit host (exit status 1), and display this message:
> 
> What do you start by `qrsh` - a binary or a script?
> 
> This sounds like the probably started script wants to start another `qrsh`. In
> case it's a script, the first line with "#!/bin/sh -x" will list the executed
> commands.
> 
> -- Reuti
> 
> NB: The side effect of "-now n" is that the job will go to a queue of "qtype"
> set to "BATCH", while "-now y" will route to a queue with "qtype" being
> "INTERACTIVE" (the same applies when this option is used for `qsub`).
> 
> 
> > Your "qrsh" request could not be scheduled, try again later.
> >
> > Note, we do include the "-now n" option on the command line.
> >
> > Also the qacct log shows the job as having completed successfully:
> >
> > qsub_time    Thu Nov 13 14:17:47 2014
> > start_time   Thu Nov 13 14:21:13 2014
> > end_time     Thu Nov 13 14:25:15 2014
> > granted_pe   NONE
> > slots        1
> > failed       0
> > exit_status  0
> > ru_wallclock 242
> > ru_utime     226.439
> > ru_stime     5.383
> >
> > And reviewing the working directory, it does look like the job completed
> properly.
> >
> > I'm not sure how to take the next step in debugging this problem.  Any
> advice?
> >
> > Brian Small
> > Northwest Logic
> > 1100 NW Compton Drive, Ste. 100
> > Beaverton, OR  97006
> > Desk - 503-533-5800 x-320
> > Cell - 503-577-6869
> > Fax: 503-533-5900
> > E-mail - [email protected]
> > Web - www.nwlogic.com
> >
> > _______________________________________________
> > users mailing list
> > [email protected]
> > https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to