Re: [gridengine users] command runs in grid engine but does not complete.

Dan Hyatt Tue, 09 Jun 2015 08:40:21 -0700

When I send a job to a queue, if the queue is busy it sends it to the
next queue (defeating the purpose of separate queues in my env). How do
I set the queues to run jobs ONLY in the appointed queue?

You should be able to do this with qsub -q <queuename> <job script>.
If that isn't doing it then I suspect that either you are making soft
requests somehow or a jsv is rewriting your request.  In any case a qstat -j
on the jobid should reveal what grid engine thinks the job requires.

The execute nodes were updated, and some are not playing well in the
sandbox. When the grid sends a job there, it hangs, sends an error but

Not clear what hangs the job or the node?
What is the error that is being sent and  how is it being sent?

The nodes are hanging the job, some did not come up correctly. I hadthis problem when I was first building them

I am taking them out using qmon -d

does not remove that blade from the execute node list like it did before.
Is there an easy way to manually test the execute nodes (there are 180),
and why is it not removing bad nodes from the available nodes as it did
before?  Before it would mark it unusable so when I list the execute
nodes I would see that the node was bad and it would not accept jobs.

Not clear what was marking the node or how  it was marking the node.

When a job dies as a result of some sort of error grid engine tries
to figure out if the cause is likely the node or the job.  If the node it puts 
the
appropriate queue instance into an error 'E' state.  If the job then it will 
put the
job in an error state (Eqw,Erq or similar).  One possibility is that the errors 
are
of a nature that grid engine takes for a job problem rather than a node problem.
IIRC the exit status of the prolog can be used to set either the job or queue
instance into an error state.  Possibly you had a prolog that detected problem
nodes and has recently gone AWOL?

Possibly you had a load sensor and associated load_threshold that put the queues
into an alarmed ('a') state?  If that is the case you need to set them up again.

When I run qhost -j

I assume

HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOTMEMUSE SWAPTO SWAPUSblade5-5-5 lx-amd64 24 2 12 240.00 126.0G 1.2G 2.0G 0.0blade5-5-6 lx-amd64 24 2 12 24 -126.0G - 2.0G -blade5-5-7 lx-amd64 24 2 1224 - 126.0G - 2.0G -blade5-5-8 lx-amd64 24 2 1224 0.00 126.0G 1.2G 2.0G 0.0

means that 5-5-5 is up and working, 5-5-6 and 5-5-7 are not availablefor use.


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] command runs in grid engine but does not complete.

Reply via email to