Hi,
I'm from Edinburgh but I'm not using the cluster here, I have a private
cluster set up for a project elsewhere. Anyhow, the problem is
associated with the execution daemon failing to acknowledge jobs sent
from the qmaster, causing some jobs to be 'lost'. All the jobs are
identical, only on different data sets so they should not have crashed
or executed so quickly. I have run many tests so it is not the case that
I have overlooked their execution.
For example, here is the entry in
/var/spool/gridengine/qmaster/messages for a job (id 8424):
02/18/2013 20:39:23|worker|gad202|E|unable to find job 8424 from the scheduler
order package
02/18/2013 20:39:23|schedu|gad202|E|unable to find job 8424 from the scheduler
order package
02/18/2013 20:39:24|schedu|gad202|E|could not find job "8424" in master list
02/18/2013 20:39:24|schedu|gad202|E|callback function for event "1718.
EVENT DEL JOB 8424.1" failed
Cheers,
Gaya
On 23/02/13 05:59, Fritz Ferstl wrote:
Hi Gaya,
I see you are from Edinburgh and Univ of Edinburgh happens to be a Univa Grid
Engine customer. If you're part of that cluster and you are in fact using Univa
Grid Engine then feel free to get your questions answered by our support. We
can take it off-line if you've questions around that.
What Reuti has responded is correct, of course. I too would suspect failed jobs
or very short running jobs which just have finished. Qstat -z and qacct will
allow you to check.
Cheers,
Fritz
Sent from my iPhone
Am 22.02.2013 um 18:04 schrieb Gaya Nadarajan<[email protected]>:
Hi all,
I'm assigning slots to a queue that I have, right now it is is set to the
number of cores on the host. Do you know what consequence this would have on
the number of jobs running. For example, I have assigned the queue to have 12
slots. And I'm trying to run 300 jobs on it. Should the jobs wait and all run
eventually? I had problems that jobs stopped queueing and 'disappear'. Should
increasing the slots be a better way around this?
Thanks,
Gaya
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users