Hi,

I'm from Edinburgh but I'm not using the cluster here, I have a private cluster set up for a project elsewhere. Anyhow, the problem is associated with the execution daemon failing to acknowledge jobs sent from the qmaster, causing some jobs to be 'lost'. All the jobs are identical, only on different data sets so they should not have crashed or executed so quickly. I have run many tests so it is not the case that I have overlooked their execution.

For example, here is the entry in /var/spool/gridengine/qmaster/messages for a job (id 8424):

02/18/2013 20:39:23|worker|gad202|E|unable to find job 8424 from the scheduler 
order package
02/18/2013 20:39:23|schedu|gad202|E|unable to find job 8424 from the scheduler 
order package
02/18/2013 20:39:24|schedu|gad202|E|could not find job "8424" in master list
02/18/2013 20:39:24|schedu|gad202|E|callback function for event "1718.
EVENT DEL JOB 8424.1" failed


Cheers,
Gaya


On 23/02/13 05:59, Fritz Ferstl wrote:
Hi Gaya,

I see you are from Edinburgh and Univ of Edinburgh happens to be a Univa Grid 
Engine customer. If you're part of that cluster and you are in fact using Univa 
Grid Engine then feel free to get your questions answered by our support. We 
can take it off-line if you've questions around that.

What Reuti has responded is correct, of course. I too would suspect failed jobs 
or very short running jobs which just have finished. Qstat -z and qacct will 
allow you to check.

Cheers,

Fritz

Sent from my iPhone

Am 22.02.2013 um 18:04 schrieb Gaya Nadarajan<[email protected]>:

Hi all,

I'm assigning slots to a queue that I have, right now it is is set to the 
number of cores on the host. Do you know what consequence this would have on 
the number of jobs running. For example, I have assigned the queue to have 12 
slots. And I'm trying to run 300 jobs on it. Should the jobs wait and all run 
eventually? I had problems that jobs stopped queueing and 'disappear'. Should 
increasing the slots be a better way around this?

Thanks,
Gaya

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to