Hello,

I use multiple queues to divide up available resources based on job run times. 
Large parallel jobs will typically span multiple queues and this has generally 
been working fine thus far.  However I recently increased the number of queues 
(from 4 to 9) so that the time limits can be more fine grained. After this 
change I noticed that large parallel jobs will consistently fail if more than 
3-4 queues are being used on each host.  The failed jobs will generate the 
following messages:

Execution daemon on host <hostname> didn't accept task

I see this problem using both "builtin" and SSH for job startup.

While the error message is different, I think this may be related to a problem 
I had previously 
(http://gridengine.org/pipermail/users/2012-November/005164.html).  In that 
case I was having problems starting large numbers of small parallel jobs at the 
same time (which would in turn cause jobs to start on many different queues at 
the same time).  I am thinking there must be some race condition going on in 
this specific scenario (many parallel jobs starting at the same time across 
multiple queues on the same host).

Thanks,
Brendan
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to