Hello, I use multiple queues to divide up available resources based on job run times. Large parallel jobs will typically span multiple queues and this has generally been working fine thus far. However I recently increased the number of queues (from 4 to 9) so that the time limits can be more fine grained. After this change I noticed that large parallel jobs will consistently fail if more than 3-4 queues are being used on each host. The failed jobs will generate the following messages:
Execution daemon on host <hostname> didn't accept task I see this problem using both "builtin" and SSH for job startup. While the error message is different, I think this may be related to a problem I had previously (http://gridengine.org/pipermail/users/2012-November/005164.html). In that case I was having problems starting large numbers of small parallel jobs at the same time (which would in turn cause jobs to start on many different queues at the same time). I am thinking there must be some race condition going on in this specific scenario (many parallel jobs starting at the same time across multiple queues on the same host). Thanks, Brendan _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
