On Fri, 9 Feb 2018 at 1:29am, William Hay wrote

On Thu, Feb 08, 2018 at 03:42:03PM -0800, Joshua Baker-LePain wrote:
 153758 0.51149 tomography USER1       qw    02/08/2018 14:03:05                
 153759 0.00000 qss_svk_ge USER2       qw    02/08/2018 14:15:06                
                    1 1
 153760 0.00000 qss_svk_ge USER2       qw    02/08/2018 14:15:06                
                    1 1

with more jobs below that, all with 0.0000 priority.  Starting at 14:03:06
in the messages file, I see this:

02/08/2018 14:03:06|worker|wynq1|E|not enough (1) free slots in queue 
"ondemand.q@cin-id3" for job 153758.1

And in the schedule file I see this:

So why is it trying to give the job slots in ondemand.q?

Has the job in question requested the ondemand queue via -masterq by any chance? I have heard people who should know say that -masterq is somewhat buggy. I've never had a problem with -masterq myself but I don't use it much and we don't use RQS either. Possibly the alleged bugginess of -masterq manifests in the presence of RQS.

I can't know for sure, but our documentation doesn't mention masterq *and* we do ask folks not to specify queues in their job options. So I would strongly suspect that the jobs that get stuck don't request a queue in any way.

Does the pe in question have job_is_first_task set to false? If so this may be a funny with treatment of the MASTER task by RQS.

job_is_first_task is set to TRUE.

As of now, I haven't had any more jobs get stuck since I made the changes yesterday. Thanks for your guidance towards the RQSes as the likely culprits.

Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
users mailing list

Reply via email to