I removed the complex virtual_free and it seemed to solve things. However I do need to use this complex.
I just realized that all jobs actually do request the complex, because there's a default value of 2G: # qconf -sc | grep virtual_free virtual_free mem MEMORY <= YES JOB 2G 0 So I think what happened is that one exec host was full which caused this error in master/spool/messages: 04/20/2017 15:28:49|worker|ibm068|E|host load value "virtual_free" exceeded: capacity is 18650263552.262142, job 5066834 requests additional 19327352832.000000 And after that all new jobs are in waiting state: 5066334 0.55500 pt_shell huan qw 04/20/2017 14:46:18 1 5066335 0.00000 calibredrv chengtai qw 04/20/2017 14:46:18 1 5066336 0.00000 calibredrv chenyi qw 04/20/2017 14:46:23 1 5066338 0.00000 calibredrv chenyi qw 04/20/2017 14:46:33 1 5066339 0.00000 virtuoso allenmo qw 04/20/2017 14:46:56 1 5066341 0.00000 virtuoso johnt qw 04/20/2017 14:47:42 1 5066342 0.00000 calibre yonglong qw 04/20/2017 14:48:11 1 5066343 0.00000 nettran felicia qw 04/20/2017 14:48:18 1 Here are sample messages: 04/20/2017 15:28:42|worker|ibm068|E|cannot start job 5066834.1, as resources have changed during a scheduling run 04/20/2017 15:28:42|worker|ibm068|W|Skipping remaining 12 orders 04/20/2017 15:28:42|schedu|ibm068|E|cannot start job 5066834.1, as resources have changed during a scheduling run So it seems one host being full is affecting all others? Hope you can help me. And thank you again for replying to me. Really appreciate it. John -----Original Message----- From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On Behalf Of John_Tai Sent: Thursday, April 20, 2017 2:29 To: Reuti Cc: users@gridengine.org Subject: Re: [gridengine users] Queue dropped because it is full, except it is not >> The queue is also defined as being "qtype INTERACTIVE"? Yes both interactive and batch. >> And only a load of 7.75? That was the current load. >> Are there any consumable resource requests? I.e. is the memory perhaps fully >> used up by the already running jobs (being it h_vmem, virtual-free or any >> other consumable)? Jobs are not submitted with any consumable requests. Though I have set virtual_free as a complex. >> Did you upgrade all nodes? I did upgrade all exec hosts. Here are error messages from the master: 04/20/2017 14:28:07|schedu|ibm068|E|cannot start job 5066074.1, as resources have changed during a scheduling run 04/20/2017 14:28:08|worker|ibm068|E|host load value "virtual_free" exceeded: capacity is 20690952192.524288, job 5066074 requests additional 21474836480.000000 04/20/2017 14:28:08|worker|ibm068|E|cannot start job 5066074.1, as resources have changed during a scheduling run 04/20/2017 14:28:08|worker|ibm068|W|Skipping remaining 32 orders 04/20/2017 14:28:08|schedu|ibm068|E|cannot start job 5066074.1, as resources have changed during a scheduling run 04/20/2017 14:28:09|worker|ibm068|E|host load value "virtual_free" exceeded: capacity is 20690952192.524288, job 5066074 requests additional 21474836480.000000 04/20/2017 14:28:09|worker|ibm068|E|cannot start job 5066074.1, as resources have changed during a scheduling run 04/20/2017 14:28:09|worker|ibm068|W|Skipping remaining 32 orders 04/20/2017 14:28:09|schedu|ibm068|E|cannot start job 5066074.1, as resources have changed during a scheduling run 04/20/2017 14:28:10|worker|ibm068|E|host load value "virtual_free" exceeded: capacity is 20690952192.524288, job 5066074 requests additional 21474836480.000000 04/20/2017 14:28:10|worker|ibm068|E|cannot start job 5066074.1, as resources have changed during a scheduling run 04/20/2017 14:28:10|worker|ibm068|W|Skipping remaining 32 orders 04/20/2017 14:28:10|schedu|ibm068|E|cannot start job 5066074.1, as resources have changed during a scheduling run 04/20/2017 14:28:11|worker|ibm068|E|host load value "virtual_free" exceeded: capacity is 20690952192.524288, job 5066074 requests additional 21474836480.000000 04/20/2017 14:28:11|worker|ibm068|E|cannot start job 5066074.1, as resources have changed during a scheduling run 04/20/2017 14:28:11|worker|ibm068|W|Skipping remaining 33 orders 04/20/2017 14:28:11|schedu|ibm068|E|cannot start job 5066074.1, as resources have changed during a scheduling run -----Original Message----- From: Reuti [mailto:re...@staff.uni-marburg.de] Sent: Wednesday, April 19, 2017 7:26 To: John_Tai Cc: users@gridengine.org Subject: Re: [gridengine users] Queue dropped because it is full, except it is not Hi, > Am 19.04.2017 um 09:00 schrieb John_Tai <john_...@smics.com>: > > I am trying to submit a job to a specific host in the queue: > > # qrsh -verbose -q gui.q@ibm056 > Your job 5049542 ("QRLOGIN") has been submitted waiting for > interactive job to be scheduled ... > > > However it is in waiting state: > > # qstat -u johnt > job-ID prior name user state submit/start at queue > slots ja-task-ID > ----------------------------------------------------------------------------------------------------------------- > 5049542 0.55500 QRLOGIN johnt qw 04/19/2017 14:51:19 > 1 The queue is also defined as being "qtype INTERACTIVE"? > # qstat -j 5049542 |grep gui.q > hard_queue_list: gui.q@ibm056 > queue instance "gui.q@dsbm05" dropped > because it is full > > Here is the current status of the queue: > > # qstat -f |grep gui.q > gui.q@dsbm04 BIP 0/5/45 8.87 lx24-amd64 > gui.q@dsbm05 BIP 0/55/55 7.75 lx24-amd64 And only a load of 7.75? > gui.q@ibm056 BIP 0/11/30 3.15 lx24-amd64 Are there any consumable resource requests? I.e. is the memory perhaps fully used up by the already running jobs (being it h_vmem, virtual-free or any other consumable)? > gui.q@ibm057 BIP 0/11/30 1.34 lx24-amd64 > gui.q@ibm058 BIP 0/11/45 3.47 lx24-amd64 > > > The same goes for ibm057 and ibm058. It seems that dsbm05 being full blocks > all following servers in the queue list. In fact I can submit to dsbm04, > which precedes dsbm05. > > I recently upgraded from sge6.1 to sge6.2u6, though I can’t be sure that’s > the only thing that’s changed. How do I even begin to debug this? Did you upgrade all nodes? -- Reuti ________________________________ This email (including its attachments, if any) may be confidential and proprietary information of SMIC, and intended only for the use of the named recipient(s) above. Any unauthorized use or disclosure of this email is strictly prohibited. If you are not the intended recipient(s), please notify the sender immediately and delete this email from your computer. _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users ________________________________ This email (including its attachments, if any) may be confidential and proprietary information of SMIC, and intended only for the use of the named recipient(s) above. Any unauthorized use or disclosure of this email is strictly prohibited. If you are not the intended recipient(s), please notify the sender immediately and delete this email from your computer. _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users