I removed the complex virtual_free and it seemed to solve things. However I do 
need to use this complex.

I just realized that all jobs actually do request the complex, because there's 
a default value of 2G:

# qconf -sc | grep virtual_free
virtual_free        mem        MEMORY      <=    YES         JOB        2G      
  0

So I think what happened is that one exec host was full which caused this error 
in master/spool/messages:

04/20/2017 15:28:49|worker|ibm068|E|host load value "virtual_free" exceeded: 
capacity is 18650263552.262142, job 5066834 requests additional 
19327352832.000000

And after that all new jobs are in waiting state:

5066334 0.55500 pt_shell   huan         qw    04/20/2017 14:46:18               
                     1
5066335 0.00000 calibredrv chengtai     qw    04/20/2017 14:46:18               
                     1
5066336 0.00000 calibredrv chenyi       qw    04/20/2017 14:46:23               
                     1
5066338 0.00000 calibredrv chenyi       qw    04/20/2017 14:46:33               
                     1
5066339 0.00000 virtuoso   allenmo      qw    04/20/2017 14:46:56               
                     1
5066341 0.00000 virtuoso   johnt        qw    04/20/2017 14:47:42               
                     1
5066342 0.00000 calibre    yonglong     qw    04/20/2017 14:48:11               
                     1
5066343 0.00000 nettran    felicia      qw    04/20/2017 14:48:18               
                     1

Here are sample messages:

04/20/2017 15:28:42|worker|ibm068|E|cannot start job 5066834.1, as resources 
have changed during a scheduling run
04/20/2017 15:28:42|worker|ibm068|W|Skipping remaining 12 orders
04/20/2017 15:28:42|schedu|ibm068|E|cannot start job 5066834.1, as resources 
have changed during a scheduling run

So it seems one host being full is affecting all others?

Hope you can help me. And thank you again for replying to me. Really appreciate 
it.

John




-----Original Message-----
From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On 
Behalf Of John_Tai
Sent: Thursday, April 20, 2017 2:29
To: Reuti
Cc: users@gridengine.org
Subject: Re: [gridengine users] Queue dropped because it is full, except it is 
not

>> The queue is also defined as being "qtype INTERACTIVE"?

Yes both interactive and batch.

>> And only a load of 7.75?

That was the current load.

>> Are there any consumable resource requests? I.e. is the memory perhaps fully 
>> used up by the already running jobs (being it h_vmem, virtual-free or any 
>> other consumable)?

Jobs are not submitted with any consumable requests. Though I have set 
virtual_free as a complex.

>> Did you upgrade all nodes?

I did upgrade all exec hosts.

Here are error messages from the master:

04/20/2017 14:28:07|schedu|ibm068|E|cannot start job 5066074.1, as resources 
have changed during a scheduling run
04/20/2017 14:28:08|worker|ibm068|E|host load value "virtual_free" exceeded: 
capacity is 20690952192.524288, job 5066074 requests additional 
21474836480.000000
04/20/2017 14:28:08|worker|ibm068|E|cannot start job 5066074.1, as resources 
have changed during a scheduling run
04/20/2017 14:28:08|worker|ibm068|W|Skipping remaining 32 orders
04/20/2017 14:28:08|schedu|ibm068|E|cannot start job 5066074.1, as resources 
have changed during a scheduling run
04/20/2017 14:28:09|worker|ibm068|E|host load value "virtual_free" exceeded: 
capacity is 20690952192.524288, job 5066074 requests additional 
21474836480.000000
04/20/2017 14:28:09|worker|ibm068|E|cannot start job 5066074.1, as resources 
have changed during a scheduling run
04/20/2017 14:28:09|worker|ibm068|W|Skipping remaining 32 orders
04/20/2017 14:28:09|schedu|ibm068|E|cannot start job 5066074.1, as resources 
have changed during a scheduling run
04/20/2017 14:28:10|worker|ibm068|E|host load value "virtual_free" exceeded: 
capacity is 20690952192.524288, job 5066074 requests additional 
21474836480.000000
04/20/2017 14:28:10|worker|ibm068|E|cannot start job 5066074.1, as resources 
have changed during a scheduling run
04/20/2017 14:28:10|worker|ibm068|W|Skipping remaining 32 orders
04/20/2017 14:28:10|schedu|ibm068|E|cannot start job 5066074.1, as resources 
have changed during a scheduling run
04/20/2017 14:28:11|worker|ibm068|E|host load value "virtual_free" exceeded: 
capacity is 20690952192.524288, job 5066074 requests additional 
21474836480.000000
04/20/2017 14:28:11|worker|ibm068|E|cannot start job 5066074.1, as resources 
have changed during a scheduling run
04/20/2017 14:28:11|worker|ibm068|W|Skipping remaining 33 orders
04/20/2017 14:28:11|schedu|ibm068|E|cannot start job 5066074.1, as resources 
have changed during a scheduling run



-----Original Message-----
From: Reuti [mailto:re...@staff.uni-marburg.de]
Sent: Wednesday, April 19, 2017 7:26
To: John_Tai
Cc: users@gridengine.org
Subject: Re: [gridengine users] Queue dropped because it is full, except it is 
not

Hi,

> Am 19.04.2017 um 09:00 schrieb John_Tai <john_...@smics.com>:
>
> I am trying to submit a job to a specific host in the queue:
>
> # qrsh -verbose -q gui.q@ibm056
> Your job 5049542 ("QRLOGIN") has been submitted waiting for
> interactive job to be scheduled ...
>
>
> However it is in waiting state:
>
> # qstat -u johnt
> job-ID  prior   name       user         state submit/start at     queue       
>                    slots ja-task-ID
> -----------------------------------------------------------------------------------------------------------------
> 5049542 0.55500 QRLOGIN    johnt        qw    04/19/2017 14:51:19             
>                        1

The queue is also defined as being "qtype INTERACTIVE"?


> # qstat -j 5049542 |grep gui.q
> hard_queue_list:            gui.q@ibm056
>                             queue instance "gui.q@dsbm05" dropped
> because it is full
>
> Here is the current status of the queue:
>
> # qstat -f |grep gui.q
> gui.q@dsbm04                   BIP   0/5/45         8.87     lx24-amd64
> gui.q@dsbm05                   BIP   0/55/55        7.75     lx24-amd64

And only a load of 7.75?


> gui.q@ibm056                   BIP   0/11/30        3.15     lx24-amd64

Are there any consumable resource requests? I.e. is the memory perhaps fully 
used up by the already running jobs (being it h_vmem, virtual-free or any other 
consumable)?


> gui.q@ibm057                   BIP   0/11/30        1.34     lx24-amd64
> gui.q@ibm058                   BIP   0/11/45        3.47     lx24-amd64
>
>
> The same goes for ibm057 and ibm058. It seems that dsbm05 being full blocks 
> all following servers in the queue list. In fact I can submit to dsbm04, 
> which precedes dsbm05.
>
> I recently upgraded from sge6.1 to sge6.2u6, though I can’t be sure that’s 
> the only thing that’s changed. How do I even begin to debug this?

Did you upgrade all nodes?

-- Reuti
________________________________

This email (including its attachments, if any) may be confidential and 
proprietary information of SMIC, and intended only for the use of the named 
recipient(s) above. Any unauthorized use or disclosure of this email is 
strictly prohibited. If you are not the intended recipient(s), please notify 
the sender immediately and delete this email from your computer.

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
________________________________

This email (including its attachments, if any) may be confidential and 
proprietary information of SMIC, and intended only for the use of the named 
recipient(s) above. Any unauthorized use or disclosure of this email is 
strictly prohibited. If you are not the intended recipient(s), please notify 
the sender immediately and delete this email from your computer.

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to