As Chris says, the output of "qstat -j <jobID>" should be the first place
to look. It should explain why the job is in an error state.

Ian

On Wed, Sep 28, 2016 at 8:06 AM, Dan Hyatt <dhy...@dsgmail.wustl.edu> wrote:

> Thanks,
>
> after what you said, suggests it is something the user is doing. But she
> is saying some of the jobs are working and some are being dumped because
> its full.
>
>
> On 09/28/2016 09:41 AM, Chris Dagdigian wrote:
>
>>
>> I think the "queue instance dropped because ... full" is not related to
>> your user/job problem. The dropped message is a sign from the job placement
>> process that the queue instance was skipped during the active host
>> select-and-job-dispatch round because it had no more job slots free to take
>> new work. This would be a normal status alert on an active cluster with
>> lots of jobs in 'qw' state. No big deal basically unless you think a
>> resource, quota or some other thing is interfering.
>>
>> State "Eqw" is usually a sign that something went badly wrong with a job.
>> Its usually a sign of a significant issue like the UID/GID of the user not
>> existing on the execution host or similar or it could be as simple as user
>> error in a script (permission denied, path not found, etc.).
>>
>> What does "qstat -j <jobID>" tell you about the jobs in Eqw state? Any
>> interesting spool lots from the compute nodes or qmaster?
>>
>> Chris
>>
>>
>>
>>
>> Dan Hyatt wrote:
>>
>>>
>>> I am trying to narrow down what would cause this. I searched google and
>>> the sge resources and could not find a reason for
>>>
>>>   queue instance "VeryHighMem@blade5-5-8" dropped because it is full
>>>   queue instance "HighMem@blade5-1-4" dropped because it is full
>>>
>>> This is that one user almost every shop has who is incredible at its
>>> work, but causes about 90% of the technical problems because of bad choices.
>>>
>>>
>>> Why would sge queue the jobs for everyone else but with this user
>>> suddenly drop jobs "because its full"
>>>
>>> I have lots of jobs went to "eqw" as shown in the follow:
>>> 1144122 0.55500 sas64      username       Eqw   09/27/2016 22:54:45
>>>                               1
>>> 1144125 0.55500 sas64      username       Eqw   09/27/2016 22:55:35
>>>                               1
>>> 1144127 0.55500 sas64      username       Eqw   09/27/2016 22:56:25
>>>                               1
>>> 1144130 0.55500 sas64      username       Eqw   09/27/2016 22:57:15
>>>                               1
>>> 1144134 0.55500 sas64      username       Eqw   09/27/2016 22:58:05
>>>                               1
>>> 1144139 0.55500 sas64      username       Eqw   09/27/2016 22:58:55
>>>                               1
>>> 1144142 0.55500 sas64      username       Eqw   09/27/2016 22:59:46
>>>                               1
>>> 1144145 0.55500 sas64      username       Eqw   09/27/2016 23:00:36
>>>                               1
>>> 1144151 0.55500 sas64      username       Eqw   09/27/2016 23:01:26
>>>                               1
>>> 1144156 0.55500 sas64      username       Eqw   09/27/2016 23:02:16
>>>                               1
>>> 1144161 0.55500 sas64      username       Eqw   09/27/2016 23:03:06
>>>                               1
>>> 1144165 0.55500 sas64      username       Eqw   09/27/2016 23:03:56
>>>                               1
>>> 1144169 0.55500 sas64      username       Eqw   09/27/2016 23:04:46
>>>                               1
>>> 1144174 0.55500 sas64      username       Eqw   09/27/2016 23:05:36
>>>                               1
>>> 1144177 0.55500 sas64      username       Eqw   09/27/2016 23:06:26
>>>                               1
>>> 1144182 0.55500 sas64      username       Eqw   09/27/2016 23:07:17
>>>                               1
>>> 1144186 0.55500 sas64      username       Eqw   09/27/2016 23:08:07
>>>                               1
>>> 1144196 0.55500 sas64      username       Eqw   09/27/2016 23:08:57
>>>                               1
>>> 1144204 0.55500 sas64      username       Eqw   09/27/2016 23:09:47
>>>                               1
>>> 1144212 0.55500 sas64      username       Eqw   09/27/2016 23:10:37
>>>                               1
>>> 1144217 0.55500 sas64      username       Eqw   09/27/2016 23:11:27
>>>                               1
>>> 1144221 0.55500 sas64      username       Eqw   09/27/2016 23:12:17
>>>                               1
>>> 1144224 0.55500 sas64      username       Eqw   09/27/2016 23:13:08
>>>                               1
>>> 1144225 0.55500 sas64      username       Eqw   09/27/2016 23:13:58
>>>                               1
>>> 1144227 0.55500 sas64      username       Eqw   09/27/2016 23:14:48
>>>                               1
>>> 1144232 0.55500 sas64      username       Eqw   09/27/2016 23:15:38
>>>                               1
>>> 1144236 0.55500 sas64      username       Eqw   09/27/2016 23:16:28
>>>                               1
>>> 1144244 0.55500 sas64      username       Eqw   09/27/2016 23:17:18
>>>                               1
>>> 1144255 0.55500 sas64      username       Eqw   09/27/2016 23:18:09
>>>                               1
>>> 1144265 0.55500 sas64      username       Eqw   09/27/2016 23:18:59
>>>                               1
>>> 1144276 0.55500 sas64      username       Eqw   09/27/2016 23:19:49
>>>                               1
>>> 1144286 0.55500 sas64      username       Eqw   09/27/2016 23:20:39
>>>                               1
>>> 1144295 0.55500 sas64      username       Eqw   09/27/2016 23:21:29
>>>                               1
>>> 1144306 0.55500 sas64      username       Eqw   09/27/2016 23:22:19
>>>                               1
>>> 1144316 0.55500 sas64      username       Eqw   09/27/2016 23:23:09
>>>                               1
>>> 1144326 0.55500 sas64      username       Eqw   09/27/2016 23:23:59
>>>                               1
>>> 1144335 0.55500 sas64      username       Eqw   09/27/2016 23:24:49
>>>                               1
>>> 1144344 0.55500 sas64      username       Eqw   09/27/2016 23:25:39
>>>                               1
>>> 1144351 0.55500 sas64      username       Eqw   09/27/2016 23:26:30
>>>                               1
>>> 1144359 0.55500 sas64      username       Eqw   09/27/2016 23:27:20
>>>                               1
>>> 1144366 0.55500 sas64      username       Eqw   09/27/2016 23:28:10
>>>                               1
>>> 1144374 0.55500 sas64      username       Eqw   09/27/2016 23:29:00
>>>                               1
>>> 1144416 0.55500 sas64      username       Eqw   09/27/2016 23:29:50
>>>                               1
>>> 1144482 0.55500 sas64      username       Eqw   09/27/2016 23:30:40
>>>                               1
>>> 1144484 0.55500 sas64      username       Eqw   09/27/2016 23:31:30
>>>                               1
>>> 1144485 0.55500 sas64      username       Eqw   09/27/2016 23:32:20
>>>                               1
>>> 1144486 0.55500 sas64      username       Eqw   09/27/2016 23:33:10
>>>                               1
>>> 1144487 0.55500 sas64      username       Eqw   09/27/2016 23:34:00
>>>                               1
>>> 1144491 0.55500 sas64      username       Eqw   09/27/2016 23:34:51
>>>                               1
>>> 1144498 0.55500 sas64      username       Eqw   09/27/2016 23:35:41
>>>                               1
>>> 1144499 0.55500 sas64      username       Eqw   09/27/2016 23:36:31
>>>                               1
>>> 1144500 0.55500 sas64      username       Eqw   09/27/2016 23:37:21
>>>                               1
>>> _______________________________________________
>>> users mailing list
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>>>
>>
>>
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
>



-- 
Ian Kaufman
Research Systems Administrator
UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to