Hello,

I set up a test queue to test a new prolog/epilog scripts and I am seeing
some strange behavior when I submit a PE job to this queue, which causes
the job to not get scheduled forever or for a very long period of time. I
tried several PE with allocation rules of '1', '2', '4'. All to no avail.
Submitting a job without a PE makes it run immediately. I am using SGE
2.6u5.

Checking why it is not running:
$ qalter -w *v* 7301747
...
Job 7301747 cannot run because it exceeds limit "ilya/////" in rule
"limit_slots_for_users/1"
Job 7301747 cannot run in PE "pe_1" because it only offers 0 slots
verification: no suitable queues

$ qconf -sp pe_1
pe_name            pe_1
slots              9999999
user_lists         NONE
xuser_lists        NONE
start_proc_args    startmpi.sh $pe_hostfile
stop_proc_args     stopmpi.sh $pe_hostfile
allocation_rule    *1*
control_slaves     TRUE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE

$ qconf -srqs limit_slots_for_users
{
   name         limit_slots_for_users
   description  "limit the number of simultaneous slots any user can use"
   enabled      TRUE
   limit        users {*} to slots=800
}

And finally,
$ qstat
job-ID  prior   name       user         state submit/start at
queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
7301584 0.60051 sleep      ilya        qw    04/20/2018
18:29:26                                    4
7301747 0.50051 sleep      ilya        qw    04/20/2018
18:36:23                                    1

So I am not running anything at the moment. If I submit a job with the same
PE to a production queue, it will get scheduled.

A job that I left hanging last night, finally got scheduled after 7-8 hours.

The test queue is a follows:
qconf -sq test_gpu.q
qname                 test_gpu.q
hostlist              @gpu
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make pe_1 pe_2 pe_3 pe_4 pe_slots
rerun                 TRUE
slots                 4
tmpdir                /data
shell                 /bin/sh
prolog                sgeg...@prolog.sh
epilog                sgeg...@epilog.sh
shell_start_mode      unix_behavior
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      custom_kill -p $job_pid -j $job_id
notify                00:00:60
owner_list            NONE
user_lists            system.g
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                1G
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY


Any suggestions?

Thank you,
Ilya.
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to