Good morning. I was going through the /opt/gridengine/default/spool/qmaster/messages file for additional hints and noticed that it was a little too big (~2.7 GB). I took the opportunity to just shut the qmaster down, rotate this file and empty it, and start the qmaster again. No other configuration was changed anywhere. Voila, the waiting simulations (with priority -27 and lower) started running just fine.
I am running another set of such tests with priority values ranging from 0 through -100 to make sure I can reproduce the results. Thank you all for your time and willingness to help. I appreciate it. Best regards, Gowtham -- Gowtham, PhD Director of Research Computing, IT Research Associate Professor, ECE Michigan Technological University P: (906) 487-4096 F: (906) 487-2787 https://it.mtu.edu https://hpc.mtu.edu On Mon, Aug 6, 2018 at 7:44 PM Gowtham <g...@mtu.edu> wrote: > Additionally, when I ran 'qstat -j', the list of pending jobs do not seem > to contain these priority -27 or -28 jobs. > > ***************** > Jobs can not run because queue instance is not contained in its hard queue > list > 481017, 481018, 481020, 481021, 481022, 481024, 481214, 481251, > 481297, 481298, 481299, 481329, 481366, 481370, 481405, 482154, > 481406, 481407, 481408, 481409, 481410, 481411, 481412, 481413, > 481860, 482162 > > Jobs can not run because available slots combined under PE are not in > range of job > 481017, 481018, 481020, 481021, 481022, 481024, 481214, 481251, > 481297, 481298, 481299, 481329, 481366, 481370, 481405, 482154, > 481406, 481407, 481408, 481409, 481410, 481411, 481412, 481413, > 481860, 482162 > ***************** > > Best regards, > Gowtham > > -- > Gowtham, PhD > Director of Research Computing, IT > Research Associate Professor, ECE > Michigan Technological University > > P: (906) 487-4096 > F: (906) 487-2787 > https://it.mtu.edu > https://hpc.mtu.edu > > > On Mon, Aug 6, 2018 at 7:37 PM Gowtham <g...@mtu.edu> wrote: > >> Thank you, Reuti. >> >> report_pjob_tickets was already set to TRUE in 'qconf -msconf'. >> >> Output of qstat -ext and qstat -pri is attached as a screenshot. >> >> I have verified the following a few times: >> >> 1. There are sufficient free slots/processors available in the queue >> 2. There is sufficient men_free resource available to start these >> waiting simulations >> 3. When I submit a p = -26 simulation while the -27 and -28 are >> waiting, -26 (or lower) will still run to completion successfully >> >> Anymore thoughts/tips would be greatly appreciated. >> >> Best regards, >> Gowtham >> >> -- >> Gowtham, PhD >> Director of Research Computing, IT >> Research Associate Professor, ECE >> Michigan Technological University >> >> P: (906) 487-4096 >> F: (906) 487-2787 >> https://it.mtu.edu >> https://hpc.mtu.edu >> >> >> On Mon, Aug 6, 2018 at 4:50 PM Reuti <re...@staff.uni-marburg.de> wrote: >> >>> Hi, >>> >>> You can try to have a look at the extended output of `qstat`: >>> >>> $ qstat -ext >>> >>> $ qstat -pri >>> >>> In addition, the way the priority is honored and essentially computed is >>> outlined here: >>> >>> $ man sge_priority >>> >>> Maybe this will shed some light on it and point to the cause of it. >>> >>> -- Reuti >>> >>> PS: You may also want to switch on the output of the computed tickets: >>> >>> $ qconf -ssconf >>> … >>> report_pjob_tickets TRUE >>> >>> >>> Am 06.08.2018 um 19:18 schrieb Gowtham: >>> >>> > Greetings. >>> > >>> > I am using Rocks Cluster Distribution 6.1 and Grid Engine 2011.11p1. >>> All our simulations are submitted to the queue using the following command >>> format: >>> > >>> > qsub -p N SUBMISSION_SCRIPT.sh >>> > >>> > N is a negative integer ranging from -1 through -60 (we consider this >>> the "priority" of a research group). >>> > >>> > Until about a week or so ago, everything worked fine. Upon noticing >>> some simulations waiting in queue for longer than normal periods of time >>> (for e.g., my own group's priority is -41), I submitted 60 simulations with >>> priority values -1, -2, -3, ..., -60. >>> > >>> > I noticed that simulations with priority up to -26 ran just fine. >>> Those with -p value -27 and below just stay in 'qw' mode. The usual 'qstat >>> -j SIM_ID' command does not have information as to why it's not running >>> (please see below the output for a simulation with priority -27). >>> Processors/slots are free and available in long.q. >>> > >>> > As far as I know and understand Grid Engine documentation, -p values >>> range from -1024 through 1023 and non operators/admins are restricted to 0 >>> through -1024. >>> > >>> > Any help in debugging/identifying the cause of this problem will be >>> greatly appreciated. >>> > >>> > >>> **************************************************************************************** >>> > job_number: 481703 >>> > exec_file: job_scripts/481703 >>> > submission_time: Mon Aug 6 12:48:07 2018 >>> > owner: john >>> > uid: 38025 >>> > group: jane-users >>> > gid: 506 >>> > sge_o_home: /home/john >>> > sge_o_log_name: john >>> > sge_o_path: >>> >>> :/bin:/usr/bin:/usr/kerberos/bin:/share/apps/bin:/share/apps/sbin:/usr/X11R6/bin:/usr/java/latest/bin:/sbin:/usr/sbin:/usr/kerberos/sbin:/opt/gridengine/bin/lx26-amd64:/opt/gridengine/bin/linux-x64:/home/john/bin:/opt/ganglia/bin:/opt/rocks/bin:/opt/rocks/sbin >>> > sge_o_shell: /bin/bash >>> > sge_o_tz: America/Detroit >>> > sge_o_workdir: /misc/research/john/test_runs >>> > sge_o_host: login-0-2 >>> > account: sge >>> > cwd: /misc/research/john/test_runs >>> > merge: y >>> > hard resource_list: mem_free=2G >>> > mail_list: john@login-0-1.local >>> > notify: TRUE >>> > job_name: test_p27.sh >>> > priority: -27 >>> > jobshare: 0 >>> > hard_queue_list: long.q >>> > shell_list: NONE:/bin/bash >>> > env_list: >>> > script_file: test_p27.sh >>> > scheduling info: queue instance "long.q@compute-0-48.local" >>> dropped because it is disabled >>> > queue instance "long.q@compute-0-66.local" >>> dropped because it is disabled >>> > queue instance "long.q@compute-0-65.local" >>> dropped because it is disabled >>> > queue instance "long.q@compute-0-20.local" >>> dropped because it is disabled >>> > queue instance "long.q@compute-0-64.local" >>> dropped because it is disabled >>> > queue instance "repair.q@compute-0-36.local" >>> dropped because it is disabled >>> > queue instance "long.q@compute-0-63.local" >>> dropped because it is full >>> > queue instance "long.q@compute-0-50.local" >>> dropped because it is full >>> > ... >>> > queue instance "long.q@compute-0-33.local" >>> dropped because it is full >>> > queue instance "long.q@compute-0-31.local" >>> dropped because it is full >>> > queue instance "long.q@compute-0-35.local" >>> dropped because it is full >>> > queue instance "long.q@compute-0-10.local" >>> dropped because it is full >>> > queue instance "long.q@compute-0-43.local" >>> dropped because it is full >>> > queue instance "short.q@compute-0-1.local" >>> dropped because it is full >>> > queue instance "short.q@compute-0-2.local" >>> dropped because it is full >>> > queue instance "short.q@compute-0-3.local" >>> dropped because it is full >>> > queue instance "short.q@compute-0-0.local" >>> dropped because it is full >>> > queue instance "medium.q@compute-0-6.local" >>> dropped because it is full >>> > queue instance "medium.q@compute-0-7.local" >>> dropped because it is full >>> > queue instance "medium.q@compute-0-5.local" >>> dropped because it is full >>> > queue instance "medium.q@compute-0-4.local" >>> dropped because it is full >>> > >>> **************************************************************************************** >>> > >>> > >>> > Best regards, >>> > Gowtham >>> > >>> > -- >>> > Gowtham, PhD >>> > Director of Research Computing, IT >>> > Research Associate Professor, ECE >>> > Michigan Technological University >>> > >>> > P: (906) 487-4096 >>> > F: (906) 487-2787 >>> > https://it.mtu.edu >>> > https://hpc.mtu.edu >>> > _______________________________________________ >>> > users mailing list >>> > users@gridengine.org >>> > https://gridengine.org/mailman/listinfo/users >>> >>>
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users