Re: [gridengine users] Monitoring slot usage
Hi Simon: We use 'Core Binding' to restrict users to the same number of cores as slots requested. http://www.gridengine.eu/grid-engine-internals/87-exploiting-the-grid-engine-core-binding-feature We use a jsv to assign the binding value (force compliance) based on the other job inputs: single slot and MPI jobs are bound to 1 core (for each slot requested), OpenMP jobs are bound to the number of slots requested in the pe option. Or you might be able to just put '-binding linear:1' in $SGE_ROOT/default/common/sge_request, and then have users specify '-binding linear:#' if they're doing a SMP job. Test carefully! :) -Hugh From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On Behalf Of Simon Andrews Sent: Thursday, July 30, 2015 11:01 AM To: users@gridengine.org Subject: [gridengine users] Monitoring slot usage What is the recommended way of identifying jobs which are consuming more CPU than they've requested? I have an environment set up where people mostly submit SMP jobs through a parallel environment and we can use this information to schedule them appropriately. We've had several cases though where the jobs have used significantly more cores on the machine they're assigned to than they requested, so the nodes become overloaded and go into an alarm state. What options do I have for monitoring the number of cores simultaneously used by a job and comparing this to the number which were requested so I can find cases where the actual usage is way above the request and kill them? Thanks Simon. The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered Charity No. 1053902. The information transmitted in this email is directed only to the addressee. If you received this in error, please contact the sender and delete this email from your system. The contents of this e-mail are the views of the sender and do not necessarily represent the views of the Babraham Institute. Full conditions at: www.babraham.ac.ukhttp://www.babraham.ac.uk/terms ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Filling up nodes when using gepetools
On Thu, 30 Jul 2015 12:57:13 + Winkler, Ursula (ursula.wink...@uni-graz.at) ursula.wink...@uni-graz.at wrote: My suggestion was to modify your jsv/gepetools to force single node parallel jobs into PEs with $pe_slots allocation rules (which gives you control over where they are scheduled via queue_sort_method and load_formula) while sending the others to PEs with other (appropriate) allocation rules that won't cause (ii). Well, I created an additional PE with alloacation_rule $pe_slots, and built in an if condition in pe.jsv for all jobs which request just a single node to be assigned to this new PE. But the annoying situation didn't change. The scheduler configuration is set to queue_sort_methodload and load_formula slots. So what I'm still missing? How is job_load_adjustment configured? ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users pgpfZHlQA2o9k.pgp Description: OpenPGP digital signature ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Monitoring slot usage
I have similar issue too. Especially when users run MPI+Multithreads jobs. Some Multithreading programs by default use all of the cores on a node they find. Now I have a script to scan the usage of CPU and RAM on all nodes, and it will warn me if it find any overloaded nodes. Not sure SGE has built-in ability to track the CPU cores each job uses. But it may not be difficult to prepare a script to do that routinely out of SGE. On Thu, Jul 30, 2015 at 11:00 AM, Simon Andrews simon.andr...@babraham.ac.uk wrote: What is the recommended way of identifying jobs which are consuming more CPU than they’ve requested? I have an environment set up where people mostly submit SMP jobs through a parallel environment and we can use this information to schedule them appropriately. We’ve had several cases though where the jobs have used significantly more cores on the machine they’re assigned to than they requested, so the nodes become overloaded and go into an alarm state. What options do I have for monitoring the number of cores simultaneously used by a job and comparing this to the number which were requested so I can find cases where the actual usage is way above the request and kill them? Thanks Simon. The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered Charity No. 1053902. The information transmitted in this email is directed only to the addressee. If you received this in error, please contact the sender and delete this email from your system. The contents of this e-mail are the views of the sender and do not necessarily represent the views of the Babraham Institute. Full conditions at: www.babraham.ac.uk ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users -- Best, Feng ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Filling up nodes when using gepetools
Sorry to step in the discussion: `qstat -j ...` shows the requested one, the granted one is in `qstat -r`. $ qsub -pe * 2 test.sh Your job 44329 (test.sh) has been submitted $ qstat -j 44329 ... parallel environment: * range: 2 ... My jobs: qstat -j ... ... parallel environment: gepetools_1host range 2 ... That's the PE I created for that purposes. So qstat -j shows the right info. $ qstat -r ... Requested PE: * 2 Granted PE: make 2 qstat -r ... Requested PE: gepetools_1host 2 Granted PE:gepetools_1host 2 ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Filling up nodes when using gepetools
Well, I created an additional PE with alloacation_rule $pe_slots, and built in an if condition in pe.jsv for all jobs which request just a single node to be assigned to this new PE. But the annoying situation didn't change. The scheduler configuration is set to queue_sort_methodload and load_formula slots. So what I'm still missing? I believe it should be a load_formula of -slots so the more slots are available(fewest used) the lower the load and the more attractive the node. The page Reuti pointed to manages to write this both ways around. Setting load_formula to -slots doesn't change anything - every job still starts on a separate host (but in this case it should be the correct hehave if I don't misinterpret the instructions from the Web Page Reuti mentioned). I must be missing something else and pretty basic... ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Filling up nodes when using gepetools
On Thu, 30 Jul 2015 06:12:52 + Winkler, Ursula (ursula.wink...@uni-graz.at) ursula.wink...@uni-graz.at wrote: -Ursprüngliche Nachricht- Von: Reuti [mailto:re...@staff.uni-marburg.de] Gesendet: Mittwoch, 29. Juli 2015 15:10 An: Winkler, Ursula (ursula.wink...@uni-graz.at) Cc: users@gridengine.org Betreff: Re: [gridengine users] Filling up nodes when using gepetools Hi, Am 29.07.2015 um 12:50 schrieb Winkler, Ursula (ursula.wink...@uni-graz.at) ursula.wink...@uni-graz.at: Node1 has 12 Cores/Slots and 1 MPI-Job with 2 Slots is running on it. A user submits job2 which require maximal 10 slots. Independently from schedule_interval, job_load_adjustments, load_formula and/or load_adjustment_decay_time parameters-settings job2 usually won't start on Node1 if What about queue_sort_method? Doesn't work neither. As long as the requested PE has $pe_slots as allocation_rule, it should be possible to use a fill up configuration: https://blogs.oracle.com/sgrell/entry/grid_engine_scheduler_hacks_least Thank you for the link, that with $pe_slots I didn't know. But unfortunately it still doesn't work - maybe because of the gepetools Sub-PE's. Setting there $pe_slots too has the effect that jobs doesn't start anymore. Ursula $pe_slots restricts you to a single node so I'm guessing the jobs that don't start are jobs that need more than one node. While we don't use gepetools we do have a JSV that rewrites people's requested PE based on the number What you need I think is something that routes jobs that request 1 node to PEs with a $pe_slots allocation rule while other jobs are routed to nodes with an allocation rule equal to the requested ppn. In all cases the number of slots to request should be nodes*ppn. ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users pgp5A7ayz_l7X.pgp Description: OpenPGP digital signature ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Filling up nodes when using gepetools
I believe it should be a load_formula of -slots so the more slots are available(fewest used) the lower the load and the more attractive the node. The page Reuti pointed to manages to write this both ways around. I'll try it out tomorrow - I'm not at the office now and it's a little bit difficult from here. ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Filling up nodes when using gepetools
On Thu, 30 Jul 2015 12:57:13 + Winkler, Ursula (ursula.wink...@uni-graz.at) ursula.wink...@uni-graz.at wrote: My suggestion was to modify your jsv/gepetools to force single node parallel jobs into PEs with $pe_slots allocation rules (which gives you control over where they are scheduled via queue_sort_method and load_formula) while sending the others to PEs with other (appropriate) allocation rules that won't cause (ii). Well, I created an additional PE with alloacation_rule $pe_slots, and built in an if condition in pe.jsv for all jobs which request just a single node to be assigned to this new PE. But the annoying situation didn't change. The scheduler configuration is set to queue_sort_methodload and load_formula slots. So what I'm still missing? Ignore previous message. Me getting it back to front I think. That looks correct (I think). Have you checked the jobs show the right granted PE with qstat -j? Yes, of course. ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Monitoring slot usage
Thanks, core binding looks like it does what we need. Do I understand correctly that if a process spawns more threads than slots that it will then just restrict those threads to the core it’s been allocated, so they’ll just end up slowing down their own job, and that it won’t actually get killed? I’ll be very careful in testing this :-) Simon. From: MacMullan, Hugh hugh...@wharton.upenn.edumailto:hugh...@wharton.upenn.edu Date: Thursday, 30 July 2015 16:20 To: Simon Andrews simon.andr...@babraham.ac.ukmailto:simon.andr...@babraham.ac.uk, users@gridengine.orgmailto:users@gridengine.org users@gridengine.orgmailto:users@gridengine.org Subject: RE: Monitoring slot usage Hi Simon: We use 'Core Binding' to restrict users to the same number of cores as slots requested. http://www.gridengine.eu/grid-engine-internals/87-exploiting-the-grid-engine-core-binding-feature We use a jsv to assign the binding value (force compliance) based on the other job inputs: single slot and MPI jobs are bound to 1 core (for each slot requested), OpenMP jobs are bound to the number of slots requested in the pe option. Or you might be able to just put '-binding linear:1' in $SGE_ROOT/default/common/sge_request, and then have users specify '-binding linear:#' if they're doing a SMP job. Test carefully! :) -Hugh From: users-boun...@gridengine.orgmailto:users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On Behalf Of Simon Andrews Sent: Thursday, July 30, 2015 11:01 AM To: users@gridengine.orgmailto:users@gridengine.org Subject: [gridengine users] Monitoring slot usage What is the recommended way of identifying jobs which are consuming more CPU than they’ve requested? I have an environment set up where people mostly submit SMP jobs through a parallel environment and we can use this information to schedule them appropriately. We’ve had several cases though where the jobs have used significantly more cores on the machine they’re assigned to than they requested, so the nodes become overloaded and go into an alarm state. What options do I have for monitoring the number of cores simultaneously used by a job and comparing this to the number which were requested so I can find cases where the actual usage is way above the request and kill them? Thanks Simon. The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered Charity No. 1053902. The information transmitted in this email is directed only to the addressee. If you received this in error, please contact the sender and delete this email from your system. The contents of this e-mail are the views of the sender and do not necessarily represent the views of the Babraham Institute. Full conditions at: www.babraham.ac.ukhttp://www.babraham.ac.uk/terms The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered Charity No. 1053902. The information transmitted in this email is directed only to the addressee. If you received this in error, please contact the sender and delete this email from your system. The contents of this e-mail are the views of the sender and do not necessarily represent the views of the Babraham Institute. Full conditions at: www.babraham.ac.ukhttp://www.babraham.ac.uk/terms ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Filling up nodes when using gepetools
Am 30.07.2015 um 18:14 schrieb Winkler, Ursula (ursula.wink...@uni-graz.at) ursula.wink...@uni-graz.at: On Thu, 30 Jul 2015 12:57:13 + Winkler, Ursula (ursula.wink...@uni-graz.at) ursula.wink...@uni-graz.at wrote: My suggestion was to modify your jsv/gepetools to force single node parallel jobs into PEs with $pe_slots allocation rules (which gives you control over where they are scheduled via queue_sort_method and load_formula) while sending the others to PEs with other (appropriate) allocation rules that won't cause (ii). Well, I created an additional PE with alloacation_rule $pe_slots, and built in an if condition in pe.jsv for all jobs which request just a single node to be assigned to this new PE. But the annoying situation didn't change. The scheduler configuration is set to queue_sort_methodload and load_formula slots. So what I'm still missing? Ignore previous message. Me getting it back to front I think. That looks correct (I think). Have you checked the jobs show the right granted PE with qstat -j? Yes, of course. Sorry to step in the discussion: `qstat -j ...` shows the requested one, the granted one is in `qstat -r`. $ qsub -pe * 2 test.sh Your job 44329 (test.sh) has been submitted $ qstat -j 44329 ... parallel environment: * range: 2 ... $ qstat -r ... Requested PE: * 2 Granted PE: make 2 -- Reuti ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Filling up nodes when using gepetools
Am 30.07.2015 um 18:29 schrieb Reuti re...@staff.uni-marburg.de: Am 30.07.2015 um 18:14 schrieb Winkler, Ursula (ursula.wink...@uni-graz.at) ursula.wink...@uni-graz.at: On Thu, 30 Jul 2015 12:57:13 + Winkler, Ursula (ursula.wink...@uni-graz.at) ursula.wink...@uni-graz.at wrote: My suggestion was to modify your jsv/gepetools to force single node parallel jobs into PEs with $pe_slots allocation rules (which gives you control over where they are scheduled via queue_sort_method and load_formula) while sending the others to PEs with other (appropriate) allocation rules that won't cause (ii). Well, I created an additional PE with alloacation_rule $pe_slots, and built in an if condition in pe.jsv for all jobs which request just a single node to be assigned to this new PE. But the annoying situation didn't change. The scheduler configuration is set to queue_sort_methodload and load_formula slots. So what I'm still missing? Ignore previous message. Me getting it back to front I think. That looks correct (I think). Have you checked the jobs show the right granted PE with qstat -j? Yes, of course. Sorry to step in the discussion: `qstat -j ...` shows the requested one, the granted one is in `qstat -r`. $ qsub -pe * 2 test.sh Your job 44329 (test.sh) has been submitted $ qstat -j 44329 ... parallel environment: * range: 2 ... $ qstat -r ... Requested PE: * 2 Granted PE: make 2 -- Reuti At the moment I don't know if I checked it with qstat -j, but I checked it - when I'm in the office again I probably have the output still on some screen window so I can tell it exactly. And I did do a test: I removed the PE temporarely from the queue - with the result that the jobs could not start anymore (as respected). ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users