[gridengine users] sge_shepherd 100% cpuload problem and running all jobs with numactl ?
Hello, we have an aggregated cluster (ScaleMP vSMP System) with 192 Cores and 2TB of RAM and have some trouble with an simple: for i in `seq 1 300`; do qsub simple.sh; done mostly it hangs after round about 120 submitted jobs and the sge_shepherd's are all using 100% cpuload and the simple.sh isn't executed. How could i solve this? the second problem we have, where i would need help: we need to use numactl --physcpubind for the shellscripts submitted to qsub, they need to run bind to a specific core (due to the hugh size of this aggregated machine) but i don't get it how i can push the numactl in front of the submitted script for qsub, so the user don't need to bother with it and which core is not used etc. any suggestions ? Since qsub mostly needs scripts which are submitted. the facts: we have SoGE 8.1.1 we use centos 6.2 on the system all CPUs are Xeons with 6 Cores (HT disabled) (16 Nodes, 32 Sockets, 192 Cores) and 128GB RAM thanks Adam -- Adam Podstawka Leibniz-Institut DSMZ-Deutsche Sammlung von Mikro- organismen und Zellkulturen GmbH Inhoffenstraße 7 B 38124 Braunschweig Germany http://www.dsmz.de Director: Prof. Dr. Jörg Overmann Local court: Braunschweig HRB 2570 Chairman of the supervisory board: MR Dr. Axel Kollatschny DSMZ - A member of the Leibniz Association (WGL) ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] sge_shepherd 100% cpuload problem and running all jobs with numactl ?
On 24 August 2012 11:16, A. Podstawka adam.podsta...@dsmz.de wrote: Hello, we have an aggregated cluster (ScaleMP vSMP System) with 192 Cores and 2TB of RAM and have some trouble with an simple: for i in `seq 1 300`; do qsub simple.sh; done mostly it hangs after round about 120 submitted jobs and the sge_shepherd's are all using 100% cpuload and the simple.sh isn't executed. How could i solve this? I'd start with an strace of the shepherd to see what it was up to... the second problem we have, where i would need help: we need to use numactl --physcpubind for the shellscripts submitted to You could use a starter_method. As a recent version of Grid Engine though I think SoGE has the ability to bind cores itself though so you may not need to. qsub, they need to run bind to a specific core (due to the hugh size of this aggregated machine) but i don't get it how i can push the numactl in front of the submitted script for qsub, so the user don't need to bother with it and which core is not used etc. any suggestions ? Since qsub mostly needs scripts which are submitted. the facts: we have SoGE 8.1.1 we use centos 6.2 on the system all CPUs are Xeons with 6 Cores (HT disabled) (16 Nodes, 32 Sockets, 192 Cores) and 128GB RAM thanks Adam -- Adam Podstawka Leibniz-Institut DSMZ-Deutsche Sammlung von Mikro- organismen und Zellkulturen GmbH Inhoffenstraße 7 B 38124 Braunschweig Germany http://www.dsmz.de Director: Prof. Dr. Jörg Overmann Local court: Braunschweig HRB 2570 Chairman of the supervisory board: MR Dr. Axel Kollatschny DSMZ - A member of the Leibniz Association (WGL) ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] sge_shepherd 100% cpuload problem and running all jobs with numactl ?
Hi William, On 24.08.2012 12:38, William Hay wrote: On 24 August 2012 11:16, A. Podstawka adam.podsta...@dsmz.de wrote: Hello, we have an aggregated cluster (ScaleMP vSMP System) with 192 Cores and 2TB of RAM and have some trouble with an simple: for i in `seq 1 300`; do qsub simple.sh; done mostly it hangs after round about 120 submitted jobs and the sge_shepherd's are all using 100% cpuload and the simple.sh isn't executed. How could i solve this? I'd start with an strace of the shepherd to see what it was up to... ok will try - nice tip, haven't thought about strace the second problem we have, where i would need help: we need to use numactl --physcpubind for the shellscripts submitted to You could use a starter_method. As a recent version of Grid Engine ok will look at it. An other idea of mine was a wrapper for qsub, so the original qsub would be called afterwards from the wrapper with an extra script with numactl in it.. just as an idea, but i prefer native functions, so thanks (: though I think SoGE has the ability to bind cores itself though so you may not need to. have tried it, the binding seems not working, but because of the problematic with sge_shepherd can't say this 100% thanks Adam ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] sge_shepherd 100% cpuload problem and running all jobs with numactl ?
A. Podstawka adam.podsta...@dsmz.de writes: Hello, we have an aggregated cluster (ScaleMP vSMP System) with 192 Cores and 2TB of RAM and have some trouble with an simple: I'm afraid I don't know ScaleMP, other than roughly what it does. for i in `seq 1 300`; do qsub simple.sh; done mostly it hangs after round about 120 submitted jobs What exactly, hangs? The qmaster? and the sge_shepherd's are all using 100% cpuload and the simple.sh isn't executed. How could i solve this? Do you mean they only do that when you have a lot of them running? William's advice is likely to be useful. (You can attach strace to a running process.) Are there any useful messages in syslog or the SGE messages file with the log level set to info? Do the jobs actually start? If so, what's in the trace file in the job directory under active_jobs in the spool area? If you want to get really serious, it's possible to run the shepherd under gdb by using a suitable shepherd_cmd in the configuration and starting the execd by hand with SGE_ND=1 in the environment. the second problem we have, where i would need help: we need to use numactl --physcpubind for the shellscripts submitted to qsub, they need to run bind to a specific core (due to the hugh size of this aggregated machine) but i don't get it how i can push the numactl in front of the submitted script for qsub, so the user don't need to bother with it and which core is not used etc. any suggestions ? Since qsub mostly needs scripts which are submitted. That's definitely not the right way to do it. You want to get the SGE core binding working. The hwloc library that SGE uses is supposed to work on ScaleMP, but if there's a problem with it (which version are you using?) the developers will be interested and probably fix it reasonably quickly. If you have the hwloc utilities, do they work, e.g. can you do something like hwloc-bind core:1-2 hwloc-ps and get sensible output? -- Community Grid Engine: http://arc.liv.ac.uk/SGE/ ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users