On 24 August 2012 11:16, A. Podstawka <[email protected]> wrote: > Hello, > > we have an aggregated cluster (ScaleMP vSMP System) with 192 Cores and > 2TB of RAM and have some trouble with an simple: > > for i in `seq 1 300`; do qsub simple.sh; done > > mostly it hangs after round about 120 submitted jobs and the > sge_shepherd's are all using 100% cpuload and the simple.sh isn't > executed. How could i solve this? One possibilty is that sge_execd or shepherd is running out of some resource and throwing a hissy fit as a result. Try raising the limits you can (via setrlimit/ulimit) to unlimited before launching the sge_execd and see if that makes a difference. If it does you should probably then lower them again to the lowest value that works plus a fudge factor. I'd be suspicious of the number of file handles myself.
William > > the second problem we have, where i would need help: > we need to use "numactl --physcpubind" for the shellscripts submitted to > qsub, they need to run bind to a specific core (due to the hugh size of > this aggregated machine) but i don't get it how i can push the numactl > in front of the submitted script for qsub, so the user don't need to > bother with it and which core is not used etc. any suggestions ? Since > qsub mostly needs scripts which are submitted. > > > the facts: > we have SoGE 8.1.1 > we use centos 6.2 on the system > all CPUs are Xeons with 6 Cores (HT disabled) (16 Nodes, 32 Sockets, 192 > Cores) and 128GB RAM > > thanks > Adam > > -- > Adam Podstawka > Leibniz-Institut DSMZ-Deutsche Sammlung von Mikro- > organismen und Zellkulturen GmbH > Inhoffenstraße 7 B > 38124 Braunschweig > Germany > http://www.dsmz.de > > Director: Prof. Dr. Jörg Overmann > Local court: Braunschweig HRB 2570 > Chairman of the supervisory board: MR Dr. Axel Kollatschny > > DSMZ - A member of the Leibniz Association (WGL) > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
