"A. Podstawka" <[email protected]> writes: >> I'm afraid I don't know ScaleMP, other than roughly what it does. > it aggregates our 16 blades to one huge linux machine over the > infiniband connection. So we see 192 cores:
Right, I know the idea but I don't know details of how it works. It's possible there are issues with a large number of cores, or something specific to the kernel-level environment. >>> for i in `seq 1 300`; do qsub simple.sh; done >>> >>> mostly it hangs after round about 120 submitted jobs >> >> What exactly, hangs? The qmaster? > No, the for loop hangs for some time and then goes on with the next > jobs.. hangs again... goes on.. till all are submitted. I assume that's because the qmaster is working too hard. > For 1000 jobs > it took on the last try (time for i in...) 267minutes, this is 16sec > per jobsubmit. > The jobs are run and done normaly, but the submission takes quit a > long time. That doesn't sound too good. Do you have any unusual parameters or load? > the strace showed for the process something like this: > [root@scalemp ~]# strace -p 4974 > Process 4974 attached - interrupt to quit > read(4, "w", 1) = 1 > read(4, "e", 1) = 1 > read(4, "r", 1) = 1 > read(4, " ", 1) = 1 > .. have attached a stracelog [That was probably rather large for a mail list.] Unfortunately it doesn't show anything very useful -- I was expecting it to be looping somewhere after exec'ing the child process.. > but will try to rise the loglevel, just a hint where to do this for > shepherd? I don't think it will help for the shepherd, but there might be something useful from the execd or the qmaster (indicating why it's slow). > [root@scalemp ~]# numactl --physcpubind 1,2,3,4 hwloc-ps > 8729 L2Cache:1 L2Cache:2 L2Cache:3 L2Cache:4 hwloc-ps > [root@scalemp ~]# hwloc-bind core:1-4 hwloc-ps > 8730 L2Cache:1 L2Cache:2 L2Cache:3 L2Cache:4 hwloc-ps OK so hwloc seems to work, and the SGE binding should work. > Seems to work, but a short try through sge (one of my first trys...) > hasn't shown that working, The shepherd trace file says it hasn't been requested. You need to ask for it. > mostly my problem is that i can't find any > documentation on the binding process - See -binding in submit(1). Do you need more information than that? > we need an automated binding to cores for the jobs, Sure. "-binding linear" most likely does what you want. I have that in the sge_request file since it is a reasonable default. I hope that helps with the binding, but I'm not sure what to suggest for the shepherd and qmaster without more information. I'll see if I can think of anything, but perhaps someone else has a suggestion. -- Community Grid Engine: http://arc.liv.ac.uk/SGE/ _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
