Am 01.03.2013 um 12:13 schrieb Dave Love:

> Reuti <[email protected]> writes:
> 
>> Am 27.02.2013 um 20:56 schrieb Mikael Brandström Durling:
>>> <snip>
>>>> 
>>>> In case you look deeper into the issue, it's also worth to note that there 
>>>> is no option to specify the target queue for `qrsh -inherit` in case you 
>>>> get slots from different queues on the slave system:
>>>> 
>>>> https://arc.liv.ac.uk/trac/SGE/ticket/813
>>> 
>>> Ok. This could lead to incompatible changes to the -inherit behaviour, if 
>>> the caller to `qrsh -inherit` has to specify the queue requested. On the 
>>> other hand, I have seen cases where an OMPI job has been allotted slots 
>>> from two different queues on an exec host, which has resulted in ompi 
>>> launching two `qrsh -inherit` to the same host.
> 
> In my limited experience, you really don't want to split parallel jobs
> across queues (and you only add queues if there's something you have to
> hang off them).
> 
> I don't really understand what the complaint is here otherwise.  OMPI
> with h_vmem enforced works reasonably well for us (with a single queue).

The h_vmem isn't multiplied on the slave nodes even if you are getting slots 
from one queue only, despite the fact that the correct value of $NSLOTS on the 
slave node is known:

$ qsub -pe mpich 4 -l h_vmem=256M test.sh
$ cat test.sh.o5664
pc15370 2 all.q@pc15370 UNDEFINED
pc15381 2 all.q@pc15381 UNDEFINED
Script pc15370: /tmp/5664.1.all.q 4
...
virtual memory          (kbytes, -v) 524288
...
Call pc15370: /tmp/5664.1.all.q 4
...
virtual memory          (kbytes, -v) 262144
...
Call pc15381: /tmp/5664.1.all.q 2
...
virtual memory          (kbytes, -v) 262144
...
Call pc15381: /tmp/5664.1.all.q 2
...
virtual memory          (kbytes, -v) 262144
...

It should be 524288 also on pc15381, at least for the first call.

-- Reuti

Used script:

#!/bin/sh
cat $PE_HOSTFILE
. /usr/sge/default/common/settings.sh
echo "Script $(hostname): $TMPDIR $NSLOTS"
ulimit -aH
for HOST in $(tail -n +2 $TMPDIR/machines); do
    qrsh -inherit $HOST 'echo "Call $HOST: $TMPDIR $NSLOTS"; ulimit -aH; sleep 
60' &
done
wait


>> This was a bug and is fixed in the meantime from Open MPI 1.5.5 on.
>> 
>> https://svn.open-mpi.org/trac/ompi/changeset/26163
>> 
>> It will always add up all slots for a machine even if they come from 
>> different queues now.
> 
> You'll still get potential confusion from different TMPDIRs, though.  I
> never established whether there was any problem replacing the queue name
> with the cell name in TMPDIR construction, but I have a patch lying
> around to do it.
> 
>>> I'll think of this and add it as a comment to the ticket. Is that
>>> trac instance at arc.liv.ac.uk the best place, even though we are
>>> running OGS? I suppose so?
> 
> I'd be happy to have reports that might improve SGE (if I or someone
> else understands the issue), but I'm afraid I've been flamed for trying
> to help OGS users.
> 
> -- 
> Community Grid Engine:  http://arc.liv.ac.uk/SGE/
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to