Am 27.02.2013 um 20:56 schrieb Mikael Brandström Durling: > <snip> >> >> In case you look deeper into the issue, it's also worth to note that there >> is no option to specify the target queue for `qrsh -inherit` in case you get >> slots from different queues on the slave system: >> >> https://arc.liv.ac.uk/trac/SGE/ticket/813 > > Ok. This could lead to incompatible changes to the -inherit behaviour, if the > caller to `qrsh -inherit` has to specify the queue requested. On the other > hand, I have seen cases where an OMPI job has been allotted slots from two > different queues on an exec host, which has resulted in ompi launching two > `qrsh -inherit` to the same host.
This was a bug and is fixed in the meantime from Open MPI 1.5.5 on. https://svn.open-mpi.org/trac/ompi/changeset/26163 It will always add up all slots for a machine even if they come from different queues now. Please let me know if you still see this behavior. >> Maybe it's related to the $NSLOTS. If you get slots from one and the same >> queue it seems to be indeed correct for the slave nodes. But for a local >> `qrsh -inherit` on the master node of the serial job it looks like being set >> to the overall slot count instead. > > > I noted that too. I will see if I get some spare time to hunt down this > track. It seems that an ideal solution could be that $NSLOTS is set to the > allotted number of slots for the current job (i.e. correct the number in the > master job), and that `qrsh -inherit` could take an argument of 'queue@host' > type. > > I'll think of this and add it as a comment to the ticket. Is that trac > instance at arc.liv.ac.uk the best place, even though we are running OGS? I > suppose so? It's at least the place where I put all my issues I found. OGS has his own bug tracker though, maybe you could enter it at both places. -- Reuti > Mikael > >> >> -- Reuti >> >> >>> Mikael >>> >>> >>> 26 feb 2013 kl. 21:32 skrev Reuti <[email protected]> >>> : >>> >>>> Am 26.02.2013 um 19:45 schrieb Mikael Brandström Durling: >>>> >>>>> I have recently been trying to run OpenMPI jobs spanning several nodes on >>>>> our small cluster. However, it seems to me as sub-jobs launched with qsub >>>>> -inherit (by openmpi) gets killed at a memory limit of h_vmem, instead of >>>>> h_vmem times the number of slots allocated to the sub-node. >>>> >>>> Unfortunately this is correct: >>>> >>>> https://arc.liv.ac.uk/trac/SGE/ticket/197 >>>> >>>> Only way around: use virtual_free instead and hope that they users comply >>>> to this estimated value. >>>> >>>> -- Reuti >>>> >>>> >>>>> Is there any way to get the correct allocation to the sub nodes? I have >>>>> some vague memory that I have read something about this. As it behaves >>>>> now, it is impossible to run large memory MPI jobs for us. Would making >>>>> h_vmem a per job consumable, rather than slot wise, give any other >>>>> behaviour? >>>>> >>>>> We are using OGS GE2011.11. >>>>> >>>>> Thanks for any hints on this issue, >>>>> >>>>> Mikael >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> [email protected] >>>>> https://gridengine.org/mailman/listinfo/users >>>> >>> >>> >> > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
