I've sent a patch for solving a bug related to using this switch option.
The patch is for slurm-2.4.0 but you can aply it for 2.3.2, probably.

As I said when the patch was submitted, there are two problems:

1) a job requesting with switch option could be pending forever
2) when there are enough nodes available for serving the request
including the switches requisite, it could happen the job being
allocated through more switches than requested.

If after aplying the patch the problem persists, I could take a look at
the code for your case.

On 01/19/2012 03:48 AM, Alex Besogonov wrote:
> Hm, I've tried it. Doesn't seem to work, unfortunately. For example,
> if I start one node and run
> "srun --switch=2@1 sleep 30" and "srun --switch=2@1 ls" then the
> second command actually
> causes startup of two extra nodes.
>
> On Wed, Jan 18, 2012 at 5:55 PM, Alejandro Lucero Palau
> <alejandro.luc...@bsc.es> wrote:
>   
>> This behaviour is similar to --switch=2@60  option requesting up to 2
>> switches for distributing the job and waiting up to 60 minutes before
>> accepting whatever the distribution. This is done at select cons/res
>> plugin so I guess something similar could be done for this cloud
>> requisite. Probably just a new field or flag is needed for checking this
>> out.
>>
>> On 01/18/2012 12:39 PM, Alex Besogonov wrote:
>>     
>>> I'm working on Amazon EC2 integration with Slurm. I've found several
>>> issues (like inability to work with CLOUD nodes without DNS names) but
>>> they look fairly easy to fix. CLOUD mode with suspend/restore works OK
>>> too.
>>>
>>> However, I have another question - is it possible to somehow make
>>> Slurm work in a 'reluctant' mode? Let me explain, nodes on Amazon EC2
>>> are billed at one-hour increments. So if I start 10 "srun sleep 10"
>>> jobs SLURM is going to resume 10 nodes causing me to be billed for 20
>>> hours of CPU time even though all the jobs could be completed on a
>>> single host in the time it takes to start all the EC2 nodes.
>>>
>>> I've tried to play with ResumeRate but it simply doesn't work well enough.
>>>
>>> So I'm thinking about a scheduler that will work in conjunction with
>>> the backfill scheduler. It'll wait until there's at least one task in
>>> the queue which is awaiting execution for more than N seconds to start
>>> resuming new nodes.
>>>
>>> Is it feasible or is there a better way to do it?
>>>
>>>       
>>
>> WARNING / LEGAL TEXT: This message is intended only for the use of the
>> individual or entity to which it is addressed and may contain
>> information which is privileged, confidential, proprietary, or exempt
>> from disclosure under applicable law. If you are not the intended
>> recipient or the person responsible for delivering the message to the
>> intended recipient, you are strictly prohibited from disclosing,
>> distributing, copying, or in any way using this message. If you have
>> received this communication in error, please notify the sender and
>> destroy and delete any copies you may have received.
>>
>> http://www.bsc.es/disclaimer.htm
>>     
>   


WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer.htm

Reply via email to