Hi all,

My apologies if this is sent twice. The first time I sent it without my 
subscription to the list being complete.

I am attempting to use Slurm as a test automation system for its fairly 
advanced queueing and job control abilities, and also because it scales very 
well.
However, since our use case is a bit outside the standard usage of Slurm, we 
are hitting some issues that don’t appear to have obvious solutions.

In our current setup, the Slurm nodes are hosts attached to a test system. Our 
pipeline (greatly simplified) would be to install some software on the test 
system and then run sets of tests against it.
In our old pipeline, this was done in a single job, however with Slurm I was 
hoping to decouple these two actions as it makes the entire pipeline more 
robust to update failures and would give us more finely grained job control for 
the actual test run.

I would like to allow users to queue jobs with constraints indicating which 
software version they need. Then separately some automated job would scan the 
queue, see jobs that are not being allocated due to missing resources, and 
queue software installs appropriately. We attempted to do this using the 
Active/Available Features configuration. We use HealthCheck and Epilog scripts 
to scrape the test system for software properties (version, commit, etc.) and 
assign them as Features. Once an install is complete and the Features are 
updated, queued jobs would start to be allocated on those nodes.

Herein lies the conundrum. If a user submits a job, constraining to run on 
Version A, but all nodes in the cluster are currently configured with 
Features=Version-B, Slurm will fail to queue the job, indicating an invalid 
feature specification. I completely understand why Features are implemented 
this way, so my question is, is there some workaround or other Slurm 
capabilities that I could use to achieve this behavior? Otherwise my options 
seem to be:

  1.  Go back to how we did it before. The pipeline would have the same level 
of robustness as before but at least we would still be able to leverage other 
queueing capabilities of Slurm.
  2.  Write our own Feature or Job Submit plugin that customizes this behavior 
just for us. Seems possible but adds lead time and complexity to the situation.

It's not feasible to update the config for all branches/versions/commits to be 
AvailableFeatures, as our branch ecosystem is quite large and the maintenance 
of that approach would not scale well.

Thanks,

Raj Sahae  |  Manager, Software QA
3500 Deer Creek Rd, Palo Alto, CA 94304
m. +1 (408) 230-8531  | 
rsa...@tesla.com<file:///composeviewinternalloadurl/%3Cmailto:rsa...@tesla.com%3E>

[cid:image001.png@01D6560C.399F5D30]<http://www.tesla.com/>

Reply via email to