Hi everyone, I have a question relating to process memory usage. Right now I'm using 'sched/backfill' with CR_CPU_MEMORY as a select type. Apart from having to use the "defer" parameter due to large job submissions, everything is working ok.
I have one particular case however where memory usage leads to sub-optimal usage, and I would like to hear if there's a better suggestion/approach/configuration. If you remember my previous messages, I'm running batches (in the order of 10-20k submissions) of bioinformatical programs. In this case I'm using "merlin". In a job submission I'm currently sitting at, I have that the "normal" usage is about ~1gb per-process, but in a 5% of cases usage spikes to 9GB. Unfortunately, I cannot determine a-priory which process is going to take more memory. I think you already see the problem. If I set a limit of ~1gb I can maximize CPU usage, but 5% of those jobs (taking as much as 6 hours) will be killed. If I set a 9GB memory limit, I can load less then 30% of my current CPU capacity. At this point it is simply worth to just to run everything with a ~1GB limit and re-run the killed instances. I cannot simply ignore memory allocation, since it already happened to have all those jobs allocated on a single 64 cores machine not capable to handle it. I'm wondering if the GANG scheduler can help me there. I can put loads of swap space if necessary as long as the VM is not trashing all the time. It would be very nice if the scheduler would simply put those processes to sleep when a treshold is hit, just to re-schedule allocation with the current memory usage. Thanks for any pointer.