Good news: Seems like changing the MessageTimeout to 300 has resolved the
issue. This is the only change I made in addition to SchedulerParameters
change and SlurmctldPorts change which is listed above. I rolled back all
the other changes

The bad news is that my log file has thousands of messages that read
srun: WARNING: MessageTimeout is too high for effective fault-tolerance
and
sbatch: WARNING: MessageTimeout is too high for effective fault-tolerance

I will have to play with the value of MessageTimeout to see if I can get rid
of this error message.

Reply via email to