Good news: Seems like changing the MessageTimeout to 300 has resolved the issue. This is the only change I made in addition to SchedulerParameters change and SlurmctldPorts change which is listed above. I rolled back all the other changes
The bad news is that my log file has thousands of messages that read srun: WARNING: MessageTimeout is too high for effective fault-tolerance and sbatch: WARNING: MessageTimeout is too high for effective fault-tolerance I will have to play with the value of MessageTimeout to see if I can get rid of this error message.
