Hey Paul, Unless you have a very busy cluster (100s of jobs a second)
or are running very large jobs (>2000 nodes) I don't think this will be
very useful. But I would expect
MsgAggregationParams=WindowMsgs=10,WindowTime=10 to be more what you
would want. WindowTime=100 may be too long of a wait. I am surprised
at the threading of your slurmctld though, I would expect it to have
much less threading. Be sure to restart all your slurmd's as well as
the slurmctld when you change the parameter. Try again and see if
lowering the WindowTime down improves your situation.
Danny
On 12/08/15 10:25, Paul Edmon wrote:
We recently upgraded to 15.08.4 from 14.11 and we wanted to try out
the MsgAggregation to see if that would improve cluster throughput and
responsiveness. However when we turned it on with the settings of
WindowMsgs=10 and WindowTime=100 everything slowed to a crawl and it
looked like the slurmctld was threading like crazy. When we turned it
off everything returned to normal. Does any one have any suggestions
or guidelines for what to set the MsgAggregationParam to? I'm
guessing it depends on the size of the cluster as we have the same
settings on our test cluster but it is about 10 times smaller in terms
of number of nodes than our main one. I'm guessing this is a scaling
problem.
Thoughts? Anyone else using MsgAggregation?
-Paul Edmon-