Thanks!
I admit, I'm confused because "/usr/lpp/mmfs/bin/mmcommon
notifyOverload" does in fact start tracing for me on one of our clusters
(technically 2, one in dev, one in prod). It did *not* start it on
another test cluster. It looks to me like the difference is the
mmsdrservport settings. On clusters where it's set to 0 tracing *does*
start. On clusters where it's set to the default of 1191 (didn't try any
other value) tracing *does not* start. I can toggle the behavior by
changing the value of mmsdrservport back and forth.
I do have a PMR open for this so I'll follow up there too. Thanks again
for the help.
-Aaron
On 3/12/18 11:13 AM, IBM Spectrum Scale wrote:
/usr/lpp/mmfs/bin/mmcommon notifyOverload will not cause tracing to be
started. One can verify that using the underlying command being called
as shown in the following example with /tmp/n containing node names one
each line that will get the notification and the IP address being the
file system manager from which the command is issued.
*/usr/lpp/mmfs/bin/mmsdrcli notifyOverload /tmp/n 1191 192.168.117.131 3 8*
The only case that deadlock detection code will initiate tracing is that
debugDataControl is set to "heavy" and tracing is not started. Then on
deadlock detection tracing is turned on for 20 seconds and turned off.
That can be tested using command like
*/usr/lpp/mmfs/bin/mmsdrcli notifyDeadlock /tmp/n 1191 192.168.117.131 3 8*
And then mmfs.log will tell you what's going on. That's not a silent action.
*2018-03-12_10:16:11.243-0400: [N] sdrServ: Received deadlock
notification from 192.168.117.131*
*2018-03-12_10:16:11.243-0400: [N] GPFS will attempt to collect debug
data on this node.*
*2018-03-12_10:16:11.953-0400: [I] Tracing in overwrite mode <== tracing
started*
*Trace started: Wait 20 seconds before cut and stop trace*
*2018-03-12_10:16:37.147-0400: [I] Tracing disabled <== tracing stopped
20 seconds later*
*mmtrace: move /tmp/mmfs/lxtrace.trc.c69bc2xn01.cpu0
/tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.cpu0*
*mmtrace: formatting
/tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01 to
/tmp/mmfs/trcrpt.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.gz*
> What's odd is there are no log events to indicate an overload occurred.
Overload msg is only seen in mmfs.log when debugDataControl is "heavy".
mmdiag --deadlock shows overload related info starting from 4.2.3.
*# mmdiag --deadlock*
*=== mmdiag: deadlock ===*
*Effective deadlock detection threshold on c69bc2xn01 is 1800 seconds*
*Effective deadlock detection threshold on c69bc2xn01 is 360 seconds for
short waiters*
*Cluster c69bc2xn01.gpfs.net is overloaded. The overload index on
c69bc2xn01 is 0.01812 <==*
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss