Re: [gpfsug-discuss] spontaneous tracing?

2018-03-12 Thread Aaron Knister

Thanks!

I admit, I'm confused because "/usr/lpp/mmfs/bin/mmcommon 
notifyOverload" does in fact start tracing for me on one of our clusters 
(technically 2, one in dev, one in prod). It did *not* start it on 
another test cluster. It looks to me like the difference is the 
mmsdrservport settings. On clusters where it's set to 0 tracing *does* 
start. On clusters where it's set to the default of 1191 (didn't try any 
other value) tracing *does not* start. I can toggle the behavior by 
changing the value of mmsdrservport back and forth.


I do have a PMR open for this so I'll follow up there too. Thanks again 
for the help.


-Aaron

On 3/12/18 11:13 AM, IBM Spectrum Scale wrote:
/usr/lpp/mmfs/bin/mmcommon notifyOverload will not cause tracing to be 
started.  One can verify that using the underlying command being called 
as shown in the following example with /tmp/n containing node names one 
each line that will get the notification and the IP address being the 
file system manager from which the command is issued.


*/usr/lpp/mmfs/bin/mmsdrcli notifyOverload /tmp/n 1191 192.168.117.131 3 8*

The only case that deadlock detection code will initiate tracing is that 
debugDataControl is set to "heavy" and tracing is not started. Then on 
deadlock detection tracing is turned on for 20 seconds and turned off.


That can be tested using command like
*/usr/lpp/mmfs/bin/mmsdrcli notifyDeadlock /tmp/n 1191 192.168.117.131 3 8*

And then mmfs.log will tell you what's going on. That's not a silent action.

*2018-03-12_10:16:11.243-0400: [N] sdrServ: Received deadlock 
notification from 192.168.117.131*
*2018-03-12_10:16:11.243-0400: [N] GPFS will attempt to collect debug 
data on this node.*
*2018-03-12_10:16:11.953-0400: [I] Tracing in overwrite mode <== tracing 
started*

*Trace started: Wait 20 seconds before cut and stop trace*
*2018-03-12_10:16:37.147-0400: [I] Tracing disabled <== tracing stopped 
20 seconds later*
*mmtrace: move /tmp/mmfs/lxtrace.trc.c69bc2xn01.cpu0 
/tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.cpu0*
*mmtrace: formatting 
/tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01 to 
/tmp/mmfs/trcrpt.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.gz*


 > What's odd is there are no log events to indicate an overload occurred.

Overload msg is only seen in mmfs.log when debugDataControl is "heavy". 
mmdiag --deadlock shows overload related info starting from 4.2.3.


*# mmdiag --deadlock*

*=== mmdiag: deadlock ===*

*Effective deadlock detection threshold on c69bc2xn01 is 1800 seconds*
*Effective deadlock detection threshold on c69bc2xn01 is 360 seconds for 
short waiters*


*Cluster c69bc2xn01.gpfs.net is overloaded. The overload index on 
c69bc2xn01 is 0.01812 <==*



___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] spontaneous tracing?

2018-03-12 Thread IBM Spectrum Scale


/usr/lpp/mmfs/bin/mmcommon notifyOverload will not cause tracing to be
started.  One can verify that using the underlying command being called as
shown in the following example with /tmp/n containing node names one each
line that will get the notification and the IP address being the file
system manager from which the command is issued.

/usr/lpp/mmfs/bin/mmsdrcli notifyOverload /tmp/n 1191 192.168.117.131 3 8

The only case that deadlock detection code will initiate tracing is that
debugDataControl is set to "heavy" and tracing is not started.   Then on
deadlock detection tracing is turned on for 20 seconds and turned off.

That can be tested using command like
/usr/lpp/mmfs/bin/mmsdrcli notifyDeadlock /tmp/n 1191 192.168.117.131 3 8

And then mmfs.log will tell you what's going on.  That's not a silent
action.

2018-03-12_10:16:11.243-0400: [N] sdrServ: Received deadlock notification from 
192.168.117.131
2018-03-12_10:16:11.243-0400: [N] GPFS will attempt to collect debug data on 
this node.
2018-03-12_10:16:11.953-0400: [I] Tracing in overwrite mode  <== tracing started
Trace started: Wait 20 seconds before cut and stop trace
2018-03-12_10:16:37.147-0400: [I] Tracing disabled  <== tracing stopped 20 
seconds later
mmtrace: move /tmp/mmfs/lxtrace.trc.c69bc2xn01.cpu0 
/tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.cpu0
mmtrace: formatting 
/tmp/mmfs/trcfile.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01 to 
/tmp/mmfs/trcrpt.2018-03-12_10.16.11.2982.deadlock.c69bc2xn01.gz

> What's odd is there are no log events to indicate an overload occurred.

Overload msg is only seen in mmfs.log when debugDataControl is "heavy".
mmdiag --deadlock shows  overload related info starting from 4.2.3.

# mmdiag --deadlock

=== mmdiag: deadlock ===

Effective deadlock detection threshold on c69bc2xn01 is 1800 seconds
Effective deadlock detection threshold on c69bc2xn01 is 360 seconds for
short waiters

Cluster c69bc2xn01.gpfs.net is overloaded. The overload index on c69bc2xn01
is 0.01812  <==
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss