Re: [gpfsug-discuss] SGExceptionLogBufferFullThread waiter

2016-10-15 Thread Olaf Weiser
in deed, it is.. consider last recent GPFS
releases .. lot's of enhancements/improvements in terms of file creation
rate 're included here .. cheersMit freundlichen Grüßen / Kind regards Olaf Weiser EMEA Storage Competence Center Mainz, German / IBM Systems, Storage Platform,---IBM DeutschlandIBM Allee 171139 EhningenPhone: +49-170-579-44-66E-Mail: olaf.wei...@de.ibm.com---IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin JetterGeschäftsführung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert
Janzen, Dr. Christian Keller, Ivo Koerner, Markus KoernerSitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
HRB 14562 / WEEE-Reg.-Nr. DE 99369940 From:      
 "Knister, Aaron
S. (GSFC-606.2)[COMPUTER SCIENCE CORP]" To:      
 gpfsug main discussion
list , "gpfsug    
   main discussion list" Date:      
 10/15/2016 10:07 AMSubject:    
   Re: [gpfsug-discuss]
SGExceptionLogBufferFullThread waiterSent by:    
   gpfsug-discuss-boun...@spectrumscale.orgUnderstood. Thank you for your help. By the way, I was able to figure out by
poking mmpmon gfis that the job is performing 20k a second each of inode
creations, updates and deletions across 64 nodes. There's my 60k iops on
the backend. While I'm impressed and not surprised GPFS can keep up with
this...that's a pretty hefty workload. From: Olaf WeiserSent: 10/15/16, 12:47 PMTo: gpfsug main discussion listSubject: Re: [gpfsug-discuss] SGExceptionLogBufferFullThread waiterwell - hard to say.. 60K IO may or may not
be a problem... it depends on your storage backends.. check the response times to the physical disk on the NSD server... concerning
the output you provided ... check particularly 10.1.53.5  and 10.1.53.7  if they are in the same (bad/ poor) range .. then your storage back
end is in trouble or maybe just too heavily utilized ...  if the response times to physical disks on the NSD server are ok... ..
than maybe the network from client <-->  NSD server is somehow
in trouble .. From:        Aaron
Knister To:        Date:        10/15/2016
08:28 AMSubject:        Re:
[gpfsug-discuss] SGExceptionLogBufferFullThread waiterSent by:        gpfsug-discuss-boun...@spectrumscale.orgIt absolutely does, thanks Olaf!The tasks running on these nodes are running on 63 other nodes and generating ~60K iop/s of metadata writes and I *think* about the same in
reads. Do you think that could be contributing to the higher waiter times? I'm not sure quite what the job is up to. It's seemingly doing very little data movement, the cpu %used is very low but the load is rather high.-AaronOn 10/15/16 11:23 AM, Olaf Weiser wrote:> from your file system configuration .. mmfs  -L you'll
find the> size of the LOG> since release 4.x ..you can change it, but you need to re-mount the
FS> on every client , to make the change effective ...>> when a clients initiate writes/changes to GPFS  it needs to update
its> changes to the log -  if this narrows a certain filling degree,
GPFS> triggers so called logWrapThreads to write content to disk and so
free> space>> with your given numbers ... double digit [ms] waiter times .. you
fs> get's probably slowed down.. and there's something suspect with the> storage, because LOG-IOs are rather small and should not take that
long>> to give you an example from a healthy environment... the IO times
are so> small, that you usually don't see waiters for this..>> I/O start time RW    Buf type disk:sectorNum    
nSec  time ms      tag1>      tag2           Disk UID
typ      NSD node context   thread> --- -- --- - -  ---> - - -- --- --- -> --> 06:23:32.358851  W     logData    2:524306424
       8    0.439> 0         0  C0A70D08:57CF40D1 cli  
192.167.20.17 LogData> SGExceptionLogBufferFullThread> 06:23:33.576367  W     logData    1:524257280
       8    0.646> 0         0  C0A70D08:57CF40D0 cli  
192.167.20.16 LogData> SGExceptionLogBufferFullThread> 06:23:32.358851  W     logData    2:524306424
       8    0.439> 0         0  C0A70D08:57CF40D1 cli  
192.167.20.17 LogData> SGExceptionLogBufferFullThread> 06:23:33.576367  W     logData    1:524257280
       8    0.646> 0         0  C0A70D08:57CF40D0 cli  
192.167.20.16 LogData> SGExceptionLogBufferFullThread> 06:23:32.212426  W   iallocSeg    1:524490048
      64    0.733> 2       245  C0A70D08:57CF40D0 cli   192.167.20.16
Logwrap> LogWrapHelperThread> 06:23:32.212412  W     logWrap    2:524552192
       8    0.755> 0    179200  C0A70D08:57CF40D1 cli   

Re: [gpfsug-discuss] SGExceptionLogBufferFullThread waiter

2016-10-15 Thread Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
Understood. Thank you for your help.

By the way, I was able to figure out by poking mmpmon gfis that the job is 
performing 20k a second each of inode creations, updates and deletions across 
64 nodes. There's my 60k iops on the backend. While I'm impressed and not 
surprised GPFS can keep up with this...that's a pretty hefty workload.



From: Olaf Weiser
Sent: 10/15/16, 12:47 PM
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] SGExceptionLogBufferFullThread waiter
well - hard to say.. 60K IO may or may not be a problem... it depends on your 
storage backends..

check the response times to the physical disk on the NSD server... concerning 
the output you provided ... check particularly
10.1.53.5  and 10.1.53.7

 if they are in the same (bad/ poor) range .. then your storage back end is 
in trouble or maybe just too heavily utilized ...
 if the response times to physical disks on the NSD server are ok... .. than 
maybe the network from client <-->  NSD server is somehow in trouble ..




From:Aaron Knister <aaron.s.knis...@nasa.gov>
To:<gpfsug-discuss@spectrumscale.org>
Date:10/15/2016 08:28 AM
Subject:    Re: [gpfsug-discuss] SGExceptionLogBufferFullThread waiter
Sent by:gpfsug-discuss-boun...@spectrumscale.org




It absolutely does, thanks Olaf!

The tasks running on these nodes are running on 63 other nodes and
generating ~60K iop/s of metadata writes and I *think* about the same in
reads. Do you think that could be contributing to the higher waiter
times? I'm not sure quite what the job is up to. It's seemingly doing
very little data movement, the cpu %used is very low but the load is
rather high.

-Aaron

On 10/15/16 11:23 AM, Olaf Weiser wrote:
> from your file system configuration .. mmfs  -L you'll find the
> size of the LOG
> since release 4.x ..you can change it, but you need to re-mount the FS
> on every client , to make the change effective ...
>
> when a clients initiate writes/changes to GPFS  it needs to update its
> changes to the log -  if this narrows a certain filling degree, GPFS
> triggers so called logWrapThreads to write content to disk and so free
> space
>
> with your given numbers ... double digit [ms] waiter times .. you fs
> get's probably slowed down.. and there's something suspect with the
> storage, because LOG-IOs are rather small and should not take that long
>
> to give you an example from a healthy environment... the IO times are so
> small, that you usually don't see waiters for this..
>
> I/O start time RWBuf type disk:sectorNum nSec  time ms  tag1
>  tag2   Disk UID typ  NSD node context   thread
> --- -- --- - -  ---
> - - -- --- --- -
> --
> 06:23:32.358851  W logData2:52430642480.439
> 0 0  C0A70D08:57CF40D1 cli   192.167.20.17 LogData
> SGExceptionLogBufferFullThread
> 06:23:33.576367  W logData1:52425728080.646
> 0 0  C0A70D08:57CF40D0 cli   192.167.20.16 LogData
> SGExceptionLogBufferFullThread
> 06:23:32.358851  W logData2:52430642480.439
> 0 0  C0A70D08:57CF40D1 cli   192.167.20.17 LogData
> SGExceptionLogBufferFullThread
> 06:23:33.576367  W logData1:52425728080.646
> 0 0  C0A70D08:57CF40D0 cli   192.167.20.16 LogData
> SGExceptionLogBufferFullThread
> 06:23:32.212426  W   iallocSeg1:524490048   640.733
> 2   245  C0A70D08:57CF40D0 cli   192.167.20.16 Logwrap
> LogWrapHelperThread
> 06:23:32.212412  W logWrap2:52455219280.755
> 0179200  C0A70D08:57CF40D1 cli   192.167.20.17 Logwrap
> LogWrapHelperThread
> 06:23:32.212432  W logWrap2:52516276080.737
> 0125473  C0A70D08:57CF40D1 cli   192.167.20.17 Logwrap
> LogWrapHelperThread
> 06:23:32.212416  W   iallocSeg2:524488384   640.763
> 2   347  C0A70D08:57CF40D1 cli   192.167.20.17 Logwrap
> LogWrapHelperThread
> 06:23:32.212414  W logWrap2:52526694482.160
> 0177664  C0A70D08:57CF40D1 cli   192.167.20.17 Logwrap
> LogWrapHelperThread
>
>
> hope this helps ..
>
>
> Mit freundlichen Grüßen / Kind regards
>
>
> Olaf Weiser
>
> EMEA Storage Competence Center Mainz, German / IBM Systems, Storage
> Platform,
> ---
> IBM Deutschland
> IBM Allee 1
> 71139 Ehningen
> Phone: +49-170-579-44-66
> E-Mail: olaf.wei...@de.ibm.com
> 

Re: [gpfsug-discuss] SGExceptionLogBufferFullThread waiter

2016-10-15 Thread Olaf Weiser
well - hard to say.. 60K IO may or may
not be a problem... it depends on your storage backends.. check the response times to the physical
disk on the NSD server... concerning the output you provided ... check
particularly 10.1.53.5  and 10.1.53.7
 if they are in the same (bad/ poor)
range .. then your storage back end is in trouble or maybe just too heavily
utilized ...  if the response times to physical
disks on the NSD server are ok... .. than maybe the network from client
<-->  NSD server is somehow in trouble .. From:      
 Aaron Knister To:      
 Date:      
 10/15/2016 08:28 AMSubject:    
   Re: [gpfsug-discuss]
SGExceptionLogBufferFullThread waiterSent by:    
   gpfsug-discuss-boun...@spectrumscale.orgIt absolutely does, thanks Olaf!The tasks running on these nodes are running on 63 other nodes and generating ~60K iop/s of metadata writes and I *think* about the same in
reads. Do you think that could be contributing to the higher waiter times? I'm not sure quite what the job is up to. It's seemingly doing very little data movement, the cpu %used is very low but the load is rather high.-AaronOn 10/15/16 11:23 AM, Olaf Weiser wrote:> from your file system configuration .. mmfs  -L you'll
find the> size of the LOG> since release 4.x ..you can change it, but you need to re-mount the
FS> on every client , to make the change effective ...>> when a clients initiate writes/changes to GPFS  it needs to update
its> changes to the log -  if this narrows a certain filling degree,
GPFS> triggers so called logWrapThreads to write content to disk and so
free> space>> with your given numbers ... double digit [ms] waiter times .. you
fs> get's probably slowed down.. and there's something suspect with the> storage, because LOG-IOs are rather small and should not take that
long>> to give you an example from a healthy environment... the IO times
are so> small, that you usually don't see waiters for this..>> I/O start time RW    Buf type disk:sectorNum    
nSec  time ms      tag1>      tag2           Disk UID
typ      NSD node context   thread> --- -- --- - -  ---> - - -- --- --- -> --> 06:23:32.358851  W     logData    2:524306424
       8    0.439> 0         0  C0A70D08:57CF40D1 cli  
192.167.20.17 LogData> SGExceptionLogBufferFullThread> 06:23:33.576367  W     logData    1:524257280
       8    0.646> 0         0  C0A70D08:57CF40D0 cli  
192.167.20.16 LogData> SGExceptionLogBufferFullThread> 06:23:32.358851  W     logData    2:524306424
       8    0.439> 0         0  C0A70D08:57CF40D1 cli  
192.167.20.17 LogData> SGExceptionLogBufferFullThread> 06:23:33.576367  W     logData    1:524257280
       8    0.646> 0         0  C0A70D08:57CF40D0 cli  
192.167.20.16 LogData> SGExceptionLogBufferFullThread> 06:23:32.212426  W   iallocSeg    1:524490048
      64    0.733> 2       245  C0A70D08:57CF40D0 cli   192.167.20.16
Logwrap> LogWrapHelperThread> 06:23:32.212412  W     logWrap    2:524552192
       8    0.755> 0    179200  C0A70D08:57CF40D1 cli   192.167.20.17
Logwrap> LogWrapHelperThread> 06:23:32.212432  W     logWrap    2:525162760
       8    0.737> 0    125473  C0A70D08:57CF40D1 cli   192.167.20.17
Logwrap> LogWrapHelperThread> 06:23:32.212416  W   iallocSeg    2:524488384
      64    0.763> 2       347  C0A70D08:57CF40D1 cli   192.167.20.17
Logwrap> LogWrapHelperThread> 06:23:32.212414  W     logWrap    2:525266944
       8    2.160> 0    177664  C0A70D08:57CF40D1 cli   192.167.20.17
Logwrap> LogWrapHelperThread>>> hope this helps ..>>> Mit freundlichen Grüßen / Kind regards>>> Olaf Weiser>> EMEA Storage Competence Center Mainz, German / IBM Systems, Storage> Platform,> ---> IBM Deutschland> IBM Allee 1> 71139 Ehningen> Phone: +49-170-579-44-66> E-Mail: olaf.wei...@de.ibm.com> ---> IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter> Geschäftsführung: Martina Koederitz (Vorsitzende), Susanne Peter,> Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner> Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht> Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940 From:        Aaron Knister > To:        gpfsug main discussion list > Date:        10/15/2016 07:23 AM> Subject:        [gpfsug-discuss] SGExceptionLogBufferFullThread
waiter> Sent by:        gpfsug-discuss-boun...@spectrumscale.org>  I've got a node that's got some curious waiters on it (see below).
Could> someone explain what the 

Re: [gpfsug-discuss] SGExceptionLogBufferFullThread waiter

2016-10-15 Thread Aaron Knister

It absolutely does, thanks Olaf!

The tasks running on these nodes are running on 63 other nodes and 
generating ~60K iop/s of metadata writes and I *think* about the same in 
reads. Do you think that could be contributing to the higher waiter 
times? I'm not sure quite what the job is up to. It's seemingly doing 
very little data movement, the cpu %used is very low but the load is 
rather high.


-Aaron

On 10/15/16 11:23 AM, Olaf Weiser wrote:

from your file system configuration .. mmfs  -L you'll find the
size of the LOG
since release 4.x ..you can change it, but you need to re-mount the FS
on every client , to make the change effective ...

when a clients initiate writes/changes to GPFS  it needs to update its
changes to the log -  if this narrows a certain filling degree, GPFS
triggers so called logWrapThreads to write content to disk and so free
space

with your given numbers ... double digit [ms] waiter times .. you fs
get's probably slowed down.. and there's something suspect with the
storage, because LOG-IOs are rather small and should not take that long

to give you an example from a healthy environment... the IO times are so
small, that you usually don't see waiters for this..

I/O start time RWBuf type disk:sectorNum nSec  time ms  tag1
 tag2   Disk UID typ  NSD node context   thread
--- -- --- - -  ---
- - -- --- --- -
--
06:23:32.358851  W logData2:52430642480.439
0 0  C0A70D08:57CF40D1 cli   192.167.20.17 LogData
SGExceptionLogBufferFullThread
06:23:33.576367  W logData1:52425728080.646
0 0  C0A70D08:57CF40D0 cli   192.167.20.16 LogData
SGExceptionLogBufferFullThread
06:23:32.358851  W logData2:52430642480.439
0 0  C0A70D08:57CF40D1 cli   192.167.20.17 LogData
SGExceptionLogBufferFullThread
06:23:33.576367  W logData1:52425728080.646
0 0  C0A70D08:57CF40D0 cli   192.167.20.16 LogData
SGExceptionLogBufferFullThread
06:23:32.212426  W   iallocSeg1:524490048   640.733
2   245  C0A70D08:57CF40D0 cli   192.167.20.16 Logwrap
LogWrapHelperThread
06:23:32.212412  W logWrap2:52455219280.755
0179200  C0A70D08:57CF40D1 cli   192.167.20.17 Logwrap
LogWrapHelperThread
06:23:32.212432  W logWrap2:52516276080.737
0125473  C0A70D08:57CF40D1 cli   192.167.20.17 Logwrap
LogWrapHelperThread
06:23:32.212416  W   iallocSeg2:524488384   640.763
2   347  C0A70D08:57CF40D1 cli   192.167.20.17 Logwrap
LogWrapHelperThread
06:23:32.212414  W logWrap2:52526694482.160
0177664  C0A70D08:57CF40D1 cli   192.167.20.17 Logwrap
LogWrapHelperThread


hope this helps ..


Mit freundlichen Grüßen / Kind regards


Olaf Weiser

EMEA Storage Competence Center Mainz, German / IBM Systems, Storage
Platform,
---
IBM Deutschland
IBM Allee 1
71139 Ehningen
Phone: +49-170-579-44-66
E-Mail: olaf.wei...@de.ibm.com
---
IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Martina Koederitz (Vorsitzende), Susanne Peter,
Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner
Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht
Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940



From:Aaron Knister 
To:gpfsug main discussion list 
Date:10/15/2016 07:23 AM
Subject:[gpfsug-discuss] SGExceptionLogBufferFullThread waiter
Sent by:gpfsug-discuss-boun...@spectrumscale.org




I've got a node that's got some curious waiters on it (see below). Could
someone explain what the "SGExceptionLogBufferFullThread" waiter means?

Thanks!

-Aaron

=== mmdiag: waiters ===
0x7040D600 waiting 0.038822715 seconds,
SGExceptionLogBufferFullThread: on ThCond 0x7FFFDBB07628
(0x7FFFDBB07628) (parallelWaitCond), reason 'wait for parallel write'
for NSD I/O completion on node 10.1.53.5 
0x7FFFE83F3D60 waiting 0.039629116 seconds, CleanBufferThread: on ThCond
0x17B1488 (0x17B1488) (MsgRecordCondvar), reason 'RPC wait' for NSD I/O
completion on node 10.1.53.7 
0x7FFFE8373A90 waiting 0.038921480 seconds, CleanBufferThread: on ThCond
0x7FFFCD2B4E30 (0x7FFFCD2B4E30) (LogFileBufferDescriptorCondvar), reason
'force wait on force active buffer write'
0x42CD9B0 waiting 0.028227004 seconds, CleanBufferThread: on ThCond
0x7FFFCD2B4E30 (0x7FFFCD2B4E30) (LogFileBufferDescriptorCondvar), reason
'force wait for buffer write to complete'
0x7FFFE0F0EAD0