Re: [PATCH] apparmor: Fix network performance issue in aa_label_sk_perm
On 09/07/2018 09:37 AM, John Johansen wrote: > hey Tony, > > thanks for the patch, I am curious did you're investigation look > into what parts of DEFINE_AUDIT_SK are causing the issue? Hi JJ. Attached are the perf annotations for DEFINE_AUDIT_SK (percentages are relative to the fn). Our kernel performance testing is carried out with default installs which means AppArmor is enabled but the performance tests are unconfined. It was obvious that the overhead of DEFINE_AUDIT_SK was significant for smaller packet sizes (typical of synthetic benchmarks) and that it didn't need to execute for the unconfined case, hence the patch. I didn't spend any time looking at the performance of confined tasks. It may be worth your time to look at this. Comparing my current tip (2601dd392dd1) to tip+patch I'm seeing an increase of 3-6% in netperf throughput for packet sizes 64-1024. HTH Tony Percent | Source code & Disassembly of vmlinux for cycles:ppp (117 samples) - : : : : Disassembly of section .text: : : 813fbec0 : : aa_label_sk_perm(): : type)); : } : : static int aa_label_sk_perm(struct aa_label *label, const char *op, u32 request, : struct sock *sk) : { 0.00 : 813fbec0: callq 81a017f0 <__fentry__> 2.56 : 813fbec5: push %r14 0.00 : 813fbec7: mov%rcx,%r14 : struct aa_profile *profile; : DEFINE_AUDIT_SK(sa, op, sk); 0.00 : 813fbeca: mov$0x7,%ecx : { 0.00 : 813fbecf: push %r13 3.42 : 813fbed1: mov%edx,%r13d 0.00 : 813fbed4: push %r12 0.00 : 813fbed6: push %rbp 0.00 : 813fbed7: mov%rdi,%rbp 5.13 : 813fbeda: push %rbx 0.00 : 813fbedb: sub$0xb8,%rsp : DEFINE_AUDIT_SK(sa, op, sk); 0.00 : 813fbee2: movzwl 0x10(%r14),%r9d : { 1.71 : 813fbee7: mov%gs:0x28,%rax 0.00 : 813fbef0: mov%rax,0xb0(%rsp) 0.00 : 813fbef8: xor%eax,%eax : DEFINE_AUDIT_SK(sa, op, sk); 0.00 : 813fbefa: lea0x78(%rsp),%rdx 1.71 : 813fbeff: lea0x20(%rsp),%r8 0.00 : 813fbf04: movq $0x0,(%rsp) 0.00 : 813fbf0c: movq $0x0,0x10(%rsp) 0.00 : 813fbf15: mov%rdx,%rdi 14.53 : 813fbf18: rep stos %rax,%es:(%rdi) 1.71 : 813fbf1b: mov$0xb,%ecx 0.00 : 813fbf20: mov%r8,%rdi 0.00 : 813fbf23: mov%r14,0x80(%rsp) 18.80 : 813fbf2b: rep stos %rax,%es:(%rdi) 0.00 : 813fbf2e: mov%rsi,0x28(%rsp) 1.71 : 813fbf33: mov%r9w,0x88(%rsp) 0.00 : 813fbf3c: cmp$0x1,%r9w 0.00 : 813fbf41: je 813fbfa1 0.00 : 813fbf43: mov$0x2,%eax 0.00 : 813fbf48: test %r14,%r14 0.00 : 813fbf4b: je 813fbfa1 14.53 : 813fbf4d: mov%al,(%rsp) 0.00 : 813fbf50: movzwl 0x1ea(%r14),%eax : AA_BUG(!sk); : : if (unconfined(label)) : return 0; : : return fn_for_each_confined(label, profile, 0.00 : 813fbf58: xor%r12d,%r12d : DEFINE_AUDIT_SK(sa, op, sk); 0.00 : 813fbf5b: mov%r8,0x18(%rsp) 8.55 : 813fbf60: mov%eax,0x58(%rsp) 0.00 : 813fbf64: movzbl 0x1e9(%r14),%eax 0.00 : 813fbf6c: mov%rdx,0x8(%rsp) 0.00 : 813fbf71: mov%eax,0x5c(%rsp) : if (unconfined(label)) 8.55 : 813fbf75: testb $0x2,0x40(%rbp) 0.00 : 813fbf79: je 813fbfa8 : aa_profile_af_sk_perm(profile, , request, sk)); : } 0.00 : 813fbf7b: mov0xb0(%rsp),%rdx
Re: [PATCH] apparmor: Fix network performance issue in aa_label_sk_perm
On 09/07/2018 09:37 AM, John Johansen wrote: > hey Tony, > > thanks for the patch, I am curious did you're investigation look > into what parts of DEFINE_AUDIT_SK are causing the issue? Hi JJ. Attached are the perf annotations for DEFINE_AUDIT_SK (percentages are relative to the fn). Our kernel performance testing is carried out with default installs which means AppArmor is enabled but the performance tests are unconfined. It was obvious that the overhead of DEFINE_AUDIT_SK was significant for smaller packet sizes (typical of synthetic benchmarks) and that it didn't need to execute for the unconfined case, hence the patch. I didn't spend any time looking at the performance of confined tasks. It may be worth your time to look at this. Comparing my current tip (2601dd392dd1) to tip+patch I'm seeing an increase of 3-6% in netperf throughput for packet sizes 64-1024. HTH Tony Percent | Source code & Disassembly of vmlinux for cycles:ppp (117 samples) - : : : : Disassembly of section .text: : : 813fbec0 : : aa_label_sk_perm(): : type)); : } : : static int aa_label_sk_perm(struct aa_label *label, const char *op, u32 request, : struct sock *sk) : { 0.00 : 813fbec0: callq 81a017f0 <__fentry__> 2.56 : 813fbec5: push %r14 0.00 : 813fbec7: mov%rcx,%r14 : struct aa_profile *profile; : DEFINE_AUDIT_SK(sa, op, sk); 0.00 : 813fbeca: mov$0x7,%ecx : { 0.00 : 813fbecf: push %r13 3.42 : 813fbed1: mov%edx,%r13d 0.00 : 813fbed4: push %r12 0.00 : 813fbed6: push %rbp 0.00 : 813fbed7: mov%rdi,%rbp 5.13 : 813fbeda: push %rbx 0.00 : 813fbedb: sub$0xb8,%rsp : DEFINE_AUDIT_SK(sa, op, sk); 0.00 : 813fbee2: movzwl 0x10(%r14),%r9d : { 1.71 : 813fbee7: mov%gs:0x28,%rax 0.00 : 813fbef0: mov%rax,0xb0(%rsp) 0.00 : 813fbef8: xor%eax,%eax : DEFINE_AUDIT_SK(sa, op, sk); 0.00 : 813fbefa: lea0x78(%rsp),%rdx 1.71 : 813fbeff: lea0x20(%rsp),%r8 0.00 : 813fbf04: movq $0x0,(%rsp) 0.00 : 813fbf0c: movq $0x0,0x10(%rsp) 0.00 : 813fbf15: mov%rdx,%rdi 14.53 : 813fbf18: rep stos %rax,%es:(%rdi) 1.71 : 813fbf1b: mov$0xb,%ecx 0.00 : 813fbf20: mov%r8,%rdi 0.00 : 813fbf23: mov%r14,0x80(%rsp) 18.80 : 813fbf2b: rep stos %rax,%es:(%rdi) 0.00 : 813fbf2e: mov%rsi,0x28(%rsp) 1.71 : 813fbf33: mov%r9w,0x88(%rsp) 0.00 : 813fbf3c: cmp$0x1,%r9w 0.00 : 813fbf41: je 813fbfa1 0.00 : 813fbf43: mov$0x2,%eax 0.00 : 813fbf48: test %r14,%r14 0.00 : 813fbf4b: je 813fbfa1 14.53 : 813fbf4d: mov%al,(%rsp) 0.00 : 813fbf50: movzwl 0x1ea(%r14),%eax : AA_BUG(!sk); : : if (unconfined(label)) : return 0; : : return fn_for_each_confined(label, profile, 0.00 : 813fbf58: xor%r12d,%r12d : DEFINE_AUDIT_SK(sa, op, sk); 0.00 : 813fbf5b: mov%r8,0x18(%rsp) 8.55 : 813fbf60: mov%eax,0x58(%rsp) 0.00 : 813fbf64: movzbl 0x1e9(%r14),%eax 0.00 : 813fbf6c: mov%rdx,0x8(%rsp) 0.00 : 813fbf71: mov%eax,0x5c(%rsp) : if (unconfined(label)) 8.55 : 813fbf75: testb $0x2,0x40(%rbp) 0.00 : 813fbf79: je 813fbfa8 : aa_profile_af_sk_perm(profile, , request, sk)); : } 0.00 : 813fbf7b: mov0xb0(%rsp),%rdx
Re: [PATCH] apparmor: Fix network performance issue in aa_label_sk_perm
On 09/06/2018 09:33 PM, Tony Jones wrote: > The netperf benchmark shows a 5.73% reduction in throughput for > small (64 byte) transfers by unconfined tasks. > > DEFINE_AUDIT_SK() in aa_label_sk_perm() should not be performed > unconditionally, rather only when the label is confined. > > netperf-tcp > 56974a6fc^ 56974a6fc > Min 64 563.48 ( 0.00%) 531.17 ( -5.73%) > Min 128 1056.92 ( 0.00%) 999.44 ( -5.44%) > Min 256 1945.95 ( 0.00%) 1867.97 ( -4.01%) > Min 1024 6761.40 ( 0.00%) 6364.23 ( -5.87%) > Min 2048 0.53 ( 0.00%)10606.20 ( -4.54%) > Min 3312 13692.67 ( 0.00%)13158.41 ( -3.90%) > Min 4096 14926.29 ( 0.00%)14457.46 ( -3.14%) > Min 8192 18399.34 ( 0.00%)18091.65 ( -1.67%) > Min 1638421384.13 ( 0.00%)21158.05 ( -1.06%) > Hmean 64 564.96 ( 0.00%) 534.38 ( -5.41%) > Hmean 128 1064.42 ( 0.00%) 1010.12 ( -5.10%) > Hmean 256 1965.85 ( 0.00%) 1879.16 ( -4.41%) > Hmean 1024 6839.77 ( 0.00%) 6478.70 ( -5.28%) > Hmean 2048 11154.80 ( 0.00%)10671.13 ( -4.34%) > Hmean 3312 13838.12 ( 0.00%)13249.01 ( -4.26%) > Hmean 4096 15009.99 ( 0.00%)14561.36 ( -2.99%) > Hmean 8192 18975.57 ( 0.00%)18326.54 ( -3.42%) > Hmean 1638421440.44 ( 0.00%)21324.59 ( -0.54%) > Stddev64 1.24 ( 0.00%)2.85 (-130.64%) > Stddev128 4.51 ( 0.00%)6.53 ( -44.84%) > Stddev256 11.67 ( 0.00%)8.50 ( 27.16%) > Stddev102448.33 ( 0.00%) 75.07 ( -55.34%) > Stddev204854.82 ( 0.00%) 65.16 ( -18.86%) > Stddev3312 153.57 ( 0.00%) 56.29 ( 63.35%) > Stddev4096 100.25 ( 0.00%) 88.50 ( 11.72%) > Stddev8192 358.13 ( 0.00%) 169.99 ( 52.54%) > Stddev16384 43.99 ( 0.00%) 141.82 (-222.39%) > > Signed-off-by: Tony Jones > Fixes: 56974a6fcfef ("apparmor: add base infastructure for socket > mediation") hey Tony, thanks for the patch, I am curious did you're investigation look into what parts of DEFINE_AUDIT_SK are causing the issue? regardless, I have pulled it into apparmor next > --- > security/apparmor/net.c | 15 +-- > 1 file changed, 9 insertions(+), 6 deletions(-) > > diff --git a/security/apparmor/net.c b/security/apparmor/net.c > index bb24cfa0a164..d5d72dd1ca1f 100644 > --- a/security/apparmor/net.c > +++ b/security/apparmor/net.c > @@ -146,17 +146,20 @@ int aa_af_perm(struct aa_label *label, const char *op, > u32 request, u16 family, > static int aa_label_sk_perm(struct aa_label *label, const char *op, u32 > request, > struct sock *sk) > { > - struct aa_profile *profile; > - DEFINE_AUDIT_SK(sa, op, sk); > + int error = 0; > > AA_BUG(!label); > AA_BUG(!sk); > > - if (unconfined(label)) > - return 0; > + if (!unconfined(label)) { > + struct aa_profile *profile; > + DEFINE_AUDIT_SK(sa, op, sk); > > - return fn_for_each_confined(label, profile, > - aa_profile_af_sk_perm(profile, , request, sk)); > + error = fn_for_each_confined(label, profile, > + aa_profile_af_sk_perm(profile, , request, sk)); > + } > + > + return error; > } > > int aa_sk_perm(const char *op, u32 request, struct sock *sk) >
Re: [PATCH] apparmor: Fix network performance issue in aa_label_sk_perm
On 09/06/2018 09:33 PM, Tony Jones wrote: > The netperf benchmark shows a 5.73% reduction in throughput for > small (64 byte) transfers by unconfined tasks. > > DEFINE_AUDIT_SK() in aa_label_sk_perm() should not be performed > unconditionally, rather only when the label is confined. > > netperf-tcp > 56974a6fc^ 56974a6fc > Min 64 563.48 ( 0.00%) 531.17 ( -5.73%) > Min 128 1056.92 ( 0.00%) 999.44 ( -5.44%) > Min 256 1945.95 ( 0.00%) 1867.97 ( -4.01%) > Min 1024 6761.40 ( 0.00%) 6364.23 ( -5.87%) > Min 2048 0.53 ( 0.00%)10606.20 ( -4.54%) > Min 3312 13692.67 ( 0.00%)13158.41 ( -3.90%) > Min 4096 14926.29 ( 0.00%)14457.46 ( -3.14%) > Min 8192 18399.34 ( 0.00%)18091.65 ( -1.67%) > Min 1638421384.13 ( 0.00%)21158.05 ( -1.06%) > Hmean 64 564.96 ( 0.00%) 534.38 ( -5.41%) > Hmean 128 1064.42 ( 0.00%) 1010.12 ( -5.10%) > Hmean 256 1965.85 ( 0.00%) 1879.16 ( -4.41%) > Hmean 1024 6839.77 ( 0.00%) 6478.70 ( -5.28%) > Hmean 2048 11154.80 ( 0.00%)10671.13 ( -4.34%) > Hmean 3312 13838.12 ( 0.00%)13249.01 ( -4.26%) > Hmean 4096 15009.99 ( 0.00%)14561.36 ( -2.99%) > Hmean 8192 18975.57 ( 0.00%)18326.54 ( -3.42%) > Hmean 1638421440.44 ( 0.00%)21324.59 ( -0.54%) > Stddev64 1.24 ( 0.00%)2.85 (-130.64%) > Stddev128 4.51 ( 0.00%)6.53 ( -44.84%) > Stddev256 11.67 ( 0.00%)8.50 ( 27.16%) > Stddev102448.33 ( 0.00%) 75.07 ( -55.34%) > Stddev204854.82 ( 0.00%) 65.16 ( -18.86%) > Stddev3312 153.57 ( 0.00%) 56.29 ( 63.35%) > Stddev4096 100.25 ( 0.00%) 88.50 ( 11.72%) > Stddev8192 358.13 ( 0.00%) 169.99 ( 52.54%) > Stddev16384 43.99 ( 0.00%) 141.82 (-222.39%) > > Signed-off-by: Tony Jones > Fixes: 56974a6fcfef ("apparmor: add base infastructure for socket > mediation") hey Tony, thanks for the patch, I am curious did you're investigation look into what parts of DEFINE_AUDIT_SK are causing the issue? regardless, I have pulled it into apparmor next > --- > security/apparmor/net.c | 15 +-- > 1 file changed, 9 insertions(+), 6 deletions(-) > > diff --git a/security/apparmor/net.c b/security/apparmor/net.c > index bb24cfa0a164..d5d72dd1ca1f 100644 > --- a/security/apparmor/net.c > +++ b/security/apparmor/net.c > @@ -146,17 +146,20 @@ int aa_af_perm(struct aa_label *label, const char *op, > u32 request, u16 family, > static int aa_label_sk_perm(struct aa_label *label, const char *op, u32 > request, > struct sock *sk) > { > - struct aa_profile *profile; > - DEFINE_AUDIT_SK(sa, op, sk); > + int error = 0; > > AA_BUG(!label); > AA_BUG(!sk); > > - if (unconfined(label)) > - return 0; > + if (!unconfined(label)) { > + struct aa_profile *profile; > + DEFINE_AUDIT_SK(sa, op, sk); > > - return fn_for_each_confined(label, profile, > - aa_profile_af_sk_perm(profile, , request, sk)); > + error = fn_for_each_confined(label, profile, > + aa_profile_af_sk_perm(profile, , request, sk)); > + } > + > + return error; > } > > int aa_sk_perm(const char *op, u32 request, struct sock *sk) >
[PATCH] apparmor: Fix network performance issue in aa_label_sk_perm
The netperf benchmark shows a 5.73% reduction in throughput for small (64 byte) transfers by unconfined tasks. DEFINE_AUDIT_SK() in aa_label_sk_perm() should not be performed unconditionally, rather only when the label is confined. netperf-tcp 56974a6fc^ 56974a6fc Min 64 563.48 ( 0.00%) 531.17 ( -5.73%) Min 128 1056.92 ( 0.00%) 999.44 ( -5.44%) Min 256 1945.95 ( 0.00%) 1867.97 ( -4.01%) Min 1024 6761.40 ( 0.00%) 6364.23 ( -5.87%) Min 2048 0.53 ( 0.00%)10606.20 ( -4.54%) Min 3312 13692.67 ( 0.00%)13158.41 ( -3.90%) Min 4096 14926.29 ( 0.00%)14457.46 ( -3.14%) Min 8192 18399.34 ( 0.00%)18091.65 ( -1.67%) Min 1638421384.13 ( 0.00%)21158.05 ( -1.06%) Hmean 64 564.96 ( 0.00%) 534.38 ( -5.41%) Hmean 128 1064.42 ( 0.00%) 1010.12 ( -5.10%) Hmean 256 1965.85 ( 0.00%) 1879.16 ( -4.41%) Hmean 1024 6839.77 ( 0.00%) 6478.70 ( -5.28%) Hmean 2048 11154.80 ( 0.00%)10671.13 ( -4.34%) Hmean 3312 13838.12 ( 0.00%)13249.01 ( -4.26%) Hmean 4096 15009.99 ( 0.00%)14561.36 ( -2.99%) Hmean 8192 18975.57 ( 0.00%)18326.54 ( -3.42%) Hmean 1638421440.44 ( 0.00%)21324.59 ( -0.54%) Stddev64 1.24 ( 0.00%)2.85 (-130.64%) Stddev128 4.51 ( 0.00%)6.53 ( -44.84%) Stddev256 11.67 ( 0.00%)8.50 ( 27.16%) Stddev102448.33 ( 0.00%) 75.07 ( -55.34%) Stddev204854.82 ( 0.00%) 65.16 ( -18.86%) Stddev3312 153.57 ( 0.00%) 56.29 ( 63.35%) Stddev4096 100.25 ( 0.00%) 88.50 ( 11.72%) Stddev8192 358.13 ( 0.00%) 169.99 ( 52.54%) Stddev16384 43.99 ( 0.00%) 141.82 (-222.39%) Signed-off-by: Tony Jones Fixes: 56974a6fcfef ("apparmor: add base infastructure for socket mediation") --- security/apparmor/net.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/security/apparmor/net.c b/security/apparmor/net.c index bb24cfa0a164..d5d72dd1ca1f 100644 --- a/security/apparmor/net.c +++ b/security/apparmor/net.c @@ -146,17 +146,20 @@ int aa_af_perm(struct aa_label *label, const char *op, u32 request, u16 family, static int aa_label_sk_perm(struct aa_label *label, const char *op, u32 request, struct sock *sk) { - struct aa_profile *profile; - DEFINE_AUDIT_SK(sa, op, sk); + int error = 0; AA_BUG(!label); AA_BUG(!sk); - if (unconfined(label)) - return 0; + if (!unconfined(label)) { + struct aa_profile *profile; + DEFINE_AUDIT_SK(sa, op, sk); - return fn_for_each_confined(label, profile, - aa_profile_af_sk_perm(profile, , request, sk)); + error = fn_for_each_confined(label, profile, + aa_profile_af_sk_perm(profile, , request, sk)); + } + + return error; } int aa_sk_perm(const char *op, u32 request, struct sock *sk) -- 2.18.0
[PATCH] apparmor: Fix network performance issue in aa_label_sk_perm
The netperf benchmark shows a 5.73% reduction in throughput for small (64 byte) transfers by unconfined tasks. DEFINE_AUDIT_SK() in aa_label_sk_perm() should not be performed unconditionally, rather only when the label is confined. netperf-tcp 56974a6fc^ 56974a6fc Min 64 563.48 ( 0.00%) 531.17 ( -5.73%) Min 128 1056.92 ( 0.00%) 999.44 ( -5.44%) Min 256 1945.95 ( 0.00%) 1867.97 ( -4.01%) Min 1024 6761.40 ( 0.00%) 6364.23 ( -5.87%) Min 2048 0.53 ( 0.00%)10606.20 ( -4.54%) Min 3312 13692.67 ( 0.00%)13158.41 ( -3.90%) Min 4096 14926.29 ( 0.00%)14457.46 ( -3.14%) Min 8192 18399.34 ( 0.00%)18091.65 ( -1.67%) Min 1638421384.13 ( 0.00%)21158.05 ( -1.06%) Hmean 64 564.96 ( 0.00%) 534.38 ( -5.41%) Hmean 128 1064.42 ( 0.00%) 1010.12 ( -5.10%) Hmean 256 1965.85 ( 0.00%) 1879.16 ( -4.41%) Hmean 1024 6839.77 ( 0.00%) 6478.70 ( -5.28%) Hmean 2048 11154.80 ( 0.00%)10671.13 ( -4.34%) Hmean 3312 13838.12 ( 0.00%)13249.01 ( -4.26%) Hmean 4096 15009.99 ( 0.00%)14561.36 ( -2.99%) Hmean 8192 18975.57 ( 0.00%)18326.54 ( -3.42%) Hmean 1638421440.44 ( 0.00%)21324.59 ( -0.54%) Stddev64 1.24 ( 0.00%)2.85 (-130.64%) Stddev128 4.51 ( 0.00%)6.53 ( -44.84%) Stddev256 11.67 ( 0.00%)8.50 ( 27.16%) Stddev102448.33 ( 0.00%) 75.07 ( -55.34%) Stddev204854.82 ( 0.00%) 65.16 ( -18.86%) Stddev3312 153.57 ( 0.00%) 56.29 ( 63.35%) Stddev4096 100.25 ( 0.00%) 88.50 ( 11.72%) Stddev8192 358.13 ( 0.00%) 169.99 ( 52.54%) Stddev16384 43.99 ( 0.00%) 141.82 (-222.39%) Signed-off-by: Tony Jones Fixes: 56974a6fcfef ("apparmor: add base infastructure for socket mediation") --- security/apparmor/net.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/security/apparmor/net.c b/security/apparmor/net.c index bb24cfa0a164..d5d72dd1ca1f 100644 --- a/security/apparmor/net.c +++ b/security/apparmor/net.c @@ -146,17 +146,20 @@ int aa_af_perm(struct aa_label *label, const char *op, u32 request, u16 family, static int aa_label_sk_perm(struct aa_label *label, const char *op, u32 request, struct sock *sk) { - struct aa_profile *profile; - DEFINE_AUDIT_SK(sa, op, sk); + int error = 0; AA_BUG(!label); AA_BUG(!sk); - if (unconfined(label)) - return 0; + if (!unconfined(label)) { + struct aa_profile *profile; + DEFINE_AUDIT_SK(sa, op, sk); - return fn_for_each_confined(label, profile, - aa_profile_af_sk_perm(profile, , request, sk)); + error = fn_for_each_confined(label, profile, + aa_profile_af_sk_perm(profile, , request, sk)); + } + + return error; } int aa_sk_perm(const char *op, u32 request, struct sock *sk) -- 2.18.0
Re: network performance get regression from 2.6 to 3.10 by each version
On 05/02/2014 12:40 PM, V JobNickname wrote: I have an ARM platform which works with older 2.6.28 Linux Kernel and the embedded NIC driver I profile the TCP Tx using netperf 2.6 by command "./netperf -H {serverip} -l 300". Is your ARM platform a multi-core one? If so, you may need/want to look into making certain the assignment of NIC interrupts and netperf have remained constant through your tests. You can bind netperf to a specific CPU via either "taskset" or the global -T option. You can check the interrupt assignment(s) for the queue(s) from the NIC by looking at /proc/interrupts and perhaps via other means. It would also be good to know if the drops in throughput correspond to an increase in service demand (CPU per unit of work). To that end, adding a global -c option to measure local (netperf side) CPU utilization would be a good idea. Still, even armed with that information, tracking down the regression or regressions will be no small feat particularly since the timespan is so long. A very good reason to be trying the newer versions as they appear, even if only briefly, rather than leaving it for so long. happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: network performance get regression from 2.6 to 3.10 by each version
On 05/02/2014 12:40 PM, V JobNickname wrote: I have an ARM platform which works with older 2.6.28 Linux Kernel and the embedded NIC driver I profile the TCP Tx using netperf 2.6 by command ./netperf -H {serverip} -l 300. Is your ARM platform a multi-core one? If so, you may need/want to look into making certain the assignment of NIC interrupts and netperf have remained constant through your tests. You can bind netperf to a specific CPU via either taskset or the global -T option. You can check the interrupt assignment(s) for the queue(s) from the NIC by looking at /proc/interrupts and perhaps via other means. It would also be good to know if the drops in throughput correspond to an increase in service demand (CPU per unit of work). To that end, adding a global -c option to measure local (netperf side) CPU utilization would be a good idea. Still, even armed with that information, tracking down the regression or regressions will be no small feat particularly since the timespan is so long. A very good reason to be trying the newer versions as they appear, even if only briefly, rather than leaving it for so long. happy benchmarking, rick jones -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
network performance get regression from 2.6 to 3.10 by each version
I have an ARM platform which works with older 2.6.28 Linux Kernel and the embedded NIC driver I profile the TCP Tx using netperf 2.6 by command "./netperf -H {serverip} -l 300". In 2.6.28 the TCP tx can reach 190 Mbps. Recently I am porting the platform to long-term Kernel version 2.6.32.61, 3.4.88 and 3.10. And I got the regression TCP Tx throughput by each new version. 2.6.32.61 is about 184Mbps 3.4.88 is about 173Mbps 3.10.0 is about 160Mbps so, I try to porting to more EOL versions 3.0.38 184Mbps 3.2.0 179Mbps 3.2.57 177Mbps 3.5.0 168Mbps 3.5.7 166Mbps 3.6.0 162Mbps 3.6.11 163Mbps The newer version have slower performance. The Kernel was download from kernel.org for porting. To touch less file as possible, I only porting basic requirement for MACHINE_START "io_map" "interrupt" "timer" and add NIC driver. The patch needed for NIC driver from 2.x to 3.x is only to group fops to "net_device_ops". No any change in xmit, received and isr handle flow. Actually, the NIC driver from 3.2 to 3.6 are file identical by diff. The only different is .config file, and I have try to keep the configuration are identical also. Just new or remove option difference between each version. I have no idea is the performance regression is due to network stack of each version or any feature I have to configure on new version. Any suggestion I can try to dig out the root cause, or some one have similar same observation on your experimence? The following is .config which I used for 3.0.38 and .config diff between each version. I can't find any difference option will effect the performance from .config diffs. .config of kernel 3.0.38 # # Automatically generated make config: don't edit # Linux/arm 3.0.38 Kernel Configuration # CONFIG_ARM=y CONFIG_SYS_SUPPORTS_APM_EMULATION=y # CONFIG_ARCH_USES_GETTIMEOFFSET is not set CONFIG_KTIME_SCALAR=y CONFIG_HAVE_PROC_CPU=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_TRACE_IRQFLAGS_SUPPORT=y CONFIG_HARDIRQS_SW_RESEND=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_NEED_DMA_MAP_STATE=y CONFIG_VECTORS_BASE=0x # CONFIG_ARM_PATCH_PHYS_VIRT is not set CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_HAVE_IRQ_WORK=y # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_BROKEN_ON_SMP=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE="arm-linux-" CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_LZMA is not set # CONFIG_KERNEL_LZO is not set CONFIG_DEFAULT_HOSTNAME="(none)" # CONFIG_SWAP is not set CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y # CONFIG_POSIX_MQUEUE is not set # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_FHANDLE is not set # CONFIG_TASKSTATS is not set # CONFIG_AUDIT is not set CONFIG_HAVE_GENERIC_HARDIRQS=y # # IRQ subsystem # CONFIG_GENERIC_HARDIRQS=y CONFIG_HAVE_SPARSE_IRQ=y CONFIG_GENERIC_IRQ_SHOW=y # CONFIG_SPARSE_IRQ is not set # # RCU Subsystem # # CONFIG_TREE_PREEMPT_RCU is not set # CONFIG_TINY_RCU is not set CONFIG_TINY_PREEMPT_RCU=y CONFIG_PREEMPT_RCU=y # CONFIG_RCU_TRACE is not set # CONFIG_TREE_RCU_TRACE is not set # CONFIG_RCU_BOOST is not set # CONFIG_IKCONFIG is not set CONFIG_LOG_BUF_SHIFT=16 # CONFIG_CGROUPS is not set CONFIG_NAMESPACES=y # CONFIG_UTS_NS is not set CONFIG_IPC_NS=y # CONFIG_USER_NS is not set # CONFIG_PID_NS is not set # CONFIG_NET_NS is not set # CONFIG_SCHED_AUTOGROUP is not set CONFIG_SYSFS_DEPRECATED=y CONFIG_SYSFS_DEPRECATED_V2=y # CONFIG_RELAY is not set CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" CONFIG_RD_GZIP=y # CONFIG_RD_BZIP2 is not set # CONFIG_RD_LZMA is not set # CONFIG_RD_XZ is not set # CONFIG_RD_LZO is not set # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y CONFIG_ANON_INODES=y CONFIG_EXPERT=y CONFIG_UID16=y # CONFIG_SYSCTL_SYSCALL is not set CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_ALL is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y # CONFIG_ELF_CORE is not set CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_AIO=y CONFIG_EMBEDDED=y CONFIG_HAVE_PERF_EVENTS=y CONFIG_PERF_USE_VMALLOC=y # # Kernel Performance Events And Counters # # CONFIG_PERF_EVENTS is not set # CONFIG_PERF_COUNTERS is not set # CONFIG_VM_EVENT_COUNTERS is not set CONFIG_COMPAT_BRK=y CONFIG_SLAB=y # CONFIG_SLUB is not set # CONFIG_SLOB is not set # CONFIG_PROFILING is not set CONFIG_HAVE_OPROFILE=y # CONFIG_KPROBES is not set CONFIG_HAVE_KPROBES=y CONFIG_HAVE_KRETPROBES=y CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y CONFIG_HAVE_DMA_API_DEBUG=y # # GCOV-based kernel profiling # CONFIG_HAVE_GENERIC_DMA_COHERENT=y CONFIG_SLABINFO=y CONFIG_RT_MUTEXES=y CONFIG_BASE_SMALL=0 CONFIG_MODULES=y #
network performance get regression from 2.6 to 3.10 by each version
I have an ARM platform which works with older 2.6.28 Linux Kernel and the embedded NIC driver I profile the TCP Tx using netperf 2.6 by command ./netperf -H {serverip} -l 300. In 2.6.28 the TCP tx can reach 190 Mbps. Recently I am porting the platform to long-term Kernel version 2.6.32.61, 3.4.88 and 3.10. And I got the regression TCP Tx throughput by each new version. 2.6.32.61 is about 184Mbps 3.4.88 is about 173Mbps 3.10.0 is about 160Mbps so, I try to porting to more EOL versions 3.0.38 184Mbps 3.2.0 179Mbps 3.2.57 177Mbps 3.5.0 168Mbps 3.5.7 166Mbps 3.6.0 162Mbps 3.6.11 163Mbps The newer version have slower performance. The Kernel was download from kernel.org for porting. To touch less file as possible, I only porting basic requirement for MACHINE_START io_map interrupt timer and add NIC driver. The patch needed for NIC driver from 2.x to 3.x is only to group fops to net_device_ops. No any change in xmit, received and isr handle flow. Actually, the NIC driver from 3.2 to 3.6 are file identical by diff. The only different is .config file, and I have try to keep the configuration are identical also. Just new or remove option difference between each version. I have no idea is the performance regression is due to network stack of each version or any feature I have to configure on new version. Any suggestion I can try to dig out the root cause, or some one have similar same observation on your experimence? The following is .config which I used for 3.0.38 and .config diff between each version. I can't find any difference option will effect the performance from .config diffs. .config of kernel 3.0.38 # # Automatically generated make config: don't edit # Linux/arm 3.0.38 Kernel Configuration # CONFIG_ARM=y CONFIG_SYS_SUPPORTS_APM_EMULATION=y # CONFIG_ARCH_USES_GETTIMEOFFSET is not set CONFIG_KTIME_SCALAR=y CONFIG_HAVE_PROC_CPU=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_TRACE_IRQFLAGS_SUPPORT=y CONFIG_HARDIRQS_SW_RESEND=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_NEED_DMA_MAP_STATE=y CONFIG_VECTORS_BASE=0x # CONFIG_ARM_PATCH_PHYS_VIRT is not set CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config CONFIG_HAVE_IRQ_WORK=y # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_BROKEN_ON_SMP=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE=arm-linux- CONFIG_LOCALVERSION= # CONFIG_LOCALVERSION_AUTO is not set CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_LZMA is not set # CONFIG_KERNEL_LZO is not set CONFIG_DEFAULT_HOSTNAME=(none) # CONFIG_SWAP is not set CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y # CONFIG_POSIX_MQUEUE is not set # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_FHANDLE is not set # CONFIG_TASKSTATS is not set # CONFIG_AUDIT is not set CONFIG_HAVE_GENERIC_HARDIRQS=y # # IRQ subsystem # CONFIG_GENERIC_HARDIRQS=y CONFIG_HAVE_SPARSE_IRQ=y CONFIG_GENERIC_IRQ_SHOW=y # CONFIG_SPARSE_IRQ is not set # # RCU Subsystem # # CONFIG_TREE_PREEMPT_RCU is not set # CONFIG_TINY_RCU is not set CONFIG_TINY_PREEMPT_RCU=y CONFIG_PREEMPT_RCU=y # CONFIG_RCU_TRACE is not set # CONFIG_TREE_RCU_TRACE is not set # CONFIG_RCU_BOOST is not set # CONFIG_IKCONFIG is not set CONFIG_LOG_BUF_SHIFT=16 # CONFIG_CGROUPS is not set CONFIG_NAMESPACES=y # CONFIG_UTS_NS is not set CONFIG_IPC_NS=y # CONFIG_USER_NS is not set # CONFIG_PID_NS is not set # CONFIG_NET_NS is not set # CONFIG_SCHED_AUTOGROUP is not set CONFIG_SYSFS_DEPRECATED=y CONFIG_SYSFS_DEPRECATED_V2=y # CONFIG_RELAY is not set CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE= CONFIG_RD_GZIP=y # CONFIG_RD_BZIP2 is not set # CONFIG_RD_LZMA is not set # CONFIG_RD_XZ is not set # CONFIG_RD_LZO is not set # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y CONFIG_ANON_INODES=y CONFIG_EXPERT=y CONFIG_UID16=y # CONFIG_SYSCTL_SYSCALL is not set CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_ALL is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y # CONFIG_ELF_CORE is not set CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_AIO=y CONFIG_EMBEDDED=y CONFIG_HAVE_PERF_EVENTS=y CONFIG_PERF_USE_VMALLOC=y # # Kernel Performance Events And Counters # # CONFIG_PERF_EVENTS is not set # CONFIG_PERF_COUNTERS is not set # CONFIG_VM_EVENT_COUNTERS is not set CONFIG_COMPAT_BRK=y CONFIG_SLAB=y # CONFIG_SLUB is not set # CONFIG_SLOB is not set # CONFIG_PROFILING is not set CONFIG_HAVE_OPROFILE=y # CONFIG_KPROBES is not set CONFIG_HAVE_KPROBES=y CONFIG_HAVE_KRETPROBES=y CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y CONFIG_HAVE_DMA_API_DEBUG=y # # GCOV-based kernel profiling # CONFIG_HAVE_GENERIC_DMA_COHERENT=y CONFIG_SLABINFO=y CONFIG_RT_MUTEXES=y CONFIG_BASE_SMALL=0 CONFIG_MODULES=y #
Re: Poor network performance x86_64.. also with 3.13
On Sun, Feb 09, 2014 at 10:14:34AM -0800, Eric Dumazet wrote: > tcp_rmem[2] = 16777 > > Come on, the 640KB barrier was broken a long time ago ;) > > Feel free to investigate, I wont ;) Me too - it's not like I don't have anything else to do. :-) I was just wondering why 3.10 was fine even with these settings and 3.12 wasn't. Here's the original report: "I recently upgraded the Kernel from version 3.10 to latest stable 3.12.8, did the usual "make oldconfig" (resulting config attached). But now I noticed some _really_ low network performance." Link: http://lkml.kernel.org/r/52dad66f.7080...@dragonslave.de -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Poor network performance x86_64.. also with 3.13
On Sun, 2014-02-09 at 16:31 +0100, Borislav Petkov wrote: > On Sun, Feb 09, 2014 at 04:05:11PM +0100, Daniel Exner wrote: > > > cat /etc/sysctl.d/net.conf > > > net.ipv4.tcp_window_scaling = 1 > > > net.core.rmem_max = 16777216 > > > net.ipv4.tcp_rmem = 4096 87380 16777 > > > net.ipv4.tcp_wmem = 4096 1638 > > > > After removing those values I finally had sane iperf values. > > No idea how those got there, perhaps they made sense when I first setup > > the box, which is some years ago.. > > The only question that is left to clarify now is why do those values > have effect on 3.12.x and not on 3.10... tcp_rmem[2] = 16777 Come on, the 640KB barrier was broken a long time ago ;) Feel free to investigate, I wont ;) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Poor network performance x86_64.. also with 3.13
On Sun, Feb 09, 2014 at 04:05:11PM +0100, Daniel Exner wrote: > > cat /etc/sysctl.d/net.conf > > net.ipv4.tcp_window_scaling = 1 > > net.core.rmem_max = 16777216 > > net.ipv4.tcp_rmem = 4096 87380 16777 > > net.ipv4.tcp_wmem = 4096 1638 > > After removing those values I finally had sane iperf values. > No idea how those got there, perhaps they made sense when I first setup > the box, which is some years ago.. The only question that is left to clarify now is why do those values have effect on 3.12.x and not on 3.10... -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Poor network performance x86_64.. also with 3.13
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Hi all, Am Mon, 20 Jan 2014 23:37:52 +0100 schrieb Borislav Petkov : > On Mon, Jan 20, 2014 at 11:27:25PM +0100, Daniel Exner wrote: > > I just did the same procedure with Kernel Version 3.13: same poor > > rates. > > > > I think I will try to see of 3.12.6 was still ok and bisect from > > there. > > Or try something more coarse-grained like 3.11 first, then 3.12 and > then the -rcs in between. > I must apologize for suspecting the kernel for my problems. After some bisect attempts I finaly noticed the following: > cat /etc/sysctl.d/net.conf > net.ipv4.tcp_window_scaling = 1 > net.core.rmem_max = 16777216 > net.ipv4.tcp_rmem = 4096 87380 16777 > net.ipv4.tcp_wmem = 4096 1638 After removing those values I finally had sane iperf values. No idea how those got there, perhaps they made sense when I first setup the box, which is some years ago.. Anyway, thanks all for your help :) Greetings Daniel Exner -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) iQIcBAEBCgAGBQJS95knAAoJEPI6v6bI/QkfZYYP/37WBvygR7gLKqFTYfQA2ALE n6cOLrogoJT8Cf3q1fLqKiSzPToxSuPBTTmQtaNnhxLKTCPFHxLYPbTdtlXGTPB1 vFVJmXg8WAM/kQD/IoHrMsZsHfRWZE+RtQrUfeQ4Ava6abmniBufVe7ERMuF6ddW 02F5COtw74LJuSbxS70Cn3reog/ExoIYOYKQn6+FpoKTME4WnZtA8DJxo1r077RL mNqo3D4OMrYdPhxyRjLygtCnmXuX/yynV2czBnFkME4f1B4P/1hIzqYCxa2dBQIM Pfr+b/TtyVZA3DsE1d22f/+34EFWE/06EM5l8KwImmRHGA9Ffx77jKX4sAxN0Hhg 9ZJleeddk4NahXur5WNAV4lrkiLUgGauC0k721KwBFecFy2gYK/OUIyOm/oA3IPT WAEeGT4nCfCa1vYfoZVBn5UWOZo1eLm5qh6dmGb9FHukmwWTEplRRvYDPyJNfmRg 0j5mvn7ymFIbnmkVSnBFdfJH0I6XhhiHQ9H3cb9It9OLH5eEK1x4AW1okkAwrquQ oNYkpq54aJS/3oDokyWN/Gkvkmmk+4Q6tpxQge0AQPhrNeft5X7b8VhffstWhTSF kO1ULQ+zOtRUHF6T5523qVcS3pzFfqQKPYPhhQspGvuPJEr0M94i2JS016Z84Cz6 krmaHvSO/MKFkm7w+x5d =90En -END PGP SIGNATURE-
Re: Poor network performance x86_64.. also with 3.13
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Hi all, Am Mon, 20 Jan 2014 23:37:52 +0100 schrieb Borislav Petkov b...@alien8.de: On Mon, Jan 20, 2014 at 11:27:25PM +0100, Daniel Exner wrote: I just did the same procedure with Kernel Version 3.13: same poor rates. I think I will try to see of 3.12.6 was still ok and bisect from there. Or try something more coarse-grained like 3.11 first, then 3.12 and then the -rcs in between. I must apologize for suspecting the kernel for my problems. After some bisect attempts I finaly noticed the following: cat /etc/sysctl.d/net.conf net.ipv4.tcp_window_scaling = 1 net.core.rmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777 net.ipv4.tcp_wmem = 4096 1638 After removing those values I finally had sane iperf values. No idea how those got there, perhaps they made sense when I first setup the box, which is some years ago.. Anyway, thanks all for your help :) Greetings Daniel Exner -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) iQIcBAEBCgAGBQJS95knAAoJEPI6v6bI/QkfZYYP/37WBvygR7gLKqFTYfQA2ALE n6cOLrogoJT8Cf3q1fLqKiSzPToxSuPBTTmQtaNnhxLKTCPFHxLYPbTdtlXGTPB1 vFVJmXg8WAM/kQD/IoHrMsZsHfRWZE+RtQrUfeQ4Ava6abmniBufVe7ERMuF6ddW 02F5COtw74LJuSbxS70Cn3reog/ExoIYOYKQn6+FpoKTME4WnZtA8DJxo1r077RL mNqo3D4OMrYdPhxyRjLygtCnmXuX/yynV2czBnFkME4f1B4P/1hIzqYCxa2dBQIM Pfr+b/TtyVZA3DsE1d22f/+34EFWE/06EM5l8KwImmRHGA9Ffx77jKX4sAxN0Hhg 9ZJleeddk4NahXur5WNAV4lrkiLUgGauC0k721KwBFecFy2gYK/OUIyOm/oA3IPT WAEeGT4nCfCa1vYfoZVBn5UWOZo1eLm5qh6dmGb9FHukmwWTEplRRvYDPyJNfmRg 0j5mvn7ymFIbnmkVSnBFdfJH0I6XhhiHQ9H3cb9It9OLH5eEK1x4AW1okkAwrquQ oNYkpq54aJS/3oDokyWN/Gkvkmmk+4Q6tpxQge0AQPhrNeft5X7b8VhffstWhTSF kO1ULQ+zOtRUHF6T5523qVcS3pzFfqQKPYPhhQspGvuPJEr0M94i2JS016Z84Cz6 krmaHvSO/MKFkm7w+x5d =90En -END PGP SIGNATURE-
Re: Poor network performance x86_64.. also with 3.13
On Sun, Feb 09, 2014 at 04:05:11PM +0100, Daniel Exner wrote: cat /etc/sysctl.d/net.conf net.ipv4.tcp_window_scaling = 1 net.core.rmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777 net.ipv4.tcp_wmem = 4096 1638 After removing those values I finally had sane iperf values. No idea how those got there, perhaps they made sense when I first setup the box, which is some years ago.. The only question that is left to clarify now is why do those values have effect on 3.12.x and not on 3.10... -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Poor network performance x86_64.. also with 3.13
On Sun, 2014-02-09 at 16:31 +0100, Borislav Petkov wrote: On Sun, Feb 09, 2014 at 04:05:11PM +0100, Daniel Exner wrote: cat /etc/sysctl.d/net.conf net.ipv4.tcp_window_scaling = 1 net.core.rmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777 net.ipv4.tcp_wmem = 4096 1638 After removing those values I finally had sane iperf values. No idea how those got there, perhaps they made sense when I first setup the box, which is some years ago.. The only question that is left to clarify now is why do those values have effect on 3.12.x and not on 3.10... tcp_rmem[2] = 16777 Come on, the 640KB barrier was broken a long time ago ;) Feel free to investigate, I wont ;) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Poor network performance x86_64.. also with 3.13
On Sun, Feb 09, 2014 at 10:14:34AM -0800, Eric Dumazet wrote: tcp_rmem[2] = 16777 Come on, the 640KB barrier was broken a long time ago ;) Feel free to investigate, I wont ;) Me too - it's not like I don't have anything else to do. :-) I was just wondering why 3.10 was fine even with these settings and 3.12 wasn't. Here's the original report: I recently upgraded the Kernel from version 3.10 to latest stable 3.12.8, did the usual make oldconfig (resulting config attached). But now I noticed some _really_ low network performance. Link: http://lkml.kernel.org/r/52dad66f.7080...@dragonslave.de -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Poor network performance x86_64.. also with 3.13
On 01/20/2014 11:37 PM, Borislav Petkov wrote: On Mon, Jan 20, 2014 at 11:27:25PM +0100, Daniel Exner wrote: I just did the same procedure with Kernel Version 3.13: same poor rates. I think I will try to see of 3.12.6 was still ok and bisect from there. Or try something more coarse-grained like 3.11 first, then 3.12 and then the -rcs in between. Hm, on my machine 3.13 (latest git) has double throughtput of 3.11 (distro compiled) on loopback interface. 68Gb vs 33Gb (iperf). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Poor network performance x86_64.. also with 3.13
On Mon, Jan 20, 2014 at 11:27:25PM +0100, Daniel Exner wrote: > I just did the same procedure with Kernel Version 3.13: same poor rates. > > I think I will try to see of 3.12.6 was still ok and bisect from there. Or try something more coarse-grained like 3.11 first, then 3.12 and then the -rcs in between. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Poor network performance x86_64.. also with 3.13
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Hi, Am 18.01.2014 23:46, schrieb Daniel Exner: > Hi again, > > Am 18.01.2014 20:50, schrieb Borislav Petkov: >> + netdev. > Thx > > Am 18.01.2014 20:49, schrieb Holger Hoffstätte:> [This mail was > also posted to gmane.linux.kernel.] > >> On Sat, 18 Jan 2014 20:30:55 +0100, Daniel Exner wrote: > >>> I recently upgraded the Kernel from version 3.10 to latest >>> stable 3.12.8, did the usual "make oldconfig" (resulting >>> config attached). >>> >>> But now I noticed some _really_ low network performance. > >> Try: sysctl net.ipv4.tcp_limit_output_bytes=262144 > > Tried that. Even 10 times the value. Same effect. I just did the same procedure with Kernel Version 3.13: same poor rates. I think I will try to see of 3.12.6 was still ok and bisect from there. Greetings Daniel -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCgAGBQJS3aLFAAoJEPI6v6bI/Qkf9+4QAJkfljUsRQn6DA6gWy3XYsn4 ZB3F2Mu8kLsCMVjpASsi7+km2qTiFv4qGOgezHCJmqMcdCFszBweGQrnYBLA5PCD XSZ7G4S0U71aHWtY6iQd1q4ywnA21pfnGRqIpc5+OuIiIOm+YY+RXpJAHC5y1OVo MxsPL1ZVp/enJoZuvblw6i+JT+soAbSypPWcNQ78qb+CYzLVMLZHcqQvMwpAsRvQ LNKx8nyj8p32CdZo1GoT3f/nWvBeh/V/ViLrtt64u/oXMJyk5INVRFpSNUUviP8c 42y+r2K31+nY11K2dHsdJYbv5lZ8p8g0SNoLG1SrjgDspaptnT8jptxxn7GcQqdL PZ3waUB7qYU15IxCA2iXwNPqjtsv8V5l55H/cunKQgxNbb318ui/a3cW7+R++CeL onv9HFNUkHJiP/MvZJ1/FXE0AsjX70un9NuQ0+xFjCRwJ/YLZzCHkWMERcev500O vS1yFTGiVY1HndoA4VFYzEkjOyHgDHHQA+0JkfBspVlhL7ow9hccmULZtEn9LzwU 9rooQHyXwdKr6KIbsjHECyjIsBhW4Jfj6195bZ9ddBDBXSqYyGqjiuy7l7TjlZVa YmPNTlkEfMeXkO2h3km8TD2f+MPntYXkYjZVVUcK8NucgnIdLuWDk/GLrt73VTd8 Cww6B/u4YnGJSF5v/nit =xCa3 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Poor network performance x86_64.. also with 3.13
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Hi, Am 18.01.2014 23:46, schrieb Daniel Exner: Hi again, Am 18.01.2014 20:50, schrieb Borislav Petkov: + netdev. Thx Am 18.01.2014 20:49, schrieb Holger Hoffstätte: [This mail was also posted to gmane.linux.kernel.] On Sat, 18 Jan 2014 20:30:55 +0100, Daniel Exner wrote: I recently upgraded the Kernel from version 3.10 to latest stable 3.12.8, did the usual make oldconfig (resulting config attached). But now I noticed some _really_ low network performance. Try: sysctl net.ipv4.tcp_limit_output_bytes=262144 Tried that. Even 10 times the value. Same effect. I just did the same procedure with Kernel Version 3.13: same poor rates. I think I will try to see of 3.12.6 was still ok and bisect from there. Greetings Daniel -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCgAGBQJS3aLFAAoJEPI6v6bI/Qkf9+4QAJkfljUsRQn6DA6gWy3XYsn4 ZB3F2Mu8kLsCMVjpASsi7+km2qTiFv4qGOgezHCJmqMcdCFszBweGQrnYBLA5PCD XSZ7G4S0U71aHWtY6iQd1q4ywnA21pfnGRqIpc5+OuIiIOm+YY+RXpJAHC5y1OVo MxsPL1ZVp/enJoZuvblw6i+JT+soAbSypPWcNQ78qb+CYzLVMLZHcqQvMwpAsRvQ LNKx8nyj8p32CdZo1GoT3f/nWvBeh/V/ViLrtt64u/oXMJyk5INVRFpSNUUviP8c 42y+r2K31+nY11K2dHsdJYbv5lZ8p8g0SNoLG1SrjgDspaptnT8jptxxn7GcQqdL PZ3waUB7qYU15IxCA2iXwNPqjtsv8V5l55H/cunKQgxNbb318ui/a3cW7+R++CeL onv9HFNUkHJiP/MvZJ1/FXE0AsjX70un9NuQ0+xFjCRwJ/YLZzCHkWMERcev500O vS1yFTGiVY1HndoA4VFYzEkjOyHgDHHQA+0JkfBspVlhL7ow9hccmULZtEn9LzwU 9rooQHyXwdKr6KIbsjHECyjIsBhW4Jfj6195bZ9ddBDBXSqYyGqjiuy7l7TjlZVa YmPNTlkEfMeXkO2h3km8TD2f+MPntYXkYjZVVUcK8NucgnIdLuWDk/GLrt73VTd8 Cww6B/u4YnGJSF5v/nit =xCa3 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Poor network performance x86_64.. also with 3.13
On Mon, Jan 20, 2014 at 11:27:25PM +0100, Daniel Exner wrote: I just did the same procedure with Kernel Version 3.13: same poor rates. I think I will try to see of 3.12.6 was still ok and bisect from there. Or try something more coarse-grained like 3.11 first, then 3.12 and then the -rcs in between. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Poor network performance x86_64.. also with 3.13
On 01/20/2014 11:37 PM, Borislav Petkov wrote: On Mon, Jan 20, 2014 at 11:27:25PM +0100, Daniel Exner wrote: I just did the same procedure with Kernel Version 3.13: same poor rates. I think I will try to see of 3.12.6 was still ok and bisect from there. Or try something more coarse-grained like 3.11 first, then 3.12 and then the -rcs in between. Hm, on my machine 3.13 (latest git) has double throughtput of 3.11 (distro compiled) on loopback interface. 68Gb vs 33Gb (iperf). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.12.8 poor network performance x86_64
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Hi again, Am 18.01.2014 20:50, schrieb Borislav Petkov: > + netdev. Thx Am 18.01.2014 20:49, schrieb Holger Hoffstätte:> [This mail was also posted to gmane.linux.kernel.] > > On Sat, 18 Jan 2014 20:30:55 +0100, Daniel Exner wrote: > >> I recently upgraded the Kernel from version 3.10 to latest >> stable 3.12.8, did the usual "make oldconfig" (resulting config >> attached). >> >> But now I noticed some _really_ low network performance. > > Try: sysctl net.ipv4.tcp_limit_output_bytes=262144 Tried that. Even 10 times the value. Same effect. Is there something like that on a lower level of the network stack I might try to change? Could that be something in the cgroups layer? Should I send a dmesg or anything else? Greetings Daniel -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCgAGBQJS2wQtAAoJEPI6v6bI/QkfKakP/jsv7VG5bUuSbvXjLQklb8mY kGZvXRktpGMxHbUJe8NCLStbWcGLRoD+ilXh38e0U+icgU/f6uAa6a93cSaF+zi8 imjkyutQqAlevV3Ab3SAaSho6SsWgfTkWZ7kkFooIXU6UwIlqq41923OTpR2bXL4 qvYiYEOOO9Uzg/o0PXeV2VYcgxfnvTqRrpou3yQZK5YhLAZIHd8i9r1yqc+Un4dG 7+Ju/45YpqynX2CJVMx5kP62f9uQbdA9sEKoEkYInVtja0UGUwFXzHgy8RZLHDiM Qhy/yZTzkjR4vai4N+dx2UizGBwgBtng5IzUiXX2HGd8TOMJRBcoPaa0ZBA4CsA+ RjypqL9dOGpw1bxZ/87h9qpvjmZd3mPhb768VKXgzgdEVlp56u5rT1OQEBUju8aS Qprgtf6k1EkjJPWo3DVJrGr/Wk+k8cLASW5qm3wGS7V0k1H0EN+pw7UGvNY99kcf IllTKa6bTkKe15x1BaZjvAwFHR1Fdcdn3A+2WQwy+hha1CsjogHnbhzUjmxHDq+c 8i92oZ1nw2788/ULPKc5hK2o8C4Zsp0JVGHd4Cy4Dy6tvCLSQveDneM/2U3JvCL2 ViOdxh6LGGWFvDpe1w9x+e3QzvXYTFNEXawn/5OIEzbM1XP+VF8zIyHsLG4nNI88 +ICSvsTt86Zg8Zhm96YH =gsnO -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.12.8 poor network performance x86_64
On Sat, 18 Jan 2014 20:30:55 +0100, Daniel Exner wrote: > I recently upgraded the Kernel from version 3.10 to latest stable > 3.12.8, did the usual "make oldconfig" (resulting config attached). > > But now I noticed some _really_ low network performance. Try: sysctl net.ipv4.tcp_limit_output_bytes=262144 Holger -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.12.8 poor network performance x86_64
On Sat, 18 Jan 2014 20:30:55 +0100, Daniel Exner wrote: I recently upgraded the Kernel from version 3.10 to latest stable 3.12.8, did the usual make oldconfig (resulting config attached). But now I noticed some _really_ low network performance. Try: sysctl net.ipv4.tcp_limit_output_bytes=262144 Holger -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.12.8 poor network performance x86_64
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Hi again, Am 18.01.2014 20:50, schrieb Borislav Petkov: + netdev. Thx Am 18.01.2014 20:49, schrieb Holger Hoffstätte: [This mail was also posted to gmane.linux.kernel.] On Sat, 18 Jan 2014 20:30:55 +0100, Daniel Exner wrote: I recently upgraded the Kernel from version 3.10 to latest stable 3.12.8, did the usual make oldconfig (resulting config attached). But now I noticed some _really_ low network performance. Try: sysctl net.ipv4.tcp_limit_output_bytes=262144 Tried that. Even 10 times the value. Same effect. Is there something like that on a lower level of the network stack I might try to change? Could that be something in the cgroups layer? Should I send a dmesg or anything else? Greetings Daniel -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBCgAGBQJS2wQtAAoJEPI6v6bI/QkfKakP/jsv7VG5bUuSbvXjLQklb8mY kGZvXRktpGMxHbUJe8NCLStbWcGLRoD+ilXh38e0U+icgU/f6uAa6a93cSaF+zi8 imjkyutQqAlevV3Ab3SAaSho6SsWgfTkWZ7kkFooIXU6UwIlqq41923OTpR2bXL4 qvYiYEOOO9Uzg/o0PXeV2VYcgxfnvTqRrpou3yQZK5YhLAZIHd8i9r1yqc+Un4dG 7+Ju/45YpqynX2CJVMx5kP62f9uQbdA9sEKoEkYInVtja0UGUwFXzHgy8RZLHDiM Qhy/yZTzkjR4vai4N+dx2UizGBwgBtng5IzUiXX2HGd8TOMJRBcoPaa0ZBA4CsA+ RjypqL9dOGpw1bxZ/87h9qpvjmZd3mPhb768VKXgzgdEVlp56u5rT1OQEBUju8aS Qprgtf6k1EkjJPWo3DVJrGr/Wk+k8cLASW5qm3wGS7V0k1H0EN+pw7UGvNY99kcf IllTKa6bTkKe15x1BaZjvAwFHR1Fdcdn3A+2WQwy+hha1CsjogHnbhzUjmxHDq+c 8i92oZ1nw2788/ULPKc5hK2o8C4Zsp0JVGHd4Cy4Dy6tvCLSQveDneM/2U3JvCL2 ViOdxh6LGGWFvDpe1w9x+e3QzvXYTFNEXawn/5OIEzbM1XP+VF8zIyHsLG4nNI88 +ICSvsTt86Zg8Zhm96YH =gsnO -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
> "Willy" == Willy Tarreau writes: Willy> On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote: >> On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote: >> > > >> > > (sd->len is usually 4096, which is expected, but sd->total_len value is >> > > huge in your case, so we always set the flag in fs/splice.c) >> > >> > I am testing : >> > >> >if (sd->len < sd->total_len && pipe->nrbufs > 1) >> > more |= MSG_SENDPAGE_NOTLAST; >> > >> >> Yes, this should fix the problem : >> >> If there is no following buffer in the pipe, we should not set NOTLAST. >> >> diff --git a/fs/splice.c b/fs/splice.c >> index 8890604..6909d89 100644 >> --- a/fs/splice.c >> +++ b/fs/splice.c >> @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info >> *pipe, >> return -EINVAL; >> >> more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0; >> -if (sd->len < sd->total_len) >> + >> +if (sd->len < sd->total_len && pipe->nrbufs > 1) >> more |= MSG_SENDPAGE_NOTLAST; >> + >> return file->f_op->sendpage(file, buf->page, buf->offset, sd-> len, , more); >> } Willy> OK it works like a charm here now ! I can't break it anymore, so it Willy> looks like you finally got it ! It's still broken, there's no comments in the code to explain all this magic to mere mortals! *grin* John -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
> "Willy" == Willy Tarreau writes: Willy> On Sun, Jan 06, 2013 at 04:49:35PM -0500, John Stoffel wrote: >> > "Willy" == Willy Tarreau writes: >> Willy> On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote: >> >> On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote: >> >> > > >> >> > > (sd->len is usually 4096, which is expected, but sd->total_len value >> >> > > is >> >> > > huge in your case, so we always set the flag in fs/splice.c) >> >> > >> >> > I am testing : >> >> > >> >> >if (sd->len < sd->total_len && pipe->nrbufs > 1) >> >> > more |= MSG_SENDPAGE_NOTLAST; >> >> > >> >> >> >> Yes, this should fix the problem : >> >> >> >> If there is no following buffer in the pipe, we should not set NOTLAST. >> >> >> >> diff --git a/fs/splice.c b/fs/splice.c >> >> index 8890604..6909d89 100644 >> >> --- a/fs/splice.c >> >> +++ b/fs/splice.c >> >> @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info >> >> *pipe, >> >> return -EINVAL; >> >> >> >> more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0; >> >> - if (sd->len < sd->total_len) >> >> + >> >> + if (sd->len < sd->total_len && pipe->nrbufs > 1) >> >> more |= MSG_SENDPAGE_NOTLAST; >> >> + >> >> return file->f_op->sendpage(file, buf->page, buf->offset, sd-> len, , more); >> >> } >> Willy> OK it works like a charm here now ! I can't break it anymore, so it Willy> looks like you finally got it ! >> >> It's still broken, there's no comments in the code to explain all this >> magic to mere mortals! *grin* Willy> I would generally agree, but when Eric fixes such a thing, he Willy> generally goes with lengthy details in the commit message. I'm sure he will too, I just wanted to nudge him because while I sorta followed this discussion, I see lots of pain down the road if the code wasn't updated with some nice big fat comments. Great job finding this code and testing, testing, testing. John -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, Jan 06, 2013 at 04:49:35PM -0500, John Stoffel wrote: > > "Willy" == Willy Tarreau writes: > > Willy> On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote: > >> On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote: > >> > > > >> > > (sd->len is usually 4096, which is expected, but sd->total_len value is > >> > > huge in your case, so we always set the flag in fs/splice.c) > >> > > >> > I am testing : > >> > > >> >if (sd->len < sd->total_len && pipe->nrbufs > 1) > >> > more |= MSG_SENDPAGE_NOTLAST; > >> > > >> > >> Yes, this should fix the problem : > >> > >> If there is no following buffer in the pipe, we should not set NOTLAST. > >> > >> diff --git a/fs/splice.c b/fs/splice.c > >> index 8890604..6909d89 100644 > >> --- a/fs/splice.c > >> +++ b/fs/splice.c > >> @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info > >> *pipe, > >> return -EINVAL; > >> > >> more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0; > >> - if (sd->len < sd->total_len) > >> + > >> + if (sd->len < sd->total_len && pipe->nrbufs > 1) > >> more |= MSG_SENDPAGE_NOTLAST; > >> + > >> return file->f_op->sendpage(file, buf->page, buf->offset, > sd-> len, , more); > >> } > > Willy> OK it works like a charm here now ! I can't break it anymore, so it > Willy> looks like you finally got it ! > > It's still broken, there's no comments in the code to explain all this > magic to mere mortals! *grin* I would generally agree, but when Eric fixes such a thing, he generally goes with lengthy details in the commit message. Willy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, Jan 06, 2013 at 11:39:31AM -0800, Eric Dumazet wrote: > On Sun, 2013-01-06 at 20:34 +0100, Willy Tarreau wrote: > > > OK it works like a charm here now ! I can't break it anymore, so it > > looks like you finally got it ! > > > > I noticed that the data rate was higher when the loopback's MTU > > is exactly a multiple of 4096 (making the 64k choice optimal) > > while I would have assumed that in order to efficiently splice > > TCP segments, we'd need to have some space for IP/TCP headers > > and n*4k for the payload. > > > > I also got the transfer freezes again a few times when starting > > tcpdump on the server, but this is not 100% reproducible I'm afraid. > > So I'll bring this back when I manage to get some analysable pattern. > > > > The spliced transfer through all the chain haproxy works fine again > > at 10gig with your fix. The issue is closed for me. Feel free to add > > my Tested-By if you want. > > > > Good to know ! > > What is the max speed you get now ? Line rate with 1500 MTU and LRO enabled : # time eth1(ikb ipk okb opk)eth2(ikb ipk okbopk) 1357060023 19933.3 41527.7 9355538.2 62167.7 9757888.1 808701.1 19400.3 40417.7 1357060024 26124.1 54425.5 9290064.9 48804.4 9778294.0 810210.0 18068.8 37643.3 1357060025 27015.2 56281.1 9296115.3 46868.8 9797125.9 811271.1 8790.1 18312.2 1357060026 27556.0 57408.8 9291701.4 46805.5 9805371.6 811410.0 3494.8 7280.0 1357060027 27577.0 57452.2 9293606.8 46804.4 9806122.3 811314.4 2558.7 5330.0 1357060028 27476.1 57242.2 9296885.4 46830.0 9794537.3 810527.7 2516.1 5242.2 ^^^^^^ kbps out kbps in eth1=facing the client eth2=facing the server Top reports the following usage : Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 31.7%id, 0.0%wa, 0.0%hi, 68.3%si, 0.0%st Cpu1 : 1.0%us, 37.3%sy, 0.0%ni, 61.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st (IRQ bound to cpu 0, haproxy to cpu 1) This is a core2duo 2.66 GHz and the myris are 1st generation. BTW I was very happy to see that the LRO->GRO conversion patches in 3.8-rc2 don't affect byte rate anymore (just a minor CPU usage increase but this is not critical here), now I won't complain about it being slower anymore, you won :-) With the GRO patches backported, still at 1500 MTU but with GRO now : Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 28.7%id, 0.0%wa, 0.0%hi, 71.3%si, 0.0%st Cpu1 : 0.0%us, 37.6%sy, 0.0%ni, 62.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st # time eth1(ikb ipk okb opk)eth2(ikb ipk okbopk) 1357058637 18319.3 38165.5 9401736.3 65159.9 9761613.4 808963.3 19403.6 40424.4 1357058638 20009.8 41687.7 9400903.7 62706.6 9770555.8 809522.2 18696.5 38951.1 1357058639 25439.5 52999.9 9301635.3 50267.7 9773666.7 809721.1 19174.1 39946.6 1357058640 26808.2 55850.0 9298301.4 46876.6 9790470.1 810843.3 12408.7 25851.1 1357058641 27110.9 56481.1 9297009.2 46832.2 9803308.4 811339.9 5692.5 11859.9 1357058642 27411.1 57106.6 9291419.2 46796.6 9806846.5 811378.8 2804.4 5842.2 This kernel is getting really good :-) Cheers, Willy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 20:34 +0100, Willy Tarreau wrote: > OK it works like a charm here now ! I can't break it anymore, so it > looks like you finally got it ! > > I noticed that the data rate was higher when the loopback's MTU > is exactly a multiple of 4096 (making the 64k choice optimal) > while I would have assumed that in order to efficiently splice > TCP segments, we'd need to have some space for IP/TCP headers > and n*4k for the payload. > > I also got the transfer freezes again a few times when starting > tcpdump on the server, but this is not 100% reproducible I'm afraid. > So I'll bring this back when I manage to get some analysable pattern. > > The spliced transfer through all the chain haproxy works fine again > at 10gig with your fix. The issue is closed for me. Feel free to add > my Tested-By if you want. > Good to know ! What is the max speed you get now ? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote: > On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote: > > > > > > (sd->len is usually 4096, which is expected, but sd->total_len value is > > > huge in your case, so we always set the flag in fs/splice.c) > > > > I am testing : > > > >if (sd->len < sd->total_len && pipe->nrbufs > 1) > > more |= MSG_SENDPAGE_NOTLAST; > > > > Yes, this should fix the problem : > > If there is no following buffer in the pipe, we should not set NOTLAST. > > diff --git a/fs/splice.c b/fs/splice.c > index 8890604..6909d89 100644 > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe, > return -EINVAL; > > more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0; > - if (sd->len < sd->total_len) > + > + if (sd->len < sd->total_len && pipe->nrbufs > 1) > more |= MSG_SENDPAGE_NOTLAST; > + > return file->f_op->sendpage(file, buf->page, buf->offset, > sd->len, , more); > } OK it works like a charm here now ! I can't break it anymore, so it looks like you finally got it ! I noticed that the data rate was higher when the loopback's MTU is exactly a multiple of 4096 (making the 64k choice optimal) while I would have assumed that in order to efficiently splice TCP segments, we'd need to have some space for IP/TCP headers and n*4k for the payload. I also got the transfer freezes again a few times when starting tcpdump on the server, but this is not 100% reproducible I'm afraid. So I'll bring this back when I manage to get some analysable pattern. The spliced transfer through all the chain haproxy works fine again at 10gig with your fix. The issue is closed for me. Feel free to add my Tested-By if you want. Thank you Eric :-) Willy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote: > > > > (sd->len is usually 4096, which is expected, but sd->total_len value is > > huge in your case, so we always set the flag in fs/splice.c) > > I am testing : > >if (sd->len < sd->total_len && pipe->nrbufs > 1) > more |= MSG_SENDPAGE_NOTLAST; > Yes, this should fix the problem : If there is no following buffer in the pipe, we should not set NOTLAST. diff --git a/fs/splice.c b/fs/splice.c index 8890604..6909d89 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe, return -EINVAL; more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0; - if (sd->len < sd->total_len) + + if (sd->len < sd->total_len && pipe->nrbufs > 1) more |= MSG_SENDPAGE_NOTLAST; + return file->f_op->sendpage(file, buf->page, buf->offset, sd->len, , more); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
> > (sd->len is usually 4096, which is expected, but sd->total_len value is > huge in your case, so we always set the flag in fs/splice.c) I am testing : if (sd->len < sd->total_len && pipe->nrbufs > 1) more |= MSG_SENDPAGE_NOTLAST; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 10:39 -0800, Eric Dumazet wrote: > On Sun, 2013-01-06 at 18:35 +0100, Willy Tarreau wrote: > > > Unfortunately it does not work any better, which means to me > > that we don't leave via this code path. I tried other tricks > > which failed too. I need to understand this part better before > > randomly fiddling with it. > > > > OK, now I have your test program, I can work on a fix, dont worry ;) > > The MSG_SENDPAGE_NOTLAST logic needs to be tweaked. > (sd->len is usually 4096, which is expected, but sd->total_len value is huge in your case, so we always set the flag in fs/splice.c) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 18:35 +0100, Willy Tarreau wrote: > Unfortunately it does not work any better, which means to me > that we don't leave via this code path. I tried other tricks > which failed too. I need to understand this part better before > randomly fiddling with it. > OK, now I have your test program, I can work on a fix, dont worry ;) The MSG_SENDPAGE_NOTLAST logic needs to be tweaked. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, Jan 06, 2013 at 09:10:55AM -0800, Eric Dumazet wrote: > On Sun, 2013-01-06 at 17:44 +0100, Willy Tarreau wrote: > > On Sun, Jan 06, 2013 at 08:39:53AM -0800, Eric Dumazet wrote: > > > Hmm, I'll have to check if this really can be reverted without hurting > > > vmsplice() again. > > > > Looking at the code I've been wondering whether we shouldn't transform > > the condition to perform the push if we can't push more segments, but > > I don't know what to rely on. It would be something like this : > > > >if (copied && > > (!(flags & MSG_SENDPAGE_NOTLAST) || cant_push_more)) > > tcp_push(sk, flags, mss_now, tp->nonagle); > > Good point ! > > Maybe the following fix then ? > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 1ca2536..7ba0717 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -941,8 +941,10 @@ out: > return copied; > > do_error: > - if (copied) > + if (copied) { > + flags &= ~MSG_SENDPAGE_NOTLAST; > goto out; > + } > out_err: > return sk_stream_error(sk, flags, err); > } Unfortunately it does not work any better, which means to me that we don't leave via this code path. I tried other tricks which failed too. I need to understand this part better before randomly fiddling with it. Willy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 17:44 +0100, Willy Tarreau wrote: > On Sun, Jan 06, 2013 at 08:39:53AM -0800, Eric Dumazet wrote: > > Hmm, I'll have to check if this really can be reverted without hurting > > vmsplice() again. > > Looking at the code I've been wondering whether we shouldn't transform > the condition to perform the push if we can't push more segments, but > I don't know what to rely on. It would be something like this : > >if (copied && > (!(flags & MSG_SENDPAGE_NOTLAST) || cant_push_more)) > tcp_push(sk, flags, mss_now, tp->nonagle); Good point ! Maybe the following fix then ? diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 1ca2536..7ba0717 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -941,8 +941,10 @@ out: return copied; do_error: - if (copied) + if (copied) { + flags &= ~MSG_SENDPAGE_NOTLAST; goto out; + } out_err: return sk_stream_error(sk, flags, err); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 16:51 +0100, Willy Tarreau wrote: > Hi Eric, > > Oh sorry, I didn't really want to pollute the list with links and configs, > especially during the initial report with various combined issues :-( > > The client is my old "inject" tool, available here : > > http://git.1wt.eu/web?p=inject.git > > The server is my "httpterm" tool, available here : > > http://git.1wt.eu/web?p=httpterm.git > Use "-O3 -DENABLE_POLL -DENABLE_EPOLL -DENABLE_SPLICE" for CFLAGS. > > I'm starting httpterm this way : > httpterm -D -L :8000 -P 256 > => it starts a server on port 8000, and sets pipe size to 256 kB. It >uses SPLICE_F_MORE on output data but removing it did not fix the >issue one of the early tests. > > Then I'm starting inject this way : > inject -o 1 -u 1 -G 0:8000/?s=1g > => 1 user, 1 object at a time, and fetch /?s=1g from the loopback. >The server will then emit 1 GB of data using splice(). > > It's possible to disable splicing on the server using -dS. The client > "eats" data using recv(MSG_TRUNC) to avoid a useless copy. > > > TCP has very low defaults concerning initial window, and it appears you > > set RCVBUF to even smaller values. > > Yes, you're right, my bootup scripts still change the default value, though > I increase them to larger values during the tests (except the one where you > saw win 8030 due to the default rmem set to 16060). I've been using this > value in the past with older kernels because it allowed an integer number > of segments to fit into the default window, and offered optimal performance > with large numbers of concurrent connections. Since 2.6, tcp_moderate_rcvbuf > works very well and this is not needed anymore. > > Anyway, it does not affect the test here. Good kernels are OK whatever the > default value, and bad kernels are bad whatever the default value too. > > Hmmm finally it's this commit again : > >2f53384 tcp: allow splice() to build full TSO packets > > I'm saying "again" because we already diagnosed a similar effect several > months ago that was revealed by this patch and we fixed it with the > following one, though I remember that we weren't completely sure it > would fix everything : > >bad115c tcp: do_tcp_sendpages() must try to push data out on oom conditions > > Just out of curiosity, I tried to re-apply the patch above just after the > first one but it did not change anything (after all it changed a symptom > which appeared in different conditions). > > Interestingly, this commit (2f53384) significantly improved performance > on spliced data over the loopback (more than 50% in this test). In 3.7, > it seems to have no positive effect anymore. I reverted it using the > following patch and now the problem is fixed (mtu=64k works fine now) : > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index e457c7a..61e4517 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -935,7 +935,7 @@ wait_for_memory: > } > > out: > - if (copied && !(flags & MSG_SENDPAGE_NOTLAST)) > + if (copied) > tcp_push(sk, flags, mss_now, tp->nonagle); > return copied; > > Regards, > Willy > Hmm, I'll have to check if this really can be reverted without hurting vmsplice() again. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, Jan 06, 2013 at 08:39:53AM -0800, Eric Dumazet wrote: > Hmm, I'll have to check if this really can be reverted without hurting > vmsplice() again. Looking at the code I've been wondering whether we shouldn't transform the condition to perform the push if we can't push more segments, but I don't know what to rely on. It would be something like this : if (copied && (!(flags & MSG_SENDPAGE_NOTLAST) || cant_push_more)) tcp_push(sk, flags, mss_now, tp->nonagle); Willy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
Hi Eric, On Sun, Jan 06, 2013 at 06:59:02AM -0800, Eric Dumazet wrote: > On Sun, 2013-01-06 at 10:24 +0100, Willy Tarreau wrote: > > > It does not change anything to the tests above unfortunately. It did not > > even stabilize the unstable runs. > > > > I'll check if I can spot the original commit which caused the regression > > for MTUs that are not n*4096+52. > > Since you don't post your program, I wont be able to help, just by > guessing what it does... Oh sorry, I didn't really want to pollute the list with links and configs, especially during the initial report with various combined issues :-( The client is my old "inject" tool, available here : http://git.1wt.eu/web?p=inject.git The server is my "httpterm" tool, available here : http://git.1wt.eu/web?p=httpterm.git Use "-O3 -DENABLE_POLL -DENABLE_EPOLL -DENABLE_SPLICE" for CFLAGS. I'm starting httpterm this way : httpterm -D -L :8000 -P 256 => it starts a server on port 8000, and sets pipe size to 256 kB. It uses SPLICE_F_MORE on output data but removing it did not fix the issue one of the early tests. Then I'm starting inject this way : inject -o 1 -u 1 -G 0:8000/?s=1g => 1 user, 1 object at a time, and fetch /?s=1g from the loopback. The server will then emit 1 GB of data using splice(). It's possible to disable splicing on the server using -dS. The client "eats" data using recv(MSG_TRUNC) to avoid a useless copy. > TCP has very low defaults concerning initial window, and it appears you > set RCVBUF to even smaller values. Yes, you're right, my bootup scripts still change the default value, though I increase them to larger values during the tests (except the one where you saw win 8030 due to the default rmem set to 16060). I've been using this value in the past with older kernels because it allowed an integer number of segments to fit into the default window, and offered optimal performance with large numbers of concurrent connections. Since 2.6, tcp_moderate_rcvbuf works very well and this is not needed anymore. Anyway, it does not affect the test here. Good kernels are OK whatever the default value, and bad kernels are bad whatever the default value too. Hmmm finally it's this commit again : 2f53384 tcp: allow splice() to build full TSO packets I'm saying "again" because we already diagnosed a similar effect several months ago that was revealed by this patch and we fixed it with the following one, though I remember that we weren't completely sure it would fix everything : bad115c tcp: do_tcp_sendpages() must try to push data out on oom conditions Just out of curiosity, I tried to re-apply the patch above just after the first one but it did not change anything (after all it changed a symptom which appeared in different conditions). Interestingly, this commit (2f53384) significantly improved performance on spliced data over the loopback (more than 50% in this test). In 3.7, it seems to have no positive effect anymore. I reverted it using the following patch and now the problem is fixed (mtu=64k works fine now) : diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e457c7a..61e4517 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -935,7 +935,7 @@ wait_for_memory: } out: - if (copied && !(flags & MSG_SENDPAGE_NOTLAST)) + if (copied) tcp_push(sk, flags, mss_now, tp->nonagle); return copied; Regards, Willy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 10:24 +0100, Willy Tarreau wrote: > It does not change anything to the tests above unfortunately. It did not > even stabilize the unstable runs. > > I'll check if I can spot the original commit which caused the regression > for MTUs that are not n*4096+52. Since you don't post your program, I wont be able to help, just by guessing what it does... TCP has very low defaults concerning initial window, and it appears you set RCVBUF to even smaller values. Here we can see "win 8030", this is not a sane value... 18:32:08.071602 IP 127.0.0.1.26792 > 127.0.0.1.8000: S 2036886615:2036886615(0) win 8030 18:32:08.071605 IP 127.0.0.1.8000 > 127.0.0.1.26792: S 126397113:126397113(0) ack 2036886616 win 8030 So you apparently changed /proc/sys/net/ipv4/tcp_rmem or SO_RCVBUF ? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, Jan 06, 2013 at 11:25:25AM +0100, Willy Tarreau wrote: > OK good news here, the performance drop on the myri was caused by a > problem between the keyboard and the chair. After the reboot series, > I forgot to reload the firmware so the driver used the less efficient > firmware from the NIC (it performs just as if LRO is disabled). > > That makes me think that I should try 3.8-rc2 since LRO was removed > there :-/ Just for the record, I tested 3.8-rc2, and the myri works as fast with GRO there as it used to work with LRO in previous kernels. The softirq work has increased from 26 to 48% but there is no performance drop when using GRO anymore. Andrew has done a good job ! Willy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, Jan 06, 2013 at 12:46:58PM +0100, Romain Francoise wrote: > Willy Tarreau writes: > > > That makes me think that I should try 3.8-rc2 since LRO was removed > > there :-/ > > Better yet, find a way to automate these tests so they can run continually > against net-next and find problems early... There is no way scripts will plug cables and turn on sleeping hardware unfortunately. I'm already following network updates closely enough to spot occasional regressions that are naturally expected due to the number of changes. Also, automated tests won't easily report a behaviour analysis, and behaviour is important in networking. You don't want to accept 100ms pauses all the time for example (and that's just an example). Right now my lab is simplified enough so that I can test something like 100 patches in a week-end, I think that's already fine. Willy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
Willy Tarreau writes: > That makes me think that I should try 3.8-rc2 since LRO was removed > there :-/ Better yet, find a way to automate these tests so they can run continually against net-next and find problems early... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, Jan 06, 2013 at 10:24:35AM +0100, Willy Tarreau wrote: > But before that I'll try to find the recent one causing the myri10ge to > slow down, it should take less time to bisect. OK good news here, the performance drop on the myri was caused by a problem between the keyboard and the chair. After the reboot series, I forgot to reload the firmware so the driver used the less efficient firmware from the NIC (it performs just as if LRO is disabled). That makes me think that I should try 3.8-rc2 since LRO was removed there :-/ The only remaining issue really is the loopback then. Cheers, Willy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sat, Jan 05, 2013 at 11:35:24PM -0800, Eric Dumazet wrote: > On Sun, 2013-01-06 at 03:52 +0100, Willy Tarreau wrote: > > > OK so I observed no change with this patch, either on the loopback > > data rate at >16kB MTU, or on the myri. I'm keeping it at hand for > > experimentation anyway. > > > > Yeah, there was no bug. I rewrote it for net-next as a cleanup/optim > only. I have re-applied your last rewrite and noticed a small but nice performance improvement on a single stream over the loopback : 1 session 10 sessions - without the patch : 55.8 Gbps 68.4 Gbps - with the patch: 56.4 Gbps 70.4 Gbps This was with the loopback reverted to 16kB MTU of course. > > Concerning the loopback MTU, I find it strange that the MTU changes > > the splice() behaviour and not send/recv. I thought that there could > > be a relation between the MTU and the pipe size, but it does not > > appear to be the case either, as I tried various sizes between 16kB > > and 256kB without achieving original performance. > > > It probably is related to a too small receive window, given the MTU was > multiplied by 4, I guess we need to make some adjustments In fact even if I set it to 32kB it breaks. I have tried to progressively increase the loopback's MTU from the default 16436, by steps of 4096 : tcp_rmem = 256 kB tcp_rmem = 256 kB pipe size = 64 kB pipe size = 256 kB 16436 : 55.8 Gbps 65.2 Gbps 20532 : 32..48 Gbps unstable24..45 Gbps unstable 24628 : 56.0 Gbps 66.4 Gbps 28724 : 58.6 Gbps 67.8 Gbps 32820 : 54.5 Gbps 61.7 Gbps 36916 : 56.8 Gbps 65.5 Gbps 41012 : 57.8..58.2 Gbps ~stable 67.5..68.8 Gbps ~stable 45108 : 59.4 Gbps 70.0 Gbps 49204 : 61.2 Gbps 71.1 Gbps 53300 : 58.8 Gbps 70.6 Gbps 57396 : 60.2 Gbps 70.8 Gbps 61492 : 61.4 Gbps 71.1 Gbps tcp_rmem = 1 MB tcp_rmem = 1 MB pipe size = 64 kB pipe size = 256 kB 16436 : 16..34 Gbps unstable49.5 or 65.2 Gbps (unstable) 20532 : 7..15 Gbps unstable15..32 Gbps unstable 24628 : 36..48 Gbps unstable34..61 Gbps unstable 28724 : 40..51 Gbps unstable40..69 Gbps unstable 32820 : 40..55 Gbps unstable59.9..62.3 Gbps ~stable 36916 : 38..51 Gbps unstable66.0 Gbps 41012 : 30..42 Gbps unstable47..66 Gbps unstable 45108 : 59.5 Gbps 71.2 Gbps 49204 : 61.3 Gbps 74.0 Gbps 53300 : 63.1 Gbps 74.5 Gbps 57396 : 64.6 Gbps 74.7 Gbps 61492 : 61..66 Gbps unstable76.5 Gbps So as long as we maintain the MTU to n*4096 + 52, performance is still almost OK. It is interesting to see that the transfer rate is unstable at many values and that it depends both on the rmem and pipe size, just as if some segments sometimes remained stuck for too long. And if I pick a value which does not match n*4096+52, such as 61492+2048 = 63540, then the transfer falls to about 50-100 Mbps again. So there's clearly something related to the copy of segments from incomplete pages instead of passing them via the pipe. It is possible that this bug has been there for a long time and that we never detected it because nobody plays with the loopback MTU. I have tried with 2.6.35 : 16436 : 31..33 Gbps 61492 : 48..50 Gbps 63540 : 50..53 Gbps => so at least it's not affected Even forcing the MTU to 16384 maintains 30..33 Gbps almost stable. On 3.5.7.2 : 16436 : 23..27 Gbps 61492 : 61..64 Gbps 63540 : 40..100 Mbps => the problem was already there. Since there were many splice changes in 3.5, I'd suspect that the issue appeared there though I could be wrong. > You also could try : > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 1ca2536..b68cdfb 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -1482,6 +1482,9 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t > *desc, > break; > } > used = recv_actor(desc, skb, offset, len); > + /* Clean up data we have read: This will do ACK frames. > */ > + if (used > 0) > + tcp_cleanup_rbuf(sk, used); > if (used < 0) { > if (!copied) > copied = used; It does not change anything to the tests above unfortunately. It did not even stabilize the unstable runs. I'll check if I can spot the original commit which caused the regression for MTUs that are not n*4096+52. But before that I'll try to find the recent one causing the myri10ge to slow down,
Re: Major network performance regression in 3.7
On Sat, Jan 05, 2013 at 11:35:24PM -0800, Eric Dumazet wrote: On Sun, 2013-01-06 at 03:52 +0100, Willy Tarreau wrote: OK so I observed no change with this patch, either on the loopback data rate at 16kB MTU, or on the myri. I'm keeping it at hand for experimentation anyway. Yeah, there was no bug. I rewrote it for net-next as a cleanup/optim only. I have re-applied your last rewrite and noticed a small but nice performance improvement on a single stream over the loopback : 1 session 10 sessions - without the patch : 55.8 Gbps 68.4 Gbps - with the patch: 56.4 Gbps 70.4 Gbps This was with the loopback reverted to 16kB MTU of course. Concerning the loopback MTU, I find it strange that the MTU changes the splice() behaviour and not send/recv. I thought that there could be a relation between the MTU and the pipe size, but it does not appear to be the case either, as I tried various sizes between 16kB and 256kB without achieving original performance. It probably is related to a too small receive window, given the MTU was multiplied by 4, I guess we need to make some adjustments In fact even if I set it to 32kB it breaks. I have tried to progressively increase the loopback's MTU from the default 16436, by steps of 4096 : tcp_rmem = 256 kB tcp_rmem = 256 kB pipe size = 64 kB pipe size = 256 kB 16436 : 55.8 Gbps 65.2 Gbps 20532 : 32..48 Gbps unstable24..45 Gbps unstable 24628 : 56.0 Gbps 66.4 Gbps 28724 : 58.6 Gbps 67.8 Gbps 32820 : 54.5 Gbps 61.7 Gbps 36916 : 56.8 Gbps 65.5 Gbps 41012 : 57.8..58.2 Gbps ~stable 67.5..68.8 Gbps ~stable 45108 : 59.4 Gbps 70.0 Gbps 49204 : 61.2 Gbps 71.1 Gbps 53300 : 58.8 Gbps 70.6 Gbps 57396 : 60.2 Gbps 70.8 Gbps 61492 : 61.4 Gbps 71.1 Gbps tcp_rmem = 1 MB tcp_rmem = 1 MB pipe size = 64 kB pipe size = 256 kB 16436 : 16..34 Gbps unstable49.5 or 65.2 Gbps (unstable) 20532 : 7..15 Gbps unstable15..32 Gbps unstable 24628 : 36..48 Gbps unstable34..61 Gbps unstable 28724 : 40..51 Gbps unstable40..69 Gbps unstable 32820 : 40..55 Gbps unstable59.9..62.3 Gbps ~stable 36916 : 38..51 Gbps unstable66.0 Gbps 41012 : 30..42 Gbps unstable47..66 Gbps unstable 45108 : 59.5 Gbps 71.2 Gbps 49204 : 61.3 Gbps 74.0 Gbps 53300 : 63.1 Gbps 74.5 Gbps 57396 : 64.6 Gbps 74.7 Gbps 61492 : 61..66 Gbps unstable76.5 Gbps So as long as we maintain the MTU to n*4096 + 52, performance is still almost OK. It is interesting to see that the transfer rate is unstable at many values and that it depends both on the rmem and pipe size, just as if some segments sometimes remained stuck for too long. And if I pick a value which does not match n*4096+52, such as 61492+2048 = 63540, then the transfer falls to about 50-100 Mbps again. So there's clearly something related to the copy of segments from incomplete pages instead of passing them via the pipe. It is possible that this bug has been there for a long time and that we never detected it because nobody plays with the loopback MTU. I have tried with 2.6.35 : 16436 : 31..33 Gbps 61492 : 48..50 Gbps 63540 : 50..53 Gbps = so at least it's not affected Even forcing the MTU to 16384 maintains 30..33 Gbps almost stable. On 3.5.7.2 : 16436 : 23..27 Gbps 61492 : 61..64 Gbps 63540 : 40..100 Mbps = the problem was already there. Since there were many splice changes in 3.5, I'd suspect that the issue appeared there though I could be wrong. You also could try : diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 1ca2536..b68cdfb 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1482,6 +1482,9 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, break; } used = recv_actor(desc, skb, offset, len); + /* Clean up data we have read: This will do ACK frames. */ + if (used 0) + tcp_cleanup_rbuf(sk, used); if (used 0) { if (!copied) copied = used; It does not change anything to the tests above unfortunately. It did not even stabilize the unstable runs. I'll check if I can spot the original commit which caused the regression for MTUs that are not n*4096+52. But before that I'll try to find the recent one causing the myri10ge to slow down, it should take less time to bisect. Regards,
Re: Major network performance regression in 3.7
On Sun, Jan 06, 2013 at 10:24:35AM +0100, Willy Tarreau wrote: But before that I'll try to find the recent one causing the myri10ge to slow down, it should take less time to bisect. OK good news here, the performance drop on the myri was caused by a problem between the keyboard and the chair. After the reboot series, I forgot to reload the firmware so the driver used the less efficient firmware from the NIC (it performs just as if LRO is disabled). That makes me think that I should try 3.8-rc2 since LRO was removed there :-/ The only remaining issue really is the loopback then. Cheers, Willy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
Willy Tarreau w...@1wt.eu writes: That makes me think that I should try 3.8-rc2 since LRO was removed there :-/ Better yet, find a way to automate these tests so they can run continually against net-next and find problems early... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, Jan 06, 2013 at 12:46:58PM +0100, Romain Francoise wrote: Willy Tarreau w...@1wt.eu writes: That makes me think that I should try 3.8-rc2 since LRO was removed there :-/ Better yet, find a way to automate these tests so they can run continually against net-next and find problems early... There is no way scripts will plug cables and turn on sleeping hardware unfortunately. I'm already following network updates closely enough to spot occasional regressions that are naturally expected due to the number of changes. Also, automated tests won't easily report a behaviour analysis, and behaviour is important in networking. You don't want to accept 100ms pauses all the time for example (and that's just an example). Right now my lab is simplified enough so that I can test something like 100 patches in a week-end, I think that's already fine. Willy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, Jan 06, 2013 at 11:25:25AM +0100, Willy Tarreau wrote: OK good news here, the performance drop on the myri was caused by a problem between the keyboard and the chair. After the reboot series, I forgot to reload the firmware so the driver used the less efficient firmware from the NIC (it performs just as if LRO is disabled). That makes me think that I should try 3.8-rc2 since LRO was removed there :-/ Just for the record, I tested 3.8-rc2, and the myri works as fast with GRO there as it used to work with LRO in previous kernels. The softirq work has increased from 26 to 48% but there is no performance drop when using GRO anymore. Andrew has done a good job ! Willy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 10:24 +0100, Willy Tarreau wrote: It does not change anything to the tests above unfortunately. It did not even stabilize the unstable runs. I'll check if I can spot the original commit which caused the regression for MTUs that are not n*4096+52. Since you don't post your program, I wont be able to help, just by guessing what it does... TCP has very low defaults concerning initial window, and it appears you set RCVBUF to even smaller values. Here we can see win 8030, this is not a sane value... 18:32:08.071602 IP 127.0.0.1.26792 127.0.0.1.8000: S 2036886615:2036886615(0) win 8030 mss 65495,nop,nop,sackOK,nop,wscale 9 18:32:08.071605 IP 127.0.0.1.8000 127.0.0.1.26792: S 126397113:126397113(0) ack 2036886616 win 8030 mss 65495,nop,nop,sackOK,nop,wscale 9 So you apparently changed /proc/sys/net/ipv4/tcp_rmem or SO_RCVBUF ? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
Hi Eric, On Sun, Jan 06, 2013 at 06:59:02AM -0800, Eric Dumazet wrote: On Sun, 2013-01-06 at 10:24 +0100, Willy Tarreau wrote: It does not change anything to the tests above unfortunately. It did not even stabilize the unstable runs. I'll check if I can spot the original commit which caused the regression for MTUs that are not n*4096+52. Since you don't post your program, I wont be able to help, just by guessing what it does... Oh sorry, I didn't really want to pollute the list with links and configs, especially during the initial report with various combined issues :-( The client is my old inject tool, available here : http://git.1wt.eu/web?p=inject.git The server is my httpterm tool, available here : http://git.1wt.eu/web?p=httpterm.git Use -O3 -DENABLE_POLL -DENABLE_EPOLL -DENABLE_SPLICE for CFLAGS. I'm starting httpterm this way : httpterm -D -L :8000 -P 256 = it starts a server on port 8000, and sets pipe size to 256 kB. It uses SPLICE_F_MORE on output data but removing it did not fix the issue one of the early tests. Then I'm starting inject this way : inject -o 1 -u 1 -G 0:8000/?s=1g = 1 user, 1 object at a time, and fetch /?s=1g from the loopback. The server will then emit 1 GB of data using splice(). It's possible to disable splicing on the server using -dS. The client eats data using recv(MSG_TRUNC) to avoid a useless copy. TCP has very low defaults concerning initial window, and it appears you set RCVBUF to even smaller values. Yes, you're right, my bootup scripts still change the default value, though I increase them to larger values during the tests (except the one where you saw win 8030 due to the default rmem set to 16060). I've been using this value in the past with older kernels because it allowed an integer number of segments to fit into the default window, and offered optimal performance with large numbers of concurrent connections. Since 2.6, tcp_moderate_rcvbuf works very well and this is not needed anymore. Anyway, it does not affect the test here. Good kernels are OK whatever the default value, and bad kernels are bad whatever the default value too. Hmmm finally it's this commit again : 2f53384 tcp: allow splice() to build full TSO packets I'm saying again because we already diagnosed a similar effect several months ago that was revealed by this patch and we fixed it with the following one, though I remember that we weren't completely sure it would fix everything : bad115c tcp: do_tcp_sendpages() must try to push data out on oom conditions Just out of curiosity, I tried to re-apply the patch above just after the first one but it did not change anything (after all it changed a symptom which appeared in different conditions). Interestingly, this commit (2f53384) significantly improved performance on spliced data over the loopback (more than 50% in this test). In 3.7, it seems to have no positive effect anymore. I reverted it using the following patch and now the problem is fixed (mtu=64k works fine now) : diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e457c7a..61e4517 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -935,7 +935,7 @@ wait_for_memory: } out: - if (copied !(flags MSG_SENDPAGE_NOTLAST)) + if (copied) tcp_push(sk, flags, mss_now, tp-nonagle); return copied; Regards, Willy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, Jan 06, 2013 at 08:39:53AM -0800, Eric Dumazet wrote: Hmm, I'll have to check if this really can be reverted without hurting vmsplice() again. Looking at the code I've been wondering whether we shouldn't transform the condition to perform the push if we can't push more segments, but I don't know what to rely on. It would be something like this : if (copied (!(flags MSG_SENDPAGE_NOTLAST) || cant_push_more)) tcp_push(sk, flags, mss_now, tp-nonagle); Willy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 16:51 +0100, Willy Tarreau wrote: Hi Eric, Oh sorry, I didn't really want to pollute the list with links and configs, especially during the initial report with various combined issues :-( The client is my old inject tool, available here : http://git.1wt.eu/web?p=inject.git The server is my httpterm tool, available here : http://git.1wt.eu/web?p=httpterm.git Use -O3 -DENABLE_POLL -DENABLE_EPOLL -DENABLE_SPLICE for CFLAGS. I'm starting httpterm this way : httpterm -D -L :8000 -P 256 = it starts a server on port 8000, and sets pipe size to 256 kB. It uses SPLICE_F_MORE on output data but removing it did not fix the issue one of the early tests. Then I'm starting inject this way : inject -o 1 -u 1 -G 0:8000/?s=1g = 1 user, 1 object at a time, and fetch /?s=1g from the loopback. The server will then emit 1 GB of data using splice(). It's possible to disable splicing on the server using -dS. The client eats data using recv(MSG_TRUNC) to avoid a useless copy. TCP has very low defaults concerning initial window, and it appears you set RCVBUF to even smaller values. Yes, you're right, my bootup scripts still change the default value, though I increase them to larger values during the tests (except the one where you saw win 8030 due to the default rmem set to 16060). I've been using this value in the past with older kernels because it allowed an integer number of segments to fit into the default window, and offered optimal performance with large numbers of concurrent connections. Since 2.6, tcp_moderate_rcvbuf works very well and this is not needed anymore. Anyway, it does not affect the test here. Good kernels are OK whatever the default value, and bad kernels are bad whatever the default value too. Hmmm finally it's this commit again : 2f53384 tcp: allow splice() to build full TSO packets I'm saying again because we already diagnosed a similar effect several months ago that was revealed by this patch and we fixed it with the following one, though I remember that we weren't completely sure it would fix everything : bad115c tcp: do_tcp_sendpages() must try to push data out on oom conditions Just out of curiosity, I tried to re-apply the patch above just after the first one but it did not change anything (after all it changed a symptom which appeared in different conditions). Interestingly, this commit (2f53384) significantly improved performance on spliced data over the loopback (more than 50% in this test). In 3.7, it seems to have no positive effect anymore. I reverted it using the following patch and now the problem is fixed (mtu=64k works fine now) : diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e457c7a..61e4517 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -935,7 +935,7 @@ wait_for_memory: } out: - if (copied !(flags MSG_SENDPAGE_NOTLAST)) + if (copied) tcp_push(sk, flags, mss_now, tp-nonagle); return copied; Regards, Willy Hmm, I'll have to check if this really can be reverted without hurting vmsplice() again. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 17:44 +0100, Willy Tarreau wrote: On Sun, Jan 06, 2013 at 08:39:53AM -0800, Eric Dumazet wrote: Hmm, I'll have to check if this really can be reverted without hurting vmsplice() again. Looking at the code I've been wondering whether we shouldn't transform the condition to perform the push if we can't push more segments, but I don't know what to rely on. It would be something like this : if (copied (!(flags MSG_SENDPAGE_NOTLAST) || cant_push_more)) tcp_push(sk, flags, mss_now, tp-nonagle); Good point ! Maybe the following fix then ? diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 1ca2536..7ba0717 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -941,8 +941,10 @@ out: return copied; do_error: - if (copied) + if (copied) { + flags = ~MSG_SENDPAGE_NOTLAST; goto out; + } out_err: return sk_stream_error(sk, flags, err); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, Jan 06, 2013 at 09:10:55AM -0800, Eric Dumazet wrote: On Sun, 2013-01-06 at 17:44 +0100, Willy Tarreau wrote: On Sun, Jan 06, 2013 at 08:39:53AM -0800, Eric Dumazet wrote: Hmm, I'll have to check if this really can be reverted without hurting vmsplice() again. Looking at the code I've been wondering whether we shouldn't transform the condition to perform the push if we can't push more segments, but I don't know what to rely on. It would be something like this : if (copied (!(flags MSG_SENDPAGE_NOTLAST) || cant_push_more)) tcp_push(sk, flags, mss_now, tp-nonagle); Good point ! Maybe the following fix then ? diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 1ca2536..7ba0717 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -941,8 +941,10 @@ out: return copied; do_error: - if (copied) + if (copied) { + flags = ~MSG_SENDPAGE_NOTLAST; goto out; + } out_err: return sk_stream_error(sk, flags, err); } Unfortunately it does not work any better, which means to me that we don't leave via this code path. I tried other tricks which failed too. I need to understand this part better before randomly fiddling with it. Willy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 18:35 +0100, Willy Tarreau wrote: Unfortunately it does not work any better, which means to me that we don't leave via this code path. I tried other tricks which failed too. I need to understand this part better before randomly fiddling with it. OK, now I have your test program, I can work on a fix, dont worry ;) The MSG_SENDPAGE_NOTLAST logic needs to be tweaked. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 10:39 -0800, Eric Dumazet wrote: On Sun, 2013-01-06 at 18:35 +0100, Willy Tarreau wrote: Unfortunately it does not work any better, which means to me that we don't leave via this code path. I tried other tricks which failed too. I need to understand this part better before randomly fiddling with it. OK, now I have your test program, I can work on a fix, dont worry ;) The MSG_SENDPAGE_NOTLAST logic needs to be tweaked. (sd-len is usually 4096, which is expected, but sd-total_len value is huge in your case, so we always set the flag in fs/splice.c) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
(sd-len is usually 4096, which is expected, but sd-total_len value is huge in your case, so we always set the flag in fs/splice.c) I am testing : if (sd-len sd-total_len pipe-nrbufs 1) more |= MSG_SENDPAGE_NOTLAST; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote: (sd-len is usually 4096, which is expected, but sd-total_len value is huge in your case, so we always set the flag in fs/splice.c) I am testing : if (sd-len sd-total_len pipe-nrbufs 1) more |= MSG_SENDPAGE_NOTLAST; Yes, this should fix the problem : If there is no following buffer in the pipe, we should not set NOTLAST. diff --git a/fs/splice.c b/fs/splice.c index 8890604..6909d89 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe, return -EINVAL; more = (sd-flags SPLICE_F_MORE) ? MSG_MORE : 0; - if (sd-len sd-total_len) + + if (sd-len sd-total_len pipe-nrbufs 1) more |= MSG_SENDPAGE_NOTLAST; + return file-f_op-sendpage(file, buf-page, buf-offset, sd-len, pos, more); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote: On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote: (sd-len is usually 4096, which is expected, but sd-total_len value is huge in your case, so we always set the flag in fs/splice.c) I am testing : if (sd-len sd-total_len pipe-nrbufs 1) more |= MSG_SENDPAGE_NOTLAST; Yes, this should fix the problem : If there is no following buffer in the pipe, we should not set NOTLAST. diff --git a/fs/splice.c b/fs/splice.c index 8890604..6909d89 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe, return -EINVAL; more = (sd-flags SPLICE_F_MORE) ? MSG_MORE : 0; - if (sd-len sd-total_len) + + if (sd-len sd-total_len pipe-nrbufs 1) more |= MSG_SENDPAGE_NOTLAST; + return file-f_op-sendpage(file, buf-page, buf-offset, sd-len, pos, more); } OK it works like a charm here now ! I can't break it anymore, so it looks like you finally got it ! I noticed that the data rate was higher when the loopback's MTU is exactly a multiple of 4096 (making the 64k choice optimal) while I would have assumed that in order to efficiently splice TCP segments, we'd need to have some space for IP/TCP headers and n*4k for the payload. I also got the transfer freezes again a few times when starting tcpdump on the server, but this is not 100% reproducible I'm afraid. So I'll bring this back when I manage to get some analysable pattern. The spliced transfer through all the chain haproxy works fine again at 10gig with your fix. The issue is closed for me. Feel free to add my Tested-By if you want. Thank you Eric :-) Willy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 20:34 +0100, Willy Tarreau wrote: OK it works like a charm here now ! I can't break it anymore, so it looks like you finally got it ! I noticed that the data rate was higher when the loopback's MTU is exactly a multiple of 4096 (making the 64k choice optimal) while I would have assumed that in order to efficiently splice TCP segments, we'd need to have some space for IP/TCP headers and n*4k for the payload. I also got the transfer freezes again a few times when starting tcpdump on the server, but this is not 100% reproducible I'm afraid. So I'll bring this back when I manage to get some analysable pattern. The spliced transfer through all the chain haproxy works fine again at 10gig with your fix. The issue is closed for me. Feel free to add my Tested-By if you want. Good to know ! What is the max speed you get now ? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, Jan 06, 2013 at 11:39:31AM -0800, Eric Dumazet wrote: On Sun, 2013-01-06 at 20:34 +0100, Willy Tarreau wrote: OK it works like a charm here now ! I can't break it anymore, so it looks like you finally got it ! I noticed that the data rate was higher when the loopback's MTU is exactly a multiple of 4096 (making the 64k choice optimal) while I would have assumed that in order to efficiently splice TCP segments, we'd need to have some space for IP/TCP headers and n*4k for the payload. I also got the transfer freezes again a few times when starting tcpdump on the server, but this is not 100% reproducible I'm afraid. So I'll bring this back when I manage to get some analysable pattern. The spliced transfer through all the chain haproxy works fine again at 10gig with your fix. The issue is closed for me. Feel free to add my Tested-By if you want. Good to know ! What is the max speed you get now ? Line rate with 1500 MTU and LRO enabled : # time eth1(ikb ipk okb opk)eth2(ikb ipk okbopk) 1357060023 19933.3 41527.7 9355538.2 62167.7 9757888.1 808701.1 19400.3 40417.7 1357060024 26124.1 54425.5 9290064.9 48804.4 9778294.0 810210.0 18068.8 37643.3 1357060025 27015.2 56281.1 9296115.3 46868.8 9797125.9 811271.1 8790.1 18312.2 1357060026 27556.0 57408.8 9291701.4 46805.5 9805371.6 811410.0 3494.8 7280.0 1357060027 27577.0 57452.2 9293606.8 46804.4 9806122.3 811314.4 2558.7 5330.0 1357060028 27476.1 57242.2 9296885.4 46830.0 9794537.3 810527.7 2516.1 5242.2 ^^^^^^ kbps out kbps in eth1=facing the client eth2=facing the server Top reports the following usage : Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 31.7%id, 0.0%wa, 0.0%hi, 68.3%si, 0.0%st Cpu1 : 1.0%us, 37.3%sy, 0.0%ni, 61.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st (IRQ bound to cpu 0, haproxy to cpu 1) This is a core2duo 2.66 GHz and the myris are 1st generation. BTW I was very happy to see that the LRO-GRO conversion patches in 3.8-rc2 don't affect byte rate anymore (just a minor CPU usage increase but this is not critical here), now I won't complain about it being slower anymore, you won :-) With the GRO patches backported, still at 1500 MTU but with GRO now : Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 28.7%id, 0.0%wa, 0.0%hi, 71.3%si, 0.0%st Cpu1 : 0.0%us, 37.6%sy, 0.0%ni, 62.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st # time eth1(ikb ipk okb opk)eth2(ikb ipk okbopk) 1357058637 18319.3 38165.5 9401736.3 65159.9 9761613.4 808963.3 19403.6 40424.4 1357058638 20009.8 41687.7 9400903.7 62706.6 9770555.8 809522.2 18696.5 38951.1 1357058639 25439.5 52999.9 9301635.3 50267.7 9773666.7 809721.1 19174.1 39946.6 1357058640 26808.2 55850.0 9298301.4 46876.6 9790470.1 810843.3 12408.7 25851.1 1357058641 27110.9 56481.1 9297009.2 46832.2 9803308.4 811339.9 5692.5 11859.9 1357058642 27411.1 57106.6 9291419.2 46796.6 9806846.5 811378.8 2804.4 5842.2 This kernel is getting really good :-) Cheers, Willy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, Jan 06, 2013 at 04:49:35PM -0500, John Stoffel wrote: Willy == Willy Tarreau w...@1wt.eu writes: Willy On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote: On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote: (sd-len is usually 4096, which is expected, but sd-total_len value is huge in your case, so we always set the flag in fs/splice.c) I am testing : if (sd-len sd-total_len pipe-nrbufs 1) more |= MSG_SENDPAGE_NOTLAST; Yes, this should fix the problem : If there is no following buffer in the pipe, we should not set NOTLAST. diff --git a/fs/splice.c b/fs/splice.c index 8890604..6909d89 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe, return -EINVAL; more = (sd-flags SPLICE_F_MORE) ? MSG_MORE : 0; - if (sd-len sd-total_len) + + if (sd-len sd-total_len pipe-nrbufs 1) more |= MSG_SENDPAGE_NOTLAST; + return file-f_op-sendpage(file, buf-page, buf-offset, sd- len, pos, more); } Willy OK it works like a charm here now ! I can't break it anymore, so it Willy looks like you finally got it ! It's still broken, there's no comments in the code to explain all this magic to mere mortals! *grin* I would generally agree, but when Eric fixes such a thing, he generally goes with lengthy details in the commit message. Willy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
Willy == Willy Tarreau w...@1wt.eu writes: Willy On Sun, Jan 06, 2013 at 04:49:35PM -0500, John Stoffel wrote: Willy == Willy Tarreau w...@1wt.eu writes: Willy On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote: On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote: (sd-len is usually 4096, which is expected, but sd-total_len value is huge in your case, so we always set the flag in fs/splice.c) I am testing : if (sd-len sd-total_len pipe-nrbufs 1) more |= MSG_SENDPAGE_NOTLAST; Yes, this should fix the problem : If there is no following buffer in the pipe, we should not set NOTLAST. diff --git a/fs/splice.c b/fs/splice.c index 8890604..6909d89 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe, return -EINVAL; more = (sd-flags SPLICE_F_MORE) ? MSG_MORE : 0; - if (sd-len sd-total_len) + + if (sd-len sd-total_len pipe-nrbufs 1) more |= MSG_SENDPAGE_NOTLAST; + return file-f_op-sendpage(file, buf-page, buf-offset, sd- len, pos, more); } Willy OK it works like a charm here now ! I can't break it anymore, so it Willy looks like you finally got it ! It's still broken, there's no comments in the code to explain all this magic to mere mortals! *grin* Willy I would generally agree, but when Eric fixes such a thing, he Willy generally goes with lengthy details in the commit message. I'm sure he will too, I just wanted to nudge him because while I sorta followed this discussion, I see lots of pain down the road if the code wasn't updated with some nice big fat comments. Great job finding this code and testing, testing, testing. John -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
Willy == Willy Tarreau w...@1wt.eu writes: Willy On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote: On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote: (sd-len is usually 4096, which is expected, but sd-total_len value is huge in your case, so we always set the flag in fs/splice.c) I am testing : if (sd-len sd-total_len pipe-nrbufs 1) more |= MSG_SENDPAGE_NOTLAST; Yes, this should fix the problem : If there is no following buffer in the pipe, we should not set NOTLAST. diff --git a/fs/splice.c b/fs/splice.c index 8890604..6909d89 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe, return -EINVAL; more = (sd-flags SPLICE_F_MORE) ? MSG_MORE : 0; -if (sd-len sd-total_len) + +if (sd-len sd-total_len pipe-nrbufs 1) more |= MSG_SENDPAGE_NOTLAST; + return file-f_op-sendpage(file, buf-page, buf-offset, sd- len, pos, more); } Willy OK it works like a charm here now ! I can't break it anymore, so it Willy looks like you finally got it ! It's still broken, there's no comments in the code to explain all this magic to mere mortals! *grin* John -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 03:52 +0100, Willy Tarreau wrote: > OK so I observed no change with this patch, either on the loopback > data rate at >16kB MTU, or on the myri. I'm keeping it at hand for > experimentation anyway. > Yeah, there was no bug. I rewrote it for net-next as a cleanup/optim only. > Concerning the loopback MTU, I find it strange that the MTU changes > the splice() behaviour and not send/recv. I thought that there could > be a relation between the MTU and the pipe size, but it does not > appear to be the case either, as I tried various sizes between 16kB > and 256kB without achieving original performance. It probably is related to a too small receive window, given the MTU was multiplied by 4, I guess we need to make some adjustments You also could try : diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 1ca2536..b68cdfb 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1482,6 +1482,9 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, break; } used = recv_actor(desc, skb, offset, len); + /* Clean up data we have read: This will do ACK frames. */ + if (used > 0) + tcp_cleanup_rbuf(sk, used); if (used < 0) { if (!copied) copied = used; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote: > On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote: > > On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote: > > > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote: > > > > > > > Ah interesting because these were some of the mm patches that I had > > > > tried to revert. > > > > > > Hmm, or we should fix __skb_splice_bits() > > > > > > I'll send a patch. > > > > > > > Could you try the following ? > > Or more exactly... > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 3ab989b..01f222c 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -1736,11 +1736,8 @@ static bool __splice_segment(struct page *page, > unsigned int poff, > return false; > } > > - /* ignore any bits we already processed */ > - if (*off) { > - __segment_seek(, , , *off); > - *off = 0; > - } > + __segment_seek(, , , *off); > + *off = 0; > > do { > unsigned int flen = min(*len, plen); > @@ -1768,14 +1765,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, > struct pipe_inode_info *pipe, > struct splice_pipe_desc *spd, struct sock *sk) > { > int seg; > + struct page *page = virt_to_page(skb->data); > + unsigned int poff = skb->data - (unsigned char *)page_address(page); > > /* map the linear part : >* If skb->head_frag is set, this 'linear' part is backed by a >* fragment, and if the head is not shared with any clones then >* we can avoid a copy since we own the head portion of this page. >*/ > - if (__splice_segment(virt_to_page(skb->data), > - (unsigned long) skb->data & (PAGE_SIZE - 1), > + if (__splice_segment(page, poff, >skb_headlen(skb), >offset, len, skb, spd, >skb_head_is_locked(skb), > OK so I observed no change with this patch, either on the loopback data rate at >16kB MTU, or on the myri. I'm keeping it at hand for experimentation anyway. Concerning the loopback MTU, I find it strange that the MTU changes the splice() behaviour and not send/recv. I thought that there could be a relation between the MTU and the pipe size, but it does not appear to be the case either, as I tried various sizes between 16kB and 256kB without achieving original performance. I've started to bisect the 10GE issue again (since both issues are unrelated), but I'll finish tomorrow, it's time to get some sleep now. Best regards, Willy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 03:32 +0100, Willy Tarreau wrote: > It's 0cf833ae (net: loopback: set default mtu to 64K). And I could > reproduce it with 3.6 by setting loopback's MTU to 65536 by hand. > The trick is that once the MTU has been set to this large a value, > even when I set it back to 16kB the problem persists. > Well, this MTU change can uncover a prior bug, or make it happen faster, for sure. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sat, Jan 05, 2013 at 06:22:13PM -0800, Eric Dumazet wrote: > On Sun, 2013-01-06 at 03:18 +0100, Willy Tarreau wrote: > > On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote: > > > On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote: > > > > On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote: > > > > > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote: > > > > > > > > > > > Ah interesting because these were some of the mm patches that I had > > > > > > tried to revert. > > > > > > > > > > Hmm, or we should fix __skb_splice_bits() > > > > > > > > > > I'll send a patch. > > > > > > > > > > > > > Could you try the following ? > > > > > > Or more exactly... > > > > The first one did not change a iota unfortunately. I'm about to > > spot the commit causing the loopback regression. It's a few patches > > before the first one you pointed. It's almost finished and I test > > your patch below immediately after. > > I bet you are going to find commit > 69b08f62e17439ee3d436faf0b9a7ca6fffb78db > (net: use bigger pages in __netdev_alloc_frag ) > > Am I wrong ? Yes this time you guessed wrong :-) Well maybe it's participating to the issue. It's 0cf833ae (net: loopback: set default mtu to 64K). And I could reproduce it with 3.6 by setting loopback's MTU to 65536 by hand. The trick is that once the MTU has been set to this large a value, even when I set it back to 16kB the problem persists. Now I'm retrying your other patch to see if it brings the 10GE back to full speed. Willy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 03:18 +0100, Willy Tarreau wrote: > On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote: > > On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote: > > > On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote: > > > > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote: > > > > > > > > > Ah interesting because these were some of the mm patches that I had > > > > > tried to revert. > > > > > > > > Hmm, or we should fix __skb_splice_bits() > > > > > > > > I'll send a patch. > > > > > > > > > > Could you try the following ? > > > > Or more exactly... > > The first one did not change a iota unfortunately. I'm about to > spot the commit causing the loopback regression. It's a few patches > before the first one you pointed. It's almost finished and I test > your patch below immediately after. I bet you are going to find commit 69b08f62e17439ee3d436faf0b9a7ca6fffb78db (net: use bigger pages in __netdev_alloc_frag ) Am I wrong ? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote: > On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote: > > On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote: > > > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote: > > > > > > > Ah interesting because these were some of the mm patches that I had > > > > tried to revert. > > > > > > Hmm, or we should fix __skb_splice_bits() > > > > > > I'll send a patch. > > > > > > > Could you try the following ? > > Or more exactly... The first one did not change a iota unfortunately. I'm about to spot the commit causing the loopback regression. It's a few patches before the first one you pointed. It's almost finished and I test your patch below immediately after. Thanks, Willy > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 3ab989b..01f222c 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -1736,11 +1736,8 @@ static bool __splice_segment(struct page *page, > unsigned int poff, > return false; > } > > - /* ignore any bits we already processed */ > - if (*off) { > - __segment_seek(, , , *off); > - *off = 0; > - } > + __segment_seek(, , , *off); > + *off = 0; > > do { > unsigned int flen = min(*len, plen); > @@ -1768,14 +1765,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, > struct pipe_inode_info *pipe, > struct splice_pipe_desc *spd, struct sock *sk) > { > int seg; > + struct page *page = virt_to_page(skb->data); > + unsigned int poff = skb->data - (unsigned char *)page_address(page); > > /* map the linear part : >* If skb->head_frag is set, this 'linear' part is backed by a >* fragment, and if the head is not shared with any clones then >* we can avoid a copy since we own the head portion of this page. >*/ > - if (__splice_segment(virt_to_page(skb->data), > - (unsigned long) skb->data & (PAGE_SIZE - 1), > + if (__splice_segment(page, poff, >skb_headlen(skb), >offset, len, skb, spd, >skb_head_is_locked(skb), > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote: > On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote: > > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote: > > > > > Ah interesting because these were some of the mm patches that I had > > > tried to revert. > > > > Hmm, or we should fix __skb_splice_bits() > > > > I'll send a patch. > > > > Could you try the following ? Or more exactly... diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3ab989b..01f222c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1736,11 +1736,8 @@ static bool __splice_segment(struct page *page, unsigned int poff, return false; } - /* ignore any bits we already processed */ - if (*off) { - __segment_seek(, , , *off); - *off = 0; - } + __segment_seek(, , , *off); + *off = 0; do { unsigned int flen = min(*len, plen); @@ -1768,14 +1765,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe, struct splice_pipe_desc *spd, struct sock *sk) { int seg; + struct page *page = virt_to_page(skb->data); + unsigned int poff = skb->data - (unsigned char *)page_address(page); /* map the linear part : * If skb->head_frag is set, this 'linear' part is backed by a * fragment, and if the head is not shared with any clones then * we can avoid a copy since we own the head portion of this page. */ - if (__splice_segment(virt_to_page(skb->data), -(unsigned long) skb->data & (PAGE_SIZE - 1), + if (__splice_segment(page, poff, skb_headlen(skb), offset, len, skb, spd, skb_head_is_locked(skb), -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote: > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote: > > > Ah interesting because these were some of the mm patches that I had > > tried to revert. > > Hmm, or we should fix __skb_splice_bits() > > I'll send a patch. > Could you try the following ? diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3ab989b..c5246be 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1768,14 +1768,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe, struct splice_pipe_desc *spd, struct sock *sk) { int seg; + struct page *page = virt_to_page(skb->data); + unsigned int poff = skb->data - (unsigned char *)page_address(page); /* map the linear part : * If skb->head_frag is set, this 'linear' part is backed by a * fragment, and if the head is not shared with any clones then * we can avoid a copy since we own the head portion of this page. */ - if (__splice_segment(virt_to_page(skb->data), -(unsigned long) skb->data & (PAGE_SIZE - 1), + if (__splice_segment(page, poff, skb_headlen(skb), offset, len, skb, spd, skb_head_is_locked(skb), -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote: > Ah interesting because these were some of the mm patches that I had > tried to revert. Hmm, or we should fix __skb_splice_bits() I'll send a patch. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sat, Jan 05, 2013 at 05:21:16PM -0800, Eric Dumazet wrote: > On Sun, 2013-01-06 at 01:50 +0100, Willy Tarreau wrote: > > > Yes, I've removed all zero counters in this short view for easier > > reading (complete version appended at the end of this email). This > > was after around 140 GB were transferred : > > OK I only wanted to make sure skb were not linearized in xmit. > > Could you try to disable CONFIG_COMPACTION ? It's already disabled. > ( This is the other thread mentioning this : "ppoll() stuck on POLLIN > while TCP peer is sending" ) Ah interesting because these were some of the mm patches that I had tried to revert. Willy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 01:50 +0100, Willy Tarreau wrote: > Yes, I've removed all zero counters in this short view for easier > reading (complete version appended at the end of this email). This > was after around 140 GB were transferred : OK I only wanted to make sure skb were not linearized in xmit. Could you try to disable CONFIG_COMPACTION ? ( This is the other thread mentioning this : "ppoll() stuck on POLLIN while TCP peer is sending" ) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sat, Jan 05, 2013 at 04:02:03PM -0800, Eric Dumazet wrote: > On Sun, 2013-01-06 at 00:29 +0100, Willy Tarreau wrote: > > > > 2) Another possibility would be that Myri card/driver doesnt like very > > > well high order pages. > > > > It looks like it has not changed much since 3.6 :-/ I really suspect > > something is wrong with memory allocation. I have tried reverting many > > patches affecting the mm/ directory just in case but I did not come to > > anything useful yet. > > > > Hmm, I was referring to TCP stack now using order-3 pages instead of > order-0 ones > > See commit 5640f7685831e088fe6c2e1f863a6805962f8e81 > (net: use a per task frag allocator) OK, so you think there are two distinct problems ? I have tried to revert this one but it did not change the performance, I'm still saturating at ~6.9 Gbps. > Could you please post : > > ethtool -S eth0 Yes, I've removed all zero counters in this short view for easier reading (complete version appended at the end of this email). This was after around 140 GB were transferred : # ethtool -S eth1|grep -vw 0 NIC statistics: rx_packets: 8001500 tx_packets: 10015409 rx_bytes: 480115998 tx_bytes: 148825674976 tx_boundary: 2048 WC: 1 irq: 45 MSI: 1 read_dma_bw_MBs: 1200 write_dma_bw_MBs: 1614 read_write_dma_bw_MBs: 2101 serial_number: 320061 link_changes: 2 link_up: 1 tx_pkt_start: 10015409 tx_pkt_done: 10015409 tx_req: 93407411 tx_done: 93407411 rx_small_cnt: 8001500 wake_queue: 187727 stop_queue: 187727 LRO aggregated: 146 LRO flushed: 146 LRO avg aggr: 1 LRO no_desc: 80 Quite honnestly, this is typically the pattern what I'm used to observe here. I'm now trying to bisect, hopefully we'll get something exploitable. Cheers, Willy - full ethtool -S NIC statistics: rx_packets: 8001500 tx_packets: 10015409 rx_bytes: 480115998 tx_bytes: 148825674976 rx_errors: 0 tx_errors: 0 rx_dropped: 0 tx_dropped: 0 multicast: 0 collisions: 0 rx_length_errors: 0 rx_over_errors: 0 rx_crc_errors: 0 rx_frame_errors: 0 rx_fifo_errors: 0 rx_missed_errors: 0 tx_aborted_errors: 0 tx_carrier_errors: 0 tx_fifo_errors: 0 tx_heartbeat_errors: 0 tx_window_errors: 0 tx_boundary: 2048 WC: 1 irq: 45 MSI: 1 MSIX: 0 read_dma_bw_MBs: 1200 write_dma_bw_MBs: 1614 read_write_dma_bw_MBs: 2101 serial_number: 320061 watchdog_resets: 0 link_changes: 2 link_up: 1 dropped_link_overflow: 0 dropped_link_error_or_filtered: 0 dropped_pause: 0 dropped_bad_phy: 0 dropped_bad_crc32: 0 dropped_unicast_filtered: 0 dropped_multicast_filtered: 0 dropped_runt: 0 dropped_overrun: 0 dropped_no_small_buffer: 0 dropped_no_big_buffer: 0 --- slice -: 0 tx_pkt_start: 10015409 tx_pkt_done: 10015409 tx_req: 93407411 tx_done: 93407411 rx_small_cnt: 8001500 rx_big_cnt: 0 wake_queue: 187727 stop_queue: 187727 tx_linearized: 0 LRO aggregated: 146 LRO flushed: 146 LRO avg aggr: 1 LRO no_desc: 80 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 00:29 +0100, Willy Tarreau wrote: > > 2) Another possibility would be that Myri card/driver doesnt like very > > well high order pages. > > It looks like it has not changed much since 3.6 :-/ I really suspect > something is wrong with memory allocation. I have tried reverting many > patches affecting the mm/ directory just in case but I did not come to > anything useful yet. > Hmm, I was referring to TCP stack now using order-3 pages instead of order-0 ones See commit 5640f7685831e088fe6c2e1f863a6805962f8e81 (net: use a per task frag allocator) Could you please post : ethtool -S eth0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
Hi Eric, On Sat, Jan 05, 2013 at 03:18:46PM -0800, Eric Dumazet wrote: > Hi Willy, another good finding during the week end ! ;) Yes, I wanted to experiment with TFO and stopped on this :-) > 1) This looks like interrupts are spreaded on multiple cpus, and this > give Out Of Order problems with TCP stack. No, I forgot to mention this, I have tried to bind IRQs to a single core, with the server either on the same or another one, but the problem remained. Also, the loopback is much more affected and doesn't use IRQs. And BTW tcpdump on the loopback shouldn't drop that many packets (up to 90% even at low rate). I just noticed something, transferring data using netcat on the loopback doesn't affect tcpdump. So it's likely only the spliced data that are affected. > 2) Another possibility would be that Myri card/driver doesnt like very > well high order pages. It looks like it has not changed much since 3.6 :-/ I really suspect something is wrong with memory allocation. I have tried reverting many patches affecting the mm/ directory just in case but I did not come to anything useful yet. I'm continuing to dig. Thanks, Willy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sat, 2013-01-05 at 22:49 +0100, Willy Tarreau wrote: > Hi, > > I'm observing multiple apparently unrelated network performance > issues in 3.7, to the point that I'm doubting it comes from the > network stack. > > My setup involves 3 machines connected point-to-point with myri > 10GE NICs (the middle machine has 2 NICs). The middle machine > normally runs haproxy, the other two run either an HTTP load > generator or a dummy web server : > > > [ client ] <> [ haproxy ] <> [ server ] > > Usually transferring HTTP objects from the server to the client > via haproxy causes no problem at 10 Gbps for moderately large > objects. > > This time I observed that it was not possible to go beyond 6.8 Gbps, > with all the chain idling a lot. I tried to change the IRQ rate, CPU > affinity, tcp_rmem/tcp_wmem, disabling flow control, etc... the usual > knobs, nothing managed to go beyond. > > So I removed haproxy from the equation, and simply started the client > on the middle machine. Same issue. I thought about concurrency issues, > so I reduced to a single connection, and nothing changed (usually I > achieve 10G even with a single connection with large enough TCP windows). > I tried to start tcpdump and the transfer immediately stalled and did not > come back after I stopped tcpdump. This was reproducible several times > but not always. > > So I first thought about an issue in the myri10ge driver and wanted to > confirm that everything was OK on the middle machine. > > I started the server on it and aimed the client at it via the loopback. > The transfer rate was even worse : randomly oscillating between 10 and > 100 MB/s ! Normally on the loop back, I get several GB/s here. > > Running tcpdump on the loopback showed be several very concerning issues : > > 1) lots of packets are lost before reaching tcpdump. The trace shows that >these segments are ACKed so they're correctly received, but tcpdump >does not get them. Tcpdump stats at the end report impressive numbers, >around 90% packet dropped from the capture! > > 2) ACKs seem to be immediately delivered but do not trigger sending, the >system seems to be running with delayed ACKs, as it waits 40 or 200ms >before restarting, and this is visible even in the first round trips : > >- connection setup : > >18:32:08.071602 IP 127.0.0.1.26792 > 127.0.0.1.8000: S > 2036886615:2036886615(0) win 8030 >18:32:08.071605 IP 127.0.0.1.8000 > 127.0.0.1.26792: S > 126397113:126397113(0) ack 2036886616 win 8030 65495,nop,nop,sackOK,nop,wscale 9> >18:32:08.071614 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126397114 win 16 > >- GET /?s=1g HTTP/1.0 > >18:32:08.071649 IP 127.0.0.1.26792 > 127.0.0.1.8000: P > 2036886616:2036886738(122) ack 126397114 win 16 > >- HTTP/1.1 200 OK with the beginning of the response : > >18:32:08.071672 IP 127.0.0.1.8000 > 127.0.0.1.26792: . > 126397114:126401210(4096) ack 2036886738 win 16 >18:32:08.071676 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126401210 win > 250 >==> 200ms pause here >18:32:08.275493 IP 127.0.0.1.8000 > 127.0.0.1.26792: P > 126401210:126463006(61796) ack 2036886738 win 16 >==> 40ms pause here >18:32:08.315493 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126463006 win > 256 >18:32:08.315498 IP 127.0.0.1.8000 > 127.0.0.1.26792: . > 126463006:126527006(64000) ack 2036886738 win 16 > >... and so on > >My server is using splice() with the SPLICE_F_MORE flag to send data. >I noticed that not using splice and relying on send(MSG_MORE) instead >I don't get the issue. > > 3) I wondered if this had something to do with the 64k MTU on the loopback >so I lowered it to 16kB. The performance was even worse (about 5MB/s). >Starting tcpdump managed to make my transfer stall, just like with the >myri10ge. In this last test, I noticed that there were some real drops, >because there were some SACKs : > >18:45:16.699951 IP 127.0.0.1.8000 > 127.0.0.1.8002: P > 956153186:956169530(16344) ack 131668746 win 16 >18:45:16.699956 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 956169530 win 64 >18:45:16.904119 IP 127.0.0.1.8000 > 127.0.0.1.8002: P > 957035762:957052106(16344) ack 131668746 win 16 >18:45:16.904122 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957052106 win 703 >18:45:16.904124 IP 127.0.0.1.8000 > 127.0.0.1.8002: P > 957052106:957099566(47460) ack 131668746 win 16 >18:45:17.108117 IP 127.0.0.1.8000 > 127.0.0.1.8002: P > 957402550:957418894(16344) ack 131668746 win 16 >18:45:17.108119 IP 127.0.0.1.8002 > 127
Major network performance regression in 3.7
Hi, I'm observing multiple apparently unrelated network performance issues in 3.7, to the point that I'm doubting it comes from the network stack. My setup involves 3 machines connected point-to-point with myri 10GE NICs (the middle machine has 2 NICs). The middle machine normally runs haproxy, the other two run either an HTTP load generator or a dummy web server : [ client ] <> [ haproxy ] <> [ server ] Usually transferring HTTP objects from the server to the client via haproxy causes no problem at 10 Gbps for moderately large objects. This time I observed that it was not possible to go beyond 6.8 Gbps, with all the chain idling a lot. I tried to change the IRQ rate, CPU affinity, tcp_rmem/tcp_wmem, disabling flow control, etc... the usual knobs, nothing managed to go beyond. So I removed haproxy from the equation, and simply started the client on the middle machine. Same issue. I thought about concurrency issues, so I reduced to a single connection, and nothing changed (usually I achieve 10G even with a single connection with large enough TCP windows). I tried to start tcpdump and the transfer immediately stalled and did not come back after I stopped tcpdump. This was reproducible several times but not always. So I first thought about an issue in the myri10ge driver and wanted to confirm that everything was OK on the middle machine. I started the server on it and aimed the client at it via the loopback. The transfer rate was even worse : randomly oscillating between 10 and 100 MB/s ! Normally on the loop back, I get several GB/s here. Running tcpdump on the loopback showed be several very concerning issues : 1) lots of packets are lost before reaching tcpdump. The trace shows that these segments are ACKed so they're correctly received, but tcpdump does not get them. Tcpdump stats at the end report impressive numbers, around 90% packet dropped from the capture! 2) ACKs seem to be immediately delivered but do not trigger sending, the system seems to be running with delayed ACKs, as it waits 40 or 200ms before restarting, and this is visible even in the first round trips : - connection setup : 18:32:08.071602 IP 127.0.0.1.26792 > 127.0.0.1.8000: S 2036886615:2036886615(0) win 8030 18:32:08.071605 IP 127.0.0.1.8000 > 127.0.0.1.26792: S 126397113:126397113(0) ack 2036886616 win 8030 18:32:08.071614 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126397114 win 16 - GET /?s=1g HTTP/1.0 18:32:08.071649 IP 127.0.0.1.26792 > 127.0.0.1.8000: P 2036886616:2036886738(122) ack 126397114 win 16 - HTTP/1.1 200 OK with the beginning of the response : 18:32:08.071672 IP 127.0.0.1.8000 > 127.0.0.1.26792: . 126397114:126401210(4096) ack 2036886738 win 16 18:32:08.071676 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126401210 win 250 ==> 200ms pause here 18:32:08.275493 IP 127.0.0.1.8000 > 127.0.0.1.26792: P 126401210:126463006(61796) ack 2036886738 win 16 ==> 40ms pause here 18:32:08.315493 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126463006 win 256 18:32:08.315498 IP 127.0.0.1.8000 > 127.0.0.1.26792: . 126463006:126527006(64000) ack 2036886738 win 16 ... and so on My server is using splice() with the SPLICE_F_MORE flag to send data. I noticed that not using splice and relying on send(MSG_MORE) instead I don't get the issue. 3) I wondered if this had something to do with the 64k MTU on the loopback so I lowered it to 16kB. The performance was even worse (about 5MB/s). Starting tcpdump managed to make my transfer stall, just like with the myri10ge. In this last test, I noticed that there were some real drops, because there were some SACKs : 18:45:16.699951 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 956153186:956169530(16344) ack 131668746 win 16 18:45:16.699956 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 956169530 win 64 18:45:16.904119 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957035762:957052106(16344) ack 131668746 win 16 18:45:16.904122 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957052106 win 703 18:45:16.904124 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957052106:957099566(47460) ack 131668746 win 16 18:45:17.108117 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957402550:957418894(16344) ack 131668746 win 16 18:45:17.108119 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957418894 win 1846 18:45:17.312115 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957672806:957689150(16344) ack 131668746 win 16 18:45:17.312117 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957689150 win 2902 18:45:17.516114 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 958962966:958979310(16344) ack 131668746 win 16 18:45:17.516116 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 958979310 win 7941 18:45:17.516150 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 959503678 win 9926 18:45:17.516151 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 959503678 win 9926 Plea
Major network performance regression in 3.7
Hi, I'm observing multiple apparently unrelated network performance issues in 3.7, to the point that I'm doubting it comes from the network stack. My setup involves 3 machines connected point-to-point with myri 10GE NICs (the middle machine has 2 NICs). The middle machine normally runs haproxy, the other two run either an HTTP load generator or a dummy web server : [ client ] [ haproxy ] [ server ] Usually transferring HTTP objects from the server to the client via haproxy causes no problem at 10 Gbps for moderately large objects. This time I observed that it was not possible to go beyond 6.8 Gbps, with all the chain idling a lot. I tried to change the IRQ rate, CPU affinity, tcp_rmem/tcp_wmem, disabling flow control, etc... the usual knobs, nothing managed to go beyond. So I removed haproxy from the equation, and simply started the client on the middle machine. Same issue. I thought about concurrency issues, so I reduced to a single connection, and nothing changed (usually I achieve 10G even with a single connection with large enough TCP windows). I tried to start tcpdump and the transfer immediately stalled and did not come back after I stopped tcpdump. This was reproducible several times but not always. So I first thought about an issue in the myri10ge driver and wanted to confirm that everything was OK on the middle machine. I started the server on it and aimed the client at it via the loopback. The transfer rate was even worse : randomly oscillating between 10 and 100 MB/s ! Normally on the loop back, I get several GB/s here. Running tcpdump on the loopback showed be several very concerning issues : 1) lots of packets are lost before reaching tcpdump. The trace shows that these segments are ACKed so they're correctly received, but tcpdump does not get them. Tcpdump stats at the end report impressive numbers, around 90% packet dropped from the capture! 2) ACKs seem to be immediately delivered but do not trigger sending, the system seems to be running with delayed ACKs, as it waits 40 or 200ms before restarting, and this is visible even in the first round trips : - connection setup : 18:32:08.071602 IP 127.0.0.1.26792 127.0.0.1.8000: S 2036886615:2036886615(0) win 8030 mss 65495,nop,nop,sackOK,nop,wscale 9 18:32:08.071605 IP 127.0.0.1.8000 127.0.0.1.26792: S 126397113:126397113(0) ack 2036886616 win 8030 mss 65495,nop,nop,sackOK,nop,wscale 9 18:32:08.071614 IP 127.0.0.1.26792 127.0.0.1.8000: . ack 126397114 win 16 - GET /?s=1g HTTP/1.0 18:32:08.071649 IP 127.0.0.1.26792 127.0.0.1.8000: P 2036886616:2036886738(122) ack 126397114 win 16 - HTTP/1.1 200 OK with the beginning of the response : 18:32:08.071672 IP 127.0.0.1.8000 127.0.0.1.26792: . 126397114:126401210(4096) ack 2036886738 win 16 18:32:08.071676 IP 127.0.0.1.26792 127.0.0.1.8000: . ack 126401210 win 250 == 200ms pause here 18:32:08.275493 IP 127.0.0.1.8000 127.0.0.1.26792: P 126401210:126463006(61796) ack 2036886738 win 16 == 40ms pause here 18:32:08.315493 IP 127.0.0.1.26792 127.0.0.1.8000: . ack 126463006 win 256 18:32:08.315498 IP 127.0.0.1.8000 127.0.0.1.26792: . 126463006:126527006(64000) ack 2036886738 win 16 ... and so on My server is using splice() with the SPLICE_F_MORE flag to send data. I noticed that not using splice and relying on send(MSG_MORE) instead I don't get the issue. 3) I wondered if this had something to do with the 64k MTU on the loopback so I lowered it to 16kB. The performance was even worse (about 5MB/s). Starting tcpdump managed to make my transfer stall, just like with the myri10ge. In this last test, I noticed that there were some real drops, because there were some SACKs : 18:45:16.699951 IP 127.0.0.1.8000 127.0.0.1.8002: P 956153186:956169530(16344) ack 131668746 win 16 18:45:16.699956 IP 127.0.0.1.8002 127.0.0.1.8000: . ack 956169530 win 64 18:45:16.904119 IP 127.0.0.1.8000 127.0.0.1.8002: P 957035762:957052106(16344) ack 131668746 win 16 18:45:16.904122 IP 127.0.0.1.8002 127.0.0.1.8000: . ack 957052106 win 703 18:45:16.904124 IP 127.0.0.1.8000 127.0.0.1.8002: P 957052106:957099566(47460) ack 131668746 win 16 18:45:17.108117 IP 127.0.0.1.8000 127.0.0.1.8002: P 957402550:957418894(16344) ack 131668746 win 16 18:45:17.108119 IP 127.0.0.1.8002 127.0.0.1.8000: . ack 957418894 win 1846 18:45:17.312115 IP 127.0.0.1.8000 127.0.0.1.8002: P 957672806:957689150(16344) ack 131668746 win 16 18:45:17.312117 IP 127.0.0.1.8002 127.0.0.1.8000: . ack 957689150 win 2902 18:45:17.516114 IP 127.0.0.1.8000 127.0.0.1.8002: P 958962966:958979310(16344) ack 131668746 win 16 18:45:17.516116 IP 127.0.0.1.8002 127.0.0.1.8000: . ack 958979310 win 7941 18:45:17.516150 IP 127.0.0.1.8002 127.0.0.1.8000: . ack 959503678 win 9926 nop,nop,sack 1 {959405614:959421958} 18:45:17.516151 IP 127.0.0.1.8002 127.0.0.1.8000: . ack 959503678 win 9926 nop,nop
Re: Major network performance regression in 3.7
On Sat, 2013-01-05 at 22:49 +0100, Willy Tarreau wrote: Hi, I'm observing multiple apparently unrelated network performance issues in 3.7, to the point that I'm doubting it comes from the network stack. My setup involves 3 machines connected point-to-point with myri 10GE NICs (the middle machine has 2 NICs). The middle machine normally runs haproxy, the other two run either an HTTP load generator or a dummy web server : [ client ] [ haproxy ] [ server ] Usually transferring HTTP objects from the server to the client via haproxy causes no problem at 10 Gbps for moderately large objects. This time I observed that it was not possible to go beyond 6.8 Gbps, with all the chain idling a lot. I tried to change the IRQ rate, CPU affinity, tcp_rmem/tcp_wmem, disabling flow control, etc... the usual knobs, nothing managed to go beyond. So I removed haproxy from the equation, and simply started the client on the middle machine. Same issue. I thought about concurrency issues, so I reduced to a single connection, and nothing changed (usually I achieve 10G even with a single connection with large enough TCP windows). I tried to start tcpdump and the transfer immediately stalled and did not come back after I stopped tcpdump. This was reproducible several times but not always. So I first thought about an issue in the myri10ge driver and wanted to confirm that everything was OK on the middle machine. I started the server on it and aimed the client at it via the loopback. The transfer rate was even worse : randomly oscillating between 10 and 100 MB/s ! Normally on the loop back, I get several GB/s here. Running tcpdump on the loopback showed be several very concerning issues : 1) lots of packets are lost before reaching tcpdump. The trace shows that these segments are ACKed so they're correctly received, but tcpdump does not get them. Tcpdump stats at the end report impressive numbers, around 90% packet dropped from the capture! 2) ACKs seem to be immediately delivered but do not trigger sending, the system seems to be running with delayed ACKs, as it waits 40 or 200ms before restarting, and this is visible even in the first round trips : - connection setup : 18:32:08.071602 IP 127.0.0.1.26792 127.0.0.1.8000: S 2036886615:2036886615(0) win 8030 mss 65495,nop,nop,sackOK,nop,wscale 9 18:32:08.071605 IP 127.0.0.1.8000 127.0.0.1.26792: S 126397113:126397113(0) ack 2036886616 win 8030 mss 65495,nop,nop,sackOK,nop,wscale 9 18:32:08.071614 IP 127.0.0.1.26792 127.0.0.1.8000: . ack 126397114 win 16 - GET /?s=1g HTTP/1.0 18:32:08.071649 IP 127.0.0.1.26792 127.0.0.1.8000: P 2036886616:2036886738(122) ack 126397114 win 16 - HTTP/1.1 200 OK with the beginning of the response : 18:32:08.071672 IP 127.0.0.1.8000 127.0.0.1.26792: . 126397114:126401210(4096) ack 2036886738 win 16 18:32:08.071676 IP 127.0.0.1.26792 127.0.0.1.8000: . ack 126401210 win 250 == 200ms pause here 18:32:08.275493 IP 127.0.0.1.8000 127.0.0.1.26792: P 126401210:126463006(61796) ack 2036886738 win 16 == 40ms pause here 18:32:08.315493 IP 127.0.0.1.26792 127.0.0.1.8000: . ack 126463006 win 256 18:32:08.315498 IP 127.0.0.1.8000 127.0.0.1.26792: . 126463006:126527006(64000) ack 2036886738 win 16 ... and so on My server is using splice() with the SPLICE_F_MORE flag to send data. I noticed that not using splice and relying on send(MSG_MORE) instead I don't get the issue. 3) I wondered if this had something to do with the 64k MTU on the loopback so I lowered it to 16kB. The performance was even worse (about 5MB/s). Starting tcpdump managed to make my transfer stall, just like with the myri10ge. In this last test, I noticed that there were some real drops, because there were some SACKs : 18:45:16.699951 IP 127.0.0.1.8000 127.0.0.1.8002: P 956153186:956169530(16344) ack 131668746 win 16 18:45:16.699956 IP 127.0.0.1.8002 127.0.0.1.8000: . ack 956169530 win 64 18:45:16.904119 IP 127.0.0.1.8000 127.0.0.1.8002: P 957035762:957052106(16344) ack 131668746 win 16 18:45:16.904122 IP 127.0.0.1.8002 127.0.0.1.8000: . ack 957052106 win 703 18:45:16.904124 IP 127.0.0.1.8000 127.0.0.1.8002: P 957052106:957099566(47460) ack 131668746 win 16 18:45:17.108117 IP 127.0.0.1.8000 127.0.0.1.8002: P 957402550:957418894(16344) ack 131668746 win 16 18:45:17.108119 IP 127.0.0.1.8002 127.0.0.1.8000: . ack 957418894 win 1846 18:45:17.312115 IP 127.0.0.1.8000 127.0.0.1.8002: P 957672806:957689150(16344) ack 131668746 win 16 18:45:17.312117 IP 127.0.0.1.8002 127.0.0.1.8000: . ack 957689150 win 2902 18:45:17.516114 IP 127.0.0.1.8000 127.0.0.1.8002: P 958962966:958979310(16344) ack 131668746 win 16 18:45:17.516116 IP 127.0.0.1.8002 127.0.0.1.8000: . ack 958979310 win 7941 18:45:17.516150 IP
Re: Major network performance regression in 3.7
Hi Eric, On Sat, Jan 05, 2013 at 03:18:46PM -0800, Eric Dumazet wrote: Hi Willy, another good finding during the week end ! ;) Yes, I wanted to experiment with TFO and stopped on this :-) 1) This looks like interrupts are spreaded on multiple cpus, and this give Out Of Order problems with TCP stack. No, I forgot to mention this, I have tried to bind IRQs to a single core, with the server either on the same or another one, but the problem remained. Also, the loopback is much more affected and doesn't use IRQs. And BTW tcpdump on the loopback shouldn't drop that many packets (up to 90% even at low rate). I just noticed something, transferring data using netcat on the loopback doesn't affect tcpdump. So it's likely only the spliced data that are affected. 2) Another possibility would be that Myri card/driver doesnt like very well high order pages. It looks like it has not changed much since 3.6 :-/ I really suspect something is wrong with memory allocation. I have tried reverting many patches affecting the mm/ directory just in case but I did not come to anything useful yet. I'm continuing to dig. Thanks, Willy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 00:29 +0100, Willy Tarreau wrote: 2) Another possibility would be that Myri card/driver doesnt like very well high order pages. It looks like it has not changed much since 3.6 :-/ I really suspect something is wrong with memory allocation. I have tried reverting many patches affecting the mm/ directory just in case but I did not come to anything useful yet. Hmm, I was referring to TCP stack now using order-3 pages instead of order-0 ones See commit 5640f7685831e088fe6c2e1f863a6805962f8e81 (net: use a per task frag allocator) Could you please post : ethtool -S eth0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sat, Jan 05, 2013 at 04:02:03PM -0800, Eric Dumazet wrote: On Sun, 2013-01-06 at 00:29 +0100, Willy Tarreau wrote: 2) Another possibility would be that Myri card/driver doesnt like very well high order pages. It looks like it has not changed much since 3.6 :-/ I really suspect something is wrong with memory allocation. I have tried reverting many patches affecting the mm/ directory just in case but I did not come to anything useful yet. Hmm, I was referring to TCP stack now using order-3 pages instead of order-0 ones See commit 5640f7685831e088fe6c2e1f863a6805962f8e81 (net: use a per task frag allocator) OK, so you think there are two distinct problems ? I have tried to revert this one but it did not change the performance, I'm still saturating at ~6.9 Gbps. Could you please post : ethtool -S eth0 Yes, I've removed all zero counters in this short view for easier reading (complete version appended at the end of this email). This was after around 140 GB were transferred : # ethtool -S eth1|grep -vw 0 NIC statistics: rx_packets: 8001500 tx_packets: 10015409 rx_bytes: 480115998 tx_bytes: 148825674976 tx_boundary: 2048 WC: 1 irq: 45 MSI: 1 read_dma_bw_MBs: 1200 write_dma_bw_MBs: 1614 read_write_dma_bw_MBs: 2101 serial_number: 320061 link_changes: 2 link_up: 1 tx_pkt_start: 10015409 tx_pkt_done: 10015409 tx_req: 93407411 tx_done: 93407411 rx_small_cnt: 8001500 wake_queue: 187727 stop_queue: 187727 LRO aggregated: 146 LRO flushed: 146 LRO avg aggr: 1 LRO no_desc: 80 Quite honnestly, this is typically the pattern what I'm used to observe here. I'm now trying to bisect, hopefully we'll get something exploitable. Cheers, Willy - full ethtool -S NIC statistics: rx_packets: 8001500 tx_packets: 10015409 rx_bytes: 480115998 tx_bytes: 148825674976 rx_errors: 0 tx_errors: 0 rx_dropped: 0 tx_dropped: 0 multicast: 0 collisions: 0 rx_length_errors: 0 rx_over_errors: 0 rx_crc_errors: 0 rx_frame_errors: 0 rx_fifo_errors: 0 rx_missed_errors: 0 tx_aborted_errors: 0 tx_carrier_errors: 0 tx_fifo_errors: 0 tx_heartbeat_errors: 0 tx_window_errors: 0 tx_boundary: 2048 WC: 1 irq: 45 MSI: 1 MSIX: 0 read_dma_bw_MBs: 1200 write_dma_bw_MBs: 1614 read_write_dma_bw_MBs: 2101 serial_number: 320061 watchdog_resets: 0 link_changes: 2 link_up: 1 dropped_link_overflow: 0 dropped_link_error_or_filtered: 0 dropped_pause: 0 dropped_bad_phy: 0 dropped_bad_crc32: 0 dropped_unicast_filtered: 0 dropped_multicast_filtered: 0 dropped_runt: 0 dropped_overrun: 0 dropped_no_small_buffer: 0 dropped_no_big_buffer: 0 --- slice -: 0 tx_pkt_start: 10015409 tx_pkt_done: 10015409 tx_req: 93407411 tx_done: 93407411 rx_small_cnt: 8001500 rx_big_cnt: 0 wake_queue: 187727 stop_queue: 187727 tx_linearized: 0 LRO aggregated: 146 LRO flushed: 146 LRO avg aggr: 1 LRO no_desc: 80 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 01:50 +0100, Willy Tarreau wrote: Yes, I've removed all zero counters in this short view for easier reading (complete version appended at the end of this email). This was after around 140 GB were transferred : OK I only wanted to make sure skb were not linearized in xmit. Could you try to disable CONFIG_COMPACTION ? ( This is the other thread mentioning this : ppoll() stuck on POLLIN while TCP peer is sending ) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sat, Jan 05, 2013 at 05:21:16PM -0800, Eric Dumazet wrote: On Sun, 2013-01-06 at 01:50 +0100, Willy Tarreau wrote: Yes, I've removed all zero counters in this short view for easier reading (complete version appended at the end of this email). This was after around 140 GB were transferred : OK I only wanted to make sure skb were not linearized in xmit. Could you try to disable CONFIG_COMPACTION ? It's already disabled. ( This is the other thread mentioning this : ppoll() stuck on POLLIN while TCP peer is sending ) Ah interesting because these were some of the mm patches that I had tried to revert. Willy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote: Ah interesting because these were some of the mm patches that I had tried to revert. Hmm, or we should fix __skb_splice_bits() I'll send a patch. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote: On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote: Ah interesting because these were some of the mm patches that I had tried to revert. Hmm, or we should fix __skb_splice_bits() I'll send a patch. Could you try the following ? diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3ab989b..c5246be 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1768,14 +1768,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe, struct splice_pipe_desc *spd, struct sock *sk) { int seg; + struct page *page = virt_to_page(skb-data); + unsigned int poff = skb-data - (unsigned char *)page_address(page); /* map the linear part : * If skb-head_frag is set, this 'linear' part is backed by a * fragment, and if the head is not shared with any clones then * we can avoid a copy since we own the head portion of this page. */ - if (__splice_segment(virt_to_page(skb-data), -(unsigned long) skb-data (PAGE_SIZE - 1), + if (__splice_segment(page, poff, skb_headlen(skb), offset, len, skb, spd, skb_head_is_locked(skb), -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote: On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote: On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote: Ah interesting because these were some of the mm patches that I had tried to revert. Hmm, or we should fix __skb_splice_bits() I'll send a patch. Could you try the following ? Or more exactly... diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3ab989b..01f222c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1736,11 +1736,8 @@ static bool __splice_segment(struct page *page, unsigned int poff, return false; } - /* ignore any bits we already processed */ - if (*off) { - __segment_seek(page, poff, plen, *off); - *off = 0; - } + __segment_seek(page, poff, plen, *off); + *off = 0; do { unsigned int flen = min(*len, plen); @@ -1768,14 +1765,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe, struct splice_pipe_desc *spd, struct sock *sk) { int seg; + struct page *page = virt_to_page(skb-data); + unsigned int poff = skb-data - (unsigned char *)page_address(page); /* map the linear part : * If skb-head_frag is set, this 'linear' part is backed by a * fragment, and if the head is not shared with any clones then * we can avoid a copy since we own the head portion of this page. */ - if (__splice_segment(virt_to_page(skb-data), -(unsigned long) skb-data (PAGE_SIZE - 1), + if (__splice_segment(page, poff, skb_headlen(skb), offset, len, skb, spd, skb_head_is_locked(skb), -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote: On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote: On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote: On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote: Ah interesting because these were some of the mm patches that I had tried to revert. Hmm, or we should fix __skb_splice_bits() I'll send a patch. Could you try the following ? Or more exactly... The first one did not change a iota unfortunately. I'm about to spot the commit causing the loopback regression. It's a few patches before the first one you pointed. It's almost finished and I test your patch below immediately after. Thanks, Willy diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3ab989b..01f222c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1736,11 +1736,8 @@ static bool __splice_segment(struct page *page, unsigned int poff, return false; } - /* ignore any bits we already processed */ - if (*off) { - __segment_seek(page, poff, plen, *off); - *off = 0; - } + __segment_seek(page, poff, plen, *off); + *off = 0; do { unsigned int flen = min(*len, plen); @@ -1768,14 +1765,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe, struct splice_pipe_desc *spd, struct sock *sk) { int seg; + struct page *page = virt_to_page(skb-data); + unsigned int poff = skb-data - (unsigned char *)page_address(page); /* map the linear part : * If skb-head_frag is set, this 'linear' part is backed by a * fragment, and if the head is not shared with any clones then * we can avoid a copy since we own the head portion of this page. */ - if (__splice_segment(virt_to_page(skb-data), - (unsigned long) skb-data (PAGE_SIZE - 1), + if (__splice_segment(page, poff, skb_headlen(skb), offset, len, skb, spd, skb_head_is_locked(skb), -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 03:18 +0100, Willy Tarreau wrote: On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote: On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote: On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote: On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote: Ah interesting because these were some of the mm patches that I had tried to revert. Hmm, or we should fix __skb_splice_bits() I'll send a patch. Could you try the following ? Or more exactly... The first one did not change a iota unfortunately. I'm about to spot the commit causing the loopback regression. It's a few patches before the first one you pointed. It's almost finished and I test your patch below immediately after. I bet you are going to find commit 69b08f62e17439ee3d436faf0b9a7ca6fffb78db (net: use bigger pages in __netdev_alloc_frag ) Am I wrong ? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sat, Jan 05, 2013 at 06:22:13PM -0800, Eric Dumazet wrote: On Sun, 2013-01-06 at 03:18 +0100, Willy Tarreau wrote: On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote: On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote: On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote: On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote: Ah interesting because these were some of the mm patches that I had tried to revert. Hmm, or we should fix __skb_splice_bits() I'll send a patch. Could you try the following ? Or more exactly... The first one did not change a iota unfortunately. I'm about to spot the commit causing the loopback regression. It's a few patches before the first one you pointed. It's almost finished and I test your patch below immediately after. I bet you are going to find commit 69b08f62e17439ee3d436faf0b9a7ca6fffb78db (net: use bigger pages in __netdev_alloc_frag ) Am I wrong ? Yes this time you guessed wrong :-) Well maybe it's participating to the issue. It's 0cf833ae (net: loopback: set default mtu to 64K). And I could reproduce it with 3.6 by setting loopback's MTU to 65536 by hand. The trick is that once the MTU has been set to this large a value, even when I set it back to 16kB the problem persists. Now I'm retrying your other patch to see if it brings the 10GE back to full speed. Willy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Major network performance regression in 3.7
On Sun, 2013-01-06 at 03:32 +0100, Willy Tarreau wrote: It's 0cf833ae (net: loopback: set default mtu to 64K). And I could reproduce it with 3.6 by setting loopback's MTU to 65536 by hand. The trick is that once the MTU has been set to this large a value, even when I set it back to 16kB the problem persists. Well, this MTU change can uncover a prior bug, or make it happen faster, for sure. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/