Re: [PATCH] apparmor: Fix network performance issue in aa_label_sk_perm

2018-09-07 Thread Tony Jones
On 09/07/2018 09:37 AM, John Johansen wrote:

> hey Tony,
> 
> thanks for the patch, I am curious did you're investigation look
> into what parts of DEFINE_AUDIT_SK are causing the issue?

Hi JJ.

Attached are the perf annotations for DEFINE_AUDIT_SK (percentages are relative 
to the fn).   
Our kernel performance testing is carried out with default installs which means 
AppArmor 
is enabled but the performance tests are unconfined. It was obvious that the 
overhead of 
DEFINE_AUDIT_SK was significant for smaller packet sizes (typical of synthetic 
benchmarks) 
and that it didn't need to execute for the unconfined case,  hence the patch.  
I didn't 
spend any time looking at the performance of confined tasks.  It may be worth 
your time to 
look at this.

Comparing my current tip (2601dd392dd1) to tip+patch I'm seeing an increase of 
3-6% in netperf
throughput for packet sizes 64-1024.

HTH

Tony

 Percent |  Source code & Disassembly of vmlinux for cycles:ppp (117 
samples)
-
 :
 :
 :
 :  Disassembly of section .text:
 :
 :  813fbec0 :
 :  aa_label_sk_perm():
 : 
type));
 :  }
 :
 :  static int aa_label_sk_perm(struct aa_label 
*label, const char *op, u32 request,
 :  struct sock *sk)
 :  {
0.00 :   813fbec0:   callq  81a017f0 <__fentry__>
2.56 :   813fbec5:   push   %r14
0.00 :   813fbec7:   mov%rcx,%r14
 :  struct aa_profile *profile;
 :  DEFINE_AUDIT_SK(sa, op, sk);
0.00 :   813fbeca:   mov$0x7,%ecx
 :  {
0.00 :   813fbecf:   push   %r13
3.42 :   813fbed1:   mov%edx,%r13d
0.00 :   813fbed4:   push   %r12
0.00 :   813fbed6:   push   %rbp
0.00 :   813fbed7:   mov%rdi,%rbp
5.13 :   813fbeda:   push   %rbx
0.00 :   813fbedb:   sub$0xb8,%rsp
 :  DEFINE_AUDIT_SK(sa, op, sk);
0.00 :   813fbee2:   movzwl 0x10(%r14),%r9d
 :  {
1.71 :   813fbee7:   mov%gs:0x28,%rax
0.00 :   813fbef0:   mov%rax,0xb0(%rsp)
0.00 :   813fbef8:   xor%eax,%eax
 :  DEFINE_AUDIT_SK(sa, op, sk);
0.00 :   813fbefa:   lea0x78(%rsp),%rdx
1.71 :   813fbeff:   lea0x20(%rsp),%r8
0.00 :   813fbf04:   movq   $0x0,(%rsp)
0.00 :   813fbf0c:   movq   $0x0,0x10(%rsp)
0.00 :   813fbf15:   mov%rdx,%rdi
   14.53 :   813fbf18:   rep stos %rax,%es:(%rdi)
1.71 :   813fbf1b:   mov$0xb,%ecx
0.00 :   813fbf20:   mov%r8,%rdi
0.00 :   813fbf23:   mov%r14,0x80(%rsp)
   18.80 :   813fbf2b:   rep stos %rax,%es:(%rdi)
0.00 :   813fbf2e:   mov%rsi,0x28(%rsp)
1.71 :   813fbf33:   mov%r9w,0x88(%rsp)
0.00 :   813fbf3c:   cmp$0x1,%r9w
0.00 :   813fbf41:   je 813fbfa1 

0.00 :   813fbf43:   mov$0x2,%eax
0.00 :   813fbf48:   test   %r14,%r14
0.00 :   813fbf4b:   je 813fbfa1 

   14.53 :   813fbf4d:   mov%al,(%rsp)
0.00 :   813fbf50:   movzwl 0x1ea(%r14),%eax
 :  AA_BUG(!sk);
 :
 :  if (unconfined(label))
 :  return 0;
 :
 :  return fn_for_each_confined(label, 
profile,
0.00 :   813fbf58:   xor%r12d,%r12d
 :  DEFINE_AUDIT_SK(sa, op, sk);
0.00 :   813fbf5b:   mov%r8,0x18(%rsp)
8.55 :   813fbf60:   mov%eax,0x58(%rsp)
0.00 :   813fbf64:   movzbl 0x1e9(%r14),%eax
0.00 :   813fbf6c:   mov%rdx,0x8(%rsp)
0.00 :   813fbf71:   mov%eax,0x5c(%rsp)
 :  if (unconfined(label))
8.55 :   813fbf75:   testb  $0x2,0x40(%rbp)
0.00 :   813fbf79:   je 813fbfa8 

 :  
aa_profile_af_sk_perm(profile, , request, sk));
 :  }
0.00 :   813fbf7b:   mov0xb0(%rsp),%rdx

Re: [PATCH] apparmor: Fix network performance issue in aa_label_sk_perm

2018-09-07 Thread Tony Jones
On 09/07/2018 09:37 AM, John Johansen wrote:

> hey Tony,
> 
> thanks for the patch, I am curious did you're investigation look
> into what parts of DEFINE_AUDIT_SK are causing the issue?

Hi JJ.

Attached are the perf annotations for DEFINE_AUDIT_SK (percentages are relative 
to the fn).   
Our kernel performance testing is carried out with default installs which means 
AppArmor 
is enabled but the performance tests are unconfined. It was obvious that the 
overhead of 
DEFINE_AUDIT_SK was significant for smaller packet sizes (typical of synthetic 
benchmarks) 
and that it didn't need to execute for the unconfined case,  hence the patch.  
I didn't 
spend any time looking at the performance of confined tasks.  It may be worth 
your time to 
look at this.

Comparing my current tip (2601dd392dd1) to tip+patch I'm seeing an increase of 
3-6% in netperf
throughput for packet sizes 64-1024.

HTH

Tony

 Percent |  Source code & Disassembly of vmlinux for cycles:ppp (117 
samples)
-
 :
 :
 :
 :  Disassembly of section .text:
 :
 :  813fbec0 :
 :  aa_label_sk_perm():
 : 
type));
 :  }
 :
 :  static int aa_label_sk_perm(struct aa_label 
*label, const char *op, u32 request,
 :  struct sock *sk)
 :  {
0.00 :   813fbec0:   callq  81a017f0 <__fentry__>
2.56 :   813fbec5:   push   %r14
0.00 :   813fbec7:   mov%rcx,%r14
 :  struct aa_profile *profile;
 :  DEFINE_AUDIT_SK(sa, op, sk);
0.00 :   813fbeca:   mov$0x7,%ecx
 :  {
0.00 :   813fbecf:   push   %r13
3.42 :   813fbed1:   mov%edx,%r13d
0.00 :   813fbed4:   push   %r12
0.00 :   813fbed6:   push   %rbp
0.00 :   813fbed7:   mov%rdi,%rbp
5.13 :   813fbeda:   push   %rbx
0.00 :   813fbedb:   sub$0xb8,%rsp
 :  DEFINE_AUDIT_SK(sa, op, sk);
0.00 :   813fbee2:   movzwl 0x10(%r14),%r9d
 :  {
1.71 :   813fbee7:   mov%gs:0x28,%rax
0.00 :   813fbef0:   mov%rax,0xb0(%rsp)
0.00 :   813fbef8:   xor%eax,%eax
 :  DEFINE_AUDIT_SK(sa, op, sk);
0.00 :   813fbefa:   lea0x78(%rsp),%rdx
1.71 :   813fbeff:   lea0x20(%rsp),%r8
0.00 :   813fbf04:   movq   $0x0,(%rsp)
0.00 :   813fbf0c:   movq   $0x0,0x10(%rsp)
0.00 :   813fbf15:   mov%rdx,%rdi
   14.53 :   813fbf18:   rep stos %rax,%es:(%rdi)
1.71 :   813fbf1b:   mov$0xb,%ecx
0.00 :   813fbf20:   mov%r8,%rdi
0.00 :   813fbf23:   mov%r14,0x80(%rsp)
   18.80 :   813fbf2b:   rep stos %rax,%es:(%rdi)
0.00 :   813fbf2e:   mov%rsi,0x28(%rsp)
1.71 :   813fbf33:   mov%r9w,0x88(%rsp)
0.00 :   813fbf3c:   cmp$0x1,%r9w
0.00 :   813fbf41:   je 813fbfa1 

0.00 :   813fbf43:   mov$0x2,%eax
0.00 :   813fbf48:   test   %r14,%r14
0.00 :   813fbf4b:   je 813fbfa1 

   14.53 :   813fbf4d:   mov%al,(%rsp)
0.00 :   813fbf50:   movzwl 0x1ea(%r14),%eax
 :  AA_BUG(!sk);
 :
 :  if (unconfined(label))
 :  return 0;
 :
 :  return fn_for_each_confined(label, 
profile,
0.00 :   813fbf58:   xor%r12d,%r12d
 :  DEFINE_AUDIT_SK(sa, op, sk);
0.00 :   813fbf5b:   mov%r8,0x18(%rsp)
8.55 :   813fbf60:   mov%eax,0x58(%rsp)
0.00 :   813fbf64:   movzbl 0x1e9(%r14),%eax
0.00 :   813fbf6c:   mov%rdx,0x8(%rsp)
0.00 :   813fbf71:   mov%eax,0x5c(%rsp)
 :  if (unconfined(label))
8.55 :   813fbf75:   testb  $0x2,0x40(%rbp)
0.00 :   813fbf79:   je 813fbfa8 

 :  
aa_profile_af_sk_perm(profile, , request, sk));
 :  }
0.00 :   813fbf7b:   mov0xb0(%rsp),%rdx

Re: [PATCH] apparmor: Fix network performance issue in aa_label_sk_perm

2018-09-07 Thread John Johansen
On 09/06/2018 09:33 PM, Tony Jones wrote:
> The netperf benchmark shows a 5.73% reduction in throughput for 
> small (64 byte) transfers by unconfined tasks.
> 
> DEFINE_AUDIT_SK() in aa_label_sk_perm() should not be performed 
> unconditionally, rather only when the label is confined.
> 
> netperf-tcp
> 56974a6fc^  56974a6fc
> Min   64 563.48 (   0.00%)  531.17 (  -5.73%)
> Min   128   1056.92 (   0.00%)  999.44 (  -5.44%)
> Min   256   1945.95 (   0.00%) 1867.97 (  -4.01%)
> Min   1024  6761.40 (   0.00%) 6364.23 (  -5.87%)
> Min   2048 0.53 (   0.00%)10606.20 (  -4.54%)
> Min   3312 13692.67 (   0.00%)13158.41 (  -3.90%)
> Min   4096 14926.29 (   0.00%)14457.46 (  -3.14%)
> Min   8192 18399.34 (   0.00%)18091.65 (  -1.67%)
> Min   1638421384.13 (   0.00%)21158.05 (  -1.06%)
> Hmean 64 564.96 (   0.00%)  534.38 (  -5.41%)
> Hmean 128   1064.42 (   0.00%) 1010.12 (  -5.10%)
> Hmean 256   1965.85 (   0.00%) 1879.16 (  -4.41%)
> Hmean 1024  6839.77 (   0.00%) 6478.70 (  -5.28%)
> Hmean 2048 11154.80 (   0.00%)10671.13 (  -4.34%)
> Hmean 3312 13838.12 (   0.00%)13249.01 (  -4.26%)
> Hmean 4096 15009.99 (   0.00%)14561.36 (  -2.99%)
> Hmean 8192 18975.57 (   0.00%)18326.54 (  -3.42%)
> Hmean 1638421440.44 (   0.00%)21324.59 (  -0.54%)
> Stddev64   1.24 (   0.00%)2.85 (-130.64%)
> Stddev128  4.51 (   0.00%)6.53 ( -44.84%)
> Stddev256 11.67 (   0.00%)8.50 (  27.16%)
> Stddev102448.33 (   0.00%)   75.07 ( -55.34%)
> Stddev204854.82 (   0.00%)   65.16 ( -18.86%)
> Stddev3312   153.57 (   0.00%)   56.29 (  63.35%)
> Stddev4096   100.25 (   0.00%)   88.50 (  11.72%)
> Stddev8192   358.13 (   0.00%)  169.99 (  52.54%)
> Stddev16384   43.99 (   0.00%)  141.82 (-222.39%)
> 
> Signed-off-by: Tony Jones 
> Fixes: 56974a6fcfef ("apparmor: add base infastructure for socket
> mediation")

hey Tony,

thanks for the patch, I am curious did you're investigation look
into what parts of DEFINE_AUDIT_SK are causing the issue?

regardless, I have pulled it into apparmor next

> ---
>  security/apparmor/net.c | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/security/apparmor/net.c b/security/apparmor/net.c
> index bb24cfa0a164..d5d72dd1ca1f 100644
> --- a/security/apparmor/net.c
> +++ b/security/apparmor/net.c
> @@ -146,17 +146,20 @@ int aa_af_perm(struct aa_label *label, const char *op, 
> u32 request, u16 family,
>  static int aa_label_sk_perm(struct aa_label *label, const char *op, u32 
> request,
>   struct sock *sk)
>  {
> - struct aa_profile *profile;
> - DEFINE_AUDIT_SK(sa, op, sk);
> + int error = 0;
>  
>   AA_BUG(!label);
>   AA_BUG(!sk);
>  
> - if (unconfined(label))
> - return 0;
> + if (!unconfined(label)) {
> + struct aa_profile *profile;
> + DEFINE_AUDIT_SK(sa, op, sk);
>  
> - return fn_for_each_confined(label, profile,
> - aa_profile_af_sk_perm(profile, , request, sk));
> + error = fn_for_each_confined(label, profile,
> + aa_profile_af_sk_perm(profile, , request, sk));
> + }
> +
> + return error;
>  }
>  
>  int aa_sk_perm(const char *op, u32 request, struct sock *sk)
> 



Re: [PATCH] apparmor: Fix network performance issue in aa_label_sk_perm

2018-09-07 Thread John Johansen
On 09/06/2018 09:33 PM, Tony Jones wrote:
> The netperf benchmark shows a 5.73% reduction in throughput for 
> small (64 byte) transfers by unconfined tasks.
> 
> DEFINE_AUDIT_SK() in aa_label_sk_perm() should not be performed 
> unconditionally, rather only when the label is confined.
> 
> netperf-tcp
> 56974a6fc^  56974a6fc
> Min   64 563.48 (   0.00%)  531.17 (  -5.73%)
> Min   128   1056.92 (   0.00%)  999.44 (  -5.44%)
> Min   256   1945.95 (   0.00%) 1867.97 (  -4.01%)
> Min   1024  6761.40 (   0.00%) 6364.23 (  -5.87%)
> Min   2048 0.53 (   0.00%)10606.20 (  -4.54%)
> Min   3312 13692.67 (   0.00%)13158.41 (  -3.90%)
> Min   4096 14926.29 (   0.00%)14457.46 (  -3.14%)
> Min   8192 18399.34 (   0.00%)18091.65 (  -1.67%)
> Min   1638421384.13 (   0.00%)21158.05 (  -1.06%)
> Hmean 64 564.96 (   0.00%)  534.38 (  -5.41%)
> Hmean 128   1064.42 (   0.00%) 1010.12 (  -5.10%)
> Hmean 256   1965.85 (   0.00%) 1879.16 (  -4.41%)
> Hmean 1024  6839.77 (   0.00%) 6478.70 (  -5.28%)
> Hmean 2048 11154.80 (   0.00%)10671.13 (  -4.34%)
> Hmean 3312 13838.12 (   0.00%)13249.01 (  -4.26%)
> Hmean 4096 15009.99 (   0.00%)14561.36 (  -2.99%)
> Hmean 8192 18975.57 (   0.00%)18326.54 (  -3.42%)
> Hmean 1638421440.44 (   0.00%)21324.59 (  -0.54%)
> Stddev64   1.24 (   0.00%)2.85 (-130.64%)
> Stddev128  4.51 (   0.00%)6.53 ( -44.84%)
> Stddev256 11.67 (   0.00%)8.50 (  27.16%)
> Stddev102448.33 (   0.00%)   75.07 ( -55.34%)
> Stddev204854.82 (   0.00%)   65.16 ( -18.86%)
> Stddev3312   153.57 (   0.00%)   56.29 (  63.35%)
> Stddev4096   100.25 (   0.00%)   88.50 (  11.72%)
> Stddev8192   358.13 (   0.00%)  169.99 (  52.54%)
> Stddev16384   43.99 (   0.00%)  141.82 (-222.39%)
> 
> Signed-off-by: Tony Jones 
> Fixes: 56974a6fcfef ("apparmor: add base infastructure for socket
> mediation")

hey Tony,

thanks for the patch, I am curious did you're investigation look
into what parts of DEFINE_AUDIT_SK are causing the issue?

regardless, I have pulled it into apparmor next

> ---
>  security/apparmor/net.c | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/security/apparmor/net.c b/security/apparmor/net.c
> index bb24cfa0a164..d5d72dd1ca1f 100644
> --- a/security/apparmor/net.c
> +++ b/security/apparmor/net.c
> @@ -146,17 +146,20 @@ int aa_af_perm(struct aa_label *label, const char *op, 
> u32 request, u16 family,
>  static int aa_label_sk_perm(struct aa_label *label, const char *op, u32 
> request,
>   struct sock *sk)
>  {
> - struct aa_profile *profile;
> - DEFINE_AUDIT_SK(sa, op, sk);
> + int error = 0;
>  
>   AA_BUG(!label);
>   AA_BUG(!sk);
>  
> - if (unconfined(label))
> - return 0;
> + if (!unconfined(label)) {
> + struct aa_profile *profile;
> + DEFINE_AUDIT_SK(sa, op, sk);
>  
> - return fn_for_each_confined(label, profile,
> - aa_profile_af_sk_perm(profile, , request, sk));
> + error = fn_for_each_confined(label, profile,
> + aa_profile_af_sk_perm(profile, , request, sk));
> + }
> +
> + return error;
>  }
>  
>  int aa_sk_perm(const char *op, u32 request, struct sock *sk)
> 



[PATCH] apparmor: Fix network performance issue in aa_label_sk_perm

2018-09-06 Thread Tony Jones
The netperf benchmark shows a 5.73% reduction in throughput for 
small (64 byte) transfers by unconfined tasks.

DEFINE_AUDIT_SK() in aa_label_sk_perm() should not be performed 
unconditionally, rather only when the label is confined.

netperf-tcp
56974a6fc^  56974a6fc
Min   64 563.48 (   0.00%)  531.17 (  -5.73%)
Min   128   1056.92 (   0.00%)  999.44 (  -5.44%)
Min   256   1945.95 (   0.00%) 1867.97 (  -4.01%)
Min   1024  6761.40 (   0.00%) 6364.23 (  -5.87%)
Min   2048 0.53 (   0.00%)10606.20 (  -4.54%)
Min   3312 13692.67 (   0.00%)13158.41 (  -3.90%)
Min   4096 14926.29 (   0.00%)14457.46 (  -3.14%)
Min   8192 18399.34 (   0.00%)18091.65 (  -1.67%)
Min   1638421384.13 (   0.00%)21158.05 (  -1.06%)
Hmean 64 564.96 (   0.00%)  534.38 (  -5.41%)
Hmean 128   1064.42 (   0.00%) 1010.12 (  -5.10%)
Hmean 256   1965.85 (   0.00%) 1879.16 (  -4.41%)
Hmean 1024  6839.77 (   0.00%) 6478.70 (  -5.28%)
Hmean 2048 11154.80 (   0.00%)10671.13 (  -4.34%)
Hmean 3312 13838.12 (   0.00%)13249.01 (  -4.26%)
Hmean 4096 15009.99 (   0.00%)14561.36 (  -2.99%)
Hmean 8192 18975.57 (   0.00%)18326.54 (  -3.42%)
Hmean 1638421440.44 (   0.00%)21324.59 (  -0.54%)
Stddev64   1.24 (   0.00%)2.85 (-130.64%)
Stddev128  4.51 (   0.00%)6.53 ( -44.84%)
Stddev256 11.67 (   0.00%)8.50 (  27.16%)
Stddev102448.33 (   0.00%)   75.07 ( -55.34%)
Stddev204854.82 (   0.00%)   65.16 ( -18.86%)
Stddev3312   153.57 (   0.00%)   56.29 (  63.35%)
Stddev4096   100.25 (   0.00%)   88.50 (  11.72%)
Stddev8192   358.13 (   0.00%)  169.99 (  52.54%)
Stddev16384   43.99 (   0.00%)  141.82 (-222.39%)

Signed-off-by: Tony Jones 
Fixes: 56974a6fcfef ("apparmor: add base infastructure for socket
mediation")
---
 security/apparmor/net.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/security/apparmor/net.c b/security/apparmor/net.c
index bb24cfa0a164..d5d72dd1ca1f 100644
--- a/security/apparmor/net.c
+++ b/security/apparmor/net.c
@@ -146,17 +146,20 @@ int aa_af_perm(struct aa_label *label, const char *op, 
u32 request, u16 family,
 static int aa_label_sk_perm(struct aa_label *label, const char *op, u32 
request,
struct sock *sk)
 {
-   struct aa_profile *profile;
-   DEFINE_AUDIT_SK(sa, op, sk);
+   int error = 0;
 
AA_BUG(!label);
AA_BUG(!sk);
 
-   if (unconfined(label))
-   return 0;
+   if (!unconfined(label)) {
+   struct aa_profile *profile;
+   DEFINE_AUDIT_SK(sa, op, sk);
 
-   return fn_for_each_confined(label, profile,
-   aa_profile_af_sk_perm(profile, , request, sk));
+   error = fn_for_each_confined(label, profile,
+   aa_profile_af_sk_perm(profile, , request, sk));
+   }
+
+   return error;
 }
 
 int aa_sk_perm(const char *op, u32 request, struct sock *sk)
-- 
2.18.0



[PATCH] apparmor: Fix network performance issue in aa_label_sk_perm

2018-09-06 Thread Tony Jones
The netperf benchmark shows a 5.73% reduction in throughput for 
small (64 byte) transfers by unconfined tasks.

DEFINE_AUDIT_SK() in aa_label_sk_perm() should not be performed 
unconditionally, rather only when the label is confined.

netperf-tcp
56974a6fc^  56974a6fc
Min   64 563.48 (   0.00%)  531.17 (  -5.73%)
Min   128   1056.92 (   0.00%)  999.44 (  -5.44%)
Min   256   1945.95 (   0.00%) 1867.97 (  -4.01%)
Min   1024  6761.40 (   0.00%) 6364.23 (  -5.87%)
Min   2048 0.53 (   0.00%)10606.20 (  -4.54%)
Min   3312 13692.67 (   0.00%)13158.41 (  -3.90%)
Min   4096 14926.29 (   0.00%)14457.46 (  -3.14%)
Min   8192 18399.34 (   0.00%)18091.65 (  -1.67%)
Min   1638421384.13 (   0.00%)21158.05 (  -1.06%)
Hmean 64 564.96 (   0.00%)  534.38 (  -5.41%)
Hmean 128   1064.42 (   0.00%) 1010.12 (  -5.10%)
Hmean 256   1965.85 (   0.00%) 1879.16 (  -4.41%)
Hmean 1024  6839.77 (   0.00%) 6478.70 (  -5.28%)
Hmean 2048 11154.80 (   0.00%)10671.13 (  -4.34%)
Hmean 3312 13838.12 (   0.00%)13249.01 (  -4.26%)
Hmean 4096 15009.99 (   0.00%)14561.36 (  -2.99%)
Hmean 8192 18975.57 (   0.00%)18326.54 (  -3.42%)
Hmean 1638421440.44 (   0.00%)21324.59 (  -0.54%)
Stddev64   1.24 (   0.00%)2.85 (-130.64%)
Stddev128  4.51 (   0.00%)6.53 ( -44.84%)
Stddev256 11.67 (   0.00%)8.50 (  27.16%)
Stddev102448.33 (   0.00%)   75.07 ( -55.34%)
Stddev204854.82 (   0.00%)   65.16 ( -18.86%)
Stddev3312   153.57 (   0.00%)   56.29 (  63.35%)
Stddev4096   100.25 (   0.00%)   88.50 (  11.72%)
Stddev8192   358.13 (   0.00%)  169.99 (  52.54%)
Stddev16384   43.99 (   0.00%)  141.82 (-222.39%)

Signed-off-by: Tony Jones 
Fixes: 56974a6fcfef ("apparmor: add base infastructure for socket
mediation")
---
 security/apparmor/net.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/security/apparmor/net.c b/security/apparmor/net.c
index bb24cfa0a164..d5d72dd1ca1f 100644
--- a/security/apparmor/net.c
+++ b/security/apparmor/net.c
@@ -146,17 +146,20 @@ int aa_af_perm(struct aa_label *label, const char *op, 
u32 request, u16 family,
 static int aa_label_sk_perm(struct aa_label *label, const char *op, u32 
request,
struct sock *sk)
 {
-   struct aa_profile *profile;
-   DEFINE_AUDIT_SK(sa, op, sk);
+   int error = 0;
 
AA_BUG(!label);
AA_BUG(!sk);
 
-   if (unconfined(label))
-   return 0;
+   if (!unconfined(label)) {
+   struct aa_profile *profile;
+   DEFINE_AUDIT_SK(sa, op, sk);
 
-   return fn_for_each_confined(label, profile,
-   aa_profile_af_sk_perm(profile, , request, sk));
+   error = fn_for_each_confined(label, profile,
+   aa_profile_af_sk_perm(profile, , request, sk));
+   }
+
+   return error;
 }
 
 int aa_sk_perm(const char *op, u32 request, struct sock *sk)
-- 
2.18.0



Re: network performance get regression from 2.6 to 3.10 by each version

2014-05-05 Thread Rick Jones

On 05/02/2014 12:40 PM, V JobNickname wrote:

I have an ARM platform which works with older 2.6.28 Linux Kernel and
the embedded NIC driver
I profile the TCP Tx using netperf 2.6 by command "./netperf -H
{serverip} -l 300".


Is your ARM platform a multi-core one?  If so, you may need/want to look 
into making certain the assignment of NIC interrupts and netperf have 
remained constant through your tests.  You can bind netperf to a 
specific CPU via either "taskset" or the global -T option.  You can 
check the interrupt assignment(s) for the queue(s) from the NIC by 
looking at /proc/interrupts and perhaps via other means.


It would also be good to know if the drops in throughput correspond to 
an increase in service demand (CPU per unit of work).  To that end, 
adding a global -c option to measure local (netperf side) CPU 
utilization would be a good idea.


Still, even armed with that information, tracking down the regression or 
regressions will be no small feat particularly since the timespan is so 
long.  A very good reason to be trying the newer versions as they 
appear, even if only briefly, rather than leaving it for so long.


happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: network performance get regression from 2.6 to 3.10 by each version

2014-05-05 Thread Rick Jones

On 05/02/2014 12:40 PM, V JobNickname wrote:

I have an ARM platform which works with older 2.6.28 Linux Kernel and
the embedded NIC driver
I profile the TCP Tx using netperf 2.6 by command ./netperf -H
{serverip} -l 300.


Is your ARM platform a multi-core one?  If so, you may need/want to look 
into making certain the assignment of NIC interrupts and netperf have 
remained constant through your tests.  You can bind netperf to a 
specific CPU via either taskset or the global -T option.  You can 
check the interrupt assignment(s) for the queue(s) from the NIC by 
looking at /proc/interrupts and perhaps via other means.


It would also be good to know if the drops in throughput correspond to 
an increase in service demand (CPU per unit of work).  To that end, 
adding a global -c option to measure local (netperf side) CPU 
utilization would be a good idea.


Still, even armed with that information, tracking down the regression or 
regressions will be no small feat particularly since the timespan is so 
long.  A very good reason to be trying the newer versions as they 
appear, even if only briefly, rather than leaving it for so long.


happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


network performance get regression from 2.6 to 3.10 by each version

2014-05-02 Thread V JobNickname
I have an ARM platform which works with older 2.6.28 Linux Kernel and
the embedded NIC driver
I profile the TCP Tx using netperf 2.6 by command "./netperf -H
{serverip} -l 300".

In 2.6.28 the TCP tx can reach 190 Mbps.

Recently I am porting the platform to long-term Kernel version
2.6.32.61, 3.4.88 and 3.10.
And I got the regression TCP Tx throughput by each new version.

2.6.32.61 is about 184Mbps
3.4.88 is about 173Mbps
3.10.0 is about 160Mbps

so, I try to porting to more EOL versions

3.0.38  184Mbps
3.2.0   179Mbps
3.2.57 177Mbps
3.5.0   168Mbps
3.5.7   166Mbps
3.6.0   162Mbps
3.6.11   163Mbps

The newer version have slower performance.

The Kernel was download from kernel.org for porting.
To touch less file as possible, I only porting basic requirement for
MACHINE_START "io_map" "interrupt" "timer" and add NIC driver.
The patch needed for NIC driver from 2.x to 3.x is only to group fops
to "net_device_ops". No any change in xmit, received and isr handle
flow.
Actually, the NIC driver from 3.2 to 3.6 are file identical by diff.
The only different is .config file, and I have try to keep the
configuration are identical also. Just new or remove option difference
between each version.
I have no idea is the performance regression is due to network stack
of each version or any feature I have to configure on new version.
Any suggestion I can try to dig out the root cause, or some one have
similar same observation on your experimence?

The following is .config which I used for 3.0.38 and .config diff
between each version.
I can't find any difference option will effect the performance from
.config diffs.



.config of kernel 3.0.38

#
# Automatically generated make config: don't edit
# Linux/arm 3.0.38 Kernel Configuration
#
CONFIG_ARM=y
CONFIG_SYS_SUPPORTS_APM_EMULATION=y
# CONFIG_ARCH_USES_GETTIMEOFFSET is not set
CONFIG_KTIME_SCALAR=y
CONFIG_HAVE_PROC_CPU=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_VECTORS_BASE=0x
# CONFIG_ARM_PATCH_PHYS_VIRT is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_HAVE_IRQ_WORK=y

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE="arm-linux-"
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_LZO is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
# CONFIG_SWAP is not set
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
# CONFIG_POSIX_MQUEUE is not set
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_FHANDLE is not set
# CONFIG_TASKSTATS is not set
# CONFIG_AUDIT is not set
CONFIG_HAVE_GENERIC_HARDIRQS=y

#
# IRQ subsystem
#
CONFIG_GENERIC_HARDIRQS=y
CONFIG_HAVE_SPARSE_IRQ=y
CONFIG_GENERIC_IRQ_SHOW=y
# CONFIG_SPARSE_IRQ is not set

#
# RCU Subsystem
#
# CONFIG_TREE_PREEMPT_RCU is not set
# CONFIG_TINY_RCU is not set
CONFIG_TINY_PREEMPT_RCU=y
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_TRACE is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_RCU_BOOST is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=16
# CONFIG_CGROUPS is not set
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
CONFIG_IPC_NS=y
# CONFIG_USER_NS is not set
# CONFIG_PID_NS is not set
# CONFIG_NET_NS is not set
# CONFIG_SCHED_AUTOGROUP is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y
# CONFIG_RELAY is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
# CONFIG_RD_BZIP2 is not set
# CONFIG_RD_LZMA is not set
# CONFIG_RD_XZ is not set
# CONFIG_RD_LZO is not set
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
CONFIG_EXPERT=y
CONFIG_UID16=y
# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
# CONFIG_ELF_CORE is not set
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_EMBEDDED=y
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_PERF_USE_VMALLOC=y

#
# Kernel Performance Events And Counters
#
# CONFIG_PERF_EVENTS is not set
# CONFIG_PERF_COUNTERS is not set
# CONFIG_VM_EVENT_COUNTERS is not set
CONFIG_COMPAT_BRK=y
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
# CONFIG_PROFILING is not set
CONFIG_HAVE_OPROFILE=y
# CONFIG_KPROBES is not set
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_DMA_API_DEBUG=y

#
# GCOV-based kernel profiling
#
CONFIG_HAVE_GENERIC_DMA_COHERENT=y
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# 

network performance get regression from 2.6 to 3.10 by each version

2014-05-02 Thread V JobNickname
I have an ARM platform which works with older 2.6.28 Linux Kernel and
the embedded NIC driver
I profile the TCP Tx using netperf 2.6 by command ./netperf -H
{serverip} -l 300.

In 2.6.28 the TCP tx can reach 190 Mbps.

Recently I am porting the platform to long-term Kernel version
2.6.32.61, 3.4.88 and 3.10.
And I got the regression TCP Tx throughput by each new version.

2.6.32.61 is about 184Mbps
3.4.88 is about 173Mbps
3.10.0 is about 160Mbps

so, I try to porting to more EOL versions

3.0.38  184Mbps
3.2.0   179Mbps
3.2.57 177Mbps
3.5.0   168Mbps
3.5.7   166Mbps
3.6.0   162Mbps
3.6.11   163Mbps

The newer version have slower performance.

The Kernel was download from kernel.org for porting.
To touch less file as possible, I only porting basic requirement for
MACHINE_START io_map interrupt timer and add NIC driver.
The patch needed for NIC driver from 2.x to 3.x is only to group fops
to net_device_ops. No any change in xmit, received and isr handle
flow.
Actually, the NIC driver from 3.2 to 3.6 are file identical by diff.
The only different is .config file, and I have try to keep the
configuration are identical also. Just new or remove option difference
between each version.
I have no idea is the performance regression is due to network stack
of each version or any feature I have to configure on new version.
Any suggestion I can try to dig out the root cause, or some one have
similar same observation on your experimence?

The following is .config which I used for 3.0.38 and .config diff
between each version.
I can't find any difference option will effect the performance from
.config diffs.



.config of kernel 3.0.38

#
# Automatically generated make config: don't edit
# Linux/arm 3.0.38 Kernel Configuration
#
CONFIG_ARM=y
CONFIG_SYS_SUPPORTS_APM_EMULATION=y
# CONFIG_ARCH_USES_GETTIMEOFFSET is not set
CONFIG_KTIME_SCALAR=y
CONFIG_HAVE_PROC_CPU=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_VECTORS_BASE=0x
# CONFIG_ARM_PATCH_PHYS_VIRT is not set
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config
CONFIG_HAVE_IRQ_WORK=y

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=arm-linux-
CONFIG_LOCALVERSION=
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_LZO is not set
CONFIG_DEFAULT_HOSTNAME=(none)
# CONFIG_SWAP is not set
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
# CONFIG_POSIX_MQUEUE is not set
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_FHANDLE is not set
# CONFIG_TASKSTATS is not set
# CONFIG_AUDIT is not set
CONFIG_HAVE_GENERIC_HARDIRQS=y

#
# IRQ subsystem
#
CONFIG_GENERIC_HARDIRQS=y
CONFIG_HAVE_SPARSE_IRQ=y
CONFIG_GENERIC_IRQ_SHOW=y
# CONFIG_SPARSE_IRQ is not set

#
# RCU Subsystem
#
# CONFIG_TREE_PREEMPT_RCU is not set
# CONFIG_TINY_RCU is not set
CONFIG_TINY_PREEMPT_RCU=y
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_TRACE is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_RCU_BOOST is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=16
# CONFIG_CGROUPS is not set
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
CONFIG_IPC_NS=y
# CONFIG_USER_NS is not set
# CONFIG_PID_NS is not set
# CONFIG_NET_NS is not set
# CONFIG_SCHED_AUTOGROUP is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y
# CONFIG_RELAY is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=
CONFIG_RD_GZIP=y
# CONFIG_RD_BZIP2 is not set
# CONFIG_RD_LZMA is not set
# CONFIG_RD_XZ is not set
# CONFIG_RD_LZO is not set
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
CONFIG_EXPERT=y
CONFIG_UID16=y
# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
# CONFIG_ELF_CORE is not set
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_EMBEDDED=y
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_PERF_USE_VMALLOC=y

#
# Kernel Performance Events And Counters
#
# CONFIG_PERF_EVENTS is not set
# CONFIG_PERF_COUNTERS is not set
# CONFIG_VM_EVENT_COUNTERS is not set
CONFIG_COMPAT_BRK=y
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
# CONFIG_PROFILING is not set
CONFIG_HAVE_OPROFILE=y
# CONFIG_KPROBES is not set
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_DMA_API_DEBUG=y

#
# GCOV-based kernel profiling
#
CONFIG_HAVE_GENERIC_DMA_COHERENT=y
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# 

Re: Poor network performance x86_64.. also with 3.13

2014-02-09 Thread Borislav Petkov
On Sun, Feb 09, 2014 at 10:14:34AM -0800, Eric Dumazet wrote:
> tcp_rmem[2] = 16777
> 
> Come on, the 640KB barrier was broken a long time ago ;)
> 
> Feel free to investigate, I wont ;)

Me too - it's not like I don't have anything else to do. :-)

I was just wondering why 3.10 was fine even with these settings and 3.12
wasn't. Here's the original report:

"I recently upgraded the Kernel from version 3.10 to latest stable
3.12.8, did the usual "make oldconfig" (resulting config attached).

But now I noticed some _really_ low network performance."

Link: http://lkml.kernel.org/r/52dad66f.7080...@dragonslave.de

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Poor network performance x86_64.. also with 3.13

2014-02-09 Thread Eric Dumazet
On Sun, 2014-02-09 at 16:31 +0100, Borislav Petkov wrote:
> On Sun, Feb 09, 2014 at 04:05:11PM +0100, Daniel Exner wrote:
> > > cat /etc/sysctl.d/net.conf
> > > net.ipv4.tcp_window_scaling = 1
> > > net.core.rmem_max = 16777216
> > > net.ipv4.tcp_rmem = 4096 87380 16777
> > > net.ipv4.tcp_wmem = 4096   1638
> > 
> > After removing those values I finally had sane iperf values.
> > No idea how those got there, perhaps they made sense when I first setup
> > the box, which is some years ago..
> 
> The only question that is left to clarify now is why do those values
> have effect on 3.12.x and not on 3.10...

tcp_rmem[2] = 16777

Come on, the 640KB barrier was broken a long time ago ;)

Feel free to investigate, I wont ;)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Poor network performance x86_64.. also with 3.13

2014-02-09 Thread Borislav Petkov
On Sun, Feb 09, 2014 at 04:05:11PM +0100, Daniel Exner wrote:
> > cat /etc/sysctl.d/net.conf
> > net.ipv4.tcp_window_scaling = 1
> > net.core.rmem_max = 16777216
> > net.ipv4.tcp_rmem = 4096 87380 16777
> > net.ipv4.tcp_wmem = 4096   1638
> 
> After removing those values I finally had sane iperf values.
> No idea how those got there, perhaps they made sense when I first setup
> the box, which is some years ago..

The only question that is left to clarify now is why do those values
have effect on 3.12.x and not on 3.10...

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Poor network performance x86_64.. also with 3.13

2014-02-09 Thread Daniel Exner
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

Hi all,

Am Mon, 20 Jan 2014 23:37:52 +0100
schrieb Borislav Petkov :

> On Mon, Jan 20, 2014 at 11:27:25PM +0100, Daniel Exner wrote:
> > I just did the same procedure with Kernel Version 3.13: same poor
> > rates.
> > 
> > I think I will try to see of 3.12.6 was still ok and bisect from
> > there.
> 
> Or try something more coarse-grained like 3.11 first, then 3.12 and
> then the -rcs in between.
> 

I must apologize for suspecting the kernel for my problems. After some
bisect attempts I finaly noticed the following:

> cat /etc/sysctl.d/net.conf
> net.ipv4.tcp_window_scaling = 1
> net.core.rmem_max = 16777216
> net.ipv4.tcp_rmem = 4096 87380 16777
> net.ipv4.tcp_wmem = 4096   1638

After removing those values I finally had sane iperf values.
No idea how those got there, perhaps they made sense when I first setup
the box, which is some years ago..

Anyway, thanks all for your help :)

Greetings
Daniel Exner
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)

iQIcBAEBCgAGBQJS95knAAoJEPI6v6bI/QkfZYYP/37WBvygR7gLKqFTYfQA2ALE
n6cOLrogoJT8Cf3q1fLqKiSzPToxSuPBTTmQtaNnhxLKTCPFHxLYPbTdtlXGTPB1
vFVJmXg8WAM/kQD/IoHrMsZsHfRWZE+RtQrUfeQ4Ava6abmniBufVe7ERMuF6ddW
02F5COtw74LJuSbxS70Cn3reog/ExoIYOYKQn6+FpoKTME4WnZtA8DJxo1r077RL
mNqo3D4OMrYdPhxyRjLygtCnmXuX/yynV2czBnFkME4f1B4P/1hIzqYCxa2dBQIM
Pfr+b/TtyVZA3DsE1d22f/+34EFWE/06EM5l8KwImmRHGA9Ffx77jKX4sAxN0Hhg
9ZJleeddk4NahXur5WNAV4lrkiLUgGauC0k721KwBFecFy2gYK/OUIyOm/oA3IPT
WAEeGT4nCfCa1vYfoZVBn5UWOZo1eLm5qh6dmGb9FHukmwWTEplRRvYDPyJNfmRg
0j5mvn7ymFIbnmkVSnBFdfJH0I6XhhiHQ9H3cb9It9OLH5eEK1x4AW1okkAwrquQ
oNYkpq54aJS/3oDokyWN/Gkvkmmk+4Q6tpxQge0AQPhrNeft5X7b8VhffstWhTSF
kO1ULQ+zOtRUHF6T5523qVcS3pzFfqQKPYPhhQspGvuPJEr0M94i2JS016Z84Cz6
krmaHvSO/MKFkm7w+x5d
=90En
-END PGP SIGNATURE-


Re: Poor network performance x86_64.. also with 3.13

2014-02-09 Thread Daniel Exner
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

Hi all,

Am Mon, 20 Jan 2014 23:37:52 +0100
schrieb Borislav Petkov b...@alien8.de:

 On Mon, Jan 20, 2014 at 11:27:25PM +0100, Daniel Exner wrote:
  I just did the same procedure with Kernel Version 3.13: same poor
  rates.
  
  I think I will try to see of 3.12.6 was still ok and bisect from
  there.
 
 Or try something more coarse-grained like 3.11 first, then 3.12 and
 then the -rcs in between.
 

I must apologize for suspecting the kernel for my problems. After some
bisect attempts I finaly noticed the following:

 cat /etc/sysctl.d/net.conf
 net.ipv4.tcp_window_scaling = 1
 net.core.rmem_max = 16777216
 net.ipv4.tcp_rmem = 4096 87380 16777
 net.ipv4.tcp_wmem = 4096   1638

After removing those values I finally had sane iperf values.
No idea how those got there, perhaps they made sense when I first setup
the box, which is some years ago..

Anyway, thanks all for your help :)

Greetings
Daniel Exner
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)

iQIcBAEBCgAGBQJS95knAAoJEPI6v6bI/QkfZYYP/37WBvygR7gLKqFTYfQA2ALE
n6cOLrogoJT8Cf3q1fLqKiSzPToxSuPBTTmQtaNnhxLKTCPFHxLYPbTdtlXGTPB1
vFVJmXg8WAM/kQD/IoHrMsZsHfRWZE+RtQrUfeQ4Ava6abmniBufVe7ERMuF6ddW
02F5COtw74LJuSbxS70Cn3reog/ExoIYOYKQn6+FpoKTME4WnZtA8DJxo1r077RL
mNqo3D4OMrYdPhxyRjLygtCnmXuX/yynV2czBnFkME4f1B4P/1hIzqYCxa2dBQIM
Pfr+b/TtyVZA3DsE1d22f/+34EFWE/06EM5l8KwImmRHGA9Ffx77jKX4sAxN0Hhg
9ZJleeddk4NahXur5WNAV4lrkiLUgGauC0k721KwBFecFy2gYK/OUIyOm/oA3IPT
WAEeGT4nCfCa1vYfoZVBn5UWOZo1eLm5qh6dmGb9FHukmwWTEplRRvYDPyJNfmRg
0j5mvn7ymFIbnmkVSnBFdfJH0I6XhhiHQ9H3cb9It9OLH5eEK1x4AW1okkAwrquQ
oNYkpq54aJS/3oDokyWN/Gkvkmmk+4Q6tpxQge0AQPhrNeft5X7b8VhffstWhTSF
kO1ULQ+zOtRUHF6T5523qVcS3pzFfqQKPYPhhQspGvuPJEr0M94i2JS016Z84Cz6
krmaHvSO/MKFkm7w+x5d
=90En
-END PGP SIGNATURE-


Re: Poor network performance x86_64.. also with 3.13

2014-02-09 Thread Borislav Petkov
On Sun, Feb 09, 2014 at 04:05:11PM +0100, Daniel Exner wrote:
  cat /etc/sysctl.d/net.conf
  net.ipv4.tcp_window_scaling = 1
  net.core.rmem_max = 16777216
  net.ipv4.tcp_rmem = 4096 87380 16777
  net.ipv4.tcp_wmem = 4096   1638
 
 After removing those values I finally had sane iperf values.
 No idea how those got there, perhaps they made sense when I first setup
 the box, which is some years ago..

The only question that is left to clarify now is why do those values
have effect on 3.12.x and not on 3.10...

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Poor network performance x86_64.. also with 3.13

2014-02-09 Thread Eric Dumazet
On Sun, 2014-02-09 at 16:31 +0100, Borislav Petkov wrote:
 On Sun, Feb 09, 2014 at 04:05:11PM +0100, Daniel Exner wrote:
   cat /etc/sysctl.d/net.conf
   net.ipv4.tcp_window_scaling = 1
   net.core.rmem_max = 16777216
   net.ipv4.tcp_rmem = 4096 87380 16777
   net.ipv4.tcp_wmem = 4096   1638
  
  After removing those values I finally had sane iperf values.
  No idea how those got there, perhaps they made sense when I first setup
  the box, which is some years ago..
 
 The only question that is left to clarify now is why do those values
 have effect on 3.12.x and not on 3.10...

tcp_rmem[2] = 16777

Come on, the 640KB barrier was broken a long time ago ;)

Feel free to investigate, I wont ;)



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Poor network performance x86_64.. also with 3.13

2014-02-09 Thread Borislav Petkov
On Sun, Feb 09, 2014 at 10:14:34AM -0800, Eric Dumazet wrote:
 tcp_rmem[2] = 16777
 
 Come on, the 640KB barrier was broken a long time ago ;)
 
 Feel free to investigate, I wont ;)

Me too - it's not like I don't have anything else to do. :-)

I was just wondering why 3.10 was fine even with these settings and 3.12
wasn't. Here's the original report:

I recently upgraded the Kernel from version 3.10 to latest stable
3.12.8, did the usual make oldconfig (resulting config attached).

But now I noticed some _really_ low network performance.

Link: http://lkml.kernel.org/r/52dad66f.7080...@dragonslave.de

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Poor network performance x86_64.. also with 3.13

2014-01-20 Thread Branimir Maksimovic

On 01/20/2014 11:37 PM, Borislav Petkov wrote:

On Mon, Jan 20, 2014 at 11:27:25PM +0100, Daniel Exner wrote:

I just did the same procedure with Kernel Version 3.13: same poor rates.

I think I will try to see of 3.12.6 was still ok and bisect from there.

Or try something more coarse-grained like 3.11 first, then 3.12 and then
the -rcs in between.


Hm, on my machine 3.13 (latest git) has double throughtput of 3.11 (distro
compiled) on loopback interface. 68Gb vs 33Gb (iperf).



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Poor network performance x86_64.. also with 3.13

2014-01-20 Thread Borislav Petkov
On Mon, Jan 20, 2014 at 11:27:25PM +0100, Daniel Exner wrote:
> I just did the same procedure with Kernel Version 3.13: same poor rates.
> 
> I think I will try to see of 3.12.6 was still ok and bisect from there.

Or try something more coarse-grained like 3.11 first, then 3.12 and then
the -rcs in between.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Poor network performance x86_64.. also with 3.13

2014-01-20 Thread Daniel Exner
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

Hi,

Am 18.01.2014 23:46, schrieb Daniel Exner:
> Hi again,
> 
> Am 18.01.2014 20:50, schrieb Borislav Petkov:
>> + netdev.
> Thx
> 
> Am 18.01.2014 20:49, schrieb Holger Hoffstätte:> [This mail was
> also posted to gmane.linux.kernel.]
> 
>> On Sat, 18 Jan 2014 20:30:55 +0100, Daniel Exner wrote:
> 
>>> I recently upgraded the Kernel from version 3.10 to latest 
>>> stable 3.12.8, did the usual "make oldconfig" (resulting
>>> config attached).
>>> 
>>> But now I noticed some _really_ low network performance.
> 
>> Try: sysctl net.ipv4.tcp_limit_output_bytes=262144
> 
> Tried that. Even 10 times the value. Same effect.


I just did the same procedure with Kernel Version 3.13: same poor rates.

I think I will try to see of 3.12.6 was still ok and bisect from there.

Greetings
Daniel
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCgAGBQJS3aLFAAoJEPI6v6bI/Qkf9+4QAJkfljUsRQn6DA6gWy3XYsn4
ZB3F2Mu8kLsCMVjpASsi7+km2qTiFv4qGOgezHCJmqMcdCFszBweGQrnYBLA5PCD
XSZ7G4S0U71aHWtY6iQd1q4ywnA21pfnGRqIpc5+OuIiIOm+YY+RXpJAHC5y1OVo
MxsPL1ZVp/enJoZuvblw6i+JT+soAbSypPWcNQ78qb+CYzLVMLZHcqQvMwpAsRvQ
LNKx8nyj8p32CdZo1GoT3f/nWvBeh/V/ViLrtt64u/oXMJyk5INVRFpSNUUviP8c
42y+r2K31+nY11K2dHsdJYbv5lZ8p8g0SNoLG1SrjgDspaptnT8jptxxn7GcQqdL
PZ3waUB7qYU15IxCA2iXwNPqjtsv8V5l55H/cunKQgxNbb318ui/a3cW7+R++CeL
onv9HFNUkHJiP/MvZJ1/FXE0AsjX70un9NuQ0+xFjCRwJ/YLZzCHkWMERcev500O
vS1yFTGiVY1HndoA4VFYzEkjOyHgDHHQA+0JkfBspVlhL7ow9hccmULZtEn9LzwU
9rooQHyXwdKr6KIbsjHECyjIsBhW4Jfj6195bZ9ddBDBXSqYyGqjiuy7l7TjlZVa
YmPNTlkEfMeXkO2h3km8TD2f+MPntYXkYjZVVUcK8NucgnIdLuWDk/GLrt73VTd8
Cww6B/u4YnGJSF5v/nit
=xCa3
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Poor network performance x86_64.. also with 3.13

2014-01-20 Thread Daniel Exner
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

Hi,

Am 18.01.2014 23:46, schrieb Daniel Exner:
 Hi again,
 
 Am 18.01.2014 20:50, schrieb Borislav Petkov:
 + netdev.
 Thx
 
 Am 18.01.2014 20:49, schrieb Holger Hoffstätte: [This mail was
 also posted to gmane.linux.kernel.]
 
 On Sat, 18 Jan 2014 20:30:55 +0100, Daniel Exner wrote:
 
 I recently upgraded the Kernel from version 3.10 to latest 
 stable 3.12.8, did the usual make oldconfig (resulting
 config attached).
 
 But now I noticed some _really_ low network performance.
 
 Try: sysctl net.ipv4.tcp_limit_output_bytes=262144
 
 Tried that. Even 10 times the value. Same effect.


I just did the same procedure with Kernel Version 3.13: same poor rates.

I think I will try to see of 3.12.6 was still ok and bisect from there.

Greetings
Daniel
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCgAGBQJS3aLFAAoJEPI6v6bI/Qkf9+4QAJkfljUsRQn6DA6gWy3XYsn4
ZB3F2Mu8kLsCMVjpASsi7+km2qTiFv4qGOgezHCJmqMcdCFszBweGQrnYBLA5PCD
XSZ7G4S0U71aHWtY6iQd1q4ywnA21pfnGRqIpc5+OuIiIOm+YY+RXpJAHC5y1OVo
MxsPL1ZVp/enJoZuvblw6i+JT+soAbSypPWcNQ78qb+CYzLVMLZHcqQvMwpAsRvQ
LNKx8nyj8p32CdZo1GoT3f/nWvBeh/V/ViLrtt64u/oXMJyk5INVRFpSNUUviP8c
42y+r2K31+nY11K2dHsdJYbv5lZ8p8g0SNoLG1SrjgDspaptnT8jptxxn7GcQqdL
PZ3waUB7qYU15IxCA2iXwNPqjtsv8V5l55H/cunKQgxNbb318ui/a3cW7+R++CeL
onv9HFNUkHJiP/MvZJ1/FXE0AsjX70un9NuQ0+xFjCRwJ/YLZzCHkWMERcev500O
vS1yFTGiVY1HndoA4VFYzEkjOyHgDHHQA+0JkfBspVlhL7ow9hccmULZtEn9LzwU
9rooQHyXwdKr6KIbsjHECyjIsBhW4Jfj6195bZ9ddBDBXSqYyGqjiuy7l7TjlZVa
YmPNTlkEfMeXkO2h3km8TD2f+MPntYXkYjZVVUcK8NucgnIdLuWDk/GLrt73VTd8
Cww6B/u4YnGJSF5v/nit
=xCa3
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Poor network performance x86_64.. also with 3.13

2014-01-20 Thread Borislav Petkov
On Mon, Jan 20, 2014 at 11:27:25PM +0100, Daniel Exner wrote:
 I just did the same procedure with Kernel Version 3.13: same poor rates.
 
 I think I will try to see of 3.12.6 was still ok and bisect from there.

Or try something more coarse-grained like 3.11 first, then 3.12 and then
the -rcs in between.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Poor network performance x86_64.. also with 3.13

2014-01-20 Thread Branimir Maksimovic

On 01/20/2014 11:37 PM, Borislav Petkov wrote:

On Mon, Jan 20, 2014 at 11:27:25PM +0100, Daniel Exner wrote:

I just did the same procedure with Kernel Version 3.13: same poor rates.

I think I will try to see of 3.12.6 was still ok and bisect from there.

Or try something more coarse-grained like 3.11 first, then 3.12 and then
the -rcs in between.


Hm, on my machine 3.13 (latest git) has double throughtput of 3.11 (distro
compiled) on loopback interface. 68Gb vs 33Gb (iperf).



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.12.8 poor network performance x86_64

2014-01-18 Thread Daniel Exner
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

Hi again,

Am 18.01.2014 20:50, schrieb Borislav Petkov:
> + netdev.
Thx

Am 18.01.2014 20:49, schrieb Holger Hoffstätte:> [This mail was also
posted to gmane.linux.kernel.]
> 
> On Sat, 18 Jan 2014 20:30:55 +0100, Daniel Exner wrote:
> 
>> I recently upgraded the Kernel from version 3.10 to latest
>> stable 3.12.8, did the usual "make oldconfig" (resulting config
>> attached).
>> 
>> But now I noticed some _really_ low network performance.
> 
> Try: sysctl net.ipv4.tcp_limit_output_bytes=262144

Tried that. Even 10 times the value. Same effect.

Is there something like that on a lower level of the network stack I
might try to change?

Could that be something in the cgroups layer?

Should I send a dmesg or anything else?

Greetings
Daniel
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCgAGBQJS2wQtAAoJEPI6v6bI/QkfKakP/jsv7VG5bUuSbvXjLQklb8mY
kGZvXRktpGMxHbUJe8NCLStbWcGLRoD+ilXh38e0U+icgU/f6uAa6a93cSaF+zi8
imjkyutQqAlevV3Ab3SAaSho6SsWgfTkWZ7kkFooIXU6UwIlqq41923OTpR2bXL4
qvYiYEOOO9Uzg/o0PXeV2VYcgxfnvTqRrpou3yQZK5YhLAZIHd8i9r1yqc+Un4dG
7+Ju/45YpqynX2CJVMx5kP62f9uQbdA9sEKoEkYInVtja0UGUwFXzHgy8RZLHDiM
Qhy/yZTzkjR4vai4N+dx2UizGBwgBtng5IzUiXX2HGd8TOMJRBcoPaa0ZBA4CsA+
RjypqL9dOGpw1bxZ/87h9qpvjmZd3mPhb768VKXgzgdEVlp56u5rT1OQEBUju8aS
Qprgtf6k1EkjJPWo3DVJrGr/Wk+k8cLASW5qm3wGS7V0k1H0EN+pw7UGvNY99kcf
IllTKa6bTkKe15x1BaZjvAwFHR1Fdcdn3A+2WQwy+hha1CsjogHnbhzUjmxHDq+c
8i92oZ1nw2788/ULPKc5hK2o8C4Zsp0JVGHd4Cy4Dy6tvCLSQveDneM/2U3JvCL2
ViOdxh6LGGWFvDpe1w9x+e3QzvXYTFNEXawn/5OIEzbM1XP+VF8zIyHsLG4nNI88
+ICSvsTt86Zg8Zhm96YH
=gsnO
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.12.8 poor network performance x86_64

2014-01-18 Thread Holger Hoffstätte
On Sat, 18 Jan 2014 20:30:55 +0100, Daniel Exner wrote:

> I recently upgraded the Kernel from version 3.10 to latest stable
> 3.12.8, did the usual "make oldconfig" (resulting config attached).
> 
> But now I noticed some _really_ low network performance.

Try: sysctl net.ipv4.tcp_limit_output_bytes=262144

Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.12.8 poor network performance x86_64

2014-01-18 Thread Holger Hoffstätte
On Sat, 18 Jan 2014 20:30:55 +0100, Daniel Exner wrote:

 I recently upgraded the Kernel from version 3.10 to latest stable
 3.12.8, did the usual make oldconfig (resulting config attached).
 
 But now I noticed some _really_ low network performance.

Try: sysctl net.ipv4.tcp_limit_output_bytes=262144

Holger

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.12.8 poor network performance x86_64

2014-01-18 Thread Daniel Exner
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

Hi again,

Am 18.01.2014 20:50, schrieb Borislav Petkov:
 + netdev.
Thx

Am 18.01.2014 20:49, schrieb Holger Hoffstätte: [This mail was also
posted to gmane.linux.kernel.]
 
 On Sat, 18 Jan 2014 20:30:55 +0100, Daniel Exner wrote:
 
 I recently upgraded the Kernel from version 3.10 to latest
 stable 3.12.8, did the usual make oldconfig (resulting config
 attached).
 
 But now I noticed some _really_ low network performance.
 
 Try: sysctl net.ipv4.tcp_limit_output_bytes=262144

Tried that. Even 10 times the value. Same effect.

Is there something like that on a lower level of the network stack I
might try to change?

Could that be something in the cgroups layer?

Should I send a dmesg or anything else?

Greetings
Daniel
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCgAGBQJS2wQtAAoJEPI6v6bI/QkfKakP/jsv7VG5bUuSbvXjLQklb8mY
kGZvXRktpGMxHbUJe8NCLStbWcGLRoD+ilXh38e0U+icgU/f6uAa6a93cSaF+zi8
imjkyutQqAlevV3Ab3SAaSho6SsWgfTkWZ7kkFooIXU6UwIlqq41923OTpR2bXL4
qvYiYEOOO9Uzg/o0PXeV2VYcgxfnvTqRrpou3yQZK5YhLAZIHd8i9r1yqc+Un4dG
7+Ju/45YpqynX2CJVMx5kP62f9uQbdA9sEKoEkYInVtja0UGUwFXzHgy8RZLHDiM
Qhy/yZTzkjR4vai4N+dx2UizGBwgBtng5IzUiXX2HGd8TOMJRBcoPaa0ZBA4CsA+
RjypqL9dOGpw1bxZ/87h9qpvjmZd3mPhb768VKXgzgdEVlp56u5rT1OQEBUju8aS
Qprgtf6k1EkjJPWo3DVJrGr/Wk+k8cLASW5qm3wGS7V0k1H0EN+pw7UGvNY99kcf
IllTKa6bTkKe15x1BaZjvAwFHR1Fdcdn3A+2WQwy+hha1CsjogHnbhzUjmxHDq+c
8i92oZ1nw2788/ULPKc5hK2o8C4Zsp0JVGHd4Cy4Dy6tvCLSQveDneM/2U3JvCL2
ViOdxh6LGGWFvDpe1w9x+e3QzvXYTFNEXawn/5OIEzbM1XP+VF8zIyHsLG4nNI88
+ICSvsTt86Zg8Zhm96YH
=gsnO
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread John Stoffel
> "Willy" == Willy Tarreau  writes:

Willy> On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote:
>> On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote:
>> > > 
>> > > (sd->len is usually 4096, which is expected, but sd->total_len value is
>> > > huge in your case, so we always set the flag in fs/splice.c)
>> > 
>> > I am testing :
>> > 
>> >if (sd->len < sd->total_len && pipe->nrbufs > 1)
>> > more |= MSG_SENDPAGE_NOTLAST;
>> > 
>> 
>> Yes, this should fix the problem :
>> 
>> If there is no following buffer in the pipe, we should not set NOTLAST.
>> 
>> diff --git a/fs/splice.c b/fs/splice.c
>> index 8890604..6909d89 100644
>> --- a/fs/splice.c
>> +++ b/fs/splice.c
>> @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info 
>> *pipe,
>> return -EINVAL;
>> 
>> more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
>> -if (sd->len < sd->total_len)
>> +
>> +if (sd->len < sd->total_len && pipe->nrbufs > 1)
>> more |= MSG_SENDPAGE_NOTLAST;
>> +
>> return file->f_op->sendpage(file, buf->page, buf->offset,
sd-> len, , more);
>> }
 
Willy> OK it works like a charm here now ! I can't break it anymore, so it
Willy> looks like you finally got it !

It's still broken, there's no comments in the code to explain all this
magic to mere mortals!  *grin*

John
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread John Stoffel
> "Willy" == Willy Tarreau  writes:

Willy> On Sun, Jan 06, 2013 at 04:49:35PM -0500, John Stoffel wrote:
>> > "Willy" == Willy Tarreau  writes:
>> 
Willy> On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote:
>> >> On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote:
>> >> > > 
>> >> > > (sd->len is usually 4096, which is expected, but sd->total_len value 
>> >> > > is
>> >> > > huge in your case, so we always set the flag in fs/splice.c)
>> >> > 
>> >> > I am testing :
>> >> > 
>> >> >if (sd->len < sd->total_len && pipe->nrbufs > 1)
>> >> > more |= MSG_SENDPAGE_NOTLAST;
>> >> > 
>> >> 
>> >> Yes, this should fix the problem :
>> >> 
>> >> If there is no following buffer in the pipe, we should not set NOTLAST.
>> >> 
>> >> diff --git a/fs/splice.c b/fs/splice.c
>> >> index 8890604..6909d89 100644
>> >> --- a/fs/splice.c
>> >> +++ b/fs/splice.c
>> >> @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info 
>> >> *pipe,
>> >> return -EINVAL;
>> >> 
>> >> more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
>> >> - if (sd->len < sd->total_len)
>> >> +
>> >> + if (sd->len < sd->total_len && pipe->nrbufs > 1)
>> >> more |= MSG_SENDPAGE_NOTLAST;
>> >> +
>> >> return file->f_op->sendpage(file, buf->page, buf->offset,
sd-> len, , more);
>> >> }
>> 
Willy> OK it works like a charm here now ! I can't break it anymore, so it
Willy> looks like you finally got it !
>> 
>> It's still broken, there's no comments in the code to explain all this
>> magic to mere mortals!  *grin*

Willy> I would generally agree, but when Eric fixes such a thing, he
Willy> generally goes with lengthy details in the commit message.

I'm sure he will too, I just wanted to nudge him because while I sorta
followed this discussion, I see lots of pain down the road if the code
wasn't updated with some nice big fat comments.

Great job finding this code and testing, testing, testing.

John

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
On Sun, Jan 06, 2013 at 04:49:35PM -0500, John Stoffel wrote:
> > "Willy" == Willy Tarreau  writes:
> 
> Willy> On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote:
> >> On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote:
> >> > > 
> >> > > (sd->len is usually 4096, which is expected, but sd->total_len value is
> >> > > huge in your case, so we always set the flag in fs/splice.c)
> >> > 
> >> > I am testing :
> >> > 
> >> >if (sd->len < sd->total_len && pipe->nrbufs > 1)
> >> > more |= MSG_SENDPAGE_NOTLAST;
> >> > 
> >> 
> >> Yes, this should fix the problem :
> >> 
> >> If there is no following buffer in the pipe, we should not set NOTLAST.
> >> 
> >> diff --git a/fs/splice.c b/fs/splice.c
> >> index 8890604..6909d89 100644
> >> --- a/fs/splice.c
> >> +++ b/fs/splice.c
> >> @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info 
> >> *pipe,
> >> return -EINVAL;
> >> 
> >> more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
> >> -  if (sd->len < sd->total_len)
> >> +
> >> +  if (sd->len < sd->total_len && pipe->nrbufs > 1)
> >> more |= MSG_SENDPAGE_NOTLAST;
> >> +
> >> return file->f_op->sendpage(file, buf->page, buf->offset,
> sd-> len, , more);
> >> }
>  
> Willy> OK it works like a charm here now ! I can't break it anymore, so it
> Willy> looks like you finally got it !
> 
> It's still broken, there's no comments in the code to explain all this
> magic to mere mortals!  *grin*

I would generally agree, but when Eric fixes such a thing, he
generally goes with lengthy details in the commit message.

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
On Sun, Jan 06, 2013 at 11:39:31AM -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 20:34 +0100, Willy Tarreau wrote:
> 
> > OK it works like a charm here now ! I can't break it anymore, so it
> > looks like you finally got it !
> > 
> > I noticed that the data rate was higher when the loopback's MTU
> > is exactly a multiple of 4096 (making the 64k choice optimal)
> > while I would have assumed that in order to efficiently splice
> > TCP segments, we'd need to have some space for IP/TCP headers
> > and n*4k for the payload.
> > 
> > I also got the transfer freezes again a few times when starting
> > tcpdump on the server, but this is not 100% reproducible I'm afraid.
> > So I'll bring this back when I manage to get some analysable pattern.
> > 
> > The spliced transfer through all the chain haproxy works fine again
> > at 10gig with your fix. The issue is closed for me. Feel free to add
> > my Tested-By if you want.
> > 
> 
> Good to know !
> 
> What is the max speed you get now ?

Line rate with 1500 MTU and LRO enabled :

#   time   eth1(ikb  ipk okb  opk)eth2(ikb   ipk  okbopk) 

1357060023 19933.3 41527.7 9355538.2 62167.7  9757888.1 808701.1 19400.3 40417.7
1357060024 26124.1 54425.5 9290064.9 48804.4  9778294.0 810210.0 18068.8 37643.3
1357060025 27015.2 56281.1 9296115.3 46868.8  9797125.9 811271.1 8790.1 18312.2 
1357060026 27556.0 57408.8 9291701.4 46805.5  9805371.6 811410.0 3494.8 7280.0 
1357060027 27577.0 57452.2 9293606.8 46804.4  9806122.3 811314.4 2558.7 5330.0 
1357060028 27476.1 57242.2 9296885.4 46830.0  9794537.3 810527.7 2516.1 5242.2 
   ^^^^^^
   kbps out   kbps in
eth1=facing the client
eth2=facing the server

Top reports the following usage :

Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 31.7%id,  0.0%wa,  0.0%hi, 68.3%si,  0.0%st
Cpu1  :  1.0%us, 37.3%sy,  0.0%ni, 61.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

(IRQ bound to cpu 0, haproxy to cpu 1)

This is a core2duo 2.66 GHz and the myris are 1st generation.

BTW I was very happy to see that the LRO->GRO conversion patches in 3.8-rc2
don't affect byte rate anymore (just a minor CPU usage increase but this is
not critical here), now I won't complain about it being slower anymore, you
won :-)


With the GRO patches backported, still at 1500 MTU but with GRO now :

Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 28.7%id,  0.0%wa,  0.0%hi, 71.3%si,  0.0%st
Cpu1  :  0.0%us, 37.6%sy,  0.0%ni, 62.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

#   time   eth1(ikb  ipk okb  opk)eth2(ikb   ipk  okbopk) 
1357058637 18319.3 38165.5 9401736.3 65159.9  9761613.4 808963.3 19403.6 40424.4
1357058638 20009.8 41687.7 9400903.7 62706.6  9770555.8 809522.2 18696.5 38951.1
1357058639 25439.5 52999.9 9301635.3 50267.7  9773666.7 809721.1 19174.1 39946.6
1357058640 26808.2 55850.0 9298301.4 46876.6  9790470.1 810843.3 12408.7 25851.1
1357058641 27110.9 56481.1 9297009.2 46832.2  9803308.4 811339.9 5692.5 11859.9
1357058642 27411.1 57106.6 9291419.2 46796.6  9806846.5 811378.8 2804.4 5842.2

This kernel is getting really good :-)

Cheers,
Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Eric Dumazet
On Sun, 2013-01-06 at 20:34 +0100, Willy Tarreau wrote:

> OK it works like a charm here now ! I can't break it anymore, so it
> looks like you finally got it !
> 
> I noticed that the data rate was higher when the loopback's MTU
> is exactly a multiple of 4096 (making the 64k choice optimal)
> while I would have assumed that in order to efficiently splice
> TCP segments, we'd need to have some space for IP/TCP headers
> and n*4k for the payload.
> 
> I also got the transfer freezes again a few times when starting
> tcpdump on the server, but this is not 100% reproducible I'm afraid.
> So I'll bring this back when I manage to get some analysable pattern.
> 
> The spliced transfer through all the chain haproxy works fine again
> at 10gig with your fix. The issue is closed for me. Feel free to add
> my Tested-By if you want.
> 

Good to know !

What is the max speed you get now ?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote:
> > > 
> > > (sd->len is usually 4096, which is expected, but sd->total_len value is
> > > huge in your case, so we always set the flag in fs/splice.c)
> > 
> > I am testing :
> > 
> >if (sd->len < sd->total_len && pipe->nrbufs > 1)
> > more |= MSG_SENDPAGE_NOTLAST;
> > 
> 
> Yes, this should fix the problem :
> 
> If there is no following buffer in the pipe, we should not set NOTLAST.
> 
> diff --git a/fs/splice.c b/fs/splice.c
> index 8890604..6909d89 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
>   return -EINVAL;
>  
>   more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
> - if (sd->len < sd->total_len)
> +
> + if (sd->len < sd->total_len && pipe->nrbufs > 1)
>   more |= MSG_SENDPAGE_NOTLAST;
> +
>   return file->f_op->sendpage(file, buf->page, buf->offset,
>   sd->len, , more);
>  }
 
OK it works like a charm here now ! I can't break it anymore, so it
looks like you finally got it !

I noticed that the data rate was higher when the loopback's MTU
is exactly a multiple of 4096 (making the 64k choice optimal)
while I would have assumed that in order to efficiently splice
TCP segments, we'd need to have some space for IP/TCP headers
and n*4k for the payload.

I also got the transfer freezes again a few times when starting
tcpdump on the server, but this is not 100% reproducible I'm afraid.
So I'll bring this back when I manage to get some analysable pattern.

The spliced transfer through all the chain haproxy works fine again
at 10gig with your fix. The issue is closed for me. Feel free to add
my Tested-By if you want.

Thank you Eric :-)
Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Eric Dumazet
On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote:
> > 
> > (sd->len is usually 4096, which is expected, but sd->total_len value is
> > huge in your case, so we always set the flag in fs/splice.c)
> 
> I am testing :
> 
>if (sd->len < sd->total_len && pipe->nrbufs > 1)
> more |= MSG_SENDPAGE_NOTLAST;
> 

Yes, this should fix the problem :

If there is no following buffer in the pipe, we should not set NOTLAST.

diff --git a/fs/splice.c b/fs/splice.c
index 8890604..6909d89 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
return -EINVAL;
 
more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
-   if (sd->len < sd->total_len)
+
+   if (sd->len < sd->total_len && pipe->nrbufs > 1)
more |= MSG_SENDPAGE_NOTLAST;
+
return file->f_op->sendpage(file, buf->page, buf->offset,
sd->len, , more);
 }


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Eric Dumazet

> 
> (sd->len is usually 4096, which is expected, but sd->total_len value is
> huge in your case, so we always set the flag in fs/splice.c)

I am testing :

   if (sd->len < sd->total_len && pipe->nrbufs > 1)
more |= MSG_SENDPAGE_NOTLAST;



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Eric Dumazet
On Sun, 2013-01-06 at 10:39 -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 18:35 +0100, Willy Tarreau wrote:
> 
> > Unfortunately it does not work any better, which means to me
> > that we don't leave via this code path. I tried other tricks
> > which failed too. I need to understand this part better before
> > randomly fiddling with it.
> > 
> 
> OK, now I have your test program, I can work on a fix, dont worry ;)
> 
> The MSG_SENDPAGE_NOTLAST logic needs to be tweaked.
> 


(sd->len is usually 4096, which is expected, but sd->total_len value is
huge in your case, so we always set the flag in fs/splice.c)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Eric Dumazet
On Sun, 2013-01-06 at 18:35 +0100, Willy Tarreau wrote:

> Unfortunately it does not work any better, which means to me
> that we don't leave via this code path. I tried other tricks
> which failed too. I need to understand this part better before
> randomly fiddling with it.
> 

OK, now I have your test program, I can work on a fix, dont worry ;)

The MSG_SENDPAGE_NOTLAST logic needs to be tweaked.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
On Sun, Jan 06, 2013 at 09:10:55AM -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 17:44 +0100, Willy Tarreau wrote:
> > On Sun, Jan 06, 2013 at 08:39:53AM -0800, Eric Dumazet wrote:
> > > Hmm, I'll have to check if this really can be reverted without hurting
> > > vmsplice() again.
> > 
> > Looking at the code I've been wondering whether we shouldn't transform
> > the condition to perform the push if we can't push more segments, but
> > I don't know what to rely on. It would be something like this :
> > 
> >if (copied &&
> >   (!(flags & MSG_SENDPAGE_NOTLAST) || cant_push_more))
> > tcp_push(sk, flags, mss_now, tp->nonagle);
> 
> Good point !
> 
> Maybe the following fix then ?
> 
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1ca2536..7ba0717 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -941,8 +941,10 @@ out:
>   return copied;
>  
>  do_error:
> - if (copied)
> + if (copied) {
> + flags &= ~MSG_SENDPAGE_NOTLAST;
>   goto out;
> + }
>  out_err:
>   return sk_stream_error(sk, flags, err);
>  }

Unfortunately it does not work any better, which means to me
that we don't leave via this code path. I tried other tricks
which failed too. I need to understand this part better before
randomly fiddling with it.

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Eric Dumazet
On Sun, 2013-01-06 at 17:44 +0100, Willy Tarreau wrote:
> On Sun, Jan 06, 2013 at 08:39:53AM -0800, Eric Dumazet wrote:
> > Hmm, I'll have to check if this really can be reverted without hurting
> > vmsplice() again.
> 
> Looking at the code I've been wondering whether we shouldn't transform
> the condition to perform the push if we can't push more segments, but
> I don't know what to rely on. It would be something like this :
> 
>if (copied &&
>   (!(flags & MSG_SENDPAGE_NOTLAST) || cant_push_more))
> tcp_push(sk, flags, mss_now, tp->nonagle);

Good point !

Maybe the following fix then ?


diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1ca2536..7ba0717 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -941,8 +941,10 @@ out:
return copied;
 
 do_error:
-   if (copied)
+   if (copied) {
+   flags &= ~MSG_SENDPAGE_NOTLAST;
goto out;
+   }
 out_err:
return sk_stream_error(sk, flags, err);
 }


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Eric Dumazet
On Sun, 2013-01-06 at 16:51 +0100, Willy Tarreau wrote:
> Hi Eric,
> 

> Oh sorry, I didn't really want to pollute the list with links and configs,
> especially during the initial report with various combined issues :-(
> 
> The client is my old "inject" tool, available here :
> 
>  http://git.1wt.eu/web?p=inject.git
> 
> The server is my "httpterm" tool, available here :
> 
>  http://git.1wt.eu/web?p=httpterm.git
>  Use "-O3 -DENABLE_POLL -DENABLE_EPOLL -DENABLE_SPLICE" for CFLAGS.
> 
> I'm starting httpterm this way :
> httpterm -D -L :8000 -P 256
> => it starts a server on port 8000, and sets pipe size to 256 kB. It
>uses SPLICE_F_MORE on output data but removing it did not fix the
>issue one of the early tests.
> 
> Then I'm starting inject this way :
> inject -o 1 -u 1 -G 0:8000/?s=1g
> => 1 user, 1 object at a time, and fetch /?s=1g from the loopback.
>The server will then emit 1 GB of data using splice().
> 
> It's possible to disable splicing on the server using -dS. The client
> "eats" data using recv(MSG_TRUNC) to avoid a useless copy.
> 
> > TCP has very low defaults concerning initial window, and it appears you
> > set RCVBUF to even smaller values.
> 
> Yes, you're right, my bootup scripts still change the default value, though
> I increase them to larger values during the tests (except the one where you
> saw win 8030 due to the default rmem set to 16060). I've been using this
> value in the past with older kernels because it allowed an integer number
> of segments to fit into the default window, and offered optimal performance
> with large numbers of concurrent connections. Since 2.6, tcp_moderate_rcvbuf
> works very well and this is not needed anymore.
> 
> Anyway, it does not affect the test here. Good kernels are OK whatever the
> default value, and bad kernels are bad whatever the default value too.
> 
> Hmmm finally it's this commit again :
> 
>2f53384 tcp: allow splice() to build full TSO packets
> 
> I'm saying "again" because we already diagnosed a similar effect several
> months ago that was revealed by this patch and we fixed it with the
> following  one, though I remember that we weren't completely sure it
> would fix everything :
> 
>bad115c tcp: do_tcp_sendpages() must try to push data out on oom conditions
> 
> Just out of curiosity, I tried to re-apply the patch above just after the
> first one but it did not change anything (after all it changed a symptom
> which appeared in different conditions).
> 
> Interestingly, this commit (2f53384) significantly improved performance
> on spliced data over the loopback (more than 50% in this test). In 3.7,
> it seems to have no positive effect anymore. I reverted it using the
> following patch and now the problem is fixed (mtu=64k works fine now) :
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e457c7a..61e4517 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -935,7 +935,7 @@ wait_for_memory:
>   }
>  
>  out:
> - if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
> + if (copied)
>   tcp_push(sk, flags, mss_now, tp->nonagle);
>   return copied;
> 
> Regards,
> Willy
> 

Hmm, I'll have to check if this really can be reverted without hurting
vmsplice() again.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
On Sun, Jan 06, 2013 at 08:39:53AM -0800, Eric Dumazet wrote:
> Hmm, I'll have to check if this really can be reverted without hurting
> vmsplice() again.

Looking at the code I've been wondering whether we shouldn't transform
the condition to perform the push if we can't push more segments, but
I don't know what to rely on. It would be something like this :

   if (copied &&
  (!(flags & MSG_SENDPAGE_NOTLAST) || cant_push_more))
tcp_push(sk, flags, mss_now, tp->nonagle);

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
Hi Eric,

On Sun, Jan 06, 2013 at 06:59:02AM -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 10:24 +0100, Willy Tarreau wrote:
> 
> > It does not change anything to the tests above unfortunately. It did not
> > even stabilize the unstable runs.
> > 
> > I'll check if I can spot the original commit which caused the regression
> > for MTUs that are not n*4096+52.
> 
> Since you don't post your program, I wont be able to help, just by
> guessing what it does...

Oh sorry, I didn't really want to pollute the list with links and configs,
especially during the initial report with various combined issues :-(

The client is my old "inject" tool, available here :

 http://git.1wt.eu/web?p=inject.git

The server is my "httpterm" tool, available here :

 http://git.1wt.eu/web?p=httpterm.git
 Use "-O3 -DENABLE_POLL -DENABLE_EPOLL -DENABLE_SPLICE" for CFLAGS.

I'm starting httpterm this way :
httpterm -D -L :8000 -P 256
=> it starts a server on port 8000, and sets pipe size to 256 kB. It
   uses SPLICE_F_MORE on output data but removing it did not fix the
   issue one of the early tests.

Then I'm starting inject this way :
inject -o 1 -u 1 -G 0:8000/?s=1g
=> 1 user, 1 object at a time, and fetch /?s=1g from the loopback.
   The server will then emit 1 GB of data using splice().

It's possible to disable splicing on the server using -dS. The client
"eats" data using recv(MSG_TRUNC) to avoid a useless copy.

> TCP has very low defaults concerning initial window, and it appears you
> set RCVBUF to even smaller values.

Yes, you're right, my bootup scripts still change the default value, though
I increase them to larger values during the tests (except the one where you
saw win 8030 due to the default rmem set to 16060). I've been using this
value in the past with older kernels because it allowed an integer number
of segments to fit into the default window, and offered optimal performance
with large numbers of concurrent connections. Since 2.6, tcp_moderate_rcvbuf
works very well and this is not needed anymore.

Anyway, it does not affect the test here. Good kernels are OK whatever the
default value, and bad kernels are bad whatever the default value too.

Hmmm finally it's this commit again :

   2f53384 tcp: allow splice() to build full TSO packets

I'm saying "again" because we already diagnosed a similar effect several
months ago that was revealed by this patch and we fixed it with the
following  one, though I remember that we weren't completely sure it
would fix everything :

   bad115c tcp: do_tcp_sendpages() must try to push data out on oom conditions

Just out of curiosity, I tried to re-apply the patch above just after the
first one but it did not change anything (after all it changed a symptom
which appeared in different conditions).

Interestingly, this commit (2f53384) significantly improved performance
on spliced data over the loopback (more than 50% in this test). In 3.7,
it seems to have no positive effect anymore. I reverted it using the
following patch and now the problem is fixed (mtu=64k works fine now) :

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e457c7a..61e4517 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -935,7 +935,7 @@ wait_for_memory:
}
 
 out:
-   if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
+   if (copied)
tcp_push(sk, flags, mss_now, tp->nonagle);
return copied;

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Eric Dumazet
On Sun, 2013-01-06 at 10:24 +0100, Willy Tarreau wrote:

> It does not change anything to the tests above unfortunately. It did not
> even stabilize the unstable runs.
> 
> I'll check if I can spot the original commit which caused the regression
> for MTUs that are not n*4096+52.

Since you don't post your program, I wont be able to help, just by
guessing what it does...

TCP has very low defaults concerning initial window, and it appears you
set RCVBUF to even smaller values.

Here we can see "win 8030", this is not a sane value...

18:32:08.071602 IP 127.0.0.1.26792 > 127.0.0.1.8000: S 2036886615:2036886615(0) 
win 8030 
18:32:08.071605 IP 127.0.0.1.8000 > 127.0.0.1.26792: S 126397113:126397113(0) 
ack 2036886616 win 8030 

So you apparently changed /proc/sys/net/ipv4/tcp_rmem or SO_RCVBUF ?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
On Sun, Jan 06, 2013 at 11:25:25AM +0100, Willy Tarreau wrote:
> OK good news here, the performance drop on the myri was caused by a
> problem between the keyboard and the chair. After the reboot series,
> I forgot to reload the firmware so the driver used the less efficient
> firmware from the NIC (it performs just as if LRO is disabled).
> 
> That makes me think that I should try 3.8-rc2 since LRO was removed
> there :-/

Just for the record, I tested 3.8-rc2, and the myri works as fast with
GRO there as it used to work with LRO in previous kernels. The softirq
work has increased from 26 to 48% but there is no performance drop when
using GRO anymore. Andrew has done a good job !

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
On Sun, Jan 06, 2013 at 12:46:58PM +0100, Romain Francoise wrote:
> Willy Tarreau  writes:
> 
> > That makes me think that I should try 3.8-rc2 since LRO was removed
> > there :-/
> 
> Better yet, find a way to automate these tests so they can run continually
> against net-next and find problems early...

There is no way scripts will plug cables and turn on sleeping hardware
unfortunately. I'm already following network updates closely enough to
spot occasional regressions that are naturally expected due to the number
of changes.

Also, automated tests won't easily report a behaviour analysis, and
behaviour is important in networking. You don't want to accept 100ms
pauses all the time for example (and that's just an example).

Right now my lab is simplified enough so that I can test something like
100 patches in a week-end, I think that's already fine.

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Romain Francoise
Willy Tarreau  writes:

> That makes me think that I should try 3.8-rc2 since LRO was removed
> there :-/

Better yet, find a way to automate these tests so they can run continually
against net-next and find problems early...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
On Sun, Jan 06, 2013 at 10:24:35AM +0100, Willy Tarreau wrote:
> But before that I'll try to find the recent one causing the myri10ge to
> slow down, it should take less time to bisect.

OK good news here, the performance drop on the myri was caused by a
problem between the keyboard and the chair. After the reboot series,
I forgot to reload the firmware so the driver used the less efficient
firmware from the NIC (it performs just as if LRO is disabled).

That makes me think that I should try 3.8-rc2 since LRO was removed
there :-/

The only remaining issue really is the loopback then.

Cheers,
Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
On Sat, Jan 05, 2013 at 11:35:24PM -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 03:52 +0100, Willy Tarreau wrote:
> 
> > OK so I observed no change with this patch, either on the loopback
> > data rate at >16kB MTU, or on the myri. I'm keeping it at hand for
> > experimentation anyway.
> > 
> 
> Yeah, there was no bug. I rewrote it for net-next as a cleanup/optim
> only.

I have re-applied your last rewrite and noticed a small but nice
performance improvement on a single stream over the loopback :

1 session   10 sessions
  - without the patch :   55.8 Gbps   68.4 Gbps
  - with the patch:   56.4 Gbps   70.4 Gbps

This was with the loopback reverted to 16kB MTU of course.

> > Concerning the loopback MTU, I find it strange that the MTU changes
> > the splice() behaviour and not send/recv. I thought that there could
> > be a relation between the MTU and the pipe size, but it does not
> > appear to be the case either, as I tried various sizes between 16kB
> > and 256kB without achieving original performance.
> 
> 
> It probably is related to a too small receive window, given the MTU was
> multiplied by 4, I guess we need to make some adjustments

In fact even if I set it to 32kB it breaks.

I have tried to progressively increase the loopback's MTU from the default
16436, by steps of 4096 :

tcp_rmem = 256 kB   tcp_rmem = 256 kB
pipe size = 64 kB   pipe size = 256 kB

16436 : 55.8 Gbps   65.2 Gbps
20532 : 32..48 Gbps unstable24..45 Gbps unstable
24628 : 56.0 Gbps   66.4 Gbps
28724 : 58.6 Gbps   67.8 Gbps
32820 : 54.5 Gbps   61.7 Gbps
36916 : 56.8 Gbps   65.5 Gbps
41012 : 57.8..58.2 Gbps ~stable 67.5..68.8 Gbps ~stable
45108 : 59.4 Gbps   70.0 Gbps
49204 : 61.2 Gbps   71.1 Gbps
53300 : 58.8 Gbps   70.6 Gbps
57396 : 60.2 Gbps   70.8 Gbps
61492 : 61.4 Gbps   71.1 Gbps

tcp_rmem = 1 MB tcp_rmem = 1 MB
pipe size = 64 kB   pipe size = 256 kB

16436 : 16..34 Gbps unstable49.5 or 65.2 Gbps (unstable)
20532 :  7..15 Gbps unstable15..32 Gbps unstable
24628 : 36..48 Gbps unstable34..61 Gbps unstable
28724 : 40..51 Gbps unstable40..69 Gbps unstable
32820 : 40..55 Gbps unstable59.9..62.3 Gbps ~stable
36916 : 38..51 Gbps unstable66.0 Gbps
41012 : 30..42 Gbps unstable47..66 Gbps unstable
45108 : 59.5 Gbps   71.2 Gbps
49204 : 61.3 Gbps   74.0 Gbps
53300 : 63.1 Gbps   74.5 Gbps
57396 : 64.6 Gbps   74.7 Gbps
61492 : 61..66 Gbps unstable76.5 Gbps

So as long as we maintain the MTU to n*4096 + 52, performance is still
almost OK. It is interesting to see that the transfer rate is unstable
at many values and that it depends both on the rmem and pipe size, just
as if some segments sometimes remained stuck for too long.

And if I pick a value which does not match n*4096+52, such as
61492+2048 = 63540, then the transfer falls to about 50-100 Mbps again.

So there's clearly something related to the copy of segments from
incomplete pages instead of passing them via the pipe.

It is possible that this bug has been there for a long time and that
we never detected it because nobody plays with the loopback MTU.

I have tried with 2.6.35 :

16436 : 31..33 Gbps
61492 : 48..50 Gbps
63540 : 50..53 Gbps  => so at least it's not affected

Even forcing the MTU to 16384 maintains 30..33 Gbps almost stable.

On 3.5.7.2 :

16436 : 23..27 Gbps
61492 : 61..64 Gbps
63540 : 40..100 Mbps  => the problem was already there.

Since there were many splice changes in 3.5, I'd suspect that the issue
appeared there though I could be wrong.

> You also could try :
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1ca2536..b68cdfb 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1482,6 +1482,9 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t 
> *desc,
>   break;
>   }
>   used = recv_actor(desc, skb, offset, len);
> + /* Clean up data we have read: This will do ACK frames. 
> */
> + if (used > 0)
> + tcp_cleanup_rbuf(sk, used);
>   if (used < 0) {
>   if (!copied)
>   copied = used;

It does not change anything to the tests above unfortunately. It did not
even stabilize the unstable runs.

I'll check if I can spot the original commit which caused the regression
for MTUs that are not n*4096+52.

But before that I'll try to find the recent one causing the myri10ge to
slow down, 

Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
On Sat, Jan 05, 2013 at 11:35:24PM -0800, Eric Dumazet wrote:
 On Sun, 2013-01-06 at 03:52 +0100, Willy Tarreau wrote:
 
  OK so I observed no change with this patch, either on the loopback
  data rate at 16kB MTU, or on the myri. I'm keeping it at hand for
  experimentation anyway.
  
 
 Yeah, there was no bug. I rewrote it for net-next as a cleanup/optim
 only.

I have re-applied your last rewrite and noticed a small but nice
performance improvement on a single stream over the loopback :

1 session   10 sessions
  - without the patch :   55.8 Gbps   68.4 Gbps
  - with the patch:   56.4 Gbps   70.4 Gbps

This was with the loopback reverted to 16kB MTU of course.

  Concerning the loopback MTU, I find it strange that the MTU changes
  the splice() behaviour and not send/recv. I thought that there could
  be a relation between the MTU and the pipe size, but it does not
  appear to be the case either, as I tried various sizes between 16kB
  and 256kB without achieving original performance.
 
 
 It probably is related to a too small receive window, given the MTU was
 multiplied by 4, I guess we need to make some adjustments

In fact even if I set it to 32kB it breaks.

I have tried to progressively increase the loopback's MTU from the default
16436, by steps of 4096 :

tcp_rmem = 256 kB   tcp_rmem = 256 kB
pipe size = 64 kB   pipe size = 256 kB

16436 : 55.8 Gbps   65.2 Gbps
20532 : 32..48 Gbps unstable24..45 Gbps unstable
24628 : 56.0 Gbps   66.4 Gbps
28724 : 58.6 Gbps   67.8 Gbps
32820 : 54.5 Gbps   61.7 Gbps
36916 : 56.8 Gbps   65.5 Gbps
41012 : 57.8..58.2 Gbps ~stable 67.5..68.8 Gbps ~stable
45108 : 59.4 Gbps   70.0 Gbps
49204 : 61.2 Gbps   71.1 Gbps
53300 : 58.8 Gbps   70.6 Gbps
57396 : 60.2 Gbps   70.8 Gbps
61492 : 61.4 Gbps   71.1 Gbps

tcp_rmem = 1 MB tcp_rmem = 1 MB
pipe size = 64 kB   pipe size = 256 kB

16436 : 16..34 Gbps unstable49.5 or 65.2 Gbps (unstable)
20532 :  7..15 Gbps unstable15..32 Gbps unstable
24628 : 36..48 Gbps unstable34..61 Gbps unstable
28724 : 40..51 Gbps unstable40..69 Gbps unstable
32820 : 40..55 Gbps unstable59.9..62.3 Gbps ~stable
36916 : 38..51 Gbps unstable66.0 Gbps
41012 : 30..42 Gbps unstable47..66 Gbps unstable
45108 : 59.5 Gbps   71.2 Gbps
49204 : 61.3 Gbps   74.0 Gbps
53300 : 63.1 Gbps   74.5 Gbps
57396 : 64.6 Gbps   74.7 Gbps
61492 : 61..66 Gbps unstable76.5 Gbps

So as long as we maintain the MTU to n*4096 + 52, performance is still
almost OK. It is interesting to see that the transfer rate is unstable
at many values and that it depends both on the rmem and pipe size, just
as if some segments sometimes remained stuck for too long.

And if I pick a value which does not match n*4096+52, such as
61492+2048 = 63540, then the transfer falls to about 50-100 Mbps again.

So there's clearly something related to the copy of segments from
incomplete pages instead of passing them via the pipe.

It is possible that this bug has been there for a long time and that
we never detected it because nobody plays with the loopback MTU.

I have tried with 2.6.35 :

16436 : 31..33 Gbps
61492 : 48..50 Gbps
63540 : 50..53 Gbps  = so at least it's not affected

Even forcing the MTU to 16384 maintains 30..33 Gbps almost stable.

On 3.5.7.2 :

16436 : 23..27 Gbps
61492 : 61..64 Gbps
63540 : 40..100 Mbps  = the problem was already there.

Since there were many splice changes in 3.5, I'd suspect that the issue
appeared there though I could be wrong.

 You also could try :
 
 diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
 index 1ca2536..b68cdfb 100644
 --- a/net/ipv4/tcp.c
 +++ b/net/ipv4/tcp.c
 @@ -1482,6 +1482,9 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t 
 *desc,
   break;
   }
   used = recv_actor(desc, skb, offset, len);
 + /* Clean up data we have read: This will do ACK frames. 
 */
 + if (used  0)
 + tcp_cleanup_rbuf(sk, used);
   if (used  0) {
   if (!copied)
   copied = used;

It does not change anything to the tests above unfortunately. It did not
even stabilize the unstable runs.

I'll check if I can spot the original commit which caused the regression
for MTUs that are not n*4096+52.

But before that I'll try to find the recent one causing the myri10ge to
slow down, it should take less time to bisect.

Regards,

Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
On Sun, Jan 06, 2013 at 10:24:35AM +0100, Willy Tarreau wrote:
 But before that I'll try to find the recent one causing the myri10ge to
 slow down, it should take less time to bisect.

OK good news here, the performance drop on the myri was caused by a
problem between the keyboard and the chair. After the reboot series,
I forgot to reload the firmware so the driver used the less efficient
firmware from the NIC (it performs just as if LRO is disabled).

That makes me think that I should try 3.8-rc2 since LRO was removed
there :-/

The only remaining issue really is the loopback then.

Cheers,
Willy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Romain Francoise
Willy Tarreau w...@1wt.eu writes:

 That makes me think that I should try 3.8-rc2 since LRO was removed
 there :-/

Better yet, find a way to automate these tests so they can run continually
against net-next and find problems early...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
On Sun, Jan 06, 2013 at 12:46:58PM +0100, Romain Francoise wrote:
 Willy Tarreau w...@1wt.eu writes:
 
  That makes me think that I should try 3.8-rc2 since LRO was removed
  there :-/
 
 Better yet, find a way to automate these tests so they can run continually
 against net-next and find problems early...

There is no way scripts will plug cables and turn on sleeping hardware
unfortunately. I'm already following network updates closely enough to
spot occasional regressions that are naturally expected due to the number
of changes.

Also, automated tests won't easily report a behaviour analysis, and
behaviour is important in networking. You don't want to accept 100ms
pauses all the time for example (and that's just an example).

Right now my lab is simplified enough so that I can test something like
100 patches in a week-end, I think that's already fine.

Willy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
On Sun, Jan 06, 2013 at 11:25:25AM +0100, Willy Tarreau wrote:
 OK good news here, the performance drop on the myri was caused by a
 problem between the keyboard and the chair. After the reboot series,
 I forgot to reload the firmware so the driver used the less efficient
 firmware from the NIC (it performs just as if LRO is disabled).
 
 That makes me think that I should try 3.8-rc2 since LRO was removed
 there :-/

Just for the record, I tested 3.8-rc2, and the myri works as fast with
GRO there as it used to work with LRO in previous kernels. The softirq
work has increased from 26 to 48% but there is no performance drop when
using GRO anymore. Andrew has done a good job !

Willy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Eric Dumazet
On Sun, 2013-01-06 at 10:24 +0100, Willy Tarreau wrote:

 It does not change anything to the tests above unfortunately. It did not
 even stabilize the unstable runs.
 
 I'll check if I can spot the original commit which caused the regression
 for MTUs that are not n*4096+52.

Since you don't post your program, I wont be able to help, just by
guessing what it does...

TCP has very low defaults concerning initial window, and it appears you
set RCVBUF to even smaller values.

Here we can see win 8030, this is not a sane value...

18:32:08.071602 IP 127.0.0.1.26792  127.0.0.1.8000: S 2036886615:2036886615(0) 
win 8030 mss 65495,nop,nop,sackOK,nop,wscale 9
18:32:08.071605 IP 127.0.0.1.8000  127.0.0.1.26792: S 126397113:126397113(0) 
ack 2036886616 win 8030 mss 65495,nop,nop,sackOK,nop,wscale 9

So you apparently changed /proc/sys/net/ipv4/tcp_rmem or SO_RCVBUF ?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
Hi Eric,

On Sun, Jan 06, 2013 at 06:59:02AM -0800, Eric Dumazet wrote:
 On Sun, 2013-01-06 at 10:24 +0100, Willy Tarreau wrote:
 
  It does not change anything to the tests above unfortunately. It did not
  even stabilize the unstable runs.
  
  I'll check if I can spot the original commit which caused the regression
  for MTUs that are not n*4096+52.
 
 Since you don't post your program, I wont be able to help, just by
 guessing what it does...

Oh sorry, I didn't really want to pollute the list with links and configs,
especially during the initial report with various combined issues :-(

The client is my old inject tool, available here :

 http://git.1wt.eu/web?p=inject.git

The server is my httpterm tool, available here :

 http://git.1wt.eu/web?p=httpterm.git
 Use -O3 -DENABLE_POLL -DENABLE_EPOLL -DENABLE_SPLICE for CFLAGS.

I'm starting httpterm this way :
httpterm -D -L :8000 -P 256
= it starts a server on port 8000, and sets pipe size to 256 kB. It
   uses SPLICE_F_MORE on output data but removing it did not fix the
   issue one of the early tests.

Then I'm starting inject this way :
inject -o 1 -u 1 -G 0:8000/?s=1g
= 1 user, 1 object at a time, and fetch /?s=1g from the loopback.
   The server will then emit 1 GB of data using splice().

It's possible to disable splicing on the server using -dS. The client
eats data using recv(MSG_TRUNC) to avoid a useless copy.

 TCP has very low defaults concerning initial window, and it appears you
 set RCVBUF to even smaller values.

Yes, you're right, my bootup scripts still change the default value, though
I increase them to larger values during the tests (except the one where you
saw win 8030 due to the default rmem set to 16060). I've been using this
value in the past with older kernels because it allowed an integer number
of segments to fit into the default window, and offered optimal performance
with large numbers of concurrent connections. Since 2.6, tcp_moderate_rcvbuf
works very well and this is not needed anymore.

Anyway, it does not affect the test here. Good kernels are OK whatever the
default value, and bad kernels are bad whatever the default value too.

Hmmm finally it's this commit again :

   2f53384 tcp: allow splice() to build full TSO packets

I'm saying again because we already diagnosed a similar effect several
months ago that was revealed by this patch and we fixed it with the
following  one, though I remember that we weren't completely sure it
would fix everything :

   bad115c tcp: do_tcp_sendpages() must try to push data out on oom conditions

Just out of curiosity, I tried to re-apply the patch above just after the
first one but it did not change anything (after all it changed a symptom
which appeared in different conditions).

Interestingly, this commit (2f53384) significantly improved performance
on spliced data over the loopback (more than 50% in this test). In 3.7,
it seems to have no positive effect anymore. I reverted it using the
following patch and now the problem is fixed (mtu=64k works fine now) :

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e457c7a..61e4517 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -935,7 +935,7 @@ wait_for_memory:
}
 
 out:
-   if (copied  !(flags  MSG_SENDPAGE_NOTLAST))
+   if (copied)
tcp_push(sk, flags, mss_now, tp-nonagle);
return copied;

Regards,
Willy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
On Sun, Jan 06, 2013 at 08:39:53AM -0800, Eric Dumazet wrote:
 Hmm, I'll have to check if this really can be reverted without hurting
 vmsplice() again.

Looking at the code I've been wondering whether we shouldn't transform
the condition to perform the push if we can't push more segments, but
I don't know what to rely on. It would be something like this :

   if (copied 
  (!(flags  MSG_SENDPAGE_NOTLAST) || cant_push_more))
tcp_push(sk, flags, mss_now, tp-nonagle);

Willy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Eric Dumazet
On Sun, 2013-01-06 at 16:51 +0100, Willy Tarreau wrote:
 Hi Eric,
 

 Oh sorry, I didn't really want to pollute the list with links and configs,
 especially during the initial report with various combined issues :-(
 
 The client is my old inject tool, available here :
 
  http://git.1wt.eu/web?p=inject.git
 
 The server is my httpterm tool, available here :
 
  http://git.1wt.eu/web?p=httpterm.git
  Use -O3 -DENABLE_POLL -DENABLE_EPOLL -DENABLE_SPLICE for CFLAGS.
 
 I'm starting httpterm this way :
 httpterm -D -L :8000 -P 256
 = it starts a server on port 8000, and sets pipe size to 256 kB. It
uses SPLICE_F_MORE on output data but removing it did not fix the
issue one of the early tests.
 
 Then I'm starting inject this way :
 inject -o 1 -u 1 -G 0:8000/?s=1g
 = 1 user, 1 object at a time, and fetch /?s=1g from the loopback.
The server will then emit 1 GB of data using splice().
 
 It's possible to disable splicing on the server using -dS. The client
 eats data using recv(MSG_TRUNC) to avoid a useless copy.
 
  TCP has very low defaults concerning initial window, and it appears you
  set RCVBUF to even smaller values.
 
 Yes, you're right, my bootup scripts still change the default value, though
 I increase them to larger values during the tests (except the one where you
 saw win 8030 due to the default rmem set to 16060). I've been using this
 value in the past with older kernels because it allowed an integer number
 of segments to fit into the default window, and offered optimal performance
 with large numbers of concurrent connections. Since 2.6, tcp_moderate_rcvbuf
 works very well and this is not needed anymore.
 
 Anyway, it does not affect the test here. Good kernels are OK whatever the
 default value, and bad kernels are bad whatever the default value too.
 
 Hmmm finally it's this commit again :
 
2f53384 tcp: allow splice() to build full TSO packets
 
 I'm saying again because we already diagnosed a similar effect several
 months ago that was revealed by this patch and we fixed it with the
 following  one, though I remember that we weren't completely sure it
 would fix everything :
 
bad115c tcp: do_tcp_sendpages() must try to push data out on oom conditions
 
 Just out of curiosity, I tried to re-apply the patch above just after the
 first one but it did not change anything (after all it changed a symptom
 which appeared in different conditions).
 
 Interestingly, this commit (2f53384) significantly improved performance
 on spliced data over the loopback (more than 50% in this test). In 3.7,
 it seems to have no positive effect anymore. I reverted it using the
 following patch and now the problem is fixed (mtu=64k works fine now) :
 
 diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
 index e457c7a..61e4517 100644
 --- a/net/ipv4/tcp.c
 +++ b/net/ipv4/tcp.c
 @@ -935,7 +935,7 @@ wait_for_memory:
   }
  
  out:
 - if (copied  !(flags  MSG_SENDPAGE_NOTLAST))
 + if (copied)
   tcp_push(sk, flags, mss_now, tp-nonagle);
   return copied;
 
 Regards,
 Willy
 

Hmm, I'll have to check if this really can be reverted without hurting
vmsplice() again.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Eric Dumazet
On Sun, 2013-01-06 at 17:44 +0100, Willy Tarreau wrote:
 On Sun, Jan 06, 2013 at 08:39:53AM -0800, Eric Dumazet wrote:
  Hmm, I'll have to check if this really can be reverted without hurting
  vmsplice() again.
 
 Looking at the code I've been wondering whether we shouldn't transform
 the condition to perform the push if we can't push more segments, but
 I don't know what to rely on. It would be something like this :
 
if (copied 
   (!(flags  MSG_SENDPAGE_NOTLAST) || cant_push_more))
 tcp_push(sk, flags, mss_now, tp-nonagle);

Good point !

Maybe the following fix then ?


diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1ca2536..7ba0717 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -941,8 +941,10 @@ out:
return copied;
 
 do_error:
-   if (copied)
+   if (copied) {
+   flags = ~MSG_SENDPAGE_NOTLAST;
goto out;
+   }
 out_err:
return sk_stream_error(sk, flags, err);
 }


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
On Sun, Jan 06, 2013 at 09:10:55AM -0800, Eric Dumazet wrote:
 On Sun, 2013-01-06 at 17:44 +0100, Willy Tarreau wrote:
  On Sun, Jan 06, 2013 at 08:39:53AM -0800, Eric Dumazet wrote:
   Hmm, I'll have to check if this really can be reverted without hurting
   vmsplice() again.
  
  Looking at the code I've been wondering whether we shouldn't transform
  the condition to perform the push if we can't push more segments, but
  I don't know what to rely on. It would be something like this :
  
 if (copied 
(!(flags  MSG_SENDPAGE_NOTLAST) || cant_push_more))
  tcp_push(sk, flags, mss_now, tp-nonagle);
 
 Good point !
 
 Maybe the following fix then ?
 
 
 diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
 index 1ca2536..7ba0717 100644
 --- a/net/ipv4/tcp.c
 +++ b/net/ipv4/tcp.c
 @@ -941,8 +941,10 @@ out:
   return copied;
  
  do_error:
 - if (copied)
 + if (copied) {
 + flags = ~MSG_SENDPAGE_NOTLAST;
   goto out;
 + }
  out_err:
   return sk_stream_error(sk, flags, err);
  }

Unfortunately it does not work any better, which means to me
that we don't leave via this code path. I tried other tricks
which failed too. I need to understand this part better before
randomly fiddling with it.

Willy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Eric Dumazet
On Sun, 2013-01-06 at 18:35 +0100, Willy Tarreau wrote:

 Unfortunately it does not work any better, which means to me
 that we don't leave via this code path. I tried other tricks
 which failed too. I need to understand this part better before
 randomly fiddling with it.
 

OK, now I have your test program, I can work on a fix, dont worry ;)

The MSG_SENDPAGE_NOTLAST logic needs to be tweaked.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Eric Dumazet
On Sun, 2013-01-06 at 10:39 -0800, Eric Dumazet wrote:
 On Sun, 2013-01-06 at 18:35 +0100, Willy Tarreau wrote:
 
  Unfortunately it does not work any better, which means to me
  that we don't leave via this code path. I tried other tricks
  which failed too. I need to understand this part better before
  randomly fiddling with it.
  
 
 OK, now I have your test program, I can work on a fix, dont worry ;)
 
 The MSG_SENDPAGE_NOTLAST logic needs to be tweaked.
 


(sd-len is usually 4096, which is expected, but sd-total_len value is
huge in your case, so we always set the flag in fs/splice.c)


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Eric Dumazet

 
 (sd-len is usually 4096, which is expected, but sd-total_len value is
 huge in your case, so we always set the flag in fs/splice.c)

I am testing :

   if (sd-len  sd-total_len  pipe-nrbufs  1)
more |= MSG_SENDPAGE_NOTLAST;



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Eric Dumazet
On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote:
  
  (sd-len is usually 4096, which is expected, but sd-total_len value is
  huge in your case, so we always set the flag in fs/splice.c)
 
 I am testing :
 
if (sd-len  sd-total_len  pipe-nrbufs  1)
 more |= MSG_SENDPAGE_NOTLAST;
 

Yes, this should fix the problem :

If there is no following buffer in the pipe, we should not set NOTLAST.

diff --git a/fs/splice.c b/fs/splice.c
index 8890604..6909d89 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
return -EINVAL;
 
more = (sd-flags  SPLICE_F_MORE) ? MSG_MORE : 0;
-   if (sd-len  sd-total_len)
+
+   if (sd-len  sd-total_len  pipe-nrbufs  1)
more |= MSG_SENDPAGE_NOTLAST;
+
return file-f_op-sendpage(file, buf-page, buf-offset,
sd-len, pos, more);
 }


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote:
 On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote:
   
   (sd-len is usually 4096, which is expected, but sd-total_len value is
   huge in your case, so we always set the flag in fs/splice.c)
  
  I am testing :
  
 if (sd-len  sd-total_len  pipe-nrbufs  1)
  more |= MSG_SENDPAGE_NOTLAST;
  
 
 Yes, this should fix the problem :
 
 If there is no following buffer in the pipe, we should not set NOTLAST.
 
 diff --git a/fs/splice.c b/fs/splice.c
 index 8890604..6909d89 100644
 --- a/fs/splice.c
 +++ b/fs/splice.c
 @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
   return -EINVAL;
  
   more = (sd-flags  SPLICE_F_MORE) ? MSG_MORE : 0;
 - if (sd-len  sd-total_len)
 +
 + if (sd-len  sd-total_len  pipe-nrbufs  1)
   more |= MSG_SENDPAGE_NOTLAST;
 +
   return file-f_op-sendpage(file, buf-page, buf-offset,
   sd-len, pos, more);
  }
 
OK it works like a charm here now ! I can't break it anymore, so it
looks like you finally got it !

I noticed that the data rate was higher when the loopback's MTU
is exactly a multiple of 4096 (making the 64k choice optimal)
while I would have assumed that in order to efficiently splice
TCP segments, we'd need to have some space for IP/TCP headers
and n*4k for the payload.

I also got the transfer freezes again a few times when starting
tcpdump on the server, but this is not 100% reproducible I'm afraid.
So I'll bring this back when I manage to get some analysable pattern.

The spliced transfer through all the chain haproxy works fine again
at 10gig with your fix. The issue is closed for me. Feel free to add
my Tested-By if you want.

Thank you Eric :-)
Willy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Eric Dumazet
On Sun, 2013-01-06 at 20:34 +0100, Willy Tarreau wrote:

 OK it works like a charm here now ! I can't break it anymore, so it
 looks like you finally got it !
 
 I noticed that the data rate was higher when the loopback's MTU
 is exactly a multiple of 4096 (making the 64k choice optimal)
 while I would have assumed that in order to efficiently splice
 TCP segments, we'd need to have some space for IP/TCP headers
 and n*4k for the payload.
 
 I also got the transfer freezes again a few times when starting
 tcpdump on the server, but this is not 100% reproducible I'm afraid.
 So I'll bring this back when I manage to get some analysable pattern.
 
 The spliced transfer through all the chain haproxy works fine again
 at 10gig with your fix. The issue is closed for me. Feel free to add
 my Tested-By if you want.
 

Good to know !

What is the max speed you get now ?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
On Sun, Jan 06, 2013 at 11:39:31AM -0800, Eric Dumazet wrote:
 On Sun, 2013-01-06 at 20:34 +0100, Willy Tarreau wrote:
 
  OK it works like a charm here now ! I can't break it anymore, so it
  looks like you finally got it !
  
  I noticed that the data rate was higher when the loopback's MTU
  is exactly a multiple of 4096 (making the 64k choice optimal)
  while I would have assumed that in order to efficiently splice
  TCP segments, we'd need to have some space for IP/TCP headers
  and n*4k for the payload.
  
  I also got the transfer freezes again a few times when starting
  tcpdump on the server, but this is not 100% reproducible I'm afraid.
  So I'll bring this back when I manage to get some analysable pattern.
  
  The spliced transfer through all the chain haproxy works fine again
  at 10gig with your fix. The issue is closed for me. Feel free to add
  my Tested-By if you want.
  
 
 Good to know !
 
 What is the max speed you get now ?

Line rate with 1500 MTU and LRO enabled :

#   time   eth1(ikb  ipk okb  opk)eth2(ikb   ipk  okbopk) 

1357060023 19933.3 41527.7 9355538.2 62167.7  9757888.1 808701.1 19400.3 40417.7
1357060024 26124.1 54425.5 9290064.9 48804.4  9778294.0 810210.0 18068.8 37643.3
1357060025 27015.2 56281.1 9296115.3 46868.8  9797125.9 811271.1 8790.1 18312.2 
1357060026 27556.0 57408.8 9291701.4 46805.5  9805371.6 811410.0 3494.8 7280.0 
1357060027 27577.0 57452.2 9293606.8 46804.4  9806122.3 811314.4 2558.7 5330.0 
1357060028 27476.1 57242.2 9296885.4 46830.0  9794537.3 810527.7 2516.1 5242.2 
   ^^^^^^
   kbps out   kbps in
eth1=facing the client
eth2=facing the server

Top reports the following usage :

Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 31.7%id,  0.0%wa,  0.0%hi, 68.3%si,  0.0%st
Cpu1  :  1.0%us, 37.3%sy,  0.0%ni, 61.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

(IRQ bound to cpu 0, haproxy to cpu 1)

This is a core2duo 2.66 GHz and the myris are 1st generation.

BTW I was very happy to see that the LRO-GRO conversion patches in 3.8-rc2
don't affect byte rate anymore (just a minor CPU usage increase but this is
not critical here), now I won't complain about it being slower anymore, you
won :-)


With the GRO patches backported, still at 1500 MTU but with GRO now :

Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 28.7%id,  0.0%wa,  0.0%hi, 71.3%si,  0.0%st
Cpu1  :  0.0%us, 37.6%sy,  0.0%ni, 62.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

#   time   eth1(ikb  ipk okb  opk)eth2(ikb   ipk  okbopk) 
1357058637 18319.3 38165.5 9401736.3 65159.9  9761613.4 808963.3 19403.6 40424.4
1357058638 20009.8 41687.7 9400903.7 62706.6  9770555.8 809522.2 18696.5 38951.1
1357058639 25439.5 52999.9 9301635.3 50267.7  9773666.7 809721.1 19174.1 39946.6
1357058640 26808.2 55850.0 9298301.4 46876.6  9790470.1 810843.3 12408.7 25851.1
1357058641 27110.9 56481.1 9297009.2 46832.2  9803308.4 811339.9 5692.5 11859.9
1357058642 27411.1 57106.6 9291419.2 46796.6  9806846.5 811378.8 2804.4 5842.2

This kernel is getting really good :-)

Cheers,
Willy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread Willy Tarreau
On Sun, Jan 06, 2013 at 04:49:35PM -0500, John Stoffel wrote:
  Willy == Willy Tarreau w...@1wt.eu writes:
 
 Willy On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote:
  On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote:

(sd-len is usually 4096, which is expected, but sd-total_len value is
huge in your case, so we always set the flag in fs/splice.c)
   
   I am testing :
   
  if (sd-len  sd-total_len  pipe-nrbufs  1)
   more |= MSG_SENDPAGE_NOTLAST;
   
  
  Yes, this should fix the problem :
  
  If there is no following buffer in the pipe, we should not set NOTLAST.
  
  diff --git a/fs/splice.c b/fs/splice.c
  index 8890604..6909d89 100644
  --- a/fs/splice.c
  +++ b/fs/splice.c
  @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info 
  *pipe,
  return -EINVAL;
  
  more = (sd-flags  SPLICE_F_MORE) ? MSG_MORE : 0;
  -  if (sd-len  sd-total_len)
  +
  +  if (sd-len  sd-total_len  pipe-nrbufs  1)
  more |= MSG_SENDPAGE_NOTLAST;
  +
  return file-f_op-sendpage(file, buf-page, buf-offset,
 sd- len, pos, more);
  }
  
 Willy OK it works like a charm here now ! I can't break it anymore, so it
 Willy looks like you finally got it !
 
 It's still broken, there's no comments in the code to explain all this
 magic to mere mortals!  *grin*

I would generally agree, but when Eric fixes such a thing, he
generally goes with lengthy details in the commit message.

Willy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread John Stoffel
 Willy == Willy Tarreau w...@1wt.eu writes:

Willy On Sun, Jan 06, 2013 at 04:49:35PM -0500, John Stoffel wrote:
  Willy == Willy Tarreau w...@1wt.eu writes:
 
Willy On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote:
  On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote:

(sd-len is usually 4096, which is expected, but sd-total_len value 
is
huge in your case, so we always set the flag in fs/splice.c)
   
   I am testing :
   
  if (sd-len  sd-total_len  pipe-nrbufs  1)
   more |= MSG_SENDPAGE_NOTLAST;
   
  
  Yes, this should fix the problem :
  
  If there is no following buffer in the pipe, we should not set NOTLAST.
  
  diff --git a/fs/splice.c b/fs/splice.c
  index 8890604..6909d89 100644
  --- a/fs/splice.c
  +++ b/fs/splice.c
  @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info 
  *pipe,
  return -EINVAL;
  
  more = (sd-flags  SPLICE_F_MORE) ? MSG_MORE : 0;
  - if (sd-len  sd-total_len)
  +
  + if (sd-len  sd-total_len  pipe-nrbufs  1)
  more |= MSG_SENDPAGE_NOTLAST;
  +
  return file-f_op-sendpage(file, buf-page, buf-offset,
sd- len, pos, more);
  }
 
Willy OK it works like a charm here now ! I can't break it anymore, so it
Willy looks like you finally got it !
 
 It's still broken, there's no comments in the code to explain all this
 magic to mere mortals!  *grin*

Willy I would generally agree, but when Eric fixes such a thing, he
Willy generally goes with lengthy details in the commit message.

I'm sure he will too, I just wanted to nudge him because while I sorta
followed this discussion, I see lots of pain down the road if the code
wasn't updated with some nice big fat comments.

Great job finding this code and testing, testing, testing.

John

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-06 Thread John Stoffel
 Willy == Willy Tarreau w...@1wt.eu writes:

Willy On Sun, Jan 06, 2013 at 11:00:15AM -0800, Eric Dumazet wrote:
 On Sun, 2013-01-06 at 10:51 -0800, Eric Dumazet wrote:
   
   (sd-len is usually 4096, which is expected, but sd-total_len value is
   huge in your case, so we always set the flag in fs/splice.c)
  
  I am testing :
  
 if (sd-len  sd-total_len  pipe-nrbufs  1)
  more |= MSG_SENDPAGE_NOTLAST;
  
 
 Yes, this should fix the problem :
 
 If there is no following buffer in the pipe, we should not set NOTLAST.
 
 diff --git a/fs/splice.c b/fs/splice.c
 index 8890604..6909d89 100644
 --- a/fs/splice.c
 +++ b/fs/splice.c
 @@ -696,8 +696,10 @@ static int pipe_to_sendpage(struct pipe_inode_info 
 *pipe,
 return -EINVAL;
 
 more = (sd-flags  SPLICE_F_MORE) ? MSG_MORE : 0;
 -if (sd-len  sd-total_len)
 +
 +if (sd-len  sd-total_len  pipe-nrbufs  1)
 more |= MSG_SENDPAGE_NOTLAST;
 +
 return file-f_op-sendpage(file, buf-page, buf-offset,
sd- len, pos, more);
 }
 
Willy OK it works like a charm here now ! I can't break it anymore, so it
Willy looks like you finally got it !

It's still broken, there's no comments in the code to explain all this
magic to mere mortals!  *grin*

John
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sun, 2013-01-06 at 03:52 +0100, Willy Tarreau wrote:

> OK so I observed no change with this patch, either on the loopback
> data rate at >16kB MTU, or on the myri. I'm keeping it at hand for
> experimentation anyway.
> 

Yeah, there was no bug. I rewrote it for net-next as a cleanup/optim
only.

> Concerning the loopback MTU, I find it strange that the MTU changes
> the splice() behaviour and not send/recv. I thought that there could
> be a relation between the MTU and the pipe size, but it does not
> appear to be the case either, as I tried various sizes between 16kB
> and 256kB without achieving original performance.


It probably is related to a too small receive window, given the MTU was
multiplied by 4, I guess we need to make some adjustments

You also could try :

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1ca2536..b68cdfb 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1482,6 +1482,9 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t 
*desc,
break;
}
used = recv_actor(desc, skb, offset, len);
+   /* Clean up data we have read: This will do ACK frames. 
*/
+   if (used > 0)
+   tcp_cleanup_rbuf(sk, used);
if (used < 0) {
if (!copied)
copied = used;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote:
> On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote:
> > On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
> > > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
> > > 
> > > > Ah interesting because these were some of the mm patches that I had
> > > > tried to revert.
> > > 
> > > Hmm, or we should fix __skb_splice_bits()
> > > 
> > > I'll send a patch.
> > > 
> > 
> > Could you try the following ?
> 
> Or more exactly...
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 3ab989b..01f222c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1736,11 +1736,8 @@ static bool __splice_segment(struct page *page, 
> unsigned int poff,
>   return false;
>   }
>  
> - /* ignore any bits we already processed */
> - if (*off) {
> - __segment_seek(, , , *off);
> - *off = 0;
> - }
> + __segment_seek(, , , *off);
> + *off = 0;
>  
>   do {
>   unsigned int flen = min(*len, plen);
> @@ -1768,14 +1765,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, 
> struct pipe_inode_info *pipe,
> struct splice_pipe_desc *spd, struct sock *sk)
>  {
>   int seg;
> + struct page *page = virt_to_page(skb->data);
> + unsigned int poff = skb->data - (unsigned char *)page_address(page);
>  
>   /* map the linear part :
>* If skb->head_frag is set, this 'linear' part is backed by a
>* fragment, and if the head is not shared with any clones then
>* we can avoid a copy since we own the head portion of this page.
>*/
> - if (__splice_segment(virt_to_page(skb->data),
> -  (unsigned long) skb->data & (PAGE_SIZE - 1),
> + if (__splice_segment(page, poff,
>skb_headlen(skb),
>offset, len, skb, spd,
>skb_head_is_locked(skb),
> 

OK so I observed no change with this patch, either on the loopback
data rate at >16kB MTU, or on the myri. I'm keeping it at hand for
experimentation anyway.

Concerning the loopback MTU, I find it strange that the MTU changes
the splice() behaviour and not send/recv. I thought that there could
be a relation between the MTU and the pipe size, but it does not
appear to be the case either, as I tried various sizes between 16kB
and 256kB without achieving original performance.

I've started to bisect the 10GE issue again (since both issues are
unrelated), but I'll finish tomorrow, it's time to get some sleep
now.

Best regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sun, 2013-01-06 at 03:32 +0100, Willy Tarreau wrote:

> It's 0cf833ae (net: loopback: set default mtu to 64K). And I could
> reproduce it with 3.6 by setting loopback's MTU to 65536 by hand.
> The trick is that once the MTU has been set to this large a value,
> even when I set it back to 16kB the problem persists.
> 

Well, this MTU change can uncover a prior bug, or make it happen faster,
for sure.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
On Sat, Jan 05, 2013 at 06:22:13PM -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 03:18 +0100, Willy Tarreau wrote:
> > On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote:
> > > On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote:
> > > > On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
> > > > > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
> > > > > 
> > > > > > Ah interesting because these were some of the mm patches that I had
> > > > > > tried to revert.
> > > > > 
> > > > > Hmm, or we should fix __skb_splice_bits()
> > > > > 
> > > > > I'll send a patch.
> > > > > 
> > > > 
> > > > Could you try the following ?
> > > 
> > > Or more exactly...
> > 
> > The first one did not change a iota unfortunately. I'm about to
> > spot the commit causing the loopback regression. It's a few patches
> > before the first one you pointed. It's almost finished and I test
> > your patch below immediately after.
> 
> I bet you are going to find commit
> 69b08f62e17439ee3d436faf0b9a7ca6fffb78db
> (net: use bigger pages in __netdev_alloc_frag )
> 
> Am I wrong ?

Yes this time you guessed wrong :-) Well maybe it's participating
to the issue.

It's 0cf833ae (net: loopback: set default mtu to 64K). And I could
reproduce it with 3.6 by setting loopback's MTU to 65536 by hand.
The trick is that once the MTU has been set to this large a value,
even when I set it back to 16kB the problem persists.

Now I'm retrying your other patch to see if it brings the 10GE back
to full speed.

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sun, 2013-01-06 at 03:18 +0100, Willy Tarreau wrote:
> On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote:
> > On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote:
> > > On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
> > > > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
> > > > 
> > > > > Ah interesting because these were some of the mm patches that I had
> > > > > tried to revert.
> > > > 
> > > > Hmm, or we should fix __skb_splice_bits()
> > > > 
> > > > I'll send a patch.
> > > > 
> > > 
> > > Could you try the following ?
> > 
> > Or more exactly...
> 
> The first one did not change a iota unfortunately. I'm about to
> spot the commit causing the loopback regression. It's a few patches
> before the first one you pointed. It's almost finished and I test
> your patch below immediately after.

I bet you are going to find commit
69b08f62e17439ee3d436faf0b9a7ca6fffb78db
(net: use bigger pages in __netdev_alloc_frag )

Am I wrong ?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote:
> On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote:
> > On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
> > > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
> > > 
> > > > Ah interesting because these were some of the mm patches that I had
> > > > tried to revert.
> > > 
> > > Hmm, or we should fix __skb_splice_bits()
> > > 
> > > I'll send a patch.
> > > 
> > 
> > Could you try the following ?
> 
> Or more exactly...

The first one did not change a iota unfortunately. I'm about to
spot the commit causing the loopback regression. It's a few patches
before the first one you pointed. It's almost finished and I test
your patch below immediately after.

Thanks,
Willy

> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 3ab989b..01f222c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1736,11 +1736,8 @@ static bool __splice_segment(struct page *page, 
> unsigned int poff,
>   return false;
>   }
>  
> - /* ignore any bits we already processed */
> - if (*off) {
> - __segment_seek(, , , *off);
> - *off = 0;
> - }
> + __segment_seek(, , , *off);
> + *off = 0;
>  
>   do {
>   unsigned int flen = min(*len, plen);
> @@ -1768,14 +1765,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, 
> struct pipe_inode_info *pipe,
> struct splice_pipe_desc *spd, struct sock *sk)
>  {
>   int seg;
> + struct page *page = virt_to_page(skb->data);
> + unsigned int poff = skb->data - (unsigned char *)page_address(page);
>  
>   /* map the linear part :
>* If skb->head_frag is set, this 'linear' part is backed by a
>* fragment, and if the head is not shared with any clones then
>* we can avoid a copy since we own the head portion of this page.
>*/
> - if (__splice_segment(virt_to_page(skb->data),
> -  (unsigned long) skb->data & (PAGE_SIZE - 1),
> + if (__splice_segment(page, poff,
>skb_headlen(skb),
>offset, len, skb, spd,
>skb_head_is_locked(skb),
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote:
> On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
> > On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
> > 
> > > Ah interesting because these were some of the mm patches that I had
> > > tried to revert.
> > 
> > Hmm, or we should fix __skb_splice_bits()
> > 
> > I'll send a patch.
> > 
> 
> Could you try the following ?

Or more exactly...

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3ab989b..01f222c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1736,11 +1736,8 @@ static bool __splice_segment(struct page *page, unsigned 
int poff,
return false;
}
 
-   /* ignore any bits we already processed */
-   if (*off) {
-   __segment_seek(, , , *off);
-   *off = 0;
-   }
+   __segment_seek(, , , *off);
+   *off = 0;
 
do {
unsigned int flen = min(*len, plen);
@@ -1768,14 +1765,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, 
struct pipe_inode_info *pipe,
  struct splice_pipe_desc *spd, struct sock *sk)
 {
int seg;
+   struct page *page = virt_to_page(skb->data);
+   unsigned int poff = skb->data - (unsigned char *)page_address(page);
 
/* map the linear part :
 * If skb->head_frag is set, this 'linear' part is backed by a
 * fragment, and if the head is not shared with any clones then
 * we can avoid a copy since we own the head portion of this page.
 */
-   if (__splice_segment(virt_to_page(skb->data),
-(unsigned long) skb->data & (PAGE_SIZE - 1),
+   if (__splice_segment(page, poff,
 skb_headlen(skb),
 offset, len, skb, spd,
 skb_head_is_locked(skb),


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
> 
> > Ah interesting because these were some of the mm patches that I had
> > tried to revert.
> 
> Hmm, or we should fix __skb_splice_bits()
> 
> I'll send a patch.
> 

Could you try the following ?

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3ab989b..c5246be 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1768,14 +1768,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, 
struct pipe_inode_info *pipe,
  struct splice_pipe_desc *spd, struct sock *sk)
 {
int seg;
+   struct page *page = virt_to_page(skb->data);
+   unsigned int poff = skb->data - (unsigned char *)page_address(page);
 
/* map the linear part :
 * If skb->head_frag is set, this 'linear' part is backed by a
 * fragment, and if the head is not shared with any clones then
 * we can avoid a copy since we own the head portion of this page.
 */
-   if (__splice_segment(virt_to_page(skb->data),
-(unsigned long) skb->data & (PAGE_SIZE - 1),
+   if (__splice_segment(page, poff,
 skb_headlen(skb),
 offset, len, skb, spd,
 skb_head_is_locked(skb),


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:

> Ah interesting because these were some of the mm patches that I had
> tried to revert.

Hmm, or we should fix __skb_splice_bits()

I'll send a patch.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
On Sat, Jan 05, 2013 at 05:21:16PM -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 01:50 +0100, Willy Tarreau wrote:
> 
> > Yes, I've removed all zero counters in this short view for easier
> > reading (complete version appended at the end of this email). This
> > was after around 140 GB were transferred :
> 
> OK I only wanted to make sure skb were not linearized in xmit.
> 
> Could you try to disable CONFIG_COMPACTION ?

It's already disabled.

> ( This is the other thread mentioning this : "ppoll() stuck on POLLIN
> while TCP peer is sending" )

Ah interesting because these were some of the mm patches that I had
tried to revert.

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sun, 2013-01-06 at 01:50 +0100, Willy Tarreau wrote:

> Yes, I've removed all zero counters in this short view for easier
> reading (complete version appended at the end of this email). This
> was after around 140 GB were transferred :

OK I only wanted to make sure skb were not linearized in xmit.

Could you try to disable CONFIG_COMPACTION ?

( This is the other thread mentioning this : "ppoll() stuck on POLLIN
while TCP peer is sending" )




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
On Sat, Jan 05, 2013 at 04:02:03PM -0800, Eric Dumazet wrote:
> On Sun, 2013-01-06 at 00:29 +0100, Willy Tarreau wrote:
> 
> > > 2) Another possibility would be that Myri card/driver doesnt like very
> > > well high order pages.
> > 
> > It looks like it has not changed much since 3.6 :-/ I really suspect
> > something is wrong with memory allocation. I have tried reverting many
> > patches affecting the mm/ directory just in case but I did not come to
> > anything useful yet.
> > 
> 
> Hmm, I was referring to TCP stack now using order-3 pages instead of
> order-0 ones
> 
> See commit 5640f7685831e088fe6c2e1f863a6805962f8e81
> (net: use a per task frag allocator)

OK, so you think there are two distinct problems ?

I have tried to revert this one but it did not change the performance, I'm
still saturating at ~6.9 Gbps.

> Could you please post :
> 
> ethtool -S eth0

Yes, I've removed all zero counters in this short view for easier
reading (complete version appended at the end of this email). This
was after around 140 GB were transferred :

# ethtool -S eth1|grep -vw 0
NIC statistics:
 rx_packets: 8001500
 tx_packets: 10015409
 rx_bytes: 480115998
 tx_bytes: 148825674976
 tx_boundary: 2048
 WC: 1
 irq: 45
 MSI: 1
 read_dma_bw_MBs: 1200
 write_dma_bw_MBs: 1614
 read_write_dma_bw_MBs: 2101
 serial_number: 320061
 link_changes: 2
 link_up: 1
 tx_pkt_start: 10015409
 tx_pkt_done: 10015409
 tx_req: 93407411
 tx_done: 93407411
 rx_small_cnt: 8001500
 wake_queue: 187727
 stop_queue: 187727
 LRO aggregated: 146
 LRO flushed: 146
 LRO avg aggr: 1
 LRO no_desc: 80

Quite honnestly, this is typically the pattern what I'm used to
observe here. I'm now trying to bisect, hopefully we'll get
something exploitable.

Cheers,
Willy

- full ethtool -S 

NIC statistics:
 rx_packets: 8001500
 tx_packets: 10015409
 rx_bytes: 480115998
 tx_bytes: 148825674976
 rx_errors: 0
 tx_errors: 0
 rx_dropped: 0
 tx_dropped: 0
 multicast: 0
 collisions: 0
 rx_length_errors: 0
 rx_over_errors: 0
 rx_crc_errors: 0
 rx_frame_errors: 0
 rx_fifo_errors: 0
 rx_missed_errors: 0
 tx_aborted_errors: 0
 tx_carrier_errors: 0
 tx_fifo_errors: 0
 tx_heartbeat_errors: 0
 tx_window_errors: 0
 tx_boundary: 2048
 WC: 1
 irq: 45
 MSI: 1
 MSIX: 0
 read_dma_bw_MBs: 1200
 write_dma_bw_MBs: 1614
 read_write_dma_bw_MBs: 2101
 serial_number: 320061
 watchdog_resets: 0
 link_changes: 2
 link_up: 1
 dropped_link_overflow: 0
 dropped_link_error_or_filtered: 0
 dropped_pause: 0
 dropped_bad_phy: 0
 dropped_bad_crc32: 0
 dropped_unicast_filtered: 0
 dropped_multicast_filtered: 0
 dropped_runt: 0
 dropped_overrun: 0
 dropped_no_small_buffer: 0
 dropped_no_big_buffer: 0
 --- slice -: 0
 tx_pkt_start: 10015409
 tx_pkt_done: 10015409
 tx_req: 93407411
 tx_done: 93407411
 rx_small_cnt: 8001500
 rx_big_cnt: 0
 wake_queue: 187727
 stop_queue: 187727
 tx_linearized: 0
 LRO aggregated: 146
 LRO flushed: 146
 LRO avg aggr: 1
 LRO no_desc: 80

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sun, 2013-01-06 at 00:29 +0100, Willy Tarreau wrote:

> > 2) Another possibility would be that Myri card/driver doesnt like very
> > well high order pages.
> 
> It looks like it has not changed much since 3.6 :-/ I really suspect
> something is wrong with memory allocation. I have tried reverting many
> patches affecting the mm/ directory just in case but I did not come to
> anything useful yet.
> 

Hmm, I was referring to TCP stack now using order-3 pages instead of
order-0 ones

See commit 5640f7685831e088fe6c2e1f863a6805962f8e81
(net: use a per task frag allocator)

Could you please post :

ethtool -S eth0



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
Hi Eric,

On Sat, Jan 05, 2013 at 03:18:46PM -0800, Eric Dumazet wrote:
> Hi Willy, another good finding during the week end ! ;)

Yes, I wanted to experiment with TFO and stopped on this :-)

> 1) This looks like interrupts are spreaded on multiple cpus, and this
> give Out Of Order problems with TCP stack.

No, I forgot to mention this, I have tried to bind IRQs to a single
core, with the server either on the same or another one, but the
problem remained.

Also, the loopback is much more affected and doesn't use IRQs. And
BTW tcpdump on the loopback shouldn't drop that many packets (up to
90% even at low rate). I just noticed something, transferring data
using netcat on the loopback doesn't affect tcpdump. So it's likely
only the spliced data that are affected.

> 2) Another possibility would be that Myri card/driver doesnt like very
> well high order pages.

It looks like it has not changed much since 3.6 :-/ I really suspect
something is wrong with memory allocation. I have tried reverting many
patches affecting the mm/ directory just in case but I did not come to
anything useful yet.

I'm continuing to dig.

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sat, 2013-01-05 at 22:49 +0100, Willy Tarreau wrote:
> Hi,
> 
> I'm observing multiple apparently unrelated network performance
> issues in 3.7, to the point that I'm doubting it comes from the
> network stack.
> 
> My setup involves 3 machines connected point-to-point with myri
> 10GE NICs (the middle machine has 2 NICs). The middle machine
> normally runs haproxy, the other two run either an HTTP load
> generator or a dummy web server :
> 
> 
>   [ client ] <> [ haproxy ] <> [ server ]
> 
> Usually transferring HTTP objects from the server to the client
> via haproxy causes no problem at 10 Gbps for moderately large
> objects.
> 
> This time I observed that it was not possible to go beyond 6.8 Gbps,
> with all the chain idling a lot. I tried to change the IRQ rate, CPU
> affinity, tcp_rmem/tcp_wmem, disabling flow control, etc... the usual
> knobs, nothing managed to go beyond.
> 
> So I removed haproxy from the equation, and simply started the client
> on the middle machine. Same issue. I thought about concurrency issues,
> so I reduced to a single connection, and nothing changed (usually I
> achieve 10G even with a single connection with large enough TCP windows).
> I tried to start tcpdump and the transfer immediately stalled and did not
> come back after I stopped tcpdump. This was reproducible several times
> but not always.
> 
> So I first thought about an issue in the myri10ge driver and wanted to
> confirm that everything was OK on the middle machine.
> 
> I started the server on it and aimed the client at it via the loopback.
> The transfer rate was even worse : randomly oscillating between 10 and
> 100 MB/s ! Normally on the loop back, I get several GB/s here.
> 
> Running tcpdump on the loopback showed be several very concerning issues :
> 
> 1) lots of packets are lost before reaching tcpdump. The trace shows that
>these segments are ACKed so they're correctly received, but tcpdump
>does not get them. Tcpdump stats at the end report impressive numbers,
>around 90% packet dropped from the capture!
> 
> 2) ACKs seem to be immediately delivered but do not trigger sending, the
>system seems to be running with delayed ACKs, as it waits 40 or 200ms
>before restarting, and this is visible even in the first round trips :
> 
>- connection setup :
> 
>18:32:08.071602 IP 127.0.0.1.26792 > 127.0.0.1.8000: S 
> 2036886615:2036886615(0) win 8030 
>18:32:08.071605 IP 127.0.0.1.8000 > 127.0.0.1.26792: S 
> 126397113:126397113(0) ack 2036886616 win 8030  65495,nop,nop,sackOK,nop,wscale 9>
>18:32:08.071614 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126397114 win 16
> 
>- GET /?s=1g HTTP/1.0
> 
>18:32:08.071649 IP 127.0.0.1.26792 > 127.0.0.1.8000: P 
> 2036886616:2036886738(122) ack 126397114 win 16
> 
>- HTTP/1.1 200 OK with the beginning of the response :
> 
>18:32:08.071672 IP 127.0.0.1.8000 > 127.0.0.1.26792: . 
> 126397114:126401210(4096) ack 2036886738 win 16
>18:32:08.071676 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126401210 win 
> 250
>==> 200ms pause here
>18:32:08.275493 IP 127.0.0.1.8000 > 127.0.0.1.26792: P 
> 126401210:126463006(61796) ack 2036886738 win 16
>==> 40ms pause here
>18:32:08.315493 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126463006 win 
> 256
>18:32:08.315498 IP 127.0.0.1.8000 > 127.0.0.1.26792: . 
> 126463006:126527006(64000) ack 2036886738 win 16
> 
>... and so on
> 
>My server is using splice() with the SPLICE_F_MORE flag to send data.
>I noticed that not using splice and relying on send(MSG_MORE) instead
>I don't get the issue.
> 
> 3) I wondered if this had something to do with the 64k MTU on the loopback
>so I lowered it to 16kB. The performance was even worse (about 5MB/s).
>Starting tcpdump managed to make my transfer stall, just like with the
>myri10ge. In this last test, I noticed that there were some real drops,
>because there were some SACKs :
> 
>18:45:16.699951 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
> 956153186:956169530(16344) ack 131668746 win 16
>18:45:16.699956 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 956169530 win 64
>18:45:16.904119 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
> 957035762:957052106(16344) ack 131668746 win 16
>18:45:16.904122 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957052106 win 703
>18:45:16.904124 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
> 957052106:957099566(47460) ack 131668746 win 16
>18:45:17.108117 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
> 957402550:957418894(16344) ack 131668746 win 16
>18:45:17.108119 IP 127.0.0.1.8002 > 127

Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
Hi,

I'm observing multiple apparently unrelated network performance
issues in 3.7, to the point that I'm doubting it comes from the
network stack.

My setup involves 3 machines connected point-to-point with myri
10GE NICs (the middle machine has 2 NICs). The middle machine
normally runs haproxy, the other two run either an HTTP load
generator or a dummy web server :


  [ client ] <> [ haproxy ] <> [ server ]

Usually transferring HTTP objects from the server to the client
via haproxy causes no problem at 10 Gbps for moderately large
objects.

This time I observed that it was not possible to go beyond 6.8 Gbps,
with all the chain idling a lot. I tried to change the IRQ rate, CPU
affinity, tcp_rmem/tcp_wmem, disabling flow control, etc... the usual
knobs, nothing managed to go beyond.

So I removed haproxy from the equation, and simply started the client
on the middle machine. Same issue. I thought about concurrency issues,
so I reduced to a single connection, and nothing changed (usually I
achieve 10G even with a single connection with large enough TCP windows).
I tried to start tcpdump and the transfer immediately stalled and did not
come back after I stopped tcpdump. This was reproducible several times
but not always.

So I first thought about an issue in the myri10ge driver and wanted to
confirm that everything was OK on the middle machine.

I started the server on it and aimed the client at it via the loopback.
The transfer rate was even worse : randomly oscillating between 10 and
100 MB/s ! Normally on the loop back, I get several GB/s here.

Running tcpdump on the loopback showed be several very concerning issues :

1) lots of packets are lost before reaching tcpdump. The trace shows that
   these segments are ACKed so they're correctly received, but tcpdump
   does not get them. Tcpdump stats at the end report impressive numbers,
   around 90% packet dropped from the capture!

2) ACKs seem to be immediately delivered but do not trigger sending, the
   system seems to be running with delayed ACKs, as it waits 40 or 200ms
   before restarting, and this is visible even in the first round trips :

   - connection setup :

   18:32:08.071602 IP 127.0.0.1.26792 > 127.0.0.1.8000: S 
2036886615:2036886615(0) win 8030 
   18:32:08.071605 IP 127.0.0.1.8000 > 127.0.0.1.26792: S 
126397113:126397113(0) ack 2036886616 win 8030 
   18:32:08.071614 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126397114 win 16

   - GET /?s=1g HTTP/1.0

   18:32:08.071649 IP 127.0.0.1.26792 > 127.0.0.1.8000: P 
2036886616:2036886738(122) ack 126397114 win 16

   - HTTP/1.1 200 OK with the beginning of the response :

   18:32:08.071672 IP 127.0.0.1.8000 > 127.0.0.1.26792: . 
126397114:126401210(4096) ack 2036886738 win 16
   18:32:08.071676 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126401210 win 250
   ==> 200ms pause here
   18:32:08.275493 IP 127.0.0.1.8000 > 127.0.0.1.26792: P 
126401210:126463006(61796) ack 2036886738 win 16
   ==> 40ms pause here
   18:32:08.315493 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126463006 win 256
   18:32:08.315498 IP 127.0.0.1.8000 > 127.0.0.1.26792: . 
126463006:126527006(64000) ack 2036886738 win 16

   ... and so on

   My server is using splice() with the SPLICE_F_MORE flag to send data.
   I noticed that not using splice and relying on send(MSG_MORE) instead
   I don't get the issue.

3) I wondered if this had something to do with the 64k MTU on the loopback
   so I lowered it to 16kB. The performance was even worse (about 5MB/s).
   Starting tcpdump managed to make my transfer stall, just like with the
   myri10ge. In this last test, I noticed that there were some real drops,
   because there were some SACKs :

   18:45:16.699951 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
956153186:956169530(16344) ack 131668746 win 16
   18:45:16.699956 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 956169530 win 64
   18:45:16.904119 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
957035762:957052106(16344) ack 131668746 win 16
   18:45:16.904122 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957052106 win 703
   18:45:16.904124 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
957052106:957099566(47460) ack 131668746 win 16
   18:45:17.108117 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
957402550:957418894(16344) ack 131668746 win 16
   18:45:17.108119 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957418894 win 1846
   18:45:17.312115 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
957672806:957689150(16344) ack 131668746 win 16
   18:45:17.312117 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957689150 win 2902
   18:45:17.516114 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 
958962966:958979310(16344) ack 131668746 win 16
   18:45:17.516116 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 958979310 win 7941
   18:45:17.516150 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 959503678 win 9926 

   18:45:17.516151 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 959503678 win 9926 


Plea

Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
Hi,

I'm observing multiple apparently unrelated network performance
issues in 3.7, to the point that I'm doubting it comes from the
network stack.

My setup involves 3 machines connected point-to-point with myri
10GE NICs (the middle machine has 2 NICs). The middle machine
normally runs haproxy, the other two run either an HTTP load
generator or a dummy web server :


  [ client ]  [ haproxy ]  [ server ]

Usually transferring HTTP objects from the server to the client
via haproxy causes no problem at 10 Gbps for moderately large
objects.

This time I observed that it was not possible to go beyond 6.8 Gbps,
with all the chain idling a lot. I tried to change the IRQ rate, CPU
affinity, tcp_rmem/tcp_wmem, disabling flow control, etc... the usual
knobs, nothing managed to go beyond.

So I removed haproxy from the equation, and simply started the client
on the middle machine. Same issue. I thought about concurrency issues,
so I reduced to a single connection, and nothing changed (usually I
achieve 10G even with a single connection with large enough TCP windows).
I tried to start tcpdump and the transfer immediately stalled and did not
come back after I stopped tcpdump. This was reproducible several times
but not always.

So I first thought about an issue in the myri10ge driver and wanted to
confirm that everything was OK on the middle machine.

I started the server on it and aimed the client at it via the loopback.
The transfer rate was even worse : randomly oscillating between 10 and
100 MB/s ! Normally on the loop back, I get several GB/s here.

Running tcpdump on the loopback showed be several very concerning issues :

1) lots of packets are lost before reaching tcpdump. The trace shows that
   these segments are ACKed so they're correctly received, but tcpdump
   does not get them. Tcpdump stats at the end report impressive numbers,
   around 90% packet dropped from the capture!

2) ACKs seem to be immediately delivered but do not trigger sending, the
   system seems to be running with delayed ACKs, as it waits 40 or 200ms
   before restarting, and this is visible even in the first round trips :

   - connection setup :

   18:32:08.071602 IP 127.0.0.1.26792  127.0.0.1.8000: S 
2036886615:2036886615(0) win 8030 mss 65495,nop,nop,sackOK,nop,wscale 9
   18:32:08.071605 IP 127.0.0.1.8000  127.0.0.1.26792: S 
126397113:126397113(0) ack 2036886616 win 8030 mss 
65495,nop,nop,sackOK,nop,wscale 9
   18:32:08.071614 IP 127.0.0.1.26792  127.0.0.1.8000: . ack 126397114 win 16

   - GET /?s=1g HTTP/1.0

   18:32:08.071649 IP 127.0.0.1.26792  127.0.0.1.8000: P 
2036886616:2036886738(122) ack 126397114 win 16

   - HTTP/1.1 200 OK with the beginning of the response :

   18:32:08.071672 IP 127.0.0.1.8000  127.0.0.1.26792: . 
126397114:126401210(4096) ack 2036886738 win 16
   18:32:08.071676 IP 127.0.0.1.26792  127.0.0.1.8000: . ack 126401210 win 250
   == 200ms pause here
   18:32:08.275493 IP 127.0.0.1.8000  127.0.0.1.26792: P 
126401210:126463006(61796) ack 2036886738 win 16
   == 40ms pause here
   18:32:08.315493 IP 127.0.0.1.26792  127.0.0.1.8000: . ack 126463006 win 256
   18:32:08.315498 IP 127.0.0.1.8000  127.0.0.1.26792: . 
126463006:126527006(64000) ack 2036886738 win 16

   ... and so on

   My server is using splice() with the SPLICE_F_MORE flag to send data.
   I noticed that not using splice and relying on send(MSG_MORE) instead
   I don't get the issue.

3) I wondered if this had something to do with the 64k MTU on the loopback
   so I lowered it to 16kB. The performance was even worse (about 5MB/s).
   Starting tcpdump managed to make my transfer stall, just like with the
   myri10ge. In this last test, I noticed that there were some real drops,
   because there were some SACKs :

   18:45:16.699951 IP 127.0.0.1.8000  127.0.0.1.8002: P 
956153186:956169530(16344) ack 131668746 win 16
   18:45:16.699956 IP 127.0.0.1.8002  127.0.0.1.8000: . ack 956169530 win 64
   18:45:16.904119 IP 127.0.0.1.8000  127.0.0.1.8002: P 
957035762:957052106(16344) ack 131668746 win 16
   18:45:16.904122 IP 127.0.0.1.8002  127.0.0.1.8000: . ack 957052106 win 703
   18:45:16.904124 IP 127.0.0.1.8000  127.0.0.1.8002: P 
957052106:957099566(47460) ack 131668746 win 16
   18:45:17.108117 IP 127.0.0.1.8000  127.0.0.1.8002: P 
957402550:957418894(16344) ack 131668746 win 16
   18:45:17.108119 IP 127.0.0.1.8002  127.0.0.1.8000: . ack 957418894 win 1846
   18:45:17.312115 IP 127.0.0.1.8000  127.0.0.1.8002: P 
957672806:957689150(16344) ack 131668746 win 16
   18:45:17.312117 IP 127.0.0.1.8002  127.0.0.1.8000: . ack 957689150 win 2902
   18:45:17.516114 IP 127.0.0.1.8000  127.0.0.1.8002: P 
958962966:958979310(16344) ack 131668746 win 16
   18:45:17.516116 IP 127.0.0.1.8002  127.0.0.1.8000: . ack 958979310 win 7941
   18:45:17.516150 IP 127.0.0.1.8002  127.0.0.1.8000: . ack 959503678 win 9926 
nop,nop,sack 1 {959405614:959421958}
   18:45:17.516151 IP 127.0.0.1.8002  127.0.0.1.8000: . ack 959503678 win 9926 
nop,nop

Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sat, 2013-01-05 at 22:49 +0100, Willy Tarreau wrote:
 Hi,
 
 I'm observing multiple apparently unrelated network performance
 issues in 3.7, to the point that I'm doubting it comes from the
 network stack.
 
 My setup involves 3 machines connected point-to-point with myri
 10GE NICs (the middle machine has 2 NICs). The middle machine
 normally runs haproxy, the other two run either an HTTP load
 generator or a dummy web server :
 
 
   [ client ]  [ haproxy ]  [ server ]
 
 Usually transferring HTTP objects from the server to the client
 via haproxy causes no problem at 10 Gbps for moderately large
 objects.
 
 This time I observed that it was not possible to go beyond 6.8 Gbps,
 with all the chain idling a lot. I tried to change the IRQ rate, CPU
 affinity, tcp_rmem/tcp_wmem, disabling flow control, etc... the usual
 knobs, nothing managed to go beyond.
 
 So I removed haproxy from the equation, and simply started the client
 on the middle machine. Same issue. I thought about concurrency issues,
 so I reduced to a single connection, and nothing changed (usually I
 achieve 10G even with a single connection with large enough TCP windows).
 I tried to start tcpdump and the transfer immediately stalled and did not
 come back after I stopped tcpdump. This was reproducible several times
 but not always.
 
 So I first thought about an issue in the myri10ge driver and wanted to
 confirm that everything was OK on the middle machine.
 
 I started the server on it and aimed the client at it via the loopback.
 The transfer rate was even worse : randomly oscillating between 10 and
 100 MB/s ! Normally on the loop back, I get several GB/s here.
 
 Running tcpdump on the loopback showed be several very concerning issues :
 
 1) lots of packets are lost before reaching tcpdump. The trace shows that
these segments are ACKed so they're correctly received, but tcpdump
does not get them. Tcpdump stats at the end report impressive numbers,
around 90% packet dropped from the capture!
 
 2) ACKs seem to be immediately delivered but do not trigger sending, the
system seems to be running with delayed ACKs, as it waits 40 or 200ms
before restarting, and this is visible even in the first round trips :
 
- connection setup :
 
18:32:08.071602 IP 127.0.0.1.26792  127.0.0.1.8000: S 
 2036886615:2036886615(0) win 8030 mss 65495,nop,nop,sackOK,nop,wscale 9
18:32:08.071605 IP 127.0.0.1.8000  127.0.0.1.26792: S 
 126397113:126397113(0) ack 2036886616 win 8030 mss 
 65495,nop,nop,sackOK,nop,wscale 9
18:32:08.071614 IP 127.0.0.1.26792  127.0.0.1.8000: . ack 126397114 win 16
 
- GET /?s=1g HTTP/1.0
 
18:32:08.071649 IP 127.0.0.1.26792  127.0.0.1.8000: P 
 2036886616:2036886738(122) ack 126397114 win 16
 
- HTTP/1.1 200 OK with the beginning of the response :
 
18:32:08.071672 IP 127.0.0.1.8000  127.0.0.1.26792: . 
 126397114:126401210(4096) ack 2036886738 win 16
18:32:08.071676 IP 127.0.0.1.26792  127.0.0.1.8000: . ack 126401210 win 
 250
== 200ms pause here
18:32:08.275493 IP 127.0.0.1.8000  127.0.0.1.26792: P 
 126401210:126463006(61796) ack 2036886738 win 16
== 40ms pause here
18:32:08.315493 IP 127.0.0.1.26792  127.0.0.1.8000: . ack 126463006 win 
 256
18:32:08.315498 IP 127.0.0.1.8000  127.0.0.1.26792: . 
 126463006:126527006(64000) ack 2036886738 win 16
 
... and so on
 
My server is using splice() with the SPLICE_F_MORE flag to send data.
I noticed that not using splice and relying on send(MSG_MORE) instead
I don't get the issue.
 
 3) I wondered if this had something to do with the 64k MTU on the loopback
so I lowered it to 16kB. The performance was even worse (about 5MB/s).
Starting tcpdump managed to make my transfer stall, just like with the
myri10ge. In this last test, I noticed that there were some real drops,
because there were some SACKs :
 
18:45:16.699951 IP 127.0.0.1.8000  127.0.0.1.8002: P 
 956153186:956169530(16344) ack 131668746 win 16
18:45:16.699956 IP 127.0.0.1.8002  127.0.0.1.8000: . ack 956169530 win 64
18:45:16.904119 IP 127.0.0.1.8000  127.0.0.1.8002: P 
 957035762:957052106(16344) ack 131668746 win 16
18:45:16.904122 IP 127.0.0.1.8002  127.0.0.1.8000: . ack 957052106 win 703
18:45:16.904124 IP 127.0.0.1.8000  127.0.0.1.8002: P 
 957052106:957099566(47460) ack 131668746 win 16
18:45:17.108117 IP 127.0.0.1.8000  127.0.0.1.8002: P 
 957402550:957418894(16344) ack 131668746 win 16
18:45:17.108119 IP 127.0.0.1.8002  127.0.0.1.8000: . ack 957418894 win 
 1846
18:45:17.312115 IP 127.0.0.1.8000  127.0.0.1.8002: P 
 957672806:957689150(16344) ack 131668746 win 16
18:45:17.312117 IP 127.0.0.1.8002  127.0.0.1.8000: . ack 957689150 win 
 2902
18:45:17.516114 IP 127.0.0.1.8000  127.0.0.1.8002: P 
 958962966:958979310(16344) ack 131668746 win 16
18:45:17.516116 IP 127.0.0.1.8002  127.0.0.1.8000: . ack 958979310 win 
 7941
18:45:17.516150 IP

Re: Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
Hi Eric,

On Sat, Jan 05, 2013 at 03:18:46PM -0800, Eric Dumazet wrote:
 Hi Willy, another good finding during the week end ! ;)

Yes, I wanted to experiment with TFO and stopped on this :-)

 1) This looks like interrupts are spreaded on multiple cpus, and this
 give Out Of Order problems with TCP stack.

No, I forgot to mention this, I have tried to bind IRQs to a single
core, with the server either on the same or another one, but the
problem remained.

Also, the loopback is much more affected and doesn't use IRQs. And
BTW tcpdump on the loopback shouldn't drop that many packets (up to
90% even at low rate). I just noticed something, transferring data
using netcat on the loopback doesn't affect tcpdump. So it's likely
only the spliced data that are affected.

 2) Another possibility would be that Myri card/driver doesnt like very
 well high order pages.

It looks like it has not changed much since 3.6 :-/ I really suspect
something is wrong with memory allocation. I have tried reverting many
patches affecting the mm/ directory just in case but I did not come to
anything useful yet.

I'm continuing to dig.

Thanks,
Willy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sun, 2013-01-06 at 00:29 +0100, Willy Tarreau wrote:

  2) Another possibility would be that Myri card/driver doesnt like very
  well high order pages.
 
 It looks like it has not changed much since 3.6 :-/ I really suspect
 something is wrong with memory allocation. I have tried reverting many
 patches affecting the mm/ directory just in case but I did not come to
 anything useful yet.
 

Hmm, I was referring to TCP stack now using order-3 pages instead of
order-0 ones

See commit 5640f7685831e088fe6c2e1f863a6805962f8e81
(net: use a per task frag allocator)

Could you please post :

ethtool -S eth0



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
On Sat, Jan 05, 2013 at 04:02:03PM -0800, Eric Dumazet wrote:
 On Sun, 2013-01-06 at 00:29 +0100, Willy Tarreau wrote:
 
   2) Another possibility would be that Myri card/driver doesnt like very
   well high order pages.
  
  It looks like it has not changed much since 3.6 :-/ I really suspect
  something is wrong with memory allocation. I have tried reverting many
  patches affecting the mm/ directory just in case but I did not come to
  anything useful yet.
  
 
 Hmm, I was referring to TCP stack now using order-3 pages instead of
 order-0 ones
 
 See commit 5640f7685831e088fe6c2e1f863a6805962f8e81
 (net: use a per task frag allocator)

OK, so you think there are two distinct problems ?

I have tried to revert this one but it did not change the performance, I'm
still saturating at ~6.9 Gbps.

 Could you please post :
 
 ethtool -S eth0

Yes, I've removed all zero counters in this short view for easier
reading (complete version appended at the end of this email). This
was after around 140 GB were transferred :

# ethtool -S eth1|grep -vw 0
NIC statistics:
 rx_packets: 8001500
 tx_packets: 10015409
 rx_bytes: 480115998
 tx_bytes: 148825674976
 tx_boundary: 2048
 WC: 1
 irq: 45
 MSI: 1
 read_dma_bw_MBs: 1200
 write_dma_bw_MBs: 1614
 read_write_dma_bw_MBs: 2101
 serial_number: 320061
 link_changes: 2
 link_up: 1
 tx_pkt_start: 10015409
 tx_pkt_done: 10015409
 tx_req: 93407411
 tx_done: 93407411
 rx_small_cnt: 8001500
 wake_queue: 187727
 stop_queue: 187727
 LRO aggregated: 146
 LRO flushed: 146
 LRO avg aggr: 1
 LRO no_desc: 80

Quite honnestly, this is typically the pattern what I'm used to
observe here. I'm now trying to bisect, hopefully we'll get
something exploitable.

Cheers,
Willy

- full ethtool -S 

NIC statistics:
 rx_packets: 8001500
 tx_packets: 10015409
 rx_bytes: 480115998
 tx_bytes: 148825674976
 rx_errors: 0
 tx_errors: 0
 rx_dropped: 0
 tx_dropped: 0
 multicast: 0
 collisions: 0
 rx_length_errors: 0
 rx_over_errors: 0
 rx_crc_errors: 0
 rx_frame_errors: 0
 rx_fifo_errors: 0
 rx_missed_errors: 0
 tx_aborted_errors: 0
 tx_carrier_errors: 0
 tx_fifo_errors: 0
 tx_heartbeat_errors: 0
 tx_window_errors: 0
 tx_boundary: 2048
 WC: 1
 irq: 45
 MSI: 1
 MSIX: 0
 read_dma_bw_MBs: 1200
 write_dma_bw_MBs: 1614
 read_write_dma_bw_MBs: 2101
 serial_number: 320061
 watchdog_resets: 0
 link_changes: 2
 link_up: 1
 dropped_link_overflow: 0
 dropped_link_error_or_filtered: 0
 dropped_pause: 0
 dropped_bad_phy: 0
 dropped_bad_crc32: 0
 dropped_unicast_filtered: 0
 dropped_multicast_filtered: 0
 dropped_runt: 0
 dropped_overrun: 0
 dropped_no_small_buffer: 0
 dropped_no_big_buffer: 0
 --- slice -: 0
 tx_pkt_start: 10015409
 tx_pkt_done: 10015409
 tx_req: 93407411
 tx_done: 93407411
 rx_small_cnt: 8001500
 rx_big_cnt: 0
 wake_queue: 187727
 stop_queue: 187727
 tx_linearized: 0
 LRO aggregated: 146
 LRO flushed: 146
 LRO avg aggr: 1
 LRO no_desc: 80

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sun, 2013-01-06 at 01:50 +0100, Willy Tarreau wrote:

 Yes, I've removed all zero counters in this short view for easier
 reading (complete version appended at the end of this email). This
 was after around 140 GB were transferred :

OK I only wanted to make sure skb were not linearized in xmit.

Could you try to disable CONFIG_COMPACTION ?

( This is the other thread mentioning this : ppoll() stuck on POLLIN
while TCP peer is sending )




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
On Sat, Jan 05, 2013 at 05:21:16PM -0800, Eric Dumazet wrote:
 On Sun, 2013-01-06 at 01:50 +0100, Willy Tarreau wrote:
 
  Yes, I've removed all zero counters in this short view for easier
  reading (complete version appended at the end of this email). This
  was after around 140 GB were transferred :
 
 OK I only wanted to make sure skb were not linearized in xmit.
 
 Could you try to disable CONFIG_COMPACTION ?

It's already disabled.

 ( This is the other thread mentioning this : ppoll() stuck on POLLIN
 while TCP peer is sending )

Ah interesting because these were some of the mm patches that I had
tried to revert.

Willy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:

 Ah interesting because these were some of the mm patches that I had
 tried to revert.

Hmm, or we should fix __skb_splice_bits()

I'll send a patch.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
 On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
 
  Ah interesting because these were some of the mm patches that I had
  tried to revert.
 
 Hmm, or we should fix __skb_splice_bits()
 
 I'll send a patch.
 

Could you try the following ?

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3ab989b..c5246be 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1768,14 +1768,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, 
struct pipe_inode_info *pipe,
  struct splice_pipe_desc *spd, struct sock *sk)
 {
int seg;
+   struct page *page = virt_to_page(skb-data);
+   unsigned int poff = skb-data - (unsigned char *)page_address(page);
 
/* map the linear part :
 * If skb-head_frag is set, this 'linear' part is backed by a
 * fragment, and if the head is not shared with any clones then
 * we can avoid a copy since we own the head portion of this page.
 */
-   if (__splice_segment(virt_to_page(skb-data),
-(unsigned long) skb-data  (PAGE_SIZE - 1),
+   if (__splice_segment(page, poff,
 skb_headlen(skb),
 offset, len, skb, spd,
 skb_head_is_locked(skb),


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote:
 On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
  On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
  
   Ah interesting because these were some of the mm patches that I had
   tried to revert.
  
  Hmm, or we should fix __skb_splice_bits()
  
  I'll send a patch.
  
 
 Could you try the following ?

Or more exactly...

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3ab989b..01f222c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1736,11 +1736,8 @@ static bool __splice_segment(struct page *page, unsigned 
int poff,
return false;
}
 
-   /* ignore any bits we already processed */
-   if (*off) {
-   __segment_seek(page, poff, plen, *off);
-   *off = 0;
-   }
+   __segment_seek(page, poff, plen, *off);
+   *off = 0;
 
do {
unsigned int flen = min(*len, plen);
@@ -1768,14 +1765,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, 
struct pipe_inode_info *pipe,
  struct splice_pipe_desc *spd, struct sock *sk)
 {
int seg;
+   struct page *page = virt_to_page(skb-data);
+   unsigned int poff = skb-data - (unsigned char *)page_address(page);
 
/* map the linear part :
 * If skb-head_frag is set, this 'linear' part is backed by a
 * fragment, and if the head is not shared with any clones then
 * we can avoid a copy since we own the head portion of this page.
 */
-   if (__splice_segment(virt_to_page(skb-data),
-(unsigned long) skb-data  (PAGE_SIZE - 1),
+   if (__splice_segment(page, poff,
 skb_headlen(skb),
 offset, len, skb, spd,
 skb_head_is_locked(skb),


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote:
 On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote:
  On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
   On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
   
Ah interesting because these were some of the mm patches that I had
tried to revert.
   
   Hmm, or we should fix __skb_splice_bits()
   
   I'll send a patch.
   
  
  Could you try the following ?
 
 Or more exactly...

The first one did not change a iota unfortunately. I'm about to
spot the commit causing the loopback regression. It's a few patches
before the first one you pointed. It's almost finished and I test
your patch below immediately after.

Thanks,
Willy

 diff --git a/net/core/skbuff.c b/net/core/skbuff.c
 index 3ab989b..01f222c 100644
 --- a/net/core/skbuff.c
 +++ b/net/core/skbuff.c
 @@ -1736,11 +1736,8 @@ static bool __splice_segment(struct page *page, 
 unsigned int poff,
   return false;
   }
  
 - /* ignore any bits we already processed */
 - if (*off) {
 - __segment_seek(page, poff, plen, *off);
 - *off = 0;
 - }
 + __segment_seek(page, poff, plen, *off);
 + *off = 0;
  
   do {
   unsigned int flen = min(*len, plen);
 @@ -1768,14 +1765,15 @@ static bool __skb_splice_bits(struct sk_buff *skb, 
 struct pipe_inode_info *pipe,
 struct splice_pipe_desc *spd, struct sock *sk)
  {
   int seg;
 + struct page *page = virt_to_page(skb-data);
 + unsigned int poff = skb-data - (unsigned char *)page_address(page);
  
   /* map the linear part :
* If skb-head_frag is set, this 'linear' part is backed by a
* fragment, and if the head is not shared with any clones then
* we can avoid a copy since we own the head portion of this page.
*/
 - if (__splice_segment(virt_to_page(skb-data),
 -  (unsigned long) skb-data  (PAGE_SIZE - 1),
 + if (__splice_segment(page, poff,
skb_headlen(skb),
offset, len, skb, spd,
skb_head_is_locked(skb),
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sun, 2013-01-06 at 03:18 +0100, Willy Tarreau wrote:
 On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote:
  On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote:
   On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:

 Ah interesting because these were some of the mm patches that I had
 tried to revert.

Hmm, or we should fix __skb_splice_bits()

I'll send a patch.

   
   Could you try the following ?
  
  Or more exactly...
 
 The first one did not change a iota unfortunately. I'm about to
 spot the commit causing the loopback regression. It's a few patches
 before the first one you pointed. It's almost finished and I test
 your patch below immediately after.

I bet you are going to find commit
69b08f62e17439ee3d436faf0b9a7ca6fffb78db
(net: use bigger pages in __netdev_alloc_frag )

Am I wrong ?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Willy Tarreau
On Sat, Jan 05, 2013 at 06:22:13PM -0800, Eric Dumazet wrote:
 On Sun, 2013-01-06 at 03:18 +0100, Willy Tarreau wrote:
  On Sat, Jan 05, 2013 at 06:16:31PM -0800, Eric Dumazet wrote:
   On Sat, 2013-01-05 at 17:51 -0800, Eric Dumazet wrote:
On Sat, 2013-01-05 at 17:40 -0800, Eric Dumazet wrote:
 On Sun, 2013-01-06 at 02:30 +0100, Willy Tarreau wrote:
 
  Ah interesting because these were some of the mm patches that I had
  tried to revert.
 
 Hmm, or we should fix __skb_splice_bits()
 
 I'll send a patch.
 

Could you try the following ?
   
   Or more exactly...
  
  The first one did not change a iota unfortunately. I'm about to
  spot the commit causing the loopback regression. It's a few patches
  before the first one you pointed. It's almost finished and I test
  your patch below immediately after.
 
 I bet you are going to find commit
 69b08f62e17439ee3d436faf0b9a7ca6fffb78db
 (net: use bigger pages in __netdev_alloc_frag )
 
 Am I wrong ?

Yes this time you guessed wrong :-) Well maybe it's participating
to the issue.

It's 0cf833ae (net: loopback: set default mtu to 64K). And I could
reproduce it with 3.6 by setting loopback's MTU to 65536 by hand.
The trick is that once the MTU has been set to this large a value,
even when I set it back to 16kB the problem persists.

Now I'm retrying your other patch to see if it brings the 10GE back
to full speed.

Willy

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major network performance regression in 3.7

2013-01-05 Thread Eric Dumazet
On Sun, 2013-01-06 at 03:32 +0100, Willy Tarreau wrote:

 It's 0cf833ae (net: loopback: set default mtu to 64K). And I could
 reproduce it with 3.6 by setting loopback's MTU to 65536 by hand.
 The trick is that once the MTU has been set to this large a value,
 even when I set it back to 16kB the problem persists.
 

Well, this MTU change can uncover a prior bug, or make it happen faster,
for sure.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >