date:20151112

Re: Is ndo_do_ioctl still acceptable?

2015-11-12 Thread Jason A. Donenfeld

On Thu, Nov 12, 2015 at 9:30 PM, Austin S Hemmelgarn
 wrote:
>>
> On the other hand, based on what you are saying about your device, it sounds
> like you are working on some kind of cryptographically secured (either
> authenticated or encrypted or both) tunnel, in which case the fact that
> security is easier to handle with netlink than ioctls becomes important.  If
> you can't ensure security of the endpoint configuration, you can't ensure
> security of the tunnel itself.

Could you substantiate these claims that "security is easier to handle
with netlink". I've never heard this and I don't know why it'd be the
case. Are you referring to the fact that the copy_to/from_user dance
of ioctl opens up more potential vulnerabilities than netlink's
abstracted validation? Or something else? Just confused here...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] stmmac: avoid ipq806x constant overflow warning

2015-11-12 Thread Arnd Bergmann

On Thursday 12 November 2015 12:25:28 David Miller wrote:
> From: Arnd Bergmann 
> Date: Thu, 12 Nov 2015 15:12:48 +0100
> 
> > Building dwmac-ipq806x on a 64-bit architecture produces a harmless
> > warning from gcc:
> > 
> > stmmac/dwmac-ipq806x.c: In function 'ipq806x_gmac_probe':
> > include/linux/bitops.h:6:19: warning: overflow in implicit constant 
> > conversion [-Woverflow]
> >   val = QSGMII_PHY_CDR_EN |
> > stmmac/dwmac-ipq806x.c:333:8: note: in expansion of macro 
> > 'QSGMII_PHY_CDR_EN'
> >  #define QSGMII_PHY_CDR_EN   BIT(0)
> >  #define BIT(nr)   (1UL << (nr))
> > 
> > The compiler warns about the fact that a 64-bit literal is passed
> > into a function that takes a 32-bit argument. I could not fully understand
> > why it warns despite the fact that this number is always small enough
> > to fit, but changing the use of BIT() macros into the equivalent hexadecimal
> > representation avoids the warning
> > 
> > Signed-off-by: Arnd Bergmann 
> > Fixes: b1c17215d718 ("stmmac: add ipq806x glue layer")
> 
> I've seen this warning too on x86_64 and had been meaning to look
> into it, thanks for taking the initiative. 
> 
> Moving away from using BIT() is somewhat disappointing, because we
> want to encourage people to use these macros.

Ok, I never really liked that macro, so I didn't mind removing it. ;-)

I spent too much time working at IBM where all internal documentation
uses bit numbers that call the MSB bit 0, and drivers use randomly
either the IBM notation or the one that everyone else uses, so I always
found the hex numbers way more intuitive.

> Also I don't even understand the compiler's behavior, it's warning
> about QSGMII_PHY_CDR_EN but if you define only that to "0x1u" it still
> warns about QSGMII_PHY_CDR_EN.
> 
> The warning goes away only if you change all 5 BIT() uses.

Yes, I have spent the time to analyze the problem now and it all makes
sense. A proper patch follows.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] stmmac: avoid ipq806x constant overflow warning

2015-11-12 Thread David Miller

From: Arnd Bergmann 
Date: Thu, 12 Nov 2015 22:03:40 +0100

> @@ -337,11 +337,11 @@ static int ipq806x_gmac_probe(struct platform_device 
> *pdev)
>QSGMII_PHY_RX_SIGNAL_DETECT_EN |
>QSGMII_PHY_TX_DRIVER_EN |
>QSGMII_PHY_QSGMII_EN |
> -  0x4 << QSGMII_PHY_PHASE_LOOP_GAIN_OFFSET |
> -  0x3 << QSGMII_PHY_RX_DC_BIAS_OFFSET |
> -  0x1 << QSGMII_PHY_RX_INPUT_EQU_OFFSET |
> -  0x2 << QSGMII_PHY_CDR_PI_SLEW_OFFSET |
> -  0xC << QSGMII_PHY_TX_DRV_AMP_OFFSET);
> +  0x4ul << QSGMII_PHY_PHASE_LOOP_GAIN_OFFSET |
> +  0x3ul << QSGMII_PHY_RX_DC_BIAS_OFFSET |
> +  0x1ul << QSGMII_PHY_RX_INPUT_EQU_OFFSET |
> +  0x2ul << QSGMII_PHY_CDR_PI_SLEW_OFFSET |
> +  0xCul << QSGMII_PHY_TX_DRV_AMP_OFFSET);
>   }
>  

Indeed, this looks so much better.

Applied.

Thanks for looking more deeply into this!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [linux-4.4-mw] BUG: unable to handle kernel paging request ip_vs_out.constprop

2015-11-12 Thread Ido Schimmel

Thu, Nov 12, 2015 at 07:12:03PM IST, li...@eikelenboom.it wrote:
>On 2015-11-12 17:52, Eric Dumazet wrote:
>> On Thu, 2015-11-12 at 16:16 +0100, Sander Eikelenboom wrote:
>> 
>>> > Thanks for the report, please try following patch :
>>> 
>>> Hi Eric,
>>> 
>>> Thanks for the patch!
>>> Got it up and running at the moment, but since i don't have a clear
>>> trigger it
>>> will take 1 or 2 days before i can report something back.
>> 
>> Don't worry, I have a pretty good picture of the bug and patch must fix
>> it.
>> 
>> I'll submit it formally asap.
>
>Ok.
>
>Do you know were these new warnings are for ?
>(apparently all networking including bridging works fine, so is this 
>just too verbose logging ?)

Yes, I think I do. I can send a patch tomorrow morning unless someone
beats me to it.

Thanks for reporting!

>
>[  207.033768] vif vif-1-0 vif1.0: set_features() failed (-1); wanted
>0x00044803, left 0x000400114813
>[  207.033780] vif vif-1-0 vif1.0: set_features() failed (-1); wanted
>0x00044803, left 0x000400114813
>[  207.245435] xen_bridge: error setting offload STP state on port
>1(vif1.0)
>[  207.245442] vif vif-1-0 vif1.0: failed to set HW ageing time
>[  207.245443] xen_bridge: error setting offload STP state on port
>1(vif1.0)
>[  207.245491] vif vif-1-0 vif1.0: set_features() failed (-1); wanted
>0x00044803, left 0x000400114813
>
>--
>Sander
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majord...@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V2 0/2] arm64: bpf: correct JIT stack setup and make it align with ARM64 AAPCS

2015-11-12 Thread Yang Shi


Changelog in V2:
Split to two patches according to the suggestion from Zi Shen Lim
Show A64_FP in stack layout diagram
Correct "+64" to "-64"

Yang Shi (2):
  arm64: bpf: fix JIT frame pointer setup
  arm64: bpf: make BPF prologue and epilogue align with ARM64 AAPCS

 arch/arm64/net/bpf_jit_comp.c | 38 +++---
 1 file changed, 31 insertions(+), 7 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net 1/3] ipv6: Avoid creating RTF_CACHE from a rt that is not managed by fib6 tree

2015-11-12 Thread Chris Siebenmann

> The original bug report:
> https://bugzilla.redhat.com/show_bug.cgi?id=1272571
> 
> The setup has a IPv4 GRE tunnel running in a IPSec.  The bug
> happens when ndisc starts sending router solicitation at the gre
> interface.  The simplified oops stack is like:
[...]
> Reported-by: Chris Siebenmann 

 For what it's worth, this change appears to fix my issue in preliminary
testing. I haven't actually brought it up on a production machine with a
GRE tunnel that actually works, but the sequence of operations on my
test machine that used to reliably oops no longer does so.

(I tested this against both a quite recent upstream git pull and the
current Fedora 22 kernel source with this patch blindly shoved in on top
by just running 'patch'.)

- cks
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is ndo_do_ioctl still acceptable?

2015-11-12 Thread Austin S Hemmelgarn


On 2015-11-12 11:58, Jason A. Donenfeld wrote:

Hi Stephen,

Thanks for your response.

On Thu, Nov 12, 2015 at 5:34 PM, Stephen Hemminger
 wrote:

The problem is ioctl's are device specific, and therefore create dependency
on the unique features supported by your device.
The question always comes up, why is this new API not something general?


In this case, it really is for unique features of my device. My device
has its own unique notion of a "peer" based on a particular elliptic
curve point and some other interesting things. It's not something
generalizable to other devices. The thing that makes my particular
device special is these attributes that I need to make configurable. I
think then, by your criteria, ioctl would actually be perfect. In
other words, I interpret what you wrote to mean "generalizable:
netlink. device-specific: ioctl." If that's a decent summary, then
ioctl is certainly good for me.

On the other hand, based on what you are saying about your device, it 
sounds like you are working on some kind of cryptographically secured 
(either authenticated or encrypted or both) tunnel, in which case the 
fact that security is easier to handle with netlink than ioctls becomes 
important.  If you can't ensure security of the endpoint configuration, 
you can't ensure security of the tunnel itself.





smime.p7s
Description: S/MIME Cryptographic Signature

Re: [PATCH net] sctp: translate host order to network order when setting a hmacid

2015-11-12 Thread Vlad Yasevich

On 11/12/2015 12:07 AM, Xin Long wrote:
> now sctp auth cannot work well when setting a hmacid manually, which
> is caused by that we didn't use the network order for hmacid, so fix
> it by adding the transformation in sctp_auth_ep_set_hmacs.
> 
> even we set hmacid with the network order in userspace, it still
> can't work, because of this condition in sctp_auth_ep_set_hmacs():
> 
>   if (id > SCTP_AUTH_HMAC_ID_MAX)
>   return -EOPNOTSUPP;
> 
> so this wasn't working before and thus it won't break compatibility.
> 
> Signed-off-by: Xin Long 
> Signed-off-by: Marcelo Ricardo Leitner 
> ---
>  net/sctp/auth.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sctp/auth.c b/net/sctp/auth.c
> index 4f15b7d..1543e39 100644
> --- a/net/sctp/auth.c
> +++ b/net/sctp/auth.c
> @@ -809,8 +809,8 @@ int sctp_auth_ep_set_hmacs(struct sctp_endpoint *ep,
>   if (!has_sha1)
>   return -EINVAL;
>  
> - memcpy(ep->auth_hmacs_list->hmac_ids, >shmac_idents[0],
> - hmacs->shmac_num_idents * sizeof(__u16));
> + for (i = 0; i < hmacs->shmac_num_idents; i++)
> + ep->auth_hmacs_list->hmac_ids[i] = 
> htons(hmacs->shmac_idents[i]);
>   ep->auth_hmacs_list->param_hdr.length = htons(sizeof(sctp_paramhdr_t) +
>   hmacs->shmac_num_idents * sizeof(__u16));
>   return 0;
> 

Acked-by: Vlad Yasevich 

-vlad
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH iproute2] ip_tunnel: determine tunnel address family from the tunnel type

2015-11-12 Thread Konstantin Shemyak

When creating an IP tunnel over IPv6, the address family must be passed in
the option, e.g.

ip -6 tunnel add mode ip6gre local 1::1 remote 2::2

This makes it impossible to create both IPv4 and IPv6 tunnels in one batch.

In fact the address family option is redundant here, as each tunnel mode is
relevant for only one address family.
The patch determines whether the applicable address family is AF_INET6
instead of the default AF_INET and makes the "-6" option unnecessary for
"ip tunnel add".

Signed-off-by: Konstantin Shemyak 
---
 ip/iptunnel.c  | 26 ++
 testsuite/tests/ip/tunnel/add_tunnel.t | 14 ++
 2 files changed, 40 insertions(+)
 create mode 100755 testsuite/tests/ip/tunnel/add_tunnel.t

diff --git a/ip/iptunnel.c b/ip/iptunnel.c
index 78fa988..7826a37 100644
--- a/ip/iptunnel.c
+++ b/ip/iptunnel.c
@@ -629,8 +629,34 @@ static int do_6rd(int argc, char **argv)
return tnl_6rd_ioctl(cmd, medium, );
 }
 
+static int tunnel_mode_is_ipv6(char *tunnel_mode) {
+   char *ipv6_modes[] = {
+   "ipv6/ipv6", "ip6ip6",
+   "vti6",
+   "ip/ipv6", "ipv4/ipv6", "ipip6", "ip4ip6",
+   "ip6gre", "gre/ipv6",
+   "any/ipv6", "any"
+   };
+   int i;
+
+   for (i = 0; i < sizeof(ipv6_modes) / sizeof(char *); i++) {
+   if (strcmp(ipv6_modes[i], tunnel_mode) == 0)
+   return 1;
+   }
+   return 0;
+}
+
 int do_iptunnel(int argc, char **argv)
 {
+   int i;
+
+   for (i = 0; i < argc - 1; i++) {
+   if (strcmp(argv[i], "mode") == 0) {
+   if (tunnel_mode_is_ipv6(argv[i + 1]))
+   preferred_family = AF_INET6;
+   break;
+   }
+   }
switch (preferred_family) {
case AF_UNSPEC:
preferred_family = AF_INET;
diff --git a/testsuite/tests/ip/tunnel/add_tunnel.t 
b/testsuite/tests/ip/tunnel/add_tunnel.t
new file mode 100755
index 000..18f6e37
--- /dev/null
+++ b/testsuite/tests/ip/tunnel/add_tunnel.t
@@ -0,0 +1,14 @@
+#!/bin/sh
+
+source lib/generic.sh
+
+TUNNEL_NAME="tunnel_test_ip"
+
+ts_log "[Testing add/del tunnels]"
+
+ts_ip "$0" "Add GRE tunnel over IPv4" tunnel add name $TUNNEL_NAME mode gre 
local 1.1.1.1 remote 2.2.2.2
+ts_ip "$0" "Del GRE tunnel over IPv4" tunnel del $TUNNEL_NAME
+
+ts_ip "$0" "Add GRE tunnel over IPv6" tunnel add name $TUNNEL_NAME mode ip6gre 
local dead:beef::1 remote dead:beef::2
+ts_ip "$0" "Del GRE tunnel over IPv6" tunnel del $TUNNEL_NAME
+
-- 
1.9.1
>From d0a76b2d62a4393feedb1cc023f5ec7e1204d296 Mon Sep 17 00:00:00 2001
From: Konstantin Shemyak 
Date: Thu, 12 Nov 2015 20:52:02 +0200
Subject: [PATCH] Tunnel address family is determined from the tunnel mode

When the tunnel mode already tells the IP address family, "ip tunnel"
command determines it and does not require option "-4"/"-6" to be passed.

This makes possible creating both IPv4 and IPv6 tunnels in one batch.

Signed-off-by: Konstantin Shemyak 
---
 ip/iptunnel.c  | 26 ++
 testsuite/tests/ip/tunnel/add_tunnel.t | 14 ++
 2 files changed, 40 insertions(+)
 create mode 100755 testsuite/tests/ip/tunnel/add_tunnel.t

diff --git a/ip/iptunnel.c b/ip/iptunnel.c
index 78fa988..7826a37 100644
--- a/ip/iptunnel.c
+++ b/ip/iptunnel.c
@@ -629,8 +629,34 @@ static int do_6rd(int argc, char **argv)
 	return tnl_6rd_ioctl(cmd, medium, );
 }
 
+static int tunnel_mode_is_ipv6(char *tunnel_mode) {
+	char *ipv6_modes[] = {
+		"ipv6/ipv6", "ip6ip6",
+		"vti6",
+		"ip/ipv6", "ipv4/ipv6", "ipip6", "ip4ip6",
+		"ip6gre", "gre/ipv6",
+		"any/ipv6", "any"
+	};
+	int i;
+
+	for (i = 0; i < sizeof(ipv6_modes) / sizeof(char *); i++) {
+		if (strcmp(ipv6_modes[i], tunnel_mode) == 0)
+			return 1;
+	}
+	return 0;
+}
+
 int do_iptunnel(int argc, char **argv)
 {
+	int i;
+
+	for (i = 0; i < argc - 1; i++) {
+		if (strcmp(argv[i], "mode") == 0) {
+			if (tunnel_mode_is_ipv6(argv[i + 1]))
+preferred_family = AF_INET6;
+			break;
+		}
+	}
 	switch (preferred_family) {
 	case AF_UNSPEC:
 		preferred_family = AF_INET;
diff --git a/testsuite/tests/ip/tunnel/add_tunnel.t b/testsuite/tests/ip/tunnel/add_tunnel.t
new file mode 100755
index 000..18f6e37
--- /dev/null
+++ b/testsuite/tests/ip/tunnel/add_tunnel.t
@@ -0,0 +1,14 @@
+#!/bin/sh
+
+source lib/generic.sh
+
+TUNNEL_NAME="tunnel_test_ip"
+
+ts_log "[Testing add/del tunnels]"
+
+ts_ip "$0" "Add GRE tunnel over IPv4" tunnel add name $TUNNEL_NAME mode gre local 1.1.1.1 remote 2.2.2.2
+ts_ip "$0" "Del GRE tunnel over IPv4" tunnel del $TUNNEL_NAME
+
+ts_ip "$0" "Add GRE tunnel over IPv6" tunnel add name $TUNNEL_NAME mode ip6gre local dead:beef::1 remote dead:beef::2
+ts_ip "$0" "Del GRE tunnel over IPv6" tunnel del $TUNNEL_NAME
+
-- 
1.9.1

[PATCH v2] stmmac: avoid ipq806x constant overflow warning

2015-11-12 Thread Arnd Bergmann

Building dwmac-ipq806x on a 64-bit architecture produces a harmless
warning from gcc:

stmmac/dwmac-ipq806x.c: In function 'ipq806x_gmac_probe':
include/linux/bitops.h:6:19: warning: overflow in implicit constant conversion 
[-Woverflow]
  val = QSGMII_PHY_CDR_EN |
stmmac/dwmac-ipq806x.c:333:8: note: in expansion of macro 'QSGMII_PHY_CDR_EN'
 #define QSGMII_PHY_CDR_EN   BIT(0)
 #define BIT(nr)   (1UL << (nr))

This is a result of the type conversion rules in C, when we take the
logical OR of multiple different types. In particular, we have
and unsigned long

QSGMII_PHY_CDR_EN == BIT(0) == (1ul << 0) == 0x0001ul

and a signed int

0xC << QSGMII_PHY_TX_DRV_AMP_OFFSET == 0xc000

which together gives a signed long value

0xc001l

and when this is passed into a function that takes an unsigned int type,
gcc warns about the signed overflow and the loss of the upper 32-bits that
are all ones.

This patch adds 'ul' type modifiers to the literal numbers passed in
here, so now the expression remains an 'unsigned long' with the upper
bits all zero, and that avoids the signed overflow and the warning.

Signed-off-by: Arnd Bergmann 
Fixes: b1c17215d718 ("stmmac: add ipq806x glue layer")

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-ipq806x.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-ipq806x.c
index 9d89bdbf029f..82de68b1a452 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-ipq806x.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-ipq806x.c
@@ -337,11 +337,11 @@ static int ipq806x_gmac_probe(struct platform_device 
*pdev)
 QSGMII_PHY_RX_SIGNAL_DETECT_EN |
 QSGMII_PHY_TX_DRIVER_EN |
 QSGMII_PHY_QSGMII_EN |
-0x4 << QSGMII_PHY_PHASE_LOOP_GAIN_OFFSET |
-0x3 << QSGMII_PHY_RX_DC_BIAS_OFFSET |
-0x1 << QSGMII_PHY_RX_INPUT_EQU_OFFSET |
-0x2 << QSGMII_PHY_CDR_PI_SLEW_OFFSET |
-0xC << QSGMII_PHY_TX_DRV_AMP_OFFSET);
+0x4ul << QSGMII_PHY_PHASE_LOOP_GAIN_OFFSET |
+0x3ul << QSGMII_PHY_RX_DC_BIAS_OFFSET |
+0x1ul << QSGMII_PHY_RX_INPUT_EQU_OFFSET |
+0x2ul << QSGMII_PHY_CDR_PI_SLEW_OFFSET |
+0xCul << QSGMII_PHY_TX_DRV_AMP_OFFSET);
}
 
plat_dat->has_gmac = true;

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] arm64: bpf: make BPF prologue and epilogue align with ARM64 AAPCS

2015-11-12 Thread Yang Shi

Save and restore FP/LR in BPF prog prologue and epilogue, save SP to FP
in prologue in order to get the correct stack backtrace.

However, ARM64 JIT used FP (x29) as eBPF fp register, FP is subjected to
change during function call so it may cause the BPF prog stack base address
change too.

Use x25 to replace FP as BPF stack base register (fp). Since x25 is callee
saved register, so it will keep intact during function call.
It is initialized in BPF prog prologue when BPF prog is started to run
everytime. When BPF prog exits, it could be just tossed.

So, the BPF stack layout looks like:

 high
 original A64_SP =>   0:+-+ BPF prologue
| | FP/LR and callee saved registers
 BPF fp register => -64:+-+
| |
| ... | BPF prog stack
| |
| |
 current A64_SP/FP =>   +-+
| |
| ... | Function call stack
| |
+-+
  low

CC: Zi Shen Lim 
CC: Xi Wang 
Signed-off-by: Yang Shi 
---
 arch/arm64/net/bpf_jit_comp.c | 34 +-
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index ac8b548..8753bb7 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -50,7 +50,7 @@ static const int bpf2a64[] = {
[BPF_REG_8] = A64_R(21),
[BPF_REG_9] = A64_R(22),
/* read-only frame pointer to access stack */
-   [BPF_REG_FP] = A64_FP,
+   [BPF_REG_FP] = A64_R(25),
/* temporary register for internal BPF JIT */
[TMP_REG_1] = A64_R(23),
[TMP_REG_2] = A64_R(24),
@@ -155,17 +155,41 @@ static void build_prologue(struct jit_ctx *ctx)
stack_size += 4; /* extra for skb_copy_bits buffer */
stack_size = STACK_ALIGN(stack_size);
 
+   /*
+* BPF prog stack layout
+*
+* high
+* original A64_SP =>   0:+-+ BPF prologue
+*| | FP/LR and callee saved registers
+* BPF fp register => -64:+-+
+*| |
+*| ... | BPF prog stack
+*| |
+*| |
+* current A64_SP/FP =>   +-+
+*| |
+*| ... | Function call stack
+*| |
+*+-+
+*  low
+*
+*/
+
+   /* Save FP and LR registers to stay align with ARM64 AAPCS */
+   emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
+
/* Save callee-saved register */
emit(A64_PUSH(r6, r7, A64_SP), ctx);
emit(A64_PUSH(r8, r9, A64_SP), ctx);
if (ctx->tmp_used)
emit(A64_PUSH(tmp1, tmp2, A64_SP), ctx);
 
-   /* Set up frame pointer */
+   /* Set up BPF prog stack base register (x25) */
emit(A64_MOV(1, fp, A64_SP), ctx);
 
-   /* Set up BPF stack */
+   /* Set up function call stack */
emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx);
+   emit(A64_MOV(1, A64_FP, A64_SP), ctx);
 
/* Clear registers A and X */
emit_a64_mov_i64(ra, 0, ctx);
@@ -196,8 +220,8 @@ static void build_epilogue(struct jit_ctx *ctx)
emit(A64_POP(r8, r9, A64_SP), ctx);
emit(A64_POP(r6, r7, A64_SP), ctx);
 
-   /* Restore frame pointer */
-   emit(A64_MOV(1, fp, A64_SP), ctx);
+   /* Restore FP/LR registers */
+   emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
 
/* Set return value */
emit(A64_MOV(1, A64_R(0), r0), ctx);
-- 
2.0.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net: phy: at803x: support interrupt on 8030 and 8035

2015-11-12 Thread Mason

On 12/11/2015 20:14, Florian Fainelli wrote:
> On 12/11/15 11:09, Måns Rullgård wrote:
>> On 12 November 2015 19:06:23 GMT+00:00, Mason wrote:
>>> On 12/11/2015 18:40, Mans Rullgard wrote:
 Commit 77a993942 "phy/at8031: enable at8031 to work on interrupt mode"
 added interrupt support for the 8031 PHY but left out the other two
 chips supported by this driver.

 This patch sets the .ack_interrupt and .config_intr functions for the
 8030 and 8035 drivers as well.

 Signed-off-by: Mans Rullgard 
 ---
 I have only tested this with an 8035.  I can't find a datasheet for
 the 8030, but since 8031, 8032, and 8035 all have the same register
 layout, there's a good chance 8030 does as well.
 ---
  drivers/net/phy/at803x.c | 4 
  1 file changed, 4 insertions(+)

 diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
 index fabf11d..2d020a3 100644
 --- a/drivers/net/phy/at803x.c
 +++ b/drivers/net/phy/at803x.c
 @@ -308,6 +308,8 @@ static struct phy_driver at803x_driver[] = {
.flags  = PHY_HAS_INTERRUPT,
.config_aneg= genphy_config_aneg,
.read_status= genphy_read_status,
 +  .ack_interrupt  = at803x_ack_interrupt,
 +  .config_intr= at803x_config_intr,
.driver = {
.owner = THIS_MODULE,
},
 @@ -327,6 +329,8 @@ static struct phy_driver at803x_driver[] = {
.flags  = PHY_HAS_INTERRUPT,
.config_aneg= genphy_config_aneg,
.read_status= genphy_read_status,
 +  .ack_interrupt  = at803x_ack_interrupt,
 +  .config_intr= at803x_config_intr,
.driver = {
.owner = THIS_MODULE,
},
>>>
>>> Shouldn't we take the opportunity to clean up the duplicated register
>>> definitions? (I'll send an informal patch to spur discussion.)
>>>
>>> Regards.
>>
>> That can be done independently. Feel free to send a patch.
> 
> Agreed, that deserve a separate patch.

Isn't there a problem when at803x_set_wol() sets the AT803X_WOL_ENABLE
bit, but a DISABLE/ENABLE cycle through at803x_config_intr() will
discard that bit?

Regards.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] arm64: bpf: fix JIT frame pointer setup

2015-11-12 Thread Yang Shi

BPF fp should point to the top of the BPF prog stack. The original
implementation made it point to the bottom incorrectly.
Move A64_SP to fp before reserve BPF prog stack space.

CC: Zi Shen Lim 
CC: Xi Wang 
Signed-off-by: Yang Shi 
---
 arch/arm64/net/bpf_jit_comp.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index a44e529..ac8b548 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -161,12 +161,12 @@ static void build_prologue(struct jit_ctx *ctx)
if (ctx->tmp_used)
emit(A64_PUSH(tmp1, tmp2, A64_SP), ctx);
 
-   /* Set up BPF stack */
-   emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx);
-
/* Set up frame pointer */
emit(A64_MOV(1, fp, A64_SP), ctx);
 
+   /* Set up BPF stack */
+   emit(A64_SUB_I(1, A64_SP, A64_SP, stack_size), ctx);
+
/* Clear registers A and X */
emit_a64_mov_i64(ra, 0, ctx);
emit_a64_mov_i64(rx, 0, ctx);
-- 
2.0.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] tools/net: Use include/uapi with __EXPORTED_HEADERS__

2015-11-12 Thread Daniel Borkmann


On 11/11/2015 11:24 PM, Kamal Mostafa wrote:

Use the local uapi headers to keep in sync with "recently" added #define's
(e.g. SKF_AD_VLAN_TPID).  Refactored CFLAGS, and bpf_asm doesn't need -I.

Fixes: 3f356385e8a449e1d7cfc6b6f8d634ac4f5581a0
Signed-off-by: Kamal Mostafa 


Acked-by: Daniel Borkmann 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH stable <= 3.18] net: add length argument to skb_copy_and_csum_datagram_iovec

2015-11-12 Thread Sabrina Dubroca

2015-11-10, 16:03:52 -0800, Greg Kroah-Hartman wrote:
> On Tue, Nov 10, 2015 at 05:59:26PM -0600, Josh Hunt wrote:
> > On Thu, Oct 29, 2015 at 5:00 AM, Sabrina Dubroca  
> > wrote:
> > > 2015-10-15, 14:25:03 +0200, Sabrina Dubroca wrote:
> > >> Without this length argument, we can read past the end of the iovec in
> > >> memcpy_toiovec because we have no way of knowing the total length of the
> > >> iovec's buffers.
> > >>
> > >> This is needed for stable kernels where 89c22d8c3b27 ("net: Fix skb
> > >> csum races when peeking") has been backported but that don't have the
> > >> ioviter conversion, which is almost all the stable trees <= 3.18.
> > >>
> > >> This also fixes a kernel crash for NFS servers when the client uses
> > >>  -onfsvers=3,proto=udp to mount the export.
> > >>
> > >> Signed-off-by: Sabrina Dubroca 
> > >> Reviewed-by: Hannes Frederic Sowa 
> > >
> > > Fixes CVE-2015-8019.
> > > http://www.openwall.com/lists/oss-security/2015/10/29/1
> > >
> > > --
> > > Sabrina
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > Greg
> > 
> > Do you have this in your queue? I saw a few other stables pick this
> > up, but haven't seen it in 3.14 or 3.18 yet. It wasn't clear to me if
> > this had been fully reviewed yet.
> 
> I rely on Dave to package up networking stable patches and forward them
> on to me, that's why you haven't seen it be picked up yet.
> 
> thanks,
> 
> greg k-h

David, can you queue this up?

Thanks,

-- 
Sabrina
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vhost: move is_le setup to the backend

2015-11-12 Thread Greg Kurz

On Fri, 30 Oct 2015 12:42:35 +0100
Greg Kurz  wrote:

> The vq->is_le field is used to fix endianness when accessing the vring via
> the cpu_to_vhost16() and vhost16_to_cpu() helpers in the following cases:
> 
> 1) host is big endian and device is modern virtio
> 
> 2) host has cross-endian support and device is legacy virtio with a different
>endianness than the host
> 
> Both cases rely on the VHOST_SET_FEATURES ioctl, but 2) also needs the
> VHOST_SET_VRING_ENDIAN ioctl to be called by userspace. Since vq->is_le
> is only needed when the backend is active, it was decided to set it at
> backend start.
> 
> This is currently done in vhost_init_used()->vhost_init_is_le() but it
> obfuscates the core vhost code. This patch moves the is_le setup to a
> dedicated function that is called from the backend code.
> 
> Note vhost_net is the only backend that can pass vq->private_data == NULL to
> vhost_init_used(), hence the "if (sock)" branch.
> 
> No behaviour change.
> 
> Signed-off-by: Greg Kurz 
> ---

Ping ?

>  drivers/vhost/net.c   |6 ++
>  drivers/vhost/scsi.c  |3 +++
>  drivers/vhost/test.c  |2 ++
>  drivers/vhost/vhost.c |   12 +++-
>  drivers/vhost/vhost.h |1 +
>  5 files changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 9eda69e40678..d6319cb2664c 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -917,6 +917,12 @@ static long vhost_net_set_backend(struct vhost_net *n, 
> unsigned index, int fd)
> 
>   vhost_net_disable_vq(n, vq);
>   vq->private_data = sock;
> +
> + if (sock)
> + vhost_set_is_le(vq);
> + else
> + vq->is_le = virtio_legacy_is_little_endian();
> +
>   r = vhost_init_used(vq);
>   if (r)
>   goto err_used;
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index e25a23692822..e2644a301fa5 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -1276,6 +1276,9 @@ vhost_scsi_set_endpoint(struct vhost_scsi *vs,
>   vq = >vqs[i].vq;
>   mutex_lock(>mutex);
>   vq->private_data = vs_tpg;
> +
> + vhost_set_is_le(vq);
> +
>   vhost_init_used(vq);
>   mutex_unlock(>mutex);
>   }
> diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> index f2882ac98726..b1c7df502211 100644
> --- a/drivers/vhost/test.c
> +++ b/drivers/vhost/test.c
> @@ -196,6 +196,8 @@ static long vhost_test_run(struct vhost_test *n, int test)
>   oldpriv = vq->private_data;
>   vq->private_data = priv;
> 
> + vhost_set_is_le(vq);
> +
>   r = vhost_init_used(>vqs[index]);
> 
>   mutex_unlock(>mutex);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index eec2f11809ff..6be863dcbd13 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -113,6 +113,12 @@ static void vhost_init_is_le(struct vhost_virtqueue *vq)
>  }
>  #endif /* CONFIG_VHOST_CROSS_ENDIAN_LEGACY */
> 
> +void vhost_set_is_le(struct vhost_virtqueue *vq)
> +{
> + vhost_init_is_le(vq);
> +}
> +EXPORT_SYMBOL_GPL(vhost_set_is_le);
> +
>  static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
>   poll_table *pt)
>  {
> @@ -1156,12 +1162,8 @@ int vhost_init_used(struct vhost_virtqueue *vq)
>  {
>   __virtio16 last_used_idx;
>   int r;
> - if (!vq->private_data) {
> - vq->is_le = virtio_legacy_is_little_endian();
> + if (!vq->private_data)
>   return 0;
> - }
> -
> - vhost_init_is_le(vq);
> 
>   r = vhost_update_used_flags(vq);
>   if (r)
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 4772862b71a7..8a62041959fe 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -162,6 +162,7 @@ bool vhost_enable_notify(struct vhost_dev *, struct 
> vhost_virtqueue *);
> 
>  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
>   unsigned int log_num, u64 len);
> +void vhost_set_is_le(struct vhost_virtqueue *vq);
> 
>  #define vq_err(vq, fmt, ...) do {  \
>   pr_debug(pr_fmt(fmt), ##__VA_ARGS__);   \
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] arm64: bpf: add BPF XADD instruction

2015-11-12 Thread Peter Zijlstra

On Wed, Nov 11, 2015 at 03:40:15PM -0800, Alexei Starovoitov wrote:
> On Wed, Nov 11, 2015 at 11:21:35PM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 11, 2015 at 11:55:59AM -0800, Alexei Starovoitov wrote:
> > > Therefore things like memory barriers, full set of atomics are not 
> > > applicable
> > > in bpf world.
> > 
> > There are still plenty of wait-free constructs one can make using them.
> 
> yes, but all such lock-free algos are typically based on cmpxchg8b and
> tight loop, so it would be very hard for verifier to proof termination
> of such loops. I think when we'd need to add something like this, we'll
> add new bpf insn that will be membarrier+cmpxhg8b+check+loop as
> a single insn, so it cannot be misused.
> I don't know of any concrete use case yet. All possible though.

So this is where the 'unconditional' atomic ops come in handy.

Like the x86: xchg, lock {xadd,add,sub,inc,dec,or,and,xor}

Those do not have a loop, and then you can create truly wait-free
things; even some applications of cmpxchg do not actually need the loop.

But this class of wait-free constructs is indeed significantly smaller
than the class of lock-less constructs.

> btw, support for mini loops was requested many times in the past.
> I guess we'd have to add something like this, but it's tricky.
> Mainly because control flow graph analysis becomes much more complicated.

Agreed, that does sound like an 'interesting' problem :-)

Something like:

atomic_op(ptr, f)
{
for (;;) {
val = *ptr;
new = f(val)
old = cmpxchg(ptr, val, new);
if (old == val)
break;

cpu_relax();
}
}

might be castable as an instruction I suppose, but I'm not sure you have
function references in (e)BPF.

The above is 'sane' if f is sane (although there is a
starvation case, which is why things like sparc (iirc) need an
increasing backoff instead of cpu_relax()).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] bpf: samples: exclude asm/sysreg.h for arm64

2015-11-12 Thread Yang Shi

commit 338d4f49d6f7114a017d294ccf7374df4f998edc
("arm64: kernel: Add support for Privileged Access Never") includes sysreg.h
into futex.h and uaccess.h. But, the inline assembly used by asm/sysreg.h is
incompatible with llvm so it will cause BPF samples build failure for ARM64.
Since sysreg.h is useless for BPF samples, just exclude it from Makefile via
defining __ASM_SYSREG_H.

Signed-off-by: Yang Shi 
---
 samples/bpf/Makefile | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 79b4596..edd638b 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -67,10 +67,13 @@ HOSTLOADLIBES_lathist += -lelf
 # point this to your LLVM backend with bpf support
 LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
 
+# asm/sysreg.h inline assmbly used by it is incompatible with llvm.
+# But, ehere is not easy way to fix it, so just exclude it since it is
+# useless for BPF samples.
 $(obj)/%.o: $(src)/%.c
clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
-   -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
+   -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value 
-Wno-pointer-sign \
-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
-   -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
+   -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value 
-Wno-pointer-sign \
-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=asm -o 
$@.s
-- 
2.0.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/14] net: tcp_memcontrol: sanitize tcp memory accounting callbacks

2015-11-12 Thread Johannes Weiner

There won't be a tcp control soft limit, so integrating the memcg code
into the global skmem limiting scheme complicates things
unnecessarily. Replace this with simple and clear charge and uncharge
calls--hidden behind a jump label--to account skb memory.

Note that this is not purely aesthetic: as a result of shoehorning the
per-memcg code into the same memory accounting functions that handle
the global level, the old code would compare the per-memcg consumption
against the smaller of the per-memcg limit and the global limit. This
allowed the total consumption of multiple sockets to exceed the global
limit, as long as the individual sockets stayed within bounds. After
this change, the code will always compare the per-memcg consumption to
the per-memcg limit, and the global consumption to the global limit,
and thus close this loophole.

Without a soft limit, the per-memcg memory pressure state in sockets
is generally questionable. However, we did it until now, so we
continue to enter it when the hard limit is hit, and packets are
dropped, to let other sockets in the cgroup know that they shouldn't
grow their transmit windows, either. However, keep it simple in the
new callback model and leave memory pressure lazily when the next
packet is accepted (as opposed to doing it synchroneously when packets
are processed). When packets are dropped, network performance will
already be in the toilet, so that should be a reasonable trade-off.

As described above, consumption is now checked on the per-memcg level
and the global level separately. Likewise, memory pressure states are
maintained on both the per-memcg level and the global level, and a
socket is considered under pressure when either level asserts as much.

Signed-off-by: Johannes Weiner 
---
 include/linux/memcontrol.h | 12 -
 include/net/sock.h | 63 ++
 include/net/tcp.h  |  5 ++--
 mm/memcontrol.c| 32 +++
 net/core/sock.c| 26 +++
 net/ipv4/tcp_output.c  |  7 --
 6 files changed, 69 insertions(+), 76 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 96ca3d3..906dfff 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -676,12 +676,6 @@ void mem_cgroup_count_vm_event(struct mm_struct *mm, enum 
vm_event_item idx)
 }
 #endif /* CONFIG_MEMCG */
 
-enum {
-   UNDER_LIMIT,
-   SOFT_LIMIT,
-   OVER_LIMIT,
-};
-
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg);
@@ -711,6 +705,12 @@ static inline void mem_cgroup_wb_stats(struct 
bdi_writeback *wb,
 struct sock;
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
+bool mem_cgroup_charge_skmem(struct cg_proto *proto, unsigned int nr_pages);
+void mem_cgroup_uncharge_skmem(struct cg_proto *proto, unsigned int nr_pages);
+static inline bool mem_cgroup_under_socket_pressure(struct cg_proto *proto)
+{
+   return proto->memory_pressure;
+}
 #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/include/net/sock.h b/include/net/sock.h
index 2eefc99..8cc7613 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1126,8 +1126,8 @@ static inline bool sk_under_memory_pressure(const struct 
sock *sk)
if (!sk->sk_prot->memory_pressure)
return false;
 
-   if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-   return !!sk->sk_cgrp->memory_pressure;
+   if (mem_cgroup_sockets_enabled && sk->sk_cgrp &&
+   mem_cgroup_under_socket_pressure(sk->sk_cgrp))
 
return !!*sk->sk_prot->memory_pressure;
 }
@@ -1141,9 +1141,6 @@ static inline void sk_leave_memory_pressure(struct sock 
*sk)
 
if (*memory_pressure)
*memory_pressure = 0;
-
-   if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-   sk->sk_cgrp->memory_pressure = 0;
 }
 
 static inline void sk_enter_memory_pressure(struct sock *sk)
@@ -1151,76 +1148,30 @@ static inline void sk_enter_memory_pressure(struct sock 
*sk)
if (!sk->sk_prot->enter_memory_pressure)
return;
 
-   if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-   sk->sk_cgrp->memory_pressure = 1;
-
sk->sk_prot->enter_memory_pressure(sk);
 }
 
 static inline long sk_prot_mem_limits(const struct sock *sk, int index)
 {
-   long limit = sk->sk_prot->sysctl_mem[index];
-
-   if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-   limit = min_t(long, limit, sk->sk_cgrp->memory_allocated.limit);
-
-   return limit;
-}
-
-static inline void memcg_memory_allocated_add(struct cg_proto *prot,
- unsigned long amt,
- int *parent_status)
-{
-   struct page_counter *counter;
-
-   if (page_counter_try_charge(>memory_allocated, amt, ))
-

[PATCH 09/14] net: tcp_memcontrol: simplify linkage between socket and page counter

2015-11-12 Thread Johannes Weiner

There won't be any separate counters for socket memory consumed by
protocols other than TCP in the future. Remove the indirection and
link sockets directly to their owning memory cgroup.

Signed-off-by: Johannes Weiner 
---
 include/linux/memcontrol.h   | 18 +++-
 include/net/sock.h   | 37 ---
 include/net/tcp.h|  4 +--
 include/net/tcp_memcontrol.h |  1 -
 mm/memcontrol.c  | 57 ++--
 net/core/sock.c  | 52 +---
 net/ipv4/tcp_ipv4.c  |  7 +
 net/ipv4/tcp_memcontrol.c| 70 ++--
 net/ipv4/tcp_output.c|  4 +--
 net/ipv6/tcp_ipv6.c  |  3 --
 10 files changed, 71 insertions(+), 182 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 906dfff..1c71f27 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -99,16 +99,6 @@ struct cg_proto {
struct page_counter memory_allocated;   /* Current allocated 
memory. */
int memory_pressure;
unsigned long   flags;
-   /*
-* memcg field is used to find which memcg we belong directly
-* Each memcg struct can hold more than one cg_proto, so container_of
-* won't really cut.
-*
-* The elegant solution would be having an inverse function to
-* proto_cgroup in struct proto, but that means polluting the structure
-* for everybody, instead of just for memcg users.
-*/
-   struct mem_cgroup   *memcg;
 };
 
 #ifdef CONFIG_MEMCG
@@ -705,11 +695,11 @@ static inline void mem_cgroup_wb_stats(struct 
bdi_writeback *wb,
 struct sock;
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
-bool mem_cgroup_charge_skmem(struct cg_proto *proto, unsigned int nr_pages);
-void mem_cgroup_uncharge_skmem(struct cg_proto *proto, unsigned int nr_pages);
-static inline bool mem_cgroup_under_socket_pressure(struct cg_proto *proto)
+bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
+void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int 
nr_pages);
+static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 {
-   return proto->memory_pressure;
+   return memcg->tcp_mem.memory_pressure;
 }
 #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 8cc7613..b439dcc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -69,22 +69,6 @@
 #include 
 #include 
 
-struct cgroup;
-struct cgroup_subsys;
-#ifdef CONFIG_NET
-int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys 
*ss);
-void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg);
-#else
-static inline
-int mem_cgroup_sockets_init(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
-{
-   return 0;
-}
-static inline
-void mem_cgroup_sockets_destroy(struct mem_cgroup *memcg)
-{
-}
-#endif
 /*
  * This structure really needs to be cleaned up.
  * Most of it is for TCP, and not used by any of
@@ -310,7 +294,7 @@ struct cg_proto;
   *@sk_security: used by security modules
   *@sk_mark: generic packet mark
   *@sk_classid: this socket's cgroup classid
-  *@sk_cgrp: this socket's cgroup-specific proto data
+  *@sk_memcg: this socket's memory cgroup association
   *@sk_write_pending: a write to stream socket waits to start
   *@sk_state_change: callback to indicate change in the state of the sock
   *@sk_data_ready: callback to indicate there is data to be processed
@@ -447,7 +431,7 @@ struct sock {
 #ifdef CONFIG_CGROUP_NET_CLASSID
u32 sk_classid;
 #endif
-   struct cg_proto *sk_cgrp;
+   struct mem_cgroup   *sk_memcg;
void(*sk_state_change)(struct sock *sk);
void(*sk_data_ready)(struct sock *sk);
void(*sk_write_space)(struct sock *sk);
@@ -1051,18 +1035,6 @@ struct proto {
 #ifdef SOCK_REFCNT_DEBUG
atomic_tsocks;
 #endif
-#ifdef CONFIG_MEMCG_KMEM
-   /*
-* cgroup specific init/deinit functions. Called once for all
-* protocols that implement it, from cgroups populate function.
-* This function has to setup any files the protocol want to
-* appear in the kmem cgroup filesystem.
-*/
-   int (*init_cgroup)(struct mem_cgroup *memcg,
-  struct cgroup_subsys *ss);
-   void(*destroy_cgroup)(struct mem_cgroup *memcg);
-   struct cg_proto *(*proto_cgroup)(struct mem_cgroup *memcg);
-#endif
 };
 
 int proto_register(struct proto *prot, int alloc_slab);
@@ -1126,8 +1098,9 @@ static inline bool sk_under_memory_pressure(const struct 
sock *sk)

[PATCH 07/14] net: tcp_memcontrol: simplify the per-memcg limit access

2015-11-12 Thread Johannes Weiner

tcp_memcontrol replicates the global sysctl_mem limit array per
cgroup, but it only ever sets these entries to the value of the
memory_allocated page_counter limit. Use the latter directly.

Signed-off-by: Johannes Weiner 
---
 include/linux/memcontrol.h | 1 -
 include/net/sock.h | 8 +---
 net/ipv4/tcp_memcontrol.c  | 8 
 3 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 185df8c..96ca3d3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -98,7 +98,6 @@ enum cg_proto_flags {
 struct cg_proto {
struct page_counter memory_allocated;   /* Current allocated 
memory. */
int memory_pressure;
-   longsysctl_mem[3];
unsigned long   flags;
/*
 * memcg field is used to find which memcg we belong directly
diff --git a/include/net/sock.h b/include/net/sock.h
index ed141b3..2eefc99 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1159,10 +1159,12 @@ static inline void sk_enter_memory_pressure(struct sock 
*sk)
 
 static inline long sk_prot_mem_limits(const struct sock *sk, int index)
 {
-   long *prot = sk->sk_prot->sysctl_mem;
+   long limit = sk->sk_prot->sysctl_mem[index];
+
if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
-   prot = sk->sk_cgrp->sysctl_mem;
-   return prot[index];
+   limit = min_t(long, limit, sk->sk_cgrp->memory_allocated.limit);
+
+   return limit;
 }
 
 static inline void memcg_memory_allocated_add(struct cg_proto *prot,
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index 8965638..c383e68 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -21,9 +21,6 @@ int tcp_init_cgroup(struct mem_cgroup *memcg, struct 
cgroup_subsys *ss)
if (!cg_proto)
return 0;
 
-   cg_proto->sysctl_mem[0] = sysctl_tcp_mem[0];
-   cg_proto->sysctl_mem[1] = sysctl_tcp_mem[1];
-   cg_proto->sysctl_mem[2] = sysctl_tcp_mem[2];
cg_proto->memory_pressure = 0;
cg_proto->memcg = memcg;
 
@@ -54,7 +51,6 @@ EXPORT_SYMBOL(tcp_destroy_cgroup);
 static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
 {
struct cg_proto *cg_proto;
-   int i;
int ret;
 
cg_proto = tcp_prot.proto_cgroup(memcg);
@@ -65,10 +61,6 @@ static int tcp_update_limit(struct mem_cgroup *memcg, 
unsigned long nr_pages)
if (ret)
return ret;
 
-   for (i = 0; i < 3; i++)
-   cg_proto->sysctl_mem[i] = min_t(long, nr_pages,
-   sysctl_tcp_mem[i]);
-
if (nr_pages == PAGE_COUNTER_MAX)
clear_bit(MEMCG_SOCK_ACTIVE, _proto->flags);
else {
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net: phy: at803x: support interrupt on 8030 and 8035

2015-11-12 Thread Måns Rullgård

Mason  writes:

> On 12/11/2015 20:14, Florian Fainelli wrote:
>> On 12/11/15 11:09, Måns Rullgård wrote:
>>> On 12 November 2015 19:06:23 GMT+00:00, Mason wrote:
 On 12/11/2015 18:40, Mans Rullgard wrote:
> Commit 77a993942 "phy/at8031: enable at8031 to work on interrupt mode"
> added interrupt support for the 8031 PHY but left out the other two
> chips supported by this driver.
>
> This patch sets the .ack_interrupt and .config_intr functions for the
> 8030 and 8035 drivers as well.
>
> Signed-off-by: Mans Rullgard 
> ---
> I have only tested this with an 8035.  I can't find a datasheet for
> the 8030, but since 8031, 8032, and 8035 all have the same register
> layout, there's a good chance 8030 does as well.
> ---
>  drivers/net/phy/at803x.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
> index fabf11d..2d020a3 100644
> --- a/drivers/net/phy/at803x.c
> +++ b/drivers/net/phy/at803x.c
> @@ -308,6 +308,8 @@ static struct phy_driver at803x_driver[] = {
>   .flags  = PHY_HAS_INTERRUPT,
>   .config_aneg= genphy_config_aneg,
>   .read_status= genphy_read_status,
> + .ack_interrupt  = at803x_ack_interrupt,
> + .config_intr= at803x_config_intr,
>   .driver = {
>   .owner = THIS_MODULE,
>   },
> @@ -327,6 +329,8 @@ static struct phy_driver at803x_driver[] = {
>   .flags  = PHY_HAS_INTERRUPT,
>   .config_aneg= genphy_config_aneg,
>   .read_status= genphy_read_status,
> + .ack_interrupt  = at803x_ack_interrupt,
> + .config_intr= at803x_config_intr,
>   .driver = {
>   .owner = THIS_MODULE,
>   },

 Shouldn't we take the opportunity to clean up the duplicated register
 definitions? (I'll send an informal patch to spur discussion.)

 Regards.
>>>
>>> That can be done independently. Feel free to send a patch.
>> 
>> Agreed, that deserve a separate patch.
>
> Isn't there a problem when at803x_set_wol() sets the AT803X_WOL_ENABLE
> bit, but a DISABLE/ENABLE cycle through at803x_config_intr() will
> discard that bit?

Possibly, but fixing that should be yet another patch.

-- 
Måns Rullgård
m...@mansr.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 12/14] mm: memcontrol: move socket code for unified hierarchy accounting

2015-11-12 Thread Johannes Weiner

The unified hierarchy memory controller will account socket
memory. Move the infrastructure functions accordingly.

Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
---
 mm/memcontrol.c | 148 
 1 file changed, 74 insertions(+), 74 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e7f1a79..408fb04 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -294,80 +294,6 @@ static inline struct mem_cgroup 
*mem_cgroup_from_id(unsigned short id)
return mem_cgroup_from_css(css);
 }
 
-/* Writing them here to avoid exposing memcg's inner layout */
-#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
-
-struct static_key memcg_sockets_enabled_key;
-EXPORT_SYMBOL(memcg_sockets_enabled_key);
-
-void sock_update_memcg(struct sock *sk)
-{
-   struct mem_cgroup *memcg;
-
-   /* Socket cloning can throw us here with sk_cgrp already
-* filled. It won't however, necessarily happen from
-* process context. So the test for root memcg given
-* the current task's memcg won't help us in this case.
-*
-* Respecting the original socket's memcg is a better
-* decision in this case.
-*/
-   if (sk->sk_memcg) {
-   BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
-   css_get(>sk_memcg->css);
-   return;
-   }
-
-   rcu_read_lock();
-   memcg = mem_cgroup_from_task(current);
-   if (memcg != root_mem_cgroup &&
-   test_bit(MEMCG_SOCK_ACTIVE, >tcp_mem.flags) &&
-   css_tryget_online(>css))
-   sk->sk_memcg = memcg;
-   rcu_read_unlock();
-}
-EXPORT_SYMBOL(sock_update_memcg);
-
-void sock_release_memcg(struct sock *sk)
-{
-   WARN_ON(!sk->sk_memcg);
-   css_put(>sk_memcg->css);
-}
-
-/**
- * mem_cgroup_charge_skmem - charge socket memory
- * @memcg: memcg to charge
- * @nr_pages: number of pages to charge
- *
- * Charges @nr_pages to @memcg. Returns %true if the charge fit within
- * @memcg's configured limit, %false if the charge had to be forced.
- */
-bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
-{
-   struct page_counter *counter;
-
-   if (page_counter_try_charge(>tcp_mem.memory_allocated,
-   nr_pages, )) {
-   memcg->tcp_mem.memory_pressure = 0;
-   return true;
-   }
-   page_counter_charge(>tcp_mem.memory_allocated, nr_pages);
-   memcg->tcp_mem.memory_pressure = 1;
-   return false;
-}
-
-/**
- * mem_cgroup_uncharge_skmem - uncharge socket memory
- * @memcg - memcg to uncharge
- * @nr_pages - number of pages to uncharge
- */
-void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
-{
-   page_counter_uncharge(>tcp_mem.memory_allocated, nr_pages);
-}
-
-#endif
-
 #ifdef CONFIG_MEMCG_KMEM
 /*
  * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
@@ -5538,6 +5464,80 @@ void mem_cgroup_replace_page(struct page *oldpage, 
struct page *newpage)
commit_charge(newpage, memcg, true);
 }
 
+/* Writing them here to avoid exposing memcg's inner layout */
+#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
+
+struct static_key memcg_sockets_enabled_key;
+EXPORT_SYMBOL(memcg_sockets_enabled_key);
+
+void sock_update_memcg(struct sock *sk)
+{
+   struct mem_cgroup *memcg;
+
+   /* Socket cloning can throw us here with sk_cgrp already
+* filled. It won't however, necessarily happen from
+* process context. So the test for root memcg given
+* the current task's memcg won't help us in this case.
+*
+* Respecting the original socket's memcg is a better
+* decision in this case.
+*/
+   if (sk->sk_memcg) {
+   BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
+   css_get(>sk_memcg->css);
+   return;
+   }
+
+   rcu_read_lock();
+   memcg = mem_cgroup_from_task(current);
+   if (memcg != root_mem_cgroup &&
+   test_bit(MEMCG_SOCK_ACTIVE, >tcp_mem.flags) &&
+   css_tryget_online(>css))
+   sk->sk_memcg = memcg;
+   rcu_read_unlock();
+}
+EXPORT_SYMBOL(sock_update_memcg);
+
+void sock_release_memcg(struct sock *sk)
+{
+   WARN_ON(!sk->sk_memcg);
+   css_put(>sk_memcg->css);
+}
+
+/**
+ * mem_cgroup_charge_skmem - charge socket memory
+ * @memcg: memcg to charge
+ * @nr_pages: number of pages to charge
+ *
+ * Charges @nr_pages to @memcg. Returns %true if the charge fit within
+ * @memcg's configured limit, %false if the charge had to be forced.
+ */
+bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+   struct page_counter *counter;
+
+   if (page_counter_try_charge(>tcp_mem.memory_allocated,
+   nr_pages, )) {
+   memcg->tcp_mem.memory_pressure = 0;
+

[PATCH 11/14] mm: memcontrol: do not account memory+swap on unified hierarchy

2015-11-12 Thread Johannes Weiner

The unified hierarchy memory controller doesn't expose the memory+swap
counter to userspace, but its accounting is hardcoded in all charge
paths right now, including the per-cpu charge cache ("the stock").

To avoid adding yet more pointless memory+swap accounting with the
socket memory support in unified hierarchy, disable the counter
altogether when in unified hierarchy mode.

Signed-off-by: Johannes Weiner 
---
 mm/memcontrol.c | 44 +---
 1 file changed, 25 insertions(+), 19 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 658bef2..e7f1a79 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -87,6 +87,12 @@ int do_swap_account __read_mostly;
 #define do_swap_account0
 #endif
 
+/* Whether legacy memory+swap accounting is active */
+static bool do_memsw_account(void)
+{
+   return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && do_swap_account;
+}
+
 static const char * const mem_cgroup_stat_names[] = {
"cache",
"rss",
@@ -1177,7 +1183,7 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup 
*memcg)
if (count < limit)
margin = limit - count;
 
-   if (do_swap_account) {
+   if (do_memsw_account()) {
count = page_counter_read(>memsw);
limit = READ_ONCE(memcg->memsw.limit);
if (count <= limit)
@@ -1280,7 +1286,7 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, 
struct task_struct *p)
pr_cont(":");
 
for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
-   if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
+   if (i == MEM_CGROUP_STAT_SWAP && !do_memsw_account())
continue;
pr_cont(" %s:%luKB", mem_cgroup_stat_names[i],
K(mem_cgroup_read_stat(iter, i)));
@@ -1903,7 +1909,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 
if (stock->nr_pages) {
page_counter_uncharge(>memory, stock->nr_pages);
-   if (do_swap_account)
+   if (do_memsw_account())
page_counter_uncharge(>memsw, stock->nr_pages);
css_put_many(>css, stock->nr_pages);
stock->nr_pages = 0;
@@ -2033,11 +2039,11 @@ retry:
if (consume_stock(memcg, nr_pages))
return 0;
 
-   if (!do_swap_account ||
+   if (!do_memsw_account() ||
page_counter_try_charge(>memsw, batch, )) {
if (page_counter_try_charge(>memory, batch, ))
goto done_restock;
-   if (do_swap_account)
+   if (do_memsw_account())
page_counter_uncharge(>memsw, batch);
mem_over_limit = mem_cgroup_from_counter(counter, memory);
} else {
@@ -2124,7 +2130,7 @@ force:
 * temporarily by force charging it.
 */
page_counter_charge(>memory, nr_pages);
-   if (do_swap_account)
+   if (do_memsw_account())
page_counter_charge(>memsw, nr_pages);
css_get_many(>css, nr_pages);
 
@@ -2161,7 +2167,7 @@ static void cancel_charge(struct mem_cgroup *memcg, 
unsigned int nr_pages)
return;
 
page_counter_uncharge(>memory, nr_pages);
-   if (do_swap_account)
+   if (do_memsw_account())
page_counter_uncharge(>memsw, nr_pages);
 
css_put_many(>css, nr_pages);
@@ -2441,7 +2447,7 @@ void __memcg_kmem_uncharge(struct page *page, int order)
 
page_counter_uncharge(>kmem, nr_pages);
page_counter_uncharge(>memory, nr_pages);
-   if (do_swap_account)
+   if (do_memsw_account())
page_counter_uncharge(>memsw, nr_pages);
 
page->mem_cgroup = NULL;
@@ -3154,7 +3160,7 @@ static int memcg_stat_show(struct seq_file *m, void *v)
BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
 
for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
-   if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
+   if (i == MEM_CGROUP_STAT_SWAP && !do_memsw_account())
continue;
seq_printf(m, "%s %lu\n", mem_cgroup_stat_names[i],
   mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
@@ -3176,14 +3182,14 @@ static int memcg_stat_show(struct seq_file *m, void *v)
}
seq_printf(m, "hierarchical_memory_limit %llu\n",
   (u64)memory * PAGE_SIZE);
-   if (do_swap_account)
+   if (do_memsw_account())
seq_printf(m, "hierarchical_memsw_limit %llu\n",
   (u64)memsw * PAGE_SIZE);
 
for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
unsigned long long val = 0;
 
-   if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
+   if (i == MEM_CGROUP_STAT_SWAP &&

[PATCH 10/14] mm: memcontrol: generalize the socket accounting jump label

2015-11-12 Thread Johannes Weiner

The unified hierarchy memory controller is going to use this jump
label as well to control the networking callbacks. Move it to the
memory controller code and give it a more generic name.

Signed-off-by: Johannes Weiner 
---
 include/linux/memcontrol.h | 4 
 include/net/sock.h | 7 ---
 mm/memcontrol.c| 3 +++
 net/core/sock.c| 5 -
 net/ipv4/tcp_memcontrol.c  | 4 ++--
 5 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1c71f27..4cf5afa 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -693,6 +693,8 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback 
*wb,
 
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
 struct sock;
+extern struct static_key memcg_sockets_enabled_key;
+#define mem_cgroup_sockets_enabled static_key_false(_sockets_enabled_key)
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
 bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
@@ -701,6 +703,8 @@ static inline bool mem_cgroup_under_socket_pressure(struct 
mem_cgroup *memcg)
 {
return memcg->tcp_mem.memory_pressure;
 }
+#else
+#define mem_cgroup_sockets_enabled 0
 #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_MEMCG_KMEM
diff --git a/include/net/sock.h b/include/net/sock.h
index b439dcc..bf1b901 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1065,13 +1065,6 @@ static inline void sk_refcnt_debug_release(const struct 
sock *sk)
 #define sk_refcnt_debug_release(sk) do { } while (0)
 #endif /* SOCK_REFCNT_DEBUG */
 
-#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_NET)
-extern struct static_key memcg_socket_limit_enabled;
-#define mem_cgroup_sockets_enabled 
static_key_false(_socket_limit_enabled)
-#else
-#define mem_cgroup_sockets_enabled 0
-#endif
-
 static inline bool sk_stream_memory_free(const struct sock *sk)
 {
if (sk->sk_wmem_queued >= sk->sk_sndbuf)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 89b1d9e..658bef2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -291,6 +291,9 @@ static inline struct mem_cgroup 
*mem_cgroup_from_id(unsigned short id)
 /* Writing them here to avoid exposing memcg's inner layout */
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
 
+struct static_key memcg_sockets_enabled_key;
+EXPORT_SYMBOL(memcg_sockets_enabled_key);
+
 void sock_update_memcg(struct sock *sk)
 {
struct mem_cgroup *memcg;
diff --git a/net/core/sock.c b/net/core/sock.c
index 6486b0d..c5435b5 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -201,11 +201,6 @@ EXPORT_SYMBOL(sk_net_capable);
 static struct lock_class_key af_family_keys[AF_MAX];
 static struct lock_class_key af_family_slock_keys[AF_MAX];
 
-#if defined(CONFIG_MEMCG_KMEM)
-struct static_key memcg_socket_limit_enabled;
-EXPORT_SYMBOL(memcg_socket_limit_enabled);
-#endif
-
 /*
  * Make lock validator output more readable. (we pre-construct these
  * strings build-time, so that runtime initialization of socket
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index 47addc3..17df9dd 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -34,7 +34,7 @@ void tcp_destroy_cgroup(struct mem_cgroup *memcg)
return;
 
if (test_bit(MEMCG_SOCK_ACTIVATED, >tcp_mem.flags))
-   static_key_slow_dec(_socket_limit_enabled);
+   static_key_slow_dec(_sockets_enabled_key);
 }
 
 static int tcp_update_limit(struct mem_cgroup *memcg, unsigned long nr_pages)
@@ -73,7 +73,7 @@ static int tcp_update_limit(struct mem_cgroup *memcg, 
unsigned long nr_pages)
 */
if (!test_and_set_bit(MEMCG_SOCK_ACTIVATED,
  >tcp_mem.flags))
-   static_key_slow_inc(_socket_limit_enabled);
+   static_key_slow_inc(_sockets_enabled_key);
set_bit(MEMCG_SOCK_ACTIVE, >tcp_mem.flags);
}
 
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 13/14] mm: memcontrol: account socket memory in unified hierarchy memory controller

2015-11-12 Thread Johannes Weiner

Socket memory can be a significant share of overall memory consumed by
common workloads. In order to provide reasonable resource isolation in
the unified hierarchy, this type of memory needs to be included in the
tracking/accounting of a cgroup under active memory resource control.

Overhead is only incurred when a non-root control group is created AND
the memory controller is instructed to track and account the memory
footprint of that group. cgroup.memory=nosocket can be specified on
the boot commandline to override any runtime configuration and
forcibly exclude socket memory from active memory resource control.

Signed-off-by: Johannes Weiner 
---
 include/linux/memcontrol.h |  12 -
 mm/memcontrol.c| 131 +
 2 files changed, 118 insertions(+), 25 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4cf5afa..809d6de 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -256,6 +256,10 @@ struct mem_cgroup {
struct wb_domain cgwb_domain;
 #endif
 
+#ifdef CONFIG_INET
+   struct work_struct  socket_work;
+#endif
+
/* List of events which userspace want to receive */
struct list_head event_list;
spinlock_t event_list_lock;
@@ -691,7 +695,7 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback 
*wb,
 
 #endif /* CONFIG_CGROUP_WRITEBACK */
 
-#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
+#ifdef CONFIG_INET
 struct sock;
 extern struct static_key memcg_sockets_enabled_key;
 #define mem_cgroup_sockets_enabled static_key_false(_sockets_enabled_key)
@@ -701,11 +705,15 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, 
unsigned int nr_pages);
 void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int 
nr_pages);
 static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 {
+#ifdef CONFIG_MEMCG_KMEM
return memcg->tcp_mem.memory_pressure;
+#else
+   return false;
+#endif
 }
 #else
 #define mem_cgroup_sockets_enabled 0
-#endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
+#endif /* CONFIG_INET */
 
 #ifdef CONFIG_MEMCG_KMEM
 extern struct static_key memcg_kmem_enabled_key;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 408fb04..cad9525 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -80,6 +80,9 @@ struct mem_cgroup *root_mem_cgroup __read_mostly;
 
 #define MEM_CGROUP_RECLAIM_RETRIES 5
 
+/* Socket memory accounting disabled? */
+static bool cgroup_memory_nosocket;
+
 /* Whether the swap controller is active */
 #ifdef CONFIG_MEMCG_SWAP
 int do_swap_account __read_mostly;
@@ -1923,6 +1926,18 @@ static int memcg_cpu_hotplug_callback(struct 
notifier_block *nb,
return NOTIFY_OK;
 }
 
+static void reclaim_high(struct mem_cgroup *memcg,
+unsigned int nr_pages,
+gfp_t gfp_mask)
+{
+   do {
+   if (page_counter_read(>memory) <= memcg->high)
+   continue;
+   mem_cgroup_events(memcg, MEMCG_HIGH, 1);
+   try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+   } while ((memcg = parent_mem_cgroup(memcg)));
+}
+
 /*
  * Scheduled by try_charge() to be executed from the userland return path
  * and reclaims memory over the high limit.
@@ -1930,20 +1945,13 @@ static int memcg_cpu_hotplug_callback(struct 
notifier_block *nb,
 void mem_cgroup_handle_over_high(void)
 {
unsigned int nr_pages = current->memcg_nr_pages_over_high;
-   struct mem_cgroup *memcg, *pos;
+   struct mem_cgroup *memcg;
 
if (likely(!nr_pages))
return;
 
-   pos = memcg = get_mem_cgroup_from_mm(current->mm);
-
-   do {
-   if (page_counter_read(>memory) <= pos->high)
-   continue;
-   mem_cgroup_events(pos, MEMCG_HIGH, 1);
-   try_to_free_mem_cgroup_pages(pos, nr_pages, GFP_KERNEL, true);
-   } while ((pos = parent_mem_cgroup(pos)));
-
+   memcg = get_mem_cgroup_from_mm(current->mm);
+   reclaim_high(memcg, nr_pages, GFP_KERNEL);
css_put(>css);
current->memcg_nr_pages_over_high = 0;
 }
@@ -4141,6 +4149,8 @@ struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup 
*memcg)
 }
 EXPORT_SYMBOL(parent_mem_cgroup);
 
+static void socket_work_func(struct work_struct *work);
+
 static struct cgroup_subsys_state * __ref
 mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
@@ -4180,6 +4190,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state 
*parent_css)
 #ifdef CONFIG_CGROUP_WRITEBACK
INIT_LIST_HEAD(>cgwb_list);
 #endif
+#ifdef CONFIG_INET
+   INIT_WORK(>socket_work, socket_work_func);
+#endif
return >css;
 
 free_out:
@@ -4237,6 +4250,11 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
if (ret)
return ret;
 
+#ifdef CONFIG_INET
+   if (cgroup_subsys_on_dfl(memory_cgrp_subsys) &&

Re: [PATCH] bpf: samples: exclude asm/sysreg.h for arm64

2015-11-12 Thread Alexei Starovoitov

On Thu, Nov 12, 2015 at 02:07:46PM -0800, Yang Shi wrote:
> commit 338d4f49d6f7114a017d294ccf7374df4f998edc
> ("arm64: kernel: Add support for Privileged Access Never") includes sysreg.h
> into futex.h and uaccess.h. But, the inline assembly used by asm/sysreg.h is
> incompatible with llvm so it will cause BPF samples build failure for ARM64.
> Since sysreg.h is useless for BPF samples, just exclude it from Makefile via
> defining __ASM_SYSREG_H.
> 
> Signed-off-by: Yang Shi 

not the prettiest fix, but good enough for sample code.
Acked-by: Alexei Starovoitov 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH iproute2 -next] {f,m}_bpf: allow for sharing maps

2015-11-12 Thread Daniel Borkmann

This larger work addresses one of the bigger remaining issues on
tc's eBPF frontend, that is, to allow for persistent file descriptors.
Whenever tc parses the ELF object, extracts and loads maps into the
kernel, these file descriptors will be out of reach after the tc
instance exits.

Meaning, for simple (unnested) programs which contain one or
multiple maps, the kernel holds a reference, and they will live
on inside the kernel until the program holding them is unloaded,
but they will be out of reach for user space, even worse with
(also multiple nested) tail calls.

For this issue, we introduced the concept of an agent that can
receive the set of file descriptors from the tc instance creating
them, in order to be able to further inspect/update map data for
a specific use case. However, while that is more tied towards
specific applications, it still doesn't easily allow for sharing
maps accross multiple tc instances and would require a daemon to
be running in the background. F.e. when a map should be shared by
two eBPF programs, one attached to ingress, one to egress, this
currently doesn't work with the tc frontend.

This work solves exactly that, i.e. if requested, maps can now be
_arbitrarily_ shared between object files (PIN_GLOBAL_NS) or within
a single object (but various program sections, PIN_OBJECT_NS) without
"loosing" the file descriptor set. To make that happen, we use eBPF
object pinning introduced in kernel commit b2197755b263 ("bpf: add
support for persistent maps/progs") for exactly this purpose.

The shipped examples/bpf/bpf_shared.c code from this patch can be
easily applied, for instance, as:

 - classifier-classifier shared:

  tc filter add dev foo parent 1: bpf obj shared.o sec egress
  tc filter add dev foo parent : bpf obj shared.o sec ingress

 - classifier-action shared (here: late binding to a dummy classifier):

  tc actions add action bpf obj shared.o sec egress pass index 42
  tc filter add dev foo parent : bpf obj shared.o sec ingress
  tc filter add dev foo parent 1: bpf bytecode '1,6 0 0 4294967295,' \
 action bpf index 42

The toy example increments a shared counter on egress and dumps its
value on ingress (if no sharing (PIN_NONE) would have been chosen,
map value is 0, of course, due to the two map instances being created):

  [...]
  -0 [002] ..s. 38264.788234: : map val: 4
  -0 [002] ..s. 38264.788919: : map val: 4
  -0 [002] ..s. 38264.789599: : map val: 5
  [...]

... thus if both sections reference the pinned map(s) in question,
tc will take care of fetching the appropriate file descriptor.

The patch has been tested extensively on both, classifier and
action sides.

Signed-off-by: Daniel Borkmann 
---
 Hi Stephen,

 this requires a header rebase to get BPF_OBJ_PIN/BPF_OBJ_GET related
 things included. I've not included it here as you have your scripts
 for that anyway. The patch is targeted at the -next branch, so we have
 a good amount of linger time. I'll follow-up later with a refresh of
 the man-page to document all recently introduced features for {f,m}_bpf
 from this and previous patches.

 Thanks!

 examples/bpf/bpf_funcs.h  |7 +
 examples/bpf/bpf_shared.c |   54 ++
 examples/bpf/bpf_shared.h |4 -
 include/bpf_elf.h |6 +
 include/utils.h   |3 +
 tc/e_bpf.c|   18 +-
 tc/f_bpf.c|  131 +
 tc/m_bpf.c|  158 ++
 tc/tc_bpf.c   | 1259 +
 tc/tc_bpf.h   |   73 ++-
 10 files changed, 1105 insertions(+), 608 deletions(-)
 create mode 100644 examples/bpf/bpf_shared.c

diff --git a/examples/bpf/bpf_funcs.h b/examples/bpf/bpf_funcs.h
index 1545fa9..1369401 100644
--- a/examples/bpf/bpf_funcs.h
+++ b/examples/bpf/bpf_funcs.h
@@ -1,6 +1,10 @@
 #ifndef __BPF_FUNCS__
 #define __BPF_FUNCS__
 
+#include 
+
+#include "../../include/bpf_elf.h"
+
 /* Misc macros. */
 #ifndef __maybe_unused
 # define __maybe_unused__attribute__ ((__unused__))
@@ -43,6 +47,9 @@ static unsigned int (*get_smp_processor_id)(void) 
__maybe_unused =
 static unsigned int (*get_prandom_u32)(void) __maybe_unused =
(void *) BPF_FUNC_get_prandom_u32;
 
+static int (*bpf_printk)(const char *fmt, int fmt_size, ...) __maybe_unused =
+   (void *) BPF_FUNC_trace_printk;
+
 /* LLVM built-in functions that an eBPF C program may use to emit
  * BPF_LD_ABS and BPF_LD_IND instructions.
  */
diff --git a/examples/bpf/bpf_shared.c b/examples/bpf/bpf_shared.c
new file mode 100644
index 000..a8dc39c
--- /dev/null
+++ b/examples/bpf/bpf_shared.c
@@ -0,0 +1,54 @@
+#include 
+
+#include "bpf_funcs.h"
+
+/* Minimal, stand-alone toy map pinning example:
+ *
+ * clang -target bpf -O2 [...] -o bpf_shared.o -c bpf_shared.c
+ * tc filter add dev foo parent 1: bpf obj bpf_shared.o sec egress
+ * tc filter add dev foo parent : bpf obj bpf_shared.o sec ingress
+ *
+ * Both

[PATCH 00/14] mm: memcontrol: account socket memory in unified hierarchy

2015-11-12 Thread Johannes Weiner

Hi,

this is version 3 of the patches to add socket memory accounting to
the unified hierarchy memory controller. Changes since v2 include:

- Fixed an underflow bug in the mem+swap counter that came through the
  design of the per-cpu charge cache. To fix that, the unused mem+swap
  counter is now fully patched out on unified hierarchy. Double whammy.

- Restored the counting jump label such that the networking callbacks
  get patched out again when the last memory-controlled cgroup goes
  away. The code was already there, so we might as well keep it.

- Broke down the massive tcp_memcontrol rewrite patch into smaller
  logical pieces to (hopefully) make it easier to review and verify.

---

Socket buffer memory can make up a significant share of a workload's
memory footprint that can be directly linked to userspace activity,
and so it needs to be part of the memory controller to provide proper
resource isolation/containment.

Historically, socket buffers were accounted in a separate counter,
without any pressure equalization between anonymous memory, page
cache, and the socket buffers. When the socket buffer pool was
exhausted, buffer allocations would fail hard and cause network
performance to tank, regardless of whether there was still memory
available to the group or not. Likewise, struggling anonymous or cache
workingsets could not dip into an idle socket memory pool. Because of
this, the feature was not usable for many real life applications.

To not repeat this mistake, the new memory controller will account all
types of memory pages it is tracking on behalf of a cgroup in a single
pool. Upon pressure, the VM reclaims and shrinks and puts pressure on
whatever memory consumer in that pool is within its reach.

For socket memory, pressure feedback is provided through vmpressure
events. When the VM has trouble freeing memory, the network code is
instructed to stop growing the cgroup's transmit windows.

This series begins with a rework of the existing tcp memory controller
that simplifies and cleans up the code while allowing us to have only
one set of networking hooks for both memory controller versions. The
original behavior of the existing tcp controller should be preserved.

It then adds socket accounting to the v2 memory controller, including
the use of the per-cpu charge cache and async memory.high enforcement
from socket memory charges.

Lastly, vmpressure is hooked up to the socket code so that it stops
growing transmit windows when the VM has trouble reclaiming memory.

 include/linux/memcontrol.h   |  71 ++
 include/net/sock.h   | 149 ++--
 include/net/tcp.h|   5 +-
 include/net/tcp_memcontrol.h |   1 -
 mm/backing-dev.c |   2 +-
 mm/memcontrol.c  | 303 +++--
 mm/vmpressure.c  |  25 +++-
 mm/vmscan.c  |  31 +++--
 net/core/sock.c  |  78 +++
 net/ipv4/tcp.c   |   3 +-
 net/ipv4/tcp_ipv4.c  |   9 +-
 net/ipv4/tcp_memcontrol.c|  85 
 net/ipv4/tcp_output.c|   7 +-
 net/ipv6/tcp_ipv6.c  |   3 -
 14 files changed, 353 insertions(+), 419 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/14] mm: vmscan: simplify memcg vs. global shrinker invocation

2015-11-12 Thread Johannes Weiner

Letting shrink_slab() handle the root_mem_cgroup, and implicitely the
!CONFIG_MEMCG case, allows shrink_zone() to invoke the shrinkers
unconditionally from within the memcg iteration loop.

Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
---
 include/linux/memcontrol.h |  2 ++
 mm/vmscan.c| 31 ---
 2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9a7a24a..251bb51 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -502,6 +502,8 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
+#define root_mem_cgroup NULL
+
 static inline void mem_cgroup_events(struct mem_cgroup *memcg,
 enum mem_cgroup_events_index idx,
 unsigned int nr)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a4507ec..e4f5b3c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -411,6 +411,10 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
struct shrinker *shrinker;
unsigned long freed = 0;
 
+   /* Global shrinker mode */
+   if (memcg == root_mem_cgroup)
+   memcg = NULL;
+
if (memcg && !memcg_kmem_is_active(memcg))
return 0;
 
@@ -2410,11 +2414,22 @@ static bool shrink_zone(struct zone *zone, struct 
scan_control *sc,
shrink_lruvec(lruvec, swappiness, sc, _pages);
zone_lru_pages += lru_pages;
 
-   if (memcg && is_classzone)
+   /*
+* Shrink the slab caches in the same proportion that
+* the eligible LRU pages were scanned.
+*/
+   if (is_classzone) {
shrink_slab(sc->gfp_mask, zone_to_nid(zone),
memcg, sc->nr_scanned - scanned,
lru_pages);
 
+   if (reclaim_state) {
+   sc->nr_reclaimed +=
+   reclaim_state->reclaimed_slab;
+   reclaim_state->reclaimed_slab = 0;
+   }
+   }
+
/*
 * Direct reclaim and kswapd have to scan all memory
 * cgroups to fulfill the overall scan target for the
@@ -2432,20 +2447,6 @@ static bool shrink_zone(struct zone *zone, struct 
scan_control *sc,
}
} while ((memcg = mem_cgroup_iter(root, memcg, )));
 
-   /*
-* Shrink the slab caches in the same proportion that
-* the eligible LRU pages were scanned.
-*/
-   if (global_reclaim(sc) && is_classzone)
-   shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
-   sc->nr_scanned - nr_scanned,
-   zone_lru_pages);
-
-   if (reclaim_state) {
-   sc->nr_reclaimed += reclaim_state->reclaimed_slab;
-   reclaim_state->reclaimed_slab = 0;
-   }
-
vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
   sc->nr_scanned - nr_scanned,
   sc->nr_reclaimed - nr_reclaimed);
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 14/14] mm: memcontrol: hook up vmpressure to socket pressure

2015-11-12 Thread Johannes Weiner

Let the networking stack know when a memcg is under reclaim pressure
so that it can clamp its transmit windows accordingly.

Whenever the reclaim efficiency of a cgroup's LRU lists drops low
enough for a MEDIUM or HIGH vmpressure event to occur, assert a
pressure state in the socket and tcp memory code that tells it to curb
consumption growth from sockets associated with said control group.

vmpressure events are naturally edge triggered, so for hysteresis
assert socket pressure for a second to allow for subsequent vmpressure
events to occur before letting the socket code return to normal.

This will likely need finetuning for a wider variety of workloads, but
for now stick to the vmpressure presets and keep hysteresis simple.

Signed-off-by: Johannes Weiner 
---
 include/linux/memcontrol.h | 29 +
 mm/memcontrol.c| 15 +--
 mm/vmpressure.c| 25 -
 3 files changed, 46 insertions(+), 23 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 809d6de..dba43cb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -258,6 +258,7 @@ struct mem_cgroup {
 
 #ifdef CONFIG_INET
struct work_struct  socket_work;
+   unsigned long   socket_pressure;
 #endif
 
/* List of events which userspace want to receive */
@@ -303,18 +304,34 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, 
struct zone *);
 
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
-struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
 
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+#define mem_cgroup_from_counter(counter, member)   \
+   container_of(counter, struct mem_cgroup, member)
+
 struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
   struct mem_cgroup *,
   struct mem_cgroup_reclaim_cookie *);
 void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
 
+/**
+ * parent_mem_cgroup - find the accounting parent of a memcg
+ * @memcg: memcg whose parent to find
+ *
+ * Returns the parent memcg, or NULL if this is the root or the memory
+ * controller is in legacy no-hierarchy mode.
+ */
+static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
+{
+   if (!memcg->memory.parent)
+   return NULL;
+   return mem_cgroup_from_counter(memcg->memory.parent, memory);
+}
+
 static inline bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
  struct mem_cgroup *root)
 {
@@ -706,10 +723,14 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, 
unsigned int nr_pages);
 static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
 {
 #ifdef CONFIG_MEMCG_KMEM
-   return memcg->tcp_mem.memory_pressure;
-#else
-   return false;
+   if (memcg->tcp_mem.memory_pressure)
+   return true;
 #endif
+   do {
+   if (time_before(jiffies, memcg->socket_pressure))
+   return true;
+   } while ((memcg = parent_mem_cgroup(memcg)));
+   return false;
 }
 #else
 #define mem_cgroup_sockets_enabled 0
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cad9525..4068662 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1091,9 +1091,6 @@ bool task_in_mem_cgroup(struct task_struct *task, struct 
mem_cgroup *memcg)
return ret;
 }
 
-#define mem_cgroup_from_counter(counter, member)   \
-   container_of(counter, struct mem_cgroup, member)
-
 /**
  * mem_cgroup_margin - calculate chargeable space of a memory cgroup
  * @memcg: the memory cgroup
@@ -4138,17 +4135,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
kfree(memcg);
 }
 
-/*
- * Returns the parent mem_cgroup in memcgroup hierarchy with hierarchy enabled.
- */
-struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
-{
-   if (!memcg->memory.parent)
-   return NULL;
-   return mem_cgroup_from_counter(memcg->memory.parent, memory);
-}
-EXPORT_SYMBOL(parent_mem_cgroup);
-
 static void socket_work_func(struct work_struct *work);
 
 static struct cgroup_subsys_state * __ref
@@ -4192,6 +4178,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state 
*parent_css)
 #endif
 #ifdef CONFIG_INET
INIT_WORK(>socket_work, socket_work_func);
+   memcg->socket_pressure = jiffies;
 #endif
return >css;
 
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 4c25e62..07e8440 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -137,14 +137,11 @@ struct vmpressure_event {
 };
 
 static bool vmpressure_event(struct vmpressure *vmpr,
-unsigned long scanned, unsigned long reclaimed)
+

[PATCH 01/14] mm: memcontrol: export root_mem_cgroup

2015-11-12 Thread Johannes Weiner

A later patch will need this symbol in files other than memcontrol.c,
so export it now and replace mem_cgroup_root_css at the same time.

Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
---
 include/linux/memcontrol.h | 3 ++-
 mm/backing-dev.c   | 2 +-
 mm/memcontrol.c| 5 ++---
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ffc5460..9a7a24a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -275,7 +275,8 @@ struct mem_cgroup {
struct mem_cgroup_per_node *nodeinfo[0];
/* WARNING: nodeinfo must be the last member here */
 };
-extern struct cgroup_subsys_state *mem_cgroup_root_css;
+
+extern struct mem_cgroup *root_mem_cgroup;
 
 /**
  * mem_cgroup_events - count memory events against a cgroup
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 9160853..fdc6f4d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -707,7 +707,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 
ret = wb_init(>wb, bdi, 1, GFP_KERNEL);
if (!ret) {
-   bdi->wb.memcg_css = mem_cgroup_root_css;
+   bdi->wb.memcg_css = _mem_cgroup->css;
bdi->wb.blkcg_css = blkcg_root_css;
}
return ret;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fcd7c4e..a5d2586 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -76,9 +76,9 @@
 struct cgroup_subsys memory_cgrp_subsys __read_mostly;
 EXPORT_SYMBOL(memory_cgrp_subsys);
 
+struct mem_cgroup *root_mem_cgroup __read_mostly;
+
 #define MEM_CGROUP_RECLAIM_RETRIES 5
-static struct mem_cgroup *root_mem_cgroup __read_mostly;
-struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly;
 
 /* Whether the swap controller is active */
 #ifdef CONFIG_MEMCG_SWAP
@@ -4211,7 +4211,6 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state 
*parent_css)
/* root ? */
if (parent_css == NULL) {
root_mem_cgroup = memcg;
-   mem_cgroup_root_css = >css;
page_counter_init(>memory, NULL);
memcg->high = PAGE_COUNTER_MAX;
memcg->soft_limit = PAGE_COUNTER_MAX;
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next RFC V3 3/3] vhost_net: basic polling support

2015-11-12 Thread Jason Wang

This patch tries to poll for new added tx buffer or socket receive
queue for a while at the end of tx/rx processing. The maximum time
spent on polling were specified through a new kind of vring ioctl.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c| 77 +++---
 drivers/vhost/vhost.c  | 15 +
 drivers/vhost/vhost.h  |  1 +
 include/uapi/linux/vhost.h | 11 +++
 4 files changed, 99 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9eda69e..a38fa32 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -287,6 +287,45 @@ static void vhost_zerocopy_callback(struct ubuf_info 
*ubuf, bool success)
rcu_read_unlock_bh();
 }
 
+static inline unsigned long busy_clock(void)
+{
+   return local_clock() >> 10;
+}
+
+static bool vhost_can_busy_poll(struct vhost_dev *dev,
+   unsigned long endtime)
+{
+   return likely(!need_resched()) &&
+  likely(!time_after(busy_clock(), endtime)) &&
+  likely(!signal_pending(current)) &&
+  !vhost_has_work(dev) &&
+  single_task_running();
+}
+
+static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
+   struct vhost_virtqueue *vq,
+   struct iovec iov[], unsigned int iov_size,
+   unsigned int *out_num, unsigned int *in_num)
+{
+   unsigned long uninitialized_var(endtime);
+
+   if (vq->busyloop_timeout) {
+   preempt_disable();
+   endtime = busy_clock() + vq->busyloop_timeout;
+   }
+
+   while (vq->busyloop_timeout &&
+  vhost_can_busy_poll(vq->dev, endtime) &&
+  !vhost_vq_more_avail(vq->dev, vq))
+   cpu_relax();
+
+   if (vq->busyloop_timeout)
+   preempt_enable();
+
+   return vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+out_num, in_num, NULL, NULL);
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
@@ -331,10 +370,9 @@ static void handle_tx(struct vhost_net *net)
  % UIO_MAXIOV == nvq->done_idx))
break;
 
-   head = vhost_get_vq_desc(vq, vq->iov,
-ARRAY_SIZE(vq->iov),
-, ,
-NULL, NULL);
+   head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
+   ARRAY_SIZE(vq->iov),
+   , );
/* On error, stop handling until the next kick. */
if (unlikely(head < 0))
break;
@@ -435,6 +473,35 @@ static int peek_head_len(struct sock *sk)
return len;
 }
 
+static int vhost_net_peek_head_len(struct vhost_net *net, struct sock *sk)
+{
+   struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
+   struct vhost_virtqueue *vq = >vq;
+   unsigned long uninitialized_var(endtime);
+
+   if (vq->busyloop_timeout) {
+   mutex_lock(>mutex);
+   vhost_disable_notify(>dev, vq);
+   preempt_disable();
+   endtime = busy_clock() + vq->busyloop_timeout;
+   }
+
+   while (vq->busyloop_timeout &&
+  vhost_can_busy_poll(>dev, endtime) &&
+  skb_queue_empty(>sk_receive_queue) &&
+  !vhost_vq_more_avail(>dev, vq))
+   cpu_relax();
+
+   if (vq->busyloop_timeout) {
+   preempt_enable();
+   if (vhost_enable_notify(>dev, vq))
+   vhost_poll_queue(>poll);
+   mutex_unlock(>mutex);
+   }
+
+   return peek_head_len(sk);
+}
+
 /* This is a multi-buffer version of vhost_get_desc, that works if
  * vq has read descriptors only.
  * @vq - the relevant virtqueue
@@ -553,7 +620,7 @@ static void handle_rx(struct vhost_net *net)
vq->log : NULL;
mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
 
-   while ((sock_len = peek_head_len(sock->sk))) {
+   while ((sock_len = vhost_net_peek_head_len(net, sock->sk))) {
sock_len += sock_hlen;
vhost_len = sock_len + vhost_hlen;
headcount = get_rx_bufs(vq, vq->heads, vhost_len,
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index b86c5aa..8f9a64c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -285,6 +285,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->memory = NULL;
vq->is_le = virtio_legacy_is_little_endian();
vhost_vq_reset_user_be(vq);
+   vq->busyloop_timeout = 0;
 }
 
 static int vhost_worker(void *data)
@@ -747,6 +748,7

[PATCH net-next RFC V3 0/3] basic busy polling support for vhost_net

2015-11-12 Thread Jason Wang

Hi all:

This series tries to add basic busy polling for vhost net. The idea is
simple: at the end of tx/rx processing, busy polling for new tx added
descriptor and rx receive socket for a while. The maximum number of
time (in us) could be spent on busy polling was specified ioctl.

Test were done through:

- 50 us as busy loop timeout
- Netperf 2.6
- Two machines with back to back connected ixgbe
- Guest with 1 vcpu and 1 queue

Results:
- For stream workload, ioexits were reduced dramatically in medium
  size (1024-2048) of tx (at most -39%) and almost all rx (at most
  -79%) as a result of polling. This compensate for the possible
  wasted cpu cycles more or less. That porbably why we can still see
  some increasing in the normalized throughput in some cases.
- Throughput of tx were increased (at most 105%) expect for the huge
  write (16384). And we can send more packets in the case (+tpkts were
  increased).
- Very minor rx regression in some cases.
- Improvemnt on TCP_RR (at most 16%).

size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
   64/ 1/   +9%/  -17%/   +5%/  +10%/   -2%
   64/ 2/   +8%/  -18%/   +6%/  +10%/   -1%
   64/ 4/   +4%/  -21%/   +6%/  +10%/   -1%
   64/ 8/   +9%/  -17%/   +6%/   +9%/   -2%
  256/ 1/  +20%/   -1%/  +15%/  +11%/   -9%
  256/ 2/  +15%/   -6%/  +15%/   +8%/   -8%
  256/ 4/  +17%/   -4%/  +16%/   +8%/   -8%
  256/ 8/  -61%/  -69%/  +16%/  +10%/  -10%
  512/ 1/  +15%/   -3%/  +19%/  +18%/  -11%
  512/ 2/  +19%/0%/  +19%/  +13%/  -10%
  512/ 4/  +18%/   -2%/  +18%/  +15%/  -10%
  512/ 8/  +17%/   -1%/  +18%/  +15%/  -11%
 1024/ 1/  +25%/   +4%/  +27%/  +16%/  -21%
 1024/ 2/  +28%/   +8%/  +25%/  +15%/  -22%
 1024/ 4/  +25%/   +5%/  +25%/  +14%/  -21%
 1024/ 8/  +27%/   +7%/  +25%/  +16%/  -21%
 2048/ 1/  +32%/  +12%/  +31%/  +22%/  -38%
 2048/ 2/  +33%/  +12%/  +30%/  +23%/  -36%
 2048/ 4/  +31%/  +10%/  +31%/  +24%/  -37%
 2048/ 8/ +105%/  +75%/  +33%/  +23%/  -39%
16384/ 1/0%/  -14%/   +2%/0%/  +19%
16384/ 2/0%/  -13%/  +19%/  -13%/  +17%
16384/ 4/0%/  -12%/   +3%/0%/   +2%
16384/ 8/0%/  -11%/   -2%/   +1%/   +1%
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
   64/ 1/   -7%/  -23%/   +4%/   +6%/  -74%
   64/ 2/   -2%/  -12%/   +2%/   +2%/  -55%
   64/ 4/   +2%/   -5%/  +10%/   -2%/  -43%
   64/ 8/   -5%/   -5%/  +11%/  -34%/  -59%
  256/ 1/   -6%/  -16%/   +9%/  +11%/  -60%
  256/ 2/   +3%/   -4%/   +6%/   -3%/  -28%
  256/ 4/0%/   -5%/   -9%/   -9%/  -10%
  256/ 8/   -3%/   -6%/  -12%/   -9%/  -40%
  512/ 1/   -4%/  -17%/  -10%/  +21%/  -34%
  512/ 2/0%/   -9%/  -14%/   -3%/  -30%
  512/ 4/0%/   -4%/  -18%/  -12%/   -4%
  512/ 8/   -1%/   -4%/   -1%/   -5%/   +4%
 1024/ 1/0%/  -16%/  +12%/  +11%/  -10%
 1024/ 2/0%/  -11%/0%/   +5%/  -31%
 1024/ 4/0%/   -4%/   -7%/   +1%/  -22%
 1024/ 8/   -5%/   -6%/  -17%/  -29%/  -79%
 2048/ 1/0%/  -16%/   +1%/   +9%/  -10%
 2048/ 2/0%/  -12%/   +7%/   +9%/  -26%
 2048/ 4/0%/   -7%/   -4%/   +3%/  -64%
 2048/ 8/   -1%/   -5%/   -6%/   +4%/  -20%
16384/ 1/0%/  -12%/  +11%/   +7%/  -20%
16384/ 2/0%/   -7%/   +1%/   +5%/  -26%
16384/ 4/0%/   -5%/  +12%/  +22%/  -23%
16384/ 8/0%/   -1%/   -8%/   +5%/   -3%
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
1/ 1/   +9%/  -29%/   +9%/   +9%/   +9%
1/25/   +6%/  -18%/   +6%/   +6%/   -1%
1/50/   +6%/  -19%/   +5%/   +5%/   -2%
1/   100/   +5%/  -19%/   +4%/   +4%/   -3%
   64/ 1/  +10%/  -28%/  +10%/  +10%/  +10%
   64/25/   +8%/  -18%/   +7%/   +7%/   -2%
   64/50/   +8%/  -17%/   +8%/   +8%/   -1%
   64/   100/   +8%/  -17%/   +8%/   +8%/   -1%
  256/ 1/  +10%/  -28%/  +10%/  +10%/  +10%
  256/25/  +15%/  -13%/  +15%/  +15%/0%
  256/50/  +16%/  -14%/  +18%/  +18%/   +2%
  256/   100/  +15%/  -13%/  +12%/  +12%/   -2%

Changes from V2:
- poll also at the end of rx handling
- factor out the polling logic and optimize the code a little bit
- add two ioctls to get and set the busy poll timeout
- test on ixgbe (which can give more stable and reproducable numbers)
  instead of mlx4.

Changes from V1:
- Add a comment for vhost_has_work() to explain why it could be
  lockless
- Add param description for busyloop_timeout
- Split out the busy polling logic into a new helper
- Check and exit the loop when there's a pending signal
- Disable preemption during busy looping to make sure lock_clock() was
  correctly used.

Jason Wang (3):
  vhost: introduce vhost_has_work()
  vhost: introduce vhost_vq_more_avail()
  vhost_net: basic polling support

 drivers/vhost/net.c| 77 +++---
 drivers/vhost/vhost.c  | 48 +++--
 drivers/vhost/vhost.h  |  3 ++

[linux-4.4-mw] BUG: unable to handle kernel paging request ip_vs_out.constprop

2015-11-12 Thread Sander Eikelenboom


Hi All,

Just got a crash with a linux-4.4-mw kernel.
I'm using a routed bridge and apart from the splat below i have got some 
interesting other messages that aren't there in 4.3 (and perhaps are of 
interest for the crash as well):
[  207.033768] vif vif-1-0 vif1.0: set_features() failed (-1); wanted 
0x00044803, left 0x000400114813
[  207.033780] vif vif-1-0 vif1.0: set_features() failed (-1); wanted 
0x00044803, left 0x000400114813
[  207.245435] xen_bridge: error setting offload STP state on port 
1(vif1.0)

[  207.245442] vif vif-1-0 vif1.0: failed to set HW ageing time
[  207.245443] xen_bridge: error setting offload STP state on port 
1(vif1.0)
[  207.245491] vif vif-1-0 vif1.0: set_features() failed (-1); wanted 
0x00044803, left 0x000400114813


The commit message for the commit that introduced the "set HW ageing 
time" error message, doesn't seem to tell
me much about it's purpose. If it's not related i can reported as a 
seperate issue.


--
Sander

The crash:
[  354.328687] BUG: unable to handle kernel paging request at 
880049aa8000

[  354.350206] IP: [] ip_vs_out.constprop.25+0x47/0x60
[  354.360882] PGD 2212067 PUD 25b4067 PMD 5ffb6067 PTE 0
[  354.371587] Oops:  [#1] SMP
[  354.382143] Modules linked in:
[  354.392537] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
4.3.0-mw-2015-linus-doflr+ #1
[  354.403105] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
V1.8B1 09/13/2010
[  354.413666] task: 82218580 ti: 8220 task.ti: 
8220
[  354.424255] RIP: e030:[]  [] 
ip_vs_out.constprop.25+0x47/0x60

[  354.434742] RSP: e02b:88005f6034b0  EFLAGS: 00010246
[  354.445006] RAX: 0001 RBX: 88005f6034f8 RCX: 
880049aa7ce0
[  354.455262] RDX: 88003c0e5500 RSI: 0003 RDI: 
880004e0e800
[  354.465422] RBP: 88005f6034b8 R08: 0014 R09: 
0003
[  354.475508] R10: 0001 R11: 880040f394cc R12: 
88005f603528
[  354.485567] R13: 88003c0e5500 R14: 822da2e8 R15: 
88003c0e5500
[  354.495595] FS:  7f0243c2b700() GS:88005f60() 
knlGS:

[  354.505474] CS:  e033 DS:  ES:  CR0: 8005003b
[  354.515135] CR2: 880049aa8000 CR3: 59271000 CR4: 
0660

[  354.524794] Stack:
[  354.534319]  81a074fc 88005f6034e8 8199e138 
88003c0e5500
[  354.543981]  88005f603528 88003c0e5500  
88005f603518
[  354.553577]  8199e1af 880005300048 88003c0e5500 
822da2e8

[  354.563160] Call Trace:
[  354.572418]  
[  354.572480]  [] ? ip_vs_local_reply4+0x1c/0x20
[  354.590458]  [] nf_iterate+0x58/0x70
[  354.599372]  [] nf_hook_slow+0x5f/0xb0
[  354.608245]  [] __ip_local_out+0x9e/0xb0
[  354.617036]  [] ? ip_forward_options+0x1a0/0x1a0
[  354.625874]  [] ip_local_out+0x17/0x40
[  354.634383]  [] ip_build_and_send_pkt+0x148/0x1c0
[  354.642715]  [] tcp_v4_send_synack+0x56/0xa0
[  354.650893]  [] ? 
inet_csk_reqsk_queue_hash_add+0x68/0x90

[  354.659083]  [] tcp_conn_request+0x95d/0x970
[  354.667196]  [] ? __local_bh_enable_ip+0x26/0x90
[  354.675246]  [] tcp_v4_conn_request+0x47/0x50
[  354.683254]  [] tcp_rcv_state_process+0x183/0xca0
[  354.691004]  [] tcp_v4_do_rcv+0x5c/0x1f0
[  354.698533]  [] tcp_v4_rcv+0x987/0x9a0
[  354.705968]  [] ? ipv4_confirm+0x78/0xf0
[  354.713370]  [] ip_local_deliver_finish+0x84/0x120
[  354.720739]  [] ip_local_deliver+0x42/0xd0
[  354.728029]  [] ? inet_del_offload+0x40/0x40
[  354.735270]  [] ip_rcv_finish+0x106/0x320
[  354.742413]  [] ip_rcv+0x211/0x370
[  354.749268]  [] ? 
ip_local_deliver_finish+0x120/0x120
[  354.755929]  [] 
__netif_receive_skb_core+0x2cb/0x970

[  354.762535]  [] ? nf_nat_setup_info+0x7a/0x2f0
[  354.769131]  [] __netif_receive_skb+0x11/0x70
[  354.775481]  [] 
netif_receive_skb_internal+0x1e/0x80

[  354.781638]  [] ? nf_hook_slow+0x5f/0xb0
[  354.787771]  [] netif_receive_skb+0x9/0x10
[  354.793916]  [] br_handle_frame_finish+0x178/0x4b0
[  354.800077]  [] ? nf_nat_ipv4_fn+0x167/0x1e0
[  354.806260]  [] ? br_handle_local_finish+0x50/0x50
[  354.812405]  [] 
br_nf_pre_routing_finish+0x183/0x360

[  354.818574]  [] ? br_netif_receive_skb+0x10/0x10
[  354.824775]  [] br_nf_pre_routing+0x2a7/0x380
[  354.830780]  [] ? br_nf_forward_ip+0x3f0/0x3f0
[  354.836567]  [] nf_iterate+0x58/0x70
[  354.842281]  [] nf_hook_slow+0x5f/0xb0
[  354.847886]  [] br_handle_frame+0x1a2/0x290
[  354.853520]  [] ? br_netif_receive_skb+0x10/0x10
[  354.859206]  [] ? 
br_handle_frame_finish+0x4b0/0x4b0
[  354.864824]  [] 
__netif_receive_skb_core+0x12b/0x970
[  354.870350]  [] ? 
__raw_callee_save___pv_queued_spin_unlock+0x11/0x20

[  354.875880]  [] __netif_receive_skb+0x11/0x70
[  354.881293]  [] 
netif_receive_skb_internal+0x1e/0x80

[  354.886653]  [] netif_receive_skb+0x9/0x10
[  354.891918]  [] xenvif_tx_action+0x693/0x820
[  354.897170]  [] xenvif_poll+0x29/0x70
[

[PATCH net-next RFC V3 2/3] vhost: introduce vhost_vq_more_avail()

2015-11-12 Thread Jason Wang

Signed-off-by: Jason Wang 
---
 drivers/vhost/vhost.c | 26 +-
 drivers/vhost/vhost.h |  1 +
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 163b365..b86c5aa 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1633,10 +1633,25 @@ void vhost_add_used_and_signal_n(struct vhost_dev *dev,
 }
 EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
 
+bool vhost_vq_more_avail(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+{
+   __virtio16 avail_idx;
+   int r;
+
+   r = __get_user(avail_idx, >avail->idx);
+   if (r) {
+   vq_err(vq, "Failed to check avail idx at %p: %d\n",
+  >avail->idx, r);
+   return false;
+   }
+
+   return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
+}
+EXPORT_SYMBOL_GPL(vhost_vq_more_avail);
+
 /* OK, now we need to know about added descriptors. */
 bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
-   __virtio16 avail_idx;
int r;
 
if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
@@ -1660,14 +1675,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
/* They could have slipped one in as we were doing that: make
 * sure it's written, then check again. */
smp_mb();
-   r = __get_user(avail_idx, >avail->idx);
-   if (r) {
-   vq_err(vq, "Failed to check avail idx at %p: %d\n",
-  >avail->idx, r);
-   return false;
-   }
-
-   return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx;
+   return vhost_vq_more_avail(dev, vq);
 }
 EXPORT_SYMBOL_GPL(vhost_enable_notify);
 
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index ea0327d..5983a13 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -159,6 +159,7 @@ void vhost_add_used_and_signal_n(struct vhost_dev *, struct 
vhost_virtqueue *,
   struct vring_used_elem *heads, unsigned count);
 void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
 void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
+bool vhost_vq_more_avail(struct vhost_dev *, struct vhost_virtqueue *);
 bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next RFC V3 1/3] vhost: introduce vhost_has_work()

2015-11-12 Thread Jason Wang

This path introduces a helper which can give a hint for whether or not
there's a work queued in the work list.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vhost.c | 7 +++
 drivers/vhost/vhost.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index eec2f11..163b365 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -245,6 +245,13 @@ void vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
+/* A lockless hint for busy polling code to exit the loop */
+bool vhost_has_work(struct vhost_dev *dev)
+{
+   return !list_empty(>work_list);
+}
+EXPORT_SYMBOL_GPL(vhost_has_work);
+
 void vhost_poll_queue(struct vhost_poll *poll)
 {
vhost_work_queue(poll->dev, >work);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 4772862..ea0327d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -37,6 +37,7 @@ struct vhost_poll {
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
+bool vhost_has_work(struct vhost_dev *dev);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 unsigned long mask, struct vhost_dev *dev);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next RFC V3 0/3] basic busy polling support for vhost_net

2015-11-12 Thread Felipe Franciosi

Hi Jason,

I understand your busy loop timeout is quite conservative at 50us. Did you try 
any other values?

Also, did you measure how polling affects many VMs talking to each other (e.g. 
20 VMs on each host, perhaps with several vNICs each, transmitting to a 
corresponding VM/vNIC pair on another host)?


On a complete separate experiment (busy waiting on storage I/O rings on Xen), I 
have observed that bigger timeouts gave bigger benefits. On the other hand, all 
cases that contended for CPU were badly hurt with any sort of polling.

The cases that contended for CPU consisted of many VMs generating workload over 
very fast I/O devices (in that case, several NVMe devices on a single host). 
And the metric that got affected was aggregate throughput from all VMs.

The solution was to determine whether to poll depending on the host's overall 
CPU utilisation at that moment. That gave me the best of both worlds as polling 
made everything faster without slowing down any other metric.

Thanks,
Felipe



On 12/11/2015 10:20, "kvm-ow...@vger.kernel.org on behalf of Jason Wang" 
 wrote:

>
>
>On 11/12/2015 06:16 PM, Jason Wang wrote:
>> Hi all:
>>
>> This series tries to add basic busy polling for vhost net. The idea is
>> simple: at the end of tx/rx processing, busy polling for new tx added
>> descriptor and rx receive socket for a while. The maximum number of
>> time (in us) could be spent on busy polling was specified ioctl.
>>
>> Test were done through:
>>
>> - 50 us as busy loop timeout
>> - Netperf 2.6
>> - Two machines with back to back connected ixgbe
>> - Guest with 1 vcpu and 1 queue
>>
>> Results:
>> - For stream workload, ioexits were reduced dramatically in medium
>>   size (1024-2048) of tx (at most -39%) and almost all rx (at most
>>   -79%) as a result of polling. This compensate for the possible
>>   wasted cpu cycles more or less. That porbably why we can still see
>>   some increasing in the normalized throughput in some cases.
>> - Throughput of tx were increased (at most 105%) expect for the huge
>>   write (16384). And we can send more packets in the case (+tpkts were
>>   increased).
>> - Very minor rx regression in some cases.
>> - Improvemnt on TCP_RR (at most 16%).
>
>Forget to mention, the following test results by order are:
>
>1) Guest TX
>2) Guest RX
>3) TCP_RR
>
>> size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
>>64/ 1/   +9%/  -17%/   +5%/  +10%/   -2%
>>64/ 2/   +8%/  -18%/   +6%/  +10%/   -1%
>>64/ 4/   +4%/  -21%/   +6%/  +10%/   -1%
>>64/ 8/   +9%/  -17%/   +6%/   +9%/   -2%
>>   256/ 1/  +20%/   -1%/  +15%/  +11%/   -9%
>>   256/ 2/  +15%/   -6%/  +15%/   +8%/   -8%
>>   256/ 4/  +17%/   -4%/  +16%/   +8%/   -8%
>>   256/ 8/  -61%/  -69%/  +16%/  +10%/  -10%
>>   512/ 1/  +15%/   -3%/  +19%/  +18%/  -11%
>>   512/ 2/  +19%/0%/  +19%/  +13%/  -10%
>>   512/ 4/  +18%/   -2%/  +18%/  +15%/  -10%
>>   512/ 8/  +17%/   -1%/  +18%/  +15%/  -11%
>>  1024/ 1/  +25%/   +4%/  +27%/  +16%/  -21%
>>  1024/ 2/  +28%/   +8%/  +25%/  +15%/  -22%
>>  1024/ 4/  +25%/   +5%/  +25%/  +14%/  -21%
>>  1024/ 8/  +27%/   +7%/  +25%/  +16%/  -21%
>>  2048/ 1/  +32%/  +12%/  +31%/  +22%/  -38%
>>  2048/ 2/  +33%/  +12%/  +30%/  +23%/  -36%
>>  2048/ 4/  +31%/  +10%/  +31%/  +24%/  -37%
>>  2048/ 8/ +105%/  +75%/  +33%/  +23%/  -39%
>> 16384/ 1/0%/  -14%/   +2%/0%/  +19%
>> 16384/ 2/0%/  -13%/  +19%/  -13%/  +17%
>> 16384/ 4/0%/  -12%/   +3%/0%/   +2%
>> 16384/ 8/0%/  -11%/   -2%/   +1%/   +1%
>> size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
>>64/ 1/   -7%/  -23%/   +4%/   +6%/  -74%
>>64/ 2/   -2%/  -12%/   +2%/   +2%/  -55%
>>64/ 4/   +2%/   -5%/  +10%/   -2%/  -43%
>>64/ 8/   -5%/   -5%/  +11%/  -34%/  -59%
>>   256/ 1/   -6%/  -16%/   +9%/  +11%/  -60%
>>   256/ 2/   +3%/   -4%/   +6%/   -3%/  -28%
>>   256/ 4/0%/   -5%/   -9%/   -9%/  -10%
>>   256/ 8/   -3%/   -6%/  -12%/   -9%/  -40%
>>   512/ 1/   -4%/  -17%/  -10%/  +21%/  -34%
>>   512/ 2/0%/   -9%/  -14%/   -3%/  -30%
>>   512/ 4/0%/   -4%/  -18%/  -12%/   -4%
>>   512/ 8/   -1%/   -4%/   -1%/   -5%/   +4%
>>  1024/ 1/0%/  -16%/  +12%/  +11%/  -10%
>>  1024/ 2/0%/  -11%/0%/   +5%/  -31%
>>  1024/ 4/0%/   -4%/   -7%/   +1%/  -22%
>>  1024/ 8/   -5%/   -6%/  -17%/  -29%/  -79%
>>  2048/ 1/0%/  -16%/   +1%/   +9%/  -10%
>>  2048/ 2/0%/  -12%/   +7%/   +9%/  -26%
>>  2048/ 4/0%/   -7%/   -4%/   +3%/  -64%
>>  2048/ 8/   -1%/   -5%/   -6%/   +4%/  -20%
>> 16384/ 1/0%/  -12%/  +11%/   +7%/  -20%
>> 16384/ 2/0%/   -7%/   +1%/   +5%/  -26%
>> 16384/ 4/0%/   -5%/  +12%/  +22%/  -23%
>> 16384/ 8/0%/   -1%/   -8%/   +5%/   -3%
>>

Re: [PATCH net-next RFC V3 0/3] basic busy polling support for vhost_net

2015-11-12 Thread Jason Wang



On 11/12/2015 06:16 PM, Jason Wang wrote:
> Hi all:
>
> This series tries to add basic busy polling for vhost net. The idea is
> simple: at the end of tx/rx processing, busy polling for new tx added
> descriptor and rx receive socket for a while. The maximum number of
> time (in us) could be spent on busy polling was specified ioctl.
>
> Test were done through:
>
> - 50 us as busy loop timeout
> - Netperf 2.6
> - Two machines with back to back connected ixgbe
> - Guest with 1 vcpu and 1 queue
>
> Results:
> - For stream workload, ioexits were reduced dramatically in medium
>   size (1024-2048) of tx (at most -39%) and almost all rx (at most
>   -79%) as a result of polling. This compensate for the possible
>   wasted cpu cycles more or less. That porbably why we can still see
>   some increasing in the normalized throughput in some cases.
> - Throughput of tx were increased (at most 105%) expect for the huge
>   write (16384). And we can send more packets in the case (+tpkts were
>   increased).
> - Very minor rx regression in some cases.
> - Improvemnt on TCP_RR (at most 16%).

Forget to mention, the following test results by order are:

1) Guest TX
2) Guest RX
3) TCP_RR

> size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
>64/ 1/   +9%/  -17%/   +5%/  +10%/   -2%
>64/ 2/   +8%/  -18%/   +6%/  +10%/   -1%
>64/ 4/   +4%/  -21%/   +6%/  +10%/   -1%
>64/ 8/   +9%/  -17%/   +6%/   +9%/   -2%
>   256/ 1/  +20%/   -1%/  +15%/  +11%/   -9%
>   256/ 2/  +15%/   -6%/  +15%/   +8%/   -8%
>   256/ 4/  +17%/   -4%/  +16%/   +8%/   -8%
>   256/ 8/  -61%/  -69%/  +16%/  +10%/  -10%
>   512/ 1/  +15%/   -3%/  +19%/  +18%/  -11%
>   512/ 2/  +19%/0%/  +19%/  +13%/  -10%
>   512/ 4/  +18%/   -2%/  +18%/  +15%/  -10%
>   512/ 8/  +17%/   -1%/  +18%/  +15%/  -11%
>  1024/ 1/  +25%/   +4%/  +27%/  +16%/  -21%
>  1024/ 2/  +28%/   +8%/  +25%/  +15%/  -22%
>  1024/ 4/  +25%/   +5%/  +25%/  +14%/  -21%
>  1024/ 8/  +27%/   +7%/  +25%/  +16%/  -21%
>  2048/ 1/  +32%/  +12%/  +31%/  +22%/  -38%
>  2048/ 2/  +33%/  +12%/  +30%/  +23%/  -36%
>  2048/ 4/  +31%/  +10%/  +31%/  +24%/  -37%
>  2048/ 8/ +105%/  +75%/  +33%/  +23%/  -39%
> 16384/ 1/0%/  -14%/   +2%/0%/  +19%
> 16384/ 2/0%/  -13%/  +19%/  -13%/  +17%
> 16384/ 4/0%/  -12%/   +3%/0%/   +2%
> 16384/ 8/0%/  -11%/   -2%/   +1%/   +1%
> size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
>64/ 1/   -7%/  -23%/   +4%/   +6%/  -74%
>64/ 2/   -2%/  -12%/   +2%/   +2%/  -55%
>64/ 4/   +2%/   -5%/  +10%/   -2%/  -43%
>64/ 8/   -5%/   -5%/  +11%/  -34%/  -59%
>   256/ 1/   -6%/  -16%/   +9%/  +11%/  -60%
>   256/ 2/   +3%/   -4%/   +6%/   -3%/  -28%
>   256/ 4/0%/   -5%/   -9%/   -9%/  -10%
>   256/ 8/   -3%/   -6%/  -12%/   -9%/  -40%
>   512/ 1/   -4%/  -17%/  -10%/  +21%/  -34%
>   512/ 2/0%/   -9%/  -14%/   -3%/  -30%
>   512/ 4/0%/   -4%/  -18%/  -12%/   -4%
>   512/ 8/   -1%/   -4%/   -1%/   -5%/   +4%
>  1024/ 1/0%/  -16%/  +12%/  +11%/  -10%
>  1024/ 2/0%/  -11%/0%/   +5%/  -31%
>  1024/ 4/0%/   -4%/   -7%/   +1%/  -22%
>  1024/ 8/   -5%/   -6%/  -17%/  -29%/  -79%
>  2048/ 1/0%/  -16%/   +1%/   +9%/  -10%
>  2048/ 2/0%/  -12%/   +7%/   +9%/  -26%
>  2048/ 4/0%/   -7%/   -4%/   +3%/  -64%
>  2048/ 8/   -1%/   -5%/   -6%/   +4%/  -20%
> 16384/ 1/0%/  -12%/  +11%/   +7%/  -20%
> 16384/ 2/0%/   -7%/   +1%/   +5%/  -26%
> 16384/ 4/0%/   -5%/  +12%/  +22%/  -23%
> 16384/ 8/0%/   -1%/   -8%/   +5%/   -3%
> size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
> 1/ 1/   +9%/  -29%/   +9%/   +9%/   +9%
> 1/25/   +6%/  -18%/   +6%/   +6%/   -1%
> 1/50/   +6%/  -19%/   +5%/   +5%/   -2%
> 1/   100/   +5%/  -19%/   +4%/   +4%/   -3%
>64/ 1/  +10%/  -28%/  +10%/  +10%/  +10%
>64/25/   +8%/  -18%/   +7%/   +7%/   -2%
>64/50/   +8%/  -17%/   +8%/   +8%/   -1%
>64/   100/   +8%/  -17%/   +8%/   +8%/   -1%
>   256/ 1/  +10%/  -28%/  +10%/  +10%/  +10%
>   256/25/  +15%/  -13%/  +15%/  +15%/0%
>   256/50/  +16%/  -14%/  +18%/  +18%/   +2%
>   256/   100/  +15%/  -13%/  +12%/  +12%/   -2%
>
> Changes from V2:
> - poll also at the end of rx handling
> - factor out the polling logic and optimize the code a little bit
> - add two ioctls to get and set the busy poll timeout
> - test on ixgbe (which can give more stable and reproducable numbers)
>   instead of mlx4.
>
> Changes from V1:
> - Add a comment for vhost_has_work() to explain why it could be
>   lockless
> - Add param description for busyloop_timeout
> - Split out the busy polling logic into a new helper
> - Check and exit the loop when there's a pending signal
> - Disable preemption during busy looping to make sure lock_clock() was
>   correctly

Re: [PATCH net] sctp: translate host order to network order when setting a hmacid

2015-11-12 Thread marcelo . leitner

On Thu, Nov 12, 2015 at 01:07:07PM +0800, Xin Long wrote:
> now sctp auth cannot work well when setting a hmacid manually, which
> is caused by that we didn't use the network order for hmacid, so fix
> it by adding the transformation in sctp_auth_ep_set_hmacs.
> 
> even we set hmacid with the network order in userspace, it still
> can't work, because of this condition in sctp_auth_ep_set_hmacs():
> 
>   if (id > SCTP_AUTH_HMAC_ID_MAX)
>   return -EOPNOTSUPP;
> 
> so this wasn't working before and thus it won't break compatibility.
> 
> Signed-off-by: Xin Long 
> Signed-off-by: Marcelo Ricardo Leitner 

Fixes: 65b07e5d0d09 ("[SCTP]: API updates to suport SCTP-AUTH
extensions.")

Thanks!

> ---
>  net/sctp/auth.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sctp/auth.c b/net/sctp/auth.c
> index 4f15b7d..1543e39 100644
> --- a/net/sctp/auth.c
> +++ b/net/sctp/auth.c
> @@ -809,8 +809,8 @@ int sctp_auth_ep_set_hmacs(struct sctp_endpoint *ep,
>   if (!has_sha1)
>   return -EINVAL;
>  
> - memcpy(ep->auth_hmacs_list->hmac_ids, >shmac_idents[0],
> - hmacs->shmac_num_idents * sizeof(__u16));
> + for (i = 0; i < hmacs->shmac_num_idents; i++)
> + ep->auth_hmacs_list->hmac_ids[i] = 
> htons(hmacs->shmac_idents[i]);
>   ep->auth_hmacs_list->param_hdr.length = htons(sizeof(sctp_paramhdr_t) +
>   hmacs->shmac_num_idents * sizeof(__u16));
>   return 0;
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Intel-wired-lan] regression in ixgbe SFP detection patch

2015-11-12 Thread William Dauchy

On Nov11 22:13, Rustad, Mark D wrote:
> Just so you know, there are patches in queue that will improve this situation 
> in two ways:
> 1) When the I2C probe times out, the code assumes that the cage is empty and 
> does not retry the access until the next probe.
> 2) The driver will use its own private workqueue, so it will not affect the 
> system workqueues at all.

Thanks guys for the details,  I will have a look.
-- 
William


signature.asc
Description: PGP signature

Re: [PATCH] vhost: move is_le setup to the backend

2015-11-12 Thread Cornelia Huck

On Fri, 30 Oct 2015 12:42:35 +0100
Greg Kurz  wrote:

> The vq->is_le field is used to fix endianness when accessing the vring via
> the cpu_to_vhost16() and vhost16_to_cpu() helpers in the following cases:
> 
> 1) host is big endian and device is modern virtio
> 
> 2) host has cross-endian support and device is legacy virtio with a different
>endianness than the host
> 
> Both cases rely on the VHOST_SET_FEATURES ioctl, but 2) also needs the
> VHOST_SET_VRING_ENDIAN ioctl to be called by userspace. Since vq->is_le
> is only needed when the backend is active, it was decided to set it at
> backend start.
> 
> This is currently done in vhost_init_used()->vhost_init_is_le() but it
> obfuscates the core vhost code. This patch moves the is_le setup to a
> dedicated function that is called from the backend code.
> 
> Note vhost_net is the only backend that can pass vq->private_data == NULL to
> vhost_init_used(), hence the "if (sock)" branch.
> 
> No behaviour change.
> 
> Signed-off-by: Greg Kurz 
> ---
>  drivers/vhost/net.c   |6 ++
>  drivers/vhost/scsi.c  |3 +++
>  drivers/vhost/test.c  |2 ++
>  drivers/vhost/vhost.c |   12 +++-
>  drivers/vhost/vhost.h |1 +
>  5 files changed, 19 insertions(+), 5 deletions(-)

Makes sense.

Reviewed-by: Cornelia Huck 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] net: thunder: Fix crash upon shutdown after failed probe

2015-11-12 Thread Pavel Fedin

If device probe fails, driver remains bound to the PCI device. However,
driver data has been reset to NULL. This causes crash upon dereferencing
it in nicvf_remove()

Signed-off-by: Pavel Fedin 
---
 drivers/net/ethernet/cavium/thunder/nicvf_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index a937772..372c39e 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -1600,6 +1600,9 @@ static void nicvf_remove(struct pci_dev *pdev)
 
 static void nicvf_shutdown(struct pci_dev *pdev)
 {
+   if (!pci_get_drvdata(pdev))
+   return;
+
nicvf_remove(pdev);
 }
 
-- 
2.4.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] stmmac: avoid ipq806x constant overflow warning

2015-11-12 Thread Joe Perches

On Fri, 2015-11-13 at 08:37 +0100, Geert Uytterhoeven wrote:
> On Thu, Nov 12, 2015 at 10:03 PM, Arnd Bergmann 
> wrote:
> > Building dwmac-ipq806x on a 64-bit architecture produces a harmless
> > warning from gcc:
> > 
> > stmmac/dwmac-ipq806x.c: In function 'ipq806x_gmac_probe':
> > include/linux/bitops.h:6:19: warning: overflow in implicit constant
> > conversion [-Woverflow]
> >   val = QSGMII_PHY_CDR_EN |
> > stmmac/dwmac-ipq806x.c:333:8: note: in expansion of macro
> > 'QSGMII_PHY_CDR_EN'
> >  #define QSGMII_PHY_CDR_EN   BIT(0)
> >  #define BIT(nr)   (1UL << (nr))
> > 
> > This is a result of the type conversion rules in C, when we take
> > the
> > logical OR of multiple different types. In particular, we have
> > and unsigned long
> > 
> > QSGMII_PHY_CDR_EN == BIT(0) == (1ul << 0) ==
> > 0x0001ul
> > 
> > and a signed int
> > 
> > 0xC << QSGMII_PHY_TX_DRV_AMP_OFFSET == 0xc000
> > 
> > which together gives a signed long value
> > 
> > 0xc001l
> > 
> > and when this is passed into a function that takes an unsigned int
> > type,
> > gcc warns about the signed overflow and the loss of the upper 32
> > -bits that
> > are all ones.
> > 
> > This patch adds 'ul' type modifiers to the literal numbers passed
> > in
> > here, so now the expression remains an 'unsigned long' with the
> > upper
> > bits all zero, and that avoids the signed overflow and the warning.
> 
> FWIW, the 64-bitness of BIT() on 64-bit platforms is also causing
> subtle
> warnings in other places, e.g. when inverting them to create bit
> mask, cfr.
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commi
> t/?id=a9efeca613a8fe5281d7c91f5c8c9ea46f2312f6
> 
> Gr{oetje,eeting}s,

I still think specific length BIT macros
can be useful.

https://lkml.org/lkml/2015/10/16/852

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RESEND] net: smsc911x: Reset PHY during initialization

2015-11-12 Thread Pavel Fedin

On certain hardware after software reboot the chip may get stuck and fail
to reinitialize during reset. This can be fixed by ensuring that PHY is
reset too.

Old PHY resetting method required operational MDIO interface, therefore
the chip should have been already set up. In order to be able to function
during probe, it is changed to use PMT_CTRL register.

The problem could be observed on SMDK5410 board.

Signed-off-by: Pavel Fedin 
---
 drivers/net/ethernet/smsc/smsc911x.c | 17 ++---
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/smsc/smsc911x.c 
b/drivers/net/ethernet/smsc/smsc911x.c
index c860c90..219a99b 100644
--- a/drivers/net/ethernet/smsc/smsc911x.c
+++ b/drivers/net/ethernet/smsc/smsc911x.c
@@ -809,22 +809,17 @@ static int smsc911x_phy_check_loopbackpkt(struct 
smsc911x_data *pdata)
 
 static int smsc911x_phy_reset(struct smsc911x_data *pdata)
 {
-   struct phy_device *phy_dev = pdata->phy_dev;
unsigned int temp;
unsigned int i = 10;
 
-   BUG_ON(!phy_dev);
-   BUG_ON(!phy_dev->bus);
-
-   SMSC_TRACE(pdata, hw, "Performing PHY BCR Reset");
-   smsc911x_mii_write(phy_dev->bus, phy_dev->addr, MII_BMCR, BMCR_RESET);
+   temp = smsc911x_reg_read(pdata, PMT_CTRL);
+   smsc911x_reg_write(pdata, PMT_CTRL, temp | PMT_CTRL_PHY_RST_);
do {
msleep(1);
-   temp = smsc911x_mii_read(phy_dev->bus, phy_dev->addr,
-   MII_BMCR);
-   } while ((i--) && (temp & BMCR_RESET));
+   temp = smsc911x_reg_read(pdata, PMT_CTRL);
+   } while ((i--) && (temp & PMT_CTRL_PHY_RST_));
 
-   if (temp & BMCR_RESET) {
+   if (unlikely(temp & PMT_CTRL_PHY_RST_)) {
SMSC_WARN(pdata, hw, "PHY reset failed to complete");
return -EIO;
}
@@ -2296,7 +2291,7 @@ static int smsc911x_init(struct net_device *dev)
}
 
/* Reset the LAN911x */
-   if (smsc911x_soft_reset(pdata))
+   if (smsc911x_phy_reset(pdata) || smsc911x_soft_reset(pdata))
return -ENODEV;
 
dev->flags |= IFF_MULTICAST;
-- 
2.4.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 08/14] net: tcp_memcontrol: sanitize tcp memory accounting callbacks

2015-11-12 Thread Johannes Weiner

On Thu, Nov 12, 2015 at 08:53:38PM -0800, Eric Dumazet wrote:
> On Thu, 2015-11-12 at 18:41 -0500, Johannes Weiner wrote:
> > @@ -711,6 +705,12 @@ static inline void mem_cgroup_wb_stats(struct 
> > bdi_writeback *wb,
> >  struct sock;
> >  void sock_update_memcg(struct sock *sk);
> >  void sock_release_memcg(struct sock *sk);
> > +bool mem_cgroup_charge_skmem(struct cg_proto *proto, unsigned int 
> > nr_pages);
> > +void mem_cgroup_uncharge_skmem(struct cg_proto *proto, unsigned int 
> > nr_pages);
> > +static inline bool mem_cgroup_under_socket_pressure(struct cg_proto *proto)
> > +{
> > +   return proto->memory_pressure;
> > +}
> >  #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
> >  
> >  #ifdef CONFIG_MEMCG_KMEM
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 2eefc99..8cc7613 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -1126,8 +1126,8 @@ static inline bool sk_under_memory_pressure(const 
> > struct sock *sk)
> > if (!sk->sk_prot->memory_pressure)
> > return false;
> >  
> > -   if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> > -   return !!sk->sk_cgrp->memory_pressure;
> > +   if (mem_cgroup_sockets_enabled && sk->sk_cgrp &&
> > +   mem_cgroup_under_socket_pressure(sk->sk_cgrp))
> >  
> > return !!*sk->sk_prot->memory_pressure;
> >  }
> 
> 
> This looks wrong ?
> 
> if (A && B && C)
> return !!*sk->sk_prot->memory_pressure;
> 
>  as this function should not return void>

Yikes, you're right. This is missing a return true.

[ Just forced a complete rebuild and of course it warns at control
  reaching end of non-void function. ]

I'm stumped by how I could have missed it as I rebuild after every
commit with make -s, so a warning should stand out. And it should
definitely rebuild the callers frequently as most patches change
memcontrol.h. Probably a screwup in the final series polishing.
I'm going to go over this carefully one more time tomorrow.

Meanwhile, this is the missing piece and the updated patch.

Thanks Eric.

diff --git a/include/net/sock.h b/include/net/sock.h
index 8cc7613..f954e2a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1128,6 +1128,7 @@ static inline bool sk_under_memory_pressure(const struct 
sock *sk)

if (mem_cgroup_sockets_enabled && sk->sk_cgrp &&
mem_cgroup_under_socket_pressure(sk->sk_cgrp))
+   return true;

return !!*sk->sk_prot->memory_pressure;
 }

---
>From 4a24ca67e5b0f651a68807ee99f714437ffd6109 Mon Sep 17 00:00:00 2001
From: Johannes Weiner 
Date: Tue, 10 Nov 2015 17:14:41 -0500
Subject: [PATCH v2] net: tcp_memcontrol: sanitize tcp memory accounting 
callbacks

There won't be a tcp control soft limit, so integrating the memcg code
into the global skmem limiting scheme complicates things
unnecessarily. Replace this with simple and clear charge and uncharge
calls--hidden behind a jump label--to account skb memory.

Note that this is not purely aesthetic: as a result of shoehorning the
per-memcg code into the same memory accounting functions that handle
the global level, the old code would compare the per-memcg consumption
against the smaller of the per-memcg limit and the global limit. This
allowed the total consumption of multiple sockets to exceed the global
limit, as long as the individual sockets stayed within bounds. After
this change, the code will always compare the per-memcg consumption to
the per-memcg limit, and the global consumption to the global limit,
and thus close this loophole.

Without a soft limit, the per-memcg memory pressure state in sockets
is generally questionable. However, we did it until now, so we
continue to enter it when the hard limit is hit, and packets are
dropped, to let other sockets in the cgroup know that they shouldn't
grow their transmit windows, either. However, keep it simple in the
new callback model and leave memory pressure lazily when the next
packet is accepted (as opposed to doing it synchroneously when packets
are processed). When packets are dropped, network performance will
already be in the toilet, so that should be a reasonable trade-off.

As described above, consumption is now checked on the per-memcg level
and the global level separately. Likewise, memory pressure states are
maintained on both the per-memcg level and the global level, and a
socket is considered under pressure when either level asserts as much.

Signed-off-by: Johannes Weiner 
---
 include/linux/memcontrol.h | 12 -
 include/net/sock.h | 64 ++
 include/net/tcp.h  |  5 ++--
 mm/memcontrol.c| 32 +++
 net/core/sock.c| 26 +++
 net/ipv4/tcp_output.c  |  7 +++--
 6 files changed, 70 insertions(+), 76 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 96ca3d3..906dfff 100644
--- a/include/linux/memcontrol.h
+++

Re: [PATCH v2] stmmac: avoid ipq806x constant overflow warning

2015-11-12 Thread Geert Uytterhoeven

On Thu, Nov 12, 2015 at 10:03 PM, Arnd Bergmann  wrote:
> Building dwmac-ipq806x on a 64-bit architecture produces a harmless
> warning from gcc:
>
> stmmac/dwmac-ipq806x.c: In function 'ipq806x_gmac_probe':
> include/linux/bitops.h:6:19: warning: overflow in implicit constant 
> conversion [-Woverflow]
>   val = QSGMII_PHY_CDR_EN |
> stmmac/dwmac-ipq806x.c:333:8: note: in expansion of macro 'QSGMII_PHY_CDR_EN'
>  #define QSGMII_PHY_CDR_EN   BIT(0)
>  #define BIT(nr)   (1UL << (nr))
>
> This is a result of the type conversion rules in C, when we take the
> logical OR of multiple different types. In particular, we have
> and unsigned long
>
> QSGMII_PHY_CDR_EN == BIT(0) == (1ul << 0) == 0x0001ul
>
> and a signed int
>
> 0xC << QSGMII_PHY_TX_DRV_AMP_OFFSET == 0xc000
>
> which together gives a signed long value
>
> 0xc001l
>
> and when this is passed into a function that takes an unsigned int type,
> gcc warns about the signed overflow and the loss of the upper 32-bits that
> are all ones.
>
> This patch adds 'ul' type modifiers to the literal numbers passed in
> here, so now the expression remains an 'unsigned long' with the upper
> bits all zero, and that avoids the signed overflow and the warning.

FWIW, the 64-bitness of BIT() on 64-bit platforms is also causing subtle
warnings in other places, e.g. when inverting them to create bit mask, cfr.
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a9efeca613a8fe5281d7c91f5c8c9ea46f2312f6

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5] net: ethernet: add driver for Aurora VLSI NB8800 Ethernet controller

2015-11-12 Thread Mason

On 10/11/2015 20:25, Måns Rullgård wrote:

> Mason writes:
> 
>> On 10/11/2015 17:14, Mans Rullgard wrote:
>>
>>> This adds a driver for the Aurora VLSI NB8800 Ethernet controller.
>>> It is an almost complete rewrite of a driver originally found in
>>> a Sigma Designs 2.6.22 tree.
>>>
>>> Signed-off-by: Mans Rullgard 
>>> ---
>>> Changes:
>>> - Refactored mdio access functions
>>> - Refactored register access helpers
>>> - Improved error handling in rx buffer allocation
>>> - Optimised some fifo parameters
>>> - Overhauled tx dma. Multiple packets are now chained in a single dma
>>>   operation if xmit_more is set, improving performance.
>>> - Improved rx irq handling. It's not possible to disable interrupts
>>>   entirely for napi poll, but they can be slowed down a little.
>>> - Use readx_poll_timeout in various places
>>> - Improved error detection
>>> - Improved statistics
>>> - Report hardware statistics counters through ethtool
>>> - Improved tangox-specific setup
>>> - Support for flow control using pause frames
>>> - Explanatory comments added
>>> - Various minor stylistic changes
>>> ---
>>>  drivers/net/ethernet/Kconfig |1 +
>>>  drivers/net/ethernet/Makefile|1 +
>>>  drivers/net/ethernet/aurora/Kconfig  |   20 +
>>>  drivers/net/ethernet/aurora/Makefile |1 +
>>>  drivers/net/ethernet/aurora/nb8800.c | 1530 
>>> ++
>>>  drivers/net/ethernet/aurora/nb8800.h |  314 +++
>>>  6 files changed, 1867 insertions(+)
>>
>> The code has grown much since the previous patch, despite some
>> refactoring. Is this mostly due to ethtool_ops support?
>>
>>  drivers/net/ethernet/aurora/nb8800.c | 1146 
>> ++
>>  drivers/net/ethernet/aurora/nb8800.h |  230 +++
> 
> Some of the increase is from new features, some from improvements, and
> then there are a bunch of new comments.

Sweet.

With this version, my kernel boots faster than before
(I had been using a 5 month-old version.)

Before:

[0.613623] tangox-enet 26000.ethernet: SMP86xx internal Ethernet at 0x26000
[0.623638] libphy: tangox-mii: probed
[0.686527] tangox-enet 26000.ethernet: PHY: found Atheros 8035 ethernet at 
0x4
[0.697169] tangox-enet 26000.ethernet eth0: MAC address 00:16:e8:02:08:42
...
[1.306360] Sending DHCP requests ..
[4.699969] tangox-enet 26000.ethernet eth0: Link is Up - 1Gbps/Full - flow 
control rx/tx
[8.899671] ., OK
[8.926343] IP-Config: Got DHCP answer from 172.27.200.1, my address is 
172.27.64.49
...
[8.987327] Freeing unused kernel memory: 168K (c039e000 - c03c8000)

After:

[0.623526] libphy: nb8800-mii: probed
[0.628092] nb8800 26000.ethernet eth0: MAC address 00:16:e8:02:08:42
...
[4.732948] nb8800 26000.ethernet eth0: Link is Up - 1Gbps/Full - flow 
control rx/tx
[4.752655] Sending DHCP requests ., OK
[4.782644] IP-Config: Got DHCP answer from 172.27.200.1, my address is 
172.27.64.49
...
[4.849298] Freeing unused kernel memory: 164K (c039f000 - c03c8000)


The DHCP request is sent later, but the kernel doesn't twiddle its thumbs
for 4 seconds after the link comes up. Does this come from not probing the
PHY anymore?

BTW, you're not using the PHY IRQ, right? I think I remember you saying
it didn't work reliably?

Regards.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vhost: move is_le setup to the backend

2015-11-12 Thread Michael S. Tsirkin

On Fri, Oct 30, 2015 at 12:42:35PM +0100, Greg Kurz wrote:
> The vq->is_le field is used to fix endianness when accessing the vring via
> the cpu_to_vhost16() and vhost16_to_cpu() helpers in the following cases:
> 
> 1) host is big endian and device is modern virtio
> 
> 2) host has cross-endian support and device is legacy virtio with a different
>endianness than the host
> 
> Both cases rely on the VHOST_SET_FEATURES ioctl, but 2) also needs the
> VHOST_SET_VRING_ENDIAN ioctl to be called by userspace. Since vq->is_le
> is only needed when the backend is active, it was decided to set it at
> backend start.
> 
> This is currently done in vhost_init_used()->vhost_init_is_le() but it
> obfuscates the core vhost code. This patch moves the is_le setup to a
> dedicated function that is called from the backend code.
> 
> Note vhost_net is the only backend that can pass vq->private_data == NULL to
> vhost_init_used(), hence the "if (sock)" branch.
> 
> No behaviour change.
> 
> Signed-off-by: Greg Kurz 

I plan to look at this next week, busy with QEMU 2.5 now.

> ---
>  drivers/vhost/net.c   |6 ++
>  drivers/vhost/scsi.c  |3 +++
>  drivers/vhost/test.c  |2 ++
>  drivers/vhost/vhost.c |   12 +++-
>  drivers/vhost/vhost.h |1 +
>  5 files changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 9eda69e40678..d6319cb2664c 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -917,6 +917,12 @@ static long vhost_net_set_backend(struct vhost_net *n, 
> unsigned index, int fd)
>  
>   vhost_net_disable_vq(n, vq);
>   vq->private_data = sock;
> +
> + if (sock)
> + vhost_set_is_le(vq);
> + else
> + vq->is_le = virtio_legacy_is_little_endian();
> +
>   r = vhost_init_used(vq);
>   if (r)
>   goto err_used;
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index e25a23692822..e2644a301fa5 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -1276,6 +1276,9 @@ vhost_scsi_set_endpoint(struct vhost_scsi *vs,
>   vq = >vqs[i].vq;
>   mutex_lock(>mutex);
>   vq->private_data = vs_tpg;
> +
> + vhost_set_is_le(vq);
> +
>   vhost_init_used(vq);
>   mutex_unlock(>mutex);
>   }
> diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> index f2882ac98726..b1c7df502211 100644
> --- a/drivers/vhost/test.c
> +++ b/drivers/vhost/test.c
> @@ -196,6 +196,8 @@ static long vhost_test_run(struct vhost_test *n, int test)
>   oldpriv = vq->private_data;
>   vq->private_data = priv;
>  
> + vhost_set_is_le(vq);
> +
>   r = vhost_init_used(>vqs[index]);
>  
>   mutex_unlock(>mutex);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index eec2f11809ff..6be863dcbd13 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -113,6 +113,12 @@ static void vhost_init_is_le(struct vhost_virtqueue *vq)
>  }
>  #endif /* CONFIG_VHOST_CROSS_ENDIAN_LEGACY */
>  
> +void vhost_set_is_le(struct vhost_virtqueue *vq)
> +{
> + vhost_init_is_le(vq);
> +}
> +EXPORT_SYMBOL_GPL(vhost_set_is_le);
> +
>  static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
>   poll_table *pt)
>  {
> @@ -1156,12 +1162,8 @@ int vhost_init_used(struct vhost_virtqueue *vq)
>  {
>   __virtio16 last_used_idx;
>   int r;
> - if (!vq->private_data) {
> - vq->is_le = virtio_legacy_is_little_endian();
> + if (!vq->private_data)
>   return 0;
> - }
> -
> - vhost_init_is_le(vq);
>  
>   r = vhost_update_used_flags(vq);
>   if (r)
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 4772862b71a7..8a62041959fe 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -162,6 +162,7 @@ bool vhost_enable_notify(struct vhost_dev *, struct 
> vhost_virtqueue *);
>  
>  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
>   unsigned int log_num, u64 len);
> +void vhost_set_is_le(struct vhost_virtqueue *vq);
>  
>  #define vq_err(vq, fmt, ...) do {  \
>   pr_debug(pr_fmt(fmt), ##__VA_ARGS__);   \
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5] net: ethernet: add driver for Aurora VLSI NB8800 Ethernet controller

2015-11-12 Thread Måns Rullgård

Mason  writes:

> On 10/11/2015 20:25, Måns Rullgård wrote:
>
>> Mason writes:
>> 
>>> On 10/11/2015 17:14, Mans Rullgard wrote:
>>>
 This adds a driver for the Aurora VLSI NB8800 Ethernet controller.
 It is an almost complete rewrite of a driver originally found in
 a Sigma Designs 2.6.22 tree.

 Signed-off-by: Mans Rullgard 
 ---
 Changes:
 - Refactored mdio access functions
 - Refactored register access helpers
 - Improved error handling in rx buffer allocation
 - Optimised some fifo parameters
 - Overhauled tx dma. Multiple packets are now chained in a single dma
   operation if xmit_more is set, improving performance.
 - Improved rx irq handling. It's not possible to disable interrupts
   entirely for napi poll, but they can be slowed down a little.
 - Use readx_poll_timeout in various places
 - Improved error detection
 - Improved statistics
 - Report hardware statistics counters through ethtool
 - Improved tangox-specific setup
 - Support for flow control using pause frames
 - Explanatory comments added
 - Various minor stylistic changes
 ---
  drivers/net/ethernet/Kconfig |1 +
  drivers/net/ethernet/Makefile|1 +
  drivers/net/ethernet/aurora/Kconfig  |   20 +
  drivers/net/ethernet/aurora/Makefile |1 +
  drivers/net/ethernet/aurora/nb8800.c | 1530 
 ++
  drivers/net/ethernet/aurora/nb8800.h |  314 +++
  6 files changed, 1867 insertions(+)
>>>
>>> The code has grown much since the previous patch, despite some
>>> refactoring. Is this mostly due to ethtool_ops support?
>>>
>>>  drivers/net/ethernet/aurora/nb8800.c | 1146 
>>> ++
>>>  drivers/net/ethernet/aurora/nb8800.h |  230 +++
>> 
>> Some of the increase is from new features, some from improvements, and
>> then there are a bunch of new comments.
>
> Sweet.
>
> With this version, my kernel boots faster than before
> (I had been using a 5 month-old version.)
>
> Before:
>
> [0.613623] tangox-enet 26000.ethernet: SMP86xx internal Ethernet at 
> 0x26000
> [0.623638] libphy: tangox-mii: probed
> [0.686527] tangox-enet 26000.ethernet: PHY: found Atheros 8035 ethernet 
> at 0x4
> [0.697169] tangox-enet 26000.ethernet eth0: MAC address 00:16:e8:02:08:42
> ...
> [1.306360] Sending DHCP requests ..
> [4.699969] tangox-enet 26000.ethernet eth0: Link is Up - 1Gbps/Full - 
> flow control rx/tx
> [8.899671] ., OK
> [8.926343] IP-Config: Got DHCP answer from 172.27.200.1, my address is 
> 172.27.64.49
> ...
> [8.987327] Freeing unused kernel memory: 168K (c039e000 - c03c8000)
>
> After:
>
> [0.623526] libphy: nb8800-mii: probed
> [0.628092] nb8800 26000.ethernet eth0: MAC address 00:16:e8:02:08:42
> ...
> [4.732948] nb8800 26000.ethernet eth0: Link is Up - 1Gbps/Full - flow 
> control rx/tx
> [4.752655] Sending DHCP requests ., OK
> [4.782644] IP-Config: Got DHCP answer from 172.27.200.1, my address is 
> 172.27.64.49
> ...
> [4.849298] Freeing unused kernel memory: 164K (c039f000 - c03c8000)
>
> The DHCP request is sent later, but the kernel doesn't twiddle its thumbs
> for 4 seconds after the link comes up. Does this come from not probing the
> PHY anymore?

No, that's from properly setting the link state initially down.

> BTW, you're not using the PHY IRQ, right? I think I remember you saying
> it didn't work reliably?

It doesn't seem to be wired up on any of my boards, or there's some
magic required to activate it that I'm unaware of.

-- 
Måns Rullgård
m...@mansr.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] stmmac: avoid ipq806x constant overflow warning

2015-11-12 Thread Arnd Bergmann

Building dwmac-ipq806x on a 64-bit architecture produces a harmless
warning from gcc:

stmmac/dwmac-ipq806x.c: In function 'ipq806x_gmac_probe':
include/linux/bitops.h:6:19: warning: overflow in implicit constant conversion 
[-Woverflow]
  val = QSGMII_PHY_CDR_EN |
stmmac/dwmac-ipq806x.c:333:8: note: in expansion of macro 'QSGMII_PHY_CDR_EN'
 #define QSGMII_PHY_CDR_EN   BIT(0)
 #define BIT(nr)   (1UL << (nr))

The compiler warns about the fact that a 64-bit literal is passed
into a function that takes a 32-bit argument. I could not fully understand
why it warns despite the fact that this number is always small enough
to fit, but changing the use of BIT() macros into the equivalent hexadecimal
representation avoids the warning

Signed-off-by: Arnd Bergmann 
Fixes: b1c17215d718 ("stmmac: add ipq806x glue layer")
---
This came up on the arm64 allmodconfig build

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-ipq806x.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-ipq806x.c
index 9d89bdbf029f..4abd9b0b542a 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-ipq806x.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-ipq806x.c
@@ -77,11 +77,11 @@
 /* Only GMAC1/2/3 support SGMII and their CTL register are not contiguous */
 #define QSGMII_PHY_SGMII_CTL(x)((x == 1) ? 0x134 : \
 (0x13c + (4 * (x - 2
-#define QSGMII_PHY_CDR_EN  BIT(0)
-#define QSGMII_PHY_RX_FRONT_EN BIT(1)
-#define QSGMII_PHY_RX_SIGNAL_DETECT_EN BIT(2)
-#define QSGMII_PHY_TX_DRIVER_ENBIT(3)
-#define QSGMII_PHY_QSGMII_EN   BIT(7)
+#define QSGMII_PHY_CDR_EN  0x01u
+#define QSGMII_PHY_RX_FRONT_EN 0x02u
+#define QSGMII_PHY_RX_SIGNAL_DETECT_EN 0x04u
+#define QSGMII_PHY_TX_DRIVER_EN0x08u
+#define QSGMII_PHY_QSGMII_EN   0x80u
 #define QSGMII_PHY_PHASE_LOOP_GAIN_OFFSET  12
 #define QSGMII_PHY_PHASE_LOOP_GAIN_MASK0x7
 #define QSGMII_PHY_RX_DC_BIAS_OFFSET   18

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 0/6] net: dsa: mv88e6060: cleanup and fix setup

2015-11-12 Thread Neil Armstrong

On 11/10/2015 05:14 PM, Andrew Lunn wrote:
> On Tue, Nov 10, 2015 at 04:51:09PM +0100, Neil Armstrong wrote:
>> This patchset introduces some fixes and a registers addressing cleanup for
>> the mv88e6060 DSA driver.
> 
> Hi Neil
> 
> It is normal for netdev to put into the email subject of patches which
> tree these patches are for. "net" would be the latest -rcX and is for
> fixes only. "net-next" would be for new work aimed at the next merge
> window.
> 
> So long as Dave does not complain, leave them as they are now. But
> please try to follow this for your next patches.
> 
> Thanks
>Andrew
> 
Andrew,

Understood, I'll be careful for the next submissions.

Thanks,
Neil
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Fw: [Bug 107141] New: [1177853.192071] INFO: task accel-pppd:24263 blocked for more than 120 seconds.

2015-11-12 Thread Guillaume Nault

On Wed, Nov 11, 2015 at 08:57:15AM -0800, Stephen Hemminger wrote:
> 
> 
> Begin forwarded message:
> 
> Date: Wed, 4 Nov 2015 10:46:56 +
> From: "bugzilla-dae...@bugzilla.kernel.org" 
> 
> To: "shemmin...@linux-foundation.org" 
> Subject: [Bug 107141] New: [1177853.192071] INFO: task accel-pppd:24263 
> blocked for more than 120 seconds.
> 
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=107141
> 
> Bug ID: 107141
>Summary: [1177853.192071] INFO: task accel-pppd:24263 blocked
> for more than 120 seconds.
>Product: Networking
>Version: 2.5
> Kernel Version: 4.1.12 , 4.2.0 , 4.3
>   Hardware: All
> OS: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: high
>   Priority: P1
>  Component: Other
>   Assignee: shemmin...@linux-foundation.org
>   Reporter: pstaszew...@artcom.pl
> Regression: No
> 
> Latest checked kernel where no problem exist:
> 3.13.0
> 
> 
> 
> On any kernel >4.0 I have always same problem as below:
> 
[...]

> [1177853.192196] INFO: task accel-pppd:10821 blocked for more than 120 
> seconds.
> [1177853.192196]   Not tainted 4.2.0 #1
> [1177853.192197] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [1177853.192198] accel-pppd  D 88021fd14680 0 10821  1
> 0x
> [1177853.192199]  8800d771fd08 0082 8800d771fcd8
> 8802161c
> [1177853.192201]  8800d772 8801fe650bc0 8800d771fce8
> 8800d772
> [1177853.192202]  8801fe650bc0 81a9d264 81a9d268
> 
> [1177853.192203] Call Trace:
> [1177853.192205]  [] schedule+0x71/0x80
> [1177853.192207]  [] schedule_preempt_disabled+0x9/0xb
> [1177853.192208]  [] __mutex_lock_slowpath+0xa6/0x104
> [1177853.192210]  [] mutex_lock+0x13/0x24
> [1177853.192212]  [] rtnl_lock+0x10/0x12
> [1177853.192214]  [] register_netdev+0x11/0x27
> [1177853.192215]  [] ppp_ioctl+0x2ee/0xb9d
> [1177853.192218]  [] do_vfs_ioctl+0x418/0x460
> [1177853.192219]  [] ? _raw_spin_lock+0x9/0xb
> [1177853.192221]  [] ? __fget+0x2a/0x69
> [1177853.19]  [] SyS_ioctl+0x4e/0x7e
> [1177853.192224]  [] entry_SYSCALL_64_fastpath+0x12/0x6a

[...]

> [1177853.192306] INFO: task accel-pppd:6838 blocked for more than 120 seconds.
> [1177853.192307]   Not tainted 4.2.0 #1
> [1177853.192308] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [1177853.192308] accel-pppd  D 88021fc14680 0  6838  1
> 0x
> [1177853.192310]  8800c5ca7d08 0082 000380ef
> 81a114c0
> [1177853.192311]  0f48ff3a 8800d76c3ac0 
> 8800c5ca8000
> [1177853.192313]  8800d76c3ac0 880215e3cdac 880215e3cdb0
> 
> [1177853.192314] Call Trace:
> [1177853.192316]  [] schedule+0x71/0x80
> [1177853.192317]  [] schedule_preempt_disabled+0x9/0xb
> [1177853.192319]  [] __mutex_lock_slowpath+0xa6/0x104
> [1177853.192320]  [] mutex_lock+0x13/0x24
> [1177853.192322]  [] ppp_dev_uninit+0x62/0xae
> [1177853.192324]  [] rollback_registered_many+0x19e/0x252
> [1177853.192325]  [] rollback_registered+0x29/0x38
> [1177853.192327]  [] unregister_netdevice_queue+0x6a/0x77
> [1177853.192328]  [] ppp_release+0x3f/0x73
> [1177853.192330]  [] __fput+0xdf/0x184
> [1177853.192332]  [] fput+0x9/0xb
> [1177853.192335]  [] task_work_run+0x7b/0x94
> [1177853.192336]  [] do_notify_resume+0x40/0x44
> [1177853.192338]  [] int_signal+0x12/0x17

This should be fixed by 58a89ecaca53 ("ppp: fix lockdep splat in
ppp_dev_uninit()").

Only Linux 4.2 is impacted (not counting -rcX versions). The fix has
been integrated into the -stable tree with Linux 4.2.3.

If lockups occur on Linux 4.1.12 or Linux 4.3, they most likely have
another origin.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net] sctp: translate host order to network order when setting a hmacid

2015-11-12 Thread Neil Horman

On Thu, Nov 12, 2015 at 01:07:07PM +0800, Xin Long wrote:
> now sctp auth cannot work well when setting a hmacid manually, which
> is caused by that we didn't use the network order for hmacid, so fix
> it by adding the transformation in sctp_auth_ep_set_hmacs.
> 
> even we set hmacid with the network order in userspace, it still
> can't work, because of this condition in sctp_auth_ep_set_hmacs():
> 
>   if (id > SCTP_AUTH_HMAC_ID_MAX)
>   return -EOPNOTSUPP;
> 
> so this wasn't working before and thus it won't break compatibility.
> 
> Signed-off-by: Xin Long 
> Signed-off-by: Marcelo Ricardo Leitner 
> ---
>  net/sctp/auth.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sctp/auth.c b/net/sctp/auth.c
> index 4f15b7d..1543e39 100644
> --- a/net/sctp/auth.c
> +++ b/net/sctp/auth.c
> @@ -809,8 +809,8 @@ int sctp_auth_ep_set_hmacs(struct sctp_endpoint *ep,
>   if (!has_sha1)
>   return -EINVAL;
>  
> - memcpy(ep->auth_hmacs_list->hmac_ids, >shmac_idents[0],
> - hmacs->shmac_num_idents * sizeof(__u16));
> + for (i = 0; i < hmacs->shmac_num_idents; i++)
> + ep->auth_hmacs_list->hmac_ids[i] = 
> htons(hmacs->shmac_idents[i]);
>   ep->auth_hmacs_list->param_hdr.length = htons(sizeof(sctp_paramhdr_t) +
>   hmacs->shmac_num_idents * sizeof(__u16));
>   return 0;
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Acked-by: Neil Horman 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vhost: move is_le setup to the backend

2015-11-12 Thread Greg Kurz

On Thu, 12 Nov 2015 15:46:30 +0200
"Michael S. Tsirkin"  wrote:

> On Fri, Oct 30, 2015 at 12:42:35PM +0100, Greg Kurz wrote:
> > The vq->is_le field is used to fix endianness when accessing the vring via
> > the cpu_to_vhost16() and vhost16_to_cpu() helpers in the following cases:
> > 
> > 1) host is big endian and device is modern virtio
> > 
> > 2) host has cross-endian support and device is legacy virtio with a 
> > different
> >endianness than the host
> > 
> > Both cases rely on the VHOST_SET_FEATURES ioctl, but 2) also needs the
> > VHOST_SET_VRING_ENDIAN ioctl to be called by userspace. Since vq->is_le
> > is only needed when the backend is active, it was decided to set it at
> > backend start.
> > 
> > This is currently done in vhost_init_used()->vhost_init_is_le() but it
> > obfuscates the core vhost code. This patch moves the is_le setup to a
> > dedicated function that is called from the backend code.
> > 
> > Note vhost_net is the only backend that can pass vq->private_data == NULL to
> > vhost_init_used(), hence the "if (sock)" branch.
> > 
> > No behaviour change.
> > 
> > Signed-off-by: Greg Kurz 
> 
> I plan to look at this next week, busy with QEMU 2.5 now.
> 

I don't have any deadline for this since this is only a cleanup tentative.

Thanks.

> > ---
> >  drivers/vhost/net.c   |6 ++
> >  drivers/vhost/scsi.c  |3 +++
> >  drivers/vhost/test.c  |2 ++
> >  drivers/vhost/vhost.c |   12 +++-
> >  drivers/vhost/vhost.h |1 +
> >  5 files changed, 19 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 9eda69e40678..d6319cb2664c 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -917,6 +917,12 @@ static long vhost_net_set_backend(struct vhost_net *n, 
> > unsigned index, int fd)
> >  
> > vhost_net_disable_vq(n, vq);
> > vq->private_data = sock;
> > +
> > +   if (sock)
> > +   vhost_set_is_le(vq);
> > +   else
> > +   vq->is_le = virtio_legacy_is_little_endian();
> > +
> > r = vhost_init_used(vq);
> > if (r)
> > goto err_used;
> > diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> > index e25a23692822..e2644a301fa5 100644
> > --- a/drivers/vhost/scsi.c
> > +++ b/drivers/vhost/scsi.c
> > @@ -1276,6 +1276,9 @@ vhost_scsi_set_endpoint(struct vhost_scsi *vs,
> > vq = >vqs[i].vq;
> > mutex_lock(>mutex);
> > vq->private_data = vs_tpg;
> > +
> > +   vhost_set_is_le(vq);
> > +
> > vhost_init_used(vq);
> > mutex_unlock(>mutex);
> > }
> > diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> > index f2882ac98726..b1c7df502211 100644
> > --- a/drivers/vhost/test.c
> > +++ b/drivers/vhost/test.c
> > @@ -196,6 +196,8 @@ static long vhost_test_run(struct vhost_test *n, int 
> > test)
> > oldpriv = vq->private_data;
> > vq->private_data = priv;
> >  
> > +   vhost_set_is_le(vq);
> > +
> > r = vhost_init_used(>vqs[index]);
> >  
> > mutex_unlock(>mutex);
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index eec2f11809ff..6be863dcbd13 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -113,6 +113,12 @@ static void vhost_init_is_le(struct vhost_virtqueue 
> > *vq)
> >  }
> >  #endif /* CONFIG_VHOST_CROSS_ENDIAN_LEGACY */
> >  
> > +void vhost_set_is_le(struct vhost_virtqueue *vq)
> > +{
> > +   vhost_init_is_le(vq);
> > +}
> > +EXPORT_SYMBOL_GPL(vhost_set_is_le);
> > +
> >  static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
> > poll_table *pt)
> >  {
> > @@ -1156,12 +1162,8 @@ int vhost_init_used(struct vhost_virtqueue *vq)
> >  {
> > __virtio16 last_used_idx;
> > int r;
> > -   if (!vq->private_data) {
> > -   vq->is_le = virtio_legacy_is_little_endian();
> > +   if (!vq->private_data)
> > return 0;
> > -   }
> > -
> > -   vhost_init_is_le(vq);
> >  
> > r = vhost_update_used_flags(vq);
> > if (r)
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index 4772862b71a7..8a62041959fe 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -162,6 +162,7 @@ bool vhost_enable_notify(struct vhost_dev *, struct 
> > vhost_virtqueue *);
> >  
> >  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
> > unsigned int log_num, u64 len);
> > +void vhost_set_is_le(struct vhost_virtqueue *vq);
> >  
> >  #define vq_err(vq, fmt, ...) do {  \
> > pr_debug(pr_fmt(fmt), ##__VA_ARGS__);   \
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info

Re: [linux-4.4-mw] BUG: unable to handle kernel paging request ip_vs_out.constprop

2015-11-12 Thread Eric Dumazet

On Thu, 2015-11-12 at 11:08 +0100, Sander Eikelenboom wrote:
> Hi All,
> 
> Just got a crash with a linux-4.4-mw kernel.
> I'm using a routed bridge and apart from the splat below i have got some 
> interesting other messages that aren't there in 4.3 (and perhaps are of 
> interest for the crash as well):
> [  207.033768] vif vif-1-0 vif1.0: set_features() failed (-1); wanted 
> 0x00044803, left 0x000400114813
> [  207.033780] vif vif-1-0 vif1.0: set_features() failed (-1); wanted 
> 0x00044803, left 0x000400114813
> [  207.245435] xen_bridge: error setting offload STP state on port 
> 1(vif1.0)
> [  207.245442] vif vif-1-0 vif1.0: failed to set HW ageing time
> [  207.245443] xen_bridge: error setting offload STP state on port 
> 1(vif1.0)
> [  207.245491] vif vif-1-0 vif1.0: set_features() failed (-1); wanted 
> 0x00044803, left 0x000400114813
> 
> The commit message for the commit that introduced the "set HW ageing 
> time" error message, doesn't seem to tell
> me much about it's purpose. If it's not related i can reported as a 
> seperate issue.
> 
> --
> Sander
> 
> The crash:
> [  354.328687] BUG: unable to handle kernel paging request at 
> 880049aa8000
> [  354.350206] IP: [] ip_vs_out.constprop.25+0x47/0x60
> [  354.360882] PGD 2212067 PUD 25b4067 PMD 5ffb6067 PTE 0
> [  354.371587] Oops:  [#1] SMP
> [  354.382143] Modules linked in:
> [  354.392537] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
> 4.3.0-mw-2015-linus-doflr+ #1
> [  354.403105] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , BIOS 
> V1.8B1 09/13/2010
> [  354.413666] task: 82218580 ti: 8220 task.ti: 
> 8220
> [  354.424255] RIP: e030:[]  [] 
> ip_vs_out.constprop.25+0x47/0x60
> [  354.434742] RSP: e02b:88005f6034b0  EFLAGS: 00010246
> [  354.445006] RAX: 0001 RBX: 88005f6034f8 RCX: 
> 880049aa7ce0
> [  354.455262] RDX: 88003c0e5500 RSI: 0003 RDI: 
> 880004e0e800
> [  354.465422] RBP: 88005f6034b8 R08: 0014 R09: 
> 0003
> [  354.475508] R10: 0001 R11: 880040f394cc R12: 
> 88005f603528
> [  354.485567] R13: 88003c0e5500 R14: 822da2e8 R15: 
> 88003c0e5500
> [  354.495595] FS:  7f0243c2b700() GS:88005f60() 
> knlGS:
> [  354.505474] CS:  e033 DS:  ES:  CR0: 8005003b
> [  354.515135] CR2: 880049aa8000 CR3: 59271000 CR4: 
> 0660
> [  354.524794] Stack:
> [  354.534319]  81a074fc 88005f6034e8 8199e138 
> 88003c0e5500
> [  354.543981]  88005f603528 88003c0e5500  
> 88005f603518
> [  354.553577]  8199e1af 880005300048 88003c0e5500 
> 822da2e8
> [  354.563160] Call Trace:
> [  354.572418]  
> [  354.572480]  [] ? ip_vs_local_reply4+0x1c/0x20
> [  354.590458]  [] nf_iterate+0x58/0x70
> [  354.599372]  [] nf_hook_slow+0x5f/0xb0
> [  354.608245]  [] __ip_local_out+0x9e/0xb0
> [  354.617036]  [] ? ip_forward_options+0x1a0/0x1a0
> [  354.625874]  [] ip_local_out+0x17/0x40
> [  354.634383]  [] ip_build_and_send_pkt+0x148/0x1c0
> [  354.642715]  [] tcp_v4_send_synack+0x56/0xa0
> [  354.650893]  [] ? 
> inet_csk_reqsk_queue_hash_add+0x68/0x90
> [  354.659083]  [] tcp_conn_request+0x95d/0x970
> [  354.667196]  [] ? __local_bh_enable_ip+0x26/0x90
> [  354.675246]  [] tcp_v4_conn_request+0x47/0x50
> [  354.683254]  [] tcp_rcv_state_process+0x183/0xca0
> [  354.691004]  [] tcp_v4_do_rcv+0x5c/0x1f0
> [  354.698533]  [] tcp_v4_rcv+0x987/0x9a0
> [  354.705968]  [] ? ipv4_confirm+0x78/0xf0
> [  354.713370]  [] ip_local_deliver_finish+0x84/0x120
> [  354.720739]  [] ip_local_deliver+0x42/0xd0
> [  354.728029]  [] ? inet_del_offload+0x40/0x40
> [  354.735270]  [] ip_rcv_finish+0x106/0x320
> [  354.742413]  [] ip_rcv+0x211/0x370
> [  354.749268]  [] ? 
> ip_local_deliver_finish+0x120/0x120
> [  354.755929]  [] 
> __netif_receive_skb_core+0x2cb/0x970
> [  354.762535]  [] ? nf_nat_setup_info+0x7a/0x2f0
> [  354.769131]  [] __netif_receive_skb+0x11/0x70
> [  354.775481]  [] 
> netif_receive_skb_internal+0x1e/0x80
> [  354.781638]  [] ? nf_hook_slow+0x5f/0xb0
> [  354.787771]  [] netif_receive_skb+0x9/0x10
> [  354.793916]  [] br_handle_frame_finish+0x178/0x4b0
> [  354.800077]  [] ? nf_nat_ipv4_fn+0x167/0x1e0
> [  354.806260]  [] ? br_handle_local_finish+0x50/0x50
> [  354.812405]  [] 
> br_nf_pre_routing_finish+0x183/0x360
> [  354.818574]  [] ? br_netif_receive_skb+0x10/0x10
> [  354.824775]  [] br_nf_pre_routing+0x2a7/0x380
> [  354.830780]  [] ? br_nf_forward_ip+0x3f0/0x3f0
> [  354.836567]  [] nf_iterate+0x58/0x70
> [  354.842281]  [] nf_hook_slow+0x5f/0xb0
> [  354.847886]  [] br_handle_frame+0x1a2/0x290
> [  354.853520]  [] ? br_netif_receive_skb+0x10/0x10
> [  354.859206]  [] ? 
> br_handle_frame_finish+0x4b0/0x4b0
> [  354.864824]  [] 
> __netif_receive_skb_core+0x12b/0x970
> [  354.870350]  [] ? 
>

Re: [PATCH 5/8] mm: memcontrol: account socket memory on unified hierarchy

2015-11-12 Thread Johannes Weiner

On Thu, Nov 12, 2015 at 06:36:20PM +, Mel Gorman wrote:
> Bottom line, there is legimate confusion over whether cgroup controllers
> are going to be enabled by default or not in the future. If they are
> enabled by default, there is a non-zero cost to that and a change in
> semantics that people may or may not be surprised by.

Thanks for elaborating, Mel.

My understanding is that this is a plain bug. I don't think anybody
wants to put costs without benefits on their users.

But I'll keep an eye on these reports, and I'll work with the systemd
people should issues with the kernel interface materialize that would
force them to enable resource control prematurely.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] unix: avoid use-after-free in ep_remove_wait_queue

2015-11-12 Thread Rainer Weikusat

Jason Baron  writes:
>> +
>> +/* Needs sk unix state lock. After recv_ready indicated not ready,
>> + * establish peer_wait connection if still needed.
>> + */
>> +static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
>> +{
>> +int connected;
>> +
>> +connected = unix_dgram_peer_wake_connect(sk, other);
>> +
>> +if (unix_recvq_full(other))
>> +return 1;
>> +
>> +if (connected)
>> +unix_dgram_peer_wake_disconnect(sk, other);
>> +
>> +return 0;
>> +}
>> +
>
> So the comment above this function says 'needs unix state lock', however
> the usage in unix_dgram_sendmsg() has the 'other' lock, while the usage
> in unix_dgram_poll() has the 'sk' lock. So this looks racy.

That's one thing which is broken with this patch. Judging from a 'quick'
look at the _dgram_sendmsg code, the unix_state_lock(other) will need to
be turned into a unix_state_double_lock(sk, other) and the remaining
code changed accordingly (since all of the checks must be done without
unlocking other). 

There's also something else seriously wrong with the present patch: Some
code in unix_dgram_connect presently (with this change) looks like this:

/*
 * If it was connected, reconnect.
 */
if (unix_peer(sk)) {
struct sock *old_peer = unix_peer(sk);
unix_peer(sk) = other;

if (unix_dgram_peer_wake_disconnect(sk, other))
wake_up_interruptible_poll(sk_sleep(sk),
   POLLOUT |
   POLLWRNORM |
   POLLWRBAND);

unix_state_double_unlock(sk, other);

if (other != old_peer)
unix_dgram_disconnected(sk, old_peer);
sock_put(old_peer);

and trying to disconnect from a peer the socket is just being
connected to is - of course - "flowering tomfoolery" (literal
translation of the German "bluehender Bloedsinn") --- it needs to
disconnect from old_peer instead.

I'll address the suggestion and send an updated patch "later today" (may
become "early tomorrow"). I have some code addressing both issues but
that's part of a release of 'our' kernel fork, ie, 3.2.54-based I'll
need to do 'soon'.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net: phy: vitesse: add support for VSC8601

2015-11-12 Thread Florian Fainelli

On 12/11/15 10:41, Mans Rullgard wrote:
> This adds support for the Vitesse VSC8601 PHY. Generic functions are
> used for everything except interrupt handling.
> 
> Signed-off-by: Mans Rullgard 

Reviewed-by: Florian Fainelli 
-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net: phy: at803x: support interrupt on 8030 and 8035

2015-11-12 Thread Florian Fainelli

On 12/11/15 11:09, Måns Rullgård wrote:
> On 12 November 2015 19:06:23 GMT+00:00, Mason  wrote:
>> On 12/11/2015 18:40, Mans Rullgard wrote:
>>> Commit 77a993942 "phy/at8031: enable at8031 to work on interrupt
>> mode"
>>> added interrupt support for the 8031 PHY but left out the other two
>>> chips supported by this driver.
>>>
>>> This patch sets the .ack_interrupt and .config_intr functions for the
>>> 8030 and 8035 drivers as well.
>>>
>>> Signed-off-by: Mans Rullgard 
>>> ---
>>> I have only tested this with an 8035.  I can't find a datasheet for
>>> the 8030, but since 8031, 8032, and 8035 all have the same register
>>> layout, there's a good chance 8030 does as well.
>>> ---
>>>  drivers/net/phy/at803x.c | 4 
>>>  1 file changed, 4 insertions(+)
>>>
>>> diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
>>> index fabf11d..2d020a3 100644
>>> --- a/drivers/net/phy/at803x.c
>>> +++ b/drivers/net/phy/at803x.c
>>> @@ -308,6 +308,8 @@ static struct phy_driver at803x_driver[] = {
>>> .flags  = PHY_HAS_INTERRUPT,
>>> .config_aneg= genphy_config_aneg,
>>> .read_status= genphy_read_status,
>>> +   .ack_interrupt  = at803x_ack_interrupt,
>>> +   .config_intr= at803x_config_intr,
>>> .driver = {
>>> .owner = THIS_MODULE,
>>> },
>>> @@ -327,6 +329,8 @@ static struct phy_driver at803x_driver[] = {
>>> .flags  = PHY_HAS_INTERRUPT,
>>> .config_aneg= genphy_config_aneg,
>>> .read_status= genphy_read_status,
>>> +   .ack_interrupt  = at803x_ack_interrupt,
>>> +   .config_intr= at803x_config_intr,
>>> .driver = {
>>> .owner = THIS_MODULE,
>>> },
>>
>> Shouldn't we take the opportunity to clean up the duplicated register
>> definitions? (I'll send an informal patch to spur discussion.)
>>
>> Regards.
> 
> That can be done independently. Feel free to send a patch.

Agreed, that deserve a separate patch.
-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net: phy: at803x: support interrupt on 8030 and 8035

2015-11-12 Thread Florian Fainelli

On 12/11/15 09:40, Mans Rullgard wrote:
> Commit 77a993942 "phy/at8031: enable at8031 to work on interrupt mode"
> added interrupt support for the 8031 PHY but left out the other two
> chips supported by this driver.
> 
> This patch sets the .ack_interrupt and .config_intr functions for the
> 8030 and 8035 drivers as well.
> 
> Signed-off-by: Mans Rullgard 

Reviewed-by: Florian Fainelli 
-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/10] Netfilter fixes for net

2015-11-12 Thread David Miller

From: Pablo Neira Ayuso 
Date: Wed, 11 Nov 2015 18:33:33 +0100

> The following patchset contains Netfilter fixes for your net tree. This
> large batch that includes fixes for ipset, netfilter ingress, nf_tables
> dynamic set instantiation and a longstanding Kconfig dependency problem.
> More specifically, they are:
 ...
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Pulled, thanks a lot Pablo.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] arm64: bpf: add 'store immediate' instruction

2015-11-12 Thread Shi, Yang


On 11/11/2015 4:39 AM, Will Deacon wrote:

On Wed, Nov 11, 2015 at 12:12:56PM +, Will Deacon wrote:

On Tue, Nov 10, 2015 at 06:45:39PM -0800, Z Lim wrote:

On Tue, Nov 10, 2015 at 2:41 PM, Yang Shi  wrote:

aarch64 doesn't have native store immediate instruction, such operation


Actually, aarch64 does have "STR (immediate)". For arm64 JIT, we can
consider using it as an optimization.


Yes, I'd definitely like to see that in preference to moving via a
temporary register.


Wait a second, we're both talking rubbish here :) The STR (immediate)
form is referring to the addressing mode, whereas this patch wants to
store an immediate value to memory, which does need moving to a register
first.


Yes, the immediate means immediate offset for addressing index. Doesn't 
mean to store immediate to memory.


I don't think any load-store architecture has store immediate instruction.

Thanks,
Yang



So the original patch is fine.

Will



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] arm64: bpf: fix JIT frame pointer setup

2015-11-12 Thread Z Lim

On Thu, Nov 12, 2015 at 1:57 PM, Yang Shi  wrote:
> BPF fp should point to the top of the BPF prog stack. The original
> implementation made it point to the bottom incorrectly.
> Move A64_SP to fp before reserve BPF prog stack space.
>
> CC: Zi Shen Lim 
> CC: Xi Wang 
> Signed-off-by: Yang Shi 
> ---

Reviewed-by: Zi Shen Lim 

Also,

Fixes: e54bcde3d69d ("arm64: eBPF JIT compiler")
Cc:  # 3.18+
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next 04/17] drivers/net/intel: use napi_complete_done()

2015-11-12 Thread Eric Dumazet

On Thu, 2015-10-15 at 14:43 -0700, Jeff Kirsher wrote:
> From: Jesse Brandeburg 
> 
> As per Eric Dumazet's previous patches:
> (see commit (24d2e4a50737) - tg3: use napi_complete_done())
> 
> Quoting verbatim:
> Using napi_complete_done() instead of napi_complete() allows
> us to use /sys/class/net/ethX/gro_flush_timeout
> 
> GRO layer can aggregate more packets if the flush is delayed a bit,
> without having to set too big coalescing parameters that impact
> latencies.
> 
> 
> Tested
> configuration: low latency via ethtool -C ethx adaptive-rx off
>   rx-usecs 10 adaptive-tx off tx-usecs 15
> workload: streaming rx using netperf TCP_MAERTS
> 
> igb:
> MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.1 () 
> port 0 AF_INET : demo
> ...
> Interim result:  941.48 10^6bits/s over 1.000 seconds ending at 1440193171.589
> 
> Alignment  Offset BytesBytes   Recvs   BytesSends
> Local  Remote  Local  Remote  Xfered   Per Per
> Recv   SendRecv   Send Recv (avg)  Send (avg)
> 8   8  0   0 1176930056  1475.36797726   16384.00  71905
> 
> MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.1 () 
> port 0 AF_INET : demo
> ...
> Interim result:  941.49 10^6bits/s over 0.997 seconds ending at 1440193142.763
> 
> Alignment  Offset BytesBytes   Recvs   BytesSends
> Local  Remote  Local  Remote  Xfered   Per Per
> Recv   SendRecv   Send Recv (avg)  Send (avg)
> 8   8  0   0 1175182320  50476.00 23282   16384.00  71816
> 
> i40e:
> Hard to test because the traffic is incoming so fast (24Gb/s) that GRO
> always receives 87kB, even at the highest interrupt rate.
> 
> Other drivers were only compile tested.
> 
> Signed-off-by: Jesse Brandeburg 
> Tested-by: Andrew Bowers 
> Signed-off-by: Jeff Kirsher 


Hi guys

I am not sure the ixgbe part is working :

ixgbe_qv_unlock_napi() does :

/* flush any outstanding Rx frames */
if (q_vector->napi.gro_list)
napi_gro_flush(_vector->napi, false);

And it is called before napi_complete_done(napi, work_done);



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] arm64: bpf: make BPF prologue and epilogue align with ARM64 AAPCS

2015-11-12 Thread Z Lim

On Thu, Nov 12, 2015 at 1:57 PM, Yang Shi  wrote:
>
> Save and restore FP/LR in BPF prog prologue and epilogue, save SP to FP
> in prologue in order to get the correct stack backtrace.
>
> However, ARM64 JIT used FP (x29) as eBPF fp register, FP is subjected to
> change during function call so it may cause the BPF prog stack base address
> change too.
>
> Use x25 to replace FP as BPF stack base register (fp). Since x25 is callee
> saved register, so it will keep intact during function call.
> It is initialized in BPF prog prologue when BPF prog is started to run
> everytime. When BPF prog exits, it could be just tossed.
>
> So, the BPF stack layout looks like:
>
>  high
>  original A64_SP =>   0:+-+ BPF prologue
> | | FP/LR and callee saved registers
>  BPF fp register => -64:+-+
> | |
> | ... | BPF prog stack
> | |
> | |
>  current A64_SP/FP =>   +-+
> | |
> | ... | Function call stack
> | |
> +-+
>   low
>

Yang, for stack unwinding to work, shouldn't it be something like the following?

  | LR |
A64_FP => | FP |
  | .. |
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] arm64: bpf: add 'store immediate' instruction

2015-11-12 Thread Z Lim

On Thu, Nov 12, 2015 at 11:33 AM, Shi, Yang  wrote:
> On 11/11/2015 4:39 AM, Will Deacon wrote:
>>
>> Wait a second, we're both talking rubbish here :) The STR (immediate)
>> form is referring to the addressing mode, whereas this patch wants to
>> store an immediate value to memory, which does need moving to a register
>> first.
>
>
> Yes, the immediate means immediate offset for addressing index. Doesn't mean
> to store immediate to memory.
>
> I don't think any load-store architecture has store immediate instruction.
>

Indeed. Sorry for the noise.

Somehow Will caught a whiff of whatever I was smoking then :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 08/14] net: tcp_memcontrol: sanitize tcp memory accounting callbacks

2015-11-12 Thread Eric Dumazet

On Thu, 2015-11-12 at 18:41 -0500, Johannes Weiner wrote:


> @@ -711,6 +705,12 @@ static inline void mem_cgroup_wb_stats(struct 
> bdi_writeback *wb,
>  struct sock;
>  void sock_update_memcg(struct sock *sk);
>  void sock_release_memcg(struct sock *sk);
> +bool mem_cgroup_charge_skmem(struct cg_proto *proto, unsigned int nr_pages);
> +void mem_cgroup_uncharge_skmem(struct cg_proto *proto, unsigned int 
> nr_pages);
> +static inline bool mem_cgroup_under_socket_pressure(struct cg_proto *proto)
> +{
> + return proto->memory_pressure;
> +}
>  #endif /* CONFIG_INET && CONFIG_MEMCG_KMEM */
>  
>  #ifdef CONFIG_MEMCG_KMEM
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 2eefc99..8cc7613 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1126,8 +1126,8 @@ static inline bool sk_under_memory_pressure(const 
> struct sock *sk)
>   if (!sk->sk_prot->memory_pressure)
>   return false;
>  
> - if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
> - return !!sk->sk_cgrp->memory_pressure;
> + if (mem_cgroup_sockets_enabled && sk->sk_cgrp &&
> + mem_cgroup_under_socket_pressure(sk->sk_cgrp))
>  
>   return !!*sk->sk_prot->memory_pressure;
>  }


This looks wrong ?

if (A && B && C)
return !!*sk->sk_prot->memory_pressure;


}







--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] net: hisilicon: fix binding document of mdio

2015-11-12 Thread huangdaode

This patch fixes explain the occasion of "hisilcon,mdio" according to
Arnd's comments. specify it is only used for hip04.

First, please give your commnents.

Signed-off-by: huangdaode 
---
 Documentation/devicetree/bindings/net/hisilicon-hns-mdio.txt | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/net/hisilicon-hns-mdio.txt 
b/Documentation/devicetree/bindings/net/hisilicon-hns-mdio.txt
index 9c23fdf..f650e78 100644
--- a/Documentation/devicetree/bindings/net/hisilicon-hns-mdio.txt
+++ b/Documentation/devicetree/bindings/net/hisilicon-hns-mdio.txt
@@ -1,7 +1,9 @@
 Hisilicon MDIO bus controller
 
 Properties:
-- compatible: "hisilicon,mdio","hisilicon,hns-mdio".
+- compatible: can be one of "hisilicon,hns-mdio","hisilicon,mdio",
+  for hip04 board, please use "hisilicon,mdio",
+  other boards, "hisilicon,hns-mdio" is OK.
 - reg: The base address of the MDIO bus controller register bank.
 - #address-cells: Must be <1>.
 - #size-cells: Must be <0>.  MDIO addresses have no size component.
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [linux-4.4-mw] BUG: unable to handle kernel paging request ip_vs_out.constprop

2015-11-12 Thread Sander Eikelenboom


On 2015-11-12 15:09, Eric Dumazet wrote:

On Thu, 2015-11-12 at 11:08 +0100, Sander Eikelenboom wrote:

Hi All,

Just got a crash with a linux-4.4-mw kernel.
I'm using a routed bridge and apart from the splat below i have got 
some
interesting other messages that aren't there in 4.3 (and perhaps are 
of

interest for the crash as well):
[  207.033768] vif vif-1-0 vif1.0: set_features() failed (-1); wanted
0x00044803, left 0x000400114813
[  207.033780] vif vif-1-0 vif1.0: set_features() failed (-1); wanted
0x00044803, left 0x000400114813
[  207.245435] xen_bridge: error setting offload STP state on port
1(vif1.0)
[  207.245442] vif vif-1-0 vif1.0: failed to set HW ageing time
[  207.245443] xen_bridge: error setting offload STP state on port
1(vif1.0)
[  207.245491] vif vif-1-0 vif1.0: set_features() failed (-1); wanted
0x00044803, left 0x000400114813

The commit message for the commit that introduced the "set HW ageing
time" error message, doesn't seem to tell
me much about it's purpose. If it's not related i can reported as a
seperate issue.

--
Sander

The crash:
[  354.328687] BUG: unable to handle kernel paging request at
880049aa8000
[  354.350206] IP: [] 
ip_vs_out.constprop.25+0x47/0x60

[  354.360882] PGD 2212067 PUD 25b4067 PMD 5ffb6067 PTE 0
[  354.371587] Oops:  [#1] SMP
[  354.382143] Modules linked in:
[  354.392537] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.3.0-mw-2015-linus-doflr+ #1
[  354.403105] Hardware name: MSI MS-7640/890FXA-GD70 (MS-7640)  , 
BIOS

V1.8B1 09/13/2010
[  354.413666] task: 82218580 ti: 8220 task.ti:
8220
[  354.424255] RIP: e030:[]  []
ip_vs_out.constprop.25+0x47/0x60
[  354.434742] RSP: e02b:88005f6034b0  EFLAGS: 00010246
[  354.445006] RAX: 0001 RBX: 88005f6034f8 RCX:
880049aa7ce0
[  354.455262] RDX: 88003c0e5500 RSI: 0003 RDI:
880004e0e800
[  354.465422] RBP: 88005f6034b8 R08: 0014 R09:
0003
[  354.475508] R10: 0001 R11: 880040f394cc R12:
88005f603528
[  354.485567] R13: 88003c0e5500 R14: 822da2e8 R15:
88003c0e5500
[  354.495595] FS:  7f0243c2b700() GS:88005f60()
knlGS:
[  354.505474] CS:  e033 DS:  ES:  CR0: 8005003b
[  354.515135] CR2: 880049aa8000 CR3: 59271000 CR4:
0660
[  354.524794] Stack:
[  354.534319]  81a074fc 88005f6034e8 8199e138
88003c0e5500
[  354.543981]  88005f603528 88003c0e5500 
88005f603518
[  354.553577]  8199e1af 880005300048 88003c0e5500
822da2e8
[  354.563160] Call Trace:
[  354.572418]  
[  354.572480]  [] ? ip_vs_local_reply4+0x1c/0x20
[  354.590458]  [] nf_iterate+0x58/0x70
[  354.599372]  [] nf_hook_slow+0x5f/0xb0
[  354.608245]  [] __ip_local_out+0x9e/0xb0
[  354.617036]  [] ? ip_forward_options+0x1a0/0x1a0
[  354.625874]  [] ip_local_out+0x17/0x40
[  354.634383]  [] ip_build_and_send_pkt+0x148/0x1c0
[  354.642715]  [] tcp_v4_send_synack+0x56/0xa0
[  354.650893]  [] ?
inet_csk_reqsk_queue_hash_add+0x68/0x90
[  354.659083]  [] tcp_conn_request+0x95d/0x970
[  354.667196]  [] ? __local_bh_enable_ip+0x26/0x90
[  354.675246]  [] tcp_v4_conn_request+0x47/0x50
[  354.683254]  [] tcp_rcv_state_process+0x183/0xca0
[  354.691004]  [] tcp_v4_do_rcv+0x5c/0x1f0
[  354.698533]  [] tcp_v4_rcv+0x987/0x9a0
[  354.705968]  [] ? ipv4_confirm+0x78/0xf0
[  354.713370]  [] 
ip_local_deliver_finish+0x84/0x120

[  354.720739]  [] ip_local_deliver+0x42/0xd0
[  354.728029]  [] ? inet_del_offload+0x40/0x40
[  354.735270]  [] ip_rcv_finish+0x106/0x320
[  354.742413]  [] ip_rcv+0x211/0x370
[  354.749268]  [] ?
ip_local_deliver_finish+0x120/0x120
[  354.755929]  []
__netif_receive_skb_core+0x2cb/0x970
[  354.762535]  [] ? nf_nat_setup_info+0x7a/0x2f0
[  354.769131]  [] __netif_receive_skb+0x11/0x70
[  354.775481]  []
netif_receive_skb_internal+0x1e/0x80
[  354.781638]  [] ? nf_hook_slow+0x5f/0xb0
[  354.787771]  [] netif_receive_skb+0x9/0x10
[  354.793916]  [] 
br_handle_frame_finish+0x178/0x4b0

[  354.800077]  [] ? nf_nat_ipv4_fn+0x167/0x1e0
[  354.806260]  [] ? 
br_handle_local_finish+0x50/0x50

[  354.812405]  []
br_nf_pre_routing_finish+0x183/0x360
[  354.818574]  [] ? br_netif_receive_skb+0x10/0x10
[  354.824775]  [] br_nf_pre_routing+0x2a7/0x380
[  354.830780]  [] ? br_nf_forward_ip+0x3f0/0x3f0
[  354.836567]  [] nf_iterate+0x58/0x70
[  354.842281]  [] nf_hook_slow+0x5f/0xb0
[  354.847886]  [] br_handle_frame+0x1a2/0x290
[  354.853520]  [] ? br_netif_receive_skb+0x10/0x10
[  354.859206]  [] ?
br_handle_frame_finish+0x4b0/0x4b0
[  354.864824]  []
__netif_receive_skb_core+0x12b/0x970
[  354.870350]  [] ?
__raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[  354.875880]  [] __netif_receive_skb+0x11/0x70
[  354.881293]  []
netif_receive_skb_internal+0x1e/0x80
[  354.886653]  [] netif_receive_skb+0x9/0x10
[  354.891918]  []

[PATCH] bonding: Offloading bonds to hardware

2015-11-12 Thread Premkumar Jonnala

Packet forwarding to/from bond interfaces is done in software.

This patch enables certain platforms to bridge traffic to/from
bond interfaces in hardware.  Notifications are sent out when 
the "active" slave set for a bond interface is updated in 
software.  Platforms use the notifications to program the 
hardware accordingly.  The changes have been verified to work 
with configured and 802.3ad bond interfaces.

Signed-off-by: Premkumar Jonnala 

---

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index b4351ca..4b53733 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3759,6 +3759,101 @@ err:
bond_slave_arr_work_rearm(bond, 1);
 }
 
+static int slave_present(struct slave *slave, struct bond_up_slave *arr)
+{
+   int i;
+
+   if (!arr)
+   return 0;
+
+   for (i = 0; i < arr->count; i++) {
+   if (arr->arr[i] == slave)
+   return 1;
+   }
+   return 0;
+}
+
+/* Send notification to clear/remove slaves for 'bond' in 'arr' except for
+ * slaves in 'ignore_arr'.
+ */
+static int bond_slave_arr_clear_notify(struct bonding *bond,
+   struct bond_up_slave *arr,
+   struct bond_up_slave *ignore_arr)
+{
+   struct slave *slave;
+   struct net_device *slave_dev;
+   int i, rv;
+   const struct net_device_ops *ops;
+
+   if (!bond->dev || !arr)
+   return -EINVAL;
+
+   rv = 0;
+   for (i = 0; i < arr->count; i++) {
+   slave = arr->arr[i];
+   if (!slave || !slave->dev)
+   continue;
+
+   slave_dev = slave->dev;
+   if (slave_present(slave, ignore_arr)) {
+   netdev_dbg(bond->dev, "ignoring clear of slave %s\n",
+   slave_dev->name);
+   continue;
+   }
+   ops = slave_dev->netdev_ops;
+   if (!ops || !ops->ndo_bond_slave_discard) {
+   netdev_dbg(bond->dev, "No slave discard ops for %s\n",
+   slave_dev->name);
+   continue;
+   }
+   rv = ops->ndo_bond_slave_discard(slave_dev, bond->dev);
+   if (rv < 0)
+   return rv;
+   }
+   return rv;
+}
+
+/* Send notification about updated slaves for 'bond' except for slaves in
+ * 'ignore_arr'.
+ */
+static int bond_slave_arr_set_notify(struct bonding *bond,
+   struct bond_up_slave *ignore_arr)
+{
+   struct slave *slave;
+   struct net_device *slave_dev;
+   struct bond_up_slave *arr;
+   int i, rv;
+   const struct net_device_ops *ops;
+
+   if (!bond || !bond->dev)
+   return -EINVAL;
+   rv = 0;
+
+   arr = rtnl_dereference(bond->slave_arr);
+   if (!arr)
+   return -EINVAL;
+
+   for (i = 0; i < arr->count; i++) {
+   slave = arr->arr[i];
+   slave_dev = slave->dev;
+   if (slave_present(slave, ignore_arr)) {
+   netdev_dbg(bond->dev, "ignoring add of slave %s\n",
+   slave->dev->name);
+   continue;
+   }
+   ops = slave_dev->netdev_ops;
+   if (!ops || !ops->ndo_bond_slave_add) {
+   netdev_dbg(bond->dev, "No slave add ops for %s\n",
+   slave_dev->name);
+   continue;
+   }
+   rv = ops->ndo_bond_slave_add(slave_dev, bond->dev);
+   if (rv < 0)
+   return rv;
+   }
+   return rv;
+}
+
 /* Build the usable slaves array in control path for modes that use xmit-hash
  * to determine the slave interface -
  * (a) BOND_MODE_8023AD
@@ -3771,7 +3866,7 @@ int bond_update_slave_arr(struct bonding *bond, struct 
slave *skipslave)
 {
struct slave *slave;
struct list_head *iter;
-   struct bond_up_slave *new_arr, *old_arr;
+   struct bond_up_slave *new_arr, *old_arr, *discard_arr = 0;
int agg_id = 0;
int ret = 0;
 
@@ -3786,6 +3881,12 @@ int bond_update_slave_arr(struct bonding *bond, struct 
slave *skipslave)
pr_err("Failed to build slave-array.\n");
goto out;
}
+   discard_arr = kzalloc(offsetof(struct bond_up_slave, 
arr[bond->slave_cnt]),
+   GFP_KERNEL);
+   if (!discard_arr) {
+   ret = -ENOMEM;
+   goto out;
+   }
if (BOND_MODE(bond) == BOND_MODE_8023AD) {
struct ad_info ad_info;
 
@@ -3797,6 +3898,7 @@ int bond_update_slave_arr(struct bonding *bond, struct 
slave *skipslave)
 */
old_arr = rtnl_dereference(bond->slave_arr);
if

[PATCH] ip_tunnel: disable preemption when updating per-cpu tstats

2015-11-12 Thread Jason A. Donenfeld

Drivers like vxlan use the recently introduced
udp_tunnel_xmit_skb/udp_tunnel6_xmit_skb APIs. udp_tunnel6_xmit_skb
makes use of ip6tunnel_xmit, and ip6tunnel_xmit, after sending the
packet, updates the struct stats using the usual
u64_stats_update_begin/end calls on this_cpu_ptr(dev->tstats).
udp_tunnel_xmit_skb makes use of iptunnel_xmit, which doesn't touch
tstats, so drivers like vxlan, immediately after, call
iptunnel_xmit_stats, which does the same thing - calls
u64_stats_update_begin/end on this_cpu_ptr(dev->tstats).

While vxlan is probably fine (I don't know?), calling a similar function
from, say, an unbound workqueue, on a fully preemptable kernel causes
real issues:

[  188.434537] BUG: using smp_processor_id() in preemptible [] code: 
kworker/u8:0/6
[  188.435579] caller is debug_smp_processor_id+0x17/0x20
[  188.435583] CPU: 0 PID: 6 Comm: kworker/u8:0 Not tainted 4.2.6 #2
[  188.435607] Call Trace:
[  188.435611]  [] dump_stack+0x4f/0x7b
[  188.435615]  [] check_preemption_disabled+0x19d/0x1c0
[  188.435619]  [] debug_smp_processor_id+0x17/0x20

The solution would be to protect the whole
this_cpu_ptr(dev->tstats)/u64_stats_update_begin/end blocks with
disabling preemption and then reenabling it.

Signed-off-by: Jason A. Donenfeld 
---
 include/net/ip6_tunnel.h | 5 -
 include/net/ip_tunnels.h | 6 --
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index fa915fa..67dc00d 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -90,11 +90,14 @@ static inline void ip6tunnel_xmit(struct sock *sk, struct 
sk_buff *skb,
err = ip6_local_out_sk(sk, skb);
 
if (net_xmit_eval(err) == 0) {
-   struct pcpu_sw_netstats *tstats = this_cpu_ptr(dev->tstats);
+   struct pcpu_sw_netstats *tstats;
+   preempt_disable();
+   tstats = this_cpu_ptr(dev->tstats);
u64_stats_update_begin(>syncp);
tstats->tx_bytes += pkt_len;
tstats->tx_packets++;
u64_stats_update_end(>syncp);
+   preempt_enable();
} else {
stats->tx_errors++;
stats->tx_aborted_errors++;
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index f6dafec..6544955 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -287,12 +287,14 @@ static inline void iptunnel_xmit_stats(int err,
   struct pcpu_sw_netstats __percpu *stats)
 {
if (err > 0) {
-   struct pcpu_sw_netstats *tstats = this_cpu_ptr(stats);
-
+   struct pcpu_sw_netstats *tstats;
+   preempt_disable();
+   tstats = this_cpu_ptr(stats);
u64_stats_update_begin(>syncp);
tstats->tx_bytes += err;
tstats->tx_packets++;
u64_stats_update_end(>syncp);
+   preempt_enable();
} else if (err < 0) {
err_stats->tx_errors++;
err_stats->tx_aborted_errors++;
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] bonding: Offloading bonds to hardware

2015-11-12 Thread Andrew Lunn

On Thu, Nov 12, 2015 at 04:02:18PM +, Premkumar Jonnala wrote:
> Packet forwarding to/from bond interfaces is done in software.
> 
> This patch enables certain platforms to bridge traffic to/from
> bond interfaces in hardware.  Notifications are sent out when 
> the "active" slave set for a bond interface is updated in 
> software.  Platforms use the notifications to program the 
> hardware accordingly.  The changes have been verified to work 
> with configured and 802.3ad bond interfaces.

Hi Premkumar

Nice to see this. Do you also have patches for a switch using these
notification? Are you targeting Starfighter 2?

Thanks
Andrew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5] net: ethernet: add driver for Aurora VLSI NB8800 Ethernet controller

2015-11-12 Thread Måns Rullgård

Mason  writes:

> [ CCing a few knowledgeable people ]
>
> Despite the subject, this is about an Atheros 8035 PHY :-)
>
> On 12/11/2015 15:04, Måns Rullgård wrote:
>
>> Mason wrote:
>> 
>>> BTW, you're not using the PHY IRQ, right? I think I remember you saying
>>> it didn't work reliably?
>> 
>> It doesn't seem to be wired up on any of my boards, or there's some
>> magic required to activate it that I'm unaware of.
>
> Weird. The board schematics for the 1172 show Tango ETH0_MDINT# pin
> properly connected to AR8035 INT pin (pin 20).

I have a different board.

> 
>
> http://www.redeszone.net/app/uploads/2014/04/AR8035.pdf
>
> INT pin 20
> I/O, D, PD
> Interrupt Signal to System; default OD-gate, needs an external
> 10Kohm pull-up, active low; can be configured to I/O by register,
> active high.
>
> 4.1.17 Interrupt Enable
> Offset: 0x12
> Mode: Read/Write
> Hardware Reset: 0
>
> Strange... it looks like AT803X_INER and AT803X_INTR_ENABLE refer to
> the same "Interrupt Enable" register?

Seems like someone missed that it was already defined.

> In fact, AT803X_INER_INIT == 0xec00 makes sense for register 0x12:
> link success/fail, speed/duplex changed, autoneg error
>
> Looks like at803x_config_intr() is used for 8031, but not for 8035...
>
> Relevant commit:
> 77a9939426f7a "phy/at8031: enable at8031 to work on interrupt mode"
>
> If I add .config_intr and .ack_interrupt to the 8035 struct, then I get
> (also added some traces)

I tried that just now, and I get nothing.  What interrupt did you
specify in your device tree?

> Questions:
>
> Can't at803x_ack_interrupt() just return phy_read(phydev, AT803X_INSR);

No, that would return the actual value of the register.  The caller
doesn't care about the value, but should be notified if there was an
error.

> Can at803x_config_intr() be used with the 8035

Probably.  The person who sent the patch for 8031 probably happened to
have that model.

> What about AT803X_INER/AT803X_INTR_ENABLE and AT803X_INSR/AT803X_INTR_STATUS

Accidental duplicates.

-- 
Måns Rullgård
m...@mansr.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5] net: ethernet: add driver for Aurora VLSI NB8800 Ethernet controller

2015-11-12 Thread Måns Rullgård

Måns Rullgård  writes:

> Mason  writes:
>
>> [ CCing a few knowledgeable people ]
>>
>> Despite the subject, this is about an Atheros 8035 PHY :-)
>>
>> On 12/11/2015 15:04, Måns Rullgård wrote:
>>
>>> Mason wrote:
>>> 
 BTW, you're not using the PHY IRQ, right? I think I remember you saying
 it didn't work reliably?
>>> 
>>> It doesn't seem to be wired up on any of my boards, or there's some
>>> magic required to activate it that I'm unaware of.
>>
>> Weird. The board schematics for the 1172 show Tango ETH0_MDINT# pin
>> properly connected to AR8035 INT pin (pin 20).
>
> I have a different board.
>
>> 
>>
>> http://www.redeszone.net/app/uploads/2014/04/AR8035.pdf
>>
>> INT pin 20
>> I/O, D, PD
>> Interrupt Signal to System; default OD-gate, needs an external
>> 10Kohm pull-up, active low; can be configured to I/O by register,
>> active high.
>>
>> 4.1.17 Interrupt Enable
>> Offset: 0x12
>> Mode: Read/Write
>> Hardware Reset: 0
>>
>> Strange... it looks like AT803X_INER and AT803X_INTR_ENABLE refer to
>> the same "Interrupt Enable" register?
>
> Seems like someone missed that it was already defined.
>
>> In fact, AT803X_INER_INIT == 0xec00 makes sense for register 0x12:
>> link success/fail, speed/duplex changed, autoneg error
>>
>> Looks like at803x_config_intr() is used for 8031, but not for 8035...
>>
>> Relevant commit:
>> 77a9939426f7a "phy/at8031: enable at8031 to work on interrupt mode"
>>
>> If I add .config_intr and .ack_interrupt to the 8035 struct, then I get
>> (also added some traces)
>
> I tried that just now, and I get nothing.  What interrupt did you
> specify in your device tree?

It works with the interrupt set to trigger on rising edge.

-- 
Måns Rullgård
m...@mansr.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net 3/6] net/mlx5e: Max mtu comparison fix

2015-11-12 Thread Or Gerlitz

From: Doron Tsur 

On change mtu the driver compares between hardware queried mtu and
software requested mtu. We need to compare between software
representation of the queried mtu and the requested mtu.

Fixes: facc9699f0fe ('net/mlx5e: Fix HW MTU settings')
Signed-off-by: Doron Tsur 
Signed-off-by: Saeed Mahameed 
Signed-off-by: Or Gerlitz 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index df00175..1e52db3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1901,6 +1901,8 @@ static int mlx5e_change_mtu(struct net_device *netdev, 
int new_mtu)
 
mlx5_query_port_max_mtu(mdev, _mtu, 1);
 
+   max_mtu = MLX5E_HW2SW_MTU(max_mtu);
+
if (new_mtu > max_mtu) {
netdev_err(netdev,
   "%s: Bad MTU (%d) > (%d) Max\n",
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net 2/6] net/mlx5e: Added self loopback prevention

2015-11-12 Thread Or Gerlitz

From: Tariq Toukan 

Prevent outgoing multicast frames from looping back to the RX queue.

By introducing new HW capability self_lb_en_modifiable, which indicates
the support to modify self_lb_en bit in modify_tir command.

When this capability is set we can prevent TIRs from sending back
loopback multicast traffic to their own RQs, by "refreshing TIRs" with
modify_tir command, on every time new channels (SQs/RQs) are created at
device open.
This is needed since TIRs are static and only allocated once on driver
load, and the loopback decision is under their responsibility.

Fixes issues of the kind:
"IPv6: eth2: IPv6 duplicate address fe80::e61d:2dff:fe5c:f2e9 detected!"
The issue is seen since the IPv6 solicitations multicast messages are
loopedback and the network stack thinks they are coming from another host.

Fixes: 5c50368f3831 ("net/mlx5e: Light-weight netdev open/stop")
Signed-off-by: Tariq Toukan 
Signed-off-by: Saeed Mahameed 
Signed-off-by: Or Gerlitz 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 48 +++
 include/linux/mlx5/mlx5_ifc.h | 24 +++-
 2 files changed, 62 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 5fc4d2d..df00175 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1332,6 +1332,42 @@ static int mlx5e_modify_tir_lro(struct mlx5e_priv *priv, 
int tt)
return err;
 }
 
+static int mlx5e_refresh_tir_self_loopback_enable(struct mlx5_core_dev *mdev,
+ u32 tirn)
+{
+   void *in;
+   int inlen;
+   int err;
+
+   inlen = MLX5_ST_SZ_BYTES(modify_tir_in);
+   in = mlx5_vzalloc(inlen);
+   if (!in)
+   return -ENOMEM;
+
+   MLX5_SET(modify_tir_in, in, bitmask.self_lb_en, 1);
+
+   err = mlx5_core_modify_tir(mdev, tirn, in, inlen);
+
+   kvfree(in);
+
+   return err;
+}
+
+static int mlx5e_refresh_tirs_self_loopback_enable(struct mlx5e_priv *priv)
+{
+   int err;
+   int i;
+
+   for (i = 0; i < MLX5E_NUM_TT; i++) {
+   err = mlx5e_refresh_tir_self_loopback_enable(priv->mdev,
+priv->tirn[i]);
+   if (err)
+   return err;
+   }
+
+   return 0;
+}
+
 static int mlx5e_set_dev_port_mtu(struct net_device *netdev)
 {
struct mlx5e_priv *priv = netdev_priv(netdev);
@@ -1376,6 +1412,13 @@ int mlx5e_open_locked(struct net_device *netdev)
goto err_clear_state_opened_flag;
}
 
+   err = mlx5e_refresh_tirs_self_loopback_enable(priv);
+   if (err) {
+   netdev_err(netdev, "%s: mlx5e_refresh_tirs_self_loopback_enable 
failed, %d\n",
+  __func__, err);
+   goto err_close_channels;
+   }
+
mlx5e_update_carrier(priv);
mlx5e_redirect_rqts(priv);
 
@@ -1383,6 +1426,8 @@ int mlx5e_open_locked(struct net_device *netdev)
 
return 0;
 
+err_close_channels:
+   mlx5e_close_channels(priv);
 err_clear_state_opened_flag:
clear_bit(MLX5E_STATE_OPENED, >state);
return err;
@@ -1909,6 +1954,9 @@ static int mlx5e_check_required_hca_cap(struct 
mlx5_core_dev *mdev)
   "Not creating net device, some required device 
capabilities are missing\n");
return -ENOTSUPP;
}
+   if (!MLX5_CAP_ETH(mdev, self_lb_en_modifiable))
+   mlx5_core_warn(mdev, "Self loop back prevention is not 
supported\n");
+
return 0;
 }
 
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index dd20974..1565324 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -453,26 +453,28 @@ struct mlx5_ifc_per_protocol_networking_offload_caps_bits 
{
u8 lro_cap[0x1];
u8 lro_psh_flag[0x1];
u8 lro_time_stamp[0x1];
-   u8 reserved_0[0x6];
+   u8 reserved_0[0x3];
+   u8 self_lb_en_modifiable[0x1];
+   u8 reserved_1[0x2];
u8 max_lso_cap[0x5];
-   u8 reserved_1[0x4];
+   u8 reserved_2[0x4];
u8 rss_ind_tbl_cap[0x4];
-   u8 reserved_2[0x3];
+   u8 reserved_3[0x3];
u8 tunnel_lso_const_out_ip_id[0x1];
-   u8 reserved_3[0x2];
+   u8 reserved_4[0x2];
u8 tunnel_statless_gre[0x1];
u8 tunnel_stateless_vxlan[0x1];
 
-   u8 reserved_4[0x20];
+   u8 reserved_5[0x20];
 
-   u8 reserved_5[0x10];
+   u8 reserved_6[0x10];
u8 lro_min_mss_size[0x10];
 
-   u8

[PATCH] net: phy: at803x: support interrupt on 8030 and 8035

2015-11-12 Thread Mans Rullgard

Commit 77a993942 "phy/at8031: enable at8031 to work on interrupt mode"
added interrupt support for the 8031 PHY but left out the other two
chips supported by this driver.

This patch sets the .ack_interrupt and .config_intr functions for the
8030 and 8035 drivers as well.

Signed-off-by: Mans Rullgard 
---
I have only tested this with an 8035.  I can't find a datasheet for
the 8030, but since 8031, 8032, and 8035 all have the same register
layout, there's a good chance 8030 does as well.
---
 drivers/net/phy/at803x.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
index fabf11d..2d020a3 100644
--- a/drivers/net/phy/at803x.c
+++ b/drivers/net/phy/at803x.c
@@ -308,6 +308,8 @@ static struct phy_driver at803x_driver[] = {
.flags  = PHY_HAS_INTERRUPT,
.config_aneg= genphy_config_aneg,
.read_status= genphy_read_status,
+   .ack_interrupt  = at803x_ack_interrupt,
+   .config_intr= at803x_config_intr,
.driver = {
.owner = THIS_MODULE,
},
@@ -327,6 +329,8 @@ static struct phy_driver at803x_driver[] = {
.flags  = PHY_HAS_INTERRUPT,
.config_aneg= genphy_config_aneg,
.read_status= genphy_read_status,
+   .ack_interrupt  = at803x_ack_interrupt,
+   .config_intr= at803x_config_intr,
.driver = {
.owner = THIS_MODULE,
},
-- 
2.6.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] ipvs: use skb_to_full_sk() helper

2015-11-12 Thread Eric Dumazet

From: Eric Dumazet 

SYNACK packets might be attached to request sockets.

Use skb_to_full_sk() helper to avoid illegal accesses to
inet_sk(skb->sk)

Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of 
listener")
Signed-off-by: Eric Dumazet 
Reported-by: Sander Eikelenboom 
---
 net/netfilter/ipvs/ip_vs_core.c |   16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index 1e24fff53e4b..f57b4dcdb233 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -1176,6 +1176,7 @@ ip_vs_out(struct netns_ipvs *ipvs, unsigned int hooknum, 
struct sk_buff *skb, in
struct ip_vs_protocol *pp;
struct ip_vs_proto_data *pd;
struct ip_vs_conn *cp;
+   struct sock *sk;
 
EnterFunction(11);
 
@@ -1183,13 +1184,12 @@ ip_vs_out(struct netns_ipvs *ipvs, unsigned int 
hooknum, struct sk_buff *skb, in
if (skb->ipvs_property)
return NF_ACCEPT;
 
+   sk = skb_to_full_sk(skb);
/* Bad... Do not break raw sockets */
-   if (unlikely(skb->sk != NULL && hooknum == NF_INET_LOCAL_OUT &&
+   if (unlikely(sk && hooknum == NF_INET_LOCAL_OUT &&
 af == AF_INET)) {
-   struct sock *sk = skb->sk;
-   struct inet_sock *inet = inet_sk(skb->sk);
 
-   if (inet && sk->sk_family == PF_INET && inet->nodefrag)
+   if (sk->sk_family == PF_INET && inet_sk(sk)->nodefrag)
return NF_ACCEPT;
}
 
@@ -1681,6 +1681,7 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, 
struct sk_buff *skb, int
struct ip_vs_conn *cp;
int ret, pkts;
int conn_reuse_mode;
+   struct sock *sk;
 
/* Already marked as IPVS request or reply? */
if (skb->ipvs_property)
@@ -1708,12 +1709,11 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, 
struct sk_buff *skb, int
ip_vs_fill_iph_skb(af, skb, false, );
 
/* Bad... Do not break raw sockets */
-   if (unlikely(skb->sk != NULL && hooknum == NF_INET_LOCAL_OUT &&
+   sk = skb_to_full_sk(skb);
+   if (unlikely(sk && hooknum == NF_INET_LOCAL_OUT &&
 af == AF_INET)) {
-   struct sock *sk = skb->sk;
-   struct inet_sock *inet = inet_sk(skb->sk);
 
-   if (inet && sk->sk_family == PF_INET && inet->nodefrag)
+   if (sk->sk_family == PF_INET && inet_sk(sk)->nodefrag)
return NF_ACCEPT;
}
 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [linux-4.4-mw] BUG: unable to handle kernel paging request ip_vs_out.constprop

2015-11-12 Thread Sander Eikelenboom


On 2015-11-12 17:52, Eric Dumazet wrote:

On Thu, 2015-11-12 at 16:16 +0100, Sander Eikelenboom wrote:


> Thanks for the report, please try following patch :

Hi Eric,

Thanks for the patch!
Got it up and running at the moment, but since i don't have a clear
trigger it
will take 1 or 2 days before i can report something back.


Don't worry, I have a pretty good picture of the bug and patch must fix
it.

I'll submit it formally asap.


Ok.

Do you know were these new warnings are for ?
(apparently all networking including bridging works fine, so is this 
just too verbose logging ?)


[  207.033768] vif vif-1-0 vif1.0: set_features() failed (-1); wanted
0x00044803, left 0x000400114813
[  207.033780] vif vif-1-0 vif1.0: set_features() failed (-1); wanted
0x00044803, left 0x000400114813
[  207.245435] xen_bridge: error setting offload STP state on port
1(vif1.0)
[  207.245442] vif vif-1-0 vif1.0: failed to set HW ageing time
[  207.245443] xen_bridge: error setting offload STP state on port
1(vif1.0)
[  207.245491] vif vif-1-0 vif1.0: set_features() failed (-1); wanted
0x00044803, left 0x000400114813

--
Sander
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] stmmac: avoid ipq806x constant overflow warning

2015-11-12 Thread David Miller

From: Arnd Bergmann 
Date: Thu, 12 Nov 2015 15:12:48 +0100

> Building dwmac-ipq806x on a 64-bit architecture produces a harmless
> warning from gcc:
> 
> stmmac/dwmac-ipq806x.c: In function 'ipq806x_gmac_probe':
> include/linux/bitops.h:6:19: warning: overflow in implicit constant 
> conversion [-Woverflow]
>   val = QSGMII_PHY_CDR_EN |
> stmmac/dwmac-ipq806x.c:333:8: note: in expansion of macro 'QSGMII_PHY_CDR_EN'
>  #define QSGMII_PHY_CDR_EN   BIT(0)
>  #define BIT(nr)   (1UL << (nr))
> 
> The compiler warns about the fact that a 64-bit literal is passed
> into a function that takes a 32-bit argument. I could not fully understand
> why it warns despite the fact that this number is always small enough
> to fit, but changing the use of BIT() macros into the equivalent hexadecimal
> representation avoids the warning
> 
> Signed-off-by: Arnd Bergmann 
> Fixes: b1c17215d718 ("stmmac: add ipq806x glue layer")

I've seen this warning too on x86_64 and had been meaning to look
into it, thanks for taking the initiative. :)

Moving away from using BIT() is somewhat disappointing, because we
want to encourage people to use these macros.

I think it's also easier from a driver author and auditing
perspective, you can see that something is being defined as bit X and
then check the documentation for the chip to see if bit X is correct
or not.

With the hex values there is more mental work and room for... mistakes.

Also I don't even understand the compiler's behavior, it's warning
about QSGMII_PHY_CDR_EN but if you define only that to "0x1u" it still
warns about QSGMII_PHY_CDR_EN.

The warning goes away only if you change all 5 BIT() uses.

This makes me like the change even less, something foul is going on
here and I'd rather figure out what that is than install this patch.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH stable <= 3.18] net: add length argument to skb_copy_and_csum_datagram_iovec

2015-11-12 Thread Eric Dumazet

On Thu, 2015-11-12 at 10:48 +0100, Sabrina Dubroca wrote:
> 2015-11-10, 16:03:52 -0800, Greg Kroah-Hartman wrote:
> > On Tue, Nov 10, 2015 at 05:59:26PM -0600, Josh Hunt wrote:
> > > On Thu, Oct 29, 2015 at 5:00 AM, Sabrina Dubroca  
> > > wrote:
> > > > 2015-10-15, 14:25:03 +0200, Sabrina Dubroca wrote:
> > > >> Without this length argument, we can read past the end of the iovec in
> > > >> memcpy_toiovec because we have no way of knowing the total length of 
> > > >> the
> > > >> iovec's buffers.
> > > >>
> > > >> This is needed for stable kernels where 89c22d8c3b27 ("net: Fix skb
> > > >> csum races when peeking") has been backported but that don't have the
> > > >> ioviter conversion, which is almost all the stable trees <= 3.18.
> > > >>
> > > >> This also fixes a kernel crash for NFS servers when the client uses
> > > >>  -onfsvers=3,proto=udp to mount the export.
> > > >>
> > > >> Signed-off-by: Sabrina Dubroca 
> > > >> Reviewed-by: Hannes Frederic Sowa 
> > > >
> > > > Fixes CVE-2015-8019.
> > > > http://www.openwall.com/lists/oss-security/2015/10/29/1
> > > >
> > > > --
> > > > Sabrina
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > > the body of a message to majord...@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > Greg
> > > 
> > > Do you have this in your queue? I saw a few other stables pick this
> > > up, but haven't seen it in 3.14 or 3.18 yet. It wasn't clear to me if
> > > this had been fully reviewed yet.
> > 
> > I rely on Dave to package up networking stable patches and forward them
> > on to me, that's why you haven't seen it be picked up yet.
> > 
> > thanks,
> > 
> > greg k-h
> 
> David, can you queue this up?
> 

Note that the following patch (and corresponding part for ipv6) might
also have solve the issue ?

This would supposedly save some cycles when MSG_PEEK is used and user
provides short buffers.

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 24ec14f9825c..387acab1ab5c 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1272,6 +1272,7 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int noblock,
int err;
int is_udplite = IS_UDPLITE(sk);
bool slow;
+   bool checksum_valid = false;
 
if (flags & MSG_ERRQUEUE)
return ip_recv_error(sk, msg, len, addr_len);
@@ -1296,11 +1297,12 @@ try_again:
 */
 
if (copied < ulen || UDP_SKB_CB(skb)->partial_cov) {
-   if (udp_lib_checksum_complete(skb))
+   checksum_valid = !udp_lib_checksum_complete(skb);
+   if (!checksum_valid)
goto csum_copy_err;
}
 
-   if (skb_csum_unnecessary(skb))
+   if (checksum_valid || skb_csum_unnecessary(skb))
err = skb_copy_datagram_msg(skb, sizeof(struct udphdr),
msg, copied);
else {


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net 4/6] net/mlx5e: Use the right DMA free function on TX path

2015-11-12 Thread Or Gerlitz

From: Achiad Shochat 

On xmit path we use skb_frag_dma_map() which is using dma_map_page(),
while upon completion we dma-unmap the skb fragments using
dma_unmap_single() rather than dma_unmap_page().

To fix this, we now save the dma map type on xmit path and use this
info to call the right dma unmap method upon TX completion.

Signed-off-by: Achiad Shochat 
Signed-off-by: Saeed Mahameed 
Signed-off-by: Or Gerlitz 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h| 10 +++-
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 65 +
 2 files changed, 43 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index f2ae62d..22e72bf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -334,9 +334,15 @@ struct mlx5e_tx_skb_cb {
 
 #define MLX5E_TX_SKB_CB(__skb) ((struct mlx5e_tx_skb_cb *)__skb->cb)
 
+enum mlx5e_dma_map_type {
+   MLX5E_DMA_MAP_SINGLE,
+   MLX5E_DMA_MAP_PAGE
+};
+
 struct mlx5e_sq_dma {
-   dma_addr_t addr;
-   u32size;
+   dma_addr_t  addr;
+   u32 size;
+   enum mlx5e_dma_map_type type;
 };
 
 enum {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index f687ebf..1341b1d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -61,39 +61,47 @@ void mlx5e_send_nop(struct mlx5e_sq *sq, bool notify_hw)
}
 }
 
-static void mlx5e_dma_pop_last_pushed(struct mlx5e_sq *sq, dma_addr_t *addr,
- u32 *size)
+static inline void mlx5e_tx_dma_unmap(struct device *pdev,
+ struct mlx5e_sq_dma *dma)
 {
-   sq->dma_fifo_pc--;
-   *addr = sq->dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].addr;
-   *size = sq->dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].size;
-}
-
-static void mlx5e_dma_unmap_wqe_err(struct mlx5e_sq *sq, struct sk_buff *skb)
-{
-   dma_addr_t addr;
-   u32 size;
-   int i;
-
-   for (i = 0; i < MLX5E_TX_SKB_CB(skb)->num_dma; i++) {
-   mlx5e_dma_pop_last_pushed(sq, , );
-   dma_unmap_single(sq->pdev, addr, size, DMA_TO_DEVICE);
+   switch (dma->type) {
+   case MLX5E_DMA_MAP_SINGLE:
+   dma_unmap_single(pdev, dma->addr, dma->size, DMA_TO_DEVICE);
+   break;
+   case MLX5E_DMA_MAP_PAGE:
+   dma_unmap_page(pdev, dma->addr, dma->size, DMA_TO_DEVICE);
+   break;
+   default:
+   WARN_ONCE(true, "mlx5e_tx_dma_unmap unknown DMA type!\n");
}
 }
 
-static inline void mlx5e_dma_push(struct mlx5e_sq *sq, dma_addr_t addr,
- u32 size)
+static inline void mlx5e_dma_push(struct mlx5e_sq *sq,
+ dma_addr_t addr,
+ u32 size,
+ enum mlx5e_dma_map_type map_type)
 {
sq->dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].addr = addr;
sq->dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].size = size;
+   sq->dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].type = map_type;
sq->dma_fifo_pc++;
 }
 
-static inline void mlx5e_dma_get(struct mlx5e_sq *sq, u32 i, dma_addr_t *addr,
-u32 *size)
+static inline struct mlx5e_sq_dma *mlx5e_dma_get(struct mlx5e_sq *sq, u32 i)
+{
+   return >dma_fifo[i & sq->dma_fifo_mask];
+}
+
+static void mlx5e_dma_unmap_wqe_err(struct mlx5e_sq *sq, struct sk_buff *skb)
 {
-   *addr = sq->dma_fifo[i & sq->dma_fifo_mask].addr;
-   *size = sq->dma_fifo[i & sq->dma_fifo_mask].size;
+   int i;
+
+   for (i = 0; i < MLX5E_TX_SKB_CB(skb)->num_dma; i++) {
+   struct mlx5e_sq_dma *last_pushed_dma =
+   mlx5e_dma_get(sq, --sq->dma_fifo_pc);
+
+   mlx5e_tx_dma_unmap(sq->pdev, last_pushed_dma);
+   }
 }
 
 u16 mlx5e_select_queue(struct net_device *dev, struct sk_buff *skb,
@@ -225,7 +233,7 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, 
struct sk_buff *skb)
dseg->lkey   = sq->mkey_be;
dseg->byte_count = cpu_to_be32(headlen);
 
-   mlx5e_dma_push(sq, dma_addr, headlen);
+   mlx5e_dma_push(sq, dma_addr, headlen, MLX5E_DMA_MAP_SINGLE);
MLX5E_TX_SKB_CB(skb)->num_dma++;
 
dseg++;
@@ -244,7 +252,7 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, 
struct sk_buff *skb)
dseg->lkey   = sq->mkey_be;
dseg->byte_count = cpu_to_be32(fsz);
 
-   mlx5e_dma_push(sq, dma_addr, fsz);
+   mlx5e_dma_push(sq, dma_addr, fsz, MLX5E_DMA_MAP_PAGE);

[PATCH net 5/6] net/mlx4_core: Fix sleeping while holding spinlock at rem_slave_counters

2015-11-12 Thread Or Gerlitz

From: Eran Ben Elisha 

When cleaning slave's counter resources, we hold a spinlock that
protects the slave's counters list. As part of the clean, we call
__mlx4_clear_if_stat which calls mlx4_alloc_cmd_mailbox which is a
sleepable function.

In order to fix this issue, hold the spinlock, and copy all counter
indices into a temporary array, and release the spinlock. Afterwards,
iterate over this array and free every counter. Repeat this scenario
until the original list is empty (a new counter might have been added
while releasing the counters from the temporary array).

Fixes: b72ca7e96acf ("net/mlx4_core: Reset counters data when freed")
Reported-by: Moni Shoua 
Tested-by: Moni Shoua 
Signed-off-by: Jack Morgenstein 
Signed-off-by: Eran Ben Elisha 
Signed-off-by: Or Gerlitz 
---
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  | 39 +++---
 1 file changed, 27 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c 
b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
index 9813d34..6fec3e9 100644
--- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
+++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
@@ -4952,26 +4952,41 @@ static void rem_slave_counters(struct mlx4_dev *dev, 
int slave)
struct res_counter *counter;
struct res_counter *tmp;
int err;
-   int index;
+   int *counters_arr = NULL;
+   int i, j;
 
err = move_all_busy(dev, slave, RES_COUNTER);
if (err)
mlx4_warn(dev, "rem_slave_counters: Could not move all counters 
- too busy for slave %d\n",
  slave);
 
-   spin_lock_irq(mlx4_tlock(dev));
-   list_for_each_entry_safe(counter, tmp, counter_list, com.list) {
-   if (counter->com.owner == slave) {
-   index = counter->com.res_id;
-   rb_erase(>com.node,
->res_tree[RES_COUNTER]);
-   list_del(>com.list);
-   kfree(counter);
-   __mlx4_counter_free(dev, index);
+   counters_arr = kmalloc_array(dev->caps.max_counters,
+sizeof(*counters_arr), GFP_KERNEL);
+   if (!counters_arr)
+   return;
+
+   do {
+   i = 0;
+   j = 0;
+   spin_lock_irq(mlx4_tlock(dev));
+   list_for_each_entry_safe(counter, tmp, counter_list, com.list) {
+   if (counter->com.owner == slave) {
+   counters_arr[i++] = counter->com.res_id;
+   rb_erase(>com.node,
+>res_tree[RES_COUNTER]);
+   list_del(>com.list);
+   kfree(counter);
+   }
+   }
+   spin_unlock_irq(mlx4_tlock(dev));
+
+   while (j < i) {
+   __mlx4_counter_free(dev, counters_arr[j++]);
mlx4_release_resource(dev, slave, RES_COUNTER, 1, 0);
}
-   }
-   spin_unlock_irq(mlx4_tlock(dev));
+   } while (i);
+
+   kfree(counters_arr);
 }
 
 static void rem_slave_xrcdns(struct mlx4_dev *dev, int slave)
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net 1/6] net/mlx5e: Fix inline header size calculation

2015-11-12 Thread Or Gerlitz

From: Saeed Mahameed 

mlx5e_get_inline_hdr_size didn't take into account the vlan insertion
into the inline WQE segment.
This could lead to max inline violation in cases where
skb_headlen(skb) + VLAN_HLEN >= sq->max_inline.

Fixes: 3ea4891db8d0 ("net/mlx5e: Fix LSO vlan insertion")
Signed-off-by: Saeed Mahameed 
Signed-off-by: Achiad Shochat 
Signed-off-by: Or Gerlitz 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index cd8f85a..f687ebf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -118,8 +118,15 @@ static inline u16 mlx5e_get_inline_hdr_size(struct 
mlx5e_sq *sq,
 */
 #define MLX5E_MIN_INLINE ETH_HLEN
 
-   if (bf && (skb_headlen(skb) <= sq->max_inline))
-   return skb_headlen(skb);
+   if (bf) {
+   u16 ihs = skb_headlen(skb);
+
+   if (skb_vlan_tag_present(skb))
+   ihs += VLAN_HLEN;
+
+   if (ihs <= sq->max_inline)
+   return skb_headlen(skb);
+   }
 
return MLX5E_MIN_INLINE;
 }
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net 0/6] Mellanox NIC driver update, Nov 12, 2015

2015-11-12 Thread Or Gerlitz

Hi Dave,

Few small mlx5 and mlx4 fixes from the team... done over
net commit c5a3788 "Merge branch 'akpm' (patches from Andrew)"

Eran's patch needs to go to 4.2 and 4.3 stable kernels.

Tariq's patch need to go to 4.3 stable too.

Or.

Achiad Shochat (1):
  net/mlx5e: Use the right DMA free function on TX path

Doron Tsur (1):
  net/mlx5e: Max mtu comparison fix

Eran Ben Elisha (1):
  net/mlx4_core: Fix sleeping while holding spinlock at rem_slave_counters

Noa Osherovich (1):
  net/mlx4_core: Avoid returning success in case of an error flow

Saeed Mahameed (1):
  net/mlx5e: Fix inline header size calculation

Tariq Toukan (1):
  net/mlx5e: Added self loopback prevention

 drivers/net/ethernet/mellanox/mlx4/main.c  |  8 ++-
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  | 39 +++
 drivers/net/ethernet/mellanox/mlx5/core/en.h   | 10 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 50 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c| 76 +-
 include/linux/mlx5/mlx5_ifc.h  | 24 ---
 6 files changed, 148 insertions(+), 59 deletions(-)

-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5] net: ethernet: add driver for Aurora VLSI NB8800 Ethernet controller

2015-11-12 Thread Mason

[ CCing a few knowledgeable people ]

Despite the subject, this is about an Atheros 8035 PHY :-)

On 12/11/2015 15:04, Måns Rullgård wrote:

> Mason wrote:
> 
>> BTW, you're not using the PHY IRQ, right? I think I remember you saying
>> it didn't work reliably?
> 
> It doesn't seem to be wired up on any of my boards, or there's some
> magic required to activate it that I'm unaware of.

Weird. The board schematics for the 1172 show Tango ETH0_MDINT# pin
properly connected to AR8035 INT pin (pin 20).



http://www.redeszone.net/app/uploads/2014/04/AR8035.pdf

INT pin 20
I/O, D, PD
Interrupt Signal to System; default OD-gate, needs an external
10Kohm pull-up, active low; can be configured to I/O by register,
active high.


4.1.17 Interrupt Enable
Offset: 0x12
Mode: Read/Write
Hardware Reset: 0


Strange... it looks like AT803X_INER and AT803X_INTR_ENABLE refer to
the same "Interrupt Enable" register?

In fact, AT803X_INER_INIT == 0xec00 makes sense for register 0x12:
link success/fail, speed/duplex changed, autoneg error

Looks like at803x_config_intr() is used for 8031, but not for 8035...

Relevant commit:
77a9939426f7a "phy/at8031: enable at8031 to work on interrupt mode"

If I add .config_intr and .ack_interrupt to the 8035 struct, then I get
(also added some traces)

[0.883517] *** at803x_config_intr: ENABLE
[1.576108] *** at803x_config_intr: DISABLE
[1.580467] *** at803x_config_intr: ENABLE
[1.584959] *** at803x_config_intr: DISABLE
[1.589297] *** at803x_config_intr: ENABLE
[4.321722] *** at803x_config_intr: DISABLE
[4.326054] *** at803x_config_intr: ENABLE
[4.330489] nb8800 26000.ethernet eth0: Link is Up - 1Gbps/Full - flow 
control rx/tx
[4.338335] *** at803x_config_intr: ENABLE

(Are all the ENABLE/DISABLE events expected?)

And if I unplug/replug the Ethernet cable,

[   71.903051] *** at803x_config_intr: DISABLE
[   71.907410] *** at803x_config_intr: ENABLE
[   71.912232] nb8800 26000.ethernet eth0: Link is Down
[   71.917309] *** at803x_config_intr: ENABLE
[   78.008972] *** at803x_config_intr: DISABLE
[   78.013375] *** at803x_config_intr: ENABLE
[   78.017797] nb8800 26000.ethernet eth0: Link is Up - 1Gbps/Full - flow 
control rx/tx
[   78.025702] *** at803x_config_intr: ENABLE

(Are all the ENABLE/DISABLE events expected there too?)

# cat /proc/interrupts 
CPU0   CPU1   
 18:107  0  irq0   1 Level serial
 54:  5  0  irq0  37 Edge  phy_interrupt
 55:   4953  0  irq0  38 Level eth0
211:   1147254   GIC  29 Edge  twd


Questions:

Can't at803x_ack_interrupt() just return phy_read(phydev, AT803X_INSR);
Can at803x_config_intr() be used with the 8035
What about AT803X_INER/AT803X_INTR_ENABLE and AT803X_INSR/AT803X_INTR_STATUS

Regards.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is ndo_do_ioctl still acceptable?

2015-11-12 Thread Jason A. Donenfeld

Hi Stephen,

Thanks for your response.

On Thu, Nov 12, 2015 at 5:34 PM, Stephen Hemminger
 wrote:
> The problem is ioctl's are device specific, and therefore create dependency
> on the unique features supported by your device.
> The question always comes up, why is this new API not something general?

In this case, it really is for unique features of my device. My device
has its own unique notion of a "peer" based on a particular elliptic
curve point and some other interesting things. It's not something
generalizable to other devices. The thing that makes my particular
device special is these attributes that I need to make configurable. I
think then, by your criteria, ioctl would actually be perfect. In
other words, I interpret what you wrote to mean "generalizable:
netlink. device-specific: ioctl." If that's a decent summary, then
ioctl is certainly good for me.

> And if you are dumping such a huge mound of information that only your driver
> could love, then why are you doing it? Is there anything in there that
> really matters?

There is. There's a nice userspace program used for configuring and
displaying the particular attributes of this device.

> If all you are really doing is dumping statistics then look at ethtool.

It's more than stats, unfortunately.

> If you are dealing with lots of virtual function devices, look how existing
> netlink info is trimmed.

Could you elaborate on this? What do you mean "trimmed"? And where do I look?

Regards,
Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PING: [PATCH] net: smsc911x: Reset PHY during initialization

2015-11-12 Thread David Miller

From: Pavel Fedin 
Date: Thu, 12 Nov 2015 10:04:39 +0300

> Or is it just a formal requirement to RESEND?

Yes, this always what I ask people to do.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ip_tunnel: disable preemption when updating per-cpu tstats

2015-11-12 Thread Hannes Frederic Sowa

On Thu, Nov 12, 2015, at 16:30, Jason A. Donenfeld wrote:
>   if (err > 0) {
> -   struct pcpu_sw_netstats *tstats = this_cpu_ptr(stats);
> -
> +   struct pcpu_sw_netstats *tstats;
> +   preempt_disable();
> +   tstats = this_cpu_ptr(stats);

The canonical way is get_cpu_ptr(stats) / put_cpu_ptr.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] ip_tunnel: disable preemption when updating per-cpu tstats

2015-11-12 Thread Jason A. Donenfeld

Drivers like vxlan use the recently introduced
udp_tunnel_xmit_skb/udp_tunnel6_xmit_skb APIs. udp_tunnel6_xmit_skb
makes use of ip6tunnel_xmit, and ip6tunnel_xmit, after sending the
packet, updates the struct stats using the usual
u64_stats_update_begin/end calls on this_cpu_ptr(dev->tstats).
udp_tunnel_xmit_skb makes use of iptunnel_xmit, which doesn't touch
tstats, so drivers like vxlan, immediately after, call
iptunnel_xmit_stats, which does the same thing - calls
u64_stats_update_begin/end on this_cpu_ptr(dev->tstats).

While vxlan is probably fine (I don't know?), calling a similar function
from, say, an unbound workqueue, on a fully preemptable kernel causes
real issues:

[  188.434537] BUG: using smp_processor_id() in preemptible [] code: 
kworker/u8:0/6
[  188.435579] caller is debug_smp_processor_id+0x17/0x20
[  188.435583] CPU: 0 PID: 6 Comm: kworker/u8:0 Not tainted 4.2.6 #2
[  188.435607] Call Trace:
[  188.435611]  [] dump_stack+0x4f/0x7b
[  188.435615]  [] check_preemption_disabled+0x19d/0x1c0
[  188.435619]  [] debug_smp_processor_id+0x17/0x20

The solution would be to protect the whole
this_cpu_ptr(dev->tstats)/u64_stats_update_begin/end blocks with
disabling preemption and then reenabling it.

Signed-off-by: Jason A. Donenfeld 
---
 include/net/ip6_tunnel.h | 3 ++-
 include/net/ip_tunnels.h | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index fa915fa..d49a8f8 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -90,11 +90,12 @@ static inline void ip6tunnel_xmit(struct sock *sk, struct 
sk_buff *skb,
err = ip6_local_out_sk(sk, skb);
 
if (net_xmit_eval(err) == 0) {
-   struct pcpu_sw_netstats *tstats = this_cpu_ptr(dev->tstats);
+   struct pcpu_sw_netstats *tstats = get_cpu_ptr(dev->tstats);
u64_stats_update_begin(>syncp);
tstats->tx_bytes += pkt_len;
tstats->tx_packets++;
u64_stats_update_end(>syncp);
+   put_cpu_ptr(tstats);
} else {
stats->tx_errors++;
stats->tx_aborted_errors++;
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index f6dafec..62a750a 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -287,12 +287,13 @@ static inline void iptunnel_xmit_stats(int err,
   struct pcpu_sw_netstats __percpu *stats)
 {
if (err > 0) {
-   struct pcpu_sw_netstats *tstats = this_cpu_ptr(stats);
+   struct pcpu_sw_netstats *tstats = get_cpu_ptr(stats);
 
u64_stats_update_begin(>syncp);
tstats->tx_bytes += err;
tstats->tx_packets++;
u64_stats_update_end(>syncp);
+   put_cpu_ptr(tstats);
} else if (err < 0) {
err_stats->tx_errors++;
err_stats->tx_aborted_errors++;
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ip_tunnel: disable preemption when updating per-cpu tstats

2015-11-12 Thread Jason A. Donenfeld

On Thu, Nov 12, 2015 at 5:25 PM, Hannes Frederic Sowa
 wrote:
> The canonical way is get_cpu_ptr(stats) / put_cpu_ptr.

Thanks for the pointer. Fixed in v2.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] tcp: ensure proper barriers in lockless contexts

2015-11-12 Thread Eric Dumazet

From: Eric Dumazet 

Some functions access TCP sockets without holding a lock and
might output non consistent data, depending on compiler and or
architecture.

tcp_diag_get_info(), tcp_get_info(), tcp_poll(), get_tcp4_sock() ...

Introduce sk_state_load() and sk_state_store() to fix the issues,
and more clearly document where this lack of locking is happening.

Signed-off-by: Eric Dumazet 
---
 include/net/sock.h  |   25 +
 net/ipv4/inet_connection_sock.c |4 ++--
 net/ipv4/tcp.c  |   21 +++--
 net/ipv4/tcp_diag.c |2 +-
 net/ipv4/tcp_ipv4.c |   14 --
 net/ipv6/tcp_ipv6.c |   19 +++
 6 files changed, 62 insertions(+), 23 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index bbf7c2cf15b4..7f89e4ba18d1 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2226,6 +2226,31 @@ static inline bool sk_listener(const struct sock *sk)
return (1 << sk->sk_state) & (TCPF_LISTEN | TCPF_NEW_SYN_RECV);
 }
 
+/**
+ * sk_state_load - read sk->sk_state for lockless contexts
+ * @sk: socket pointer
+ *
+ * Paired with sk_state_store(). Used in places we do not hold socket lock :
+ * tcp_diag_get_info(), tcp_get_info(), tcp_poll(), get_tcp4_sock() ...
+ */
+static inline int sk_state_load(const struct sock *sk)
+{
+   return smp_load_acquire(>sk_state);
+}
+
+/**
+ * sk_state_store - update sk->sk_state
+ * @sk: socket pointer
+ * @newstate: new state
+ *
+ * Paired with sk_state_load(). Should be used in contexts where
+ * state change might impact lockless readers.
+ */
+static inline void sk_state_store(struct sock *sk, int newstate)
+{
+   smp_store_release(>sk_state, newstate);
+}
+
 void sock_enable_timestamp(struct sock *sk, int flag);
 int sock_get_timestamp(struct sock *, struct timeval __user *);
 int sock_get_timestampns(struct sock *, struct timespec __user *);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 1feb15f23de8..46b9c887bede 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -563,7 +563,7 @@ static void reqsk_timer_handler(unsigned long data)
int max_retries, thresh;
u8 defer_accept;
 
-   if (sk_listener->sk_state != TCP_LISTEN)
+   if (sk_state_load(sk_listener) != TCP_LISTEN)
goto drop;
 
max_retries = icsk->icsk_syn_retries ? : sysctl_tcp_synack_retries;
@@ -749,7 +749,7 @@ int inet_csk_listen_start(struct sock *sk, int backlog)
 * It is OK, because this socket enters to hash table only
 * after validation is complete.
 */
-   sk->sk_state = TCP_LISTEN;
+   sk_state_store(sk, TCP_LISTEN);
if (!sk->sk_prot->get_port(sk, inet->inet_num)) {
inet->inet_sport = htons(inet->inet_num);
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0cfa7c0c1e80..c1728771cf89 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -451,11 +451,14 @@ unsigned int tcp_poll(struct file *file, struct socket 
*sock, poll_table *wait)
unsigned int mask;
struct sock *sk = sock->sk;
const struct tcp_sock *tp = tcp_sk(sk);
+   int state;
 
sock_rps_record_flow(sk);
 
sock_poll_wait(file, sk_sleep(sk), wait);
-   if (sk->sk_state == TCP_LISTEN)
+
+   state = sk_state_load(sk);
+   if (state == TCP_LISTEN)
return inet_csk_listen_poll(sk);
 
/* Socket is not locked. We are protected from async events
@@ -492,14 +495,14 @@ unsigned int tcp_poll(struct file *file, struct socket 
*sock, poll_table *wait)
 * NOTE. Check for TCP_CLOSE is added. The goal is to prevent
 * blocking on fresh not-connected or disconnected socket. --ANK
 */
-   if (sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE)
+   if (sk->sk_shutdown == SHUTDOWN_MASK || state == TCP_CLOSE)
mask |= POLLHUP;
if (sk->sk_shutdown & RCV_SHUTDOWN)
mask |= POLLIN | POLLRDNORM | POLLRDHUP;
 
/* Connected or passive Fast Open socket? */
-   if (sk->sk_state != TCP_SYN_SENT &&
-   (sk->sk_state != TCP_SYN_RECV || tp->fastopen_rsk)) {
+   if (state != TCP_SYN_SENT &&
+   (state != TCP_SYN_RECV || tp->fastopen_rsk)) {
int target = sock_rcvlowat(sk, 0, INT_MAX);
 
if (tp->urg_seq == tp->copied_seq &&
@@ -507,9 +510,6 @@ unsigned int tcp_poll(struct file *file, struct socket 
*sock, poll_table *wait)
tp->urg_data)
target++;
 
-   /* Potential race condition. If read of tp below will
-* escape above sk->sk_state, we can be illegally awaken
-* in SYN_* states. */
if (tp->rcv_nxt - tp->copied_seq >= target)
mask |= POLLIN | POLLRDNORM;
 
@@ -1934,7

Re: [linux-4.4-mw] BUG: unable to handle kernel paging request ip_vs_out.constprop

2015-11-12 Thread Eric Dumazet

On Thu, 2015-11-12 at 16:16 +0100, Sander Eikelenboom wrote:

> > Thanks for the report, please try following patch :
> 
> Hi Eric,
> 
> Thanks for the patch!
> Got it up and running at the moment, but since i don't have a clear 
> trigger it
> will take 1 or 2 days before i can report something back.

Don't worry, I have a pretty good picture of the bug and patch must fix
it.

I'll submit it formally asap.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ip_tunnel: disable preemption when updating per-cpu tstats

2015-11-12 Thread Jason A. Donenfeld

By the way, in case anybody is interested, I've done a little bit of
historical digging work. The functions in question date back to
aa0010f8 from 2012. Before that commit, statistics structures would be
incremented after each tunnel's driver itself dereferenced the per-cpu
variable. When this got factored out into iptunnel_xmit_stats, the
author of aa0010f8 simply took the code from the tunnel drivers, which
included the use of "this_cpu_ptr" instead of "get_cpu_ptr", because
presumably each driver was able to ensure preemption was disabled in
its codepath. With the generalization of this functionality into the
globally useful iptunnel_xmit_stats, we'll need to be using
"get_cpu_ptr" instead.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is ndo_do_ioctl still acceptable?

2015-11-12 Thread Stephen Hemminger

On Thu, 12 Nov 2015 05:59:03 +0100
"Jason A. Donenfeld"  wrote:

> Hi David & Folks,
> 
> Soon I will submit a virtual tunnel device driver to LKML for review.
> It uses rtnl_link_register to create a virtual network interface,
> which then handles encryption, authentication, and some other things,
> amongst various configured peers.
> 
> Right now the device is configurable via Netlink. It receives new
> peers and configuration via a rtnl_link_ops->changelink function, and
> it reports information back to userspace via a
> rtnl_link_ops->fill_info function.
> 
> Configuration works fine, though it is rather cumbersome to do this
> all via Netlink.
> 
> Reporting information back to userspace does not work fine. The reason
> is that sometimes there's too much information to report back to
> userspace than what can fit in a single preallocated Netlink skb. And
> since rtnl_link_ops->fill_info doesn't receive any information from
> userspace, I'm unable to use it to send back information in smaller
> pieces.
> 
> I realize I could register a whole new rtnl packet family and related
> set of functions with rtnl_register, such as what's done at the bottom
> of `net/core/rtnetlink.c`. This is extremely cumbersome and invasive
> though. It would require adding a new protocol family (like the
> already existing rtnl_register-ified functions for PF_UNSPEC and
> PF_BRIDGE), and I don't have enough clout to confidently submit a
> patch that augments `include/linux/socket.h` with a new PF/AF define.
> This seems very invasive and not appropriate for my driver.
> 
> What I'd really like to do is just implement ndo_do_ioctl. It seems to
> me that this gives me a precise interface to do exactly what I want in
> the cleanest and easiest to read possible way. I could have differing
> ioctls for differing things. I could write memory back to userspace in
> proper chunks, with the proper size. It's clear and straightforward
> how to do it, and what the completed result looks like. It doesn't
> require invasive changes into other parts of the kernel, as this would
> be self-contained. It's hard to imagine a better interface to use than
> ndo_do_ioctl.
> 
> But. But the word on the street is that kernel hipsters hate ioctls
> and espouse the use of netlink everywhere with religious fervor, and
> will burn at the stake any submissions I might send that go anywhere
> near using ndo_do_ioctl rather than (the most likely ill-fitting for
> the task) netlink. That, and the maintainers of the `ip` tool will be
> upset too (even though they do already make use of several ioctls
> instead of netlink). I'm told everybody will leer and jeer at me if I
> use ndo_do_ioctl instead of netlink.
> 
> Except ndo_do_ioctl is *so* perfectly fitting here for my use case!
> 
> So what's the verdict on this? Do these aforementioned kernel hipsters
> not really matter so much, and ndo_do_ioctl is actually perfectly
> fine? Or must I really affix netlink onto my forthcoming submission?

The problem is ioctl's are device specific, and therefore create dependency
on the unique features supported by your device.
Also doing security validation on ioctl's is much harder.

The question always comes up, why is this new API not something general?
And if you are dumping such a huge mound of information that only your driver
could love, then why are you doing it? Is there anything in there that
really matters? 

If all you are really doing is dumping statistics then look at ethtool.
If you are dealing with lots of virtual function devices, look how existing
netlink info is trimmed.

Don't overanalyze this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kasan r8169 use-after-free trace.

2015-11-12 Thread Dave Jones

On Wed, Nov 11, 2015 at 10:19:28AM +0100, Francois Romieu wrote:
 > Dave Jones  :
 > > This happens during boot, (and then there's a flood of traces that happen 
 > > so fast
 > > afterwards it completely overwhelms serial console; not sure if they're the
 > > same/related or not).
 > > 
 > > ==
 > > BUG: KASAN: use-after-free in rtl8169_poll+0x4b6/0xb70 at addr 
 > > 8801d43b3288
 > > Read of size 1 by task kworker/0:3/188
 > > =
 > > BUG kmalloc-256 (Not tainted): kasan: bad access detected
 > > -
 > > 
 > > Disabling lock debugging due to kernel taint
 > > INFO: Slab 0xea000750ecc0 objects=16 used=16 fp=0x  (null) 
 > > flags=0x8080
 > > INFO: Object 0x8801d43b3200 @offset=512 fp=0x8801d43b3800
 > > 
 > > Bytes b4 8801d43b31f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
 > >  
 > > Object 8801d43b3200: 00 38 3b d4 01 88 ff ff 00 00 00 00 00 00 00 00  
 > > .8;.
 > 
 > Does the patch below cure it ?

It did, thanks for the quick turnaround!  It also turns out this was responsible
for the flood of spew afterwards. It's completely silent when I apply your diff.

Dave
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net 1/1] r8169: fix kasan reported skb use-after-free.

2015-11-12 Thread David Miller

From: Francois Romieu 
Date: Wed, 11 Nov 2015 23:35:18 +0100

> Signed-off-by: Francois Romieu 
> Reported-by: Dave Jones 
> Fixes: d7d2d89d4b0af ("r8169: Add software counter for multicast packages")
> Acked-by: Eric Dumazet 
> Acked-by: Corinna Vinschen 
> ---
> 
>  Applies to davem's net as of c5a37883f42be712a989e54d5d6c0159b0e56599
>  ("Merge branch 'akpm' (patches from Andrew)")
> 
>  4.2 needs it as well.

Applied and queued up for -stable, thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net: phy: at803x: support interrupt on 8030 and 8035

2015-11-12 Thread Mason

On 12/11/2015 18:40, Mans Rullgard wrote:
> Commit 77a993942 "phy/at8031: enable at8031 to work on interrupt mode"
> added interrupt support for the 8031 PHY but left out the other two
> chips supported by this driver.
> 
> This patch sets the .ack_interrupt and .config_intr functions for the
> 8030 and 8035 drivers as well.
> 
> Signed-off-by: Mans Rullgard 
> ---
> I have only tested this with an 8035.  I can't find a datasheet for
> the 8030, but since 8031, 8032, and 8035 all have the same register
> layout, there's a good chance 8030 does as well.
> ---
>  drivers/net/phy/at803x.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
> index fabf11d..2d020a3 100644
> --- a/drivers/net/phy/at803x.c
> +++ b/drivers/net/phy/at803x.c
> @@ -308,6 +308,8 @@ static struct phy_driver at803x_driver[] = {
>   .flags  = PHY_HAS_INTERRUPT,
>   .config_aneg= genphy_config_aneg,
>   .read_status= genphy_read_status,
> + .ack_interrupt  = at803x_ack_interrupt,
> + .config_intr= at803x_config_intr,
>   .driver = {
>   .owner = THIS_MODULE,
>   },
> @@ -327,6 +329,8 @@ static struct phy_driver at803x_driver[] = {
>   .flags  = PHY_HAS_INTERRUPT,
>   .config_aneg= genphy_config_aneg,
>   .read_status= genphy_read_status,
> + .ack_interrupt  = at803x_ack_interrupt,
> + .config_intr= at803x_config_intr,
>   .driver = {
>   .owner = THIS_MODULE,
>   },

Shouldn't we take the opportunity to clean up the duplicated register
definitions? (I'll send an informal patch to spur discussion.)

Regards.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net: phy: at803x: support interrupt on 8030 and 8035

2015-11-12 Thread Måns Rullgård

On 12 November 2015 19:06:23 GMT+00:00, Mason  wrote:
>On 12/11/2015 18:40, Mans Rullgard wrote:
>> Commit 77a993942 "phy/at8031: enable at8031 to work on interrupt
>mode"
>> added interrupt support for the 8031 PHY but left out the other two
>> chips supported by this driver.
>> 
>> This patch sets the .ack_interrupt and .config_intr functions for the
>> 8030 and 8035 drivers as well.
>> 
>> Signed-off-by: Mans Rullgard 
>> ---
>> I have only tested this with an 8035.  I can't find a datasheet for
>> the 8030, but since 8031, 8032, and 8035 all have the same register
>> layout, there's a good chance 8030 does as well.
>> ---
>>  drivers/net/phy/at803x.c | 4 
>>  1 file changed, 4 insertions(+)
>> 
>> diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
>> index fabf11d..2d020a3 100644
>> --- a/drivers/net/phy/at803x.c
>> +++ b/drivers/net/phy/at803x.c
>> @@ -308,6 +308,8 @@ static struct phy_driver at803x_driver[] = {
>>  .flags  = PHY_HAS_INTERRUPT,
>>  .config_aneg= genphy_config_aneg,
>>  .read_status= genphy_read_status,
>> +.ack_interrupt  = at803x_ack_interrupt,
>> +.config_intr= at803x_config_intr,
>>  .driver = {
>>  .owner = THIS_MODULE,
>>  },
>> @@ -327,6 +329,8 @@ static struct phy_driver at803x_driver[] = {
>>  .flags  = PHY_HAS_INTERRUPT,
>>  .config_aneg= genphy_config_aneg,
>>  .read_status= genphy_read_status,
>> +.ack_interrupt  = at803x_ack_interrupt,
>> +.config_intr= at803x_config_intr,
>>  .driver = {
>>  .owner = THIS_MODULE,
>>  },
>
>Shouldn't we take the opportunity to clean up the duplicated register
>definitions? (I'll send an informal patch to spur discussion.)
>
>Regards.

That can be done independently. Feel free to send a patch.
-- 
Måns Rullgård
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 108 matches

Mail list logo