Re: [PATCH net-next v3.16]r8169: Correct value from speed 10 on MII_BMCR
From: Francois RomieuDate: Fri, 25 Mar 2016 23:53:25 +0100 > Been there. Such requests are usually left unanswered. :o( Due to this patch submitters continued anti-social and anti-community behavior, I have been completely ignoring their patches. I will continue to mark all of their patch submissions as "REJECTED" in patchwork until they start to behave like proper community members and actually respond properly and act upon to the feedback they are given. You should also feel free to ignore their work as well, your time is valuable, don't waste it on someone who ignores everyone's feedback.
Re: [PATCH 1/2] net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read,write}
On 03/25/2016 06:58 PM, Patrick Uiterwijk wrote: On Sat, Mar 26, 2016 at 2:45 AM, Guenter Roeckwrote: On 03/25/2016 05:10 PM, Patrick Uiterwijk wrote: Add versions of the phy_page_read and _write functions to be used in a context where the SMI mutex is held. Signed-off-by: Patrick Uiterwijk --- drivers/net/dsa/mv88e6xxx.c | 42 -- 1 file changed, 32 insertions(+), 10 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index fa086e0..13db5d8 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -2708,37 +2708,59 @@ int mv88e6xxx_switch_reset(struct dsa_switch *ds, bool ppu_active) return 0; } -int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, int reg) +static int _mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, + int reg) { - struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); int ret; - mutex_lock(>smi_mutex); ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); if (ret < 0) - goto error; + goto clear; ret = _mv88e6xxx_phy_read_indirect(ds, port, reg); -error: +clear: Is there some good reason for changing the name of those labels ? Vivien suggested to rename this since it makes more clear that this write is meant to return to page 0 to make sure that phylib doesn't get confused about the currently active page. And "clear:" accomplishes that ? I would not have guessed. Wonder if anyone else does. I would have used a comment. /* Try to return to page 0 even after an error */ or something like that. Guenter Patrick Guenter _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); - mutex_unlock(>smi_mutex); + return ret; } -int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page, -int reg, int val) +int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, int reg) { struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); int ret; mutex_lock(>smi_mutex); + ret = _mv88e6xxx_phy_page_read(ds, port, page, reg); + mutex_unlock(>smi_mutex); + + return ret; +} + +static int _mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page, +int reg, int val) +{ + int ret; + ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); if (ret < 0) - goto error; + goto clear; ret = _mv88e6xxx_phy_write_indirect(ds, port, reg, val); -error: +clear: _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); + + return ret; +} + +int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page, +int reg, int val) +{ + struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); + int ret; + + mutex_lock(>smi_mutex); + ret = _mv88e6xxx_phy_page_write(ds, port, page, reg, val); mutex_unlock(>smi_mutex); + return ret; }
Re: [PATCH 1/2] net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read,write}
On Sat, Mar 26, 2016 at 2:45 AM, Guenter Roeckwrote: > On 03/25/2016 05:10 PM, Patrick Uiterwijk wrote: >> >> Add versions of the phy_page_read and _write functions to >> be used in a context where the SMI mutex is held. >> >> Signed-off-by: Patrick Uiterwijk >> --- >> drivers/net/dsa/mv88e6xxx.c | 42 >> -- >> 1 file changed, 32 insertions(+), 10 deletions(-) >> >> diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c >> index fa086e0..13db5d8 100644 >> --- a/drivers/net/dsa/mv88e6xxx.c >> +++ b/drivers/net/dsa/mv88e6xxx.c >> @@ -2708,37 +2708,59 @@ int mv88e6xxx_switch_reset(struct dsa_switch *ds, >> bool ppu_active) >> return 0; >> } >> >> -int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, >> int reg) >> +static int _mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int >> page, >> + int reg) >> { >> - struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); >> int ret; >> >> - mutex_lock(>smi_mutex); >> ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); >> if (ret < 0) >> - goto error; >> + goto clear; >> ret = _mv88e6xxx_phy_read_indirect(ds, port, reg); >> -error: >> +clear: > > > Is there some good reason for changing the name of those labels ? Vivien suggested to rename this since it makes more clear that this write is meant to return to page 0 to make sure that phylib doesn't get confused about the currently active page. Patrick > > Guenter > > >> _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); >> - mutex_unlock(>smi_mutex); >> + >> return ret; >> } >> >> -int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page, >> -int reg, int val) >> +int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, >> int reg) >> { >> struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); >> int ret; >> >> mutex_lock(>smi_mutex); >> + ret = _mv88e6xxx_phy_page_read(ds, port, page, reg); >> + mutex_unlock(>smi_mutex); >> + >> + return ret; >> +} >> + >> +static int _mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int >> page, >> +int reg, int val) >> +{ >> + int ret; >> + >> ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); >> if (ret < 0) >> - goto error; >> + goto clear; >> >> ret = _mv88e6xxx_phy_write_indirect(ds, port, reg, val); >> -error: >> +clear: >> _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); >> + >> + return ret; >> +} >> + >> +int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page, >> +int reg, int val) >> +{ >> + struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); >> + int ret; >> + >> + mutex_lock(>smi_mutex); >> + ret = _mv88e6xxx_phy_page_write(ds, port, page, reg, val); >> mutex_unlock(>smi_mutex); >> + >> return ret; >> } >> >> >
Re: [RFC net-next 2/2] udp: No longer use SLAB_DESTROY_BY_RCU
On Fri, Mar 25, 2016 at 03:29:24PM -0700, Eric Dumazet wrote: > Tom Herbert would like to avoid touching UDP socket refcnt for encapsulated > traffic. For this to happen, we need to use normal RCU rules, with a grace > period before freeing a socket. UDP sockets are not short lived in the > high usage case, so the added cost of call_rcu() should not be a concern. > > This actually removes a lot of complexity in UDP stack > > Multicast receives no longer need to hold a bucket lock. > > Note that ip early demux still needs to take a reference on the socket. > > Same remark for functions used by xt_socket and xt_PROXY netfilter modules, > but this might be changed later. > > Signed-off-by: Eric Dumazet> Cc: Tom Herbert > --- > include/linux/udp.h | 8 +- > include/net/sock.h | 12 +-- > include/net/udp.h | 2 +- > net/ipv4/udp.c | 290 > +++- > net/ipv4/udp_diag.c | 18 ++-- > net/ipv6/udp.c | 194 +++ > 6 files changed, 162 insertions(+), 362 deletions(-) great stuff!
Re: [PATCH 1/2] net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read,write}
On 03/25/2016 05:10 PM, Patrick Uiterwijk wrote: Add versions of the phy_page_read and _write functions to be used in a context where the SMI mutex is held. Signed-off-by: Patrick Uiterwijk--- drivers/net/dsa/mv88e6xxx.c | 42 -- 1 file changed, 32 insertions(+), 10 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index fa086e0..13db5d8 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -2708,37 +2708,59 @@ int mv88e6xxx_switch_reset(struct dsa_switch *ds, bool ppu_active) return 0; } -int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, int reg) +static int _mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, + int reg) { - struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); int ret; - mutex_lock(>smi_mutex); ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); if (ret < 0) - goto error; + goto clear; ret = _mv88e6xxx_phy_read_indirect(ds, port, reg); -error: +clear: Is there some good reason for changing the name of those labels ? Guenter _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); - mutex_unlock(>smi_mutex); + return ret; } -int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page, -int reg, int val) +int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, int reg) { struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); int ret; mutex_lock(>smi_mutex); + ret = _mv88e6xxx_phy_page_read(ds, port, page, reg); + mutex_unlock(>smi_mutex); + + return ret; +} + +static int _mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page, +int reg, int val) +{ + int ret; + ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); if (ret < 0) - goto error; + goto clear; ret = _mv88e6xxx_phy_write_indirect(ds, port, reg, val); -error: +clear: _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); + + return ret; +} + +int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page, +int reg, int val) +{ + struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); + int ret; + + mutex_lock(>smi_mutex); + ret = _mv88e6xxx_phy_page_write(ds, port, page, reg, val); mutex_unlock(>smi_mutex); + return ret; }
[PATCH] Drivers: isdn: hisax: isac.c: Fix assignment and check into one expression.
Fix variable assignment inside if statement. It is error-prone and hard to read. Signed-off-by: Cosmin-Gabriel Samoila--- drivers/isdn/hisax/isac.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/drivers/isdn/hisax/isac.c b/drivers/isdn/hisax/isac.c index 7fdf78f..df7e05c 100644 --- a/drivers/isdn/hisax/isac.c +++ b/drivers/isdn/hisax/isac.c @@ -215,9 +215,11 @@ isac_interrupt(struct IsdnCardState *cs, u_char val) if (count == 0) count = 32; isac_empty_fifo(cs, count); - if ((count = cs->rcvidx) > 0) { + count = cs->rcvidx; + if (count > 0) { cs->rcvidx = 0; - if (!(skb = alloc_skb(count, GFP_ATOMIC))) + skb = alloc_skb(count, GFP_ATOMIC); + if (!skb) printk(KERN_WARNING "HiSax: D receive out of memory\n"); else { memcpy(skb_put(skb, count), cs->rcvbuf, count); @@ -251,7 +253,8 @@ isac_interrupt(struct IsdnCardState *cs, u_char val) cs->tx_skb = NULL; } } - if ((cs->tx_skb = skb_dequeue(>sq))) { + cs->tx_skb = skb_dequeue(>sq); + if (cs->tx_skb) { cs->tx_cnt = 0; isac_fill_fifo(cs); } else @@ -313,7 +316,8 @@ afterXPR: #if ARCOFI_USE if (v1 & 0x08) { if (!cs->dc.isac.mon_rx) { - if (!(cs->dc.isac.mon_rx = kmalloc(MAX_MON_FRAME, GFP_ATOMIC))) { + cs->dc.isac.mon_rx = kmalloc(MAX_MON_FRAME, GFP_ATOMIC); + if (!cs->dc.isac.mon_rx) { if (cs->debug & L1_DEB_WARN) debugl1(cs, "ISAC MON RX out of memory!"); cs->dc.isac.mocr &= 0xf0; @@ -343,7 +347,8 @@ afterXPR: afterMONR0: if (v1 & 0x80) { if (!cs->dc.isac.mon_rx) { - if (!(cs->dc.isac.mon_rx = kmalloc(MAX_MON_FRAME, GFP_ATOMIC))) { + cs->dc.isac.mon_rx = kmalloc(MAX_MON_FRAME, GFP_ATOMIC); + if (!cs->dc.isac.mon_rx) { if (cs->debug & L1_DEB_WARN) debugl1(cs, "ISAC MON RX out of memory!"); cs->dc.isac.mocr &= 0x0f; -- 1.9.1
Re: [PATCH] ipv6: Fix the pmtu path for connected UDP socket
On Fri, Mar 25, 2016 at 04:55:27PM -0700, Martin KaFai Lau wrote: > void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu) > { > + struct dst_entry *odst; > + > + odst = sk_dst_get(sk); > + > ip6_update_pmtu(skb, sock_net(sk), mtu, > sk->sk_bound_dev_if, sk->sk_mark); > + > + if (odst && !odst->error && > + !ip6_dst_check(odst, inet6_sk(sk)->dst_cookie)) { > + struct dst_entry *ndst; > + struct flowi6 fl6; > + > + build_skb_flow_key(, skb, sock_net(sk), > +sk->sk_bound_dev_if, sk->sk_mark); > + ndst = ip6_route_output(sock_net(sk), NULL, ); > + if (!ndst->error) > + ip6_dst_store(sk, ndst, NULL, NULL); oops...missed: else dst_release(ndst); > + } > + > + dst_release(odst); > } > EXPORT_SYMBOL_GPL(ip6_sk_update_pmtu); > > -- > 2.5.1
[PATCH 1/2] net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read,write}
Add versions of the phy_page_read and _write functions to be used in a context where the SMI mutex is held. Signed-off-by: Patrick Uiterwijk--- drivers/net/dsa/mv88e6xxx.c | 42 -- 1 file changed, 32 insertions(+), 10 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index fa086e0..13db5d8 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -2708,37 +2708,59 @@ int mv88e6xxx_switch_reset(struct dsa_switch *ds, bool ppu_active) return 0; } -int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, int reg) +static int _mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, + int reg) { - struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); int ret; - mutex_lock(>smi_mutex); ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); if (ret < 0) - goto error; + goto clear; ret = _mv88e6xxx_phy_read_indirect(ds, port, reg); -error: +clear: _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); - mutex_unlock(>smi_mutex); + return ret; } -int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page, -int reg, int val) +int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, int reg) { struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); int ret; mutex_lock(>smi_mutex); + ret = _mv88e6xxx_phy_page_read(ds, port, page, reg); + mutex_unlock(>smi_mutex); + + return ret; +} + +static int _mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page, +int reg, int val) +{ + int ret; + ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); if (ret < 0) - goto error; + goto clear; ret = _mv88e6xxx_phy_write_indirect(ds, port, reg, val); -error: +clear: _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); + + return ret; +} + +int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page, +int reg, int val) +{ + struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); + int ret; + + mutex_lock(>smi_mutex); + ret = _mv88e6xxx_phy_page_write(ds, port, page, reg, val); mutex_unlock(>smi_mutex); + return ret; } -- 2.7.4
[PATCH 2/2] net: dsa: mv88e6xxx: Clear the PDOWN bit on setup
Some of the vendor-specific bootloaders set up this part of the initialization for us, so this was never added. However, since upstream bootloaders don't initialize the chip specifically, they leave the fiber MII's PDOWN flag set, which means that the CPU port doesn't connect. This patch checks whether this flag has been clear prior by something else, and if not make us clear it. Signed-off-by: Patrick Uiterwijk--- drivers/net/dsa/mv88e6xxx.c | 99 +++-- drivers/net/dsa/mv88e6xxx.h | 8 2 files changed, 76 insertions(+), 31 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index 13db5d8..05b2efb 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -2264,6 +2264,57 @@ static void mv88e6xxx_bridge_work(struct work_struct *work) mutex_unlock(>smi_mutex); } +static int _mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page, +int reg, int val) +{ + int ret; + + ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); + if (ret < 0) + goto clear; + + ret = _mv88e6xxx_phy_write_indirect(ds, port, reg, val); +clear: + _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); + + return ret; +} + +static int _mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, + int reg) +{ + int ret; + + ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); + if (ret < 0) + goto clear; + + ret = _mv88e6xxx_phy_read_indirect(ds, port, reg); +clear: + _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); + + return ret; +} + +static int mv88e6xxx_power_on_serdes(struct dsa_switch *ds) +{ + int ret; + + ret = _mv88e6xxx_phy_page_read(ds, REG_FIBER_SERDES, PAGE_FIBER_SERDES, + MII_BMCR); + if (ret < 0) + return ret; + + if (ret & BMCR_PDOWN) { + ret = ret & ~BMCR_PDOWN; + ret = _mv88e6xxx_phy_page_write(ds, REG_FIBER_SERDES, + PAGE_FIBER_SERDES, MII_BMCR, + ret); + } + + return ret; +} + static int mv88e6xxx_setup_port(struct dsa_switch *ds, int port) { struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); @@ -2367,6 +2418,23 @@ static int mv88e6xxx_setup_port(struct dsa_switch *ds, int port) goto abort; } + /* If this port is connected to a SerDes, make sure the SerDes is not +* powered down. +*/ + if (mv88e6xxx_6352_family(ds)) { + ret = _mv88e6xxx_reg_read(ds, REG_PORT(port), PORT_STATUS); + if (ret < 0) + goto abort; + ret &= PORT_STATUS_CMODE_MASK; + if ((ret == PORT_STATUS_CMODE_100BASE_X) || + (ret == PORT_STATUS_CMODE_1000BASE_X) || + (ret == PORT_STATUS_CMODE_SGMII)) { + ret = mv88e6xxx_power_on_serdes(ds); + if (ret < 0) + goto abort; + } + } + /* Port Control 2: don't force a good FCS, set the maximum frame size to * 10240 bytes, disable 802.1q tags checking, don't discard tagged or * untagged frames on this port, do a destination address lookup on all @@ -2708,21 +2776,6 @@ int mv88e6xxx_switch_reset(struct dsa_switch *ds, bool ppu_active) return 0; } -static int _mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, - int reg) -{ - int ret; - - ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); - if (ret < 0) - goto clear; - ret = _mv88e6xxx_phy_read_indirect(ds, port, reg); -clear: - _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); - - return ret; -} - int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, int reg) { struct mv88e6xxx_priv_state *ps = ds_to_priv(ds); @@ -2735,22 +2788,6 @@ int mv88e6xxx_phy_page_read(struct dsa_switch *ds, int port, int page, int reg) return ret; } -static int _mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page, -int reg, int val) -{ - int ret; - - ret = _mv88e6xxx_phy_write_indirect(ds, port, 0x16, page); - if (ret < 0) - goto clear; - - ret = _mv88e6xxx_phy_write_indirect(ds, port, reg, val); -clear: - _mv88e6xxx_phy_write_indirect(ds, port, 0x16, 0x0); - - return ret; -} - int mv88e6xxx_phy_page_write(struct dsa_switch *ds, int port, int page, int reg, int val) { diff --git a/drivers/net/dsa/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx.h index 9a038ab..26a424a 100644
Re: [RFC net-next 1/2] net: add SOCK_RCU_FREE socket flag
On Fri, Mar 25, 2016 at 3:29 PM, Eric Dumazetwrote: > We want a generic way to insert an RCU grace period before socket > freeing for cases where RCU_SLAB_DESTROY_BY_RCU is adding too > much overhead. > > SLAB_DESTROY_BY_RCU strict rules force us to take a reference > on the socket sk_refcnt, and it is a performance problem for UDP > encapsulation, or TCP synflood behavior, as many CPUs might > attempt the atomic operations on a shared sk_refcnt > > UDP sockets and TCP listeners can set SOCK_RCU_FREE so that their > lookup can use traditional RCU rules, without refcount changes. > They can set the flag only once hashed and visible by other cpus. > > Signed-off-by: Eric Dumazet > Cc: Tom Herbert > --- > include/net/sock.h | 2 ++ > net/core/sock.c| 14 +- > 2 files changed, 15 insertions(+), 1 deletion(-) > > diff --git a/include/net/sock.h b/include/net/sock.h > index 255d3e03727b..c88785a3e76c 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -438,6 +438,7 @@ struct sock { > struct sk_buff *skb); > void(*sk_destruct)(struct sock *sk); > struct sock_reuseport __rcu *sk_reuseport_cb; > + struct rcu_head sk_rcu; > }; > > #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data))) > @@ -720,6 +721,7 @@ enum sock_flags { > */ > SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */ > SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */ > + SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */ > }; > > #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << > SOCK_TIMESTAMPING_RX_SOFTWARE)) > diff --git a/net/core/sock.c b/net/core/sock.c > index b67b9aedb230..238a94f879ca 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -1418,8 +1418,12 @@ struct sock *sk_alloc(struct net *net, int family, > gfp_t priority, > } > EXPORT_SYMBOL(sk_alloc); > > -void sk_destruct(struct sock *sk) > +/* Sockets having SOCK_RCU_FREE will call this function after one RCU > + * grace period. This is the case for UDP sockets and TCP listeners. > + */ > +static void __sk_destruct(struct rcu_head *head) > { > + struct sock *sk = container_of(head, struct sock, sk_rcu); > struct sk_filter *filter; > > if (sk->sk_destruct) > @@ -1448,6 +1452,14 @@ void sk_destruct(struct sock *sk) > sk_prot_free(sk->sk_prot_creator, sk); > } > > +void sk_destruct(struct sock *sk) > +{ > + if (sock_flag(sk, SOCK_RCU_FREE)) > + call_rcu(>sk_rcu, __sk_destruct); > + else > + __sk_destruct(>sk_rcu); > +} > + Very nice! > static void __sk_free(struct sock *sk) > { > if (unlikely(sock_diag_has_destroy_listeners(sk) && > sk->sk_net_refcnt)) > -- > 2.8.0.rc3.226.g39d4020 >
Re: [RFC net-next 2/2] udp: No longer use SLAB_DESTROY_BY_RCU
On Fri, Mar 25, 2016 at 3:29 PM, Eric Dumazetwrote: > Tom Herbert would like to avoid touching UDP socket refcnt for encapsulated > traffic. For this to happen, we need to use normal RCU rules, with a grace > period before freeing a socket. UDP sockets are not short lived in the > high usage case, so the added cost of call_rcu() should not be a concern. > > This actually removes a lot of complexity in UDP stack > > Multicast receives no longer need to hold a bucket lock. > > Note that ip early demux still needs to take a reference on the socket. > > Same remark for functions used by xt_socket and xt_PROXY netfilter modules, > but this might be changed later. > > Signed-off-by: Eric Dumazet > Cc: Tom Herbert > --- > include/linux/udp.h | 8 +- > include/net/sock.h | 12 +-- > include/net/udp.h | 2 +- > net/ipv4/udp.c | 290 > +++- > net/ipv4/udp_diag.c | 18 ++-- > net/ipv6/udp.c | 194 +++ > 6 files changed, 162 insertions(+), 362 deletions(-) > > diff --git a/include/linux/udp.h b/include/linux/udp.h > index 87c094961bd5..32342754643a 100644 > --- a/include/linux/udp.h > +++ b/include/linux/udp.h > @@ -98,11 +98,11 @@ static inline bool udp_get_no_check6_rx(struct sock *sk) > return udp_sk(sk)->no_check6_rx; > } > > -#define udp_portaddr_for_each_entry(__sk, node, list) \ > - hlist_nulls_for_each_entry(__sk, node, list, > __sk_common.skc_portaddr_node) > +#define udp_portaddr_for_each_entry(__sk, list) \ > + hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node) > > -#define udp_portaddr_for_each_entry_rcu(__sk, node, list) \ > - hlist_nulls_for_each_entry_rcu(__sk, node, list, > __sk_common.skc_portaddr_node) > +#define udp_portaddr_for_each_entry_rcu(__sk, list) \ > + hlist_for_each_entry_rcu(__sk, list, __sk_common.skc_portaddr_node) > > #define IS_UDPLITE(__sk) (udp_sk(__sk)->pcflag) > > diff --git a/include/net/sock.h b/include/net/sock.h > index c88785a3e76c..5b9562bc478e 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -178,7 +178,7 @@ struct sock_common { > int skc_bound_dev_if; > union { > struct hlist_node skc_bind_node; > - struct hlist_nulls_node skc_portaddr_node; > + struct hlist_node skc_portaddr_node; > }; > struct proto*skc_prot; > possible_net_t skc_net; > @@ -670,18 +670,18 @@ static inline void sk_add_bind_node(struct sock *sk, > hlist_for_each_entry(__sk, list, sk_bind_node) > > /** > - * sk_nulls_for_each_entry_offset - iterate over a list at a given struct > offset > + * sk_for_each_entry_offset - iterate over a list at a given struct offset > * @tpos: the type * to use as a loop cursor. > * @pos: the hlist_node to use as a loop cursor. > * @head: the head for your list. > * @offset:offset of hlist_node within the struct. > * > */ > -#define sk_nulls_for_each_entry_offset(tpos, pos, head, offset) > \ > - for (pos = (head)->first; > \ > -(!is_a_nulls(pos)) && > \ > +#define sk_for_each_entry_offset_rcu(tpos, pos, head, offset) > \ > + for (pos = rcu_dereference((head)->first); > \ > +pos != NULL && > \ > ({ tpos = (typeof(*tpos) *)((void *)pos - offset); 1;}); > \ > -pos = pos->next) > +pos = rcu_dereference(pos->next)) > > static inline struct user_namespace *sk_user_ns(struct sock *sk) > { > diff --git a/include/net/udp.h b/include/net/udp.h > index 92927f729ac8..d870ec1611c4 100644 > --- a/include/net/udp.h > +++ b/include/net/udp.h > @@ -59,7 +59,7 @@ struct udp_skb_cb { > * @lock: spinlock protecting changes to head/count > */ > struct udp_hslot { > - struct hlist_nulls_head head; > + struct hlist_head head; > int count; > spinlock_t lock; > } __attribute__((aligned(2 * sizeof(long; > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index 08eed5e16df0..3ebca8445d35 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -143,10 +143,9 @@ static int udp_lib_lport_inuse(struct net *net, __u16 > num, >unsigned int log) > { > struct sock *sk2; > - struct hlist_nulls_node *node; > kuid_t uid = sock_i_uid(sk); > > - sk_nulls_for_each(sk2, node, >head) { > + sk_for_each(sk2, >head) { > if (net_eq(sock_net(sk2), net) && > sk2 != sk && > (bitmap || udp_sk(sk2)->udp_port_hash == num) && >
Re: [PATCH] ipv6: Fix the pmtu path for connected UDP socket
On Wed, Mar 23, 2016 at 04:57:22PM -0700, Wei Wang wrote: > What about something like this: > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > index ed44663..21b4102 100644 > --- a/net/ipv6/route.c > +++ b/net/ipv6/route.c > @@ -1394,6 +1394,19 @@ static void ip6_rt_update_pmtu(struct dst_entry > *dst, struct sock *sk, > __ip6_rt_update_pmtu(dst, sk, skb ? ipv6_hdr(skb) : NULL, mtu); > } > > +static void ip6_fill_in_flow(struct flowi6 *fl6, struct net *net, > + struct sk_buff *skb, int oif, u32 mark) > +{ > + const struct ipv6hdr *iph = (struct ipv6hdr *) skb->data; > + > + memset(fl6, 0, sizeof(fl6)); > + fl6->flowi6_oif = oif; > + fl6->flowi6_mark = mark ? mark : IP6_REPLY_MARK(net, skb->mark); > + fl6->daddr = iph->daddr; > + fl6->saddr = iph->saddr; > + fl6->flowlabel = ip6_flowinfo(iph); > +} > + > void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu, > int oif, u32 mark) > { > @@ -1401,13 +1414,7 @@ void ip6_update_pmtu(struct sk_buff *skb, > struct net *net, __be32 mtu, > struct dst_entry *dst; > struct flowi6 fl6; > > - memset(, 0, sizeof(fl6)); > - fl6.flowi6_oif = oif; > - fl6.flowi6_mark = mark ? mark : IP6_REPLY_MARK(net, skb->mark); > - fl6.daddr = iph->daddr; > - fl6.saddr = iph->saddr; > - fl6.flowlabel = ip6_flowinfo(iph); > - > + ip6_fill_in_flow(, net, skb, oif, mark); > dst = ip6_route_output(net, NULL, ); > if (!dst->error) > __ip6_rt_update_pmtu(dst, NULL, iph, ntohl(mtu)); > @@ -1417,8 +1424,22 @@ EXPORT_SYMBOL_GPL(ip6_update_pmtu); > > void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu) > { > - ip6_update_pmtu(skb, sock_net(sk), mtu, > + struct ipv6_pinfo *np = inet6_sk(sk); > + struct dst_entry *dst_new; > + struct flowi6 fl6; > + struct net *net = sock_net(sk); > + > + ip6_update_pmtu(skb, net, mtu, > + sk->sk_bound_dev_if, sk->sk_mark); > + > + if (sk->sk_state == TCP_ESTABLISHED && > + !sk_dst_check(sk, np->dst_cookie)) { > + ip6_fill_in_flow(, net, skb, > sk->sk_bound_dev_if, sk->sk_mark); > + dst_new = ip6_route_output(net, NULL, ); > + if (!IS_ERR(dst_new)) > + ip6_dst_store(sk, dst_new, NULL, NULL); > + } > } > EXPORT_SYMBOL_GPL(ip6_sk_update_pmtu); Here is the change I come up with (based on your flow key buildup function with one fix), take a little different approach in ip6_sk_update_pmtu() and some explanation on ip6_dst_store() after looking at ip6_rt_check() I have not included the sk lock yet. There is still a few things I don't fully understand in another discussion thread. [PATCH] ipv6: Update sk->sk_dst_cache in ip6_sk_update_pmtu() There is a case in connected UDP socket such that getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible sequence could be the following: 1. Create a connected UDP socket 2. Send some datagrams out 3. Receive a ICMPV6_PKT_TOOBIG 4. No new outgoing datagrams to trigger the sk_dst_check() logic to update the sk->sk_dst_cache. 5. getsockopt(IPV6_MTU) returns the mtu from the invalid sk->sk_dst_cache instead of the newly created RTF_CACHE clone which contains the MTU value instructed by the ICMPV6_PKT_TOOBIG message. One observation is that, the __udp6_lib_err() code path does not do a sk_dst_check() and route relookup (if needed) after doing the pmtu update (while TCP does). This patch does a ip6_dst_check() in ip6_sk_update_pmtu() which is used by UDP to do the pmtu update and update sk->sk_dst_cache if it is needed. ip6_dst_store(sk, ndst, NULL, NULL) is used. NULL(s) are passed to the daddr and saddr parameters for simplicity reason. The current use case is only for the UDP lookup. The details is in ip6_rt_check(). At the point where ip6_dst_store(sk, ndst, NULL, NULL) is called in this patch, ndst should be a RTF_CACHE clone (which implies ((struct rt6i_info *)ndst)->rt6i_dst.plen == 128) and ip6_rt_check() will pass without even considering np->d/saddr_cache. Even it has to use np->d/saddr_cache, the worst case is ip6_rt_check() fails and causes one more route lookup. Test: Server (Connected UDP Socket): ~ Route Details: [root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac' 2fac::/64 dev eth0 proto kernel metric 256 pref medium 2fac:face::/64 via 2fac::face dev eth0 metric 1024 pref medium A crappy python code to create a connected UDP socket: import socket import errno HOST = '2fac::1' PORT = 8080 s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM) s.bind((HOST, PORT)) s.connect(('2fac:face::face', 53)) print("connected") while True: try: data = s.recv(1024) except socket.error as se: if se.errno == errno.EMSGSIZE: pmtu = s.getsockopt(41, 24) print("PMTU:%d" % pmtu) break s.close() Python program output after getting a ICMPV6_PKT_TOOBIG: [root@arch-fb-vm1 ~]# python2
Re: [PATCH net-next v3.16]r8169: Correct value from speed 10 on MII_BMCR
On Fri, Mar 25, 2016 at 11:53:25PM +0100, Francois Romieu wrote: > Phil Sutter: > [...] > > Your patch submissions are getting better, also good to see you're > > finally using git-send-email. A few things need to be corrected though: > > > > #define BMCR_RESV 0x003f /* Unused... */ > #define BMCR_SPEED1000 0x0040 /* MSB of Speed (1000) */ > #define BMCR_CTST 0x0080 /* Collision test */ > #define BMCR_FULLDPLX 0x0100 /* Full duplex */ > #define BMCR_ANRESTART 0x0200 /* Auto negotiation restart*/ > #define BMCR_ISOLATE0x0400 /* Isolate data paths from MII */ > #define BMCR_PDOWN 0x0800 /* Enable low power state */ > #define BMCR_ANENABLE 0x1000 /* Enable auto negotiation */ > #define BMCR_SPEED100 0x2000 /* Select 100Mbps */ > #define BMCR_LOOPBACK 0x4000 /* TXD loopback bits */ > > BMCR_SPEED100 apart, *all* these bits are now set. > > It does not make much sense. I presumed that already, but didn't care to check myself so instead ignored the actual code change. Thanks for pointing out the futility of the whole thing. :) > > Also detailed instructions on how to trigger the problem you are fixing > > for would be good. In detail: Which specific hardware was used, in which > > situation did the problem occur, how did it behave in that situation and > > what was the expected behaviour? > > Been there. Such requests are usually left unanswered. :o( That's my impression from following the (somewhat amusing) former threads, too. I was merely impressed by the sheer quality of this patch in comparison to previous ones. Speaking of which, the presented tenacity certainly earns some respect. Cheers, Phil
Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."
On 03/25/2016 04:03 PM, Vijay Pandurangan wrote: On Fri, Mar 25, 2016 at 6:23 PM, Ben Greearwrote: On 03/25/2016 02:59 PM, Vijay Pandurangan wrote: consider two scenarios, where process a sends raw ethernet frames containing UDP packets to b I) process a --> veth --> process b II) process a -> eth -> wire -> eth -> process b I believe (I) is the simplest setup we can create that will replicate this bug. If process a sends frames that contain UDP packets to process b, what is the behaviour we want if the UDP packet *has an incorrect checksum*? It seems to me that I and II should have identical behaviour, and I would think that (II) would not deliver the packets to the application. In (I) with Cong's patch would we be delivering corrupt UDP packets to process b despite an incorrect checksum in (I)? If so, I would argue that this patch isn't right. Checksums are normally used to deal with flaky transport mechanisms, and once a machine receives the frame, we do not keep re-calculating checksums as we move it through various drivers and subsystems. In particular, checksums are NOT a security mechanism and can be easily faked. Since packets sent on one veth never actually hit any unreliable transport before they are received on the peer veth, then there should be no need to checksum packets whose origin is known to be on the local machine. That's a good argument. I'm trying to figure out how to reconcile your thoughts with the argument that virtual ethernet devices are an abstraction that should behave identically to perfectly-functional physical ethernet devices when connected with a wire. In my view, the invariant must be identical functionality, and if I were writing a regression test for this system, that's what I would test. I think optimizations for eliding checksums should be implemented only if they don't alter this functionality. There must be a way to structure / write this code so that we can optimize veths without causing different behaviour ... A real NIC can either do hardware checksums, or it cannot. If it cannot, then the host must do it on the CPU for both transmit and receive. Veth is not a real NIC, and it cannot do hardware checksum offloading. So, we either lie and pretend it does, or we eat massive amounts of CPU usage to calculate and check checksums when sending across a veth pair. Any frame sent from a socket can be considered to be a local packet in my opinion. I'm not sure that's totally right. Your bridge is adding a delay to your packets; it could just as easily be simulating corruption by corrupting 5% of packets going through it. If this change allows corrupt packets to be delivered to an application when they could not be delivered if the packets were routed via physical eths, I think that is a bug. I actually do support corrupting the frame, but what I normally do is corrupt the contents of the packet, and then recalculate the IP checksum (and TCP if it applies) and send it on its way. The receiving NIC and stack will pass the frame up to the application since the checksums match, and it would be up the application to deal with it. So, I can easily cause an application to receive corrupted frames over physical eths. I can also corrupt without updating the checksums in case you want to test another systems NIC and/or stack. But, if I am purposely corrupting a frame destined for veth, then the only reason I would want the stack to check the checksums is if I were testing my own stack's checksum logic, and that seems to be a pretty limited use. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [RFC PATCH] tcp: Add SOF_TIMESTAMPING_TX_EOR and allow MSG_EOR in tcp_sendmsg
On Thu, Mar 24, 2016 at 6:39 PM, Willem de Bruijnwrote: > > > This patch allows the user process to use MSG_EOR during > > tcp_sendmsg to tell the kernel that it is the last byte > > of an application response message. > > > > The user process can use the new SOF_TIMESTAMPING_TX_EOR to > > ask the kernel to only track timestamp of the MSG_EOR byte. > > Selective timestamp requests is a useful addition. Soheil (cc-ed) also > happens to be looking at this. That approach uses cmsg to selectively > tag send calls, avoiding the need to define a new SOF_ flag. > > > The current SOF_TIMESTAMPING_TX_ACK is tracking the last > > byte appended to a skb during the tcp_sendmsg. It may track > > multiple bytes if the response spans across multiple skbs. > > It only tracks the last byte of the buffer passed in sendmsg. If a > sendmsg results in multiple skbuffs, only the last skb is tracked. It > is, however, possible that that skbuff is appended to in a follow-on > sendmsg call. If multiple calls enable recording on an skbuff, only > the last record request on an skb is kept. > > > it is enough to measure the response latency for application > > protocol with a single request/response at a time (like HTTP 1.1 > > without pipeline), it does not work well for application protocol > > with >1 pipeline responses (like HTTP2). Looks like an interesting and useful patch. Since HTTP2 allows multiplexing data stream frames from multiple logical streams on a single socket, how would you instrument to measure the latency of each stream? e.g., sendmsg of data_frame_1_of_stream_a sendmsg of data_frame_1_of_stream_b sendmsg of data_frame_2_of_stream_a sendmsg of data_frame_1_of_stream_c sendmsg of data_frame_2_of_stream_b > > > > Each skb can only track one tskey (which is the seq number of > > the last byte of the message). To allow tracking the > > last byte of multiple response messages, this patch takes an approach > > by not appending to the last skb during tcp_sendmsg if the last skb's > > tskey will be overwritten. A similar case also happens during retransmit. > > > > This approach avoids introducing another list to track the tskey. The > > downside is that it will have less gso benefit and/or more outgoing > > packets. Practically, due to the amount of measurement data generated, > > sampling is usually used in production. (i.e. not every connection is > > tracked). > > Agreed. This is the simplest approach to avoiding timestamp request > overwrites. A list-based approach quickly becomes complex as skbuffs > can be split and merged at various points. > > Since this use is rare, I would suggest making the code even simpler > by just jumping to new_segment on a call with this MSG option (or > cmsg) set, avoiding tcp_tx_ts_noappend_skb() on each new segment. +1 > > > One of our use case is at the webserver. The webserver tracks > > the HTTP2 response latency by measuring when the webserver > > sends the first byte to the socket till the TCP ACK of the last byte > > is received. In the cases where we don't have client side > > measurement, measuring from the server side is the only option. > > In the cases we have the client side measurement, the server side > > data can also be used to justify/cross-check-with the client > > side data (e.g. is there slowness at the layer above the client's > > TCP stack). > > > > The TCP PRR paper [1] also measures a similar metrics: > > "The TCP latency of a HTTP response when the server sends the first > > byte until it receives the acknowledgment (ACK) for the last byte." > > > > [1] Proportional Rate Reduction for TCP: > > http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37486.pdf > > > > Signed-off-by: Martin KaFai Lau > > Cc: Eric Dumazet > > Cc: Neal Cardwell > > Cc: Willem de Bruijn > > Cc: Yuchung Cheng > > --- > > include/uapi/linux/net_tstamp.h | 3 ++- > > net/ipv4/tcp.c | 23 ++- > > net/ipv4/tcp_output.c | 9 +++-- > > 3 files changed, 27 insertions(+), 8 deletions(-) > > > > diff --git a/include/uapi/linux/net_tstamp.h > > b/include/uapi/linux/net_tstamp.h > > index 6d1abea..5376569 100644 > > --- a/include/uapi/linux/net_tstamp.h > > +++ b/include/uapi/linux/net_tstamp.h > > @@ -25,8 +25,9 @@ enum { > > SOF_TIMESTAMPING_TX_ACK = (1<<9), > > SOF_TIMESTAMPING_OPT_CMSG = (1<<10), > > SOF_TIMESTAMPING_OPT_TSONLY = (1<<11), > > + SOF_TIMESTAMPING_TX_EOR = (1<<12), > > > > - SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_TSONLY, > > + SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_TX_EOR, > > SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_LAST - 1) | > > SOF_TIMESTAMPING_LAST > > }; > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > index 08b8b96..7de96eb 100644 > > ---
Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."
On Fri, Mar 25, 2016 at 6:23 PM, Ben Greearwrote: > On 03/25/2016 02:59 PM, Vijay Pandurangan wrote: >> >> consider two scenarios, where process a sends raw ethernet frames >> containing UDP packets to b >> >> I) process a --> veth --> process b >> >> II) process a -> eth -> wire -> eth -> process b >> >> I believe (I) is the simplest setup we can create that will replicate this >> bug. >> >> If process a sends frames that contain UDP packets to process b, what >> is the behaviour we want if the UDP packet *has an incorrect >> checksum*? >> >> It seems to me that I and II should have identical behaviour, and I >> would think that (II) would not deliver the packets to the >> application. >> >> In (I) with Cong's patch would we be delivering corrupt UDP packets to >> process b despite an incorrect checksum in (I)? >> >> If so, I would argue that this patch isn't right. > > > Checksums are normally used to deal with flaky transport mechanisms, > and once a machine receives the frame, we do not keep re-calculating > checksums > as we move it through various drivers and subsystems. > > In particular, checksums are NOT a security mechanism and can be easily > faked. > > Since packets sent on one veth never actually hit any unreliable transport > before they are received on the peer veth, then there should be no need to > checksum packets whose origin is known to be on the local machine. That's a good argument. I'm trying to figure out how to reconcile your thoughts with the argument that virtual ethernet devices are an abstraction that should behave identically to perfectly-functional physical ethernet devices when connected with a wire. In my view, the invariant must be identical functionality, and if I were writing a regression test for this system, that's what I would test. I think optimizations for eliding checksums should be implemented only if they don't alter this functionality. There must be a way to structure / write this code so that we can optimize veths without causing different behaviour ... > > Any frame sent from a socket can be considered to be a local packet in my > opinion. I'm not sure that's totally right. Your bridge is adding a delay to your packets; it could just as easily be simulating corruption by corrupting 5% of packets going through it. If this change allows corrupt packets to be delivered to an application when they could not be delivered if the packets were routed via physical eths, I think that is a bug. > > That is what Cong's patch does as far as I can tell. > > > Thanks, > Ben > > > -- > Ben Greear > Candela Technologies Inc http://www.candelatech.com >
Re: [PATCH net-next v3.16]r8169: Correct value from speed 10 on MII_BMCR
Phil Sutter: [...] > Your patch submissions are getting better, also good to see you're > finally using git-send-email. A few things need to be corrected though: > #define BMCR_RESV 0x003f /* Unused... */ #define BMCR_SPEED1000 0x0040 /* MSB of Speed (1000) */ #define BMCR_CTST 0x0080 /* Collision test */ #define BMCR_FULLDPLX 0x0100 /* Full duplex */ #define BMCR_ANRESTART 0x0200 /* Auto negotiation restart*/ #define BMCR_ISOLATE0x0400 /* Isolate data paths from MII */ #define BMCR_PDOWN 0x0800 /* Enable low power state */ #define BMCR_ANENABLE 0x1000 /* Enable auto negotiation */ #define BMCR_SPEED100 0x2000 /* Select 100Mbps */ #define BMCR_LOOPBACK 0x4000 /* TXD loopback bits */ BMCR_SPEED100 apart, *all* these bits are now set. It does not make much sense. > Also detailed instructions on how to trigger the problem you are fixing > for would be good. In detail: Which specific hardware was used, in which > situation did the problem occur, how did it behave in that situation and > what was the expected behaviour? Been there. Such requests are usually left unanswered. :o( Btw, this stuff targets 3.16 (...) and net-next is still closed. -- Ueimor
Re: [PATCH RFC 3/9] net: Add fast receive encapsulation
On Fri, 2016-03-25 at 16:40 -0400, David Miller wrote: > From: Tom Herbert> Date: Wed, 23 Mar 2016 15:36:52 -0700 > > > +{ > > + struct udp_sock *up = udp_sk(sk); > > + int is_udplite = IS_UDPLITE(sk); > > + > > + int (*encap_rcv)(struct sock *sk, struct sk_buff *skb); > > + > > Small nit, please put this encap_rcv function pointer declaration at > the top of the local variable list. It might also be nice to remove the equivalent typedef and use the same form in udp_tunnel.h --- include/net/udp_tunnel.h | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h index b831140..71885b1 100644 --- a/include/net/udp_tunnel.h +++ b/include/net/udp_tunnel.h @@ -62,15 +62,12 @@ static inline int udp_sock_create(struct net *net, return -EPFNOSUPPORT; } -typedef int (*udp_tunnel_encap_rcv_t)(struct sock *sk, struct sk_buff *skb); -typedef void (*udp_tunnel_encap_destroy_t)(struct sock *sk); - struct udp_tunnel_sock_cfg { void *sk_user_data; /* user data used by encap_rcv call back */ /* Used for setting up udp_sock fields, see udp.h for details */ __u8 encap_type; - udp_tunnel_encap_rcv_t encap_rcv; - udp_tunnel_encap_destroy_t encap_destroy; + int (*encap_rcv)(struct sock *sk, struct sk_buff *skb); + void (*encap_destroy)(struct sock *sk); }; /* Setup the given (UDP) sock to receive UDP encapsulated packets */
[RFC net-next 2/2] udp: No longer use SLAB_DESTROY_BY_RCU
Tom Herbert would like to avoid touching UDP socket refcnt for encapsulated traffic. For this to happen, we need to use normal RCU rules, with a grace period before freeing a socket. UDP sockets are not short lived in the high usage case, so the added cost of call_rcu() should not be a concern. This actually removes a lot of complexity in UDP stack Multicast receives no longer need to hold a bucket lock. Note that ip early demux still needs to take a reference on the socket. Same remark for functions used by xt_socket and xt_PROXY netfilter modules, but this might be changed later. Signed-off-by: Eric DumazetCc: Tom Herbert --- include/linux/udp.h | 8 +- include/net/sock.h | 12 +-- include/net/udp.h | 2 +- net/ipv4/udp.c | 290 +++- net/ipv4/udp_diag.c | 18 ++-- net/ipv6/udp.c | 194 +++ 6 files changed, 162 insertions(+), 362 deletions(-) diff --git a/include/linux/udp.h b/include/linux/udp.h index 87c094961bd5..32342754643a 100644 --- a/include/linux/udp.h +++ b/include/linux/udp.h @@ -98,11 +98,11 @@ static inline bool udp_get_no_check6_rx(struct sock *sk) return udp_sk(sk)->no_check6_rx; } -#define udp_portaddr_for_each_entry(__sk, node, list) \ - hlist_nulls_for_each_entry(__sk, node, list, __sk_common.skc_portaddr_node) +#define udp_portaddr_for_each_entry(__sk, list) \ + hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node) -#define udp_portaddr_for_each_entry_rcu(__sk, node, list) \ - hlist_nulls_for_each_entry_rcu(__sk, node, list, __sk_common.skc_portaddr_node) +#define udp_portaddr_for_each_entry_rcu(__sk, list) \ + hlist_for_each_entry_rcu(__sk, list, __sk_common.skc_portaddr_node) #define IS_UDPLITE(__sk) (udp_sk(__sk)->pcflag) diff --git a/include/net/sock.h b/include/net/sock.h index c88785a3e76c..5b9562bc478e 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -178,7 +178,7 @@ struct sock_common { int skc_bound_dev_if; union { struct hlist_node skc_bind_node; - struct hlist_nulls_node skc_portaddr_node; + struct hlist_node skc_portaddr_node; }; struct proto*skc_prot; possible_net_t skc_net; @@ -670,18 +670,18 @@ static inline void sk_add_bind_node(struct sock *sk, hlist_for_each_entry(__sk, list, sk_bind_node) /** - * sk_nulls_for_each_entry_offset - iterate over a list at a given struct offset + * sk_for_each_entry_offset - iterate over a list at a given struct offset * @tpos: the type * to use as a loop cursor. * @pos: the hlist_node to use as a loop cursor. * @head: the head for your list. * @offset:offset of hlist_node within the struct. * */ -#define sk_nulls_for_each_entry_offset(tpos, pos, head, offset) \ - for (pos = (head)->first; \ -(!is_a_nulls(pos)) && \ +#define sk_for_each_entry_offset_rcu(tpos, pos, head, offset) \ + for (pos = rcu_dereference((head)->first); \ +pos != NULL &&\ ({ tpos = (typeof(*tpos) *)((void *)pos - offset); 1;}); \ -pos = pos->next) +pos = rcu_dereference(pos->next)) static inline struct user_namespace *sk_user_ns(struct sock *sk) { diff --git a/include/net/udp.h b/include/net/udp.h index 92927f729ac8..d870ec1611c4 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -59,7 +59,7 @@ struct udp_skb_cb { * @lock: spinlock protecting changes to head/count */ struct udp_hslot { - struct hlist_nulls_head head; + struct hlist_head head; int count; spinlock_t lock; } __attribute__((aligned(2 * sizeof(long; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 08eed5e16df0..3ebca8445d35 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -143,10 +143,9 @@ static int udp_lib_lport_inuse(struct net *net, __u16 num, unsigned int log) { struct sock *sk2; - struct hlist_nulls_node *node; kuid_t uid = sock_i_uid(sk); - sk_nulls_for_each(sk2, node, >head) { + sk_for_each(sk2, >head) { if (net_eq(sock_net(sk2), net) && sk2 != sk && (bitmap || udp_sk(sk2)->udp_port_hash == num) && @@ -177,12 +176,11 @@ static int udp_lib_lport_inuse2(struct net *net, __u16 num, bool match_wildcard)) { struct sock *sk2; - struct hlist_nulls_node *node; kuid_t uid = sock_i_uid(sk); int res = 0; spin_lock(>lock); -
[RFC net-next 1/2] net: add SOCK_RCU_FREE socket flag
We want a generic way to insert an RCU grace period before socket freeing for cases where RCU_SLAB_DESTROY_BY_RCU is adding too much overhead. SLAB_DESTROY_BY_RCU strict rules force us to take a reference on the socket sk_refcnt, and it is a performance problem for UDP encapsulation, or TCP synflood behavior, as many CPUs might attempt the atomic operations on a shared sk_refcnt UDP sockets and TCP listeners can set SOCK_RCU_FREE so that their lookup can use traditional RCU rules, without refcount changes. They can set the flag only once hashed and visible by other cpus. Signed-off-by: Eric DumazetCc: Tom Herbert --- include/net/sock.h | 2 ++ net/core/sock.c| 14 +- 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/net/sock.h b/include/net/sock.h index 255d3e03727b..c88785a3e76c 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -438,6 +438,7 @@ struct sock { struct sk_buff *skb); void(*sk_destruct)(struct sock *sk); struct sock_reuseport __rcu *sk_reuseport_cb; + struct rcu_head sk_rcu; }; #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data))) @@ -720,6 +721,7 @@ enum sock_flags { */ SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */ SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */ + SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */ }; #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE)) diff --git a/net/core/sock.c b/net/core/sock.c index b67b9aedb230..238a94f879ca 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1418,8 +1418,12 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority, } EXPORT_SYMBOL(sk_alloc); -void sk_destruct(struct sock *sk) +/* Sockets having SOCK_RCU_FREE will call this function after one RCU + * grace period. This is the case for UDP sockets and TCP listeners. + */ +static void __sk_destruct(struct rcu_head *head) { + struct sock *sk = container_of(head, struct sock, sk_rcu); struct sk_filter *filter; if (sk->sk_destruct) @@ -1448,6 +1452,14 @@ void sk_destruct(struct sock *sk) sk_prot_free(sk->sk_prot_creator, sk); } +void sk_destruct(struct sock *sk) +{ + if (sock_flag(sk, SOCK_RCU_FREE)) + call_rcu(>sk_rcu, __sk_destruct); + else + __sk_destruct(>sk_rcu); +} + static void __sk_free(struct sock *sk) { if (unlikely(sock_diag_has_destroy_listeners(sk) && sk->sk_net_refcnt)) -- 2.8.0.rc3.226.g39d4020
[RFC net-next 0/2] udp: use standard RCU rules
Add a generic facility for sockets to be freed afer an RCU grace period. Then UDP is changed to no longer use SLAB_DESTROY_BY_RCU, in order to speedup rx processing for traffic encapsulated in UDP. I prepared a patch to convert TCP listeners to this infrastructure, but will post it later, since Tom was mostly interested in UDP. Eric Dumazet (2): net: add SOCK_RCU_FREE socket flag udp: No longer use SLAB_DESTROY_BY_RCU include/linux/udp.h | 8 +- include/net/sock.h | 14 +-- include/net/udp.h | 2 +- net/core/sock.c | 14 ++- net/ipv4/udp.c | 290 +++- net/ipv4/udp_diag.c | 18 ++-- net/ipv6/udp.c | 194 +++ 7 files changed, 177 insertions(+), 363 deletions(-) -- 2.8.0.rc3.226.g39d4020
Re: [net-next RFC 0/4] SO_BINDTOSUBNET
On 03/25/2016 12:25 AM, Tom Herbert wrote: > On Wed, Mar 16, 2016 at 6:19 AM, Gilberto Bertin >wrote: >> This is my second attempt to submit an RFC for this patch. >> >> Some arguments for and against it since the first submission: >> * SO_BINDTOSUBNET is an arbitrary option and can be seens as nother use >> * case of the SO_REUSEPORT BPF patch >> * but at the same time using BPF requires more work/code on the server >> and since the bind to subnet use case could potentially become a >> common one maybe there is some value in having it as an option instead >> of having to code (either manually or with clang) an eBPF program that >> would do the same > > Gilberto, I'm not sure I understand this argument. Have you > implemented the BPF bind solution? > > Thanks, > Tom Yes, I wrote up a very basic draft for this feature (I didn't know there was already some work going on with SO_ATTACH_REUSEPORT_[CE]BPF). Thanks, Gilberto
Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."
On 03/25/2016 02:59 PM, Vijay Pandurangan wrote: consider two scenarios, where process a sends raw ethernet frames containing UDP packets to b I) process a --> veth --> process b II) process a -> eth -> wire -> eth -> process b I believe (I) is the simplest setup we can create that will replicate this bug. If process a sends frames that contain UDP packets to process b, what is the behaviour we want if the UDP packet *has an incorrect checksum*? It seems to me that I and II should have identical behaviour, and I would think that (II) would not deliver the packets to the application. In (I) with Cong's patch would we be delivering corrupt UDP packets to process b despite an incorrect checksum in (I)? If so, I would argue that this patch isn't right. Checksums are normally used to deal with flaky transport mechanisms, and once a machine receives the frame, we do not keep re-calculating checksums as we move it through various drivers and subsystems. In particular, checksums are NOT a security mechanism and can be easily faked. Since packets sent on one veth never actually hit any unreliable transport before they are received on the peer veth, then there should be no need to checksum packets whose origin is known to be on the local machine. Any frame sent from a socket can be considered to be a local packet in my opinion. That is what Cong's patch does as far as I can tell. Thanks, Ben -- Ben GreearCandela Technologies Inc http://www.candelatech.com
Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."
On Fri, Mar 25, 2016 at 2:59 PM, Vijay Panduranganwrote: > consider two scenarios, where process a sends raw ethernet frames > containing UDP packets to b > > I) process a --> veth --> process b > > II) process a -> eth -> wire -> eth -> process b > > I believe (I) is the simplest setup we can create that will replicate this > bug. > > If process a sends frames that contain UDP packets to process b, what > is the behaviour we want if the UDP packet *has an incorrect > checksum*? > > It seems to me that I and II should have identical behaviour, and I > would think that (II) would not deliver the packets to the > application. > > In (I) with Cong's patch would we be delivering corrupt UDP packets to > process b despite an incorrect checksum in (I)? > Right, I thought packet socket does the checksum by itself, so the problem is: if user-space does the checksum like packet socket, its checksum could be wrong therefore we can not trust it on RX path once it loops back. Let me think about it again.
Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."
On Fri, Mar 25, 2016 at 1:56 PM, Ben Greearwrote: > On 03/24/2016 10:33 PM, Cong Wang wrote: > >> Here we go: >> >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c >> index 1ecfa71..ab66080 100644 >> --- a/net/packet/af_packet.c >> +++ b/net/packet/af_packet.c >> @@ -1925,6 +1925,7 @@ static int packet_sendmsg_spkt(struct socket >> *sock, struct msghdr *msg, >> goto out_unlock; >> } >> >> + skb->ip_summed = CHECKSUM_UNNECESSARY; >> skb->protocol = proto; >> skb->dev = dev; >> skb->priority = sk->sk_priority; >> @@ -2496,6 +2497,7 @@ static int tpacket_fill_skb(struct packet_sock >> *po, struct sk_buff *skb, >> >> ph.raw = frame; >> >> + skb->ip_summed = CHECKSUM_UNNECESSARY; >> skb->protocol = proto; >> skb->dev = dev; >> skb->priority = po->sk.sk_priority; >> @@ -2805,6 +2807,7 @@ static struct sk_buff *packet_alloc_skb(struct >> sock *sk, size_t prepad, >> skb_put(skb, linear); >> skb->data_len = len - linear; >> skb->len += len - linear; >> + skb->ip_summed = CHECKSUM_UNNECESSARY; >> >> return skb; >> } > > > I have tested UDP, TCP, TCPv6 and custom Ethernet frames across a veth pair. > > And, UDP, TCP, and pktgen across a pair of veth pairs > bridged by my user-space packet filter. > > All of these tests work fine with your patch as far as I can tell. > > So, you can add: > > Tested-by: Ben Greear Thanks for testing! I will send it out formally after I audit more drivers, also let's give Tom some time to response. > > That said, it could easily break some drivers and/or other scenarios that I > have not tested, so at the least it should cook a while upstream before > going into the > stable tree Yeah. Thanks.
[PATCH net] inet: add proper locking in __inet{6}_lookup()
From: Eric DumazetBlocking BH in __inet{6}_lookup() is not correct, as the lookups are done using RCU protection. It matters only starting from Lorenzo Colitti patches to destroy a TCP socket, since rcu_read_lock() is already held by other users of these functions. This can be backported to all known stable versions, since TCP got RCU lookups back in 2.6.29 : Even if iproute2 contained no code to trigger the bug, some user programs could have used the API. Signed-off-by: Eric Dumazet Cc: Lorenzo Colitti --- include/net/inet_hashtables.h |5 ++--- net/ipv6/inet6_hashtables.c |5 ++--- 2 files changed, 4 insertions(+), 6 deletions(-) diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index 50f635c2c536..4575d4f9509a 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -284,7 +284,6 @@ static inline struct sock *inet_lookup_listener(struct net *net, * Sockets in TCP_CLOSE state are _always_ taken out of the hash, so we need * not check it for lookups anymore, thanks Alexey. -DaveM * - * Local BH must be disabled here. */ struct sock *__inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo, @@ -326,10 +325,10 @@ static inline struct sock *inet_lookup(struct net *net, { struct sock *sk; - local_bh_disable(); + rcu_read_lock(); sk = __inet_lookup(net, hashinfo, skb, doff, saddr, sport, daddr, dport, dif); - local_bh_enable(); + rcu_read_unlock(); return sk; } diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c index 70f2628be6fa..9dcd0481fae2 100644 --- a/net/ipv6/inet6_hashtables.c +++ b/net/ipv6/inet6_hashtables.c @@ -200,11 +200,10 @@ struct sock *inet6_lookup(struct net *net, struct inet_hashinfo *hashinfo, { struct sock *sk; - local_bh_disable(); + rcu_read_lock(); sk = __inet6_lookup(net, hashinfo, skb, doff, saddr, sport, daddr, ntohs(dport), dif); - local_bh_enable(); - + rcu_read_unlock(); return sk; } EXPORT_SYMBOL_GPL(inet6_lookup);
Re: RESEND: Easily reproducible kernel panic due to netdev all_adj_list refcnt handling
On 03/25/2016 02:43 PM, Matthias Schiffer wrote: We've tried your patch, and it changes the symptoms a bit, but doesn't fix the panic. I've attached kernel logs of the crash both before and after applying the patch. One note: I did not reproduce this issue myself, it was first reported in [1], and then forwarded to the batman-adv issue tracker [2] by me. Regards, Matthias [1] https://github.com/freifunk-gluon/gluon/issues/680 [2] https://www.open-mesh.org/issues/247 On the off chance it helps, the version of the patch I integrated locally takes a somewhat different approach than the one I sent to the mailing list (propagates adj_list refcnts). I've attached it in case it's useful. I haven't submitted this upstream yet as it's still rather ugly. I'm of the opinion that the whole "every device knows every upperdev and lowerdev in its tree" model is rather broken, and the patch is just working around a design that needs some rework. Thanks, Andrew Collins commit df318544e282c6ab5bdc4595658fc1cf8739d091 Author: Andrew CollinsDate: Fri Mar 25 16:04:59 2016 -0600 This fixes a relatively easily reproducible kernel panic related to the all_adj_list handling for netdevs in recent kernels. The following sequence of commands will reproduce the issue: ip link add link eth0 name eth0.100 type vlan id 100 ip link add link eth0 name eth0.200 type vlan id 200 ip link add name testbr type bridge ip link set eth0.100 master testbr ip link set eth0.200 master testbr ip link add link testbr mac0 type macvlan ip link delete dev testbr This creates an upper/lower tree of (excuse the poor ASCII art): /---eth0.100-eth0 mac0-testbr- \---eth0.200-eth0 When testbr is deleted, the all_adj_lists are walked, and eth0 is deleted twice from the mac0 list. Unfortunately, during setup in __netdev_upper_dev_link, only one reference to eth0 is added, so this results in a panic. This change adds reference count propagation so things are handled properly. diff --git a/net/core/dev.c b/net/core/dev.c index b9bcbe7..4b4ef6b 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5471,6 +5471,7 @@ static inline bool netdev_adjacent_is_neigh_list(struct net_device *dev, static int __netdev_adjacent_dev_insert(struct net_device *dev, struct net_device *adj_dev, + u16 ref_nr, struct list_head *dev_list, void *private, bool master) { @@ -5480,7 +5481,7 @@ static int __netdev_adjacent_dev_insert(struct net_device *dev, adj = __netdev_find_adj(adj_dev, dev_list); if (adj) { - adj->ref_nr++; + adj->ref_nr += ref_nr; return 0; } @@ -5490,7 +5491,7 @@ static int __netdev_adjacent_dev_insert(struct net_device *dev, adj->dev = adj_dev; adj->master = master; - adj->ref_nr = 1; + adj->ref_nr = ref_nr; adj->private = private; dev_hold(adj_dev); @@ -5529,6 +5530,7 @@ free_adj: static void __netdev_adjacent_dev_remove(struct net_device *dev, struct net_device *adj_dev, + u16 ref_nr, struct list_head *dev_list) { struct netdev_adjacent *adj; @@ -5541,10 +5543,10 @@ static void __netdev_adjacent_dev_remove(struct net_device *dev, BUG(); } - if (adj->ref_nr > 1) { - pr_debug("%s to %s ref_nr-- = %d\n", dev->name, adj_dev->name, - adj->ref_nr-1); - adj->ref_nr--; + if (adj->ref_nr > ref_nr) { + pr_debug("%s to %s ref_nr-%d = %d\n", dev->name, adj_dev->name, + ref_nr, adj->ref_nr-ref_nr); + adj->ref_nr -= ref_nr; return; } @@ -5563,21 +5565,22 @@ static void __netdev_adjacent_dev_remove(struct net_device *dev, static int __netdev_adjacent_dev_link_lists(struct net_device *dev, struct net_device *upper_dev, + u16 ref_nr, struct list_head *up_list, struct list_head *down_list, void *private, bool master) { int ret; - ret = __netdev_adjacent_dev_insert(dev, upper_dev, up_list, private, - master); + ret = __netdev_adjacent_dev_insert(dev, upper_dev, ref_nr, up_list, + private, master); if (ret) return ret; - ret = __netdev_adjacent_dev_insert(upper_dev, dev, down_list, private, - false); + ret = __netdev_adjacent_dev_insert(upper_dev, dev, ref_nr, down_list, + private, false); if (ret) { - __netdev_adjacent_dev_remove(dev, upper_dev, up_list); + __netdev_adjacent_dev_remove(dev, upper_dev, ref_nr, up_list); return ret; } @@ -5585,9 +5588,10 @@ static int __netdev_adjacent_dev_link_lists(struct net_device *dev, } static int __netdev_adjacent_dev_link(struct net_device *dev, - struct net_device *upper_dev) + struct net_device *upper_dev, + u16 ref_nr) { - return __netdev_adjacent_dev_link_lists(dev, upper_dev, + return __netdev_adjacent_dev_link_lists(dev, upper_dev, ref_nr, >all_adj_list.upper, _dev->all_adj_list.lower, NULL, false); @@ -5595,17
[PATCH-RFC] drivers: net: ethernet: intel: e1000e: fix ethtool autoneg off for fiber
From: Steve ShihThis patch fixes the issues for disabling auto-negotiation and forcing speed and duplex settings for the fiber media. For fiber media, e1000_get_settings should return ETH_TP_MDI_INVALID for eth_tp_mdix_ctrl instead of ETH_TP_MDI_AUTO so subsequent e1000_set_settings call would not fail with -EOPNOTSUPP. e1000_set_spd_dplx should not automatically turn autoneg back on for forced 1000 Mbps full duplex settings. Cc: danie...@fifo99.com Cc: xe-ker...@external.cisco.com Signed-off-by: Steve Shih --- drivers/net/ethernet/intel/e1000e/ethtool.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000e/ethtool.c b/drivers/net/ethernet/intel/e1000e/ethtool.c index 6cab1f3..cd03dcd 100644 --- a/drivers/net/ethernet/intel/e1000e/ethtool.c +++ b/drivers/net/ethernet/intel/e1000e/ethtool.c @@ -201,6 +201,9 @@ static int e1000_get_settings(struct net_device *netdev, else ecmd->eth_tp_mdix_ctrl = hw->phy.mdix; + if (hw->phy.media_type != e1000_media_type_copper) + ecmd->eth_tp_mdix_ctrl = ETH_TP_MDI_INVALID; + return 0; } @@ -236,8 +239,7 @@ static int e1000_set_spd_dplx(struct e1000_adapter *adapter, u32 spd, u8 dplx) mac->forced_speed_duplex = ADVERTISE_100_FULL; break; case SPEED_1000 + DUPLEX_FULL: - mac->autoneg = 1; - adapter->hw.phy.autoneg_advertised = ADVERTISE_1000_FULL; + mac->forced_speed_duplex = ADVERTISE_1000_FULL; break; case SPEED_1000 + DUPLEX_HALF: /* not supported */ default: -- 2.5.0
Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."
consider two scenarios, where process a sends raw ethernet frames containing UDP packets to b I) process a --> veth --> process b II) process a -> eth -> wire -> eth -> process b I believe (I) is the simplest setup we can create that will replicate this bug. If process a sends frames that contain UDP packets to process b, what is the behaviour we want if the UDP packet *has an incorrect checksum*? It seems to me that I and II should have identical behaviour, and I would think that (II) would not deliver the packets to the application. In (I) with Cong's patch would we be delivering corrupt UDP packets to process b despite an incorrect checksum in (I)? If so, I would argue that this patch isn't right. On Fri, Mar 25, 2016 at 4:56 PM, Ben Greearwrote: > On 03/24/2016 10:33 PM, Cong Wang wrote: > >> Here we go: >> >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c >> index 1ecfa71..ab66080 100644 >> --- a/net/packet/af_packet.c >> +++ b/net/packet/af_packet.c >> @@ -1925,6 +1925,7 @@ static int packet_sendmsg_spkt(struct socket >> *sock, struct msghdr *msg, >> goto out_unlock; >> } >> >> + skb->ip_summed = CHECKSUM_UNNECESSARY; >> skb->protocol = proto; >> skb->dev = dev; >> skb->priority = sk->sk_priority; >> @@ -2496,6 +2497,7 @@ static int tpacket_fill_skb(struct packet_sock >> *po, struct sk_buff *skb, >> >> ph.raw = frame; >> >> + skb->ip_summed = CHECKSUM_UNNECESSARY; >> skb->protocol = proto; >> skb->dev = dev; >> skb->priority = po->sk.sk_priority; >> @@ -2805,6 +2807,7 @@ static struct sk_buff *packet_alloc_skb(struct >> sock *sk, size_t prepad, >> skb_put(skb, linear); >> skb->data_len = len - linear; >> skb->len += len - linear; >> + skb->ip_summed = CHECKSUM_UNNECESSARY; >> >> return skb; >> } > > > I have tested UDP, TCP, TCPv6 and custom Ethernet frames across a veth pair. > > And, UDP, TCP, and pktgen across a pair of veth pairs > bridged by my user-space packet filter. > > All of these tests work fine with your patch as far as I can tell. > > So, you can add: > > Tested-by: Ben Greear > > That said, it could easily break some drivers and/or other scenarios that I > have not tested, so at the least it should cook a while upstream before > going into the > stable tree > > > Thanks, > Ben > > -- > Ben Greear > Candela Technologies Inc http://www.candelatech.com >
Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."
On Fri, Mar 25, 2016 at 10:35 AM, Ben Greearwrote: > > > On 03/24/2016 10:24 PM, Vijay Pandurangan wrote: >> >> On Fri, Mar 25, 2016 at 1:07 AM, Ben Greear >> wrote: >>> >>> On 03/24/2016 09:45 PM, Vijay Pandurangan wrote: Actually, maybe they should be set to CHECKSUM_PARTIAL if we want veth to drop the packets if they have bad checksums before they hit the application level. >>> >>> >>> >>> VETH is pretty special in that when you transmit a frame on one >>> device, it's pair receives it, and unless there is RAM corruption >>> or bugs in the kernel, then it cannot be corrupted. >> >> >> Yeah, you're right that that's an optimization. However, I think that >> we should first ensure that >> >> a->veth->b >> >> operates exactly like: >> >> a->physical eth 1 -> physical eth 2->b >> >> in all cases. Once we have that working everywhere we could think >> about optimizations. >> >> >> If we're willing to refactor, we could implement the optimization by >> allowing veth devices to know whether their immediate peer is. If a >> veth knows it's talking to another veth, it could under some >> circumstances elide checksum calculation and verification. I'm not >> sure what abstractions that would break, though. What do you guys >> think? > > > veth ALWAYS transmits to another VETH. The problem is that when veth is > given a packet to transmit, it is difficult to know where that packet > came from. Yeah you're totally right – I guess what I was trying to express (but failed at) was that we might need to be able to track the original source of the packet for optimizations. > > And, adding software checksumming to veth for every frame would be a huge > performance hit. > > > Thanks, > Ben > > -- > Ben Greear > Candela Technologies Inc http://www.candelatech.com
Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."
On 03/24/2016 10:33 PM, Cong Wang wrote: Here we go: diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 1ecfa71..ab66080 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1925,6 +1925,7 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg, goto out_unlock; } + skb->ip_summed = CHECKSUM_UNNECESSARY; skb->protocol = proto; skb->dev = dev; skb->priority = sk->sk_priority; @@ -2496,6 +2497,7 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb, ph.raw = frame; + skb->ip_summed = CHECKSUM_UNNECESSARY; skb->protocol = proto; skb->dev = dev; skb->priority = po->sk.sk_priority; @@ -2805,6 +2807,7 @@ static struct sk_buff *packet_alloc_skb(struct sock *sk, size_t prepad, skb_put(skb, linear); skb->data_len = len - linear; skb->len += len - linear; + skb->ip_summed = CHECKSUM_UNNECESSARY; return skb; } I have tested UDP, TCP, TCPv6 and custom Ethernet frames across a veth pair. And, UDP, TCP, and pktgen across a pair of veth pairs bridged by my user-space packet filter. All of these tests work fine with your patch as far as I can tell. So, you can add: Tested-by: Ben GreearThat said, it could easily break some drivers and/or other scenarios that I have not tested, so at the least it should cook a while upstream before going into the stable tree Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: RESEND: Easily reproducible kernel panic due to netdev all_adj_list refcnt handling
On 02/23/2016 11:29 PM, Andrew Collins wrote: > I'm running into a relatively easily reproducible kernel panic related to > the all_adj_list handling for netdevs > in recent kernels. > > The following sequence of commands will reproduce the issue: > > ip link add link eth0 name eth0.100 type vlan id 100 > ip link add link eth0 name eth0.200 type vlan id 200 > ip link add name testbr type bridge > ip link set eth0.100 master testbr > ip link set eth0.200 master testbr > ip link add link testbr mac0 type macvlan > ip link delete dev testbr > > This creates an upper/lower tree of (excuse the poor ASCII art): > > /---eth0.100-eth0 > mac0-testbr- > \---eth0.200-eth0 > > When testbr is deleted, the all_adj_lists are walked, and eth0 is deleted > twice from the mac0 list. > Unfortunately, during setup in __netdev_upper_dev_link, only one reference > to eth0 is added, > so this results in the following panic trace: > > [68235.234564] tried to remove device eth0 from mac0 Hi, I got a similar report which looks like the same issue. Our setup is a bit more complicated, it also involves batman-adv: * 5 VLANs (eth0.2, eth0.3, eth0.100, eth0.101, eth0.102) * batman-adv device bat0 is master of eth0.100, eth0.101, eth0.102 * Bridge br-wan is master of eth0.2 * Bridge br-client is master of bat0 and eth0.3 * macvlan device local-node on top of br-client The setup is OpenWrt-based, which has a network config daemon which will in some cases remove bridge ports/slaves when the corresponding devices lose carrier (and I think br-client and bat0 get deleted in this case, not completely sure about this). The crash occurs when eth0 goes down. We've tried your patch, and it changes the symptoms a bit, but doesn't fix the panic. I've attached kernel logs of the crash both before and after applying the patch. One note: I did not reproduce this issue myself, it was first reported in [1], and then forwarded to the batman-adv issue tracker [2] by me. Regards, Matthias [1] https://github.com/freifunk-gluon/gluon/issues/680 [2] https://www.open-mesh.org/issues/247 U-Boot SPL 2015.01 (Mar 18 2016 - 19:30:08) DRAM: 1024 MiB CPU: 96000Hz, AXI/AHB/APB: 3/2/2 U-Boot 2015.01 (Mar 18 2016 - 19:30:08) Allwinner Technology CPU: Allwinner A20 (SUN7I) I2C: ready DRAM: 1016 MiB MMC: SUNXI SD/MMC: 0 *** Warning - bad CRC, using default environment Reserved 8192kB of RAM for Framebuffer. In:serial Out: serial Err: serial SCSI: SUNXI SCSI INIT SATA link 0 timeout. AHCI 0001.0100 32 slots 1 ports 3 Gbps 0x1 impl SATA mode flags: ncq stag pm led clo only pmp pio slum part ccc apst Net: dwmac.1c5 (Re)start USB... USB0: USB EHCI 1.00 scanning bus 0 for devices... 1 USB Device(s) found USB1: USB EHCI 1.00 scanning bus 1 for devices... 1 USB Device(s) found scanning usb for storage devices... 0 Storage Device(s) found Hit any key to stop autoboot: 0 switch to partitions #0, OK mmc0 is current device Scanning mmc 0... Found U-Boot script /boot.scr reading /boot.scr 377 bytes read in 18 ms (19.5 KiB/s) ## Executing script at 4310 reading uImage 1981560 bytes read in 117 ms (16.2 MiB/s) reading dtb 23549 bytes read in 30 ms (765.6 KiB/s) ## Booting kernel from Legacy Image at 4200 ... Image Name: ARM OpenWrt Linux-3.18.27 Image Type: ARM Linux Kernel Image (uncompressed) Data Size:1981496 Bytes = 1.9 MiB Load Address: 40008000 Entry Point: 40008000 Verifying Checksum ... OK ## Flattened Device Tree blob at 4300 Booting using the fdt blob at 0x4300 Loading Kernel Image ... OK Using Device Tree in place at 4300, end 43008bfc Starting kernel ... [0.00] Booting Linux on physical CPU 0x0 [0.00] Linux version 3.18.27 (julian@eclipse) (gcc version 4.8.3 (OpenWrt/Linaro GCC 4.8-2014.04 r48900) ) #1 SMP PREEMPT Fri Mar 18 19:27:10 CET 2016 [0.00] CPU: ARMv7 Processor [410fc074] revision 4 (ARMv7), cr=10c5387d [0.00] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache [0.00] Machine model: LeMaker Banana Pi [0.00] Memory policy: Data cache writealloc [0.00] psci: probing for conduit method from DT. [0.00] psci: Using PSCI v0.1 Function IDs from DT [0.00] PERCPU: Embedded 9 pages/cpu @eefe1000 s6272 r8192 d22400 u36864 [0.00] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 258576 [0.00] Kernel command line: console=ttyS0,115200 earlyprintk rootwait root=/dev/mmcblk0p2 [0.00] PID hash table entries: 4096 (order: 2, 16384 bytes) [0.00] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes) [0.00] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) [0.00] Memory: 1025188K/1040384K available (3931K kernel code, 174K rwdata, 1132K rodata, 208K init, 610K bss, 15196K reserved, 262144K highmem) [0.00] Virtual kernel memory layout: [0.00]
Re: [PATCH RFC 3/9] net: Add fast receive encapsulation
From: Tom HerbertDate: Wed, 23 Mar 2016 15:36:52 -0700 > +{ > + struct udp_sock *up = udp_sk(sk); > + int is_udplite = IS_UDPLITE(sk); > + > + int (*encap_rcv)(struct sock *sk, struct sk_buff *skb); > + Small nit, please put this encap_rcv function pointer declaration at the top of the local variable list.
Re: [PATCH] ip.7: Fix incorrect sockopt name
Hello Benjamin, On 03/22/2016 09:28 AM, Benjamin Poirier wrote: > "IP_LEAVE_GROUP" does not exist. It was perhaps a confusion with > MCAST_LEAVE_GROUP. Change the text to IP_DROP_MEMBERSHIP which has the same > function as MCAST_LEAVE_GROUP and is documented in the ip.7 man page. > > Reference: > Linux kernel net/ipv4/ip_sockglue.c do_ip_setsockopt() Thanks! Applied. Cheers, Michael > Cc: Radek Pazdera> Signed-off-by: Benjamin Poirier > --- > man7/ip.7 | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/man7/ip.7 b/man7/ip.7 > index 3905573..37e2c86 100644 > --- a/man7/ip.7 > +++ b/man7/ip.7 > @@ -376,7 +376,7 @@ a given multicast group that come from a given source. > If the application has subscribed to multiple sources within > the same group, data from the remaining sources will still be delivered. > To stop receiving data from all sources at once, use > -.BR IP_LEAVE_GROUP . > +.BR IP_DROP_MEMBERSHIP . > .IP > Argument is an > .I ip_mreq_source > -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH v2] netpoll: Fix extra refcount release in netpoll_cleanup()
From: Bjorn HelgaasDate: Fri, 25 Mar 2016 11:46:39 -0500 > You're right, there is an issue here. I reproduced a problem with a > bond device. bond_netpoll_setup() calls __netpoll_setup() directly > (not netpoll_setup()). I'll debug it more; just wanted to let you > know there *is* a problem with this patch. I bet that's why the assignment to np->dev and the reference counting were separated in the first place :-/ Indeed, commit 30fdd8a082a00126a6feec994e43e8dc12f5bccb: commit 30fdd8a082a00126a6feec994e43e8dc12f5bccb Author: Jiri Pirko Date: Tue Jul 17 05:22:35 2012 + netpoll: move np->dev and np->dev_name init into __netpoll_setup() Signed-off-by: Jiri Pirko Signed-off-by: David S. Miller
Re: [PATCH net-next] net: dsa: mv88e6xxx: fix 6185 hardware bridging
From: Vivien DidelotDate: Fri, 25 Mar 2016 13:20:30 -0400 > David, please ignore this patch for the moment. Ok.
Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."
From: Ben GreearDate: Fri, 25 Mar 2016 10:14:46 -0700 > Anyway, you know the stack and drivers better than me, so if you > think Cong's patch is valid, then I'll test it and make sure it > works in my setups at least. It probably is, I'm just waiting to see if Tom Herbert will give some feedback or not as this is an area he understands well.
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
On Fri, Mar 25, 2016 at 1:00 PM, Eric Dumazetwrote: > On Fri, 2016-03-25 at 12:31 -0400, Craig Gallek wrote: > >> I believe the issue here is that closing the listen sockets will drop >> any connections that are in the listen queue but have not been >> accepted yet. In the case of reuseport, you could in theory drain >> those queues into the non-closed sockets, but that probably has some >> interesting consequences... > > It is more complicated than this. > > Ideally, no TCP connection should be dropped during a server change. > > The idea is to let old program running as long as : > 1) It has established TCP sessions > 2) Some SYN_RECV pseudo requests are still around > > Once 3WHS completes for these SYN_RECV, children are queued into > listener accept queues. > > But the idea is to direct all new SYN packets to the 'new' process and > its listeners. (New SYN_RECV should be created on behalf on the new > listeners only) > > > In some environments, the listeners are simply transfered via FD > passing, from the 'old process' to the new one. Right. Comparatively, one of the nice features of the BPF variant is that the sockets in the old process can passively enter listen_off state solely with changes initiated by the new process (change the bpf filter for the group). By the way, if I read correctly, the listen_off feature was already possible without kernel changes prior to fast reuseport by changing SO_BINDTODEVICE on the old process's sockets to effectively segment them into a separate reuseport group. With fast reuseport, sk_bound_dev_if state equivalence is checked on joining a group, but the socket is not removed from the array when that syscall is made, so this does not work.
Re: netpoll rtnl_dereference() usage
On Fri, Mar 25, 2016 at 12:30:32PM -0500, Bjorn Helgaas wrote: > Hi Neil, > > Since we're looking at netpoll, here's another question (or two). > 0790bbb68f9d ("netpoll: cleanup sparse warnings") adds this: > > @@ -1236,7 +1236,11 @@ void __netpoll_cleanup(struct netpoll *np) > struct netpoll_info *npinfo; > unsigned long flags; > > - npinfo = np->dev->npinfo; > + /* rtnl_dereference would be preferable here but > +* rcu_cleanup_netpoll path can put us in here safely without > +* holding the rtnl, so plain rcu_dereference it is > +*/ > + npinfo = rtnl_dereference(np->dev->npinfo); > if (!npinfo) > return; > > The comment seems to contradict the code: the comment says "we would > like to use rtnl_dereference(), but we have to use rcu_dereference()." > But the code in fact *does* use rtnl_dereference(). > its the comment that went awry. I remember writing that patch, and I initially thought we had to use rcu_derefence there, but I would up finding a way to keep the rntl lock held, so rtnl_deref should be ok. I must have just forgotten to fixup the comment. > Also, "rcu_cleanup_netpoll" doesn't exist; maybe it's a typo for > rcu_cleanup_netpoll_info()? I don't see the path that leads from > rcu_cleanup_netpoll_info() to __netpoll_cleanup(), but I don't claim > to understand the netpoll async subtleties. > Correct again, its the rcu callback rcu_cleanup_netpoll_info that I'm referring to there, and the comment was written initially when rcu_cleanup_netpoll info was called cleanup_netpoll_info and called forward into __netpoll_cleanup (in my development patch versions). That comment should really just be re-written. I'm happy to do so if you like Best Neil > Bjorn
Re: [RFT Patch net 1/2] ipv6: invalidate the socket cached route on pmtu events if possible
On Fri, 2016-03-25 at 10:17 -0700, Cong Wang wrote: > 1) sock lock protects the whole update: the whole check, update, recheck, > set logic, to make sure another CPU will not do the same to the same socket > at the same time. > > 2) the dst itself is safe, because it is always refcounted, and we use xchg() > to change the pointer in sk_dst_cache. > > Or am I still missing anything here? As TCP always lock the socket before doing its heavy stuff, it can use a variant of sk_dst_cache manipulations that do not use extra atomic operations. But UDP gets xchg() to safely exchange sk_dst_cache, because we do not feel locking the socket is needed for UDP for typical uses (! cork) If you hold the socket lock in ICMP handler, then it would be inconsistent with udp sendmsg() where we do not hold the socket lock. Since I believe udp sendmsg() is fine, I do believe you do not need to lock the socket, and then care about socket being owned by the user.
netpoll rtnl_dereference() usage
Hi Neil, Since we're looking at netpoll, here's another question (or two). 0790bbb68f9d ("netpoll: cleanup sparse warnings") adds this: @@ -1236,7 +1236,11 @@ void __netpoll_cleanup(struct netpoll *np) struct netpoll_info *npinfo; unsigned long flags; - npinfo = np->dev->npinfo; + /* rtnl_dereference would be preferable here but +* rcu_cleanup_netpoll path can put us in here safely without +* holding the rtnl, so plain rcu_dereference it is +*/ + npinfo = rtnl_dereference(np->dev->npinfo); if (!npinfo) return; The comment seems to contradict the code: the comment says "we would like to use rtnl_dereference(), but we have to use rcu_dereference()." But the code in fact *does* use rtnl_dereference(). Also, "rcu_cleanup_netpoll" doesn't exist; maybe it's a typo for rcu_cleanup_netpoll_info()? I don't see the path that leads from rcu_cleanup_netpoll_info() to __netpoll_cleanup(), but I don't claim to understand the netpoll async subtleties. Bjorn
Re: [PATCH net-next] net: dsa: mv88e6xxx: fix 6185 hardware bridging
Hi, Vivien Didelotwrites: > The 88E6185 switch also has a MapDA bit in its Port Control 2 register. > When this bit is cleared, all frames are sent out to the CPU port. > > Set this bit to rely on ATU hits and direct the frame out of the correct > ports, and thus enable hardware bridging. > > Signed-off-by: Vivien Didelot Andrew, I sent this patch too fast. It requires a proper support of the ATU in order to populate the FDB correctly. I send that soon. David, please ignore this patch for the moment. Thanks, Vivien
Re: [RFT Patch net 1/2] ipv6: invalidate the socket cached route on pmtu events if possible
On Thu, Mar 24, 2016 at 6:51 PM, Eric Dumazetwrote: > On Thu, 2016-03-24 at 17:15 -0700, Cong Wang wrote: > >> My understanding is that bh_lock_sock() prevents concurrent >> access to sock struct. Since this is in softirq context, multiple >> CPU's could call this function concurrently, the whole pmtu >> update needs to be done atomically. >> >> UDP, on the other hand, doesn't do this logic, it just looks up >> for dst and save it in sk_dst_cache. > > Two ICMP messages processed on two different cpus can still get two > different sockets pointing to the same dst. > Yeah, and dst is refcounted so a shared dst will not be freed by both. > I do not see how dst pmtu update could be protected by a lock on each > socket. It would require a dst lock , or some simple memory writes that > do not require special synchro. I am lost here because you removed the dst lock in ipv4_sk_update_pmtu(): commit 7f502361531e9eecb396cf99bdc9e9a59f7ebd7f Author: Eric Dumazet Date: Mon Jun 30 01:26:23 2014 -0700 ipv4: irq safe sk_dst_[re]set() and ipv4_sk_update_pmtu() fix We have two different ways to handle changes to sk->sk_dst First way (used by TCP) assumes socket lock is owned by caller, and use no extra lock : __sk_dst_set() & __sk_dst_reset() Another way (used by UDP) uses sk_dst_lock because socket lock is not always taken. Note that sk_dst_lock is not softirq safe. These ways are not inter changeable for a given socket type. ipv4_sk_update_pmtu(), added in linux-3.8, added a race, as it used the socket lock as synchronization, but users might be UDP sockets. Instead of converting sk_dst_lock to a softirq safe version, use xchg() as we did for sk_rx_dst in commit e47eb5dfb296b ("udp: ipv4: do not use sk_dst_lock from softirq context") In a follow up patch, we probably can remove sk_dst_lock, as it is only used in IPv6. My understand is that: 1) sock lock protects the whole update: the whole check, update, recheck, set logic, to make sure another CPU will not do the same to the same socket at the same time. 2) the dst itself is safe, because it is always refcounted, and we use xchg() to change the pointer in sk_dst_cache. Or am I still missing anything here? Thanks.
Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."
On 03/25/2016 09:44 AM, David Miller wrote: From: Ben GreearDate: Fri, 25 Mar 2016 09:10:58 -0700 I am suspicious that this will break at least some drivers. I grepped around for ip_summed, and found this, for instance: davicom/dm9000.c /* The DM9000 is not smart enough to leave fragmented packets An really old (circa 1997), not-oft-used, driver such as this is not the place to be looking for correct usage of skb->ip_summed semantics. I would never use whatever a driver like this does influence whether I apply a bug fix or not. Point is, it took me 5 minutes to find that, and I did not look hard at many other drivers. e1000e and igb appear to be fine, and maybe the rest of them are too. Lord knows what other strange setups might be effected by the ip_summed change. Anyway, you know the stack and drivers better than me, so if you think Cong's patch is valid, then I'll test it and make sure it works in my setups at least. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
[PATCH net-next] net: dsa: mv88e6xxx: fix 6185 hardware bridging
The 88E6185 switch also has a MapDA bit in its Port Control 2 register. When this bit is cleared, all frames are sent out to the CPU port. Set this bit to rely on ATU hits and direct the frame out of the correct ports, and thus enable hardware bridging. Signed-off-by: Vivien Didelot--- drivers/net/dsa/mv88e6xxx.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index 4332b15..85ed3e2 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -2586,7 +2586,8 @@ static int mv88e6xxx_setup_port(struct dsa_switch *ds, int port) reg = 0; if (mv88e6xxx_6352_family(ds) || mv88e6xxx_6351_family(ds) || mv88e6xxx_6165_family(ds) || mv88e6xxx_6097_family(ds) || - mv88e6xxx_6095_family(ds) || mv88e6xxx_6320_family(ds)) + mv88e6xxx_6095_family(ds) || mv88e6xxx_6320_family(ds) || + mv88e6xxx_6185_family(ds)) reg = PORT_CONTROL_2_MAP_DA; if (mv88e6xxx_6352_family(ds) || mv88e6xxx_6351_family(ds) || -- 2.7.4
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
On Fri, 2016-03-25 at 12:31 -0400, Craig Gallek wrote: > I believe the issue here is that closing the listen sockets will drop > any connections that are in the listen queue but have not been > accepted yet. In the case of reuseport, you could in theory drain > those queues into the non-closed sockets, but that probably has some > interesting consequences... It is more complicated than this. Ideally, no TCP connection should be dropped during a server change. The idea is to let old program running as long as : 1) It has established TCP sessions 2) Some SYN_RECV pseudo requests are still around Once 3WHS completes for these SYN_RECV, children are queued into listener accept queues. But the idea is to direct all new SYN packets to the 'new' process and its listeners. (New SYN_RECV should be created on behalf on the new listeners only) In some environments, the listeners are simply transfered via FD passing, from the 'old process' to the new one.
Re: [PATCH v2] netpoll: Fix extra refcount release in netpoll_cleanup()
On Fri, Mar 25, 2016 at 07:33:42AM -0400, Neil Horman wrote: > On Thu, Mar 24, 2016 at 09:56:21PM -0500, Bjorn Helgaas wrote: > > netpoll_setup() does a dev_hold() on np->dev, the netpoll device. If it > > fails, it correctly does a dev_put() but leaves np->dev set. If we call > > netpoll_cleanup() after the failure, np->dev is still set so we do another > > dev_put(), which decrements the refcount an extra time. > > > > It's questionable to call netpoll_cleanup() after netpoll_setup() fails, > > but it can be difficult to find the problem, and we can easily avoid it in > > this case. The extra decrements can lead to hangs like this: > > > > unregister_netdevice: waiting for bond0 to become free. Usage count = -3 > > > > Set and clear np->dev at the points where we dev_hold() and dev_put() the > > device. > > > > Signed-off-by: Bjorn Helgaas> > --- > > net/core/netpoll.c |3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/net/core/netpoll.c b/net/core/netpoll.c > > index 94acfc8..a57bd17 100644 > > --- a/net/core/netpoll.c > > +++ b/net/core/netpoll.c > > @@ -603,7 +603,6 @@ int __netpoll_setup(struct netpoll *np, struct > > net_device *ndev) > > const struct net_device_ops *ops; > > int err; > > > > - np->dev = ndev; > > strlcpy(np->dev_name, ndev->name, IFNAMSIZ); > > INIT_WORK(>cleanup_work, netpoll_async_cleanup); > > > > @@ -670,6 +669,7 @@ int netpoll_setup(struct netpoll *np) > > goto unlock; > > } > > dev_hold(ndev); > > + np->dev = ndev; > > > > if (netdev_master_upper_dev_get(ndev)) { > > np_err(np, "%s is a slave device, aborting\n", np->dev_name); > > @@ -770,6 +770,7 @@ int netpoll_setup(struct netpoll *np) > > return 0; > > > > put: > > + np->dev = NULL; > > dev_put(ndev); > > unlock: > > rtnl_unlock(); > > > > Is this safe for stacked devices? It makes good sense for the typical case, > but > if you attempt to setup a netpoll client on a bridge/bond/vlan, etc, the lower > device will get its own netpoll struct registered and have no associated > np->dev > pointer. It not be a real problem in practice, But you probably want to check > to make sure that stacked devices which recursively call the netpoll api > don't > do anyting with the np->dev pointer. You're right, there is an issue here. I reproduced a problem with a bond device. bond_netpoll_setup() calls __netpoll_setup() directly (not netpoll_setup()). I'll debug it more; just wanted to let you know there *is* a problem with this patch. Bjorn
Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."
From: Cong WangDate: Fri, 25 Mar 2016 09:32:23 -0700 > So I believe dm9000 needs to fix. +1
Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."
From: Ben GreearDate: Fri, 25 Mar 2016 09:10:58 -0700 > I am suspicious that this will break at least some drivers. I > grepped around for ip_summed, and found this, for instance: > > davicom/dm9000.c > > /* The DM9000 is not smart enough to leave fragmented packets An really old (circa 1997), not-oft-used, driver such as this is not the place to be looking for correct usage of skb->ip_summed semantics. I would never use whatever a driver like this does influence whether I apply a bug fix or not.
Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."
On Fri, Mar 25, 2016 at 9:10 AM, Ben Greearwrote: > I am suspicious that this will break at least some drivers. I grepped > around > for ip_summed, and found this, for instance: > > davicom/dm9000.c > > /* The DM9000 is not smart enough to leave fragmented packets alone. > */ > if (dm->ip_summed != ip_summed) { > if (ip_summed == CHECKSUM_NONE) > iow(dm, DM9000_TCCR, 0); > else > iow(dm, DM9000_TCCR, TCCR_IP | TCCR_UDP | TCCR_TCP); > dm->ip_summed = ip_summed; > } > > > It is taking action based on ip_summed == CHECKSUM_NONE, and your change > will probably break that. > > I would suggest that we try to make any fix specific only to veth, > at least for now. A tree-wide audit of drivers is probably required > to safely make the kind of change you propose above. > > So, unless you can explain why your change is safe, then I do not plan > to test it. I just blindly trust the comments there: * CHECKSUM_UNNECESSARY: * * This has the same meaning on as CHECKSUM_NONE for checksum offload on * output. Let's Cc Tom who wrote this comment. On the other hand, hyperv got this correctly: if ((skb->ip_summed == CHECKSUM_NONE) || (skb->ip_summed == CHECKSUM_UNNECESSARY)) goto do_send; So I believe dm9000 needs to fix. Thanks.
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
On Fri, Mar 25, 2016 at 12:21 PM, Alexei Starovoitovwrote: > On Fri, Mar 25, 2016 at 11:29:10AM -0400, Craig Gallek wrote: >> On Thu, Mar 24, 2016 at 2:00 PM, Willy Tarreau wrote: >> > The pattern is : >> > >> > t0 : unprivileged processes 1 and 2 are listening to the same port >> >(sock1@pid1) (sock2@pid2) >> ><-- listening --> >> > >> > t1 : new processes are started to replace the old ones >> >(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4) >> ><-- listening --> <-- listening --> >> > >> > t2 : new processes signal the old ones they must stop >> >(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4) >> ><--- draining --> <-- listening --> >> > >> > t3 : pids 1 and 2 have finished, they go away >> > (sock3@pid3) (sock4@pid4) >> > <-- gone -> <-- listening --> > ... >> t3: Close the first two sockets and only use the last two. This is >> the tricky step. Before this point, the sockets are numbered 0 >> through 3 from the perspective of the BPF program (in the order >> listen() was called). As soon as socket 0 is closed, the last socket >> in the list replaces it (what was 3 becomes 0). When socket 1 is >> closed, socket 2 moves into that position. The assumptions about the >> socket indexes in the BPF program need to change as the indexes change >> as a result of closing them. > > yeah, the way reuseport_detach_sock() was done makes it hard to manage > such transitions from bpf program, but I don't see yet what stops > pid1 an pid2 at stage t2 to just close their sockets. > If these 'draining' pids don't want to receive packets, they should > close their sockets. Complicating bpf side to redistribute spraying > to sock3 and sock4 only (while sock1 and sock2 are still open) is possible, > but looks unnecessary complex to me. > Just close sock1 and sock2 at t2 time and then exit pid1, pid2 later. > If they are tcp sockets with rpc protocol on top and have a problem of > partial messages, then kcm can solve that and it will simplify > the user space side as well. I believe the issue here is that closing the listen sockets will drop any connections that are in the listen queue but have not been accepted yet. In the case of reuseport, you could in theory drain those queues into the non-closed sockets, but that probably has some interesting consequences...
[ethtool PATCH v7 0/2] add support for new ETHTOOL_xLINKSETTINGS ioctls
From: David DecotignyHistory: v7 added ref to related kernel commit in netlink ioctl patch description v6 re-added last patch, to use AF_NETLINK when AF_INET not available v5 rebased main patch, removed last patch "use AF_LOCAL when AF_INET not available" v4 review Ben Hutchings: using AF_UNIX instead of INET6 in the absence of v4 sockets use stdbool.h do_seeprom always fails when offset/length out of bounds sync to latest ethtool.h + kernel.h from net-next __SANE_USERSPACE_TYPES__ always defined cosmetic updates for var == const tests cosmetic updates for associativity in tests v3 TRUE/FALSE obvious-ification v2 added do_seeprom patch added netdev as recipient v1 initial submission # Patch Set Summary: David Decotigny (2): ethtool.c: add support for ETHTOOL_xLINKSETTINGS ioctls ethtool: use netlink socket when AF_INET not available configure.ac | 2 +- ethtool.c | 688 - internal.h | 67 ++ test-cmdline.c | 13 ++ 4 files changed, 611 insertions(+), 159 deletions(-) -- 2.8.0.rc3.226.g39d4020
[ethtool PATCH v7 2/2] ethtool: use netlink socket when AF_INET not available
From: David DecotignyTo benefit from this, kernel commit 025c68186e07 ("netlink: add support for NIC driver ioctls") is needed. Signed-off-by: David Decotigny --- configure.ac | 2 +- ethtool.c| 7 +++ 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/configure.ac b/configure.ac index 3105415..47d2a0f 100644 --- a/configure.ac +++ b/configure.ac @@ -15,7 +15,7 @@ AM_PROG_CC_C_O dnl Checks for libraries. dnl Checks for header files. -AC_CHECK_HEADERS(sys/ioctl.h) +AC_CHECK_HEADERS(sys/ioctl.h linux/netlink.h) dnl Checks for typedefs, structures, and compiler characteristics. AC_MSG_CHECKING([whether defines big-endian types]) diff --git a/ethtool.c b/ethtool.c index cb3d971..314b1b8 100644 --- a/ethtool.c +++ b/ethtool.c @@ -42,6 +42,9 @@ #include #include +#ifdef HAVE_LINUX_NETLINK_H +# include +#endif #ifndef MAX_ADDR_LEN #define MAX_ADDR_LEN 32 @@ -4645,6 +4648,10 @@ opt_found: /* Open control socket. */ ctx.fd = socket(AF_INET, SOCK_DGRAM, 0); +#ifdef HAVE_LINUX_NETLINK_H + if (ctx.fd < 0) + ctx.fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC); +#endif if (ctx.fd < 0) { perror("Cannot get control socket"); return 70; -- 2.8.0.rc3.226.g39d4020
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
On Fri, Mar 25, 2016 at 11:29:10AM -0400, Craig Gallek wrote: > On Thu, Mar 24, 2016 at 2:00 PM, Willy Tarreauwrote: > > The pattern is : > > > > t0 : unprivileged processes 1 and 2 are listening to the same port > >(sock1@pid1) (sock2@pid2) > ><-- listening --> > > > > t1 : new processes are started to replace the old ones > >(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4) > ><-- listening --> <-- listening --> > > > > t2 : new processes signal the old ones they must stop > >(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4) > ><--- draining --> <-- listening --> > > > > t3 : pids 1 and 2 have finished, they go away > > (sock3@pid3) (sock4@pid4) > > <-- gone -> <-- listening --> ... > t3: Close the first two sockets and only use the last two. This is > the tricky step. Before this point, the sockets are numbered 0 > through 3 from the perspective of the BPF program (in the order > listen() was called). As soon as socket 0 is closed, the last socket > in the list replaces it (what was 3 becomes 0). When socket 1 is > closed, socket 2 moves into that position. The assumptions about the > socket indexes in the BPF program need to change as the indexes change > as a result of closing them. yeah, the way reuseport_detach_sock() was done makes it hard to manage such transitions from bpf program, but I don't see yet what stops pid1 an pid2 at stage t2 to just close their sockets. If these 'draining' pids don't want to receive packets, they should close their sockets. Complicating bpf side to redistribute spraying to sock3 and sock4 only (while sock1 and sock2 are still open) is possible, but looks unnecessary complex to me. Just close sock1 and sock2 at t2 time and then exit pid1, pid2 later. If they are tcp sockets with rpc protocol on top and have a problem of partial messages, then kcm can solve that and it will simplify the user space side as well.
[ethtool PATCH v7 1/2] ethtool.c: add support for ETHTOOL_xLINKSETTINGS ioctls
From: David DecotignyMore info with kernel commit 8d3f2806f8fb ("Merge branch 'ethtool-ksettings'"). Note: The new features implemented in this patch depend on kernel commit 793cf87de9d1 ("Set cmd field in ETHTOOL_GLINKSETTINGS response to wrong nwords"). Signed-off-by: David Decotigny --- ethtool.c | 681 - internal.h | 67 ++ test-cmdline.c | 13 ++ 3 files changed, 603 insertions(+), 158 deletions(-) diff --git a/ethtool.c b/ethtool.c index 0cd0d4f..cb3d971 100644 --- a/ethtool.c +++ b/ethtool.c @@ -47,42 +47,6 @@ #define MAX_ADDR_LEN 32 #endif -#define ALL_ADVERTISED_MODES \ - (ADVERTISED_10baseT_Half | \ -ADVERTISED_10baseT_Full | \ -ADVERTISED_100baseT_Half | \ -ADVERTISED_100baseT_Full | \ -ADVERTISED_1000baseT_Half |\ -ADVERTISED_1000baseT_Full |\ -ADVERTISED_1000baseKX_Full|\ -ADVERTISED_2500baseX_Full |\ -ADVERTISED_1baseT_Full | \ -ADVERTISED_1baseKX4_Full | \ -ADVERTISED_1baseKR_Full | \ -ADVERTISED_1baseR_FEC |\ -ADVERTISED_2baseMLD2_Full |\ -ADVERTISED_2baseKR2_Full | \ -ADVERTISED_4baseKR4_Full | \ -ADVERTISED_4baseCR4_Full | \ -ADVERTISED_4baseSR4_Full | \ -ADVERTISED_4baseLR4_Full | \ -ADVERTISED_56000baseKR4_Full | \ -ADVERTISED_56000baseCR4_Full | \ -ADVERTISED_56000baseSR4_Full | \ -ADVERTISED_56000baseLR4_Full) - -#define ALL_ADVERTISED_FLAGS \ - (ADVERTISED_Autoneg | \ -ADVERTISED_TP |\ -ADVERTISED_AUI | \ -ADVERTISED_MII | \ -ADVERTISED_FIBRE | \ -ADVERTISED_BNC | \ -ADVERTISED_Pause | \ -ADVERTISED_Asym_Pause |\ -ADVERTISED_Backplane | \ -ALL_ADVERTISED_MODES) - #ifndef HAVE_NETIF_MSG enum { NETIF_MSG_DRV = 0x0001, @@ -293,6 +257,43 @@ static void get_mac_addr(char *src, unsigned char *dest) } } +static int parse_hex_u32_bitmap(const char *s, + unsigned int nbits, u32 *result) +{ + const unsigned int nwords = __KERNEL_DIV_ROUND_UP(nbits, 32); + size_t slen = strlen(s); + size_t i; + + /* ignore optional '0x' prefix */ + if ((slen > 2) && (strncasecmp(s, "0x", 2) == 0)) { + slen -= 2; + s += 2; + } + + if (slen > 8 * nwords) /* up to 2 digits per byte */ + return -1; + + memset(result, 0, 4 * nwords); + for (i = 0 ; i < slen ; ++i) { + const unsigned int shift = (slen - 1 - i) * 4; + u32 *dest = [shift / 32]; + u32 nibble; + + if ('a' <= s[i] && s[i] <= 'f') + nibble = 0xa + (s[i] - 'a'); + else if ('A' <= s[i] && s[i] <= 'F') + nibble = 0xa + (s[i] - 'A'); + else if ('0' <= s[i] && s[i] <= '9') + nibble = (s[i] - '0'); + else + return -1; + + *dest |= (nibble << (shift % 32)); + } + + return 0; +} + static void parse_generic_cmdline(struct cmd_context *ctx, int *changed, struct cmdline_info *info, @@ -472,64 +473,157 @@ static int do_version(struct cmd_context *ctx) return 0; } -static void dump_link_caps(const char *prefix, const char *an_prefix, u32 mask, - int link_mode_only); +/* link mode routines */ -static void dump_supported(struct ethtool_cmd *ep) +static __ETHTOOL_DECLARE_LINK_MODE_MASK(all_advertised_modes); +static __ETHTOOL_DECLARE_LINK_MODE_MASK(all_advertised_flags); + +static void init_global_link_mode_masks(void) { - u32 mask = ep->supported; + static const enum ethtool_link_mode_bit_indices + all_advertised_modes_bits[] = { + ETHTOOL_LINK_MODE_10baseT_Half_BIT, + ETHTOOL_LINK_MODE_10baseT_Full_BIT, + ETHTOOL_LINK_MODE_100baseT_Half_BIT, + ETHTOOL_LINK_MODE_100baseT_Full_BIT, + ETHTOOL_LINK_MODE_1000baseT_Half_BIT, + ETHTOOL_LINK_MODE_1000baseT_Full_BIT, + ETHTOOL_LINK_MODE_1000baseKX_Full_BIT, + ETHTOOL_LINK_MODE_2500baseX_Full_BIT, + ETHTOOL_LINK_MODE_1baseT_Full_BIT, +
Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."
On 03/24/2016 10:33 PM, Cong Wang wrote: On Thu, Mar 24, 2016 at 10:13 PM, Ben Greearwrote: On 03/24/2016 10:06 PM, Cong Wang wrote: On Thu, Mar 24, 2016 at 9:34 PM, Ben Greear wrote: On 03/24/2016 06:44 PM, Vijay Pandurangan wrote: Oops, I think my last email didn't go through due to an inadvertent html attachment from my phone mail client. Can you send us a copy of a packet you're sending and/or confirm that the IP and UDP4 checksums are set correctly in the packet? If those are set right, I think we need to read through the networking code again to see why this is broken... Wireshark decodes the packet as having no checksum errors. I think the contents of the packet is correct, but the 'ip_summed' field is set incorrectly to 'NONE' when transmitting on a raw packet socket. Yeah, these bugs are all due to the different interpretations of ip_summed on TX path and RX path. I think the following patch should work, if the comments don't mislead me. Could you give it a try? For the long term, we need to unify the meaning of ip_summed on TX path and RX path, or at least translate it in skb_scrub_packet(). I can test this tomorrow, but I think it will not work. I'm not sending raw IP frames, I'm sending full ethernet frames. Socket is PF_PACKET, SOCK_RAW. Your patch may still be useful for others though? Here we go: diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 1ecfa71..ab66080 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1925,6 +1925,7 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg, goto out_unlock; } + skb->ip_summed = CHECKSUM_UNNECESSARY; skb->protocol = proto; skb->dev = dev; skb->priority = sk->sk_priority; @@ -2496,6 +2497,7 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb, ph.raw = frame; + skb->ip_summed = CHECKSUM_UNNECESSARY; skb->protocol = proto; skb->dev = dev; skb->priority = po->sk.sk_priority; @@ -2805,6 +2807,7 @@ static struct sk_buff *packet_alloc_skb(struct sock *sk, size_t prepad, skb_put(skb, linear); skb->data_len = len - linear; skb->len += len - linear; + skb->ip_summed = CHECKSUM_UNNECESSARY; return skb; } I am suspicious that this will break at least some drivers. I grepped around for ip_summed, and found this, for instance: davicom/dm9000.c /* The DM9000 is not smart enough to leave fragmented packets alone. */ if (dm->ip_summed != ip_summed) { if (ip_summed == CHECKSUM_NONE) iow(dm, DM9000_TCCR, 0); else iow(dm, DM9000_TCCR, TCCR_IP | TCCR_UDP | TCCR_TCP); dm->ip_summed = ip_summed; } It is taking action based on ip_summed == CHECKSUM_NONE, and your change will probably break that. I would suggest that we try to make any fix specific only to veth, at least for now. A tree-wide audit of drivers is probably required to safely make the kind of change you propose above. So, unless you can explain why your change is safe, then I do not plan to test it. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [PATCH net-next v3.16]r8169: Add support for half duplex on MDI
On Fri, Mar 25, 2016 at 04:29:52PM +0200, Corcodel Marian wrote: > This patch add support for half duplex on MDI chips. > > Signed-off-by: Corcodel Marian> --- > drivers/net/ethernet/realtek/r8169.c | 7 ++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/ethernet/realtek/r8169.c > b/drivers/net/ethernet/realtek/r8169.c > index 77c5efb..7f6fb1f 100644 > --- a/drivers/net/ethernet/realtek/r8169.c > +++ b/drivers/net/ethernet/realtek/r8169.c > @@ -1713,8 +1713,13 @@ static int rtl8169_set_speed_xmii(struct net_device > *dev, > else > goto out; > > - if (duplex == DUPLEX_FULL) > + if (duplex == DUPLEX_FULL) { > bmcr |= BMCR_FULLDPLX; > + } else if (duplex == DUPLEX_HALF) { > + bmcr |= ~BMCR_FULLDPLX; > + } else { > + goto out; > + } Hi Corcodel What PHY is actually being used? One that is supported by drivers/net/phy/realtek.c? It looks like the bugs you are finding are all handling stuff which is generic, and well tested code in drivers/net/phy/phy_device.c would do for you, with less bugs. Maybe you want to consider throwing away the PHY code in r8169 and just implement an standard mdio bus, use phylib and the realtek.c phy driver? Andrew
Re: [PATCH v2 1/1] net: macb: remove BUG_ON() and reset the queue to handle RX errors
From: Cyrille PitchenDate: Fri, 25 Mar 2016 10:37:34 +0100 > This patch removes two BUG_ON() used to notify about RX queue corruptions > on macb (not gem) hardware without actually handling the error. > > The new code skips corrupted frames but still processes faultless frames. > Then it resets the RX queue before restarting the reception from a clean > state. > > This patch is a rework of an older patch proposed by Neil Armstrong: > http://patchwork.ozlabs.org/patch/371525/ > > Signed-off-by: Cyrille Pitchen > Acked-by: Neil Armstrong > Acked-by: Nicolas Ferre Applied, thank you.
Re: [PATCH net-next 1/1] qlge: Update version to 1.00.00.35
From: Manish ChopraDate: Fri, 25 Mar 2016 07:14:09 -0400 > Just updating version as many fixes got > accumulated over 1.00.00.34 > > Signed-off-by: Manish Chopra Applied, but it is the 'net' tree that fixes et al. like this should be targetting as 'net-next' is closed.
Re: [PATCH] net: phy: bcm7xxx: Add entries for Broadcom BCM7346 and BCM7362
From: Jaedon ShinDate: Fri, 25 Mar 2016 12:46:54 +0900 > Add PHY entries for the Broadcom BCM7346 and BCM7362 chips, these are > 40nm generation Ethernet PHY. > > Fixes: 815717d1473e ("net: phy: bcm7xxx: Remove wildcard entries") > Signed-off-by: Jaedon Shin Applied, thank you.
Re: [PATCH net] bpf: add missing map_flags to bpf_map_show_fdinfo
From: Daniel BorkmannDate: Fri, 25 Mar 2016 00:30:25 +0100 > Add map_flags attribute to bpf_map_show_fdinfo(), so that tools like > tc can check for them when loading objects from a pinned entry, e.g. > if user intent wrt allocation (BPF_F_NO_PREALLOC) is different to the > pinned object, it can bail out. Follow-up to 6c9059817432 ("bpf: > pre-allocate hash map elements"), so that tc can still support this > with v4.6. > > Signed-off-by: Daniel Borkmann > Acked-by: Alexei Starovoitov Applied, thanks Daniel.
Re: dynamically set number of queues for 82598 devices in -stable
Hi Jeff, Any chance to get an opinion on that? Thanks, On Nov27 11:05, William Dauchy wrote: > Hello Jeff, > > I faced the problem described in commit 7e3f5c8: > ixgbe: fix bounds checking in ixgbe_setup_tc for 82598 > > This patch resolves an issue where users were not able to dynamically > set number of queues for 82598 via ethtool -L > > > I backported it for my v4.1.x build but I was wondering if this could be > a candidate for a backport in stable tree at least for v4.1.x. -- William signature.asc Description: PGP signature
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
On Thu, Mar 24, 2016 at 2:00 PM, Willy Tarreauwrote: > The pattern is : > > t0 : unprivileged processes 1 and 2 are listening to the same port >(sock1@pid1) (sock2@pid2) ><-- listening --> > > t1 : new processes are started to replace the old ones >(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4) ><-- listening --> <-- listening --> > > t2 : new processes signal the old ones they must stop >(sock1@pid1) (sock2@pid2) (sock3@pid3) (sock4@pid4) ><--- draining --> <-- listening --> > > t3 : pids 1 and 2 have finished, they go away > (sock3@pid3) (sock4@pid4) > <-- gone -> <-- listening --> To address the documentation issues, I'd like to reference the following: - The filter.txt document in the kernel tree: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/filter.txt - It uses (and extends) the BPF instruction set defined in the original BSD BPF paper: http://www.tcpdump.org/papers/bpf-usenix93.pdf - The kernel headers define all of the user-space structures used: * https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/filter.h * https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/bpf.h I've been trying to come up with an example BPF program for use in the example Willy gave earlier in this thread (using 4 points in time and describing one process with two listening sockets replacing another with two listening sockets). Everything except the last step is pretty straight forward using what is currently available in the kernel. I'm using random distribution for simplicity, but you could easily do something smarter using more information about the specific hardware: t0: Evenly distrubute load to two SO_REUSEPORT sockets in a single process: ld rand mod #2 ret a t1: Fork a new process, create two new listening sockets in the same group. Even after calling listen(), but before updating the BPF program, only the first two sockets will see new connections. The program is trivially modified to use all 4. ld rand mod #4 ret a t2: Stop sending new connections to the first two sockets (draining) ld rand mod #2 add #2 ret a t3: Close the first two sockets and only use the last two. This is the tricky step. Before this point, the sockets are numbered 0 through 3 from the perspective of the BPF program (in the order listen() was called). As soon as socket 0 is closed, the last socket in the list replaces it (what was 3 becomes 0). When socket 1 is closed, socket 2 moves into that position. The assumptions about the socket indexes in the BPF program need to change as the indexes change as a result of closing them. Even if you use an EBPF map as a level of indirection here, you still have the issue that the socket indexes change as a result of some of them leaving the group. I'm not sure yet how to properly fix this, but it will probably mean changing the way the socket indexing works... The current scheme is really an implementation detail optimized for efficiency. It may be worth modifying or creating a mode which results in a stable mapping. This will probably be necessary for any scheme which expects sockets to regularly enter or leave the group.
Re: [PATCH v2] netpoll: Fix extra refcount release in netpoll_cleanup()
From: Bjorn HelgaasDate: Thu, 24 Mar 2016 21:56:21 -0500 > netpoll_setup() does a dev_hold() on np->dev, the netpoll device. If it > fails, it correctly does a dev_put() but leaves np->dev set. If we call > netpoll_cleanup() after the failure, np->dev is still set so we do another > dev_put(), which decrements the refcount an extra time. > > It's questionable to call netpoll_cleanup() after netpoll_setup() fails, > but it can be difficult to find the problem, and we can easily avoid it in > this case. The extra decrements can lead to hangs like this: > > unregister_netdevice: waiting for bond0 to become free. Usage count = -3 > > Set and clear np->dev at the points where we dev_hold() and dev_put() the > device. > > Signed-off-by: Bjorn Helgaas Looks good, applied and queued up for -stable. But you probably do want to look into the stacked device issue Neil pointed out. Thanks!
Re: [PATCH 00/13] Enhance stmmac driver to support GMAC4.x IP
Hi, On 03/25/2016 04:11 PM, David Miller wrote: It is absolutely not appropriate to submit new feature patches at this time. Please resubmit this after the net-next tree opens back out. No pb, I will wait and resend. Regards Alex Thank you.
Re: [PATCH 00/13] Enhance stmmac driver to support GMAC4.x IP
It is absolutely not appropriate to submit new feature patches at this time. Please resubmit this after the net-next tree opens back out. Thank you.
Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."
On 03/24/2016 10:24 PM, Vijay Pandurangan wrote: On Fri, Mar 25, 2016 at 1:07 AM, Ben Greearwrote: On 03/24/2016 09:45 PM, Vijay Pandurangan wrote: Actually, maybe they should be set to CHECKSUM_PARTIAL if we want veth to drop the packets if they have bad checksums before they hit the application level. VETH is pretty special in that when you transmit a frame on one device, it's pair receives it, and unless there is RAM corruption or bugs in the kernel, then it cannot be corrupted. Yeah, you're right that that's an optimization. However, I think that we should first ensure that a->veth->b operates exactly like: a->physical eth 1 -> physical eth 2->b in all cases. Once we have that working everywhere we could think about optimizations. If we're willing to refactor, we could implement the optimization by allowing veth devices to know whether their immediate peer is. If a veth knows it's talking to another veth, it could under some circumstances elide checksum calculation and verification. I'm not sure what abstractions that would break, though. What do you guys think? veth ALWAYS transmits to another VETH. The problem is that when veth is given a packet to transmit, it is difficult to know where that packet came from. And, adding software checksumming to veth for every frame would be a huge performance hit. Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com
[PATCH net-next v3.16]r8169: Add support for half duplex on MDI
This patch add support for half duplex on MDI chips. Signed-off-by: Corcodel Marian--- drivers/net/ethernet/realtek/r8169.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 77c5efb..7f6fb1f 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -1713,8 +1713,13 @@ static int rtl8169_set_speed_xmii(struct net_device *dev, else goto out; - if (duplex == DUPLEX_FULL) + if (duplex == DUPLEX_FULL) { bmcr |= BMCR_FULLDPLX; + } else if (duplex == DUPLEX_HALF) { + bmcr |= ~BMCR_FULLDPLX; + } else { + goto out; +} } rtl_writephy(tp, MII_BMCR, bmcr); -- 2.1.4
Re: [PATCH net-next v3.16]r8169: Correct value from speed 10 on MII_BMCR
On Fri, 25 Mar 2016 14:33:02 +0100 Phil Sutterwrote: > Hi, > > Your patch submissions are getting better, also good to see you're > finally using git-send-email. A few things need to be corrected > though: > > Subject line: > - The 'vNN' part in brackets is supposed to be the reroll-count (see > 'git format-patch -h' for details), not kernel version. You're > supposed to submit your patches for either net or net-next, not a > specific kernel version like 3.16. > - Missing space after the closing bracket. > - Extra space after 'r8169:'. > > On Fri, Mar 25, 2016 at 03:01:06PM +0200, Corcodel Marian wrote: > > This patch correct value on MII_BMCR register ald value 0 > > have target on reserved register first 2 bytes from MII_BMCR > > speed 10 is flipped value on BMCR_SPEED100 > > This is very hard to understand. I *guess* you want to say one should > not write 0 to MII_BMCR since it overwrites reserved bits, but it's > really not clear. Don't you know someone who can properly translate > from Romanian to English? > > Also detailed instructions on how to trigger the problem you are > fixing for would be good. In detail: Which specific hardware was > used, in which situation did the problem occur, how did it behave in > that situation and what was the expected behaviour? > > Cheers, Phil > This is very hard to understand. I *guess* you want to say one should > not write 0 to MII_BMCR since it overwrites reserved bits, but it's > really not clear. Don't you know someone who can properly translate > from Romanian to English? >Yes this it.
Re: [PATCH net-next v3.16]r8169: Correct value from speed 10 on MII_BMCR
Hi, Your patch submissions are getting better, also good to see you're finally using git-send-email. A few things need to be corrected though: Subject line: - The 'vNN' part in brackets is supposed to be the reroll-count (see 'git format-patch -h' for details), not kernel version. You're supposed to submit your patches for either net or net-next, not a specific kernel version like 3.16. - Missing space after the closing bracket. - Extra space after 'r8169:'. On Fri, Mar 25, 2016 at 03:01:06PM +0200, Corcodel Marian wrote: > This patch correct value on MII_BMCR register ald value 0 > have target on reserved register first 2 bytes from MII_BMCR > speed 10 is flipped value on BMCR_SPEED100 This is very hard to understand. I *guess* you want to say one should not write 0 to MII_BMCR since it overwrites reserved bits, but it's really not clear. Don't you know someone who can properly translate from Romanian to English? Also detailed instructions on how to trigger the problem you are fixing for would be good. In detail: Which specific hardware was used, in which situation did the problem occur, how did it behave in that situation and what was the expected behaviour? Cheers, Phil
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
On Fri, 2016-03-25 at 12:21 +0100, Yann Ylavic wrote: > Not my intention, you guys know what's the better for the kernel and its APIs. > My concern is (was) admittedly due to my own lack of knowledge of > (e)BPF, hence how much of kernel internals I'd need to know to make > the SO_REUSEPORT work in my case. > > But sure, any of the help suggested above would make it go away, I'll > stay tuned. Most of the time, the BPF filter can be trivial. If your server has a multi queue NIC with 16 queues on a 16 cpus host, it would be natural to use 16 listeners, to properly use the NIC features (RSS). BPF program would be A = QUEUE_NUMBER(skb); RETURN A; If you chose to have 4 listeners instead, for whatever reasons. A = QUEUE_NUMBER(skb); A = A & 3; RETURN A; You also can use A = CURRENT_CPU; RETURN A;
[PATCH net-next v3.16]r8169: Correct value from speed 10 on MII_BMCR
This patch correct value on MII_BMCR register ald value 0 have target on reserved register first 2 bytes from MII_BMCR speed 10 is flipped value on BMCR_SPEED100 Signed-off-by: Corcodel Marian--- drivers/net/ethernet/realtek/r8169.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index a450656..77c5efb 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -1707,7 +1707,7 @@ static int rtl8169_set_speed_xmii(struct net_device *dev, giga_ctrl = 0; if (speed == SPEED_10) - bmcr = 0; + bmcr = ~BMCR_SPEED100; else if (speed == SPEED_100) bmcr = BMCR_SPEED100; else -- 2.1.4
[PATCH net-next 1/1] qlge: Update version to 1.00.00.35
Just updating version as many fixes got accumulated over 1.00.00.34 Signed-off-by: Manish Chopra--- drivers/net/ethernet/qlogic/qlge/qlge.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/qlogic/qlge/qlge.h b/drivers/net/ethernet/qlogic/qlge/qlge.h index ef33270..6d31f92 100644 --- a/drivers/net/ethernet/qlogic/qlge/qlge.h +++ b/drivers/net/ethernet/qlogic/qlge/qlge.h @@ -18,7 +18,7 @@ */ #define DRV_NAME "qlge" #define DRV_STRING "QLogic 10 Gigabit PCI-E Ethernet Driver " -#define DRV_VERSION"1.00.00.34" +#define DRV_VERSION"1.00.00.35" #define WQ_ADDR_ALIGN 0x3 /* 4 byte alignment */ -- 2.7.2
Re: [PATCH net-next v3.16]r8169: Implement definitions from linux/mii.h header
Hello. On 3/25/2016 1:25 PM, Corcodel Marian wrote: Add definitions from mii and inhibit 0 to advertise. Signed-off-by: Corcodel Marian--- drivers/net/ethernet/realtek/r8169.c | 22 ++ 1 file changed, 6 insertions(+), 16 deletions(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index a57b650..a450656 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -1679,29 +1679,19 @@ static int rtl8169_set_speed_xmii(struct net_device *dev, int auto_nego; auto_nego = rtl_readphy(tp, MII_ADVERTISE); - auto_nego &= ~(ADVERTISE_10HALF | ADVERTISE_10FULL | - ADVERTISE_100HALF | ADVERTISE_100FULL); - - if (adv & ADVERTISED_10baseT_Half) - auto_nego |= ADVERTISE_10HALF; - if (adv & ADVERTISED_10baseT_Full) - auto_nego |= ADVERTISE_10FULL; - if (adv & ADVERTISED_100baseT_Half) - auto_nego |= ADVERTISE_100HALF; - if (adv & ADVERTISED_100baseT_Full) - auto_nego |= ADVERTISE_100FULL; + auto_nego &= ~(0 | ADVERTISE_10HALF | ADVERTISE_10FULL | ORing with 0 doesn't change the result. + ADVERTISE_100HALF | ADVERTISE_100FULL | + ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM); + auto_nego |= ethtool_adv_to_mii_adv_t(adv); auto_nego |= ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM; giga_ctrl = rtl_readphy(tp, MII_CTRL1000); - giga_ctrl &= ~(ADVERTISE_1000FULL | ADVERTISE_1000HALF); + giga_ctrl &= ~(0 | ADVERTISE_1000FULL | ADVERTISE_1000HALF); Likewise. [...] MBR, Sergei
tg3 link autoneg on one port results in packets loss on other port
Hi all, I am using a HP DL360p server with broadcom BCM5719 ethernet card handled by tg3 driver. When the link is renegotiated on one ethernet port, out packets get lost on other ethernet ports. The problem is easily reproducible by looping on switch port up/down. Instrumenting the driver, I observed that tg3_setup_phy() may last more than 40 msec while holding the spin lock. Is it possible to modify the tg3 driver so as to minimize link renegotiation impact ? Thanks in advance for your help.
Re: [PATCH v2] netpoll: Fix extra refcount release in netpoll_cleanup()
On Thu, Mar 24, 2016 at 09:56:21PM -0500, Bjorn Helgaas wrote: > netpoll_setup() does a dev_hold() on np->dev, the netpoll device. If it > fails, it correctly does a dev_put() but leaves np->dev set. If we call > netpoll_cleanup() after the failure, np->dev is still set so we do another > dev_put(), which decrements the refcount an extra time. > > It's questionable to call netpoll_cleanup() after netpoll_setup() fails, > but it can be difficult to find the problem, and we can easily avoid it in > this case. The extra decrements can lead to hangs like this: > > unregister_netdevice: waiting for bond0 to become free. Usage count = -3 > > Set and clear np->dev at the points where we dev_hold() and dev_put() the > device. > > Signed-off-by: Bjorn Helgaas> --- > net/core/netpoll.c |3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/net/core/netpoll.c b/net/core/netpoll.c > index 94acfc8..a57bd17 100644 > --- a/net/core/netpoll.c > +++ b/net/core/netpoll.c > @@ -603,7 +603,6 @@ int __netpoll_setup(struct netpoll *np, struct net_device > *ndev) > const struct net_device_ops *ops; > int err; > > - np->dev = ndev; > strlcpy(np->dev_name, ndev->name, IFNAMSIZ); > INIT_WORK(>cleanup_work, netpoll_async_cleanup); > > @@ -670,6 +669,7 @@ int netpoll_setup(struct netpoll *np) > goto unlock; > } > dev_hold(ndev); > + np->dev = ndev; > > if (netdev_master_upper_dev_get(ndev)) { > np_err(np, "%s is a slave device, aborting\n", np->dev_name); > @@ -770,6 +770,7 @@ int netpoll_setup(struct netpoll *np) > return 0; > > put: > + np->dev = NULL; > dev_put(ndev); > unlock: > rtnl_unlock(); > Is this safe for stacked devices? It makes good sense for the typical case, but if you attempt to setup a netpoll client on a bridge/bond/vlan, etc, the lower device will get its own netpoll struct registered and have no associated np->dev pointer. It not be a real problem in practice, But you probably want to check to make sure that stacked devices which recursively call the netpoll api don't do anyting with the np->dev pointer. Regards Neil
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
On Fri, Mar 25, 2016 at 9:53 AM, Willy Tarreau wrote: > > On Thu, Mar 24, 2016 at 11:49:41PM -0700, Eric Dumazet wrote: >> Everything is possible, but do not complain because BPF went in the >> kernel before your changes. > > Don't get me wrong, I'm not complaining, I'm more asking for help to > try to elaborate the alternate solution. I understood well what my > proposal did because it was pretty straightforward, and the new one > I'll have to do is of an immense complexity for me by now, since it > will require learning a new language and finding doc on how all this > works together. But as I said I'm totally sold to the benefits it can > provide for large scale deployments and I'm well aware of the ugly > socket scan there was in the previous one. +1 (all the work done on the listeners, including SO_REUSEPORT, and until lately very much appreciated in any case!) > > If you could share a pointer to some good starter documentation for this, > I would appreciate it. I really have no idea where to start from and the > only elements I found on the net didn't give a single hint regarding all > this :-/ +1 On Fri, Mar 25, 2016 at 1:24 AM, David Miller wrote: > > If we encapsulate this into libraries and helper wrappers, there is > no reason web server developers should be looking at these details > anyways. ++1 > > Please don't make a mountain out of a mole-hill. Not my intention, you guys know what's the better for the kernel and its APIs. My concern is (was) admittedly due to my own lack of knowledge of (e)BPF, hence how much of kernel internals I'd need to know to make the SO_REUSEPORT work in my case. But sure, any of the help suggested above would make it go away, I'll stay tuned.
[PATCH net-next v3.16]r8169: Implement definitions from linux/mii.h header
Add definitions from mii and inhibit 0 to advertise. Signed-off-by: Corcodel Marian--- drivers/net/ethernet/realtek/r8169.c | 22 ++ 1 file changed, 6 insertions(+), 16 deletions(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index a57b650..a450656 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -1679,29 +1679,19 @@ static int rtl8169_set_speed_xmii(struct net_device *dev, int auto_nego; auto_nego = rtl_readphy(tp, MII_ADVERTISE); - auto_nego &= ~(ADVERTISE_10HALF | ADVERTISE_10FULL | - ADVERTISE_100HALF | ADVERTISE_100FULL); - - if (adv & ADVERTISED_10baseT_Half) - auto_nego |= ADVERTISE_10HALF; - if (adv & ADVERTISED_10baseT_Full) - auto_nego |= ADVERTISE_10FULL; - if (adv & ADVERTISED_100baseT_Half) - auto_nego |= ADVERTISE_100HALF; - if (adv & ADVERTISED_100baseT_Full) - auto_nego |= ADVERTISE_100FULL; + auto_nego &= ~(0 | ADVERTISE_10HALF | ADVERTISE_10FULL | + ADVERTISE_100HALF | ADVERTISE_100FULL | + ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM); + auto_nego |= ethtool_adv_to_mii_adv_t(adv); auto_nego |= ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM; giga_ctrl = rtl_readphy(tp, MII_CTRL1000); - giga_ctrl &= ~(ADVERTISE_1000FULL | ADVERTISE_1000HALF); + giga_ctrl &= ~(0 | ADVERTISE_1000FULL | ADVERTISE_1000HALF); /* The 8100e/8101e/8102e do Fast Ethernet only. */ if (tp->mii.supports_gmii) { - if (adv & ADVERTISED_1000baseT_Half) - giga_ctrl |= ADVERTISE_1000HALF; - if (adv & ADVERTISED_1000baseT_Full) - giga_ctrl |= ADVERTISE_1000FULL; + giga_ctrl |= ethtool_adv_to_mii_ctrl1000_t(adv); } else if (adv & (ADVERTISED_1000baseT_Half | ADVERTISED_1000baseT_Full)) { netif_info(tp, link, dev, -- 2.1.4
[PATCH v2 1/1] net: macb: remove BUG_ON() and reset the queue to handle RX errors
This patch removes two BUG_ON() used to notify about RX queue corruptions on macb (not gem) hardware without actually handling the error. The new code skips corrupted frames but still processes faultless frames. Then it resets the RX queue before restarting the reception from a clean state. This patch is a rework of an older patch proposed by Neil Armstrong: http://patchwork.ozlabs.org/patch/371525/ Signed-off-by: Cyrille PitchenAcked-by: Neil Armstrong Acked-by: Nicolas Ferre --- ChangeLog v1 -> v2 - add Acked-by Neil Armstrong and Nicolas Ferre - use bool type instead of int for the 'reset_rx_queue' variable - move longer declarations before shorter declarations drivers/net/ethernet/cadence/macb.c | 59 ++--- 1 file changed, 49 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c index f715352a9b85..6c3dc27cb100 100644 --- a/drivers/net/ethernet/cadence/macb.c +++ b/drivers/net/ethernet/cadence/macb.c @@ -917,7 +917,10 @@ static int macb_rx_frame(struct macb *bp, unsigned int first_frag, unsigned int frag_len = bp->rx_buffer_size; if (offset + frag_len > len) { - BUG_ON(frag != last_frag); + if (unlikely(frag != last_frag)) { + dev_kfree_skb_any(skb); + return -1; + } frag_len = len - offset; } skb_copy_to_linear_data_offset(skb, offset, @@ -945,8 +948,23 @@ static int macb_rx_frame(struct macb *bp, unsigned int first_frag, return 0; } +static inline void macb_init_rx_ring(struct macb *bp) +{ + dma_addr_t addr; + int i; + + addr = bp->rx_buffers_dma; + for (i = 0; i < RX_RING_SIZE; i++) { + bp->rx_ring[i].addr = addr; + bp->rx_ring[i].ctrl = 0; + addr += bp->rx_buffer_size; + } + bp->rx_ring[RX_RING_SIZE - 1].addr |= MACB_BIT(RX_WRAP); +} + static int macb_rx(struct macb *bp, int budget) { + bool reset_rx_queue = false; int received = 0; unsigned int tail; int first_frag = -1; @@ -972,10 +990,18 @@ static int macb_rx(struct macb *bp, int budget) if (ctrl & MACB_BIT(RX_EOF)) { int dropped; - BUG_ON(first_frag == -1); + + if (unlikely(first_frag == -1)) { + reset_rx_queue = true; + continue; + } dropped = macb_rx_frame(bp, first_frag, tail); first_frag = -1; + if (unlikely(dropped < 0)) { + reset_rx_queue = true; + continue; + } if (!dropped) { received++; budget--; @@ -983,6 +1009,26 @@ static int macb_rx(struct macb *bp, int budget) } } + if (unlikely(reset_rx_queue)) { + unsigned long flags; + u32 ctrl; + + netdev_err(bp->dev, "RX queue corruption: reset it\n"); + + spin_lock_irqsave(>lock, flags); + + ctrl = macb_readl(bp, NCR); + macb_writel(bp, NCR, ctrl & ~MACB_BIT(RE)); + + macb_init_rx_ring(bp); + macb_writel(bp, RBQP, bp->rx_ring_dma); + + macb_writel(bp, NCR, ctrl | MACB_BIT(RE)); + + spin_unlock_irqrestore(>lock, flags); + return received; + } + if (first_frag != -1) bp->rx_tail = first_frag; else @@ -1523,15 +1569,8 @@ static void gem_init_rings(struct macb *bp) static void macb_init_rings(struct macb *bp) { int i; - dma_addr_t addr; - addr = bp->rx_buffers_dma; - for (i = 0; i < RX_RING_SIZE; i++) { - bp->rx_ring[i].addr = addr; - bp->rx_ring[i].ctrl = 0; - addr += bp->rx_buffer_size; - } - bp->rx_ring[RX_RING_SIZE - 1].addr |= MACB_BIT(RX_WRAP); + macb_init_rx_ring(bp); for (i = 0; i < TX_RING_SIZE; i++) { bp->queues[0].tx_ring[i].addr = 0; -- 1.8.2.2
[PATCH 13/13] stmmac: update MAINTAINERS
Signed-off-by: Alexandre TORGUEdiff --git a/MAINTAINERS b/MAINTAINERS index b70294e..394e233 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3293,6 +3293,7 @@ F:Documentation/powerpc/cxlflash.txt STMMAC ETHERNET DRIVER M: Giuseppe Cavallaro +M: Alexandre Torgue L: netdev@vger.kernel.org W: http://www.stlinux.com S: Supported -- 1.9.1
[PATCH 02/13] stmmac: rework the routines to show the ring status
To avoid lot of check in stmmac_main for display ring management and support the GMAC4 chip, the display_ring function is moved into dedicated descriptor file. Signed-off-by: Alexandre TORGUESigned-off-by: Giuseppe Cavallaro diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h index 797a913..6cea61b 100644 --- a/drivers/net/ethernet/stmicro/stmmac/common.h +++ b/drivers/net/ethernet/stmicro/stmmac/common.h @@ -380,6 +380,8 @@ struct stmmac_desc_ops { u64(*get_timestamp) (void *desc, u32 ats); /* get rx timestamp status */ int (*get_rx_timestamp_status) (void *desc, u32 ats); + /* Display ring */ + void (*display_ring)(void *head, unsigned int size, bool rx); }; extern const struct stmmac_desc_ops enh_desc_ops; diff --git a/drivers/net/ethernet/stmicro/stmmac/enh_desc.c b/drivers/net/ethernet/stmicro/stmmac/enh_desc.c index cfb018c..3e1b249 100644 --- a/drivers/net/ethernet/stmicro/stmmac/enh_desc.c +++ b/drivers/net/ethernet/stmicro/stmmac/enh_desc.c @@ -411,6 +411,26 @@ static int enh_desc_get_rx_timestamp_status(void *desc, u32 ats) } } +static void enh_desc_display_ring(void *head, unsigned size, bool rx) +{ + struct dma_extended_desc *ep = (struct dma_extended_desc *)head; + int i; + + pr_info("Extended %s descriptor ring:\n", rx ? "RX" : "TX"); + + for (i = 0; i < size; i++) { + u64 x; + + x = *(u64 *)ep; + pr_info("%d [0x%x]: 0x%x 0x%x 0x%x 0x%x\n", + i, (unsigned int)virt_to_phys(ep), + (unsigned int)x, (unsigned int)(x >> 32), + ep->basic.des2, ep->basic.des3); + ep++; + } + pr_info("\n"); +} + const struct stmmac_desc_ops enh_desc_ops = { .tx_status = enh_desc_get_tx_status, .rx_status = enh_desc_get_rx_status, @@ -430,4 +450,5 @@ const struct stmmac_desc_ops enh_desc_ops = { .get_tx_timestamp_status = enh_desc_get_tx_timestamp_status, .get_timestamp = enh_desc_get_timestamp, .get_rx_timestamp_status = enh_desc_get_rx_timestamp_status, + .display_ring = enh_desc_display_ring, }; diff --git a/drivers/net/ethernet/stmicro/stmmac/norm_desc.c b/drivers/net/ethernet/stmicro/stmmac/norm_desc.c index e13228f..d93323d 100644 --- a/drivers/net/ethernet/stmicro/stmmac/norm_desc.c +++ b/drivers/net/ethernet/stmicro/stmmac/norm_desc.c @@ -279,6 +279,26 @@ static int ndesc_get_rx_timestamp_status(void *desc, u32 ats) return 1; } +static void ndesc_display_ring(void *head, unsigned size, bool rx) +{ + struct dma_desc *p = (struct dma_desc *)head; + int i; + + pr_info("%s descriptor ring:\n", rx ? "RX" : "TX"); + + for (i = 0; i < size; i++) { + u64 x; + + x = *(u64 *)p; + pr_info("%d [0x%x]: 0x%x 0x%x 0x%x 0x%x", + i, (unsigned int)virt_to_phys(p), + (unsigned int)x, (unsigned int)(x >> 32), + p->des2, p->des3); + p++; + } + pr_info("\n"); +} + const struct stmmac_desc_ops ndesc_ops = { .tx_status = ndesc_get_tx_status, .rx_status = ndesc_get_rx_status, @@ -297,4 +317,5 @@ const struct stmmac_desc_ops ndesc_ops = { .get_tx_timestamp_status = ndesc_get_tx_timestamp_status, .get_timestamp = ndesc_get_timestamp, .get_rx_timestamp_status = ndesc_get_rx_timestamp_status, + .display_ring = ndesc_display_ring, }; diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index d3ebfea..8103527 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -883,53 +883,22 @@ static int stmmac_init_phy(struct net_device *dev) return 0; } -/** - * stmmac_display_ring - display ring - * @head: pointer to the head of the ring passed. - * @size: size of the ring. - * @extend_desc: to verify if extended descriptors are used. - * Description: display the control/status and buffer descriptors. - */ -static void stmmac_display_ring(void *head, int size, int extend_desc) -{ - int i; - struct dma_extended_desc *ep = (struct dma_extended_desc *)head; - struct dma_desc *p = (struct dma_desc *)head; - - for (i = 0; i < size; i++) { - u64 x; - if (extend_desc) { - x = *(u64 *) ep; - pr_info("%d [0x%x]: 0x%x 0x%x 0x%x 0x%x\n", - i, (unsigned int)virt_to_phys(ep), - (unsigned int)x, (unsigned int)(x >> 32), - ep->basic.des2, ep->basic.des3); - ep++; - } else { - x = *(u64 *) p; -
[PATCH 06/13] stmmac: add DMA support for GMAC 4.xx
DMA behavior is linked to descriptor management: -descriptor mechanism (Tx for example, but it is exactly the same for RX): -useful registers: -DMA_CH#_TxDesc_Ring_Len: length of transmit descriptor ring -DMA_CH#_TxDesc_List_Address: start address of the ring -DMA_CH#_TxDesc_Tail_Pointer: address of the last descriptor to send + 1. -DMA_CH#_TxDesc_Current_App_TxDesc: address of the current descriptor -The descriptor Tail Pointer register contains the pointer to the descriptor address (N). The base address and the current descriptor decide the address of the current descriptor that the DMA can process. The descriptors up to one location less than the one indicated by the descriptor tail pointer (N-1) are owned by the DMA. The DMA continues to process the descriptors until the following condition occurs: "current descriptor pointer == Descriptor Tail pointer" Then the DMA goes into suspend mode. The application must perform a write to descriptor tail pointer register and update the tail pointer to have the following condition and to start a new transfer: "current descriptor pointer < Descriptor tail pointer" The DMA automatically wraps around the base address when the end of ring is reached. Up to 8 DMA could be use but currently we only use one (channel0) Signed-off-by: Alexandre TORGUESigned-off-by: Giuseppe Cavallaro diff --git a/drivers/net/ethernet/stmicro/stmmac/Makefile b/drivers/net/ethernet/stmicro/stmmac/Makefile index fa000fd..9398ace 100644 --- a/drivers/net/ethernet/stmicro/stmmac/Makefile +++ b/drivers/net/ethernet/stmicro/stmmac/Makefile @@ -3,7 +3,7 @@ stmmac-objs:= stmmac_main.o stmmac_ethtool.o stmmac_mdio.o ring_mode.o \ chain_mode.o dwmac_lib.o dwmac1000_core.o dwmac1000_dma.o \ dwmac100_core.o dwmac100_dma.o enh_desc.o norm_desc.o \ mmc_core.o stmmac_hwtstamp.o stmmac_ptp.o dwmac4_descs.o \ - $(stmmac-y) + dwmac4_dma.o dwmac4_lib.o $(stmmac-y) # Ordering matters. Generic driver must be last. obj-$(CONFIG_STMMAC_PLATFORM) += stmmac-platform.o diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h index ea7eb0d..2a5126e 100644 --- a/drivers/net/ethernet/stmicro/stmmac/common.h +++ b/drivers/net/ethernet/stmicro/stmmac/common.h @@ -41,6 +41,8 @@ /* Synopsys Core versions */ #defineDWMAC_CORE_3_40 0x34 #defineDWMAC_CORE_3_50 0x35 +#defineDWMAC_CORE_4_00 0x40 +#define STMMAC_CHAN0 0 /* Always supported and default for all chips */ #define DMA_TX_SIZE 512 #define DMA_RX_SIZE 512 @@ -270,6 +272,7 @@ enum dma_irq_status { #defineCORE_PCS_ANE_COMPLETE (1 << 5) #defineCORE_PCS_LINK_STATUS(1 << 6) #defineCORE_RGMII_IRQ (1 << 7) +#define CORE_IRQ_MTL_RX_OVERFLOW BIT(8) /* Physical Coding Sublayer */ struct rgmii_adv { @@ -301,8 +304,10 @@ struct dma_features { /* 802.3az - Energy-Efficient Ethernet (EEE) */ unsigned int eee; unsigned int av; + unsigned int tsoen; /* TX and RX csum */ unsigned int tx_coe; + unsigned int rx_coe; unsigned int rx_coe_type1; unsigned int rx_coe_type2; unsigned int rxfifo_over_2048; @@ -425,6 +430,11 @@ struct stmmac_dma_ops { struct dma_features *dma_cap); /* Program the HW RX Watchdog */ void (*rx_watchdog) (void __iomem *ioaddr, u32 riwt); + void (*set_tx_ring_len)(void __iomem *ioaddr, u32 len); + void (*set_rx_ring_len)(void __iomem *ioaddr, u32 len); + void (*set_rx_tail_ptr)(void __iomem *ioaddr, u32 tail_ptr, u32 chan); + void (*set_tx_tail_ptr)(void __iomem *ioaddr, u32 tail_ptr, u32 chan); + void (*enable_tso)(void __iomem *ioaddr, bool en, u32 chan); }; struct mac_device_info; @@ -473,6 +483,7 @@ struct stmmac_hwtimestamp { }; extern const struct stmmac_hwtimestamp stmmac_ptp; +extern const struct stmmac_mode_ops dwmac4_ring_mode_ops; struct mac_link { int port; diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c new file mode 100644 index 000..116151c --- /dev/null +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c @@ -0,0 +1,354 @@ +/* + * This is the driver for the GMAC on-chip Ethernet controller for ST SoCs. + * DWC Ether MAC version 4.xx has been used for developing this code. + * + * This contains the functions to handle the dma. + * + * Copyright (C) 2015 STMicroelectronics Ltd + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + *
[PATCH 08/13] stmmac: enhance mmc counter management
For gmac3, the MMC addr map is: 0x100 - 0x2fc For gmac4, the MMC addr map is: 0x700 - 0x8fc So instead of adding 0x600 to the IO address when setup the mmc, the RMON base address is saved inside the private structure and then used to manage the counters. Signed-off-by: Giuseppe CavallaroSigned-off-by: Alexandre TORGUE diff --git a/drivers/net/ethernet/stmicro/stmmac/mmc.h b/drivers/net/ethernet/stmicro/stmmac/mmc.h index 192c249..38a1a56 100644 --- a/drivers/net/ethernet/stmicro/stmmac/mmc.h +++ b/drivers/net/ethernet/stmicro/stmmac/mmc.h @@ -35,6 +35,10 @@ * current value.*/ #define MMC_CNTRL_PRESET 0x10 #define MMC_CNTRL_FULL_HALF_PRESET 0x20 + +#define MMC_GMAC4_OFFSET 0x700 +#define MMC_GMAC3_X_OFFSET 0x100 + struct stmmac_counters { unsigned int mmc_tx_octetcount_gb; unsigned int mmc_tx_framecount_gb; diff --git a/drivers/net/ethernet/stmicro/stmmac/mmc_core.c b/drivers/net/ethernet/stmicro/stmmac/mmc_core.c index 3f20bb1..ce9aa79 100644 --- a/drivers/net/ethernet/stmicro/stmmac/mmc_core.c +++ b/drivers/net/ethernet/stmicro/stmmac/mmc_core.c @@ -28,12 +28,12 @@ /* MAC Management Counters register offset */ -#define MMC_CNTRL 0x0100 /* MMC Control */ -#define MMC_RX_INTR0x0104 /* MMC RX Interrupt */ -#define MMC_TX_INTR0x0108 /* MMC TX Interrupt */ -#define MMC_RX_INTR_MASK 0x010c /* MMC Interrupt Mask */ -#define MMC_TX_INTR_MASK 0x0110 /* MMC Interrupt Mask */ -#define MMC_DEFAULT_MASK 0x +#define MMC_CNTRL 0x00/* MMC Control */ +#define MMC_RX_INTR0x04/* MMC RX Interrupt */ +#define MMC_TX_INTR0x08/* MMC TX Interrupt */ +#define MMC_RX_INTR_MASK 0x0c/* MMC Interrupt Mask */ +#define MMC_TX_INTR_MASK 0x10/* MMC Interrupt Mask */ +#define MMC_DEFAULT_MASK 0x /* MMC TX counter registers */ @@ -41,115 +41,115 @@ * _GB register stands for good and bad frames * _G is for good only. */ -#define MMC_TX_OCTETCOUNT_GB 0x0114 -#define MMC_TX_FRAMECOUNT_GB 0x0118 -#define MMC_TX_BROADCASTFRAME_G0x011c -#define MMC_TX_MULTICASTFRAME_G0x0120 -#define MMC_TX_64_OCTETS_GB0x0124 -#define MMC_TX_65_TO_127_OCTETS_GB 0x0128 -#define MMC_TX_128_TO_255_OCTETS_GB0x012c -#define MMC_TX_256_TO_511_OCTETS_GB0x0130 -#define MMC_TX_512_TO_1023_OCTETS_GB 0x0134 -#define MMC_TX_1024_TO_MAX_OCTETS_GB 0x0138 -#define MMC_TX_UNICAST_GB 0x013c -#define MMC_TX_MULTICAST_GB0x0140 -#define MMC_TX_BROADCAST_GB0x0144 -#define MMC_TX_UNDERFLOW_ERROR 0x0148 -#define MMC_TX_SINGLECOL_G 0x014c -#define MMC_TX_MULTICOL_G 0x0150 -#define MMC_TX_DEFERRED0x0154 -#define MMC_TX_LATECOL 0x0158 -#define MMC_TX_EXESSCOL0x015c -#define MMC_TX_CARRIER_ERROR 0x0160 -#define MMC_TX_OCTETCOUNT_G0x0164 -#define MMC_TX_FRAMECOUNT_G0x0168 -#define MMC_TX_EXCESSDEF 0x016c -#define MMC_TX_PAUSE_FRAME 0x0170 -#define MMC_TX_VLAN_FRAME_G0x0174 +#define MMC_TX_OCTETCOUNT_GB 0x14 +#define MMC_TX_FRAMECOUNT_GB 0x18 +#define MMC_TX_BROADCASTFRAME_G0x1c +#define MMC_TX_MULTICASTFRAME_G0x20 +#define MMC_TX_64_OCTETS_GB0x24 +#define MMC_TX_65_TO_127_OCTETS_GB 0x28 +#define MMC_TX_128_TO_255_OCTETS_GB0x2c +#define MMC_TX_256_TO_511_OCTETS_GB0x30 +#define MMC_TX_512_TO_1023_OCTETS_GB 0x34 +#define MMC_TX_1024_TO_MAX_OCTETS_GB 0x38 +#define MMC_TX_UNICAST_GB 0x3c +#define MMC_TX_MULTICAST_GB0x40 +#define MMC_TX_BROADCAST_GB0x44 +#define MMC_TX_UNDERFLOW_ERROR 0x48 +#define MMC_TX_SINGLECOL_G 0x4c +#define MMC_TX_MULTICOL_G 0x50 +#define MMC_TX_DEFERRED0x54 +#define MMC_TX_LATECOL 0x58 +#define MMC_TX_EXESSCOL0x5c +#define MMC_TX_CARRIER_ERROR 0x60 +#define MMC_TX_OCTETCOUNT_G0x64 +#define MMC_TX_FRAMECOUNT_G0x68 +#define MMC_TX_EXCESSDEF 0x6c +#define MMC_TX_PAUSE_FRAME 0x70 +#define MMC_TX_VLAN_FRAME_G0x74 /* MMC RX counter registers */ -#define MMC_RX_FRAMECOUNT_GB 0x0180 -#define MMC_RX_OCTETCOUNT_GB 0x0184 -#define MMC_RX_OCTETCOUNT_G0x0188 -#define MMC_RX_BROADCASTFRAME_G0x018c -#define MMC_RX_MULTICASTFRAME_G0x0190 -#define MMC_RX_CRC_ERROR
[PATCH 10/13] stmmac: support new GMAC4
This patch adds the whole GMAC4 support inside the stmmac d.d. now able to use the new HW and some new features i.e.: TSO. It is missing the multi-queue and split Header support at this stage. This patch also updates the driver version and the stmmac.txt. Signed-off-by: Alexandre TORGUESigned-off-by: Giuseppe Cavallaro diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h index eabe86b..fc60368 100644 --- a/drivers/net/ethernet/stmicro/stmmac/common.h +++ b/drivers/net/ethernet/stmicro/stmmac/common.h @@ -169,6 +169,9 @@ struct stmmac_extra_stats { unsigned long mtl_rx_fifo_ctrl_active; unsigned long mac_rx_frame_ctrl_fifo; unsigned long mac_gmii_rx_proto_engine; + /* TSO */ + unsigned long tx_tso_frames; + unsigned long tx_tso_nfrags; }; /* CSR Frequency Access Defines*/ @@ -545,6 +548,7 @@ void stmmac_dwmac4_set_mac(void __iomem *ioaddr, bool enable); void dwmac_dma_flush_tx_fifo(void __iomem *ioaddr); extern const struct stmmac_mode_ops ring_mode_ops; extern const struct stmmac_mode_ops chain_mode_ops; +extern const struct stmmac_desc_ops dwmac4_desc_ops; /** * stmmac_get_synopsys_id - return the SYINID. diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h index 26fb855..317ce35 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h @@ -24,7 +24,7 @@ #define __STMMAC_H__ #define STMMAC_RESOURCE_NAME "stmmaceth" -#define DRV_MODULE_VERSION "Oct_2015" +#define DRV_MODULE_VERSION "Dec_2015" #include #include @@ -67,6 +67,7 @@ struct stmmac_priv { spinlock_t tx_lock; bool tx_path_in_lpi_mode; struct timer_list txtimer; + bool tso; struct dma_desc *dma_rx cacheline_aligned_in_smp; struct dma_extended_desc *dma_erx; @@ -129,6 +130,9 @@ struct stmmac_priv { int irq_wake; spinlock_t ptp_lock; void __iomem *mmcaddr; + u32 rx_tail_addr; + u32 tx_tail_addr; + u32 mss; #ifdef CONFIG_DEBUG_FS struct dentry *dbgfs_dir; diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c index fb2e7fc85..e2b98b0 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c @@ -161,6 +161,9 @@ static const struct stmmac_stats stmmac_gstrings_stats[] = { STMMAC_STAT(mtl_rx_fifo_ctrl_active), STMMAC_STAT(mac_rx_frame_ctrl_fifo), STMMAC_STAT(mac_gmii_rx_proto_engine), + /* TSO */ + STMMAC_STAT(tx_tso_frames), + STMMAC_STAT(tx_tso_nfrags), }; #define STMMAC_STATS_LEN ARRAY_SIZE(stmmac_gstrings_stats) @@ -499,7 +502,7 @@ static void stmmac_get_ethtool_stats(struct net_device *dev, int i, j = 0; /* Update the DMA HW counters for dwmac10/100 */ - if (!priv->plat->has_gmac) + if (priv->hw->dma->dma_diagnostic_fr) priv->hw->dma->dma_diagnostic_fr(>stats, (void *) >xstats, priv->ioaddr); diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index 977487a..cb21884 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -56,6 +56,7 @@ #include "dwmac1000.h" #define STMMAC_ALIGN(x)L1_CACHE_ALIGN(x) +#defineTSO_MAX_BUFF_SIZE (SZ_16K - 1) /* Module parameters */ #define TX_TIMEO 5000 @@ -726,13 +727,15 @@ static void stmmac_adjust_link(struct net_device *dev) new_state = 1; switch (phydev->speed) { case 1000: - if (likely(priv->plat->has_gmac)) + if (likely((priv->plat->has_gmac) || + (priv->plat->has_gmac4))) ctrl &= ~priv->hw->link.port; stmmac_hw_fix_mac_speed(priv); break; case 100: case 10: - if (priv->plat->has_gmac) { + if (likely((priv->plat->has_gmac) || + (priv->plat->has_gmac4))) { ctrl |= priv->hw->link.port; if (phydev->speed == SPEED_100) { ctrl |= priv->hw->link.speed; @@ -977,7 +980,10 @@ static int stmmac_init_rx_buffers(struct stmmac_priv *priv, struct dma_desc *p, return -EINVAL; } - p->des2
[PATCH 09/13] stmmac: add new DT platform entries for GMAC4
This is to support the snps,dwmac-4.00 and snps,dwmac-4.10a and related features on the platform driver. See binding doc for further details. Signed-off-by: Giuseppe CavallaroSigned-off-by: Alexandre TORGUE diff --git a/Documentation/devicetree/bindings/net/stmmac.txt b/Documentation/devicetree/bindings/net/stmmac.txt index 6605d19..4d302db 100644 --- a/Documentation/devicetree/bindings/net/stmmac.txt +++ b/Documentation/devicetree/bindings/net/stmmac.txt @@ -59,6 +59,8 @@ Optional properties: - snps,fb: fixed-burst - snps,mb: mixed-burst - snps,rb: rebuild INCRx Burst + - snps,tso: this enables the TSO feature otherwise it will be managed by + MAC HW capability register. - mdio: with compatible = "snps,dwmac-mdio", create and register mdio bus. Examples: diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c index dcbd2a1..6ca32f7 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c @@ -243,6 +243,13 @@ stmmac_probe_config_dt(struct platform_device *pdev, const char **mac) plat->pmt = 1; } + if (of_device_is_compatible(np, "snps,dwmac-4.00") || + of_device_is_compatible(np, "snps,dwmac-4.10a")) { + plat->has_gmac4 = 1; + plat->pmt = 1; + plat->tso_en = of_property_read_bool(np, "snps,tso"); + } + if (of_device_is_compatible(np, "snps,dwmac-3.610") || of_device_is_compatible(np, "snps,dwmac-3.710")) { plat->enh_desc = 1; diff --git a/include/linux/stmmac.h b/include/linux/stmmac.h index 4bcf5a6..3aa1870 100644 --- a/include/linux/stmmac.h +++ b/include/linux/stmmac.h @@ -138,5 +138,7 @@ struct plat_stmmacenet_data { void (*exit)(struct platform_device *pdev, void *priv); void *bsp_priv; struct stmmac_axi *axi; + int has_gmac4; + bool tso_en; }; #endif -- 1.9.1
[PATCH 04/13] stmmac: add descriptors function for GMAC 4.xx
One of main changes of GMAC 4.xx IP is descriptors management. -descriptors are only used in ring mode. -A descriptor is composed of 4 32bits registers (no more extended descriptors) -descriptor mechanism (Tx for example, but it is exactly the same for RX): -useful registers: -DMA_CH#_TxDesc_Ring_Len: length of transmit descriptor ring -DMA_CH#_TxDesc_List_Address: start address of the ring -DMA_CH#_TxDesc_Tail_Pointer: address of the last descriptor to send + 1. -DMA_CH#_TxDesc_Current_App_TxDesc: address of the current descriptor -The descriptor Tail Pointer register contains the pointer to the descriptor address (N). The base address and the current descriptor decide the address of the current descriptor that the DMA can process. The descriptors up to one location less than the one indicated by the descriptor tail pointer (N-1) are owned by the DMA. The DMA continues to process the descriptors until the following condition occurs: "current descriptor pointer == Descriptor Tail pointer" Then the DMA goes into suspend mode. The application must perform a write to descriptor tail pointer register and update the tail pointer to have the following condition and to start a new transfer: "current descriptor pointer < Descriptor tail pointer" The DMA automatically wraps around the base address when the end of ring is reached. -New features are available on IP: -TSO (TCP Segmentation Offload) for TX only -Split header: to have header and payload in 2 different buffers Signed-off-by: Alexandre TORGUESigned-off-by: Giuseppe Cavallaro diff --git a/drivers/net/ethernet/stmicro/stmmac/Makefile b/drivers/net/ethernet/stmicro/stmmac/Makefile index b390161..fa000fd 100644 --- a/drivers/net/ethernet/stmicro/stmmac/Makefile +++ b/drivers/net/ethernet/stmicro/stmmac/Makefile @@ -2,7 +2,8 @@ obj-$(CONFIG_STMMAC_ETH) += stmmac.o stmmac-objs:= stmmac_main.o stmmac_ethtool.o stmmac_mdio.o ring_mode.o \ chain_mode.o dwmac_lib.o dwmac1000_core.o dwmac1000_dma.o \ dwmac100_core.o dwmac100_dma.o enh_desc.o norm_desc.o \ - mmc_core.o stmmac_hwtstamp.o stmmac_ptp.o $(stmmac-y) + mmc_core.o stmmac_hwtstamp.o stmmac_ptp.o dwmac4_descs.o \ + $(stmmac-y) # Ordering matters. Generic driver must be last. obj-$(CONFIG_STMMAC_PLATFORM) += stmmac-platform.o diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h index 66e132f..ea7eb0d 100644 --- a/drivers/net/ethernet/stmicro/stmmac/common.h +++ b/drivers/net/ethernet/stmicro/stmmac/common.h @@ -243,6 +243,7 @@ enum rx_frame_status { csum_none = 0x2, llc_snap = 0x4, dma_own = 0x8, + rx_not_ls = 0x10, }; /* Tx status */ @@ -348,6 +349,10 @@ struct stmmac_desc_ops { void (*prepare_tx_desc) (struct dma_desc *p, int is_fs, int len, bool csum_flag, int mode, bool tx_own, bool ls); + void (*prepare_tso_tx_desc)(struct dma_desc *p, int is_fs, int len1, + int len2, bool tx_own, bool ls, + unsigned int tcphdrlen, + unsigned int tcppayloadlen); /* Set/get the owner of the descriptor */ void (*set_tx_owner) (struct dma_desc *p); int (*get_tx_owner) (struct dma_desc *p); @@ -382,6 +387,8 @@ struct stmmac_desc_ops { int (*get_rx_timestamp_status) (void *desc, u32 ats); /* Display ring */ void (*display_ring)(void *head, unsigned int size, bool rx); + /* set MSS via context descriptor */ + void (*set_mss)(struct dma_desc *p, unsigned int mss); }; extern const struct stmmac_desc_ops enh_desc_ops; diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c new file mode 100644 index 000..33cbec3 --- /dev/null +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c @@ -0,0 +1,396 @@ +/* + * This contains the functions to handle the descriptors for DesignWare databook + * 4.xx. + * + * Copyright (C) 2015 STMicroelectronics Ltd + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * Author: Alexandre Torgue + */ + +#include +#include "common.h" +#include "dwmac4_descs.h" + +static int dwmac4_wrback_get_tx_status(void *data, struct stmmac_extra_stats *x, + struct dma_desc *p, + void __iomem *ioaddr) +{ + struct net_device_stats *stats = (struct
[PATCH 11/13] Documentation: networking: update stmmac
Update stmmac driver documentation according to new GMAC 4.x family. Signed-off-by: Alexandre TORGUEdiff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt index d64a147..671fe3d 100644 --- a/Documentation/networking/stmmac.txt +++ b/Documentation/networking/stmmac.txt @@ -1,6 +1,6 @@ STMicroelectronics 10/100/1000 Synopsys Ethernet driver -Copyright (C) 2007-2014 STMicroelectronics Ltd +Copyright (C) 2007-2015 STMicroelectronics Ltd Author: Giuseppe Cavallaro This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers @@ -138,6 +138,8 @@ struct plat_stmmacenet_data { int (*init)(struct platform_device *pdev, void *priv); void (*exit)(struct platform_device *pdev, void *priv); void *bsp_priv; + int has_gmac4; + bool tso_en; }; Where: @@ -181,6 +183,8 @@ Where: registers. init/exit callbacks should not use or modify platform data. o bsp_priv: another private pointer. + o has_gmac4: uses GMAC4 core. + o tso_en: Enables TSO (TCP Segmentation Offload) feature. For MDIO bus The we have: @@ -278,6 +282,13 @@ Please see the following document: o stmmac_ethtool.c: to implement the ethtool support; o stmmac.h: private driver structure; o common.h: common definitions and VFTs; + o mmc_core.c/mmc.h: Management MAC Counters; + o stmmac_hwtstamp.c: HW timestamp support for PTP; + o stmmac_ptp.c: PTP 1588 clock; + o dwmac-.c: these are for the platform glue-logic file; e.g. dwmac-sti.c + for STMicroelectronics SoCs. + +- GMAC 3.x o descs.h: descriptor structure definitions; o dwmac1000_core.c: dwmac GiGa core functions; o dwmac1000_dma.c: dma functions for the GMAC chip; @@ -289,11 +300,32 @@ Please see the following document: o enh_desc.c: functions for handling enhanced descriptors; o norm_desc.c: functions for handling normal descriptors; o chain_mode.c/ring_mode.c:: functions to manage RING/CHAINED modes; - o mmc_core.c/mmc.h: Management MAC Counters; - o stmmac_hwtstamp.c: HW timestamp support for PTP; - o stmmac_ptp.c: PTP 1588 clock; - o dwmac-.c: these are for the platform glue-logic file; e.g. dwmac-sti.c - for STMicroelectronics SoCs. + +- GMAC4.x generation + o dwmac4_core.c: dwmac GMAC4.x core functions; + o dwmac4_desc.c: functions for handling GMAC4.x descriptors; + o dwmac4_descs.h: descriptor definitions; + o dwmac4_dma.c: dma functions for the GMAC4.x chip; + o dwmac4_dma.h: dma definitions for the GMAC4.x chip; + o dwmac4.h: core definitions for the GMAC4.x chip; + o dwmac4_lib.c: generic GMAC4.x functions; + +4.12) TSO support (GMAC4.x) + +TSO (Tcp Segmentation Offload) feature is supported by GMAC 4.x chip family. +When a packet is sent through TCP protocol, the TCP stack ensures that +the SKB provided to the low level driver (stmmac in our case) matches with +the maximum frame len (IP header + TCP header + payload <= 1500 bytes (for +MTU set to 1500)). It means that if an application using TCP want to send a +packet which will have a length (after adding headers) > 1514 the packet +will be split in several TCP packets: The data payload is split and headers +(TCP/IP ..) are added. It is done by software. + +When TSO is enabled, the TCP stack doesn't care about the maximum frame +length and provide SKB packet to stmmac as it is. The GMAC IP will have to +perform the segmentation by it self to match with maximum frame length. + +This feature can be enabled in device tree through "snps,tso" entry. 5) Debug Information -- 1.9.1
[PATCH 12/13] stmmac: update version to Jan_2016
This patch just updates the driver to the version fully tested on STi platforms. This version is Jan_2016. Signed-off-by: Alexandre TORGUEdiff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h index 317ce35..ff67506 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h @@ -24,7 +24,7 @@ #define __STMMAC_H__ #define STMMAC_RESOURCE_NAME "stmmaceth" -#define DRV_MODULE_VERSION "Dec_2015" +#define DRV_MODULE_VERSION "Jan_2016" #include #include -- 1.9.1
[PATCH 05/13] stmmac: add GMAC4 DMA/CORE Header File
This is the main header file to define all the macro used for GMAC4 DMA and CORE parts. Signed-off-by: Alexandre TORGUESigned-off-by: Giuseppe Cavallaro diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h new file mode 100644 index 000..c12f15c --- /dev/null +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h @@ -0,0 +1,224 @@ +/* + * DWMAC4 Header file. + * + * Copyright (C) 2015 STMicroelectronics Ltd + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * Author: Alexandre Torgue + */ + +#ifndef __DWMAC4_H__ +#define __DWMAC4_H__ + +#include "common.h" + +/* MAC registers */ +#define GMAC_CONFIG0x +#define GMAC_PACKET_FILTER 0x0008 +#define GMAC_HASH_TAB_0_31 0x0010 +#define GMAC_HASH_TAB_32_630x0014 +#define GMAC_RX_FLOW_CTRL 0x0090 +#define GMAC_QX_TX_FLOW_CTRL(x)(0x70 + x * 4) +#define GMAC_INT_STATUS0x00b0 +#define GMAC_INT_EN0x00b4 +#define GMAC_AN_CTRL 0x00e0 +#define GMAC_AN_STATUS 0x00e4 +#define GMAC_AN_ADV0x00e8 +#define GMAC_AN_LPA0x00ec +#define GMAC_PMT 0x00c0 +#define GMAC_VERSION 0x0110 +#define GMAC_DEBUG 0x0114 +#define GMAC_HW_FEATURE0 0x011c +#define GMAC_HW_FEATURE1 0x0120 +#define GMAC_HW_FEATURE2 0x0124 +#define GMAC_MDIO_ADDR 0x0200 +#define GMAC_MDIO_DATA 0x0204 +#define GMAC_ADDR_HIGH(reg)(0x300 + reg * 8) +#define GMAC_ADDR_LOW(reg) (0x304 + reg * 8) + +/* MAC Packet Filtering */ +#define GMAC_PACKET_FILTER_PR BIT(0) +#define GMAC_PACKET_FILTER_HMC BIT(2) +#define GMAC_PACKET_FILTER_PM BIT(4) + +#define GMAC_MAX_PERFECT_ADDRESSES 128 + +/* MAC Flow Control RX */ +#define GMAC_RX_FLOW_CTRL_RFE BIT(0) + +/* MAC Flow Control TX */ +#define GMAC_TX_FLOW_CTRL_TFE BIT(1) +#define GMAC_TX_FLOW_CTRL_PT_SHIFT 16 + +/* MAC Interrupt bitmap*/ +#define GMAC_INT_PMT_ENBIT(4) +#define GMAC_INT_LPI_ENBIT(5) + +enum dwmac4_irq_status { + time_stamp_irq = 0x1000, + mmc_rx_csum_offload_irq = 0x0800, + mmc_tx_irq = 0x0400, + mmc_rx_irq = 0x0200, + mmc_irq = 0x0100, + pmt_irq = 0x0010, + pcs_ane_irq = 0x0004, + pcs_link_irq = 0x0002, +}; + +/* MAC Auto-Neg bitmap*/ +#defineGMAC_AN_CTRL_RANBIT(9) +#defineGMAC_AN_CTRL_ANEBIT(12) +#define GMAC_AN_CTRL_ELE BIT(14) +#define GMAC_AN_FD BIT(5) +#define GMAC_AN_HD BIT(6) +#define GMAC_AN_PSE_MASK GENMASK(8, 7) +#define GMAC_AN_PSE_SHIFT 7 + +/* MAC PMT bitmap */ +enum power_event { + pointer_reset = 0x8000, + global_unicast = 0x0200, + wake_up_rx_frame = 0x0040, + magic_frame = 0x0020, + wake_up_frame_en = 0x0004, + magic_pkt_en = 0x0002, + power_down = 0x0001, +}; + +/* MAC Debug bitmap */ +#define GMAC_DEBUG_TFCSTS_MASK GENMASK(18, 17) +#define GMAC_DEBUG_TFCSTS_SHIFT17 +#define GMAC_DEBUG_TFCSTS_IDLE 0 +#define GMAC_DEBUG_TFCSTS_WAIT 1 +#define GMAC_DEBUG_TFCSTS_GEN_PAUSE2 +#define GMAC_DEBUG_TFCSTS_XFER 3 +#define GMAC_DEBUG_TPESTS BIT(16) +#define GMAC_DEBUG_RFCFCSTS_MASK GENMASK(2, 1) +#define GMAC_DEBUG_RFCFCSTS_SHIFT 1 +#define GMAC_DEBUG_RPESTS BIT(0) + +/* MAC config */ +#define GMAC_CONFIG_IPCBIT(27) +#define GMAC_CONFIG_2K BIT(22) +#define GMAC_CONFIG_ACSBIT(20) +#define GMAC_CONFIG_BE BIT(18) +#define GMAC_CONFIG_JD BIT(17) +#define GMAC_CONFIG_JE BIT(16) +#define GMAC_CONFIG_PS BIT(15) +#define GMAC_CONFIG_FESBIT(14) +#define GMAC_CONFIG_DM BIT(13) +#define GMAC_CONFIG_DCRS BIT(9) +#define GMAC_CONFIG_TE BIT(1) +#define GMAC_CONFIG_RE BIT(0) + +/* MAC HW features0 bitmap */ +#define GMAC_HW_FEAT_ADDMACBIT(18) +#define GMAC_HW_FEAT_RXCOESEL BIT(16) +#define GMAC_HW_FEAT_TXCOSEL BIT(14) +#define GMAC_HW_FEAT_EEESELBIT(13) +#define GMAC_HW_FEAT_TSSEL BIT(12) +#define GMAC_HW_FEAT_MMCSEL
[PATCH 07/13] stmmac: add GMAC4 core support
This is the initial support for GMAC4 that includes the main callbacks to setup the core module: including Csum, basic filtering, mac address and interrupt (MMC, MTL, PMT) No LPI added. Signed-off-by: Alexandre TORGUESigned-off-by: Giuseppe Cavallaro diff --git a/drivers/net/ethernet/stmicro/stmmac/Makefile b/drivers/net/ethernet/stmicro/stmmac/Makefile index 9398ace..0fb362d 100644 --- a/drivers/net/ethernet/stmicro/stmmac/Makefile +++ b/drivers/net/ethernet/stmicro/stmmac/Makefile @@ -3,7 +3,7 @@ stmmac-objs:= stmmac_main.o stmmac_ethtool.o stmmac_mdio.o ring_mode.o \ chain_mode.o dwmac_lib.o dwmac1000_core.o dwmac1000_dma.o \ dwmac100_core.o dwmac100_dma.o enh_desc.o norm_desc.o \ mmc_core.o stmmac_hwtstamp.o stmmac_ptp.o dwmac4_descs.o \ - dwmac4_dma.o dwmac4_lib.o $(stmmac-y) + dwmac4_dma.o dwmac4_lib.o dwmac4_core.o $(stmmac-y) # Ordering matters. Generic driver must be last. obj-$(CONFIG_STMMAC_PLATFORM) += stmmac-platform.o diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h index 2a5126e..eabe86b 100644 --- a/drivers/net/ethernet/stmicro/stmmac/common.h +++ b/drivers/net/ethernet/stmicro/stmmac/common.h @@ -527,15 +527,21 @@ struct mac_device_info *dwmac1000_setup(void __iomem *ioaddr, int mcbins, int perfect_uc_entries, int *synopsys_id); struct mac_device_info *dwmac100_setup(void __iomem *ioaddr, int *synopsys_id); - +struct mac_device_info *dwmac4_setup(void __iomem *ioaddr, int mcbins, +int perfect_uc_entries, int *synopsys_id); void stmmac_set_mac_addr(void __iomem *ioaddr, u8 addr[6], unsigned int high, unsigned int low); void stmmac_get_mac_addr(void __iomem *ioaddr, unsigned char *addr, unsigned int high, unsigned int low); - void stmmac_set_mac(void __iomem *ioaddr, bool enable); +void stmmac_dwmac4_set_mac_addr(void __iomem *ioaddr, u8 addr[6], + unsigned int high, unsigned int low); +void stmmac_dwmac4_get_mac_addr(void __iomem *ioaddr, unsigned char *addr, + unsigned int high, unsigned int low); +void stmmac_dwmac4_set_mac(void __iomem *ioaddr, bool enable); + void dwmac_dma_flush_tx_fifo(void __iomem *ioaddr); extern const struct stmmac_mode_ops ring_mode_ops; extern const struct stmmac_mode_ops chain_mode_ops; diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h index c12f15c..bc50952 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h @@ -221,4 +221,35 @@ enum power_event { /* To dump the core regs excluding the Address Registers */ #defineGMAC_REG_NUM132 +/* MTL debug */ +#define MTL_DEBUG_TXSTSFSTSBIT(5) +#define MTL_DEBUG_TXFSTS BIT(4) +#define MTL_DEBUG_TWCSTS BIT(3) + +/* MTL debug: Tx FIFO Read Controller Status */ +#define MTL_DEBUG_TRCSTS_MASK GENMASK(2, 1) +#define MTL_DEBUG_TRCSTS_SHIFT 1 +#define MTL_DEBUG_TRCSTS_IDLE 0 +#define MTL_DEBUG_TRCSTS_READ 1 +#define MTL_DEBUG_TRCSTS_TXW 2 +#define MTL_DEBUG_TRCSTS_WRITE 3 +#define MTL_DEBUG_TXPAUSED BIT(0) + +/* MAC debug: GMII or MII Transmit Protocol Engine Status */ +#define MTL_DEBUG_RXFSTS_MASK GENMASK(5, 4) +#define MTL_DEBUG_RXFSTS_SHIFT 4 +#define MTL_DEBUG_RXFSTS_EMPTY 0 +#define MTL_DEBUG_RXFSTS_BT1 +#define MTL_DEBUG_RXFSTS_AT2 +#define MTL_DEBUG_RXFSTS_FULL 3 +#define MTL_DEBUG_RRCSTS_MASK GENMASK(2, 1) +#define MTL_DEBUG_RRCSTS_SHIFT 1 +#define MTL_DEBUG_RRCSTS_IDLE 0 +#define MTL_DEBUG_RRCSTS_RDATA 1 +#define MTL_DEBUG_RRCSTS_RSTAT 2 +#define MTL_DEBUG_RRCSTS_FLUSH 3 +#define MTL_DEBUG_RWCSTS BIT(0) + +extern const struct stmmac_dma_ops dwmac4_dma_ops; +extern const struct stmmac_dma_ops dwmac410_dma_ops; #endif /* __DWMAC4_H__ */ diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c new file mode 100644 index 000..4f7283d --- /dev/null +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c @@ -0,0 +1,407 @@ +/* + * This is the driver for the GMAC on-chip Ethernet controller for ST SoCs. + * DWC Ether MAC version 4.00 has been used for developing this code. + * + * This only implements the mac core functions for this chip. + * + * Copyright (C) 2015 STMicroelectronics Ltd + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software
[PATCH 03/13] stmmac: rework synopsys id read, moved to dwmac setup
synopsys_uid is only used once after setup, to get synopsys_id by using shitf/mask operation. It's no longer used then. So, remove this temporary variable and directly compute synopsys_id from setup routine. Acked-by: Giuseppe CavallaroSigned-off-by: Fabrice Gasnier Signed-off-by: Alexandre TORGUE diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h index 6cea61b..66e132f 100644 --- a/drivers/net/ethernet/stmicro/stmmac/common.h +++ b/drivers/net/ethernet/stmicro/stmmac/common.h @@ -498,7 +498,6 @@ struct mac_device_info { const struct stmmac_hwtimestamp *ptp; struct mii_regs mii;/* MII register Addresses */ struct mac_link link; - unsigned int synopsys_uid; void __iomem *pcsr; /* vpointer to device CSRs */ int multicast_filter_bins; int unicast_filter_entries; @@ -507,8 +506,10 @@ struct mac_device_info { }; struct mac_device_info *dwmac1000_setup(void __iomem *ioaddr, int mcbins, - int perfect_uc_entries); -struct mac_device_info *dwmac100_setup(void __iomem *ioaddr); + int perfect_uc_entries, + int *synopsys_id); +struct mac_device_info *dwmac100_setup(void __iomem *ioaddr, int *synopsys_id); + void stmmac_set_mac_addr(void __iomem *ioaddr, u8 addr[6], unsigned int high, unsigned int low); @@ -521,4 +522,24 @@ void dwmac_dma_flush_tx_fifo(void __iomem *ioaddr); extern const struct stmmac_mode_ops ring_mode_ops; extern const struct stmmac_mode_ops chain_mode_ops; +/** + * stmmac_get_synopsys_id - return the SYINID. + * @priv: driver private structure + * Description: this simple function is to decode and return the SYINID + * starting from the HW core register. + */ +static inline u32 stmmac_get_synopsys_id(u32 hwid) +{ + /* Check Synopsys Id (not available on old chips) */ + if (likely(hwid)) { + u32 uid = ((hwid & 0xff00) >> 8); + u32 synid = (hwid & 0x00ff); + + pr_info("stmmac - user ID: 0x%x, Synopsys ID: 0x%x\n", + uid, synid); + + return synid; + } + return 0; +} #endif /* __COMMON_H__ */ diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c index c294117..fb1eb57 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c @@ -491,7 +491,8 @@ static const struct stmmac_ops dwmac1000_ops = { }; struct mac_device_info *dwmac1000_setup(void __iomem *ioaddr, int mcbins, - int perfect_uc_entries) + int perfect_uc_entries, + int *synopsys_id) { struct mac_device_info *mac; u32 hwid = readl(ioaddr + GMAC_VERSION); @@ -516,7 +517,9 @@ struct mac_device_info *dwmac1000_setup(void __iomem *ioaddr, int mcbins, mac->link.speed = GMAC_CONTROL_FES; mac->mii.addr = GMAC_MII_ADDR; mac->mii.data = GMAC_MII_DATA; - mac->synopsys_uid = hwid; + + /* Get and dump the chip ID */ + *synopsys_id = stmmac_get_synopsys_id(hwid); return mac; } diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c b/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c index f8dd773..6418b2e 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c @@ -173,7 +173,7 @@ static const struct stmmac_ops dwmac100_ops = { .get_umac_addr = dwmac100_get_umac_addr, }; -struct mac_device_info *dwmac100_setup(void __iomem *ioaddr) +struct mac_device_info *dwmac100_setup(void __iomem *ioaddr, int *synopsys_id) { struct mac_device_info *mac; @@ -192,7 +192,8 @@ struct mac_device_info *dwmac100_setup(void __iomem *ioaddr) mac->link.speed = 0; mac->mii.addr = MAC_MII_ADDR; mac->mii.data = MAC_MII_DATA; - mac->synopsys_uid = 0; + /* Synopsys Id is not available on old chips */ + *synopsys_id = 0; return mac; } diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index 8103527..2ebee81 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -1468,29 +1468,6 @@ static void stmmac_mmc_setup(struct stmmac_priv *priv) } /** - * stmmac_get_synopsys_id - return the SYINID. - * @priv: driver private structure - * Description: this simple function is to decode and return the SYINID - * starting from the HW core register. - */ -static u32 stmmac_get_synopsys_id(struct stmmac_priv *priv) -{ - u32 hwid =
[PATCH 00/13] Enhance stmmac driver to support GMAC4.x IP
This is a subset of patch to enhance current stmmac driver to support new GMAC4.x chips. New set of callbacks is defined to support this new family: descriptors, dma, core. One of main changes of GMAC 4.xx IP is descriptors management. -descriptors are only used in ring mode. -A descriptor is composed of 4 32bits registers (no more extended descriptors) -descriptor mechanism (Tx for example, but it is exactly the same for RX): -useful registers: -DMA_CH#_TxDesc_Ring_Len: length of transmit descriptor ring -DMA_CH#_TxDesc_List_Address: start address of the ring -DMA_CH#_TxDesc_Tail_Pointer: address of the last descriptor to send + 1. -DMA_CH#_TxDesc_Current_App_TxDesc: address of the current descriptor -The descriptor Tail Pointer register contains the pointer to the descriptor address (N). The base address and the current descriptor decide the address of the current descriptor that the DMA can process. The descriptors up to one location less than the one indicated by the descriptor tail pointer (N-1) are owned by the DMA. The DMA continues to process the descriptors until the following condition occurs: "current descriptor pointer == Descriptor Tail pointer" Then the DMA goes into suspend mode. The application must perform a write to descriptor tail pointer register and update the tail pointer to have the following condition and to start a new transfer: "current descriptor pointer < Descriptor tail pointer" The DMA automatically wraps around the base address when the end of ring is reached. New features are available on IP: -TSO (TCP Segmentation Offload) for TX only -Split header: to have header and payload in 2 different buffers (not yet implemented) Below some throughput figures obtained on some boxes: iperf (mbps) -- tcp udp tx rx tx rx - GMAC4.x 935 930 750 800 Note: There is a change in 4.10a databook on bitfield mapping of DMA_CHANx_INTR_ENA register. This requires to have � diffrent set of callbacks between IP 4.00a and 4.10a. Best regards Alex Alexandre TORGUE (13): stmmac: rework get_hw_feature function stmmac: rework the routines to show the ring status stmmac: rework synopsys id read, moved to dwmac setup stmmac: add descriptors function for GMAC 4.xx stmmac: add GMAC4 DMA/CORE Header File stmmac: add DMA support for GMAC 4.xx stmmac: add GMAC4 core support stmmac: enhance mmc counter management stmmac: add new DT platform entries for GMAC4 stmmac: support new GMAC4 Documentation: networking: update stmmac stmmac: update version to Jan_2016 stmmac: update MAINTAINERS Documentation/devicetree/bindings/net/stmmac.txt | 2 + Documentation/networking/stmmac.txt| 44 +- MAINTAINERS| 1 + drivers/net/ethernet/stmicro/stmmac/Makefile | 3 +- drivers/net/ethernet/stmicro/stmmac/common.h | 64 +- .../net/ethernet/stmicro/stmmac/dwmac1000_core.c | 7 +- .../net/ethernet/stmicro/stmmac/dwmac1000_dma.c| 35 +- .../net/ethernet/stmicro/stmmac/dwmac100_core.c| 5 +- drivers/net/ethernet/stmicro/stmmac/dwmac4.h | 255 drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c | 407 + drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c | 396 + drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.h | 129 + drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c | 354 drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.h | 202 +++ drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c | 225 +++ drivers/net/ethernet/stmicro/stmmac/enh_desc.c | 21 + drivers/net/ethernet/stmicro/stmmac/mmc.h | 4 + drivers/net/ethernet/stmicro/stmmac/mmc_core.c | 349 +-- drivers/net/ethernet/stmicro/stmmac/norm_desc.c| 21 + drivers/net/ethernet/stmicro/stmmac/stmmac.h | 7 +- .../net/ethernet/stmicro/stmmac/stmmac_ethtool.c | 7 +- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 643 +++-- .../net/ethernet/stmicro/stmmac/stmmac_platform.c | 7 + include/linux/stmmac.h | 2 + 24 files changed, 2821 insertions(+), 369 deletions(-) create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac4.h create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.h create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.h create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c -- 1.9.1
[PATCH 01/13] stmmac: rework get_hw_feature function
On next GMAC IP generation (4.xx), the way to get hw feature is not the same than on previous 3.xx. As it is hardware dependent, the way to get hw capabilities should be defined in dma ops of each MAC IP. It will avoid also a huge computation of hw capabilities in stmmac_main. Signed-off-by: Alexandre TORGUESigned-off-by: Giuseppe Cavallaro diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h index f96d257..797a913 100644 --- a/drivers/net/ethernet/stmicro/stmmac/common.h +++ b/drivers/net/ethernet/stmicro/stmmac/common.h @@ -412,7 +412,8 @@ struct stmmac_dma_ops { int (*dma_interrupt) (void __iomem *ioaddr, struct stmmac_extra_stats *x); /* If supported then get the optional core features */ - unsigned int (*get_hw_feature) (void __iomem *ioaddr); + void (*get_hw_feature)(void __iomem *ioaddr, + struct dma_features *dma_cap); /* Program the HW RX Watchdog */ void (*rx_watchdog) (void __iomem *ioaddr, u32 riwt); }; diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c index da32d60..99074695 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c @@ -215,9 +215,40 @@ static void dwmac1000_dump_dma_regs(void __iomem *ioaddr) } } -static unsigned int dwmac1000_get_hw_feature(void __iomem *ioaddr) +static void dwmac1000_get_hw_feature(void __iomem *ioaddr, +struct dma_features *dma_cap) { - return readl(ioaddr + DMA_HW_FEATURE); + u32 hw_cap = readl(ioaddr + DMA_HW_FEATURE); + + dma_cap->mbps_10_100 = (hw_cap & DMA_HW_FEAT_MIISEL); + dma_cap->mbps_1000 = (hw_cap & DMA_HW_FEAT_GMIISEL) >> 1; + dma_cap->half_duplex = (hw_cap & DMA_HW_FEAT_HDSEL) >> 2; + dma_cap->hash_filter = (hw_cap & DMA_HW_FEAT_HASHSEL) >> 4; + dma_cap->multi_addr = (hw_cap & DMA_HW_FEAT_ADDMAC) >> 5; + dma_cap->pcs = (hw_cap & DMA_HW_FEAT_PCSSEL) >> 6; + dma_cap->sma_mdio = (hw_cap & DMA_HW_FEAT_SMASEL) >> 8; + dma_cap->pmt_remote_wake_up = (hw_cap & DMA_HW_FEAT_RWKSEL) >> 9; + dma_cap->pmt_magic_frame = (hw_cap & DMA_HW_FEAT_MGKSEL) >> 10; + /* MMC */ + dma_cap->rmon = (hw_cap & DMA_HW_FEAT_MMCSEL) >> 11; + /* IEEE 1588-2002 */ + dma_cap->time_stamp = + (hw_cap & DMA_HW_FEAT_TSVER1SEL) >> 12; + /* IEEE 1588-2008 */ + dma_cap->atime_stamp = (hw_cap & DMA_HW_FEAT_TSVER2SEL) >> 13; + /* 802.3az - Energy-Efficient Ethernet (EEE) */ + dma_cap->eee = (hw_cap & DMA_HW_FEAT_EEESEL) >> 14; + dma_cap->av = (hw_cap & DMA_HW_FEAT_AVSEL) >> 15; + /* TX and RX csum */ + dma_cap->tx_coe = (hw_cap & DMA_HW_FEAT_TXCOESEL) >> 16; + dma_cap->rx_coe_type1 = (hw_cap & DMA_HW_FEAT_RXTYP1COE) >> 17; + dma_cap->rx_coe_type2 = (hw_cap & DMA_HW_FEAT_RXTYP2COE) >> 18; + dma_cap->rxfifo_over_2048 = (hw_cap & DMA_HW_FEAT_RXFIFOSIZE) >> 19; + /* TX and RX number of channels */ + dma_cap->number_rx_channel = (hw_cap & DMA_HW_FEAT_RXCHCNT) >> 20; + dma_cap->number_tx_channel = (hw_cap & DMA_HW_FEAT_TXCHCNT) >> 22; + /* Alternate (enhanced) DESC mode */ + dma_cap->enh_desc = (hw_cap & DMA_HW_FEAT_ENHDESSEL) >> 24; } static void dwmac1000_rx_watchdog(void __iomem *ioaddr, u32 riwt) diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index 4c5ce98..d3ebfea 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -1558,51 +1558,15 @@ static void stmmac_selec_desc_mode(struct stmmac_priv *priv) */ static int stmmac_get_hw_features(struct stmmac_priv *priv) { - u32 hw_cap = 0; + u32 ret = 0; if (priv->hw->dma->get_hw_feature) { - hw_cap = priv->hw->dma->get_hw_feature(priv->ioaddr); - - priv->dma_cap.mbps_10_100 = (hw_cap & DMA_HW_FEAT_MIISEL); - priv->dma_cap.mbps_1000 = (hw_cap & DMA_HW_FEAT_GMIISEL) >> 1; - priv->dma_cap.half_duplex = (hw_cap & DMA_HW_FEAT_HDSEL) >> 2; - priv->dma_cap.hash_filter = (hw_cap & DMA_HW_FEAT_HASHSEL) >> 4; - priv->dma_cap.multi_addr = (hw_cap & DMA_HW_FEAT_ADDMAC) >> 5; - priv->dma_cap.pcs = (hw_cap & DMA_HW_FEAT_PCSSEL) >> 6; - priv->dma_cap.sma_mdio = (hw_cap & DMA_HW_FEAT_SMASEL) >> 8; - priv->dma_cap.pmt_remote_wake_up = - (hw_cap & DMA_HW_FEAT_RWKSEL) >> 9; - priv->dma_cap.pmt_magic_frame = - (hw_cap & DMA_HW_FEAT_MGKSEL) >> 10; - /* MMC */ - priv->dma_cap.rmon = (hw_cap & DMA_HW_FEAT_MMCSEL) >> 11; -
Re: [PATCH net] net: ipv4: Multipath needs to handle unreachable nexthops
Hello, On Thu, 24 Mar 2016, David Ahern wrote: > On 3/24/16 4:33 PM, Julian Anastasov wrote: > > But for multipath routes we can also consider the > > nexthops as "alternatives", so it depends on how one uses > > the multipath mechanism. The ability to fallback to > > another nexthop assumes one connection is allowed to > > move from one ISP to another. What if the second ISP > > decides to reject the connection? What we have is a > > broken connection just because the retransmits > > were diverted to wrong place in the hurry. So, the > > nexthops can be compatible or incompatible. For your > > setup they are, for others they are not. > > I am not sure I completely understand your point. Are you saying that within a > single multipath route some connections to nexthops are allowed and others are > not? > > So to put that paragraph into an example > > 15.0.0.0/16 > nexthop via 12.0.0.2 dev swp1 weight 1 > nexthop via 12.0.0.3 dev swp1 weight 1 > > Hosts from 15.0/16 could have TCP connections use 12.0.0.2, but not 12.0.0.3 > because 12.0.0.3 could be a different ISP and not allow TCP connections from > this address space? Yes. Two cases are possible: 1. ISP2 filters saddr, traffic with saddr from ISP1 is dropped. 2. ISP2 allows any saddr. But how the responses from world with daddr=IP_from_ISP1 will come from ISP2 link? If the nexthops are for different ISP the connection can survive only if sticks to its ISP. An ISP will not work as a backup link for another ISP. The hash-based algorithm does not move connections from one nexthop to another. If you rearrange the nexthops on failure, the binding is lost and connections can break. A fallback from fragile wifi link can upset users and any redirects will lead to bad experience with broken conns. So only CONNMARK can help to stick connection to some path. If this path has multiple simultaneous alternatives you can again use multipath route reached from 'ip rule from PUBIP2 table ISP2' but then we again are restricted from the hash-based alg which is suitable only for default routes hit by saddr=0.0.0.0 lookups. In the common case when connection is created there are two lookups: 1. TCP selects nexthop with a saddr=0.0.0.0 lookup. ISP is selected based on the mp alg. 2. If one is lucky to reach ip_route_me_harder the hash-based lookup is defeated because now lookup uses saddr=iph->saddr, so it selects different nexthop. It works while all packets for the connection reach ip_route_me_harder. > > So, if the kernel used a random selection > > your fallback algorithm should help. But it is fragile > > for the simple setup with single default multipath route. > > May be what we miss is the ability to choose between > > random and hash-based selection. Then your patch may be > > useful but only for setup 2 (multipath route hit only by > > first packet). So, your patch may come with a sysctl var > > that explains your current patch logic: "avoid failed nexthops, > > never probe them, wait their failed entry to be expired by > > neigh_periodic_work and just then we can use the nexthop > > by creating new entry to probe the GW". Who will trigger > > probes often enough to maintain fresh state? > > First packet out does the probe -- neigh lookup fails, kernel has no knowledge > so can't reject the nexthop based on neighbor information. > > After that if it has information that says that a nexthop is dead, why would > it continue to try to probe? Any traffic that selects that nh is dead. That to If entry becomes FAILED this state is preserved if we do not direct traffic to this entry. If there was a single connection that was rejected after 3 failed probes the next connection (with your patch) will fallback to another neigh and the first entry will remain in FAILED state until expiration. If one wants to refresh the state often, a script/tool that pings all GWs is needed, so that you can notice the available or failed paths faster. > me defies the basis of having multiple paths. We do not know how long is the outage. Long living connections may prefer to survive with retransmits. Say you are using SSH via wifi link doing important work. Do you want your connection to break just because link was down for a while? Fallback to other ISP can be unwanted. If we do not know if multipath route is used per-packet ot per-connection we can not just apply a fallback to other nexthops. We can do it only if user can configure the different usages per multipath route: 1. hash-based or random 2. allow fallback or not. This config is not a MUST if users can select random mode, it can imply that fallback is allowed. Regards -- Julian Anastasov
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
Hi Eric, On Thu, Mar 24, 2016 at 11:49:41PM -0700, Eric Dumazet wrote: > Everything is possible, but do not complain because BPF went in the > kernel before your changes. Don't get me wrong, I'm not complaining, I'm more asking for help to try to elaborate the alternate solution. I understood well what my proposal did because it was pretty straightforward, and the new one I'll have to do is of an immense complexity for me by now, since it will require learning a new language and finding doc on how all this works together. But as I said I'm totally sold to the benefits it can provide for large scale deployments and I'm well aware of the ugly socket scan there was in the previous one. > Just rework your patch. > > Supporting multiple SO_REUSEPORT groups on the same port should not be > too hard really. Making sure BPF still work for them is feasible. > > But the semantic of the socket option would be really different. I don't care much about the socket options themselves as long as I can continue to support seamless reloads. I could even get rid of SO_REUSEPORT on Linux to use something else instead if I have a reliable way to detect that the alternative will work. > You need to not control an individual listener, but a group of listener. > > Your dying haproxy would issue a single system call to tell the kernel : > My SO_REUSEPORT group should not accept new SYN packets, so that the new > haproxy can setup a working new SO_REUSEPORT group. Normally it's the other way around :-) The new process first grabs the socket, there's a tiny window where both are attached, and only then the old process leaves. That's the only way to ensure there's no loss nor added latency in the processing. If you could share a pointer to some good starter documentation for this, I would appreciate it. I really have no idea where to start from and the only elements I found on the net didn't give a single hint regarding all this :-/ Thanks, Willy
Re: [PATCH] stmmac: Fix phy without MDIO subnode
On 3/24/2016 6:01 PM, John Keeping wrote: On Thu, 24 Mar 2016 13:56:02 +0100, Giuseppe CAVALLARO wrote: This should be fixed by some work done some days ago and not yet committed. Pls see "stmmac: MDIO fixes" patch-set and let me know if ok on your side. Yes, that works for me. thx John for having tested it on your side. peppe Thanks, John On 3/24/2016 11:56 AM, John Keeping wrote: Since commit 88f8b1bb41c6 ("stmmac: Fix 'eth0: No PHY found' regression") we no longer allocate mdio_bus_data unless there is a MDIO subnode. This breaks the ethernet on the Radxa Rock2 (using rk3288-rock2-square.dts) which does not have an MDIO subnode. That commit was correct that the phy_bus_name test is unhelpful since we allocate "plat" in the same function and never set phy_bus_name so let's just drop the test which restores the previous behaviour. Fixes: 88f8b1bb41c6 ("stmmac: Fix 'eth0: No PHY found' regression") Signed-off-by: John Keeping--- drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c index dcbd2a1..e0fa060 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c @@ -189,7 +189,7 @@ stmmac_probe_config_dt(struct platform_device *pdev, const char **mac) if (of_property_read_u32(np, "snps,phy-addr", >phy_addr) == 0) dev_warn(>dev, "snps,phy-addr property is deprecated\n"); - if ((plat->phy_node && !of_phy_is_fixed_link(np)) || !plat->mdio_node) + if ((plat->phy_node && !of_phy_is_fixed_link(np))) plat->mdio_bus_data = NULL; else plat->mdio_bus_data =
RE: [v6, 2/5] soc: fsl: add GUTS driver for QorIQ platforms
> -Original Message- > From: Scott Wood > Sent: Saturday, March 19, 2016 6:55 AM > To: Yangbo Lu; devicet...@vger.kernel.org; linux-arm- > ker...@lists.infradead.org; linux-ker...@vger.kernel.org; linuxppc- > d...@lists.ozlabs.org; linux-...@vger.kernel.org; linux- > i...@vger.kernel.org; io...@lists.linux-foundation.org; > netdev@vger.kernel.org; linux-...@vger.kernel.org > Cc: Rob Herring; Russell King; Jochen Friedrich; Joerg Roedel; Claudiu > Manoil; ulf.hans...@linaro.org; Bhupesh Sharma; Zhao Qiang; Kumar Gala; > Santosh Shilimkar; Yang-Leo Li; Xiaobo Xie > Subject: Re: [v6, 2/5] soc: fsl: add GUTS driver for QorIQ platforms > > On 03/09/2016 04:18 AM, Yangbo Lu wrote: > > +#ifdef CONFIG_FSL_GUTS > > +u32 fsl_guts_get_svr(void); > > +int fsl_guts_init(void); > > +#endif > > Don't ifdef prototypes (when not providing a stub alternative). > > -Scott [Lu Yangbo-B47093] Ok, will remove the ifdef. Thank you very much.
Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
On Fri, 2016-03-25 at 06:28 +0100, Willy Tarreau wrote: > On Thu, Mar 24, 2016 at 04:54:03PM -0700, Tom Herbert wrote: > > On Thu, Mar 24, 2016 at 4:40 PM, Yann Ylavicwrote: > > > I'll learn how to do this to get the best performances from the > > > server, but having to do so to work around what looks like a defect > > > (for simple/default SMP configurations at least, no NUMA or clever > > > CPU-affinity or queuing policy involved) seems odd in the first place. > > > > > I disagree with your assessment that there is a defect. SO_REUSEPORT > > is designed to spread packets amongst _equivalent_ connections. In the > > server draining case sockets are no longer equivalent, but that is a > > special case. > > I partially disagree with you here Tom. Initially SO_REUSEPORT was not > used to spread packets but to allow soft restart in some applications. > I've been using it since 2001 in haproxy on *BSD and linux 2.2. It was > removed during 2.3 and I used to keep a patch to reimplement it in 2.4 > (basically 2 or 3 lines, the infrastructure was still present), but the > patch was not accepted. The same patch worked for 2.6 and 3.x, allowing > me to continue to perform soft-restarts on Linux just like I used to do > on *BSD. When SO_REUSEPORT was reimplemented in 3.9 with load balancing, > I was happy because it at last allowed me to drop my patch and I got > the extra benefit of better load balancing of incoming connections. > > But the main use we have for it (at least historically) is for soft > restarts, where one process replaces another one. Very few people use > more than one process in our case. > > However given the benefits of the load spreading for extreme loads, > I'm willing to find how to achieve the same with BPF, but it's pretty > clear that at this point I have no idea where to start from and that > for a single process replacing a single one, it looks quite complicated. > > For me quite frankly the special case is the load balancing which is > a side benefit (and a nice one, don't get me wrong). > > That's why I would have found it nice to "fix" the process replacement > to avoid dropping incoming connections, though I don't want it to become > a problem for future improvements on BPF. I don't think the two lines I > proposed could become an issue but I'll live without them (or continue > to apply this patch). > > BTW, I have no problem with having to write a little bit of assembly for > fast interfaces if it remains untouched for years, we do already have a > bit in haproxy. It's just a longterm investment. Everything is possible, but do not complain because BPF went in the kernel before your changes. Just rework your patch. Supporting multiple SO_REUSEPORT groups on the same port should not be too hard really. Making sure BPF still work for them is feasible. But the semantic of the socket option would be really different. You need to not control an individual listener, but a group of listener. Your dying haproxy would issue a single system call to tell the kernel : My SO_REUSEPORT group should not accept new SYN packets, so that the new haproxy can setup a working new SO_REUSEPORT group.
RE: [v6, 5/5] mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0
> -Original Message- > From: Scott Wood > Sent: Saturday, March 19, 2016 2:28 AM > To: Arnd Bergmann; Rob Herring > Cc: Yangbo Lu; linuxppc-...@lists.ozlabs.org; devicet...@vger.kernel.org; > ulf.hans...@linaro.org; Zhao Qiang; Russell King; Bhupesh Sharma; > netdev@vger.kernel.org; Joerg Roedel; Kumar Gala; linux- > m...@vger.kernel.org; linux-ker...@vger.kernel.org; Yang-Leo Li; > io...@lists.linux-foundation.org; linux-...@vger.kernel.org; Claudiu > Manoil; Santosh Shilimkar; Xiaobo Xie; linux-...@vger.kernel.org; linux- > arm-ker...@lists.infradead.org > Subject: Re: [v6, 5/5] mmc: sdhci-of-esdhc: fix host version for T4240- > R1.0-R2.0 > > On 03/17/2016 12:06 PM, Arnd Bergmann wrote: > > On Thursday 17 March 2016 12:01:01 Rob Herring wrote: > >> On Mon, Mar 14, 2016 at 05:45:43PM +, Scott Wood wrote: > > > > This makes the driver non-portable. Better identify the specific > > workarounds based on the compatible string for this device, or add > > a boolean DT property for the quirk. > > > >Arnd > > [Lu Yangbo-B47093] Hi Arnd, we did have a discussion about using DTS > in v1 before. > https://patchwork.kernel.org/patch/6834221/ > > We don't have a separate DTS file for each revision of an SOC and if > we did, we'd constantly have people using the wrong one. > In addition, the device tree is stable ABI and errata are often > discovered after device tree are deployed. > See the link for details. > > So we decide to read SVR from the device-config/guts MMIO block > other than using DTS. > Thanks. > >>> > >>> Also note that this driver is already only for fsl-specific > >>> hardware, and it will still work even if fsl_guts doesn't find > >>> anything to bind to > >>> -- it just wouldn't be able to detect errata based on SVR in that > case. > >> > >> IIRC, it is the same IP block as i.MX and Arnd's point is this won't > >> even compile on !PPC. It is things like this that prevent sharing the > >> driver. > > The whole point of using the MMIO SVR instead of the PPC SPR is so that > it will work on ARM... The guts driver should build on any platform as > long as OF is enabled, and if it doesn't find a node to bind to it will > return 0 for SVR, and the eSDHC driver will continue (after printing an > error that should be removed) without the ability to test for errata > based on SVR. > > > I think the first four patches take care of building for ARM, but the > > problem remains if you want to enable COMPILE_TEST as we need for > > certain automated checking. > > What specific problem is there with COMPILE_TEST? > > >> Dealing with Si revs is a common problem. We should have a common > >> solution. There is soc_device for this purpose. > > > > Exactly. The last time this came up, I think we agreed to implement a > > helper using glob_match() on the soc_device strings. Unfortunately > > this hasn't happened then, but I'd still prefer that over yet another > > vendor-specific way of dealing with the generic issue. > > soc_device would require encoding the SVR as a string and then decoding > the string, which is more complicated and error prone than having > platform-specific code test a platform-specific number. And when would > it get registered on arm64, which doesn't have platform code? > > -Scott [Lu Yangbo-B47093] Hi Arnd, could you answer Scott's questions? If you don't oppose this patch, I'd like to rework a new version for merging. Thanks. :)
RE: [v6, 3/5] dt: move guts devicetree doc out of powerpc directory
> -Original Message- > From: Scott Wood > Sent: Saturday, March 19, 2016 2:16 AM > To: Rob Herring; Yangbo Lu > Cc: devicet...@vger.kernel.org; linux-arm-ker...@lists.infradead.org; > linux-ker...@vger.kernel.org; linuxppc-...@lists.ozlabs.org; linux- > c...@vger.kernel.org; linux-...@vger.kernel.org; iommu@lists.linux- > foundation.org; netdev@vger.kernel.org; linux-...@vger.kernel.org; > Russell King; Jochen Friedrich; Joerg Roedel; Claudiu Manoil; > ulf.hans...@linaro.org; Bhupesh Sharma; Zhao Qiang; Kumar Gala; Santosh > Shilimkar; Yang-Leo Li; Xiaobo Xie > Subject: Re: [v6, 3/5] dt: move guts devicetree doc out of powerpc > directory > > On 03/17/2016 12:06 PM, Rob Herring wrote: > > On Wed, Mar 09, 2016 at 06:08:49PM +0800, Yangbo Lu wrote: > >> Move guts devicetree doc to > >> Documentation/devicetree/bindings/soc/fsl/ > >> since it's used by not only PowerPC but also ARM. And add a > >> specification for 'little-endian' property. > >> > >> Signed-off-by: Yangbo Lu> >> --- > >> Changes for v2: > >>- None > >> Changes for v3: > >>- None > >> Changes for v4: > >>- Added this patch > >> Changes for v5: > >>- Modified the description for little-endian property Changes for > >> v6: > >>- None > >> --- > >> Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt | 3 > >> +++ > >> 1 file changed, 3 insertions(+) > >> rename Documentation/devicetree/bindings/{powerpc => > >> soc}/fsl/guts.txt (91%) > >> > >> diff --git a/Documentation/devicetree/bindings/powerpc/fsl/guts.txt > >> b/Documentation/devicetree/bindings/soc/fsl/guts.txt > >> similarity index 91% > >> rename from Documentation/devicetree/bindings/powerpc/fsl/guts.txt > >> rename to Documentation/devicetree/bindings/soc/fsl/guts.txt > >> index b71b203..07adca9 100644 > >> --- a/Documentation/devicetree/bindings/powerpc/fsl/guts.txt > >> +++ b/Documentation/devicetree/bindings/soc/fsl/guts.txt > >> @@ -25,6 +25,9 @@ Recommended properties: > >> - fsl,liodn-bits : Indicates the number of defined bits in the LIODN > >> registers, for those SOCs that have a PAMU device. > >> > >> + - little-endian : Indicates that the global utilities block is > little > >> + endian. The default is big endian. > > > > The default is "the native endianness of the system". So absence on an > > ARM system would be LE. > > No. For this binding, the default is big-endian, because that's what > existed for this device before an endian property was added. > > "endianness of the system" is not a well-defined concept. > > > This property is valid for any simple-bus device, > > Since when does simple-bus mean anything more than that the nodes > underneath it can be used without bus-specific knowledge? > > > so it isn't really required to document per device. You can, but your > > description had better match the documented behaviour. > > Documented where? > > In fact, Documentation/devicetree/bindings/common-properties.txt > explicitly says of the endian properties, "If a binding supports these > properties, then the binding should also specify the default behavior if > none of these properties are present." > > -Scott [Lu Yangbo-B47093] So, Rob, could you accept this patch after so much discussion? :)