date:20170906

Re: [GIT] Networking

2017-09-06 Thread Linus Torvalds

On Wed, Sep 6, 2017 at 10:40 PM, Luca Coelho  wrote:
>
> This patch is not very important (unless you really like blinking lights
> -- maybe I'll change my mind when the holidays approach :P). so it is
> fine if you just want to revert it for now.
>
> In any case, I'll send a patch fixing this problem soon.

No need to revert if we can get this fixed quickly enough.

I'll leave the fw-31 on my laptop, so that I can continue to use it for now.

Thanks,

   Linus

Re: [GIT] Networking

2017-09-06 Thread Luca Coelho

On Thu, 2017-09-07 at 05:04 +, Coelho, Luciano wrote:
> On Wed, 2017-09-06 at 21:57 -0700, Linus Torvalds wrote:
> > On Wed, Sep 6, 2017 at 9:11 PM, Coelho, Luciano
> >  wrote:
> > > 
> > > This seems to be a problem with backwards-compatibility with FW version
> > > 27.  We are now in version 31[1] and upgrading will probably fix that.
> > 
> > I can confirm that fw version 31 works.
> 
> Great, so I know for sure that this is a backwards-compatibility issue
> with the FW API.
> 
> 
> > > But obviously the driver should not fail miserably like this with
> > > version 27, because it claims to support it still.
> > 
> > Not just "claims to support it", but if it's what is shipped with a
> > fairly recent distro like an up-to-date version of F26, I would really
> > hope that the driver can still work with it.
> 
> I totally agree, we support a bunch of older versions for that exact
> reason.  We just don't really test all the supported versions very
> often.  We should probably change that.
> 
> I'll make sure it still works with version 27.

Okay, I found the offending patch:

commit 7089ae634c50544b29b31faf1a751e8765c8de3b
Author: Johannes Berg 
AuthorDate: Wed Jun 28 16:19:49 2017 +0200
Commit: Luca Coelho 
CommitDate: Wed Aug 9 09:15:32 2017 +0300

iwlwifi: mvm: use firmware LED command where applicable

On devices starting from 8000 series, the host can no longer toggle
the LED through the CSR_LED_REG register, but must do it via the
firmware instead. Add support for this. Note that this means that
the LED cannot be turned on while the firmware is off, so using an
arbitrary LED trigger may not work as expected.

Fixes: 503ab8c56ca0 ("iwlwifi: Add 8000 HW family support")
Signed-off-by: Johannes Berg 
Signed-off-by: Luca Coelho 

Reverting it solves the problem.  We introduced a new command to control
the LED lights and assumed it was available in older FW versions as
well, which turned out not to be the case.

This patch is not very important (unless you really like blinking lights
-- maybe I'll change my mind when the holidays approach :P). so it is
fine if you just want to revert it for now.

In any case, I'll send a patch fixing this problem soon.


--
Cheers,
Luca.

Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers

2017-09-06 Thread Henrik Austad

On Thu, Aug 31, 2017 at 06:26:20PM -0700, Vinicius Costa Gomes wrote:
> Hi,
> 
> This patchset is an RFC on a proposal of how the Traffic Control subsystem can
> be used to offload the configuration of traffic shapers into network devices
> that provide support for them in HW. Our goal here is to start upstreaming
> support for features related to the Time-Sensitive Networking (TSN) set of
> standards into the kernel.

Nice to see that others are working on this as well! :)

A short disclaimer; I'm pretty much anchored in the view "linux is the 
end-station in a TSN domain", is this your approach as well, or are you 
looking at this driver to be used in bridges as well? (because that will 
affect the comments on time-aware shaper and frame preemption)

Yet another disclaimer; I am not a linux networking subsystem expert. Not 
by a long shot! There are black magic happening in the internals of the 
networking subsystem that I am not even aware of. So if something I say or 
ask does not make sense _at_all_, that's probably why..

I do know a tiny bit about TSN though, and I have been messing around 
with it for a little while, hence my comments below

> As part of this work, we've assessed previous public discussions related to 
> TSN
> enabling: patches from Henrik Austad (Cisco), the presentation from Eric Mann
> at Linux Plumbers 2012, patches from Gangfeng Huang (National Instruments) and
> the current state of the OpenAVNU project (https://github.com/AVnu/OpenAvnu/).

/me eyes Cc ;p

> Overview
> 
> 
> Time-sensitive Networking (TSN) is a set of standards that aim to address
> resources availability for providing bandwidth reservation and bounded latency
> on Ethernet based LANs. The proposal described here aims to cover mainly what 
> is
> needed to enable the following standards: 802.1Qat, 802.1Qav, 802.1Qbv and
> 802.1Qbu.
> 
> The initial target of this work is the Intel i210 NIC, but other controllers'
> datasheet were also taken into account, like the Renesas RZ/A1H RZ/A1M group 
> and
> the Synopsis DesignWare Ethernet QoS controller.

NXP has a TSN aware chip on the i.MX7 sabre board as well 

> Proposal
> 
> 
> Feature-wise, what is covered here are configuration interfaces for HW
> implementations of the Credit-Based shaper (CBS, 802.1Qav), Time-Aware shaper
> (802.1Qbv) and Frame Preemption (802.1Qbu). CBS is a per-queue shaper, while
> Qbv and Qbu must be configured per port, with the configuration covering all
> queues. Given that these features are related to traffic shaping, and that the
> traffic control subsystem already provides a queueing discipline that offloads
> config into the device driver (i.e. mqprio), designing new qdiscs for the
> specific purpose of offloading the config for each shaper seemed like a good
> fit.

just to be clear, you register sch_cbs as a subclass to mqprio, not as a 
root class?

> For steering traffic into the correct queues, we use the socket option
> SO_PRIORITY and then a mechanism to map priority to traffic classes / Tx 
> queues.
> The qdisc mqprio is currently used in our tests.

Right, fair enough, I'd prefer the TSN qdisc to be the root-device and 
rather have mqprio for high priority traffic and another for 'everything 
else'', but this would work too. This is not that relevant at this stage I 
guess :)

> As for the shapers config interface:
> 
>  * CBS (802.1Qav)
> 
>This patchset is proposing a new qdisc called 'cbs'. Its 'tc' cmd line is:
>$ tc qdisc add dev IFACE parent ID cbs locredit N hicredit M sendslope S \
>  idleslope I

So this confuses me a bit, why specify sendSlope?

sendSlope = portTransmitRate - idleSlope

and portTransmitRate is the speed of the MAC (which you get from the 
driver). Adding sendSlope here is just redundant I think.

Also, does this mean that when you create the qdisc, you have locked the 
bandwidth for the scheduler? Meaning, if I later want to add another 
stream that requires more bandwidth, I have to close all active streams, 
reconfigure the qdisc and then restart?

>Note that the parameters for this qdisc are the ones defined by the
>802.1Q-2014 spec, so no hardware specific functionality is exposed here.

You do need to know if the link is brought up as 100 or 1000 though - which 
the driver already knows.

>  * Time-aware shaper (802.1Qbv):
> 
>The idea we are currently exploring is to add a "time-aware", priority 
> based
>qdisc, that also exposes the Tx queues available and provides a mechanism 
> for
>mapping priority <-> traffic class <-> Tx queues in a similar fashion as
>mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would be:

As far as I know, this is not supported by i210, and if time-aware shaping 
is enabled in the network - you'll be queued on a bridge until the window 
opens as time-aware shaping is enforced on the tx-port and not on rx. Is 
this required in this driver?

>$ $ tc qdisc add dev ens4 parent root

[LPC] 2nd RDMA Mini-Summit Schedule

2017-09-06 Thread Leon Romanovsky

Hi,

We're happy to announce schedule of the 2nd RDMA mini-summit, which will be
held as part of coming Linux Plumbers Conference 2017.

During the conference, we will have two sessions: main track session and
round table session.

First session will be held on Thursday, September 14, 2017 from 9:30am – 12:30pm
with the topics relevant for the wider audience.

In the second session, we  will focus on more face to face discussions in round
table format and it will be scheduled a little bit later once we will get 
meeting
room.

---
FIRST SESSION:
---
 * Backporting issues with multi-subsystem device (RDMA + netdev + more) by Don 
Dutile
   - More and more devices in RDMA are dependent on netdev as well as other
 subsystems -- target, NFS, scsi, block, cgroups, SELinux Can we make
 backporting task for distro people easier?
 * kABI Update by Matan Barak
   - Going forward with a flexible and secure RDMA kABI
 * System Boot and RDMA by Jason Gunthrope
   - Discuss boot-time issues with the RDMA subsystem
 * Paravirtual RDMA device by Marcel Apfelbaum and Yuval Shaia
   - QEMU's limited RDMA support leaves it behind other modern hypervisors.
 Marcel and Yuval will present the implementation of an emulated RDMA 
device,
 analyze its performance and usability, and finally talk about future plans
 for a possible virtio-rdma device.
  * TX Flow Steering for IPsec by Liran Liss (pending time slot)
   - The IPsec protocol provides both encryption and authentication for IP
 protocol packets. Users of the raw ethernet QP that send/receive IPsec 
packets
 could benefit from the IPsec crypto offloads capabilities that are
 available in recent NICs.
  * OpenFabrics Alliance - Status and New Directions by Jason Gunthrope
  * BOF - Open Discussion
   - Open discussion with the community lead by Leon Romanovsky, Jason 
Gunthorpe,
 Christoph Lameter and Dennis Dalessandro

---
ROUND TABLE SESSION:
---
Possible list, bring your own topics:
 * RDMA and ULPs
 * Verbs testing suite
 * RDMA netlink and RDMAtool future plans
 * RDMA and the Linux IP stack
 * Infrastructure issues with Travis CI
 * Infrastructure gaps with rdma-core
 * Packaging as part of CI for rdma-core.
 * 

See you next week at Linux Plumber Conference 2017.

Thanks


signature.asc
Description: PGP signature

Re: ipset losing entries on its own

2017-09-06 Thread Akshat Kakkar

What I observed is rehashing of set is not happening.

When I add multiple IPs to the ipset manually on ipset v6.32,
rehashing is not happening and my hashsize remains same as 1024

but when I add to ipset 4.5 (pretty old, I know!), rehashing is
happening and my hashsize changes from 1024 to 1536 to 2304 to 3456!

WARNING: at net/netfilter/core.c:218 __nf_hook_entries_try_shrink+0xf7/0x110

2017-09-06 Thread Mike Galbraith


[   21.219604] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   21.433091] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
[   21.495849] ip_tables: (C) 2000-2006 Netfilter Core Team
[   22.404040] [ cut here ]
[   22.404267] WARNING: CPU: 2 PID: 1379 at net/netfilter/core.c:218 
__nf_hook_entries_try_shrink+0xf7/0x110
[   22.404484] Modules linked in: xt_tcpudp(E) iptable_filter(E) 
ip6table_mangle(E) nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) 
nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) xt_conntrack(E) 
nf_conntrack(E) ip6table_filter(E) ip6_tables(E) x_tables(E) nls_iso8859_1(E) 
nls_cp437(E) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) 
coretemp(E) snd_hda_codec_hdmi(E) kvm_intel(E) kvm(E) snd_hda_codec_realtek(E) 
snd_hda_codec_generic(E) snd_hda_intel(E) irqbypass(E) joydev(E) 
snd_hda_codec(E) snd_hwdep(E) snd_hda_core(E) snd_pcm(E) crct10dif_pclmul(E) 
snd_timer(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) snd(E) 
pcbc(E) aesni_intel(E) iTCO_wdt(E) aes_x86_64(E) r8169(E) 
iTCO_vendor_support(E) crypto_simd(E) glue_helper(E) lpc_ich(E) mii(E) 
mei_me(E) mei(E) mfd_core(E) soundcore(E)
[   22.405405]  cryptd(E) i2c_i801(E) shpchp(E) intel_smartconnect(E) 
thermal(E) fan(E) battery(E) tpm_infineon(E) pcspkr(E) nfsd(E) auth_rpcgss(E) 
nfs_acl(E) lockd(E) grace(E) sunrpc(E) sr_mod(E) cdrom(E) hid_logitech_hidpp(E) 
hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) usbhid(E) nouveau(E) 
wmi(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) 
sysimgblt(E) ehci_pci(E) fb_sys_fops(E) ahci(E) xhci_pci(E) ttm(E) libahci(E) 
ehci_hcd(E) xhci_hcd(E) libata(E) drm(E) usbcore(E) video(E) button(E) 
sd_mod(E) vfat(E) fat(E) virtio_rng(E) virtio_blk(E) virtio_mmio(E) 
virtio_pci(E) virtio_ring(E) virtio(E) ext4(E) crc16(E) mbcache(E) jbd2(E) 
fscrypto(E) loop(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) 
scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) efivarfs(E) autofs4(E)
[   22.406419] CPU: 2 PID: 1379 Comm: iptables-batch Tainted: GE   
4.13.0.g1c9fe44-master #10
[   22.406662] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
09/23/2013
[   22.406911] task: 88041e822940 task.stack: c93f4000
[   22.407161] RIP: 0010:__nf_hook_entries_try_shrink+0xf7/0x110
[   22.407411] RSP: 0018:c93f7c18 EFLAGS: 00010246
[   22.407668] RAX: 0010 RBX: 8803cce14e40 RCX: 
[   22.407918] RDX: 0001 RSI:  RDI: 0001
[   22.408162] RBP: c93f7c38 R08: 0001 R09: 
[   22.408410] R10:  R11: 00040001 R12: 8803cce14e58
[   22.408663] R13: 81aac2c8 R14: a09290a0 R15: 81aab9c0
[   22.408928] FS:  7f6c31c1a700() GS:88041ec8() 
knlGS:
[   22.409186] CS:  0010 DS:  ES:  CR0: 80050033
[   22.409447] CR2: 7f708537af14 CR3: 0003fa4e5004 CR4: 001606e0
[   22.409710] Call Trace:
[   22.409978]  nf_unregister_net_hooks+0xf2/0x1a0
[   22.410246]  ? mutex_lock+0xe/0x30
[   22.410518]  ? mutex_lock+0xe/0x30
[   22.410786]  ? cleanup_match+0x40/0x50 [ip_tables]
[   22.411052]  ipv4_hooks_unregister+0x58/0x70 [nf_conntrack_ipv4]
[   22.411322]  nf_ct_netns_put+0x3c/0x80 [nf_conntrack]
[   22.411584]  cleanup_match+0x32/0x50 [ip_tables]
[   22.411845]  cleanup_entry+0x2e/0x90 [ip_tables]
[   22.412110]  translate_table+0x454/0x570 [ip_tables]
[   22.412371]  do_ipt_set_ctl+0xed/0x193 [ip_tables]
[   22.412641]  nf_setsockopt+0x3e/0x60
[   22.412909]  ip_setsockopt+0x6f/0x90
[   22.413178]  SyS_setsockopt+0x5e/0xb0
[   22.413450]  entry_SYSCALL_64_fastpath+0x1a/0xa5
[   22.413706] RIP: 0033:0x7f6c3137cbca
[   22.413969] RSP: 002b:7ffc32d9d558 EFLAGS: 0202 ORIG_RAX: 
0036
[   22.414224] RAX: ffda RBX: 7f6c3162f678 RCX: 7f6c3137cbca
[   22.414490] RDX: 0040 RSI:  RDI: 0007
[   22.414758] RBP: 7f6c3162f620 R08: 0ac8 R09: fefefefeff626d74
[   22.415016] R10: 01ed8920 R11: 0202 R12: 0af0
[   22.415293] R13: 0ac8 R14: 2709 R15: 0ad1
[   22.415564] Code: 39 d1 77 ba 4c 89 f7 e8 98 fe ff ff 4d 89 75 00 48 89 d8 
5b 41 5c 41 5d 41 5e 5d c3 45 31 f6 eb eb 31 c0 eb ee 0f ff 31 c0 eb e8 <0f> ff 
31 c0 0f 1f 44 00 00 eb dd 0f 1f 40 00 66 2e 0f 1f 84 00 
[   22.415881] ---[ end trace d15c01c3a7360856 ]---

Met 10 of those during this boot.  The first time I booted, the
warnings were followed by the explosion below, but that kernel was not
entirely virgin, so add a pinch of salt.

[   48.592298] Ebtables v2.0 registered
[   48.730755] BUG: unable to handle kernel NULL pointer dereference at 
003c
[   48.732723] IP: _raw_read_lock_bh+0x15/0x30
[   48.734526] PGD 0 P4D 0 
[   48.736482] Oops: 0002 [#1] SMP
[   48.738317] Dumping ftrace

Re: [GIT] Networking

2017-09-06 Thread Coelho, Luciano

On Wed, 2017-09-06 at 21:57 -0700, Linus Torvalds wrote:
> On Wed, Sep 6, 2017 at 9:11 PM, Coelho, Luciano
>  wrote:
> > 
> > This seems to be a problem with backwards-compatibility with FW version
> > 27.  We are now in version 31[1] and upgrading will probably fix that.
> 
> I can confirm that fw version 31 works.

Great, so I know for sure that this is a backwards-compatibility issue
with the FW API.

> > But obviously the driver should not fail miserably like this with
> > version 27, because it claims to support it still.
> 
> Not just "claims to support it", but if it's what is shipped with a
> fairly recent distro like an up-to-date version of F26, I would really
> hope that the driver can still work with it.

I totally agree, we support a bunch of older versions for that exact
reason.  We just don't really test all the supported versions very
often.  We should probably change that.

I'll make sure it still works with version 27.

--
Cheers,
Luca.

Re: [GIT] Networking

2017-09-06 Thread Linus Torvalds

On Wed, Sep 6, 2017 at 9:11 PM, Coelho, Luciano
 wrote:
>
> This seems to be a problem with backwards-compatibility with FW version
> 27.  We are now in version 31[1] and upgrading will probably fix that.

I can confirm that fw version 31 works.

> But obviously the driver should not fail miserably like this with
> version 27, because it claims to support it still.

Not just "claims to support it", but if it's what is shipped with a
fairly recent distro like an up-to-date version of F26, I would really
hope that the driver can still work with it.

> I'm looking into this now and will provide a fix asap.

Thanks,

  Linus

Re: ipset losing entries on its own

2017-09-06 Thread Akshat Kakkar

I understand that without reproducible scenarios, its hard to debug ...
But the point is, this issue is fully random and of very low frequency.

For the setup, it is CentOS 7.3 upgraded to kernel 4.4.

Whenever a system comes up on the network, he provides his credentials
and after successful authentication, his IP is added in an IPSET which
is having access to certain resources.

Re: [PATCH] tipc: remove unnecessary call to dev_net()

2017-09-06 Thread David Miller

From: Kleber Sacilotto de Souza 
Date: Wed,  6 Sep 2017 11:08:06 +0200

> The net device is already stored in the 'net' variable, so no need to call
> dev_net() again.
> 
> Signed-off-by: Kleber Sacilotto de Souza 

Applied, thanks.

[Patch net v2 2/2] net_sched: fix all the madness of tc filter chain

2017-09-06 Thread Cong Wang

This patch fixes the following madness of tc filter chain:

1) tcf_chain_destroy() is called by both tcf_block_put() and
   tcf_chain_put().  tcf_chain_put() is correctly refcnt'ed and paired
   with tcf_chain_get(), but tcf_block_put() is not, it should be paired
   with tcf_block_get() which means we still need to decrease the refcnt.
   Think it in another way: if we call tcf_bock_put() immediately after
   tcf_block_get(), could we get effectively a nop? This causes a memory
   leak as reported by Jakub.

2) tp proto should hold a refcnt to the chain too. This significantly
   simplifies the logic:

2a) Chain 0 is no longer special, it is created and refcnted by tp
like any other chains. All the ugliness in tcf_chain_put() can be
gone!

2b) No need to handle the flushing oddly, because block still holds
chain 0, it can not be released, this guarantees block is the last
user.

2c) The race condition with RCU callbacks is easier to handle with just
a rcu_barrier()! Much easier to understand, nothing to hide! Thanks
to the previous patch. Please see also the comments in code.

2d) Make the code understandable by humans, much less error-prone.

Fixes: 744a4cf63e52 ("net: sched: fix use after free when tcf_chain_destroy is 
called multiple times")
Fixes: 5bc1701881e3 ("net: sched: introduce multichain support for filters")
Reported-by: Jakub Kicinski 
Cc: Jiri Pirko 
Signed-off-by: Cong Wang 
---
 net/sched/cls_api.c | 38 ++
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 6c5ea84d2682..e9060dc36519 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -209,21 +209,20 @@ static void tcf_chain_flush(struct tcf_chain *chain)
RCU_INIT_POINTER(*chain->p_filter_chain, NULL);
while ((tp = rtnl_dereference(chain->filter_chain)) != NULL) {
RCU_INIT_POINTER(chain->filter_chain, tp->next);
+   tcf_chain_put(chain);
tcf_proto_destroy(tp);
}
 }
 
 static void tcf_chain_destroy(struct tcf_chain *chain)
 {
-   /* May be already removed from the list by the previous call. */
-   if (!list_empty(>list))
-   list_del_init(>list);
+   list_del(>list);
+   kfree(chain);
+}
 
-   /* There might still be a reference held when we got here from
-* tcf_block_put. Wait for the user to drop reference before free.
-*/
-   if (!chain->refcnt)
-   kfree(chain);
+static void tcf_chain_hold(struct tcf_chain *chain)
+{
+   ++chain->refcnt;
 }
 
 struct tcf_chain *tcf_chain_get(struct tcf_block *block, u32 chain_index,
@@ -233,7 +232,7 @@ struct tcf_chain *tcf_chain_get(struct tcf_block *block, 
u32 chain_index,
 
list_for_each_entry(chain, >chain_list, list) {
if (chain->index == chain_index) {
-   chain->refcnt++;
+   tcf_chain_hold(chain);
return chain;
}
}
@@ -246,10 +245,7 @@ EXPORT_SYMBOL(tcf_chain_get);
 
 void tcf_chain_put(struct tcf_chain *chain)
 {
-   /* Destroy unused chain, with exception of chain 0, which is the
-* default one and has to be always present.
-*/
-   if (--chain->refcnt == 0 && !chain->filter_chain && chain->index != 0)
+   if (--chain->refcnt == 0)
tcf_chain_destroy(chain);
 }
 EXPORT_SYMBOL(tcf_chain_put);
@@ -294,10 +290,18 @@ void tcf_block_put(struct tcf_block *block)
if (!block)
return;
 
-   list_for_each_entry_safe(chain, tmp, >chain_list, list) {
+   /* Standalone actions are not allowed to jump to any chain, and
+* bound actions should be all removed after flushing. However,
+* filters are destroyed in RCU callbacks, we have to flush and wait
+* for them before releasing this refcnt, otherwise we race with RCU
+* callbacks!!!
+*/
+   list_for_each_entry(chain, >chain_list, list)
tcf_chain_flush(chain);
-   tcf_chain_destroy(chain);
-   }
+   rcu_barrier();
+
+   list_for_each_entry_safe(chain, tmp, >chain_list, list)
+   tcf_chain_put(chain);
kfree(block);
 }
 EXPORT_SYMBOL(tcf_block_put);
@@ -375,6 +379,7 @@ static void tcf_chain_tp_insert(struct tcf_chain *chain,
rcu_assign_pointer(*chain->p_filter_chain, tp);
RCU_INIT_POINTER(tp->next, tcf_chain_tp_prev(chain_info));
rcu_assign_pointer(*chain_info->pprev, tp);
+   tcf_chain_hold(chain);
 }
 
 static void tcf_chain_tp_remove(struct tcf_chain *chain,
@@ -386,6 +391,7 @@ static void tcf_chain_tp_remove(struct tcf_chain *chain,
if (chain->p_filter_chain && tp == chain->filter_chain)
RCU_INIT_POINTER(*chain->p_filter_chain, next);

[Patch net v2 1/2] net_sched: get rid of tcfa_rcu

2017-09-06 Thread Cong Wang

gen estimator has been rewritten in commit 1c0d32fde5bd
("net_sched: gen_estimator: complete rewrite of rate estimators"),
the caller is no longer needed to wait for a grace period.
So this patch gets rid of it.

This also completely closes a race condition between action free
path and filter chain add/remove path for the following patch.
Because otherwise the nested RCU callback can't be caught by
rcu_barrier().

Cc: Jiri Pirko 
Cc: Eric Dumazet 
Signed-off-by: Cong Wang 
---
 include/net/act_api.h |  2 --
 net/sched/act_api.c   | 12 +++-
 2 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 26ffd8333f50..68218a5f8e72 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -38,7 +38,6 @@ struct tc_action {
struct gnet_stats_queue tcfa_qstats;
struct net_rate_estimator __rcu *tcfa_rate_est;
spinlock_t  tcfa_lock;
-   struct rcu_head tcfa_rcu;
struct gnet_stats_basic_cpu __percpu *cpu_bstats;
struct gnet_stats_queue __percpu *cpu_qstats;
struct tc_cookie*act_cookie;
@@ -55,7 +54,6 @@ struct tc_action {
 #define tcf_qstats common.tcfa_qstats
 #define tcf_rate_est   common.tcfa_rate_est
 #define tcf_lock   common.tcfa_lock
-#define tcf_rcucommon.tcfa_rcu
 
 static inline unsigned int tcf_hash(u32 index, unsigned int hmask)
 {
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index f2e9ed34a963..fc17d286a6a2 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -53,10 +53,8 @@ static void tcf_action_goto_chain_exec(const struct 
tc_action *a,
res->goto_tp = rcu_dereference_bh(chain->filter_chain);
 }
 
-static void free_tcf(struct rcu_head *head)
+static void free_tcf(struct tc_action *p)
 {
-   struct tc_action *p = container_of(head, struct tc_action, tcfa_rcu);
-
free_percpu(p->cpu_bstats);
free_percpu(p->cpu_qstats);
 
@@ -76,11 +74,7 @@ static void tcf_hash_destroy(struct tcf_hashinfo *hinfo, 
struct tc_action *p)
hlist_del(>tcfa_head);
spin_unlock_bh(>lock);
gen_kill_estimator(>tcfa_rate_est);
-   /*
-* gen_estimator est_timer() might access p->tcfa_lock
-* or bstats, wait a RCU grace period before freeing p
-*/
-   call_rcu(>tcfa_rcu, free_tcf);
+   free_tcf(p);
 }
 
 int __tcf_hash_release(struct tc_action *p, bool bind, bool strict)
@@ -271,7 +265,7 @@ void tcf_hash_cleanup(struct tc_action *a, struct nlattr 
*est)
 {
if (est)
gen_kill_estimator(>tcfa_rate_est);
-   call_rcu(>tcfa_rcu, free_tcf);
+   free_tcf(a);
 }
 EXPORT_SYMBOL(tcf_hash_cleanup);
 
-- 
2.13.0

Re: [PATCH net] netlink: access nlk groups safely in netlink bind and getname

2017-09-06 Thread David Miller

From: Xin Long 
Date: Wed,  6 Sep 2017 11:53:29 +0800

> Now there is no lock protecting nlk ngroups/groups' accessing in
> netlink bind and getname. It's safe from nlk groups' setting in
> netlink_release, but not from netlink_realloc_groups called by
> netlink_setsockopt.
> 
> netlink_lock_table is needed in both netlink bind and getname when
> accessing nlk groups.
> 
> Acked-by: Florian Westphal 
> Signed-off-by: Xin Long 

Applied.

Re: [PATCH net] netlink: fix an use-after-free issue for nlk groups

2017-09-06 Thread David Miller

From: Xin Long 
Date: Wed,  6 Sep 2017 11:47:12 +0800

> ChunYu found a netlink use-after-free issue by syzkaller:
> 
> [28448.842981] BUG: KASAN: use-after-free in __nla_put+0x37/0x40 at addr 
> 8807185e2378
> [28448.969918] Call Trace:
> [...]
> [28449.117207]  __nla_put+0x37/0x40
> [28449.132027]  nla_put+0xf5/0x130
> [28449.146261]  sk_diag_fill.isra.4.constprop.5+0x5a0/0x750 [netlink_diag]
> [28449.176608]  __netlink_diag_dump+0x25a/0x700 [netlink_diag]
> [28449.202215]  netlink_diag_dump+0x176/0x240 [netlink_diag]
> [28449.226834]  netlink_dump+0x488/0xbb0
> [28449.298014]  __netlink_dump_start+0x4e8/0x760
> [28449.317924]  netlink_diag_handler_dump+0x261/0x340 [netlink_diag]
> [28449.413414]  sock_diag_rcv_msg+0x207/0x390
> [28449.432409]  netlink_rcv_skb+0x149/0x380
> [28449.467647]  sock_diag_rcv+0x2d/0x40
> [28449.484362]  netlink_unicast+0x562/0x7b0
> [28449.564790]  netlink_sendmsg+0xaa8/0xe60
> [28449.661510]  sock_sendmsg+0xcf/0x110
> [28449.865631]  __sys_sendmsg+0xf3/0x240
> [28450.000964]  SyS_sendmsg+0x32/0x50
> [28450.016969]  do_syscall_64+0x25c/0x6c0
> [28450.154439]  entry_SYSCALL64_slow_path+0x25/0x25
> 
> It was caused by no protection between nlk groups' free in netlink_release
> and nlk groups' accessing in sk_diag_dump_groups. The similar issue also
> exists in netlink_seq_show().
> 
> This patch is to defer nlk groups' free in deferred_put_nlk_sk.
> 
> Reported-by: ChunYu Wang 
> Acked-by: Florian Westphal 
> Signed-off-by: Xin Long 

Applied.

Re: [PATCH] dt-binding: phy: don't confuse with Ethernet phy properties

2017-09-06 Thread David Miller

From: Baruch Siach 
Date: Sun,  3 Sep 2017 17:32:16 +0300

> The generic PHY 'phys' property sometime appears in the same node with
> the Ethernet PHY 'phy' or 'phy-handle' properties. Add a warning in
> phy-bindings.txt to reduce confusion.
> 
> Signed-off-by: Baruch Siach 

Applied.

Re: [GIT] Networking

2017-09-06 Thread Coelho, Luciano

On Wed, 2017-09-06 at 16:27 -0700, Linus Torvalds wrote:
> This pull request completely breaks Intel wireless for me.
> 
> This is my trusty old XPS 13 (9350), using Intel Wireless 8260 (rev 3a).
> 
> That remains a very standard Intel machine with absolutely zero odd
> things going on.
> 
> The firmware is iwlwifi-8000C-28.ucode from
> iwl7260-firmware-25.30.13.0-75.fc26.noarch, and the kernel reports
> 
>   iwlwifi :3a:00.0: loaded firmware version 27.455470.0 op_mode iwlmvm
> 
> the thing starts acting badly with this:
> 
>   iwlwifi :3a:00.0: FW Error notification: type 0x cmd_id 0x04
>   iwlwifi :3a:00.0: FW Error notification: seq 0x service 0x0004
>   iwlwifi :3a:00.0: FW Error notification: timestamp 0x5D84
>   iwlwifi :3a:00.0: Microcode SW error detected.  Restarting 0x200.
>   iwlwifi :3a:00.0: Start IWL Error Log Dump:
>   iwlwifi :3a:00.0: Status: 0x0100, count: 6
>   iwlwifi :3a:00.0: Loaded firmware version: 27.455470.0
>   iwlwifi :3a:00.0: 0x0038 | BAD_COMMAND
>   iwlwifi :3a:00.0: 0x00A002F0 | trm_hw_status0
>   ...
>   iwlwifi :3a:00.0: 0x | isr status reg
>   ieee80211 phy0: Hardware restart was requested
>   iwlwifi :3a:00.0: FW error in SYNC CMD MAC_CONTEXT_CMD

This seems to be a problem with backwards-compatibility with FW version
27.  We are now in version 31[1] and upgrading will probably fix that.

But obviously the driver should not fail miserably like this with
version 27, because it claims to support it still.

I'm looking into this now and will provide a fix asap.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/iwlwifi-8000C-31.ucode


--
Cheers,
Luca.

[PATCH] ipv4: Namespaceify tcp_max_orphans knob

2017-09-06 Thread Haishuang Yan

Different namespace application might require different maximal number
of TCP sockets independently of the host.

Signed-off-by: Haishuang Yan 
---
 include/net/netns/ipv4.h   |  1 +
 include/net/tcp.h  |  5 +++--
 net/ipv4/sysctl_net_ipv4.c | 14 +++---
 net/ipv4/tcp.c |  3 ---
 net/ipv4/tcp_input.c   |  1 -
 net/ipv4/tcp_ipv4.c|  1 +
 6 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 20d061c..305e031 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -127,6 +127,7 @@ struct netns_ipv4 {
int sysctl_tcp_timestamps;
struct inet_timewait_death_row tcp_death_row;
int sysctl_max_syn_backlog;
+   int sysctl_tcp_max_orphans;
 
 #ifdef CONFIG_NET_L3_MASTER_DEV
int sysctl_udp_l3mdev_accept;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index b510f28..ac2d998 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -320,10 +320,11 @@ static inline bool tcp_too_many_orphans(struct sock *sk, 
int shift)
 {
struct percpu_counter *ocp = sk->sk_prot->orphan_count;
int orphans = percpu_counter_read_positive(ocp);
+   int tcp_max_orphans = sock_net(sk)->ipv4.sysctl_tcp_max_orphans;
 
-   if (orphans << shift > sysctl_tcp_max_orphans) {
+   if (orphans << shift > tcp_max_orphans) {
orphans = percpu_counter_sum_positive(ocp);
-   if (orphans << shift > sysctl_tcp_max_orphans)
+   if (orphans << shift > tcp_max_orphans)
return true;
}
return false;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 0d3c038..4f26c8d3 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -394,13 +394,6 @@ static int proc_tcp_available_ulp(struct ctl_table *ctl,
.proc_handler   = proc_dointvec
},
{
-   .procname   = "tcp_max_orphans",
-   .data   = _tcp_max_orphans,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec
-   },
-   {
.procname   = "tcp_fastopen",
.data   = _tcp_fastopen,
.maxlen = sizeof(int),
@@ -1085,6 +1078,13 @@ static int proc_tcp_available_ulp(struct ctl_table *ctl,
.mode   = 0644,
.proc_handler   = proc_dointvec
},
+   {
+   .procname   = "tcp_max_orphans",
+   .data   = _net.ipv4.sysctl_tcp_max_orphans,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
{
.procname   = "fib_multipath_use_neigh",
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5091402..39187ac 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3522,9 +3522,6 @@ void __init tcp_init(void)
}
 
 
-   cnt = tcp_hashinfo.ehash_mask + 1;
-   sysctl_tcp_max_orphans = cnt / 2;
-
tcp_init_mem();
/* Set per-socket limits to no more than 1/128 the pressure threshold */
limit = nr_free_buffer_pages() << (PAGE_SHIFT - 7);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c5d7656..0230509 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -88,7 +88,6 @@
 
 int sysctl_tcp_stdurg __read_mostly;
 int sysctl_tcp_rfc1337 __read_mostly;
-int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 int sysctl_tcp_frto __read_mostly = 2;
 int sysctl_tcp_min_rtt_wlen __read_mostly = 300;
 int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a63486a..4b17a91 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2468,6 +2468,7 @@ static int __net_init tcp_sk_init(struct net *net)
net->ipv4.tcp_death_row.hashinfo = _hashinfo;
 
net->ipv4.sysctl_max_syn_backlog = max(128, cnt / 256);
+   net->ipv4.sysctl_tcp_max_orphans = cnt / 2;
net->ipv4.sysctl_tcp_sack = 1;
net->ipv4.sysctl_tcp_window_scaling = 1;
net->ipv4.sysctl_tcp_timestamps = 1;
-- 
1.8.3.1

Re: [PATCH net] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"

2017-09-06 Thread Florian Fainelli



On 09/06/2017 05:10 PM, David Daney wrote:
> On 09/06/2017 04:14 PM, Florian Fainelli wrote:
>> On 09/06/2017 03:51 PM, David Daney wrote:
> [...]
>>>
>>> Consider instead the case of a Marvell phy with no interrupts connected
>>> on a v4.9.43 kernel, single CPU:
>>>
>>>
>>>0)   | phy_disconnect() {
>>>0)   |   phy_stop_machine() {
>>>0)   | cancel_delayed_work_sync() {
>>>0) + 23.986 us   | } /*
>>> cancel_delayed_work_sync */
>>>0)   | phy_state_machine() {
>>>0)   |   phy_start_aneg_priv() {
>>
>> Thanks for providing the trace, I think I have an idea of what's going
>> on, see below.
>>
>>>0)   | marvell_config_aneg() {
>>>0) ! 240.538 us  | } /*
>>> marvell_config_aneg */
>>>0) ! 244.971 us  |   } /* phy_start_aneg_priv */
>>>0)   |   queue_delayed_work_on() {
>>>0) + 18.016 us   |   } /*
>>> queue_delayed_work_on */
>>>0) ! 268.184 us  | } /* phy_state_machine */
>>>0) ! 297.394 us  |   } /* phy_stop_machine */
>>>0)   |   phy_detach() {
>>>0)   | phy_suspend() {
>>>0)   |   phy_ethtool_get_wol() {
>>>0)   0.677 us|   } /* phy_ethtool_get_wol */
>>>0)   |   genphy_suspend() {
>>>0) + 71.250 us   |   } /* genphy_suspend */
>>>0) + 74.197 us   | } /* phy_suspend */
>>>0) + 80.302 us   |   } /* phy_detach */
>>>0) ! 380.072 us  | } /* phy_disconnect */
>>> .
>>> .
>>> .
>>>0)   |  process_one_work() {
>>>0)   |find_worker_executing_work() {
>>>0)   0.688 us|} /* find_worker_executing_work */
>>>0)   |set_work_pool_and_clear_pending() {
>>>0)   0.734 us|} /* set_work_pool_and_clear_pending */
>>>0)   |phy_state_machine() {
>>>0)   |  genphy_read_status() {
>>>0) ! 205.721 us  |  } /* genphy_read_status */
>>>0)   |  netif_carrier_off() {
>>>0)   |do_page_fault() {
>>>
>>>
>>> The do_page_fault() at the end indicates the NULL pointer dereference.
>>>
>>> That added call to phy_state_machine() turns the polling back on
>>> unconditionally for a phy that should be disconnected.  How is that
>>> correct?
>>
>> It is not fundamentally correct and I don't think there was any
>> objection to that to begin with. In fact there is a bug/inefficiency
>> here in that if we have entered the PHY state machine with PHY_HALTED we
>> should not re-schedule it period, only applicable to PHY_POLL cases
>> *and* properly calling phy_stop() followed by phy_disconnect().
>>
>> What I now think is happening in your case is the following:
>>
>> phy_stop() was not called, so nothing does set phydev->state to
>> PHY_HALTED in the first place so we have:
>>
>> phy_disconnect()
>> -> phy_stop_machine()
>> -> cancel_delayed_work_sync() OK
>> phydev->state is probably RUNNING so we have:
>> -> phydev->state = PHY_UP
>> phy_state_machine() is called synchronously
>> -> PHY_UP -> needs_aneg = true
>> -> phy_restart_aneg()
>> -> queue_delayed_work_sync()
>> -> phydev->adjust_link = NULL
>> -> phy_deatch() -> boom
>>
>> Can you confirm whether the driver you are using does call phy_stop()
>> prior to phy_disconnect()? 
> 
> There is no call to phy_stop().

OK this all makes sense now.

> 
> I can add this to the ethernet drivers, but I wonder if it should be
> called by the code code when doing phy_disconnect(), if it was not
> already stopped.

Fixing the driver should be reasonably quick and easy and can be done
independently from fixing PHYLIB, but I agree that PHYLIB should be
safeguarded against such a case.

Of course, now that I looked again at the code, there is really a ton of
unnecessary workqueue scheduling going on, similarly to phy_stop()
making us go from PHY_HALTED to PHY_HALTED, phy_start_machine() does the
same thing with PHY_READY -> PHY_READY, I suppose back when this was
done the assumption was that there is not going to be a tremendous
amount of time being spent between a call to
phy_connect()/phy_start_machine() and phy_start() and respectively
phy_stop() followed by a phy_disconnect(), oh well.

Now that the revert is in 4.13 we can work on a solution that satisfies
everybody on this thread.

Thanks!
-- 
Florian

LOAN AVAILABLE @3 %

2017-09-06 Thread Mr

LOAN AVAILABLE @3 % Rely mohammedanis...@gmail.com

Re: [PATCH net 1/2] i40e: Fix comment about locking for __i40e_read_nvm_word()

2017-09-06 Thread Jeff Kirsher

On Wed, 2017-09-06 at 10:11 +0200, Stefano Brivio wrote:
> Caller needs to acquire the lock. Called functions will not.
> 
> Fixes: 09f79fd49d94 ("i40e: avoid NVM acquire deadlock during NVM
> update")
> Signed-off-by: Stefano Brivio 
> ---
>  drivers/net/ethernet/intel/i40e/i40e_nvm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Yes, this fixes the function header comment, not sure if it requires
the "Fixes:" tag.  If that were the case, wonder why all the other code
comment changes do not have this. :-)  I do agree it reads better with
this change, so I do not have an issue queue this up for Dave's net
tree.

signature.asc
Description: This is a digitally signed message part

Re: [PATCH net 2/2] i40e: Avoid some useless variables and initializers in nvm functions

2017-09-06 Thread Jeff Kirsher

On Wed, 2017-09-06 at 10:11 +0200, Stefano Brivio wrote:
> Fixes: 09f79fd49d94 ("i40e: avoid NVM acquire deadlock during NVM
> update")
> Signed-off-by: Stefano Brivio 
> ---
>  drivers/net/ethernet/intel/i40e/i40e_nvm.c | 20 +++-
>  1 file changed, 7 insertions(+), 13 deletions(-)

This is NOT a fix, it is a coding style preference whether or not you
like multiple returns or a single return in functions.  I do not mind
picking this up for net-next (4.15 kernel), but this does not qualify
for a fix for Dave's net tree.

signature.asc
Description: This is a digitally signed message part

Re: [PATCH v2 rfc 0/8] IGMP snooping for local traffic

2017-09-06 Thread Andrew Lunn

> At the very least we should probably move the skb->offload_fwd_mark
> setting down into the individual taggers since they should be in a
> better position to set it or not based on the switch device they are
> driving, this should address, on a per-switch basis whether 2) or 3)
> applies to a given switch.

Yes, that could be done.

> That being said, I have a feeling that the Marvell switches behave a
> tiny bit differently than others in that they do not flood broadcast by
> default in a given L2 domain.

It seems like the switch can be configured to flood broadcast. Just at
the moment, it is not.

> On b53/bcm_sf2 there is the ability to disable the reception of
> broadcast frames on the management/CPU port, and while there is the
> ability to configure which ports should be flooded in case of
> unicast/multicast lookup failures, I don't see anything for Broadcast,
> so I am assuming this will get forwarded by default. Will test with your
> patch set later on time permitting.

Please let me know the results of the tests.

If Marvell is being different, it also means that all other switches
today are duplicating broadcasts. We should fix that. In fact, i think
we need to fix this first, before these patches for IGMP snooping on
br0 goes in.

Andrew

Re: [PATCH v2 rfc 0/8] IGMP snooping for local traffic

2017-09-06 Thread Andrew Lunn

On Wed, Sep 06, 2017 at 11:44:47PM +, woojung@microchip.com wrote:
> > That being said, I have a feeling that the Marvell switches behave a
> > tiny bit differently than others in that they do not flood broadcast by
> > default in a given L2 domain.
> Florian,
> 
> Because some DSA switches from Marvell & Microchip can do IGMP snooping, 
> can we propose switch layer another flag what to do when HW support it?

Hi Woojung

I expect all the current DSA devices should be able to do IGMP
snooping, with some modifications.

Two things are required:

1) The .port_mdb_prepare, .port_mdb_add and .port_mdb_del ops, so that
mdb entries can be added. As you said, only Marvell and Microchip
support these, but i expect the other switch can do this, it just
needs implementing.

2) The switch needs to identify and forward IGMP packets to the host,
even when they would normally be blocked.

And for the implementation, i don't think it actually matters.  For
switches which don't implement the port_mdb operations, IGMP packets
will get forwarded to the software bridge. It will attempt to put in
an mdb, but the request will come back with EOPNOTSUPP. The switch
should continue to flood multicast out all ports. No harm done.

   Andrew

[PATCH RFC v1 2/3] net: Abstracting out common routines from veth for use by vethtap

2017-09-06 Thread sainath . grandhi

From: Sainath Grandhi 

Abstracting out common routines for link operations in veth implementation
for use by vethtap interfaces

Signed-off-by: Sainath Grandhi 
---
 drivers/net/veth.c  | 47 +--
 include/linux/if_veth.h |  9 +
 2 files changed, 42 insertions(+), 14 deletions(-)
 create mode 100644 include/linux/if_veth.h

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index f5438d0..a1b370d 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define DRV_NAME   "veth"
@@ -29,12 +30,6 @@ struct pcpu_vstats {
struct u64_stats_sync   syncp;
 };
 
-struct veth_priv {
-   struct net_device __rcu *peer;
-   atomic64_t  dropped;
-   unsignedrequested_headroom;
-};
-
 /*
  * ethtool interface
  */
@@ -298,13 +293,12 @@ static const struct net_device_ops veth_netdev_ops = {
   NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX | \
   NETIF_F_HW_VLAN_STAG_TX | NETIF_F_HW_VLAN_STAG_RX )
 
-static void veth_setup(struct net_device *dev)
+void veth_common_setup(struct net_device *dev)
 {
ether_setup(dev);
 
dev->priv_flags &= ~IFF_TX_SKB_SHARING;
dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
-   dev->priv_flags |= IFF_NO_QUEUE;
dev->priv_flags |= IFF_PHONY_HEADROOM;
 
dev->netdev_ops = _netdev_ops;
@@ -325,6 +319,12 @@ static void veth_setup(struct net_device *dev)
dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
 }
 
+static void veth_setup(struct net_device *dev)
+{
+   veth_common_setup(dev);
+   dev->priv_flags |= IFF_NO_QUEUE;
+}
+
 /*
  * netlink interface
  */
@@ -465,7 +465,7 @@ static int veth_newlink(struct net *src_net, struct 
net_device *dev,
return err;
 }
 
-static void veth_dellink(struct net_device *dev, struct list_head *head)
+void veth_dellink(struct net_device *dev, struct list_head *head)
 {
struct veth_priv *priv;
struct net_device *peer;
@@ -503,21 +503,40 @@ static struct rtnl_link_ops veth_link_ops = {
.kind   = DRV_NAME,
.priv_size  = sizeof(struct veth_priv),
.setup  = veth_setup,
-   .validate   = veth_validate,
.newlink= veth_newlink,
.dellink= veth_dellink,
-   .policy = veth_policy,
-   .maxtype= VETH_INFO_MAX,
-   .get_link_net   = veth_get_link_net,
 };
 
+int veth_link_register(struct rtnl_link_ops *ops)
+{
+   /* common fields */
+   ops->validate = veth_validate;
+   ops->policy   = veth_policy;
+   ops->maxtype  = VETH_INFO_MAX;
+   ops->get_link_net = veth_get_link_net;
+
+   return rtnl_link_register(ops);
+}
+
+void veth_link_ops_init(struct rtnl_link_ops *ops)
+{
+   /*common fields*/
+   ops->validate   = veth_validate;
+   ops->policy = veth_policy;
+   ops->maxtype= VETH_INFO_MAX;
+   ops->get_link_net   = veth_get_link_net;
+}
 /*
  * init/fini
  */
 
 static __init int veth_init(void)
 {
-   return rtnl_link_register(_link_ops);
+   int err;
+
+   err = veth_link_register(_link_ops);
+
+   return err;
 }
 
 static __exit void veth_exit(void)
diff --git a/include/linux/if_veth.h b/include/linux/if_veth.h
new file mode 100644
index 000..b007891
--- /dev/null
+++ b/include/linux/if_veth.h
@@ -0,0 +1,9 @@
+struct veth_priv {
+   struct net_device __rcu *peer;
+   atomic64_t  dropped;
+   unsigned intrequested_headroom;
+};
+
+void veth_common_setup(struct net_device *dev);
+void veth_dellink(struct net_device *dev, struct list_head *head);
+void veth_link_ops_init(struct rtnl_link_ops *ops);
-- 
2.7.4

[PATCH RFC v1 3/3] vethtap: veth based tap driver

2017-09-06 Thread sainath . grandhi

From: Sainath Grandhi 

This patch adds a tap character device driver that is based on the
veth network interface, called vethtap. This patchset allows vethtap device
to be created ONLY as a peer interface to a veth network interface. It can
be created in the following way,
ip link add veth1 type veth peer name veth2 type vethtap
With this packets on veth2 can be accessed using tap user space interface.

Signed-off-by: Sainath Grandhi 
---
 drivers/net/Kconfig |   1 +
 drivers/net/Makefile|   2 +
 drivers/net/{veth.c => veth_main.c} |  33 +-
 drivers/net/vethtap.c   | 216 
 include/linux/if_veth.h |   4 +
 5 files changed, 255 insertions(+), 1 deletion(-)
 rename drivers/net/{veth.c => veth_main.c} (94%)
 create mode 100644 drivers/net/vethtap.c

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index aba0d65..265853e 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -323,6 +323,7 @@ config TUN_VNET_CROSS_LE
 
 config VETH
tristate "Virtual ethernet pair device"
+   select TAP
---help---
  This device is a local ethernet tunnel. Devices are created in pairs.
  When one end receives the packet it appears on its pair and vice
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 8dff900..7c63e69 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -32,6 +32,8 @@ obj-$(CONFIG_NLMON) += nlmon.o
 obj-$(CONFIG_NET_VRF) += vrf.o
 obj-$(CONFIG_VSOCKMON) += vsockmon.o
 
+veth-objs := veth_main.o vethtap.o
+
 #
 # Networking Drivers
 #
diff --git a/drivers/net/veth.c b/drivers/net/veth_main.c
similarity index 94%
rename from drivers/net/veth.c
rename to drivers/net/veth_main.c
index a1b370d..fc91dd7 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth_main.c
@@ -359,6 +359,9 @@ static int veth_newlink(struct net *src_net, struct 
net_device *dev,
unsigned char name_assign_type;
struct ifinfomsg *ifmp;
struct net *net;
+   struct nlattr *linkinfo[IFLA_INFO_MAX + 1];
+   char peer_type[8];
+   struct rtnl_link_ops *link_ops;
 
/*
 * create and register peer first
@@ -393,17 +396,38 @@ static int veth_newlink(struct net *src_net, struct 
net_device *dev,
name_assign_type = NET_NAME_ENUM;
}
 
+   link_ops = _link_ops;
+   if (tbp[IFLA_LINKINFO]) {
+   err = rtnl_nla_parse_ifla_info(linkinfo,
+  nla_data(tbp[IFLA_LINKINFO]),
+  nla_len(tbp[IFLA_LINKINFO]),
+  NULL);
+
+   if (err < 0)
+   return err;
+
+   if (linkinfo[IFLA_INFO_KIND]) {
+   nla_strlcpy(peer_type, linkinfo[IFLA_INFO_KIND],
+   sizeof(peer_type));
+   if (!strncmp(peer_type, "vethtap", sizeof(peer_type)))
+   link_ops = _link_ops;
+   }
+   }
+
net = rtnl_link_get_net(src_net, tbp);
if (IS_ERR(net))
return PTR_ERR(net);
 
peer = rtnl_create_link(net, ifname, name_assign_type,
-   _link_ops, tbp);
+   link_ops, tbp);
if (IS_ERR(peer)) {
put_net(net);
return PTR_ERR(peer);
}
 
+   if (!strncmp(peer_type, "vethtap", sizeof(peer_type)))
+   link_ops->newlink(net, peer, tbp, NULL, NULL);
+
if (!ifmp || !tbp[IFLA_ADDRESS])
eth_hw_addr_random(peer);
 
@@ -536,12 +560,19 @@ static __init int veth_init(void)
 
err = veth_link_register(_link_ops);
 
+   if (err)
+   goto out1;
+
+   err = vethtap_init();
+
+out1:
return err;
 }
 
 static __exit void veth_exit(void)
 {
rtnl_link_unregister(_link_ops);
+   vethtap_exit();
 }
 
 module_init(veth_init);
diff --git a/drivers/net/vethtap.c b/drivers/net/vethtap.c
new file mode 100644
index 000..922b3ea
--- /dev/null
+++ b/drivers/net/vethtap.c
@@ -0,0 +1,216 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct vethtap_dev {
+   struct veth_priv  veth;
+   struct tap_devtap;
+};
+
+/* Variables for dealing with vethtaps device numbers.
+ */
+static dev_t vethtap_major;
+
+static const void *vethtap_net_namespace(struct device *d)
+{
+   struct net_device *dev = to_net_dev(d->parent);
+
+   return dev_net(dev);
+}
+
+static struct class vethtap_class = {
+   .name = "vethtap",
+   .owner = THIS_MODULE,
+   .ns_type =

[PATCH RFC v1 1/3] net: Adding API to parse IFLA_LINKINFO attribute

2017-09-06 Thread sainath . grandhi

From: Sainath Grandhi 

Adding rtnl_nla_parse_ifla_info as an exported symbol in rtnetlink.c helps
other modules to parse IFLA_LINKINFO attribute

Signed-off-by: Sainath Grandhi 
---
 include/net/rtnetlink.h | 3 +++
 net/core/rtnetlink.c| 8 
 2 files changed, 11 insertions(+)

diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index 21837ca..cb15ddb 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -170,6 +170,9 @@ int rtnl_configure_link(struct net_device *dev, const 
struct ifinfomsg *ifm);
 int rtnl_nla_parse_ifla(struct nlattr **tb, const struct nlattr *head, int len,
struct netlink_ext_ack *exterr);
 
+int rtnl_nla_parse_ifla_info(struct nlattr **tb, const struct nlattr *head,
+int len, struct netlink_ext_ack *exterr);
+
 #define MODULE_ALIAS_RTNL_LINK(kind) MODULE_ALIAS("rtnl-link-" kind)
 
 #endif
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a78fd61..0784b7d 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1688,6 +1688,14 @@ int rtnl_nla_parse_ifla(struct nlattr **tb, const struct 
nlattr *head, int len,
 }
 EXPORT_SYMBOL(rtnl_nla_parse_ifla);
 
+int rtnl_nla_parse_ifla_info(struct nlattr **tb, const struct nlattr *head,
+int len, struct netlink_ext_ack *exterr)
+{
+   return nla_parse(tb, IFLA_INFO_MAX, head, len, ifla_info_policy,
+exterr);
+}
+EXPORT_SYMBOL(rtnl_nla_parse_ifla_info);
+
 struct net *rtnl_link_get_net(struct net *src_net, struct nlattr *tb[])
 {
struct net *net;
-- 
2.7.4

[PATCH RFC v1 0/3] Support for tap user-space access with veth interfaces

2017-09-06 Thread sainath . grandhi

From: Sainath Grandhi 

This patchset adds a tap device driver for veth virtual network interface.
With this implementation, tap character interface can be added only to the
peer veth interface. Adding tap interface to veth is for usecases that forwards 
packets between host and VMs. This eliminates the need for an additional 
software bridge. This can be extended to create both the peer interfaces as
tap interfaces. These patches are a step in that direction.

Sainath Grandhi (3):
  net: Adding API to parse IFLA_LINKINFO attribute
  net: Abstracting out common routines from veth for use by vethtap
  vethtap: veth based tap driver

 drivers/net/Kconfig |   1 +
 drivers/net/Makefile|   2 +
 drivers/net/{veth.c => veth_main.c} |  80 ++---
 drivers/net/vethtap.c   | 216 
 include/linux/if_veth.h |  13 +++
 include/net/rtnetlink.h |   3 +
 net/core/rtnetlink.c|   8 ++
 7 files changed, 308 insertions(+), 15 deletions(-)
 rename drivers/net/{veth.c => veth_main.c} (89%)
 create mode 100644 drivers/net/vethtap.c
 create mode 100644 include/linux/if_veth.h

-- 
2.7.4

Re: [PATCH net] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"

2017-09-06 Thread David Daney


On 09/06/2017 04:14 PM, Florian Fainelli wrote:

On 09/06/2017 03:51 PM, David Daney wrote:

[...]


Consider instead the case of a Marvell phy with no interrupts connected
on a v4.9.43 kernel, single CPU:


   0)   | phy_disconnect() {
   0)   |   phy_stop_machine() {
   0)   | cancel_delayed_work_sync() {
   0) + 23.986 us   | } /* cancel_delayed_work_sync */
   0)   | phy_state_machine() {
   0)   |   phy_start_aneg_priv() {


Thanks for providing the trace, I think I have an idea of what's going
on, see below.


   0)   | marvell_config_aneg() {
   0) ! 240.538 us  | } /* marvell_config_aneg */
   0) ! 244.971 us  |   } /* phy_start_aneg_priv */
   0)   |   queue_delayed_work_on() {
   0) + 18.016 us   |   } /* queue_delayed_work_on */
   0) ! 268.184 us  | } /* phy_state_machine */
   0) ! 297.394 us  |   } /* phy_stop_machine */
   0)   |   phy_detach() {
   0)   | phy_suspend() {
   0)   |   phy_ethtool_get_wol() {
   0)   0.677 us|   } /* phy_ethtool_get_wol */
   0)   |   genphy_suspend() {
   0) + 71.250 us   |   } /* genphy_suspend */
   0) + 74.197 us   | } /* phy_suspend */
   0) + 80.302 us   |   } /* phy_detach */
   0) ! 380.072 us  | } /* phy_disconnect */
.
.
.
   0)   |  process_one_work() {
   0)   |find_worker_executing_work() {
   0)   0.688 us|} /* find_worker_executing_work */
   0)   |set_work_pool_and_clear_pending() {
   0)   0.734 us|} /* set_work_pool_and_clear_pending */
   0)   |phy_state_machine() {
   0)   |  genphy_read_status() {
   0) ! 205.721 us  |  } /* genphy_read_status */
   0)   |  netif_carrier_off() {
   0)   |do_page_fault() {


The do_page_fault() at the end indicates the NULL pointer dereference.

That added call to phy_state_machine() turns the polling back on
unconditionally for a phy that should be disconnected.  How is that
correct?


It is not fundamentally correct and I don't think there was any
objection to that to begin with. In fact there is a bug/inefficiency
here in that if we have entered the PHY state machine with PHY_HALTED we
should not re-schedule it period, only applicable to PHY_POLL cases
*and* properly calling phy_stop() followed by phy_disconnect().

What I now think is happening in your case is the following:

phy_stop() was not called, so nothing does set phydev->state to
PHY_HALTED in the first place so we have:

phy_disconnect()
-> phy_stop_machine()
-> cancel_delayed_work_sync() OK
phydev->state is probably RUNNING so we have:
-> phydev->state = PHY_UP
phy_state_machine() is called synchronously
-> PHY_UP -> needs_aneg = true
-> phy_restart_aneg()
-> queue_delayed_work_sync()
-> phydev->adjust_link = NULL
-> phy_deatch() -> boom

Can you confirm whether the driver you are using does call phy_stop()
prior to phy_disconnect()? 


There is no call to phy_stop().

I can add this to the ethernet drivers, but I wonder if it should be 
called by the code code when doing phy_disconnect(), if it was not 
already stopped.



If that is the case then this whole theory
falls apart, if not, then this needs fixing in both the driver and PHYLIB.

Thanks

RE: [PATCH v2 rfc 0/8] IGMP snooping for local traffic

2017-09-06 Thread Woojung.Huh

> That being said, I have a feeling that the Marvell switches behave a
> tiny bit differently than others in that they do not flood broadcast by
> default in a given L2 domain.
Florian,

Because some DSA switches from Marvell & Microchip can do IGMP snooping, 
can we propose switch layer another flag what to do when HW support it?

- Woojung

Re: [patch net] net: sched: fix memleak for chain zero

2017-09-06 Thread Cong Wang

On Wed, Sep 6, 2017 at 1:33 PM, Jiri Pirko  wrote:
> Wed, Sep 06, 2017 at 07:40:02PM CEST, xiyou.wangc...@gmail.com wrote:
>>On Wed, Sep 6, 2017 at 4:14 AM, Jiri Pirko  wrote:
>>> From: Jiri Pirko 
>>>
>>> There's a memleak happening for chain 0. The thing is, chain 0 needs to
>>> be always present, not created on demand. Therefore tcf_block_get upon
>>> creation of block calls the tcf_chain_create function directly. The
>>> chain is created with refcnt == 1, which is not correct in this case and
>>> causes the memleak. So move the refcnt increment into tcf_chain_get
>>> function even for the case when chain needs to be created.
>>>
>>
>>Your approach could work but you just make the code even
>>uglier than it is now:
>>
>>1. The current code is already ugly for special-casing chain 0:
>>
>>if (--chain->refcnt == 0 && !chain->filter_chain && chain->index != 0)
>>tcf_chain_destroy(chain);
>>
>>2. With your patch, chain 0 has a different _initial_ refcnt with others.
>
> No. Initial refcnt is the same. ! for every action that holds the chain.
> So actually, it returns it back where it should be.

Not all all.

tcf_block_get() calls tcf_chain_create(, 0), after your patch
chain 0 has refcnt==0 initially.

Non-0 chain? They are created via tcf_chain_get(), aka,
refcnt==0 initially.

>
>
>>
>>3. Allowing an object (chain 0) exists with refcnt==0
>
> So? That is for every chain that does not have goto_chain action
> pointing at. Please read the code.

So you are pretending to be GC but you are apparently not.

You create all the troubles by setting yourself to believe chain 0
is special and refcnt==0 is okay. Both are wrong.

Actually the !list_empty() check is totally unnecessary too,
it is yet another place you get it wrong, you hide the race
condition in commit 744a4cf63e52 which makes it harder
to expose.

I understand you don't trust me. Look at DaveM's reaction
to your refcnt==0 madness.

Remember, refcnt can be very simple, you just want to
make it harder by abusing it or attempting to invent a GC.

I am going to update my patch (to remove all your madness)
since this is horribly wrong to me. Sorry.

Re: [GIT] Networking

2017-09-06 Thread Linus Torvalds

On Wed, Sep 6, 2017 at 4:27 PM, Linus Torvalds
 wrote:
>
> The firmware is iwlwifi-8000C-28.ucode from
> iwl7260-firmware-25.30.13.0-75.fc26.noarch, and the kernel reports
>
>   iwlwifi :3a:00.0: loaded firmware version 27.455470.0 op_mode iwlmvm

And when I said "iwlwifi-8000C-28.ucode" I obviously meant
"iwlwifi-8000C-27.ucode".

At least it was _hopefully_ obvious from that "27" in the actual
version number it reports.

Linus

Re: [GIT] Networking

2017-09-06 Thread David Miller

From: Linus Torvalds 
Date: Wed, 6 Sep 2017 16:27:15 -0700

> This pull request completely breaks Intel wireless for me.
> 
> This is my trusty old XPS 13 (9350), using Intel Wireless 8260 (rev 3a).
> 
> That remains a very standard Intel machine with absolutely zero odd
> things going on.
> 
> The firmware is iwlwifi-8000C-28.ucode from
> iwl7260-firmware-25.30.13.0-75.fc26.noarch, and the kernel reports
 ...

Johannes and other Intel folks please look into this.

Re: [GIT] Networking

2017-09-06 Thread Linus Torvalds

This pull request completely breaks Intel wireless for me.

This is my trusty old XPS 13 (9350), using Intel Wireless 8260 (rev 3a).

That remains a very standard Intel machine with absolutely zero odd
things going on.

The firmware is iwlwifi-8000C-28.ucode from
iwl7260-firmware-25.30.13.0-75.fc26.noarch, and the kernel reports

  iwlwifi :3a:00.0: loaded firmware version 27.455470.0 op_mode iwlmvm

the thing starts acting badly with this:

  iwlwifi :3a:00.0: FW Error notification: type 0x cmd_id 0x04
  iwlwifi :3a:00.0: FW Error notification: seq 0x service 0x0004
  iwlwifi :3a:00.0: FW Error notification: timestamp 0x5D84
  iwlwifi :3a:00.0: Microcode SW error detected.  Restarting 0x200.
  iwlwifi :3a:00.0: Start IWL Error Log Dump:
  iwlwifi :3a:00.0: Status: 0x0100, count: 6
  iwlwifi :3a:00.0: Loaded firmware version: 27.455470.0
  iwlwifi :3a:00.0: 0x0038 | BAD_COMMAND
  iwlwifi :3a:00.0: 0x00A002F0 | trm_hw_status0
  ...
  iwlwifi :3a:00.0: 0x | isr status reg
  ieee80211 phy0: Hardware restart was requested
  iwlwifi :3a:00.0: FW error in SYNC CMD MAC_CONTEXT_CMD
  CPU: 2 PID: 993 Comm: NetworkManager Not tainted 4.13.0-06466-g80cee03bf1d6 #4
  Hardware name: Dell Inc. XPS 13 9350/09JHRY, BIOS 1.4.17 05/10/2017
  Call Trace:
   dump_stack+0x4d/0x70
   iwl_trans_pcie_send_hcmd+0x4e7/0x530 [iwlwifi]
   ? wait_woken+0x80/0x80
   iwl_trans_send_cmd+0x5c/0xc0 [iwlwifi]
   iwl_mvm_send_cmd+0x32/0x90 [iwlmvm]
   iwl_mvm_send_cmd_pdu+0x58/0x80 [iwlmvm]
   iwl_mvm_mac_ctxt_send_cmd+0x2a/0x60 [iwlmvm]
   ? iwl_mvm_mac_ctxt_send_cmd+0x2a/0x60 [iwlmvm]
   iwl_mvm_mac_ctxt_cmd_sta+0x140/0x1e0 [iwlmvm]
   iwl_mvm_mac_ctx_send+0x2d/0x60 [iwlmvm]
   iwl_mvm_mac_ctxt_add+0x43/0xc0 [iwlmvm]
   iwl_mvm_mac_add_interface+0x139/0x2b0 [iwlmvm]
   ? iwl_led_brightness_set+0x1f/0x30 [iwlmvm]
   drv_add_interface+0x4a/0x120 [mac80211]
   ieee80211_do_open+0x33d/0x820 [mac80211]
   ieee80211_open+0x52/0x60 [mac80211]
   __dev_open+0xae/0x120
   __dev_change_flags+0x17b/0x1c0
   dev_change_flags+0x29/0x60
   do_setlink+0x2f7/0xe60
   ? __nla_put+0x20/0x30
   ? _raw_read_unlock_bh+0x20/0x30
   ? inet6_fill_ifla6_attrs+0x4be/0x4e0
   ? __kmalloc_node_track_caller+0x35/0x2b0
   ? nla_parse+0x35/0x100
   rtnl_newlink+0x5d2/0x8f0
   ? __netlink_sendskb+0x3b/0x60
   ? security_capset+0x40/0x80
   ? ns_capable_common+0x68/0x80
   ? ns_capable+0x13/0x20
   rtnetlink_rcv_msg+0x1f9/0x280
   ? rtnl_calcit.isra.26+0x110/0x110
   netlink_rcv_skb+0x8e/0x130
   rtnetlink_rcv+0x15/0x20
   netlink_unicast+0x18b/0x220
   netlink_sendmsg+0x2ad/0x3a0
   sock_sendmsg+0x38/0x50
   ___sys_sendmsg+0x269/0x2c0
   ? addrconf_sysctl_forward+0x114/0x280
   ? dev_forward_change+0x140/0x140
   ? sysctl_head_finish.part.22+0x32/0x40
   ? lockref_put_or_lock+0x5e/0x80
   ? dput.part.22+0x13e/0x1c0
   ? mntput+0x24/0x40
   __sys_sendmsg+0x54/0x90
   ? __sys_sendmsg+0x54/0x90
   SyS_sendmsg+0x12/0x20
   entry_SYSCALL_64_fastpath+0x13/0x94
  RIP: 0033:0x7ff1f9933134
  RSP: 002b:7ffe7419b460 EFLAGS: 0293 ORIG_RAX: 002e
  RAX: ffda RBX: 55604b6d82b9 RCX: 7ff1f9933134
  RDX:  RSI: 7ffe7419b4b0 RDI: 0007
  RBP: 7ffe7419b940 R08:  R09: 55604d16b400
  R10: 7ff1f7cf8b38 R11: 0293 R12: 0001
  R13: 0001 R14: 7ffe7419b670 R15: 55604b9515a0
  iwlwifi :3a:00.0: Failed to send MAC context (action:1): -5

and it doesn't get any better from there. The next error seems to be

  Timeout waiting for hardware access (CSR_GP_CNTRL 0x0808)
  [ cut here ]
  WARNING: CPU: 3 PID: 1075 at
drivers/net/wireless/intel/iwlwifi/pcie/trans.c:1874
iwl_trans_pcie_grab_nic_access+0xdf/0xf0 [iwlwifi]

and it will continue with those microcode failure errors and various
other warnigns about how nothing is working.

And no, nothing works.  A lot of log output, no actual network access..

  Linus

Re: [PATCH net] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"

2017-09-06 Thread Florian Fainelli

On 09/06/2017 03:51 PM, David Daney wrote:
> On 09/06/2017 01:49 PM, David Daney wrote:
>> On 09/06/2017 11:59 AM, Florian Fainelli wrote:
>>> On 09/06/2017 11:00 AM, David Daney wrote:
 On 08/31/2017 11:29 AM, Florian Fainelli wrote:
> On 08/31/2017 11:12 AM, Mason wrote:
>> On 31/08/2017 19:53, Florian Fainelli wrote:
>>> On 08/31/2017 10:49 AM, Mason wrote:
 On 31/08/2017 18:57, Florian Fainelli wrote:
> And the race is between phy_detach() setting phydev->attached_dev
> = NULL
> and phy_state_machine() running in PHY_HALTED state and calling
> netif_carrier_off().

 I must be missing something.
 (Since a thread cannot race against itself.)

 phy_disconnect calls phy_stop_machine which
 1) stops the work queue from running in a separate thread
 2) calls phy_state_machine *synchronously*
which runs the PHY_HALTED case with everything well-defined
 end of phy_stop_machine

 phy_disconnect only then calls phy_detach()
 which makes future calls of phy_state_machine perilous.

 This all happens in the same thread, so I'm not yet
 seeing where the race happens?
>>>
>>> The race is as described in David's earlier email, so let's recap:
>>>
>>> Thread 1Thread 2
>>> phy_disconnect()
>>> phy_stop_interrupts()
>>> phy_stop_machine()
>>> phy_state_machine()
>>>-> queue_delayed_work()
>>> phy_detach()
>>>  phy_state_machine()
>>>  -> netif_carrier_off()
>>>
>>> If phy_detach() finishes earlier than the workqueue had a chance
>>> to be
>>> scheduled and process PHY_HALTED again, then we trigger the NULL
>>> pointer
>>> de-reference.
>>>
>>> workqueues are not tasklets, the CPU scheduling them gets no
>>> guarantee
>>> they will run on the same CPU.
>>
>> Something does not add up.
>>
>> The synchronous call to phy_state_machine() does:
>>
>>  case PHY_HALTED:
>>  if (phydev->link) {
>>  phydev->link = 0;
>>  netif_carrier_off(phydev->attached_dev);
>>  phy_adjust_link(phydev);
>>  do_suspend = true;
>>  }
>>
>> then sets phydev->link = 0; therefore subsequent calls to
>> phy_state_machin() will be no-op.
>
> Actually you are right, once phydev->link is set to 0 these would
> become
> no-ops. Still scratching my head as to what happens for David then...
>
>>
>> Also, queue_delayed_work() is only called in polling mode.
>> David stated that he's using interrupt mode.

 Did you see what I wrote?
>>>
>>> Still not following, see below.
>>>

 phy_disconnect() calls phy_stop_interrupts() which puts it into polling
 mode.  So the polling work gets queued unconditionally.
>>>
>>> What part of phy_stop_interrupts() is responsible for changing
>>> phydev->irq to PHY_POLL? free_irq() cannot touch phydev->irq otherwise
>>> subsequent request_irq() calls won't work anymore.
>>> phy_disable_interrupts() only calls back into the PHY driver to
>>> acknowledge and clear interrupts.
>>>
>>> If we were using a PHY with PHY_POLL, as Marc said, the first
>>> synchronous call to phy_state_machine() would have acted on PHY_HALTED
>>> and even if we incorrectly keep re-scheduling the state machine from
>>> PHY_HALTED to PHY_HALTED the second time around nothing can happen.
>>>
>>> What are we missing here?
>>>
>>
>> OK, I am now as confused as you guys are.  I will go back and get an
>> ftrace log out of the failure.
>>
> OK, let's forget about the PHY_HALTED discussion.
> 
> 
> Consider instead the case of a Marvell phy with no interrupts connected
> on a v4.9.43 kernel, single CPU:
> 
> 
>   0)   | phy_disconnect() {
>   0)   |   phy_stop_machine() {
>   0)   | cancel_delayed_work_sync() {
>   0) + 23.986 us   | } /* cancel_delayed_work_sync */
>   0)   | phy_state_machine() {
>   0)   |   phy_start_aneg_priv() {

Thanks for providing the trace, I think I have an idea of what's going
on, see below.

>   0)   | marvell_config_aneg() {
>   0) ! 240.538 us  | } /* marvell_config_aneg */
>   0) ! 244.971 us  |   } /* phy_start_aneg_priv */
>   0)   |   queue_delayed_work_on() {
>   0) + 18.016 us   |   } /* queue_delayed_work_on */
>   0) ! 268.184 us  | } /* phy_state_machine */
>   0) ! 297.394 us  |   } /* phy_stop_machine */
>   0)   |   phy_detach() {
>   0)   |

[PATCH] net: bonding: Fix transmit load balancing in balance-alb mode if specified by sysfs

2017-09-06 Thread Kosuke Tatsukawa

Commit cbf5ecb30560 ("net: bonding: Fix transmit load balancing in
balance-alb mode") tried to fix transmit dynamic load balancing in
balance-alb mode, which wasn't working after commit 8b426dc54cf4
("bonding: remove hardcoded value").

It turned out that my previous patch only fixed the case when
balance-alb was specified as bonding module parameter, and not when
balance-alb mode was set using /sys/class/net/*/bonding/mode (the most
common usage).  In the latter case, tlb_dynamic_lb was set up according
to the default mode of the bonding interface, which happens to be
balance-rr.

This additional patch addresses this issue by setting up tlb_dynamic_lb
to 1 if "mode" is set to balance-alb through the sysfs interface.

I didn't add code to change tlb_balance_lb back to the default value for
other modes, because "mode" is usually set up only once during
initialization, and it's not worthwhile to change the static variable
bonding_defaults in bond_main.c to a global variable just for this
purpose.

Commit 8b426dc54cf4 also changes the value of tlb_dynamic_lb for
balance-tlb mode if it is set up using the sysfs interface.  I didn't
change that behavior, because the value of tlb_balance_lb can be changed
using the sysfs interface for balance-tlb, and I didn't like changing
the default value back and forth for balance-tlb.

As for balance-alb, /sys/class/net/*/bonding/tlb_balance_lb cannot be
written to.  However, I think balance-alb with tlb_dynamic_lb set to 0
is not an intended usage, so there is little use making it writable at
this moment.

Fixes: 8b426dc54cf4 ("bonding: remove hardcoded value")
Reported-by: Reinis Rozitis 
Signed-off-by: Kosuke Tatsukawa 
Cc: sta...@vger.kernel.org  # v4.12+
---
 drivers/net/bonding/bond_options.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/net/bonding/bond_options.c 
b/drivers/net/bonding/bond_options.c
index a12d603..5931aa2 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -754,6 +754,9 @@ static int bond_option_mode_set(struct bonding *bond,
   bond->params.miimon);
}
 
+   if (newval->value == BOND_MODE_ALB)
+   bond->params.tlb_dynamic_lb = 1;
+
/* don't cache arp_validate between modes */
bond->params.arp_validate = BOND_ARP_VALIDATE_NONE;
bond->params.mode = newval->value;

Re: [PATCH] net: bonding: Fix transmit load balancing in balance-alb mode

2017-09-06 Thread Kosuke Tatsukawa

Reinis Rozitis reported that tlb_dynamic_lb was still 0 in balance-alb
mode even when using linux-4.12.10 kernels, which includes this patch.

It turned out that my previous patch only fixed the case when
balance-alb mode was specified as bonding module parameter, and not when
balance-alb mode was set using /sys/class/net/*/bonding/mode (the most
common usage).  In the latter case, tlb_dynamic_lb was set up according
to the default mode of the bonding interface, which happens to be
balance-rr.

I'll send another patch to address this case.

In the meantime, a workaround for this issue is to specify balance-tlb
mode as a module parameter for bonding.  This will change the value of
tlb_balance_lb to 1, and make balance-alb work like pre-4.12 kernels.

Best regards.


> balance-alb mode used to have transmit dynamic load balancing feature
> enabled by default.  However, transmit dynamic load balancing no longer
> works in balance-alb after commit 8b426dc54cf4 ("bonding: remove
> hardcoded value").
> 
> Both balance-tlb and balance-alb use the function bond_do_alb_xmit() to
> send packets.  This function uses the parameter tlb_dynamic_lb.
> tlb_dynamic_lb used to have the default value of 1 for balance-alb, but
> now the value is set to 0 except in balance-tlb.
> 
> Re-enable transmit dyanmic load balancing by initializing tlb_dynamic_lb
> for balance-alb similar to balance-tlb.
> 
> Signed-off-by: Kosuke Tatsukawa 
> Cc: sta...@vger.kernel.org
> ---
>  drivers/net/bonding/bond_main.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 14ff622..181839d 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -4596,7 +4596,7 @@ static int bond_check_params(struct bond_params *params)
>   }
>   ad_user_port_key = valptr->value;
>  
> - if (bond_mode == BOND_MODE_TLB) {
> + if ((bond_mode == BOND_MODE_TLB) || (bond_mode == BOND_MODE_ALB)) {
>   bond_opt_initstr(, "default");
>   valptr = bond_opt_parse(bond_opt_get(BOND_OPT_TLB_DYNAMIC_LB),
>   );
---
Kosuke TATSUKAWA  | 1st Platform Software Division
  | NEC Solution Innovators
  | ta...@ab.jp.nec.com

Re: [PATCH net] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"

2017-09-06 Thread David Daney


On 09/06/2017 01:49 PM, David Daney wrote:

On 09/06/2017 11:59 AM, Florian Fainelli wrote:

On 09/06/2017 11:00 AM, David Daney wrote:

On 08/31/2017 11:29 AM, Florian Fainelli wrote:

On 08/31/2017 11:12 AM, Mason wrote:

On 31/08/2017 19:53, Florian Fainelli wrote:

On 08/31/2017 10:49 AM, Mason wrote:

On 31/08/2017 18:57, Florian Fainelli wrote:

And the race is between phy_detach() setting phydev->attached_dev
= NULL
and phy_state_machine() running in PHY_HALTED state and calling
netif_carrier_off().


I must be missing something.
(Since a thread cannot race against itself.)

phy_disconnect calls phy_stop_machine which
1) stops the work queue from running in a separate thread
2) calls phy_state_machine *synchronously*
   which runs the PHY_HALTED case with everything well-defined
end of phy_stop_machine

phy_disconnect only then calls phy_detach()
which makes future calls of phy_state_machine perilous.

This all happens in the same thread, so I'm not yet
seeing where the race happens?


The race is as described in David's earlier email, so let's recap:

Thread 1Thread 2
phy_disconnect()
phy_stop_interrupts()
phy_stop_machine()
phy_state_machine()
   -> queue_delayed_work()
phy_detach()
 phy_state_machine()
 -> netif_carrier_off()

If phy_detach() finishes earlier than the workqueue had a chance 
to be

scheduled and process PHY_HALTED again, then we trigger the NULL
pointer
de-reference.

workqueues are not tasklets, the CPU scheduling them gets no 
guarantee

they will run on the same CPU.


Something does not add up.

The synchronous call to phy_state_machine() does:

 case PHY_HALTED:
 if (phydev->link) {
 phydev->link = 0;
 netif_carrier_off(phydev->attached_dev);
 phy_adjust_link(phydev);
 do_suspend = true;
 }

then sets phydev->link = 0; therefore subsequent calls to
phy_state_machin() will be no-op.


Actually you are right, once phydev->link is set to 0 these would 
become

no-ops. Still scratching my head as to what happens for David then...



Also, queue_delayed_work() is only called in polling mode.
David stated that he's using interrupt mode.


Did you see what I wrote?


Still not following, see below.



phy_disconnect() calls phy_stop_interrupts() which puts it into polling
mode.  So the polling work gets queued unconditionally.


What part of phy_stop_interrupts() is responsible for changing
phydev->irq to PHY_POLL? free_irq() cannot touch phydev->irq otherwise
subsequent request_irq() calls won't work anymore.
phy_disable_interrupts() only calls back into the PHY driver to
acknowledge and clear interrupts.

If we were using a PHY with PHY_POLL, as Marc said, the first
synchronous call to phy_state_machine() would have acted on PHY_HALTED
and even if we incorrectly keep re-scheduling the state machine from
PHY_HALTED to PHY_HALTED the second time around nothing can happen.

What are we missing here?



OK, I am now as confused as you guys are.  I will go back and get an 
ftrace log out of the failure.



OK, let's forget about the PHY_HALTED discussion.


Consider instead the case of a Marvell phy with no interrupts connected 
on a v4.9.43 kernel, single CPU:



  0)   | phy_disconnect() {
  0)   |   phy_stop_machine() {
  0)   | cancel_delayed_work_sync() {
  0) + 23.986 us   | } /* cancel_delayed_work_sync */
  0)   | phy_state_machine() {
  0)   |   phy_start_aneg_priv() {
  0)   | marvell_config_aneg() {
  0) ! 240.538 us  | } /* marvell_config_aneg */
  0) ! 244.971 us  |   } /* phy_start_aneg_priv */
  0)   |   queue_delayed_work_on() {
  0) + 18.016 us   |   } /* queue_delayed_work_on */
  0) ! 268.184 us  | } /* phy_state_machine */
  0) ! 297.394 us  |   } /* phy_stop_machine */
  0)   |   phy_detach() {
  0)   | phy_suspend() {
  0)   |   phy_ethtool_get_wol() {
  0)   0.677 us|   } /* phy_ethtool_get_wol */
  0)   |   genphy_suspend() {
  0) + 71.250 us   |   } /* genphy_suspend */
  0) + 74.197 us   | } /* phy_suspend */
  0) + 80.302 us   |   } /* phy_detach */
  0) ! 380.072 us  | } /* phy_disconnect */
.
.
.
  0)   |  process_one_work() {
  0)   |find_worker_executing_work() {
  0)   0.688 us|} /* find_worker_executing_work */
  0)   |set_work_pool_and_clear_pending() {
  0)   0.734 us|} /* set_work_pool_and_clear_pending */
  0)

Re: Fwd: DA850-evm MAC Address is random

2017-09-06 Thread Adam Ford

On Mon, Sep 4, 2017 at 11:42 PM, Sekhar Nori  wrote:
> Hi Adam,
>
> On Wednesday 30 August 2017 11:08 AM, Sekhar Nori wrote:
>>> I wonder if U-Boot isn't pushing something to Linux because it doesn't
>>> appear to be running some of the da850 specific code even when I run
>>> linux-next.  Can you tell me what verision of U-Boot you're using?
>>> Other than using davinci_all_defconfig, did you change the
>>> configuration at all?
>
>> I am using U-Boot 2017.01. Yes, the kernel was built using
>> davinci_all_defconfig and no other config change. Before booting kernel,
>> can you confirm that ethaddr is set in U-Boot environment? This is what
>> fdt_fixup_ethernet() reads to fixup the FDT before boot.
>>
>> Here is my complete boot log with environment variable dump.
>>
>> http://pastebin.ubuntu.com/25430265/
>
> Were you able to get rid of the random mac address problem?

Not yet.  I haven't been able to rebuild Arago using TI's instructions
on the Wiki.  I am not sure if it's a dependency issue or something
else.  When I run Linux 4.13 using Buildroot as the rootfs, it does
not appear to run da850_evm_m25p80_notify_add().  I am going to
investigate whether or not da850_evm_init() is getting called.  I was
wondering if you had some insight as to what calls that function?  It
looks like it's defined as part of MACHINE_START(DAVINCI_DA850_EVM,
"DaVinci DA850/OMAP-L138/AM18x EVM"), but I don't know how it gets
called.

thanks

adam
>
> Thanks,
> Sekhar

[PATCH net] hv_netvsc: fix deadlock on hotplug

2017-09-06 Thread Stephen Hemminger

When a virtual device is added dynamically (via host console), then
the vmbus sends an offer message for the primary channel. The processing
of this message for networking causes the network device to then
initialize the sub channels.

The problem is that setting up the sub channels needs to wait until
the subsequent subchannel offers have been processed. These offers
come in on the same ring buffer and work queue as where the primary
offer is being processed; leading to a deadlock.

This did not happen in older kernels, because the sub channel waiting
logic was broken (it wasn't really waiting).

The solution is to do the sub channel setup in its own work queue
context that is scheduled by the primary channel setup; and then
happens later.

Fixes: 732e49850c5e ("netvsc: fix race on sub channel creation")
Reported-by: Dexuan Cui 
Signed-off-by: Stephen Hemminger 
---
This is version for 4.13 (linux-net)

 drivers/net/hyperv/hyperv_net.h   |   3 +
 drivers/net/hyperv/netvsc.c   |   3 +
 drivers/net/hyperv/netvsc_drv.c   |  18 +-
 drivers/net/hyperv/rndis_filter.c | 118 +++---
 4 files changed, 94 insertions(+), 48 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 12cc64bfcff8..b82119be14e9 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -200,6 +200,8 @@ int netvsc_recv_callback(struct net_device *net,
 const struct ndis_pkt_8021q_info *vlan);
 void netvsc_channel_cb(void *context);
 int netvsc_poll(struct napi_struct *napi, int budget);
+
+void rndis_set_subchannel(struct work_struct *w);
 int rndis_filter_open(struct netvsc_device *nvdev);
 int rndis_filter_close(struct netvsc_device *nvdev);
 int rndis_filter_device_add(struct hv_device *dev,
@@ -766,6 +768,7 @@ struct netvsc_device {
u32 num_chn;
 
atomic_t open_chn;
+   struct work_struct subchan_work;
wait_queue_head_t subchan_open;
 
struct rndis_device *extension;
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index d18c3326a1f7..76ef14cf55a1 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -79,6 +79,7 @@ static struct netvsc_device *alloc_net_device(void)
net_device->pkt_align = RNDIS_PKT_ALIGN_DEFAULT;
init_completion(_device->channel_init_wait);
init_waitqueue_head(_device->subchan_open);
+   INIT_WORK(_device->subchan_work, rndis_set_subchannel);
 
return net_device;
 }
@@ -553,6 +554,8 @@ void netvsc_device_remove(struct hv_device *device)
struct netvsc_device *net_device = net_device_ctx->nvdev;
int i;
 
+   cancel_work_sync(_device->subchan_work);
+
netvsc_disconnect_vsp(device);
 
RCU_INIT_POINTER(net_device_ctx->nvdev, NULL);
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index d91cbc6c3ca4..9c8dd86a663e 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -740,24 +740,13 @@ static int netvsc_set_queues(struct net_device *net, 
struct hv_device *dev,
 u32 num_chn)
 {
struct netvsc_device_info device_info;
-   int ret;
 
memset(_info, 0, sizeof(device_info));
device_info.num_chn = num_chn;
device_info.ring_size = ring_size;
device_info.max_num_vrss_chns = num_chn;
 
-   ret = rndis_filter_device_add(dev, _info);
-   if (ret)
-   return ret;
-
-   ret = netif_set_real_num_tx_queues(net, num_chn);
-   if (ret)
-   return ret;
-
-   ret = netif_set_real_num_rx_queues(net, num_chn);
-
-   return ret;
+   return rndis_filter_device_add(dev, _info);
 }
 
 static int netvsc_set_channels(struct net_device *net,
@@ -1573,8 +1562,6 @@ static int netvsc_probe(struct hv_device *dev,
 
/* RCU not necessary here, device not registered */
nvdev = net_device_ctx->nvdev;
-   netif_set_real_num_tx_queues(net, nvdev->num_chn);
-   netif_set_real_num_rx_queues(net, nvdev->num_chn);
 
/* MTU range: 68 - 1500 or 65521 */
net->min_mtu = NETVSC_MTU_MIN;
@@ -1616,11 +1603,10 @@ static int netvsc_remove(struct hv_device *dev)
 * removed. Also blocks mtu and channel changes.
 */
rtnl_lock();
+   unregister_netdevice(net);
rndis_filter_device_remove(dev, ndev_ctx->nvdev);
rtnl_unlock();
 
-   unregister_netdev(net);
-
hv_set_drvdata(dev, NULL);
 
free_netdev(net);
diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index d6308ffda53e..015a8e2b733c 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -1037,8 +1037,6 @@ static void netvsc_sc_open(struct vmbus_channel *new_sc)
 
/* Set the channel before opening.*/
nvchan->channel = new_sc;
-

[PATCH v2 net-next 1/2] hv_netvsc: fix deadlock on hotplug

2017-09-06 Thread Stephen Hemminger

When a virtual device is added dynamically (via host console), then
the vmbus sends an offer message for the primary channel. The processing
of this message for networking causes the network device to then
initialize the sub channels.

The problem is that setting up the sub channels needs to wait until
the subsequent subchannel offers have been processed. These offers
come in on the same ring buffer and work queue as where the primary
offer is being processed; leading to a deadlock.

This did not happen in older kernels, because the sub channel waiting
logic was broken (it wasn't really waiting).

The solution is to do the sub channel setup in its own work queue
context that is scheduled by the primary channel setup; and then
happens later.

Fixes: 732e49850c5e ("netvsc: fix race on sub channel creation")
Reported-by: Dexuan Cui 
Signed-off-by: Stephen Hemminger 
---
v2 - fix module removal race with new work queue

 drivers/net/hyperv/hyperv_net.h   |   3 +
 drivers/net/hyperv/netvsc.c   |   3 +
 drivers/net/hyperv/netvsc_drv.c   |  11 +---
 drivers/net/hyperv/rndis_filter.c | 122 ++
 4 files changed, 94 insertions(+), 45 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index ec546da86683..d98cdfb1536b 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -204,6 +204,8 @@ int netvsc_recv_callback(struct net_device *net,
 const struct ndis_pkt_8021q_info *vlan);
 void netvsc_channel_cb(void *context);
 int netvsc_poll(struct napi_struct *napi, int budget);
+
+void rndis_set_subchannel(struct work_struct *w);
 bool rndis_filter_opened(const struct netvsc_device *nvdev);
 int rndis_filter_open(struct netvsc_device *nvdev);
 int rndis_filter_close(struct netvsc_device *nvdev);
@@ -782,6 +784,7 @@ struct netvsc_device {
u32 num_chn;
 
atomic_t open_chn;
+   struct work_struct subchan_work;
wait_queue_head_t subchan_open;
 
struct rndis_device *extension;
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 0062b802676f..a5511b7326af 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -81,6 +81,7 @@ static struct netvsc_device *alloc_net_device(void)
 
init_completion(_device->channel_init_wait);
init_waitqueue_head(_device->subchan_open);
+   INIT_WORK(_device->subchan_work, rndis_set_subchannel);
 
return net_device;
 }
@@ -557,6 +558,8 @@ void netvsc_device_remove(struct hv_device *device)
= rtnl_dereference(net_device_ctx->nvdev);
int i;
 
+   cancel_work_sync(_device->subchan_work);
+
netvsc_disconnect_vsp(device);
 
RCU_INIT_POINTER(net_device_ctx->nvdev, NULL);
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 165ba4b3b423..c538a4f15f3b 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -853,10 +853,7 @@ static int netvsc_set_channels(struct net_device *net,
rndis_filter_device_remove(dev, nvdev);
 
nvdev = rndis_filter_device_add(dev, _info);
-   if (!IS_ERR(nvdev)) {
-   netif_set_real_num_tx_queues(net, nvdev->num_chn);
-   netif_set_real_num_rx_queues(net, nvdev->num_chn);
-   } else {
+   if (IS_ERR(nvdev)) {
ret = PTR_ERR(nvdev);
device_info.num_chn = orig;
nvdev = rndis_filter_device_add(dev, _info);
@@ -1954,9 +1951,6 @@ static int netvsc_probe(struct hv_device *dev,
NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX;
net->vlan_features = net->features;
 
-   netif_set_real_num_tx_queues(net, nvdev->num_chn);
-   netif_set_real_num_rx_queues(net, nvdev->num_chn);
-
netdev_lockdep_set_classes(net);
 
/* MTU range: 68 - 1500 or 65521 */
@@ -2012,9 +2006,10 @@ static int netvsc_remove(struct hv_device *dev)
if (vf_netdev)
netvsc_unregister_vf(vf_netdev);
 
+   unregister_netdevice(net);
+
rndis_filter_device_remove(dev,
   rtnl_dereference(ndev_ctx->nvdev));
-   unregister_netdevice(net);
rtnl_unlock();
 
hv_set_drvdata(dev, NULL);
diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index 69c40b8fccc3..731bc7cc6f43 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -1039,8 +1039,6 @@ static void netvsc_sc_open(struct vmbus_channel *new_sc)
 
/* Set the channel before opening.*/
nvchan->channel = new_sc;
-   netif_napi_add(ndev, >napi,
-  netvsc_poll, NAPI_POLL_WEIGHT);
 
ret = vmbus_open(new_sc, nvscdev->ring_size * PAGE_SIZE,
 nvscdev->ring_size * PAGE_SIZE, NULL, 0,
@@ -1048,12 +1046,88 @@ static void

[PATCH v2 net-next 0/2] hv_netvsc: sub channel initialization fixes

2017-09-06 Thread Stephen Hemminger

One serious deadlock, and one minor optimization.

Stephen Hemminger (2):
  hv_netvsc: fix deadlock on hotplug
  hv_netvsc: avoid unnecessary wakeups on subchannel creation

 drivers/net/hyperv/hyperv_net.h   |   3 +
 drivers/net/hyperv/netvsc.c   |   3 +
 drivers/net/hyperv/netvsc_drv.c   |  11 +---
 drivers/net/hyperv/rndis_filter.c | 126 ++
 4 files changed, 96 insertions(+), 47 deletions(-)

-- 
2.11.0

[PATCH v2 net-next 2/2] hv_netvsc: avoid unnecessary wakeups on subchannel creation

2017-09-06 Thread Stephen Hemminger

Only need to wakeup the initiator after all sub-channels
are opened.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/rndis_filter.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index 731bc7cc6f43..065b204d8e17 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -1048,8 +1048,8 @@ static void netvsc_sc_open(struct vmbus_channel *new_sc)
else
netdev_notice(ndev, "sub channel open failed: %d\n", ret);
 
-   atomic_inc(>open_chn);
-   wake_up(>subchan_open);
+   if (atomic_inc_return(>open_chn) == nvscdev->num_chn)
+   wake_up(>subchan_open);
 }
 
 /* Open sub-channels after completing the handling of the device probe.
-- 
2.11.0

Re: [PATCH net] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"

2017-09-06 Thread David Daney


On 09/06/2017 11:59 AM, Florian Fainelli wrote:

On 09/06/2017 11:00 AM, David Daney wrote:

On 08/31/2017 11:29 AM, Florian Fainelli wrote:

On 08/31/2017 11:12 AM, Mason wrote:

On 31/08/2017 19:53, Florian Fainelli wrote:

On 08/31/2017 10:49 AM, Mason wrote:

On 31/08/2017 18:57, Florian Fainelli wrote:

And the race is between phy_detach() setting phydev->attached_dev
= NULL
and phy_state_machine() running in PHY_HALTED state and calling
netif_carrier_off().


I must be missing something.
(Since a thread cannot race against itself.)

phy_disconnect calls phy_stop_machine which
1) stops the work queue from running in a separate thread
2) calls phy_state_machine *synchronously*
   which runs the PHY_HALTED case with everything well-defined
end of phy_stop_machine

phy_disconnect only then calls phy_detach()
which makes future calls of phy_state_machine perilous.

This all happens in the same thread, so I'm not yet
seeing where the race happens?


The race is as described in David's earlier email, so let's recap:

Thread 1Thread 2
phy_disconnect()
phy_stop_interrupts()
phy_stop_machine()
phy_state_machine()
   -> queue_delayed_work()
phy_detach()
 phy_state_machine()
 -> netif_carrier_off()

If phy_detach() finishes earlier than the workqueue had a chance to be
scheduled and process PHY_HALTED again, then we trigger the NULL
pointer
de-reference.

workqueues are not tasklets, the CPU scheduling them gets no guarantee
they will run on the same CPU.


Something does not add up.

The synchronous call to phy_state_machine() does:

 case PHY_HALTED:
 if (phydev->link) {
 phydev->link = 0;
 netif_carrier_off(phydev->attached_dev);
 phy_adjust_link(phydev);
 do_suspend = true;
 }

then sets phydev->link = 0; therefore subsequent calls to
phy_state_machin() will be no-op.


Actually you are right, once phydev->link is set to 0 these would become
no-ops. Still scratching my head as to what happens for David then...



Also, queue_delayed_work() is only called in polling mode.
David stated that he's using interrupt mode.


Did you see what I wrote?


Still not following, see below.



phy_disconnect() calls phy_stop_interrupts() which puts it into polling
mode.  So the polling work gets queued unconditionally.


What part of phy_stop_interrupts() is responsible for changing
phydev->irq to PHY_POLL? free_irq() cannot touch phydev->irq otherwise
subsequent request_irq() calls won't work anymore.
phy_disable_interrupts() only calls back into the PHY driver to
acknowledge and clear interrupts.

If we were using a PHY with PHY_POLL, as Marc said, the first
synchronous call to phy_state_machine() would have acted on PHY_HALTED
and even if we incorrectly keep re-scheduling the state machine from
PHY_HALTED to PHY_HALTED the second time around nothing can happen.

What are we missing here?



OK, I am now as confused as you guys are.  I will go back and get an 
ftrace log out of the failure.


David.

Re: [PATCH] netfilter: xt_hashlimit: avoid 64-bit division

2017-09-06 Thread Arnd Bergmann

On Wed, Sep 6, 2017 at 10:22 PM, Vishwanath Pai  wrote:
> On 09/06/2017 03:57 PM, Arnd Bergmann wrote:
>> 64-bit division is expensive on 32-bit architectures, and
>> requires a special function call to avoid a link error like:
>>
>> net/netfilter/xt_hashlimit.o: In function `hashlimit_mt_common':
>> xt_hashlimit.c:(.text+0x1328): undefined reference to `__aeabi_uldivmod'
>>
>> In the case of hashlimit_mt_common, we don't actually need a
>> 64-bit operation, we can simply rewrite the function slightly
>> to make that clear to the compiler.
>>
>> Fixes: bea74641e378 ("netfilter: xt_hashlimit: add rate match mode")
>> Signed-off-by: Arnd Bergmann 
>> ---
>>  net/netfilter/xt_hashlimit.c | 5 -
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
>> index 10d48234f5f4..50b53d86eef5 100644
>> --- a/net/netfilter/xt_hashlimit.c
>> +++ b/net/netfilter/xt_hashlimit.c
>> @@ -531,7 +531,10 @@ static u64 user2rate_bytes(u64 user)
>>  {
>>   u64 r;
>>
>> - r = user ? 0xULL / user : 0xULL;
>> + if (user > 0xULL)
>> + return 0;
>> +
>> + r = user ? 0xULL / (u32)user : 0xULL;
>>   r = (r - 1) << 4;
>>   return r;
>>  }
>>
>
> I have submitted another patch to fix this:
> https://patchwork.ozlabs.org/patch/809881/
>
> We have seen this problem before, I was careful not to introduce this
> again in the new patch but clearly I overlooked this particular line :(
>
> In the other cases we fixed it by replacing division with div64_u64().

div64_u64() seems needlessly expensive here since the dividend
is known to be a 32-bit number. I guess the function is not called
frequently though, so it doesn't matter much.

  Arnd

Re: [patch net] net: sched: fix memleak for chain zero

2017-09-06 Thread Jiri Pirko

Wed, Sep 06, 2017 at 07:40:02PM CEST, xiyou.wangc...@gmail.com wrote:
>On Wed, Sep 6, 2017 at 4:14 AM, Jiri Pirko  wrote:
>> From: Jiri Pirko 
>>
>> There's a memleak happening for chain 0. The thing is, chain 0 needs to
>> be always present, not created on demand. Therefore tcf_block_get upon
>> creation of block calls the tcf_chain_create function directly. The
>> chain is created with refcnt == 1, which is not correct in this case and
>> causes the memleak. So move the refcnt increment into tcf_chain_get
>> function even for the case when chain needs to be created.
>>
>
>Your approach could work but you just make the code even
>uglier than it is now:
>
>1. The current code is already ugly for special-casing chain 0:
>
>if (--chain->refcnt == 0 && !chain->filter_chain && chain->index != 0)
>tcf_chain_destroy(chain);
>
>2. With your patch, chain 0 has a different _initial_ refcnt with others.

No. Initial refcnt is the same. ! for every action that holds the chain.
So actually, it returns it back where it should be.


>
>3. Allowing an object (chain 0) exists with refcnt==0

So? That is for every chain that does not have goto_chain action
pointing at. Please read the code.


>
>Compare it with my patch:
>
>1. No special-case for chain 0, the above ugly part is removed
>
>2. Every chain is equal and created with refcnt==1
>
>3. Any chain with refcnt==0 is destroyed

[PATCH 13/14] netfilter/ipvs: Use %pS printk format for direct addresses

2017-09-06 Thread Helge Deller

The debug and error printk functions in ipvs uses wrongly the %pF instead of
the %pS printk format specifier for printing symbols for the address returned
by _builtin_return_address(0). Fix it for the ia64, ppc64 and parisc64
architectures.

Signed-off-by: Helge Deller 
Cc: Wensong Zhang 
Cc: netdev@vger.kernel.org
Cc: lvs-de...@vger.kernel.org
Cc: netfilter-de...@vger.kernel.org
---
 net/netfilter/ipvs/ip_vs_conn.c | 2 +-
 net/netfilter/ipvs/ip_vs_ctl.c  | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 3d2ac71a..f73561c 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -185,7 +185,7 @@ static inline int ip_vs_conn_hash(struct ip_vs_conn *cp)
hlist_add_head_rcu(>c_list, _vs_conn_tab[hash]);
ret = 1;
} else {
-   pr_err("%s(): request for already hashed, called from %pF\n",
+   pr_err("%s(): request for already hashed, called from %pS\n",
   __func__, __builtin_return_address(0));
ret = 0;
}
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 1fa3c23..88fc58a 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -300,7 +300,7 @@ static int ip_vs_svc_hash(struct ip_vs_service *svc)
unsigned int hash;
 
if (svc->flags & IP_VS_SVC_F_HASHED) {
-   pr_err("%s(): request for already hashed, called from %pF\n",
+   pr_err("%s(): request for already hashed, called from %pS\n",
   __func__, __builtin_return_address(0));
return 0;
}
@@ -334,7 +334,7 @@ static int ip_vs_svc_hash(struct ip_vs_service *svc)
 static int ip_vs_svc_unhash(struct ip_vs_service *svc)
 {
if (!(svc->flags & IP_VS_SVC_F_HASHED)) {
-   pr_err("%s(): request for unhash flagged, called from %pF\n",
+   pr_err("%s(): request for unhash flagged, called from %pS\n",
   __func__, __builtin_return_address(0));
return 0;
}
-- 
2.1.0

Re: [PATCH] netfilter: xt_hashlimit: avoid 64-bit division

2017-09-06 Thread Vishwanath Pai

On 09/06/2017 03:57 PM, Arnd Bergmann wrote:
> 64-bit division is expensive on 32-bit architectures, and
> requires a special function call to avoid a link error like:
> 
> net/netfilter/xt_hashlimit.o: In function `hashlimit_mt_common':
> xt_hashlimit.c:(.text+0x1328): undefined reference to `__aeabi_uldivmod'
> 
> In the case of hashlimit_mt_common, we don't actually need a
> 64-bit operation, we can simply rewrite the function slightly
> to make that clear to the compiler.
> 
> Fixes: bea74641e378 ("netfilter: xt_hashlimit: add rate match mode")
> Signed-off-by: Arnd Bergmann 
> ---
>  net/netfilter/xt_hashlimit.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
> index 10d48234f5f4..50b53d86eef5 100644
> --- a/net/netfilter/xt_hashlimit.c
> +++ b/net/netfilter/xt_hashlimit.c
> @@ -531,7 +531,10 @@ static u64 user2rate_bytes(u64 user)
>  {
>   u64 r;
>  
> - r = user ? 0xULL / user : 0xULL;
> + if (user > 0xULL)
> + return 0;
> +
> + r = user ? 0xULL / (u32)user : 0xULL;
>   r = (r - 1) << 4;
>   return r;
>  }
> 

I have submitted another patch to fix this:
https://patchwork.ozlabs.org/patch/809881/

We have seen this problem before, I was careful not to introduce this
again in the new patch but clearly I overlooked this particular line :(

In the other cases we fixed it by replacing division with div64_u64().

-Vishwanath

[PATCH] netfilter: xt_hashlimit: avoid 64-bit division

2017-09-06 Thread Arnd Bergmann

64-bit division is expensive on 32-bit architectures, and
requires a special function call to avoid a link error like:

net/netfilter/xt_hashlimit.o: In function `hashlimit_mt_common':
xt_hashlimit.c:(.text+0x1328): undefined reference to `__aeabi_uldivmod'

In the case of hashlimit_mt_common, we don't actually need a
64-bit operation, we can simply rewrite the function slightly
to make that clear to the compiler.

Fixes: bea74641e378 ("netfilter: xt_hashlimit: add rate match mode")
Signed-off-by: Arnd Bergmann 
---
 net/netfilter/xt_hashlimit.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
index 10d48234f5f4..50b53d86eef5 100644
--- a/net/netfilter/xt_hashlimit.c
+++ b/net/netfilter/xt_hashlimit.c
@@ -531,7 +531,10 @@ static u64 user2rate_bytes(u64 user)
 {
u64 r;
 
-   r = user ? 0xULL / user : 0xULL;
+   if (user > 0xULL)
+   return 0;
+
+   r = user ? 0xULL / (u32)user : 0xULL;
r = (r - 1) << 4;
return r;
 }
-- 
2.9.0

Re: [net-next:master 491/511] xt_hashlimit.c:undefined reference to `__aeabi_uldivmod'

2017-09-06 Thread Arnd Bergmann

On Wed, Sep 6, 2017 at 5:27 PM, kbuild test robot
 wrote:
> Hi Arnd,
>
> It's probably a bug fix that unveils the link errors.

Correct.

>net/netfilter/xt_hashlimit.o: In function `hashlimit_mt_common':
>>> xt_hashlimit.c:(.text+0x1f68): undefined reference to `__aeabi_uldivmod'

I actually wrote a patch for that yesterday, but it seems I forgot to submit it.
I've sent it now.

   Arnd

Re: [PATCH v2 rfc 0/8] IGMP snooping for local traffic

2017-09-06 Thread Florian Fainelli

On 09/06/2017 09:42 AM, Andrew Lunn wrote:
>>> On the switch asics we work with, the driver has information if the
>>> packet was
>>> forwarded in hardware. This is per packet reason code telling why the
>>> CPU is seeing the packet.
>>> The driver can use this information to reset skb->offload_fwd_mark to
>>> allow software forward.
> 
>> I am not positive this is universally available across different
>> switch vendors.
> 
> It is not universally available. We cannot rely on it being available
> with switches supported by DSA.
> 
> We have a few choices:
> 
> 1) We assume anything the switch forwards to the CPU has also been
>sent out whatever ports of the switch it needs to. Set
>offload_fwd_mark.
> 
> 2) We assume anything the switch forwards to the CPU has not gone
>anywhere else, and the bridge needs to send it out whatever ports
>it thinks. Don't set offload_fwd_mark.
> 
> 3) We define some rules about what packets the switch should handle,
>and then do some deep packet inspection to decide if
>offload_fwd_mark should be set or not.
> 
> I don't see 3) being possible. We are dealing with a fixed silicon
> data path, not something which is fully programmable.
> 
> So it is down to 1) or 2). I've been assuming 1), but maybe we need to
> discuss that as well.

At the very least we should probably move the skb->offload_fwd_mark
setting down into the individual taggers since they should be in a
better position to set it or not based on the switch device they are
driving, this should address, on a per-switch basis whether 2) or 3)
applies to a given switch.

That being said, I have a feeling that the Marvell switches behave a
tiny bit differently than others in that they do not flood broadcast by
default in a given L2 domain.

On b53/bcm_sf2 there is the ability to disable the reception of
broadcast frames on the management/CPU port, and while there is the
ability to configure which ports should be flooded in case of
unicast/multicast lookup failures, I don't see anything for Broadcast,
so I am assuming this will get forwarded by default. Will test with your
patch set later on time permitting.
-- 
Florian

Re: [PATCH net] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"

2017-09-06 Thread Florian Fainelli

On 09/06/2017 08:51 AM, Mason wrote:
> On 31/08/2017 21:18, Florian Fainelli wrote:
> 
>> On 08/31/2017 12:09 PM, Mason wrote:
>>
>>> On 31/08/2017 19:03, Florian Fainelli wrote:
>>>
 On 08/31/2017 05:29 AM, Marc Gonzalez wrote:

> On 31/08/2017 02:49, Florian Fainelli wrote:
>
>> The original motivation for this change originated from Marc Gonzalez
>> indicating that his network driver did not have its adjust_link
>> callback
>> executing with phydev->link = 0 while he was expecting it.
>
> I expect the core to call phy_adjust_link() for link changes.
> This used to work back in 3.4 and was broken somewhere along
> the way.

 If that was working correctly in 3.4 surely we can look at the diff and
 figure out what changed, even maybe find the offending commit, can you
 do that?
>>>
>>> Bisecting would a be a huge pain because my platform was
>>> not upstream until v4.4
>>
>> Then just diff the file and try to pinpoint which commit may have
>> changed that?
> 
> Running 'ip link set eth0 down' on the command-line.
> 
> In v3.4 => adjust_link() callback is called
> In v4.5 => adjust_link() callback is NOT called
> 
> $ git log --oneline --no-merges v3.4..v4.5 drivers/net/phy/phy.c | wc -l
> 59
> 
> I'm not sure what "just diff the file" entails.

git log -p --no-merges v3.4..v4.5 drivers/net/phy/{phy,phy_device.c} and
see what would seem remotely sensible to what you are observing.

> I can't move 3.4 up, nor move 4.5 down.

You can always copy the PHYLIB files at any given commit back into an
older tree, or vice versa because it is largely self contained with
little to no dependencies on other headers/files/facilities etc. This is
not convenient I agree, but it's a poor man's way of determining what
changed within PHYLIB that results in what you are seeing.

AFAICT you could use QEMU with the versatile board that has smsc911x as
an Ethernet adapter which is PHYLIB compliant which may be used to
pinpoint which commit start changing this behavior. It's long, it's painful.

> I'm not even sure the problem comes from drivers/net/phy/phy.c
> to be honest.

If that's the case then I am not sure what else we can do.
-- 
Florian

RE: [PATCH] vsock: only load vmci transport on VMware hypervisor by default

2017-09-06 Thread Dexuan Cui

> From: Jorgen S. Hansen [mailto:jhan...@vmware.com]
> Sent: Wednesday, September 6, 2017 7:11 AM
>> ...
> > I'm currently working on NFS over AF_VSOCK and sock_diag support (for
> > ss(8) and netstat-like tools).
> >
> > Multi-transport support is lower priority for me at the moment.  I'm
> > happy to review patches though.  If there is no progress on this by the
> > end of the year then I will have time to work on it.
> >
> 
> I’ll try to find time to write a more coherent proposal in the coming weeks,
> and we can discuss that.
> 
> Jorgen

Thank you! 

Thanks,
-- Dexuan

[wireless-testsing2:master 4/5] drivers/net//ethernet/marvell/mvpp2.c:7618:2: warning: passing argument 4 of 'mvpp2_port_copy_mac_addr' from incompatible pointer type

2017-09-06 Thread kbuild test robot

tree:   
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-testing.git 
master
head:   d17be7f7503bf9492198a242779a68af93fd92de
commit: 96c03618031bae5e9068b16f9e437b79f98f6482 [4/5] Merge remote-tracking 
branch 'mac80211-next/master'
config: xtensa-allyesconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout 96c03618031bae5e9068b16f9e437b79f98f6482
# save the attached .config to linux build tree
make.cross ARCH=xtensa 

All warnings (new ones prefixed by >>):

   drivers/net//ethernet/marvell/mvpp2.c: In function 'mvpp2_port_probe':
>> drivers/net//ethernet/marvell/mvpp2.c:7618:2: warning: passing argument 4 of 
>> 'mvpp2_port_copy_mac_addr' from incompatible pointer type
 mvpp2_port_copy_mac_addr(dev, priv, port_node, _from);
 ^
   drivers/net//ethernet/marvell/mvpp2.c:7468:13: note: expected 'char **' but 
argument is of type 'const char **'
static void mvpp2_port_copy_mac_addr(struct net_device *dev, struct mvpp2 
*priv,
^
   drivers/net//ethernet/marvell/mvpp2.c:7509:7: warning: unused variable 
'hw_mac_addr' [-Wunused-variable]
 char hw_mac_addr[ETH_ALEN] = {0};
  ^
   drivers/net//ethernet/marvell/mvpp2.c:7507:14: warning: unused variable 
'dt_mac_addr' [-Wunused-variable]
 const char *dt_mac_addr;
 ^

vim +/mvpp2_port_copy_mac_addr +7618 drivers/net//ethernet/marvell/mvpp2.c

3ba8c81e1 Antoine Tenart   2017-09-02  7495  
3f518509d Marcin Wojtas2014-07-10  7496  /* Ports initialization */
3f518509d Marcin Wojtas2014-07-10  7497  static int mvpp2_port_probe(struct 
platform_device *pdev,
3f518509d Marcin Wojtas2014-07-10  7498 struct 
device_node *port_node,
59b9a31ed Thomas Petazzoni 2017-03-07  7499 struct 
mvpp2 *priv)
3f518509d Marcin Wojtas2014-07-10  7500  {
3f518509d Marcin Wojtas2014-07-10  7501 struct device_node *phy_node;
542897d98 Antoine Tenart   2017-08-30  7502 struct phy *comphy;
3f518509d Marcin Wojtas2014-07-10  7503 struct mvpp2_port *port;
edc660fa0 Marcin Wojtas2015-08-06  7504 struct mvpp2_port_pcpu 
*port_pcpu;
3f518509d Marcin Wojtas2014-07-10  7505 struct net_device *dev;
3f518509d Marcin Wojtas2014-07-10  7506 struct resource *res;
3f518509d Marcin Wojtas2014-07-10  7507 const char *dt_mac_addr;
96c036180 Bob Copeland 2017-09-04  7508 const char *mac_from = "";
4c2286826 Antoine Tenart   2017-08-25  7509 char hw_mac_addr[ETH_ALEN] = 
{0};
09f839755 Thomas Petazzoni 2017-08-03  7510 unsigned int ntxqs, nrxqs;
213f428f5 Thomas Petazzoni 2017-08-03  7511 bool has_tx_irqs;
3f518509d Marcin Wojtas2014-07-10  7512 u32 id;
3f518509d Marcin Wojtas2014-07-10  7513 int features;
3f518509d Marcin Wojtas2014-07-10  7514 int phy_mode;
edc660fa0 Marcin Wojtas2015-08-06  7515 int err, i, cpu;
3f518509d Marcin Wojtas2014-07-10  7516  
213f428f5 Thomas Petazzoni 2017-08-03  7517 has_tx_irqs = 
mvpp2_port_has_tx_irqs(priv, port_node);
213f428f5 Thomas Petazzoni 2017-08-03  7518  
213f428f5 Thomas Petazzoni 2017-08-03  7519 if (!has_tx_irqs)
213f428f5 Thomas Petazzoni 2017-08-03  7520 queue_mode = 
MVPP2_QDIST_SINGLE_MODE;
213f428f5 Thomas Petazzoni 2017-08-03  7521  
09f839755 Thomas Petazzoni 2017-08-03  7522 ntxqs = MVPP2_MAX_TXQ;
213f428f5 Thomas Petazzoni 2017-08-03  7523 if (priv->hw_version == MVPP22 
&& queue_mode == MVPP2_QDIST_MULTI_MODE)
213f428f5 Thomas Petazzoni 2017-08-03  7524 nrxqs = 
MVPP2_DEFAULT_RXQ * num_possible_cpus();
213f428f5 Thomas Petazzoni 2017-08-03  7525 else
09f839755 Thomas Petazzoni 2017-08-03  7526 nrxqs = 
MVPP2_DEFAULT_RXQ;
09f839755 Thomas Petazzoni 2017-08-03  7527  
09f839755 Thomas Petazzoni 2017-08-03  7528 dev = 
alloc_etherdev_mqs(sizeof(*port), ntxqs, nrxqs);
3f518509d Marcin Wojtas2014-07-10  7529 if (!dev)
3f518509d Marcin Wojtas2014-07-10  7530 return -ENOMEM;
3f518509d Marcin Wojtas2014-07-10  7531  
3f518509d Marcin Wojtas2014-07-10  7532 phy_node = 
of_parse_phandle(port_node, "phy", 0);
3f518509d Marcin Wojtas2014-07-10  7533 phy_mode = 
of_get_phy_mode(port_node);
3f518509d Marcin Wojtas2014-07-10  7534 if (phy_mode < 0) {
3f518509d Marcin Wojtas2014-07-10  7535 dev_err(>dev, 
"incorrect phy mode\n");
3f518509d Marcin Wojtas2014-07-10  7536 err = phy_mode;
3f518509d Marcin Wojtas2014-07-10  7537 goto err_free_netdev;
3f518509d Marcin Wojtas2014-07-10  7538 }
3f518509d Marcin Wojtas2014-07-10  7539  
542897d98 Antoine Tenart   2017-08-30  7540 comphy = 
devm_of_phy_get(>dev, port_node, NULL);
542897d98 Antoine Tenart

Re: [PATCH net] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"

2017-09-06 Thread Florian Fainelli

On 09/06/2017 07:55 AM, Mason wrote:
> On 31/08/2017 21:18, Florian Fainelli wrote:
> 
>> On 08/31/2017 12:09 PM, Mason wrote:
>>
>>> 1) nb8800_link_reconfigure() calls phy_print_status()
>>> which prints the "Link down" and "Link up" messages
>>> to the console. With the patch reverted, nothing is
>>> printed when the link goes down, and the result is
>>> random when the link comes up. Sometimes, we get
>>> down + up, sometimes just up.
>>
>> Nothing printed when you bring down the network interface as a result of
>> not signaling the link down, there is a small nuance here.
> 
> Let me first focus on the "Link down" message.
> 
> Do you agree that such a message should be printed when the
> link goes down, not when the link comes up?

The question is not so much about printing the message rather than a)
having the adjust_link callback be called and b) having this particular
callback correctly determine if a "change" has occurred, but let's focus
on the notification too. Printing the message is a consequence of these
two conditions and that's what matters.

There is not unfortunately a hard and fast answer it's clearly a
philosophical problem here.

The link is not physically down, the cable is still plugged so
generating a link down even is not necessarily correct. It would be
convenient for network manager programs, just like it is for network
drivers to treat it as such because that allows them to act like if the
cable was unplugged, which may be a good way to perform a number of
actions including but not limited to: entering a low power state,
re-initialization parts of the Ethernet MAC that need it (DMA, etc.,
etc.). That does not appear to be an absolute requirement for most, if
not all drivers since it changed after 3.4 and no one did bat an eye
about it.

Upon bringing the interface back up again, same thing, if the cable was
not disconnected should we just generate a link UP event, and if we do
that, are we going to confuse any network manager application?
Generating a link transition DOWN -> UP is certainly helpful for any
network application in that they do not need to keep any state just like
it clearly indicates a change was detected.

> 
> Perhaps the issue is that the 2 following cases need to be
> handled differently:
> A) operator sets link down on the command-line

This is already handled differently since when you administratively
bring down an interface you call ndo_stop() which will be doing a
phy_stop() + phy_disconnect() which result in stopping the PHY state
machine and disconnecting from the PHY.

> B) asynchronous event makes link go down (peer is dead, cable is cut, etc)
> 
> In B) the PHY state machine keeps on running, and eventually
> calls adjust_link()

Correct.

> 
> In A) the driver calls phy_stop() and phy_disconnect() and
> therefore adjust_link() will not be called?

That is the current behavior (after the revert) and we can always change
it if deemed necessary, problem is, this broke for two people (one still
being discussed as of right now), so at this point I am very wary of
making any changes without more testing. I really need to get one of
these PHY interrupts wired to one of my boards or create a software
model of such a configuration before accepting new changes in that area.

Thank you
-- 
Florian

Re: [PATCH net] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"

2017-09-06 Thread Mason

On 06/09/2017 20:00, David Daney wrote:
> On 08/31/2017 11:29 AM, Florian Fainelli wrote:
>> On 08/31/2017 11:12 AM, Mason wrote:
>>> On 31/08/2017 19:53, Florian Fainelli wrote:
 On 08/31/2017 10:49 AM, Mason wrote:
> On 31/08/2017 18:57, Florian Fainelli wrote:
>> And the race is between phy_detach() setting phydev->attached_dev = NULL
>> and phy_state_machine() running in PHY_HALTED state and calling
>> netif_carrier_off().
>
> I must be missing something.
> (Since a thread cannot race against itself.)
>
> phy_disconnect calls phy_stop_machine which
> 1) stops the work queue from running in a separate thread
> 2) calls phy_state_machine *synchronously*
>   which runs the PHY_HALTED case with everything well-defined
> end of phy_stop_machine
>
> phy_disconnect only then calls phy_detach()
> which makes future calls of phy_state_machine perilous.
>
> This all happens in the same thread, so I'm not yet
> seeing where the race happens?

 The race is as described in David's earlier email, so let's recap:

 Thread 1   Thread 2
 phy_disconnect()
 phy_stop_interrupts()
 phy_stop_machine()
 phy_state_machine()
   -> queue_delayed_work()
 phy_detach()
phy_state_machine()
-> netif_carrier_off()

 If phy_detach() finishes earlier than the workqueue had a chance to be
 scheduled and process PHY_HALTED again, then we trigger the NULL pointer
 de-reference.

 workqueues are not tasklets, the CPU scheduling them gets no guarantee
 they will run on the same CPU.
>>>
>>> Something does not add up.
>>>
>>> The synchronous call to phy_state_machine() does:
>>>
>>> case PHY_HALTED:
>>> if (phydev->link) {
>>> phydev->link = 0;
>>> netif_carrier_off(phydev->attached_dev);
>>> phy_adjust_link(phydev);
>>> do_suspend = true;
>>> }
>>>
>>> then sets phydev->link = 0; therefore subsequent calls to
>>> phy_state_machin() will be no-op.
>>
>> Actually you are right, once phydev->link is set to 0 these would become
>> no-ops. Still scratching my head as to what happens for David then...
>>
>>>
>>> Also, queue_delayed_work() is only called in polling mode.
>>> David stated that he's using interrupt mode.
> 
> Did you see what I wrote?
> 
> phy_disconnect() calls phy_stop_interrupts() which puts it into polling 
> mode.  So the polling work gets queued unconditionally.

I did address that remark in
https://www.mail-archive.com/netdev@vger.kernel.org/msg186336.html

int phy_stop_interrupts(struct phy_device *phydev)
{
int err = phy_disable_interrupts(phydev);

if (err)
phy_error(phydev);

free_irq(phydev->irq, phydev);

/* If work indeed has been cancelled, disable_irq() will have
 * been left unbalanced from phy_interrupt() and enable_irq()
 * has to be called so that other devices on the line work.
 */
while (atomic_dec_return(>irq_disable) >= 0)
enable_irq(phydev->irq);

return err;
}

Which part of this function changes phydev->irq to PHY_POLL?

Perhaps phydev->drv->config_intr?

What PHY are you using?

Regards.

Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads

2017-09-06 Thread Tom Herbert

On Wed, Sep 6, 2017 at 10:43 AM, Alexander Duyck
 wrote:
> On Wed, Sep 6, 2017 at 9:17 AM, Tom Herbert  wrote:
>> On Tue, Sep 5, 2017 at 8:06 PM, Alexander Duyck
>>  wrote:
>>> On Tue, Sep 5, 2017 at 2:13 PM, Tom Herbert  wrote:
> The situation with encapsulation is even more complicated:
>
> We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP
> constellation. If we do the fragmentation inside the vxlan tunnel and
> carry over the skb hash to all resulting UDP/vxlan packets source ports,
> we are fine and reordering on the receiver NIC won't happen in this
> case. If the fragmentation happens on the outer UDP header, this will
> result in reordering of the inner L2 flow. Unfortunately this depends on
> how the vxlan tunnel was set up, how other devices do that and (I
> believe so) on the kernel version.
>
 This really isn't that complicated. The assumption that an IP network
 always delivers packets in order is simply wrong. The inventors of
 VXLAN must have know full well that when you use IP, packets can and
 eventually will be delivered out of order. This isn't just because of
 fragmentation, there are many other reasons that packets can be
 delivered OOO. This also must have been known when IP/GRE and any
 other protocol that carries L2 over IP was invented. If OOO is an
 issue for these protocols then they need to be fixed-- this is not a
 concern with IP protocol nor the stack.

 Tom
>>>
>>> As far as a little background on the original patch I believe the
>>> issue that was fixed by the patch was a video streaming application
>>> that was sending/receiving a mix of fragmented and non-fragmented
>>> packets. Receiving them out of order due to the fragmentation was
>>> causing issues with stutters in the video and so we ended up disabling
>>> UDP by default in the NICs listed. We decided to go that way as UDP
>>> RSS was viewed as a performance optimization, while the out-of-order
>>> problems were viewed as a functionality issue.
>>>
>> Hi Alex,
>>
>> Thanks for the details! Were you able to find the root cause for this?
>> In particular, it would be interesting to know if it is the kernel or
>> device that introduced the jitter, or if it's the application that
>> doesn't handle OOO well...
>>
>> Tom
>
> It is hard to say since my memory of the events from 7 years ago is
> pretty vague at this point, but I'm pretty sure it was the
> application. Basically getting the frames out of order was causing
> them to have to drop video data if I recall correctly.
>
Oh, I didn't notice that patch was from 2010. Maybe the application
has been fixed by now! :-)

Perhaps, it's time to try to turn UDP hashing on again by default?
Even if NICs aren't doing this, there are network devices that are
looking at UDP for ECMP and that doesn't seem to be causing widespread
problems currently...

Tom

Re: [PATCH net] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"

2017-09-06 Thread Florian Fainelli

On 09/06/2017 11:00 AM, David Daney wrote:
> On 08/31/2017 11:29 AM, Florian Fainelli wrote:
>> On 08/31/2017 11:12 AM, Mason wrote:
>>> On 31/08/2017 19:53, Florian Fainelli wrote:
 On 08/31/2017 10:49 AM, Mason wrote:
> On 31/08/2017 18:57, Florian Fainelli wrote:
>> And the race is between phy_detach() setting phydev->attached_dev
>> = NULL
>> and phy_state_machine() running in PHY_HALTED state and calling
>> netif_carrier_off().
>
> I must be missing something.
> (Since a thread cannot race against itself.)
>
> phy_disconnect calls phy_stop_machine which
> 1) stops the work queue from running in a separate thread
> 2) calls phy_state_machine *synchronously*
>   which runs the PHY_HALTED case with everything well-defined
> end of phy_stop_machine
>
> phy_disconnect only then calls phy_detach()
> which makes future calls of phy_state_machine perilous.
>
> This all happens in the same thread, so I'm not yet
> seeing where the race happens?

 The race is as described in David's earlier email, so let's recap:

 Thread 1Thread 2
 phy_disconnect()
 phy_stop_interrupts()
 phy_stop_machine()
 phy_state_machine()
   -> queue_delayed_work()
 phy_detach()
 phy_state_machine()
 -> netif_carrier_off()

 If phy_detach() finishes earlier than the workqueue had a chance to be
 scheduled and process PHY_HALTED again, then we trigger the NULL
 pointer
 de-reference.

 workqueues are not tasklets, the CPU scheduling them gets no guarantee
 they will run on the same CPU.
>>>
>>> Something does not add up.
>>>
>>> The synchronous call to phy_state_machine() does:
>>>
>>> case PHY_HALTED:
>>> if (phydev->link) {
>>> phydev->link = 0;
>>> netif_carrier_off(phydev->attached_dev);
>>> phy_adjust_link(phydev);
>>> do_suspend = true;
>>> }
>>>
>>> then sets phydev->link = 0; therefore subsequent calls to
>>> phy_state_machin() will be no-op.
>>
>> Actually you are right, once phydev->link is set to 0 these would become
>> no-ops. Still scratching my head as to what happens for David then...
>>
>>>
>>> Also, queue_delayed_work() is only called in polling mode.
>>> David stated that he's using interrupt mode.
> 
> Did you see what I wrote?

Still not following, see below.

> 
> phy_disconnect() calls phy_stop_interrupts() which puts it into polling
> mode.  So the polling work gets queued unconditionally.

What part of phy_stop_interrupts() is responsible for changing
phydev->irq to PHY_POLL? free_irq() cannot touch phydev->irq otherwise
subsequent request_irq() calls won't work anymore.
phy_disable_interrupts() only calls back into the PHY driver to
acknowledge and clear interrupts.

If we were using a PHY with PHY_POLL, as Marc said, the first
synchronous call to phy_state_machine() would have acted on PHY_HALTED
and even if we incorrectly keep re-scheduling the state machine from
PHY_HALTED to PHY_HALTED the second time around nothing can happen.

What are we missing here?

> 
> 
> 
>>
>> Right that's confusing too now. David can you check if you tree has:
>>
>> 49d52e8108a21749dc2114b924c907db43358984 ("net: phy: handle state
>> correctly in phy_stop_machine")
>>
> 
> Yes, I am using the 4.9 stable branch, and that commit was also present.

Thanks for checking that.
-- 
Florian

Re: [PATCH net-next] xdp: implement xdp_redirect_map for generic XDP

2017-09-06 Thread Daniel Borkmann


On 09/06/2017 08:42 PM, John Fastabend wrote:

On 09/06/2017 11:18 AM, Jesper Dangaard Brouer wrote:

On Wed, 06 Sep 2017 18:24:07 +0200
Daniel Borkmann  wrote:

On 09/06/2017 05:26 PM, Jesper Dangaard Brouer wrote:

Using bpf_redirect_map is allowed for generic XDP programs, but the
appropriate map lookup was never performed in xdp_do_generic_redirect().

Instead the map-index is directly used as the ifindex.  For the


Good point, but ...

[...]

   net/core/filter.c |   29 +
   1 file changed, 29 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 5912c738a7b2..6a4745bf2c9f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2562,6 +2562,32 @@ int xdp_do_redirect(struct net_device *dev, struct 
xdp_buff *xdp,
   }
   EXPORT_SYMBOL_GPL(xdp_do_redirect);

+static int xdp_do_generic_redirect_map(struct net_device *dev,
+  struct sk_buff *skb,
+  struct bpf_prog *xdp_prog)
+{
+   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_map *map = ri->map;
+   u32 index = ri->ifindex;
+   struct net_device *fwd;
+   int err;
+
+   ri->ifindex = 0;
+   ri->map = NULL;
+
+   fwd = __dev_map_lookup_elem(map, index);
+   if (!fwd) {
+   err = -EINVAL;
+   goto err;
+   }
+   skb->dev = fwd;
+   _trace_xdp_redirect_map(dev, xdp_prog, fwd, map, index);
+   return 0;
+err:
+   _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map, index, err);
+   return err;
+}
+
   int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
struct bpf_prog *xdp_prog)
   {
@@ -2571,6 +2597,9 @@ int xdp_do_generic_redirect(struct net_device *dev, 
struct sk_buff *skb,
unsigned int len;
int err = 0;

+   if (ri->map)
+   return xdp_do_generic_redirect_map(dev, skb, xdp_prog);


This is not quite correct. Really, the only thing you want
to do here is more or less ...

int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
struct bpf_prog *xdp_prog)
{
struct redirect_info *ri = this_cpu_ptr(_info);
struct bpf_map *map = ri->map;
u32 index = ri->ifindex;
struct net_device *fwd;
unsigned int len;
int err = 0;

ri->ifindex = 0;
ri->map = NULL;

if (map)
fwd = __dev_map_lookup_elem(map, index);
else
fwd = dev_get_by_index_rcu(dev_net(dev), index);
if (unlikely(!fwd)) {
err = -EINVAL;
goto err;
}
[...]

... such that you have a common path to also do the IFF_UP
and MTU checks that are done here, but otherwise omitted in
your patch.


Ah, yes. My patch miss the IFF_UP and MTU check. (I was too inspired by
xdp_do_redirect_map).


Otherwise it looks good, but note that it also doesn't really
resolve the issue you mention wrt stale map pointers by the
way.  [...]


I actually discovered more cases where we can crash the kernel :-(

E.g. driver not supporting XDP_REDIRECT, are still allowed to load an
XDP bpf_prog that call bpf_redirect_map() and set the ->map pointer,
but it will never call xdp_do_redirect() (which is responsible for
clearing/consuming ->map pointer).

Another case: You can also call bpf_redirect_map() and then NOT return
XDP_REDIRECT (it is obviously strange, but the bpf-helper API allows it).


I think we can cover both these cases with previous suggestion to check
prog pointers. Working up a patch now.


Yep, they would both be covered.

Re: [PATCH net-next] xdp: implement xdp_redirect_map for generic XDP

2017-09-06 Thread John Fastabend

On 09/06/2017 11:18 AM, Jesper Dangaard Brouer wrote:
> On Wed, 06 Sep 2017 18:24:07 +0200
> Daniel Borkmann  wrote:
> 
>> On 09/06/2017 05:26 PM, Jesper Dangaard Brouer wrote:
>>> Using bpf_redirect_map is allowed for generic XDP programs, but the
>>> appropriate map lookup was never performed in xdp_do_generic_redirect().
>>>
>>> Instead the map-index is directly used as the ifindex.  For the  
>>
>> Good point, but ...
>>
>> [...]
>>>   net/core/filter.c |   29 +
>>>   1 file changed, 29 insertions(+)
>>>
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index 5912c738a7b2..6a4745bf2c9f 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -2562,6 +2562,32 @@ int xdp_do_redirect(struct net_device *dev, struct 
>>> xdp_buff *xdp,
>>>   }
>>>   EXPORT_SYMBOL_GPL(xdp_do_redirect);
>>>
>>> +static int xdp_do_generic_redirect_map(struct net_device *dev,
>>> +  struct sk_buff *skb,
>>> +  struct bpf_prog *xdp_prog)
>>> +{
>>> +   struct redirect_info *ri = this_cpu_ptr(_info);
>>> +   struct bpf_map *map = ri->map;
>>> +   u32 index = ri->ifindex;
>>> +   struct net_device *fwd;
>>> +   int err;
>>> +
>>> +   ri->ifindex = 0;
>>> +   ri->map = NULL;
>>> +
>>> +   fwd = __dev_map_lookup_elem(map, index);
>>> +   if (!fwd) {
>>> +   err = -EINVAL;
>>> +   goto err;
>>> +   }
>>> +   skb->dev = fwd;
>>> +   _trace_xdp_redirect_map(dev, xdp_prog, fwd, map, index);
>>> +   return 0;
>>> +err:
>>> +   _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map, index, err);
>>> +   return err;
>>> +}
>>> +
>>>   int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
>>> struct bpf_prog *xdp_prog)
>>>   {
>>> @@ -2571,6 +2597,9 @@ int xdp_do_generic_redirect(struct net_device *dev, 
>>> struct sk_buff *skb,
>>> unsigned int len;
>>> int err = 0;
>>>
>>> +   if (ri->map)
>>> +   return xdp_do_generic_redirect_map(dev, skb, xdp_prog);  
>>
>> This is not quite correct. Really, the only thing you want
>> to do here is more or less ...
>>
>> int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
>>  struct bpf_prog *xdp_prog)
>> {
>>  struct redirect_info *ri = this_cpu_ptr(_info);
>>  struct bpf_map *map = ri->map;
>>  u32 index = ri->ifindex;
>>  struct net_device *fwd;
>>  unsigned int len;
>>  int err = 0;
>>
>>  ri->ifindex = 0;
>>  ri->map = NULL;
>>
>>  if (map)
>>  fwd = __dev_map_lookup_elem(map, index);
>>  else
>>  fwd = dev_get_by_index_rcu(dev_net(dev), index);
>>  if (unlikely(!fwd)) {
>>  err = -EINVAL;
>>  goto err;
>>  }
>> [...]
>>
>> ... such that you have a common path to also do the IFF_UP
>> and MTU checks that are done here, but otherwise omitted in
>> your patch.
> 
> Ah, yes. My patch miss the IFF_UP and MTU check. (I was too inspired by
> xdp_do_redirect_map).
> 
> 
>> Otherwise it looks good, but note that it also doesn't really
>> resolve the issue you mention wrt stale map pointers by the
>> way.  [...]
> 
> I actually discovered more cases where we can crash the kernel :-(
> 
> E.g. driver not supporting XDP_REDIRECT, are still allowed to load an
> XDP bpf_prog that call bpf_redirect_map() and set the ->map pointer,
> but it will never call xdp_do_redirect() (which is responsible for
> clearing/consuming ->map pointer).
> 
> Another case: You can also call bpf_redirect_map() and then NOT return
> XDP_REDIRECT (it is obviously strange, but the bpf-helper API allows it).
> 

I think we can cover both these cases with previous suggestion to check
prog pointers. Working up a patch now.

Re: [PATCH v2 rfc 0/8] IGMP snooping for local traffic

2017-09-06 Thread Vivien Didelot

Hi Andrew,

Andrew Lunn  writes:

>> I don't understand why SWITCHDEV_OBJ_ID_PORT_MDB cannot be used. Isn't
>> setting the obj->orig_dev to the bridge device itself enough to
>> distinguish br0 from a switch port?
>
> I did consider this. But conceptually, it seems wrong.
> SWITCHDEV_OBJ_ID_PORT_MDB has always been about egress. I don't like
> adding a special case for ingress. Adding a new switchdev object
> avoids this special case.

SWITCHDEV_OBJ_ID_PORT_MDB means manipulating an MDB entry on a port.
When one want to add an MDB entry to br0, SWITCHDEV_OBJ_ID_PORT_MDB is
used. If the device is br->dev, the lower devices (bridged ports) get
recursively notified and switchdev users can program themselves
accordingly. In the case of DSA, a slave port will program its
associated CPU port (port->cpu_dp) if orig_dev is a bridge.

This seems totally correct to me. I don't see any reason for adding and
maintaining a new switchdev object. What do switchdev guys think?

>> The main problem is that we will soon want support for multiple CPU
>> ports.
>
> We keep saying that, but it never actually happens. We should solve
> multiple CPU problems when we actually add support for multiple CPUs.
> Probably at that point, we need to push SWITCHDEV_OBJ_ID_PORT_HOST_MDB
> right down to a port, so that port can setup what needs setting up so
> its frames goes to its CPU port.

This is not correct. I am currently making this easier, i.e. the
dsa_master patchset I sent before net-next closed.

I do understand your point however and even if this multi-CPU feature
takes time to arrive, we can always find a proper design which makes it
easy. Assuming that each port has its dedicated CPU port is the correct
concept to follow.

>> So adding the MDB entry to all CPU ports as you did in a patch 5/8 in
>> dsa_switch_host_mdb_*() is not correct because you can have CPU port
>> sw0p0 being a member of br0 and CPU port sw0p10 being a member of br1.
>
> Be careful, you are making assumptions about mapping cpu ports to
> bridges. That is not been defined yet.

I am not making any assumptions, but you did because you assume that all
CPU ports will be part of the same bridge.

We need to handle things at the DSA core level the following way:

If one programs the bridge device itself, a switchdev object is
notified, and each DSA slave devices member of the bridge will call the
correct dsa_port_* function (agnostic to the port type) with its cpu_dp.

This covers cleanly all cases. You also don't need any new DSA notifier.
You only need your 6/8 patch. To acheive that, there is two ways:

1) either we keep SWITCHDEV_OBJ_ID_PORT_MDB for the bridge device itself
and the slave devices are notified accordingly with a correct orig_dev:

static int dsa_slave_port_obj_add(struct net_device *dev,
  const struct switchdev_obj *obj,
  struct switchdev_trans *trans)
{
struct dsa_slave_priv *p = netdev_priv(dev);
struct dsa_port *port = p->dp;

/* Program the CPU port if the target is the bridge itself */
if (obj->orig_dev == port->br)
port = port->cpu_dp;

switch (obj->id) {
case SWITCHDEV_OBJ_ID_PORT_MDB:
return dsa_port_mdb_add(port, obj, trans);
...
}
}

2) or you introduce a switchdev object which, by definition, targets the
CPU ports of our bridge members:

static int dsa_slave_port_obj_add(struct net_device *dev,
  const struct switchdev_obj *obj,
  struct switchdev_trans *trans)
{
struct dsa_slave_priv *p = netdev_priv(dev);
struct dsa_port *port = p->dp;

switch (obj->id) {
case SWITCHDEV_OBJ_ID_PORT_HOST_MDB:
port = port->cpu_dp;
/* fall through... */
case SWITCHDEV_OBJ_ID_PORT_MDB:
return dsa_port_mdb_add(port, obj, trans);
...
}
}

You basically just need that and your 6/8 patch for the DSA part.

Thanks,

Vivien

Re: [PATCH net-next] xdp: implement xdp_redirect_map for generic XDP

2017-09-06 Thread Jesper Dangaard Brouer

On Wed, 06 Sep 2017 18:24:07 +0200
Daniel Borkmann  wrote:

> On 09/06/2017 05:26 PM, Jesper Dangaard Brouer wrote:
> > Using bpf_redirect_map is allowed for generic XDP programs, but the
> > appropriate map lookup was never performed in xdp_do_generic_redirect().
> >
> > Instead the map-index is directly used as the ifindex.  For the  
> 
> Good point, but ...
> 
> [...]
> >   net/core/filter.c |   29 +
> >   1 file changed, 29 insertions(+)
> >
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 5912c738a7b2..6a4745bf2c9f 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -2562,6 +2562,32 @@ int xdp_do_redirect(struct net_device *dev, struct 
> > xdp_buff *xdp,
> >   }
> >   EXPORT_SYMBOL_GPL(xdp_do_redirect);
> >
> > +static int xdp_do_generic_redirect_map(struct net_device *dev,
> > +  struct sk_buff *skb,
> > +  struct bpf_prog *xdp_prog)
> > +{
> > +   struct redirect_info *ri = this_cpu_ptr(_info);
> > +   struct bpf_map *map = ri->map;
> > +   u32 index = ri->ifindex;
> > +   struct net_device *fwd;
> > +   int err;
> > +
> > +   ri->ifindex = 0;
> > +   ri->map = NULL;
> > +
> > +   fwd = __dev_map_lookup_elem(map, index);
> > +   if (!fwd) {
> > +   err = -EINVAL;
> > +   goto err;
> > +   }
> > +   skb->dev = fwd;
> > +   _trace_xdp_redirect_map(dev, xdp_prog, fwd, map, index);
> > +   return 0;
> > +err:
> > +   _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map, index, err);
> > +   return err;
> > +}
> > +
> >   int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
> > struct bpf_prog *xdp_prog)
> >   {
> > @@ -2571,6 +2597,9 @@ int xdp_do_generic_redirect(struct net_device *dev, 
> > struct sk_buff *skb,
> > unsigned int len;
> > int err = 0;
> >
> > +   if (ri->map)
> > +   return xdp_do_generic_redirect_map(dev, skb, xdp_prog);  
> 
> This is not quite correct. Really, the only thing you want
> to do here is more or less ...
> 
> int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
>   struct bpf_prog *xdp_prog)
> {
>   struct redirect_info *ri = this_cpu_ptr(_info);
>   struct bpf_map *map = ri->map;
>   u32 index = ri->ifindex;
>   struct net_device *fwd;
>   unsigned int len;
>   int err = 0;
> 
>   ri->ifindex = 0;
>   ri->map = NULL;
> 
>   if (map)
>   fwd = __dev_map_lookup_elem(map, index);
>   else
>   fwd = dev_get_by_index_rcu(dev_net(dev), index);
>   if (unlikely(!fwd)) {
>   err = -EINVAL;
>   goto err;
>   }
> [...]
> 
> ... such that you have a common path to also do the IFF_UP
> and MTU checks that are done here, but otherwise omitted in
> your patch.

Ah, yes. My patch miss the IFF_UP and MTU check. (I was too inspired by
xdp_do_redirect_map).


> Otherwise it looks good, but note that it also doesn't really
> resolve the issue you mention wrt stale map pointers by the
> way.  [...]

I actually discovered more cases where we can crash the kernel :-(

E.g. driver not supporting XDP_REDIRECT, are still allowed to load an
XDP bpf_prog that call bpf_redirect_map() and set the ->map pointer,
but it will never call xdp_do_redirect() (which is responsible for
clearing/consuming ->map pointer).

Another case: You can also call bpf_redirect_map() and then NOT return
XDP_REDIRECT (it is obviously strange, but the bpf-helper API allows it).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH net-next RFC 2/2] tun: enable napi_gro_frags() for TUN/TAP driver

2017-09-06 Thread Eric Dumazet

On Tue, 2017-09-05 at 15:35 -0700, Petar Penkov wrote:
> Add a TUN/TAP receive mode that exercises the napi_gro_frags()
> interface. This mode is available only in TAP mode, as the interface
> expects packets with Ethernet headers.
> 

Hi Petar, thanks a lot for this work.

I must confess I have to retract one feedback I gave while reviewing
your patches.

> + local_bh_disable();
> + data = napi_alloc_frag(fragsz);
> + local_bh_enable();
> + if (!data) {
> + err = -ENOMEM;
> + goto free;
> + }
> +
> + page = virt_to_page(data);
> + offset = offset_in_page(data);

These two lines above indeed trigger too many problems in the kernel.
(Like the one you tried to cover here
https://patchwork.kernel.org/patch/9927927/ )

Please use for your next submission the code you originally had :

page = virt_to_head_page(data);
offset = data - page_address(page);

Thanks !

Re: [PATCH net] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"

2017-09-06 Thread David Daney


On 08/31/2017 11:29 AM, Florian Fainelli wrote:

On 08/31/2017 11:12 AM, Mason wrote:

On 31/08/2017 19:53, Florian Fainelli wrote:

On 08/31/2017 10:49 AM, Mason wrote:

On 31/08/2017 18:57, Florian Fainelli wrote:

And the race is between phy_detach() setting phydev->attached_dev = NULL
and phy_state_machine() running in PHY_HALTED state and calling
netif_carrier_off().


I must be missing something.
(Since a thread cannot race against itself.)

phy_disconnect calls phy_stop_machine which
1) stops the work queue from running in a separate thread
2) calls phy_state_machine *synchronously*
  which runs the PHY_HALTED case with everything well-defined
end of phy_stop_machine

phy_disconnect only then calls phy_detach()
which makes future calls of phy_state_machine perilous.

This all happens in the same thread, so I'm not yet
seeing where the race happens?


The race is as described in David's earlier email, so let's recap:

Thread 1Thread 2
phy_disconnect()
phy_stop_interrupts()
phy_stop_machine()
phy_state_machine()
  -> queue_delayed_work()
phy_detach()
phy_state_machine()
-> netif_carrier_off()

If phy_detach() finishes earlier than the workqueue had a chance to be
scheduled and process PHY_HALTED again, then we trigger the NULL pointer
de-reference.

workqueues are not tasklets, the CPU scheduling them gets no guarantee
they will run on the same CPU.


Something does not add up.

The synchronous call to phy_state_machine() does:

case PHY_HALTED:
if (phydev->link) {
phydev->link = 0;
netif_carrier_off(phydev->attached_dev);
phy_adjust_link(phydev);
do_suspend = true;
}

then sets phydev->link = 0; therefore subsequent calls to
phy_state_machin() will be no-op.


Actually you are right, once phydev->link is set to 0 these would become
no-ops. Still scratching my head as to what happens for David then...



Also, queue_delayed_work() is only called in polling mode.
David stated that he's using interrupt mode.


Did you see what I wrote?

phy_disconnect() calls phy_stop_interrupts() which puts it into polling 
mode.  So the polling work gets queued unconditionally.






Right that's confusing too now. David can you check if you tree has:

49d52e8108a21749dc2114b924c907db43358984 ("net: phy: handle state
correctly in phy_stop_machine")



Yes, I am using the 4.9 stable branch, and that commit was also present.

David.

Re: [PATCH net] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"

2017-09-06 Thread David Daney


On 09/06/2017 07:33 AM, Mason wrote:

On 31/08/2017 20:29, Florian Fainelli wrote:

On 08/31/2017 11:12 AM, Mason wrote:

On 31/08/2017 19:53, Florian Fainelli wrote:

On 08/31/2017 10:49 AM, Mason wrote:

On 31/08/2017 18:57, Florian Fainelli wrote:
And the race is between phy_detach() setting phydev->attached_dev 
= NULL

and phy_state_machine() running in PHY_HALTED state and calling
netif_carrier_off().


I must be missing something.
(Since a thread cannot race against itself.)

phy_disconnect calls phy_stop_machine which
1) stops the work queue from running in a separate thread
2) calls phy_state_machine *synchronously*
  which runs the PHY_HALTED case with everything well-defined
end of phy_stop_machine

phy_disconnect only then calls phy_detach()
which makes future calls of phy_state_machine perilous.

This all happens in the same thread, so I'm not yet
seeing where the race happens?


The race is as described in David's earlier email, so let's recap:

Thread 1Thread 2
phy_disconnect()
phy_stop_interrupts()
phy_stop_machine()
phy_state_machine()
  -> queue_delayed_work()
phy_detach()
phy_state_machine()
-> netif_carrier_off()

If phy_detach() finishes earlier than the workqueue had a chance to be
scheduled and process PHY_HALTED again, then we trigger the NULL 
pointer

de-reference.

workqueues are not tasklets, the CPU scheduling them gets no guarantee
they will run on the same CPU.


Something does not add up.

The synchronous call to phy_state_machine() does:

case PHY_HALTED:
if (phydev->link) {
phydev->link = 0;
netif_carrier_off(phydev->attached_dev);
phy_adjust_link(phydev);
do_suspend = true;
}

then sets phydev->link = 0; therefore subsequent calls to
phy_state_machin() will be no-op.


Actually you are right, once phydev->link is set to 0 these would become
no-ops. Still scratching my head as to what happens for David then...



Also, queue_delayed_work() is only called in polling mode.
David stated that he's using interrupt mode.


Right that's confusing too now. David can you check if you tree has:

49d52e8108a21749dc2114b924c907db43358984 ("net: phy: handle state
correctly in phy_stop_machine")


Hello David,

A week ago, you wrote about my patch:
"This is broken.  Please revert."

I assume you tested the revert locally, and that reverting did make
the crash disappear. Is that correct?



Yes, I always test things before making this type of assertion.



The reason I ask is because the analysis you provided contains some
flaws, as noted above. But, if reverting my patch did fix your issue,
then perhaps understanding *why* is unimportant.


I didn't want to take the time to generate calling sequence traces to 
verify each step of my analysis, but I believe the overall concept is 
essentially correct.


Once the polling work is canceled and we set a bunch of essential 
pointers to NULL, you cannot go blindly restarting the polling.




I'm a bit baffled that it took less than 90 minutes for your request
to be approved, and the patch reverted in all branches, before I even
had a chance to comment.



o The last chance for patches to v4.13 was fast approaching.

o There were multiple reports of failures caused by the patch.

o The patch was clearly stand-alone.

The kernel maintainers are a model of efficiency, there was no reason to 
delay.

Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads

2017-09-06 Thread Alexander Duyck

On Wed, Sep 6, 2017 at 9:17 AM, Tom Herbert  wrote:
> On Tue, Sep 5, 2017 at 8:06 PM, Alexander Duyck
>  wrote:
>> On Tue, Sep 5, 2017 at 2:13 PM, Tom Herbert  wrote:
 The situation with encapsulation is even more complicated:

 We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP
 constellation. If we do the fragmentation inside the vxlan tunnel and
 carry over the skb hash to all resulting UDP/vxlan packets source ports,
 we are fine and reordering on the receiver NIC won't happen in this
 case. If the fragmentation happens on the outer UDP header, this will
 result in reordering of the inner L2 flow. Unfortunately this depends on
 how the vxlan tunnel was set up, how other devices do that and (I
 believe so) on the kernel version.

>>> This really isn't that complicated. The assumption that an IP network
>>> always delivers packets in order is simply wrong. The inventors of
>>> VXLAN must have know full well that when you use IP, packets can and
>>> eventually will be delivered out of order. This isn't just because of
>>> fragmentation, there are many other reasons that packets can be
>>> delivered OOO. This also must have been known when IP/GRE and any
>>> other protocol that carries L2 over IP was invented. If OOO is an
>>> issue for these protocols then they need to be fixed-- this is not a
>>> concern with IP protocol nor the stack.
>>>
>>> Tom
>>
>> As far as a little background on the original patch I believe the
>> issue that was fixed by the patch was a video streaming application
>> that was sending/receiving a mix of fragmented and non-fragmented
>> packets. Receiving them out of order due to the fragmentation was
>> causing issues with stutters in the video and so we ended up disabling
>> UDP by default in the NICs listed. We decided to go that way as UDP
>> RSS was viewed as a performance optimization, while the out-of-order
>> problems were viewed as a functionality issue.
>>
> Hi Alex,
>
> Thanks for the details! Were you able to find the root cause for this?
> In particular, it would be interesting to know if it is the kernel or
> device that introduced the jitter, or if it's the application that
> doesn't handle OOO well...
>
> Tom

It is hard to say since my memory of the events from 7 years ago is
pretty vague at this point, but I'm pretty sure it was the
application. Basically getting the frames out of order was causing
them to have to drop video data if I recall correctly.

- Alex

Re: [patch net] net: sched: fix memleak for chain zero

2017-09-06 Thread Cong Wang

On Wed, Sep 6, 2017 at 4:14 AM, Jiri Pirko  wrote:
> From: Jiri Pirko 
>
> There's a memleak happening for chain 0. The thing is, chain 0 needs to
> be always present, not created on demand. Therefore tcf_block_get upon
> creation of block calls the tcf_chain_create function directly. The
> chain is created with refcnt == 1, which is not correct in this case and
> causes the memleak. So move the refcnt increment into tcf_chain_get
> function even for the case when chain needs to be created.
>

Your approach could work but you just make the code even
uglier than it is now:

1. The current code is already ugly for special-casing chain 0:

if (--chain->refcnt == 0 && !chain->filter_chain && chain->index != 0)
tcf_chain_destroy(chain);

2. With your patch, chain 0 has a different _initial_ refcnt with others.

3. Allowing an object (chain 0) exists with refcnt==0

Compare it with my patch:

1. No special-case for chain 0, the above ugly part is removed

2. Every chain is equal and created with refcnt==1

3. Any chain with refcnt==0 is destroyed

Re: [patch net] net: sched: fix memleak for chain zero

2017-09-06 Thread Jakub Kicinski

On Wed,  6 Sep 2017 13:14:19 +0200, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> There's a memleak happening for chain 0. The thing is, chain 0 needs to
> be always present, not created on demand. Therefore tcf_block_get upon
> creation of block calls the tcf_chain_create function directly. The
> chain is created with refcnt == 1, which is not correct in this case and
> causes the memleak. So move the refcnt increment into tcf_chain_get
> function even for the case when chain needs to be created.
> 
> Reported-by: Jakub Kicinski 
> Fixes: 5bc1701881e3 ("net: sched: introduce multichain support for filters")
> Signed-off-by: Jiri Pirko 

Tested-by: Jakub Kicinski

RE: [PATCH net 3/4] lan78xx: Fix for eeprom read/write when device autosuspend

2017-09-06 Thread Nisar.Sayed

Thanks, will make separate patch.

> Hi Nisar
> 
> > +   else if ((ee->magic == LAN78XX_EEPROM_MAGIC) &&
> > +(ee->offset >= 0 && ee->offset < MAX_EEPROM_SIZE) &&
> > +(ee->len > 0 && (ee->offset + ee->len) <=
> MAX_EEPROM_SIZE))
> > +   ret = lan78xx_write_raw_eeprom(dev, ee->offset, ee->len,
> data);
> 
> This change does not appear to have anything to do with auto suspend.
> Please make it a separate patch.
> 
>Andrew

RE: [PATCH net 2/4] lan78xx: Add fixed_phy device support for LAN7801 device

2017-09-06 Thread Nisar.Sayed

Thanks Andrew, will try to change as suggested.

> On Wed, Sep 06, 2017 at 10:51:44AM +, nisar.sa...@microchip.com
> wrote:
> > From: Nisar Sayed 
> >
> > Add fixed_phy device support for LAN7801 device
> >
> > When LAN7801 device connected to PHY Device which does not have
> > MDIO/MDC access, fixex_phy device will be added.
> 
> Please try to find a way to do this without all the #ifdefs. They can be
> acceptable in header files, but should be avoided in .c files.
> 
>Andrew

RE: [PATCH net 1/4] lan78xx: Fix for crash associated with System suspend

2017-09-06 Thread Nisar.Sayed

Thanks Andrew inputs.
 
> On Wed, Sep 06, 2017 at 10:51:31AM +, nisar.sa...@microchip.com
> wrote:
> > From: Nisar Sayed 
> >
> > Fix for crash associated with System suspend
> >
> > Since ndo_stop removes phydev which makes phydev NULL.
> > Whenever system suspend is initiated or after "ifconfig 
> > down", if set_wol or get_wol is triggered phydev is NULL leads system
> crash.
> > Hence phy_start/phy_stop for ndo_start/ndo_stop fixes the issues
> > instead of adding/removing phydevice
> 
> Looking at this patch, there apears to be lots of different things going on.
> Please can you split it up into multiple patches.

Sure will split it up.

> 
> > Signed-off-by: Nisar Sayed 
> > ---
> >  drivers/net/usb/lan78xx.c | 44
> > 
> >  1 file changed, 28 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
> > index b99a7fb..955ab3b 100644
> > --- a/drivers/net/usb/lan78xx.c
> > +++ b/drivers/net/usb/lan78xx.c
> > @@ -2024,6 +2024,8 @@ static int lan78xx_phy_init(struct lan78xx_net
> *dev)
> >  lan8835_fixup);
> > if (ret < 0) {
> > netdev_err(dev->net, "fail to register fixup\n");
> > +   phy_unregister_fixup_for_uid(PHY_KSZ9031RNX,
> > +0xfff0);
> 
> goto error; would be better. phy_unregister_fixup_for_uid() does not care if
> you try to unregister something which has not been registered.
> 
> Also, this should be a separate patch.

Ok, will make it as separate patch

> 
> > return ret;
> > }
> > /* add more external PHY fixup here if needed */ @@ -
> 2031,8 +2033,7
> > @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
> > phydev->is_internal = false;
> > } else {
> > netdev_err(dev->net, "unknown ID found\n");
> > -   ret = -EIO;
> > -   goto error;
> > +   return -EIO;
> > }
> >
> > /* if phyirq is not set, use polling mode in phylib */ @@ -2051,7
> > +2052,10 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
> > if (ret) {
> > netdev_err(dev->net, "can't attach PHY to %s\n",
> >dev->mdiobus->id);
> > -   return -EIO;
> > +   ret = -EIO;
> > +   if (dev->chipid == ID_REV_CHIP_ID_7801_)
> > +   goto error;
> > +   return ret;
> 
> Why not add the if (dev->chipid == ID_REV_CHIP_ID_7801_) after the
> error: label?
> 
> 


Yes, will correct it

> 
> > }
> >
> > /* MAC doesn't support 1000T Half */ @@ -2067,8 +2071,6 @@ static
> > int lan78xx_phy_init(struct lan78xx_net *dev)
> >
> > dev->fc_autoneg = phydev->autoneg;
> >
> > -   phy_start(phydev);
> > -
> > netif_dbg(dev, ifup, dev->net, "phy initialised successfully");
> >
> > return 0;
> > @@ -2497,9 +2499,9 @@ static int lan78xx_open(struct net_device *net)
> > if (ret < 0)
> > goto done;
> >
> > -   ret = lan78xx_phy_init(dev);
> > -   if (ret < 0)
> > -   goto done;
> > +   if (dev->domain_data.phyirq > 0)
> > +   phy_start_interrupts(dev->net->phydev);
> 
> This is unusual. I don't see any other MAC driver starting interrupts.
> This needs explaining.
> 
>  Andrew

Since "lan78xx_open" calls  "lan78xx_reset" (Device reset) it is required to 
start/enable interrupt back
Initially when "phydev->state = PHY_READY" state  "phy_start" will not enable 
interrupts,
However after "lan78xx_stop" when "phy_stop" makes "phydev->state = PHY_HALTED"
Subsequent call to "phy_start" will enable interrupt.

Hence "phy_start_interrupts" used after "lan78xx_reset"

- Nisar

Re: [Patch net] net_sched: fix a memory leak of filter chain

2017-09-06 Thread Cong Wang

On Wed, Sep 6, 2017 at 12:38 AM, Jiri Pirko  wrote:
> Wed, Sep 06, 2017 at 07:03:10AM CEST, xiyou.wangc...@gmail.com wrote:
>>tcf_chain_destroy() is called by tcf_block_put() and tcf_chain_put().
>>tcf_chain_put() is refcn'ed and paired with tcf_chain_get(),
>>but tcf_block_put() is not, it should be paired with tcf_block_get()
>>and we still need to decrease the refcnt. However, tcf_block_put()
>>is special, it stores the chains too, we have to detach them if
>>it is not the last user.
>
> You don't describe the original issue, or I am missing that from your
> description.

The original issue is the mismatch of tcf_block_put() and tcf_block_get()
w.r.t. refcnt. Think it in this way: if you call tcf_bock_put() immediately
after tcf_block_get(), would you get effectively a nop?


>
>
>>
>>What's more, index 0 is not special at all, it should be treated
>>like other chains. This also makes the code more readable.
>
> [...]
>
>
>>@@ -246,10 +246,7 @@ EXPORT_SYMBOL(tcf_chain_get);
>>
>> void tcf_chain_put(struct tcf_chain *chain)
>> {
>>-  /* Destroy unused chain, with exception of chain 0, which is the
>>-   * default one and has to be always present.
>>-   */
>>-  if (--chain->refcnt == 0 && !chain->filter_chain && chain->index != 0)
>>+  if (--chain->refcnt == 0)
>
> The refcounting is only done for actions holding reference to the chain.
> You still need to check is the filter chain is not empty.
> See tc_ctl_tfilter.

With my patch refcnt is done for block too, if you notice the
tcf_chain_put() in tcf_block_put().


>
> Also, chain 0 is created by default on a block creation. It has to be
> present always for a reason. Please see tcf_block_get. The pointer to
> chain 0 is assigned to the qdisc filter list pointer.

Sure, this is why block holds a refcnt to chain (not just chain 0) with
my patch, aka why the initial refcnt is 1 rather than 0.

Re: [PATCH] DSA support for Micrel KSZ8895

2017-09-06 Thread Andrew Lunn

> The patches are under review internally and will need to be updated
> and approved by Woojung before formal submission.  Problem is
> although KSZ8795 and KSZ8895 drivers are new code and will be
> submitted as RFC, they depend on the change of KSZ9477 driver
> currently in the kernel, which require more rigorous review.

Please be aware that they will also go though review when you post
them. This can be anything from great, nice job, to throw them away
and start again. Since you are submitting RFCs we understand it is
early code, issues still to be solved, and we can make suggestions how
to solve those issues.

Post early, post often...

 Andrew

Re: [PATCH net-next] xdp: implement xdp_redirect_map for generic XDP

2017-09-06 Thread Daniel Borkmann


On 09/06/2017 06:24 PM, Daniel Borkmann wrote:
[...]

Otherwise it looks good, but note that it also doesn't really
resolve the issue you mention wrt stale map pointers by the
way. This would need a different way to clear out the pointers
from redirect_info, I'm thinking when we have devmap dismantle
time after RCU grace period we should check whether there are
still stale pointers from this map around and clear them under
disabled preemption, but need to brainstorm a bit more on that
first.


Scratch that approach, doesn't work. So thinking bit more on
this, what we could do here is the following: verifier knows
we called bpf_xdp_redirect_map() helper, so it could do a small
insn rewrite in the sense that it fills R4 with a pointer to
the bpf_prog. We have that at verification time anyway and R4
is allowed to be populated since we scratch it per convention.
Then, the helper would store the prog pointer in struct redirect_info.
Later in xdp_do_*_redirect() we check whether the redirect_info's
prog pointer is the same as passed xdp_prog pointer, and if
that's the case then all good, since the prog holds a ref on
the map anyway, if they are not equal in the unlikely case, it
means stale pointer, so we bail out right there. That would
work imo, will see to code it up and check it out.

Re: [PATCH v2 rfc 0/8] IGMP snooping for local traffic

2017-09-06 Thread Andrew Lunn

On Wed, Sep 06, 2017 at 11:25:26AM -0400, Vivien Didelot wrote:
> Hi Andrew, Nikolay,
> 
> Andrew Lunn  writes:
> 
> > Then starts the work passing down to the hardware that the host has
> > joined/left a group. The existing switchdev mdb object cannot be used,
> > since the semantics are different. The existing
> > SWITCHDEV_OBJ_ID_PORT_MDB is used to indicate a specific multicast
> > group should be forwarded out that port of the switch. However here we
> > require the exact opposite. We want multicast frames for the group
> > received on the port to the forwarded to the host. Hence add a new
> > object SWITCHDEV_OBJ_ID_HOST_MDB, a multicast database entry to
> > forward to the host. This new object is then propagated through the
> > DSA layers. No DSA driver changes should be needed, this should just
> > work...
> 
> I'm not sure if you already explained that, if so, sorry in advance.
> 
> I don't understand why SWITCHDEV_OBJ_ID_PORT_MDB cannot be used. Isn't
> setting the obj->orig_dev to the bridge device itself enough to
> distinguish br0 from a switch port?

Hi Vivien

I did consider this. But conceptually, it seems wrong.
SWITCHDEV_OBJ_ID_PORT_MDB has always been about egress. I don't like
adding a special case for ingress. Adding a new switchdev object
avoids this special case.

> static int dsa_slave_port_obj_add(struct net_device *dev,
> const struct switchdev_obj *obj,
> struct switchdev_trans *trans)
> {
> struct dsa_slave_priv *p = netdev_priv(dev);
> struct dsa_port *port = p->dp;
> 
> /* Is the target port the bridge device itself? */
> if (obj->orig_dev == port->br)
> port = port->cpu_dp;
> 
> return dsa_port_mdb_add(port, obj, trans);
> }
> 
> The main problem is that we will soon want support for multiple CPU
> ports.

We keep saying that, but it never actually happens. We should solve
multiple CPU problems when we actually add support for multiple CPUs.
Probably at that point, we need to push SWITCHDEV_OBJ_ID_PORT_HOST_MDB
right down to a port, so that port can setup what needs setting up so
its frames goes to its CPU port.

> So adding the MDB entry to all CPU ports as you did in a patch 5/8 in
> dsa_switch_host_mdb_*() is not correct because you can have CPU port
> sw0p0 being a member of br0 and CPU port sw0p10 being a member of br1.

Be careful, you are making assumptions about mapping cpu ports to
bridges. That is not been defined yet.

 Andrew

[iproute PATCH] utils: Review strlcpy() and strlcat()

2017-09-06 Thread Phil Sutter

As David Laight correctly pointed out, the first version of strlcpy()
modified dst buffer behind the string copied into it. Fix this by
writing NUL to the byte immediately following src string instead of to
the last byte in dst. Doing so also allows to reduce overhead by using
memcpy().

Improve strlcat() by avoiding the call to strlcpy() if dst string is
already full, not just as sanity check.

Signed-off-by: Phil Sutter 
---
 lib/utils.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/lib/utils.c b/lib/utils.c
index 330ab073c2068..bbd3cbc46a0e5 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -1233,18 +1233,22 @@ int get_real_family(int rtm_type, int rtm_family)
 
 size_t strlcpy(char *dst, const char *src, size_t size)
 {
+   size_t srclen = strlen(src);
+
if (size) {
-   strncpy(dst, src, size - 1);
-   dst[size - 1] = '\0';
+   size_t minlen = min(srclen, size - 1);
+
+   memcpy(dst, src, minlen);
+   dst[minlen] = '\0';
}
-   return strlen(src);
+   return srclen;
 }
 
 size_t strlcat(char *dst, const char *src, size_t size)
 {
size_t dlen = strlen(dst);
 
-   if (dlen > size)
+   if (dlen >= size)
return dlen + strlen(src);
 
return dlen + strlcpy(dst + dlen, src, size - dlen);
-- 
2.13.1

RE: [PATCH] DSA support for Micrel KSZ8895

2017-09-06 Thread Tristram.Ha

> -Original Message-
> From: Maxim Uvarov [mailto:muva...@gmail.com]
> Sent: Wednesday, September 06, 2017 2:15 AM
> To: Tristram Ha - C24268
> Cc: Pavel Machek; Woojung Huh - C21699; Nathan Conrad; Vivien Didelot;
> Florian Fainelli; netdev; linux-ker...@vger.kernel.org; Andrew Lunn
> Subject: Re: [PATCH] DSA support for Micrel KSZ8895
> 
> 2017-08-31 0:32 GMT+03:00  :
> >> On Mon 2017-08-28 16:09:27, Andrew Lunn wrote:
> >> > > I may be confused here, but AFAICT:
> >> > >
> >> > > 1) Yes, it has standard layout when accessed over MDIO.
> >> >
> >> >
> >> > Section 4.8 of the datasheet says:
> >> >
> >> > All the registers defined in this section can be also accessed
> >> > via the SPI interface.
> >> >
> >> > Meaning all PHY registers can be access via the SPI interface. So
> >> > you should be able to make a standard Linux MDIO bus driver which
> >> > performs SPI reads.
> >>
> >> As far as I can tell (and their driver confirms) -- yes, all those
> >> registers can be accessed over the SPI, they are just shuffled
> >> around... hence MDIO emulation code. I copied it from their code (see
> >> the copyrights) so no, I don't believe there's nicer solution.
> >>
> >> Best regards,
> >>
> >>
> >> Pavel
> >
> > Can you hold on your developing work on KSZ8895 driver?  I am afraid your
> effort may be in vain.  We at Microchip are planning to release DSA drivers
> for all KSZ switches, starting at KSZ8795, then KSZ8895, and KSZ8863.
> >
> > The driver files all follow the structures of the current KSZ9477 DSA 
> > driver,
> and the file tag_ksz.c will be updated to handle the tail tag of different 
> chips,
> which requires including the ksz_priv.h header.  That is required
> nevertheless to support using the offload_fwd_mark indication.
> >
> > The KSZ8795 driver will be submitted after Labor Day (9/4) if testing 
> > reveals
> no problem.  The KSZ8895 driver will be submitted right after that.  You
> should have no problem using the driver right away.
> >
> 
> Hello Tristram, is there any update for that driver?
> 
> Maxim.
> 

The patches are under review internally and will need to be updated and 
approved by Woojung before formal submission.  Problem is although KSZ8795 and 
KSZ8895 drivers are new code and will be submitted as RFC, they depend on the 
change of KSZ9477 driver currently in the kernel, which require more rigorous 
review.

Re: [PATCH net v2] rds: Fix incorrect statistics counting

2017-09-06 Thread Santosh Shilimkar


On 9/6/2017 9:35 AM, Håkon Bugge wrote:

In rds_send_xmit() there is logic to batch the sends. However, if
another thread has acquired the lock and has incremented the send_gen,
it is considered a race and we yield. The code incrementing the
s_send_lock_queue_raced statistics counter did not count this event
correctly.

This commit counts the race condition correctly.

Changes from v1:
- Removed check for *someone_on_xmit()*
- Fixed incorrect indentation

Signed-off-by: Håkon Bugge 
Reviewed-by: Knut Omang 
---


Thanks for the update.
Acked-by: Santosh Shilimkar

Re: [PATCH v2 rfc 0/8] IGMP snooping for local traffic

2017-09-06 Thread Andrew Lunn

> >On the switch asics we work with, the driver has information if the
> >packet was
> >forwarded in hardware. This is per packet reason code telling why the
> >CPU is seeing the packet.
> >The driver can use this information to reset skb->offload_fwd_mark to
> >allow software forward.

> I am not positive this is universally available across different
> switch vendors.

It is not universally available. We cannot rely on it being available
with switches supported by DSA.

We have a few choices:

1) We assume anything the switch forwards to the CPU has also been
   sent out whatever ports of the switch it needs to. Set
   offload_fwd_mark.

2) We assume anything the switch forwards to the CPU has not gone
   anywhere else, and the bridge needs to send it out whatever ports
   it thinks. Don't set offload_fwd_mark.

3) We define some rules about what packets the switch should handle,
   and then do some deep packet inspection to decide if
   offload_fwd_mark should be set or not.

I don't see 3) being possible. We are dealing with a fixed silicon
data path, not something which is fully programmable.

So it is down to 1) or 2). I've been assuming 1), but maybe we need to
discuss that as well.

Andrew

[PATCH net v2] rds: Fix incorrect statistics counting

2017-09-06 Thread Håkon Bugge

In rds_send_xmit() there is logic to batch the sends. However, if
another thread has acquired the lock and has incremented the send_gen,
it is considered a race and we yield. The code incrementing the
s_send_lock_queue_raced statistics counter did not count this event
correctly.

This commit counts the race condition correctly.

Changes from v1:
- Removed check for *someone_on_xmit()*
- Fixed incorrect indentation

Signed-off-by: Håkon Bugge 
Reviewed-by: Knut Omang 
---
 net/rds/send.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index 058a407..b52cdc8 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -428,14 +428,18 @@ int rds_send_xmit(struct rds_conn_path *cp)
 * some work and we will skip our goto
 */
if (ret == 0) {
+   bool raced;
+
smp_mb();
+   raced = send_gen != READ_ONCE(cp->cp_send_gen);
+
if ((test_bit(0, >c_map_queued) ||
-!list_empty(>cp_send_queue)) &&
-   send_gen == READ_ONCE(cp->cp_send_gen)) {
-   rds_stats_inc(s_send_lock_queue_raced);
+   !list_empty(>cp_send_queue)) && !raced) {
if (batch_count < send_batch_count)
goto restart;
queue_delayed_work(rds_wq, >cp_send_w, 1);
+   } else if (raced) {
+   rds_stats_inc(s_send_lock_queue_raced);
}
}
 out:
-- 
2.9.3

Re: [PATCH net-next 1/1] hv_netvsc: fix deadlock on hotplug

2017-09-06 Thread Stephen Hemminger

On Wed, 6 Sep 2017 16:23:45 +
Haiyang Zhang  wrote:

> > -Original Message-
> > From: Stephen Hemminger [mailto:step...@networkplumber.org]
> > Sent: Wednesday, September 6, 2017 11:19 AM
> > To: KY Srinivasan ; Haiyang Zhang
> > ; Stephen Hemminger 
> > Cc: de...@linuxdriverproject.org; netdev@vger.kernel.org
> > Subject: [PATCH net-next 1/1] hv_netvsc: fix deadlock on hotplug
> > 
> > When a virtual device is added dynamically (via host console), then
> > the vmbus sends an offer message for the primary channel. The processing
> > of this message for networking causes the network device to then
> > initialize the sub channels.
> > 
> > The problem is that setting up the sub channels needs to wait until
> > the subsequent subchannel offers have been processed. These offers
> > come in on the same ring buffer and work queue as where the primary
> > offer is being processed; leading to a deadlock.
> > 
> > This did not happen in older kernels, because the sub channel waiting
> > logic was broken (it wasn't really waiting).
> > 
> > The solution is to do the sub channel setup in its own work queue
> > context that is scheduled by the primary channel setup; and then
> > happens later.
> > 
> > Fixes: 732e49850c5e ("netvsc: fix race on sub channel creation")
> > Reported-by: Dexuan Cui 
> > Signed-off-by: Stephen Hemminger 
> > ---
> > Should also go to stable, but this version does not apply cleanly
> > to 4.13. Have another patch for that.
> > 
> >  drivers/net/hyperv/hyperv_net.h   |   1 +
> >  drivers/net/hyperv/netvsc_drv.c   |   8 +--
> >  drivers/net/hyperv/rndis_filter.c | 106 ++-
> > ---
> >  3 files changed, 74 insertions(+), 41 deletions(-)  
> 
> The patch looks overall. I just have a question:
> 
> With this patch, after module load and probe is done, there may still be
> subchannels being processed. If rmmod immediately, the subchannel offers 
> may hit half-way removed device structures... Do we also need to add 
> cancel_work_sync(>subchan_work) to the top of netvsc_remove()?
> 
> unregister_netdevice() includes device close, but it's only called later
> in the netvsc_remove() when rndis is already removed.
> 
> Thanks,
> - Haiyang

Good catch.
If the driver called unregister_netdevice first before doing 
rndis_filter_device_remove
that would solve the problem. That wouldn't cause additional problems and it 
makes
sense to close the network layer first.

Re: [PATCH v2 rfc 8/8] net: dsa: Fix SWITCHDEV_ATTR_ID_PORT_PARENT_ID

2017-09-06 Thread Andrew Lunn

> >Yes, you are correct. I will change this.
> 

> A switch cluster should be tied to the same dsa_switch_tree though,
> can we use dst->tree as an unique identifier?

Yes, that is what Vivien was suggesting.

 Andrew

Re: [PATCH net-next] xdp: implement xdp_redirect_map for generic XDP

2017-09-06 Thread Daniel Borkmann


On 09/06/2017 05:26 PM, Jesper Dangaard Brouer wrote:

Using bpf_redirect_map is allowed for generic XDP programs, but the
appropriate map lookup was never performed in xdp_do_generic_redirect().

Instead the map-index is directly used as the ifindex.  For the


Good point, but ...

[...]

  net/core/filter.c |   29 +
  1 file changed, 29 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 5912c738a7b2..6a4745bf2c9f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2562,6 +2562,32 @@ int xdp_do_redirect(struct net_device *dev, struct 
xdp_buff *xdp,
  }
  EXPORT_SYMBOL_GPL(xdp_do_redirect);

+static int xdp_do_generic_redirect_map(struct net_device *dev,
+  struct sk_buff *skb,
+  struct bpf_prog *xdp_prog)
+{
+   struct redirect_info *ri = this_cpu_ptr(_info);
+   struct bpf_map *map = ri->map;
+   u32 index = ri->ifindex;
+   struct net_device *fwd;
+   int err;
+
+   ri->ifindex = 0;
+   ri->map = NULL;
+
+   fwd = __dev_map_lookup_elem(map, index);
+   if (!fwd) {
+   err = -EINVAL;
+   goto err;
+   }
+   skb->dev = fwd;
+   _trace_xdp_redirect_map(dev, xdp_prog, fwd, map, index);
+   return 0;
+err:
+   _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map, index, err);
+   return err;
+}
+
  int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
struct bpf_prog *xdp_prog)
  {
@@ -2571,6 +2597,9 @@ int xdp_do_generic_redirect(struct net_device *dev, 
struct sk_buff *skb,
unsigned int len;
int err = 0;

+   if (ri->map)
+   return xdp_do_generic_redirect_map(dev, skb, xdp_prog);


This is not quite correct. Really, the only thing you want
to do here is more or less ...

int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
struct bpf_prog *xdp_prog)
{
struct redirect_info *ri = this_cpu_ptr(_info);
struct bpf_map *map = ri->map;
u32 index = ri->ifindex;
struct net_device *fwd;
unsigned int len;
int err = 0;

ri->ifindex = 0;
ri->map = NULL;

if (map)
fwd = __dev_map_lookup_elem(map, index);
else
fwd = dev_get_by_index_rcu(dev_net(dev), index);
if (unlikely(!fwd)) {
err = -EINVAL;
goto err;
}
[...]

... such that you have a common path to also do the IFF_UP
and MTU checks that are done here, but otherwise omitted in
your patch.

Otherwise it looks good, but note that it also doesn't really
resolve the issue you mention wrt stale map pointers by the
way. This would need a different way to clear out the pointers
from redirect_info, I'm thinking when we have devmap dismantle
time after RCU grace period we should check whether there are
still stale pointers from this map around and clear them under
disabled preemption, but need to brainstorm a bit more on that
first.


fwd = dev_get_by_index_rcu(dev_net(dev), index);
ri->ifindex = 0;
if (unlikely(!fwd)) {

RE: [PATCH net-next 1/1] hv_netvsc: fix deadlock on hotplug

2017-09-06 Thread Haiyang Zhang



> -Original Message-
> From: Stephen Hemminger [mailto:step...@networkplumber.org]
> Sent: Wednesday, September 6, 2017 11:19 AM
> To: KY Srinivasan ; Haiyang Zhang
> ; Stephen Hemminger 
> Cc: de...@linuxdriverproject.org; netdev@vger.kernel.org
> Subject: [PATCH net-next 1/1] hv_netvsc: fix deadlock on hotplug
> 
> When a virtual device is added dynamically (via host console), then
> the vmbus sends an offer message for the primary channel. The processing
> of this message for networking causes the network device to then
> initialize the sub channels.
> 
> The problem is that setting up the sub channels needs to wait until
> the subsequent subchannel offers have been processed. These offers
> come in on the same ring buffer and work queue as where the primary
> offer is being processed; leading to a deadlock.
> 
> This did not happen in older kernels, because the sub channel waiting
> logic was broken (it wasn't really waiting).
> 
> The solution is to do the sub channel setup in its own work queue
> context that is scheduled by the primary channel setup; and then
> happens later.
> 
> Fixes: 732e49850c5e ("netvsc: fix race on sub channel creation")
> Reported-by: Dexuan Cui 
> Signed-off-by: Stephen Hemminger 
> ---
> Should also go to stable, but this version does not apply cleanly
> to 4.13. Have another patch for that.
> 
>  drivers/net/hyperv/hyperv_net.h   |   1 +
>  drivers/net/hyperv/netvsc_drv.c   |   8 +--
>  drivers/net/hyperv/rndis_filter.c | 106 ++-
> ---
>  3 files changed, 74 insertions(+), 41 deletions(-)

The patch looks overall. I just have a question:

With this patch, after module load and probe is done, there may still be
subchannels being processed. If rmmod immediately, the subchannel offers 
may hit half-way removed device structures... Do we also need to add 
cancel_work_sync(>subchan_work) to the top of netvsc_remove()?

unregister_netdevice() includes device close, but it's only called later
in the netvsc_remove() when rndis is already removed.

Thanks,
- Haiyang

Re: [PATCH net] rds: Fix incorrect statistics counting

2017-09-06 Thread Santosh Shilimkar


On 9/6/2017 9:12 AM, Håkon Bugge wrote:



[...]



Hi Santosh,


Yes, I agree with accuracy of s_send_lock_queue_raced. But the main point is 
that the existing code counts some partial share of when it is _not_ raced.

So, in the critical path, my patch adds one test_bit(), which hits the local 
CPU cache, if not raced. If raced, some other thread is in control, so I would 
not think the added cycles would make any big difference.


Cycles added for no good reason is the point.


I can send a v2 where the race tightening is removed if you like.


Yes please.

Regards,
Santosh

Re: [pull request][net-next 0/3] Mellanox, mlx5 GRE tunnel offloads

2017-09-06 Thread Tom Herbert

On Tue, Sep 5, 2017 at 8:06 PM, Alexander Duyck
 wrote:
> On Tue, Sep 5, 2017 at 2:13 PM, Tom Herbert  wrote:
>>> The situation with encapsulation is even more complicated:
>>>
>>> We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP
>>> constellation. If we do the fragmentation inside the vxlan tunnel and
>>> carry over the skb hash to all resulting UDP/vxlan packets source ports,
>>> we are fine and reordering on the receiver NIC won't happen in this
>>> case. If the fragmentation happens on the outer UDP header, this will
>>> result in reordering of the inner L2 flow. Unfortunately this depends on
>>> how the vxlan tunnel was set up, how other devices do that and (I
>>> believe so) on the kernel version.
>>>
>> This really isn't that complicated. The assumption that an IP network
>> always delivers packets in order is simply wrong. The inventors of
>> VXLAN must have know full well that when you use IP, packets can and
>> eventually will be delivered out of order. This isn't just because of
>> fragmentation, there are many other reasons that packets can be
>> delivered OOO. This also must have been known when IP/GRE and any
>> other protocol that carries L2 over IP was invented. If OOO is an
>> issue for these protocols then they need to be fixed-- this is not a
>> concern with IP protocol nor the stack.
>>
>> Tom
>
> As far as a little background on the original patch I believe the
> issue that was fixed by the patch was a video streaming application
> that was sending/receiving a mix of fragmented and non-fragmented
> packets. Receiving them out of order due to the fragmentation was
> causing issues with stutters in the video and so we ended up disabling
> UDP by default in the NICs listed. We decided to go that way as UDP
> RSS was viewed as a performance optimization, while the out-of-order
> problems were viewed as a functionality issue.
>
Hi Alex,

Thanks for the details! Were you able to find the root cause for this?
In particular, it would be interesting to know if it is the kernel or
device that introduced the jitter, or if it's the application that
doesn't handle OOO well...

Tom

Re: [PATCH net] rds: Fix incorrect statistics counting

2017-09-06 Thread Håkon Bugge

> On 6 Sep 2017, at 17:58, Santosh Shilimkar  
> wrote:
> 
> On 9/6/2017 8:29 AM, Håkon Bugge wrote:
>> In rds_send_xmit() there is logic to batch the sends. However, if
>> another thread has acquired the lock, it is considered a race and we
>> yield. The code incrementing the s_send_lock_queue_raced statistics
>> counter did not count this event correctly.
>> This commit removes a small race in determining the race and
>> increments the statistics counter correctly.
>> Signed-off-by: Håkon Bugge 
>> Reviewed-by: Knut Omang 
>> ---
>>  net/rds/send.c | 16 +---
>>  1 file changed, 13 insertions(+), 3 deletions(-)
> Those counters are not really to give that accurate so
> am not very keen to add additional cycles in send paths
> and add additional code. Have you seen any real issue
> or this is just a observation. s_send_lock_queue_raced
> counter is never used to check for smaller increments
> and hence the question.

Hi Santosh,

Yes, I agree with accuracy of s_send_lock_queue_raced. But the main point is 
that the existing code counts some partial share of when it is _not_ raced.

So, in the critical path, my patch adds one test_bit(), which hits the local 
CPU cache, if not raced. If raced, some other thread is in control, so I would 
not think the added cycles would make any big difference.

I can send a v2 where the race tightening is removed if you like.

Thxs, Håkon

Re: [PATCH v2 rfc 8/8] net: dsa: Fix SWITCHDEV_ATTR_ID_PORT_PARENT_ID

2017-09-06 Thread Florian Fainelli

On September 6, 2017 8:08:25 AM PDT, Andrew Lunn  wrote:
>> > Use the MAC address of the master interface as the parent ID. This
>is
>> > the same for all switches in a cluster, and should be unique if
>there
>> > are multiple clusters.
>> 
>> That is not correct. Support for multiple CPU ports is coming and in
>> this case, you can have two CPU host interfaces wired to two switch
>> ports of the same tree. So two different master MAC addresses.
>
>Yes, you are correct. I will change this.

A switch cluster should be tied to the same dsa_switch_tree though, can we use 
dst->tree as an unique identifier?

-- 
Florian

Re: [PATCH v2 rfc 0/8] IGMP snooping for local traffic

2017-09-06 Thread Florian Fainelli

On September 6, 2017 8:54:33 AM PDT, Roopa Prabhu  
wrote:
>On Tue, Sep 5, 2017 at 5:47 PM, Andrew Lunn  wrote:
>>> The third and last issue will be explained in a followup email.
>>
>> Hi DSA hackers
>>
>> So there is the third issue. It affects just DSA, but it possible
>> affects all DSA drivers.
>>
>> This patchset broken broadcast with the Marvell drivers. It could
>> break broadcast on others drivers as well.
>>
>> What i found is that the Marvell chips don't flood broadcast frames
>> between bridged ports. What appears to happen is there is a fdb miss,
>> so it gets forwarded to the CPU port for the host to deal with. The
>> software bridge when floods it out all ports of the bridge.
>>
>> But the set offload_fwd_mark patch changes this. The software bridge
>> now assumes the hardware has already flooded broadcast out all ports
>> of the switch as needed. So it does not do any flooding itself. As a
>> result, on Marvell devices, broadcast packets don't get flooded at
>> all.
>>
>> The issue can be fixed. I just need to add an mdb entry for the
>> broadcast address to each port of the bridge in the switch, and the
>> CPU port.  But i don't know at what level to do this.
>>
>> Should this be done at the DSA level, or at the driver level?  Do any
>> chips do broadcast flooding in hardware already? Hence they currently
>> see broadcast duplication? If i add a broadcast mdb at the DSA level,
>> and the chip is already hard wired to flooding broadcast, is it going
>> to double flood?
>>
>
>On the switch asics we work with, the driver has information if the
>packet was
>forwarded in hardware. This is per packet reason code telling why the
>CPU is seeing the packet.
>The driver can use this information to reset skb->offload_fwd_mark to
>allow software forward.

I am not positive this is universally available across different switch 
vendors. In Broadcom tag (net/dsa/tag_brcm.c) the reason code definitely tells 
you that but it also largely depends on whether you have configured SW managed 
forwarding or not and that translates in having either the HW do all the 
address learning itself or having SW do it which is less desirable since you 
end-up with a possibility huge FDB of 4k entries to manage in SW.


-- 
Florian

Re: [PATCH net-next] xdp: implement xdp_redirect_map for generic XDP

2017-09-06 Thread Jesper Dangaard Brouer

On Wed, 6 Sep 2017 11:44:18 -0400
Andy Gospodarek  wrote:

> On Wed, Sep 6, 2017 at 11:26 AM, Jesper Dangaard Brouer
>  wrote:
> > Using bpf_redirect_map is allowed for generic XDP programs, but the
> > appropriate map lookup was never performed in xdp_do_generic_redirect().
> >
> > Instead the map-index is directly used as the ifindex.  For the
> > xdp_redirect_map sample in SKB-mode '-S', this resulted in trying
> > sending on ifindex 0 which isn't valid, resulting in getting SKB
> > packets dropped.  Thus, the reported performance numbers are wrong in
> > commit 24251c264798 ("samples/bpf: add option for native and skb mode
> > for redirect apps") for the 'xdp_redirect_map -S' case.
> >
> > It might seem innocent this was lacking, but it can actually crash the
> > kernel.  The potential crash is caused by not consuming redirect_info->map.
> > The bpf_redirect_map helper will set this_cpu_ptr(_info)->map
> > pointer, which will survive even after unloading the xdp bpf_prog and
> > deallocating the devmap data-structure.  This leaves a dead map
> > pointer around.  The kernel will crash when loading the xdp_redirect
> > sample (in native XDP mode) as it doesn't reset map (via bpf_redirect)
> > and returns XDP_REDIRECT, which will cause it to dereference the map
> > pointer.  
> 
> Nice catch!
> 
> Since 'net-next' is closed and this is a bugfix it seems like this is
> a good candidate for 'net' right?

Yes, I know 'net-next' is closed, but 'net' doesn't contain the
XDP_REDIRECT code yet... thus I had to base it on net-next ;-)


> >
> > Fixes: 6103aa96ec07 ("net: implement XDP_REDIRECT for xdp generic")
> > Fixes: 24251c264798 ("samples/bpf: add option for native and skb mode for 
> > redirect apps")
> > Signed-off-by: Jesper Dangaard Brouer   
> 
> Acked-by: Andy Gospodarek 

Thanks
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH net] rds: Fix incorrect statistics counting

2017-09-06 Thread Santosh Shilimkar


On 9/6/2017 8:29 AM, Håkon Bugge wrote:

In rds_send_xmit() there is logic to batch the sends. However, if
another thread has acquired the lock, it is considered a race and we
yield. The code incrementing the s_send_lock_queue_raced statistics
counter did not count this event correctly.

This commit removes a small race in determining the race and
increments the statistics counter correctly.

Signed-off-by: Håkon Bugge 
Reviewed-by: Knut Omang 
---
  net/rds/send.c | 16 +---
  1 file changed, 13 insertions(+), 3 deletions(-)


Those counters are not really to give that accurate so
am not very keen to add additional cycles in send paths
and add additional code. Have you seen any real issue
or this is just a observation. s_send_lock_queue_raced
counter is never used to check for smaller increments
and hence the question.

Regards,
Santosh

Re: [PATCH v2 rfc 0/8] IGMP snooping for local traffic

2017-09-06 Thread Roopa Prabhu

On Tue, Sep 5, 2017 at 5:47 PM, Andrew Lunn  wrote:
>> The third and last issue will be explained in a followup email.
>
> Hi DSA hackers
>
> So there is the third issue. It affects just DSA, but it possible
> affects all DSA drivers.
>
> This patchset broken broadcast with the Marvell drivers. It could
> break broadcast on others drivers as well.
>
> What i found is that the Marvell chips don't flood broadcast frames
> between bridged ports. What appears to happen is there is a fdb miss,
> so it gets forwarded to the CPU port for the host to deal with. The
> software bridge when floods it out all ports of the bridge.
>
> But the set offload_fwd_mark patch changes this. The software bridge
> now assumes the hardware has already flooded broadcast out all ports
> of the switch as needed. So it does not do any flooding itself. As a
> result, on Marvell devices, broadcast packets don't get flooded at
> all.
>
> The issue can be fixed. I just need to add an mdb entry for the
> broadcast address to each port of the bridge in the switch, and the
> CPU port.  But i don't know at what level to do this.
>
> Should this be done at the DSA level, or at the driver level?  Do any
> chips do broadcast flooding in hardware already? Hence they currently
> see broadcast duplication? If i add a broadcast mdb at the DSA level,
> and the chip is already hard wired to flooding broadcast, is it going
> to double flood?
>

On the switch asics we work with, the driver has information if the packet was
forwarded in hardware. This is per packet reason code telling why the
CPU is seeing the packet.
The driver can use this information to reset skb->offload_fwd_mark to
allow software forward.

Re: [PATCH net] Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"

2017-09-06 Thread Mason


On 31/08/2017 21:18, Florian Fainelli wrote:


On 08/31/2017 12:09 PM, Mason wrote:


On 31/08/2017 19:03, Florian Fainelli wrote:


On 08/31/2017 05:29 AM, Marc Gonzalez wrote:


On 31/08/2017 02:49, Florian Fainelli wrote:


The original motivation for this change originated from Marc Gonzalez
indicating that his network driver did not have its adjust_link callback
executing with phydev->link = 0 while he was expecting it.


I expect the core to call phy_adjust_link() for link changes.
This used to work back in 3.4 and was broken somewhere along
the way.


If that was working correctly in 3.4 surely we can look at the diff and
figure out what changed, even maybe find the offending commit, can you
do that?


Bisecting would a be a huge pain because my platform was
not upstream until v4.4


Then just diff the file and try to pinpoint which commit may have
changed that?


Running 'ip link set eth0 down' on the command-line.

In v3.4 => adjust_link() callback is called
In v4.5 => adjust_link() callback is NOT called

$ git log --oneline --no-merges v3.4..v4.5 drivers/net/phy/phy.c | wc -l
59

I'm not sure what "just diff the file" entails.
I can't move 3.4 up, nor move 4.5 down.
I'm not even sure the problem comes from drivers/net/phy/phy.c
to be honest.

Regards.

RE: [PATCH v2 rfc 0/8] IGMP snooping for local traffic

2017-09-06 Thread Woojung.Huh

Andrew,

> What i found is that the Marvell chips don't flood broadcast frames
> between bridged ports. What appears to happen is there is a fdb miss,
> so it gets forwarded to the CPU port for the host to deal with. The
> software bridge when floods it out all ports of the bridge.
Is this IGMP snooping enabled mode in Marvell chip?

Re: hung task in mac80211

2017-09-06 Thread Sebastian Gottschall


i confirm this patch fixes the issue for me as well

Am 06.09.2017 um 17:04 schrieb Matteo Croce:

On Wed, Sep 6, 2017 at 2:58 PM, Johannes Berg  wrote:

On Wed, 2017-09-06 at 13:57 +0200, Matteo Croce wrote:


I have an hung task on vanilla 4.13 kernel which I haven't on 4.12.
The problem is present both on my AP and on my notebook,
so it seems it affects AP and STA mode as well.
The generated messages are:

INFO: task kworker/u16:6:120 blocked for more than 120 seconds.
   Not tainted 4.13.0 #57
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
kworker/u16:6   D0   120  2 0x
Workqueue: phy0 ieee80211_ba_session_work [mac80211]
Call Trace:
  ? __schedule+0x174/0x5b0
  ? schedule+0x31/0x80
  ? schedule_preempt_disabled+0x9/0x10
  ? __mutex_lock.isra.2+0x163/0x480
  ? select_task_rq_fair+0xb9f/0xc60
  ? __ieee80211_start_rx_ba_session+0x135/0x4d0 [mac80211]
  ? __ieee80211_start_rx_ba_session+0x135/0x4d0 [mac80211]

Yeah - obviously as Stefano found, both take >ampdu_mlme.mtx.

Can you try this?

diff --git a/net/mac80211/agg-rx.c b/net/mac80211/agg-rx.c
index 2b36eff5d97e..d8d32776031e 100644
--- a/net/mac80211/agg-rx.c
+++ b/net/mac80211/agg-rx.c
@@ -245,10 +245,10 @@ static void ieee80211_send_addba_resp(struct 
ieee80211_sub_if_data *sdata, u8 *d
 ieee80211_tx_skb(sdata, skb);
  }

-void __ieee80211_start_rx_ba_session(struct sta_info *sta,
-u8 dialog_token, u16 timeout,
-u16 start_seq_num, u16 ba_policy, u16 tid,
-u16 buf_size, bool tx, bool auto_seq)
+void ___ieee80211_start_rx_ba_session(struct sta_info *sta,
+ u8 dialog_token, u16 timeout,
+ u16 start_seq_num, u16 ba_policy, u16 tid,
+ u16 buf_size, bool tx, bool auto_seq)
  {
 struct ieee80211_local *local = sta->sdata->local;
 struct tid_ampdu_rx *tid_agg_rx;
@@ -267,7 +267,7 @@ void __ieee80211_start_rx_ba_session(struct sta_info *sta,
 ht_dbg(sta->sdata,
"STA %pM requests BA session on unsupported tid %d\n",
sta->sta.addr, tid);
-   goto end_no_lock;
+   goto end;
 }

 if (!sta->sta.ht_cap.ht_supported) {
@@ -275,14 +275,14 @@ void __ieee80211_start_rx_ba_session(struct sta_info *sta,
"STA %pM erroneously requests BA session on tid %d w/o 
QoS\n",
sta->sta.addr, tid);
 /* send a response anyway, it's an error case if we get here */
-   goto end_no_lock;
+   goto end;
 }

 if (test_sta_flag(sta, WLAN_STA_BLOCK_BA)) {
 ht_dbg(sta->sdata,
"Suspend in progress - Denying ADDBA request (%pM tid 
%d)\n",
sta->sta.addr, tid);
-   goto end_no_lock;
+   goto end;
 }

 /* sanity check for incoming parameters:
@@ -296,7 +296,7 @@ void __ieee80211_start_rx_ba_session(struct sta_info *sta,
 ht_dbg_ratelimited(sta->sdata,
"AddBA Req with bad params from %pM on tid %u. 
policy %d, buffer size %d\n",
sta->sta.addr, tid, ba_policy, buf_size);
-   goto end_no_lock;
+   goto end;
 }
 /* determine default buffer size */
 if (buf_size == 0)
@@ -311,7 +311,6 @@ void __ieee80211_start_rx_ba_session(struct sta_info *sta,
buf_size, sta->sta.addr);

 /* examine state machine */
-   mutex_lock(>ampdu_mlme.mtx);

 if (test_bit(tid, sta->ampdu_mlme.agg_session_valid)) {
 if (sta->ampdu_mlme.tid_rx_token[tid] == dialog_token) {
@@ -415,15 +414,25 @@ void __ieee80211_start_rx_ba_session(struct sta_info *sta,
 __clear_bit(tid, sta->ampdu_mlme.unexpected_agg);
 sta->ampdu_mlme.tid_rx_token[tid] = dialog_token;
 }
-   mutex_unlock(>ampdu_mlme.mtx);

-end_no_lock:
 if (tx)
 ieee80211_send_addba_resp(sta->sdata, sta->sta.addr, tid,
   dialog_token, status, 1, buf_size,
   timeout);
  }

+void __ieee80211_start_rx_ba_session(struct sta_info *sta,
+u8 dialog_token, u16 timeout,
+u16 start_seq_num, u16 ba_policy, u16 tid,
+u16 buf_size, bool tx, bool auto_seq)
+{
+   mutex_lock(>ampdu_mlme.mtx);
+   ___ieee80211_start_rx_ba_session(sta, dialog_token, timeout,
+start_seq_num, ba_policy, tid,
+buf_size, tx, auto_seq);
+

Re: VLAN/bridge "compression" in wifi (was: Re: [PATCH 3/8] qtnfmac: implement AP_VLAN iftype support)

2017-09-06 Thread Sergey Matyukevich


Hi Johannes and all,

> > In a way this feature seems mis-designed - you never have 802.1Q tags
> > over the air, but you're inserting them on RX and stripping them on
> > TX, probably in order to make bridging to ethernet easier and not
> > have to have 802.1Q acceleration on the ethernet port, or - well - in
> > order to have an ability to do this with an ethernet card that only
> > has a single CPU port.
> 
> Ok this isn't really right either - it's only for saving the 802.1Q
> acceleration on the Ethernet port, really - and saving the extra
> bridges.
> 
> To clarify, I think what you - conceptually - want is the following
> topology:
> 
> +--- eth0.1  ---  br.1  ---  wlan0.1
> |
> eth0 ---+--- eth0.2  ---  br.2  ---  wlan0.2
> |
> +--- eth0.3  ---  br.3  ---  wlan0.3
> 
> where eth0.N is just "ip link add link eth0 name eth0.N type vlan id N"
> and br.N is obviously a bridge for each, and the wlan0.N are AP_VLAN
> type interfaces that isolate the clients against each other as far as
> wifi is concerned.
> 
> Is this correct? As far as I understand, that's the baseline topology
> that you're trying to achieve, expressed in terms of Linux networking.

That's right. In fact, hostapd is able to create this kind of network
bridge infrastructure automatically when it is built
with CONFIG_FULL_DYNAMIC_VLAN option enabled.

> Now, you seem to want to compress this to
> 
>   +---  wlan0.1
>   |
> eth0  ---  br  ---+---  wlan0.2
>   |
>   +---  wlan0.3
> 
> and have the 802.1Q tag insertion/removal that's normally configured to
> happen in eth0.N already be handled in wlan0.N.
> 
> Also correct?

Exactly. And yes, the only purpose of this 'non-conventional' mode was
to have 802.1Q acceleration on the ethernet port.

> 
> 
> We clearly don't have APIs for this, and I don't think it makes sense
> in the Linux space - the bridge and wlan0.N suddenly have tagged
> traffic rather than untagged, and the VLAN tagging is completely hidden
> from the management view.
> 
> johannes

Re: [PATCH net-next] xdp: implement xdp_redirect_map for generic XDP

2017-09-06 Thread Andy Gospodarek

On Wed, Sep 6, 2017 at 11:26 AM, Jesper Dangaard Brouer
 wrote:
> Using bpf_redirect_map is allowed for generic XDP programs, but the
> appropriate map lookup was never performed in xdp_do_generic_redirect().
>
> Instead the map-index is directly used as the ifindex.  For the
> xdp_redirect_map sample in SKB-mode '-S', this resulted in trying
> sending on ifindex 0 which isn't valid, resulting in getting SKB
> packets dropped.  Thus, the reported performance numbers are wrong in
> commit 24251c264798 ("samples/bpf: add option for native and skb mode
> for redirect apps") for the 'xdp_redirect_map -S' case.
>
> It might seem innocent this was lacking, but it can actually crash the
> kernel.  The potential crash is caused by not consuming redirect_info->map.
> The bpf_redirect_map helper will set this_cpu_ptr(_info)->map
> pointer, which will survive even after unloading the xdp bpf_prog and
> deallocating the devmap data-structure.  This leaves a dead map
> pointer around.  The kernel will crash when loading the xdp_redirect
> sample (in native XDP mode) as it doesn't reset map (via bpf_redirect)
> and returns XDP_REDIRECT, which will cause it to dereference the map
> pointer.

Nice catch!

Since 'net-next' is closed and this is a bugfix it seems like this is
a good candidate for 'net' right?

>
> Fixes: 6103aa96ec07 ("net: implement XDP_REDIRECT for xdp generic")
> Fixes: 24251c264798 ("samples/bpf: add option for native and skb mode for 
> redirect apps")
> Signed-off-by: Jesper Dangaard Brouer 

Acked-by: Andy Gospodarek 


> ---
>  net/core/filter.c |   29 +
>  1 file changed, 29 insertions(+)
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 5912c738a7b2..6a4745bf2c9f 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2562,6 +2562,32 @@ int xdp_do_redirect(struct net_device *dev, struct 
> xdp_buff *xdp,
>  }
>  EXPORT_SYMBOL_GPL(xdp_do_redirect);
>
> +static int xdp_do_generic_redirect_map(struct net_device *dev,
> +  struct sk_buff *skb,
> +  struct bpf_prog *xdp_prog)
> +{
> +   struct redirect_info *ri = this_cpu_ptr(_info);
> +   struct bpf_map *map = ri->map;
> +   u32 index = ri->ifindex;
> +   struct net_device *fwd;
> +   int err;
> +
> +   ri->ifindex = 0;
> +   ri->map = NULL;
> +
> +   fwd = __dev_map_lookup_elem(map, index);
> +   if (!fwd) {
> +   err = -EINVAL;
> +   goto err;
> +   }
> +   skb->dev = fwd;
> +   _trace_xdp_redirect_map(dev, xdp_prog, fwd, map, index);
> +   return 0;
> +err:
> +   _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map, index, err);
> +   return err;
> +}
> +
>  int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
> struct bpf_prog *xdp_prog)
>  {
> @@ -2571,6 +2597,9 @@ int xdp_do_generic_redirect(struct net_device *dev, 
> struct sk_buff *skb,
> unsigned int len;
> int err = 0;
>
> +   if (ri->map)
> +   return xdp_do_generic_redirect_map(dev, skb, xdp_prog);
> +
> fwd = dev_get_by_index_rcu(dev_net(dev), index);
> ri->ifindex = 0;
> if (unlikely(!fwd)) {
>

Re: [PATCH v2 rfc 4/8] net: dsa: slave: Handle switchdev host mdb add/del

2017-09-06 Thread Vivien Didelot

Hi Andrew,

Andrew Lunn  writes:

> Add code to handle switchdev host mdb add/del. As with normal mdb
> add/del, send a notification to the switch layer.
>
> Signed-off-by: Andrew Lunn 
> ---
>  net/dsa/dsa_priv.h |  7 +++
>  net/dsa/port.c | 26 ++
>  net/dsa/slave.c|  6 ++
>  3 files changed, 39 insertions(+)
>
> diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
> index 9c3eeb72462d..0ffe49f78d14 100644
> --- a/net/dsa/dsa_priv.h
> +++ b/net/dsa/dsa_priv.h
> @@ -24,6 +24,8 @@ enum {
>   DSA_NOTIFIER_FDB_DEL,
>   DSA_NOTIFIER_MDB_ADD,
>   DSA_NOTIFIER_MDB_DEL,
> + DSA_NOTIFIER_HOST_MDB_ADD,
> + DSA_NOTIFIER_HOST_MDB_DEL,
>   DSA_NOTIFIER_VLAN_ADD,
>   DSA_NOTIFIER_VLAN_DEL,
>  };
> @@ -131,6 +133,11 @@ int dsa_port_mdb_add(struct dsa_port *dp,
>struct switchdev_trans *trans);
>  int dsa_port_mdb_del(struct dsa_port *dp,
>const struct switchdev_obj_port_mdb *mdb);
> +int dsa_host_mdb_add(struct dsa_port *dp,
> +  const struct switchdev_obj_port_mdb *mdb,
> +  struct switchdev_trans *trans);
> +int dsa_host_mdb_del(struct dsa_port *dp,
> +  const struct switchdev_obj_port_mdb *mdb);
>  int dsa_port_vlan_add(struct dsa_port *dp,
> const struct switchdev_obj_port_vlan *vlan,
> struct switchdev_trans *trans);
> diff --git a/net/dsa/port.c b/net/dsa/port.c
> index 659676ba3f8b..5b18b9fe2219 100644
> --- a/net/dsa/port.c
> +++ b/net/dsa/port.c
> @@ -199,6 +199,32 @@ int dsa_port_mdb_del(struct dsa_port *dp,
>   return dsa_port_notify(dp, DSA_NOTIFIER_MDB_DEL, );
>  }
>  
> +int dsa_host_mdb_add(struct dsa_port *dp,
> +  const struct switchdev_obj_port_mdb *mdb,
> +  struct switchdev_trans *trans)
> +{
> + struct dsa_notifier_mdb_info info = {
> + .sw_index = dp->ds->index,
> + .port = dp->index,
> + .trans = trans,
> + .mdb = mdb,
> + };
> +
> + return dsa_port_notify(dp, DSA_NOTIFIER_HOST_MDB_ADD, );
> +}
> +
> +int dsa_host_mdb_del(struct dsa_port *dp,
> +  const struct switchdev_obj_port_mdb *mdb)
> +{
> + struct dsa_notifier_mdb_info info = {
> + .sw_index = dp->ds->index,
> + .port = dp->index,
> + .mdb = mdb,
> + };
> +
> + return dsa_port_notify(dp, DSA_NOTIFIER_HOST_MDB_DEL, );
> +}
> +
>  int dsa_port_vlan_add(struct dsa_port *dp,
> const struct switchdev_obj_port_vlan *vlan,
> struct switchdev_trans *trans)
> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> index 78e78a6e6833..2e07be149415 100644
> --- a/net/dsa/slave.c
> +++ b/net/dsa/slave.c
> @@ -330,6 +330,9 @@ static int dsa_slave_port_obj_add(struct net_device *dev,
>   case SWITCHDEV_OBJ_ID_PORT_MDB:
>   err = dsa_port_mdb_add(dp, SWITCHDEV_OBJ_PORT_MDB(obj), trans);
>   break;
> + case SWITCHDEV_OBJ_ID_HOST_MDB:
> + err = dsa_host_mdb_add(dp, SWITCHDEV_OBJ_PORT_MDB(obj), trans);
> + break;

If SWITCHDEV_OBJ_ID_HOST_MDB is really necessary, 

case SWITCHDEV_OBJ_ID_HOST_MDB:
err = dsa_port_mdb_add(dp->cpu_dp, SWITCHDEV_OBJ_PORT_MDB(obj), trans);
break;

should be enough. DSA_NOTIFIER_HOST_MDB_* are not necessary.


Thanks,

Vivien

Fw: [Bug 196839] New: use_time of IPsec policy is updated even when receiving error packets.

2017-09-06 Thread Stephen Hemminger

Begin forwarded message:

Date: Wed, 06 Sep 2017 10:18:33 +
From: bugzilla-dae...@bugzilla.kernel.org
To: step...@networkplumber.org
Subject: [Bug 196839] New: use_time of IPsec policy is updated even when 
receiving error packets.

https://bugzilla.kernel.org/show_bug.cgi?id=196839

Bug ID: 196839
   Summary: use_time of IPsec policy is updated even when
receiving error packets.
   Product: Networking
   Version: 2.5
Kernel Version: 4.8.0
  Hardware: Intel
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: Other
  Assignee: step...@networkplumber.org
  Reporter: cche...@gmail.com
Regression: No

Normally the use_time of policy in SPD is updated if the policy is matched by
incoming or outgoing IP packets. For protect policy, it is updated if ESP/AH
packets are sent or received.

The use_time of SPD_IN policy used by IKE implementation like strongSwan to
check whether there is inbound traffic, thus determine whether it is necessary
to send DPD(dead peer detection, rfc3706) request to check liveness of IPsec
peer.

In case an unprotected packet is received but matches the IPsec SPD IN protect
policy, the packet will be discarded and the error counter XfrmInTmplMismatch
in /proc/net/xfrm_stat is incremented.

But in such error/malicious case, the use_time of SPD IN policy is also
updated. This cause strongSwan to mistakenly regard that the policy is in use
and not to trigger DPD request even when it should.

In short, this is a security hole in kernel and could lead to DoS attack on
IPsec gateway running on Linux.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[PATCH net] rds: Fix incorrect statistics counting

2017-09-06 Thread Håkon Bugge

In rds_send_xmit() there is logic to batch the sends. However, if
another thread has acquired the lock, it is considered a race and we
yield. The code incrementing the s_send_lock_queue_raced statistics
counter did not count this event correctly.

This commit removes a small race in determining the race and
increments the statistics counter correctly.

Signed-off-by: Håkon Bugge 
Reviewed-by: Knut Omang 
---
 net/rds/send.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index 058a407..ecfe0b5 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -101,6 +101,11 @@ void rds_send_path_reset(struct rds_conn_path *cp)
 }
 EXPORT_SYMBOL_GPL(rds_send_path_reset);
 
+static bool someone_in_xmit(struct rds_conn_path *cp)
+{
+   return test_bit(RDS_IN_XMIT, >cp_flags);
+}
+
 static int acquire_in_xmit(struct rds_conn_path *cp)
 {
return test_and_set_bit(RDS_IN_XMIT, >cp_flags) == 0;
@@ -428,14 +433,19 @@ int rds_send_xmit(struct rds_conn_path *cp)
 * some work and we will skip our goto
 */
if (ret == 0) {
+   bool raced;
+
smp_mb();
+   raced = someone_in_xmit(cp) ||
+   send_gen != READ_ONCE(cp->cp_send_gen);
+
if ((test_bit(0, >c_map_queued) ||
-!list_empty(>cp_send_queue)) &&
-   send_gen == READ_ONCE(cp->cp_send_gen)) {
-   rds_stats_inc(s_send_lock_queue_raced);
+   !list_empty(>cp_send_queue)) && !raced) {
if (batch_count < send_batch_count)
goto restart;
queue_delayed_work(rds_wq, >cp_send_w, 1);
+   } else if (raced) {
+   rds_stats_inc(s_send_lock_queue_raced);
}
}
 out:
-- 
2.9.3

Re: [PATCH v2 rfc 0/8] IGMP snooping for local traffic

2017-09-06 Thread Vivien Didelot

Hi Andrew, Nikolay,

Andrew Lunn  writes:

> Then starts the work passing down to the hardware that the host has
> joined/left a group. The existing switchdev mdb object cannot be used,
> since the semantics are different. The existing
> SWITCHDEV_OBJ_ID_PORT_MDB is used to indicate a specific multicast
> group should be forwarded out that port of the switch. However here we
> require the exact opposite. We want multicast frames for the group
> received on the port to the forwarded to the host. Hence add a new
> object SWITCHDEV_OBJ_ID_HOST_MDB, a multicast database entry to
> forward to the host. This new object is then propagated through the
> DSA layers. No DSA driver changes should be needed, this should just
> work...

I'm not sure if you already explained that, if so, sorry in advance.

I don't understand why SWITCHDEV_OBJ_ID_PORT_MDB cannot be used. Isn't
setting the obj->orig_dev to the bridge device itself enough to
distinguish br0 from a switch port?

This way, the only change necessary in net/dsa/slave.c is something
like (abbreviated):

static int dsa_slave_port_obj_add(struct net_device *dev,
const struct switchdev_obj *obj,
struct switchdev_trans *trans)
{
struct dsa_slave_priv *p = netdev_priv(dev);
struct dsa_port *port = p->dp;

/* Is the target port the bridge device itself? */
if (obj->orig_dev == port->br)
port = port->cpu_dp;

return dsa_port_mdb_add(port, obj, trans);
}

The main problem is that we will soon want support for multiple CPU
ports. This means that each DSA slave will have its dedicated CPU port,
instead of having only one for the whole switch tree.

So adding the MDB entry to all CPU ports as you did in a patch 5/8 in
dsa_switch_host_mdb_*() is not correct because you can have CPU port
sw0p0 being a member of br0 and CPU port sw0p10 being a member of br1.

So is it correct to simply notify SWITCHDEV_OBJ_ID_PORT_MDB with
orig_dev = br->dev so that its members get recursively notified?

Even if SWITCHDEV_OBJ_ID_HOST_MDB is necessary, we need to handle it the
way I described, otherwise we don't have a correct mapping between a
slave port and its CPU port that we need to configure.

Thanks,

Vivien

1 2 >

1 - 100 of 188 matches

Mail list logo