Suggestions for new Ethernet driver?

2016-10-24 Thread David VomLehn
I'm working with Aquantia to add a new 2.5/5 Gbps driver to the kernel.
It looks like it's going to one of the biggest drivers in
drivers/net/ethernet. The team that developed the driver is new to
kernel development processes, but are working to make it
checkpatch-clean and addressing sparse issues. Right now we're working
to split the code into small chunks for review. The sequence of patches
first targets basic functionality, then adding a feature at a time. Still,
it's going to be a lot of work to review.

Aquantia is committed to doing the work to add this to the mainline
kernel but it's clearly going to be a substantial amount of work not
only for them, but for reviewers. So, my question: what can we do to
make this process easy for the networking community in addition to the
basics that are already under way? I welcome any and all suggestions.

Thanks!
--
David VL



[PATCH net-next] ibmveth: calculate correct gso_size and set gso_type

2016-10-24 Thread Jon Maxwell
We recently encountered a bug where a few customers using ibmveth on the 
same LPAR hit an issue where a TCP session hung when large receive was
enabled. Closer analysis revealed that the session was stuck because the 
one side was advertising a zero window repeatedly.

We narrowed this down to the fact the ibmveth driver did not set gso_size 
which is translated by TCP into the MSS later up the stack. The MSS is 
used to calculate the TCP window size and as that was abnormally large, 
it was calculating a zero window, even although the sockets receive buffer 
was completely empty. 

We were able to reproduce this and worked with IBM to fix this. Thanks Tom 
and Marcelo for all your help and review on this.

The patch fixes both our internal reproduction tests and our customers tests.

Signed-off-by: Jon Maxwell 
---
 drivers/net/ethernet/ibm/ibmveth.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/net/ethernet/ibm/ibmveth.c 
b/drivers/net/ethernet/ibm/ibmveth.c
index 29c05d0..3028c33 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1182,6 +1182,8 @@ static int ibmveth_poll(struct napi_struct *napi, int 
budget)
int frames_processed = 0;
unsigned long lpar_rc;
struct iphdr *iph;
+   bool large_packet = 0;
+   u16 hdr_len = ETH_HLEN + sizeof(struct tcphdr);
 
 restart_poll:
while (frames_processed < budget) {
@@ -1236,10 +1238,27 @@ static int ibmveth_poll(struct napi_struct *napi, int 
budget)
iph->check = 0;
iph->check = 
ip_fast_csum((unsigned char *)iph, iph->ihl);
adapter->rx_large_packets++;
+   large_packet = 1;
}
}
}
 
+   if (skb->len > netdev->mtu) {
+   iph = (struct iphdr *)skb->data;
+   if (be16_to_cpu(skb->protocol) == ETH_P_IP && 
iph->protocol == IPPROTO_TCP) {
+   hdr_len += sizeof(struct iphdr);
+   skb_shinfo(skb)->gso_type = 
SKB_GSO_TCPV4;
+   skb_shinfo(skb)->gso_size = netdev->mtu 
- hdr_len;
+   } else if (be16_to_cpu(skb->protocol) == 
ETH_P_IPV6 &&
+   iph->protocol == IPPROTO_TCP) {
+   hdr_len += sizeof(struct ipv6hdr);
+   skb_shinfo(skb)->gso_type = 
SKB_GSO_TCPV6;
+   skb_shinfo(skb)->gso_size = netdev->mtu 
- hdr_len;
+   }
+   if (!large_packet)
+   adapter->rx_large_packets++;
+   }
+
napi_gro_receive(napi, skb);/* send it up */
 
netdev->stats.rx_packets++;
-- 
1.8.3.1



pull request (net-next): ipsec-next 2016-10-25

2016-10-24 Thread Steffen Klassert
Just a leftover from the last development cycle.

1) Remove some unused code, from Florian Westphal.

Please pull or let me know if there are problems.

Thanks!

The following changes since commit 31fbe81fe3426dfb7f8056a7f5106c6b1841a9aa:

  Merge branch 'qcom-emac-acpi' (2016-09-29 01:50:20 -0400)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next.git master

for you to fetch changes up to 2258d927a691ddd2ab585adb17ea9f96e89d0638:

  xfrm: remove unused helper (2016-09-30 08:20:56 +0200)


Florian Westphal (1):
  xfrm: remove unused helper

 net/xfrm/xfrm_state.c | 8 
 1 file changed, 8 deletions(-)


[PATCH] xfrm: remove unused helper

2016-10-24 Thread Steffen Klassert
From: Florian Westphal 

Not used anymore since 2009 (9e0d57fd6dad37,
'xfrm: SAD entries do not expire correctly after suspend-resume').

Signed-off-by: Florian Westphal 
Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_state.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 419bf5d..45cb7c6 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -388,14 +388,6 @@ static void xfrm_state_gc_task(struct work_struct *work)
xfrm_state_gc_destroy(x);
 }
 
-static inline unsigned long make_jiffies(long secs)
-{
-   if (secs >= (MAX_SCHEDULE_TIMEOUT-1)/HZ)
-   return MAX_SCHEDULE_TIMEOUT-1;
-   else
-   return secs*HZ;
-}
-
 static enum hrtimer_restart xfrm_timer_handler(struct hrtimer *me)
 {
struct tasklet_hrtimer *thr = container_of(me, struct tasklet_hrtimer, 
timer);
-- 
1.9.1



Re: [PATCH net-next] ethernet: fix min/max MTU typos

2016-10-24 Thread Jarod Wilson
On Mon, Oct 24, 2016 at 02:42:26PM +0200, Stefan Richter wrote:
> Fixes: d894be57ca92('ethernet: use net core MTU range checking in more 
> drivers')
> CC: Jarod Wilson 
> CC: Thomas Falcon 
> Signed-off-by: Stefan Richter 

Wuf. Thank you, Stefan. Way too many bleeding eyeball hours staring at all
those changes.

Acked-by: Jarod Wilson 

-- 
Jarod Wilson
ja...@redhat.com



Re: [PATCH net-next 2/2 v2] firewire: net: set initial MTU = 1500 unconditionally, fix IPv6 on some CardBus cards

2016-10-24 Thread Jarod Wilson
On Mon, Oct 24, 2016 at 02:26:13PM +0200, Stefan Richter wrote:
> firewire-net, like the older eth1394 driver, reduced the initial MTU to
> less than 1500 octets if the local link layer controller's asynchronous
> packet reception limit was lower.
> 
> This is bogus, since this reception limit does not have anything to do
> with the transmission limit.  Neither did this reduction affect the TX
> path positively, nor could it prevent link fragmentation at the RX path.
> 
> Many FireWire CardBus cards have a max_rec of 9, causing an initial MTU
> of 1024 - 16 = 1008.  RFC 2734 and RFC 3146 allow a minimum max_rec = 8,
> which would result in an initial MTU of 512 - 16 = 496.  On such cards,
> IPv6 could only be employed if the MTU was manually increased to 1280 or
> more, i.e. IPv6 would not work without intervention from userland.
> 
> We now always initialize the MTU to 1500, which is the default according
> to RFC 2734 and RFC 3146.
> 
> On a VIA VT6316 based CardBus card which was affected by this, changing
> the MTU from 1008 to 1500 also increases TX bandwidth by 6 %.
> RX remains unaffected.
> 
> CC: netdev@vger.kernel.org
> CC: linux1394-de...@lists.sourceforge.net
> CC: Jarod Wilson 
> Signed-off-by: Stefan Richter 
> ---
> v2: use ETH_DATA_LEN, add comment

Acked-by: Jarod Wilson 

-- 
Jarod Wilson
ja...@redhat.com



[RFC PATCH net-next] net: ethtool: add support for forward error correction modes

2016-10-24 Thread Vidya Sagar Ravipati
From: Vidya Sagar Ravipati 

Forward Error Correction (FEC) modes i.e Base-R
and Reed-Solomon modes are introduced in 25G/40G/100G standards
for providing good BER at high speeds.
Various networking devices which support 25G/40G/100G provides ability
to manage supported FEC modes and the lack of FEC encoding control and
reporting today is a source for itneroperability issues for many vendors.
FEC capability as well as specific FEC mode i.e. Base-R
or RS modes can be requested or advertised through bits D44:47 of base link
codeword.

This patch set intends to provide option under ethtool to manage and report
FEC encoding settings for networking devices as per IEEE 802.3 bj, bm and by
specs.

set-fec/show-fec option(s) are  designed to provide  control and report
the FEC encoding on the link.

SET FEC option:
root@tor: ethtool --set-fec  swp1 encoding [off | RS | BaseR | auto] autoneg 
[off | on]

Encoding: Types of encoding
Off:  Turning off any encoding
RS :  enforcing RS-FEC encoding on supported speeds
BaseR  :  enforcing Base R encoding on supported speeds
Auto   :  Default FEC settings  for  divers , and would represent
  asking the hardware to essentially go into a best effort mode.

Here are a few examples of what we would expect if encoding=auto:
- if autoneg is on, we are  expecting FEC to be negotiated as on or off
  as long as protocol supports it
- if the hardware is capable of detecting the FEC encoding on it's
  receiver it will reconfigure its encoder to match
- in absence of the above, the configuration would be set to IEEE
  defaults.

>From our  understanding , this is essentially what most hardware/driver
combinations are doing today in the absence of a way for users to
control the behavior.

SHOW FEC option:
root@tor: ethtool --show-fec  swp1
FEC parameters for swp1:
Autonegotiate:  off
FEC encodings:  RS

ETHTOOL DEVNAME output modification:

ethtool devname output:
root@tor:~# ethtool swp1
Settings for swp1:
root@hpe-7712-03:~# ethtool swp18
Settings for swp18:
Supported ports: [ FIBRE ]
Supported link modes:   4baseCR4/Full
4baseSR4/Full
4baseLR4/Full
10baseSR4/Full
10baseCR4/Full
10baseLR4_ER4/Full
Supported pause frame use: No
Supports auto-negotiation: Yes
Supported FEC modes: [RS | BaseR | None | Not reported]
Advertised link modes:  Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: [RS | BaseR | None | Not reported]
 One or more FEC modes
Speed: 10Mb/s
Duplex: Full
Port: FIBRE
PHYAD: 106
Transceiver: internal
Auto-negotiation: off
Link detected: yes

This patch includes following changes
a) New ETHTOOL_SFECPARAM/SFECPARAM API, handled by
  the new get_fecparam/set_fecparam callbacks, provides support
  for configuration of forward error correction modes.
b) Link mode bits for FEC modes i.e. None (No FEC mode), RS, BaseR/FC
  are defined so that users can configure these fec modes for supported
  and advertising fields as part of link autonegotiation.

Signed-off-by: Vidya Sagar Ravipati 
---
 include/linux/ethtool.h  |  4 
 include/uapi/linux/ethtool.h | 53 ++--
 net/core/ethtool.c   | 34 
 3 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 9ded8c6..79a0bab 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -372,5 +372,9 @@ struct ethtool_ops {
  struct ethtool_link_ksettings *);
int (*set_link_ksettings)(struct net_device *,
  const struct ethtool_link_ksettings *);
+   int (*get_fecparam)(struct net_device *,
+ struct ethtool_fecparam *);
+   int (*set_fecparam)(struct net_device *,
+ struct ethtool_fecparam *);
 };
 #endif /* _LINUX_ETHTOOL_H */
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 099a420..28ea382 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -1224,6 +1224,51 @@ struct ethtool_per_queue_op {
chardata[];
 };
 
+/**
+ * struct ethtool_fecparam - Ethernet forward error correction(fec) parameters
+ * @cmd: Command number = %ETHTOOL_GFECPARAM or %ETHTOOL_SFECPARAM
+ * @autoneg: Flag to enable autonegotiation of fec modes(rs,baser)
+ *  (D44:47 of base link code word)
+ * @fec: Bitmask of supported FEC modes
+ * @rsvd: Reserved for future extensions. i.e FEC bypass feature.
+ *
+ * Drivers should reject a non-zero setting of @autoneg when
+ * autoneogotiation is disabled (or not supported) for the 

RE: [PATCH] net: fec: hard phy reset on open

2016-10-24 Thread Andy Duan
From: Manfred Schlaegl  Sent: Monday, October 
24, 2016 10:43 PM
> To: Andy Duan 
> Cc: netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: Re: [PATCH] net: fec: hard phy reset on open
> 
> On 2016-10-24 16:03, Andy Duan wrote:
> > From: manfred.schla...@gmx.at   Sent:
> Monday,
> > October 24, 2016 5:26 PM
> >> To: Andy Duan 
> >> Cc: netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> >> Subject: [PATCH] net: fec: hard phy reset on open
> >>
> >> We have seen some problems with auto negotiation on i.MX6 using
> >> LAN8720, after interface down/up.
> >>
> >> In our configuration, the ptp clock is used externally as reference
> >> clock for the phy. Some phys (e.g. LAN8720) needs a stable clock
> >> while and after hard reset.
> >> Before this patch, the driver disabled the clock on close but did no
> >> hard reset on open, after enabling the clocks again.
> >>
> >> A solution that prevents disabling the clocks on close was
> >> considered, but discarded because of bad power saving behavior.
> >>
> >> This patch saves the reset dt properties on probe and does a reset on
> >> every open after clocks where enabled, to make sure the clock is
> >> stable while and after hard reset.
> >>
> >> Tested on i.MX6 and i.MX28, both using LAN8720.
> >>
> >> Signed-off-by: Manfred Schlaegl 
> >> ---
> 
> > This patch did hard reset to let phy stable.
> 
> > Firstly, you should do reset before clock enable.
> I have to disagree here.
> The phy demands(datasheet + tests) a stable clock at the time of (hard-
> )reset and after this. Therefore the clock has to be enabled before the hard
> reset.
> (This is exactly the reason for the patch.)
> 
> Generally: The sense of a reset is to defer the start of digital circuit 
> until the
> environment (power, clocks, ...) has stabilized.
> 
> Furthermore: Before this patch the hard reset was done in fec_probe. And
> here also after the clocks were enabled.
> 
> Whats was your argument to do it the other way in this special case?
> 
I check some different vendor phy, hard reset assert after clock stable.
But I still don't ensure all phys are this action. 

> > Secondly, we suggest to do phy reset in phy driver, not MAC driver.
> I was not sure, if you meant a soft-, or hard-reset here.
> 
> In case you are talking about soft reset:
> Yes, the phy drivers perform a soft reset. Sadly a soft reset is not 
> sufficient in
> this case - The phy recovers only on a hard reset from lost clock. (datasheet 
> +
> tests)
> 
> In case you're talking about hard reset:
> Intuitively I would also think, that the (hard-)reset should be handled by the
> phy driver, but this is not reality in given implementations.
> 
> Documentation/devicetree/bindings/net/fsl-fec.txt says
> 
> - phy-reset-gpios : Should specify the gpio for phy reset
> 
> It is explicitly talked about phy-reset here. And the (hard-)reset was handled
> by the fec driver also before this patch (once on probe).
> 
I suggest to do phy hard reset in phy driver like:
drivers/net/phy/spi_ks8995.c:

and Uwe Kleine-König's patch "phy: add support for a reset-gpio specification" 
(I don't know why the patch is reverted now.)

Regards,
Andy
> >
> > Regards,
> > Andy
> 
> Thanks for your feedback!
> 
> Best regards,
> Manfred
> 
> 
> 
> >
> >>  drivers/net/ethernet/freescale/fec.h  |  4 ++
> >>  drivers/net/ethernet/freescale/fec_main.c | 77
> >> +---
> >> ---
> >>  2 files changed, 47 insertions(+), 34 deletions(-)
> >>
> >> diff --git a/drivers/net/ethernet/freescale/fec.h
> >> b/drivers/net/ethernet/freescale/fec.h
> >> index c865135..379e619 100644
> >> --- a/drivers/net/ethernet/freescale/fec.h
> >> +++ b/drivers/net/ethernet/freescale/fec.h
> >> @@ -498,6 +498,10 @@ struct fec_enet_private {
> >>struct clk *clk_enet_out;
> >>struct clk *clk_ptp;
> >>
> >> +  int phy_reset;
> >> +  bool phy_reset_active_high;
> >> +  int phy_reset_msec;
> >> +
> >>bool ptp_clk_on;
> >>struct mutex ptp_clk_mutex;
> >>unsigned int num_tx_queues;
> >> diff --git a/drivers/net/ethernet/freescale/fec_main.c
> >> b/drivers/net/ethernet/freescale/fec_main.c
> >> index 48a033e..8cc1ec5 100644
> >> --- a/drivers/net/ethernet/freescale/fec_main.c
> >> +++ b/drivers/net/ethernet/freescale/fec_main.c
> >> @@ -2802,6 +2802,22 @@ static int fec_enet_alloc_buffers(struct
> >> net_device *ndev)
> >>return 0;
> >>  }
> >>
> >> +static void fec_reset_phy(struct fec_enet_private *fep) {
> >> +  if (!gpio_is_valid(fep->phy_reset))
> >> +  return;
> >> +
> >> +  gpio_set_value_cansleep(fep->phy_reset, !!fep-
> >>> phy_reset_active_high);
> >> +
> >> +  if (fep->phy_reset_msec > 20)
> >> +  msleep(fep->phy_reset_msec);
> >> +  else
> >> +  usleep_range(fep->phy_reset_msec * 1000,
> >> +   fep->phy_reset_msec * 1000 + 1000);
> >> +

[PATCH v2] net: skip genenerating uevents for network namespaces that are exiting

2016-10-24 Thread Andrei Vagin
No one can see these events, because a network namespace can not be
destroyed, if it has sockets.

Unlike other devices, uevent-s for network devices are generated
only inside their network namespaces. They are filtered in
kobj_bcast_filter()

My experiments shows that net namespaces are destroyed more 30% faster
with this optimization.

Here is a perf output for destroying network namespaces without this
patch.

-   94.76% 0.02%  kworker/u48:1  [kernel.kallsyms] [k] cleanup_net
   - 94.74% cleanup_net
  - 94.64% ops_exit_list.isra.4
 - 41.61% default_device_exit_batch
- 41.47% unregister_netdevice_many
   - rollback_registered_many
  - 40.36% netdev_unregister_kobject
 - 14.55% device_del
+ 13.71% kobject_uevent
 - 13.04% netdev_queue_update_kobjects
+ 12.96% kobject_put
 - 12.72% net_rx_queue_update_kobjects
  kobject_put
- kobject_release
   + 12.69% kobject_uevent
  + 0.80% call_netdevice_notifiers_info
 + 19.57% nfsd_exit_net
 + 11.15% tcp_net_metrics_exit
 + 8.25% rpcsec_gss_exit_net

It's very critical to optimize the exit path for network namespaces,
because they are destroyed under net_mutex and many namespaces can be
destroyed for one iteration.

v2: use dev_set_uevent_suppress()

Cc: Cong Wang 
Cc: "David S. Miller" 
Cc: Eric W. Biederman 
Signed-off-by: Andrei Vagin 
---
 net/core/net-sysfs.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 6e4f347..d4fe286 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -950,10 +950,13 @@ net_rx_queue_update_kobjects(struct net_device *dev, int 
old_num, int new_num)
}
 
while (--i >= new_num) {
+   struct kobject *kobj = >_rx[i].kobj;
+
+   if (!list_empty(_net(dev)->exit_list))
+   kobj->uevent_suppress = 1;
if (dev->sysfs_rx_queue_group)
-   sysfs_remove_group(>_rx[i].kobj,
-  dev->sysfs_rx_queue_group);
-   kobject_put(>_rx[i].kobj);
+   sysfs_remove_group(kobj, dev->sysfs_rx_queue_group);
+   kobject_put(kobj);
}
 
return error;
@@ -1340,6 +1343,8 @@ netdev_queue_update_kobjects(struct net_device *dev, int 
old_num, int new_num)
while (--i >= new_num) {
struct netdev_queue *queue = dev->_tx + i;
 
+   if (!list_empty(_net(dev)->exit_list))
+   queue->kobj.uevent_suppress = 1;
 #ifdef CONFIG_BQL
sysfs_remove_group(>kobj, _group);
 #endif
@@ -1525,6 +1530,9 @@ void netdev_unregister_kobject(struct net_device *ndev)
 {
struct device *dev = &(ndev->dev);
 
+   if (!list_empty(_net(ndev)->exit_list))
+   dev_set_uevent_suppress(dev, 1);
+
kobject_get(>kobj);
 
remove_queue_kobjects(ndev);
-- 
2.7.4



[PATCH net-next] net: add an ioctl to get a socket network namespace

2016-10-24 Thread Andrei Vagin
From: Andrey Vagin 

Each socket operates in a network namespace where it has been created,
so if we want to dump and restore a socket, we have to know its network
namespace.

We have a socket_diag to get information about sockets, it doesn't
report sockets which are not bound or connected.

This patch introduces a new socket ioctl, which is called SIOCGSKNS
and used to get a file descriptor for a socket network namespace.

A task must have CAP_NET_ADMIN in a target network namespace to
use this ioctl.

Cc: "David S. Miller" 
Cc: Eric W. Biederman 
Signed-off-by: Andrei Vagin 
---
 fs/nsfs.c|  2 +-
 include/linux/proc_fs.h  |  4 
 include/uapi/linux/sockios.h |  1 +
 net/socket.c | 13 +
 4 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/nsfs.c b/fs/nsfs.c
index 8718af8..8c9fb29 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -118,7 +118,7 @@ void *ns_get_path(struct path *path, struct task_struct 
*task,
return ret;
 }
 
-static int open_related_ns(struct ns_common *ns,
+int open_related_ns(struct ns_common *ns,
   struct ns_common *(*get_ns)(struct ns_common *ns))
 {
struct path path = {};
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index b97bf2e..368c7ad 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -82,4 +82,8 @@ static inline struct proc_dir_entry *proc_net_mkdir(
return proc_mkdir_data(name, 0, parent, net);
 }
 
+struct ns_common;
+int open_related_ns(struct ns_common *ns,
+  struct ns_common *(*get_ns)(struct ns_common *ns));
+
 #endif /* _LINUX_PROC_FS_H */
diff --git a/include/uapi/linux/sockios.h b/include/uapi/linux/sockios.h
index 8e7890b..83cc54c 100644
--- a/include/uapi/linux/sockios.h
+++ b/include/uapi/linux/sockios.h
@@ -84,6 +84,7 @@
 #define SIOCWANDEV 0x894A  /* get/set netdev parameters*/
 
 #define SIOCOUTQNSD0x894B  /* output queue size (not sent only) */
+#define SIOCGSKNS  0x894C  /* get socket network namespace */
 
 /* ARP cache control calls. */
/*  0x8950 - 0x8952  * obsolete calls, don't re-use */
diff --git a/net/socket.c b/net/socket.c
index 5a9bf5e..970a7ea 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -877,6 +877,11 @@ static long sock_do_ioctl(struct net *net, struct socket 
*sock,
  * what to do with it - that's up to the protocol still.
  */
 
+static struct ns_common *get_net_ns(struct ns_common *ns)
+{
+   return _net(container_of(ns, struct net, ns))->ns;
+}
+
 static long sock_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 {
struct socket *sock;
@@ -945,6 +950,13 @@ static long sock_ioctl(struct file *file, unsigned cmd, 
unsigned long arg)
err = dlci_ioctl_hook(cmd, argp);
mutex_unlock(_ioctl_mutex);
break;
+   case SIOCGSKNS:
+   err = -EPERM;
+   if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
+   break;
+
+   err = open_related_ns(>ns, get_net_ns);
+   break;
default:
err = sock_do_ioctl(net, sock, cmd, arg);
break;
@@ -3093,6 +3105,7 @@ static int compat_sock_ioctl_trans(struct file *file, 
struct socket *sock,
case SIOCSIFVLAN:
case SIOCADDDLCI:
case SIOCDELDLCI:
+   case SIOCGSKNS:
return sock_ioctl(file, cmd, arg);
 
case SIOCGIFFLAGS:
-- 
2.7.4



Re: question about function igmp_stop_timer() in net/ipv4/igmp.c

2016-10-24 Thread Dongpo Li
Hi Andrew,

On 2016/10/24 23:32, Andrew Lunn wrote:
> On Mon, Oct 24, 2016 at 07:50:12PM +0800, Dongpo Li wrote:
>> Hello
>>
>> We encountered a multicast problem when two set-top box(STB) join the same 
>> multicast group and leave.
>> The two boxes can join the same multicast group
>> but only one box can send the IGMP leave group message when leave,
>> the other box does not send the IGMP leave message.
>> Our boxes use the IGMP version 2.
>>
>> I added some debug info and found the whole procedure is like this:
>> (1) Box A joins the multicast group 225.1.101.145 and send the IGMP v2 
>> membership report(join group).
>> (2) Box B joins the same multicast group 225.1.101.145 and also send the 
>> IGMP v2 membership report(join group).
>> (3) Box A receives the IGMP membership report from Box B and kernel calls 
>> igmp_heard_report().
>> This function will call igmp_stop_timer(im).
>> In function igmp_stop_timer(im), it tries to delete IGMP timer and does 
>> the following:
>> im->tm_running = 0;
>> im->reporter = 0;
>> (4) Box A leaves the multicast group 225.1.101.145 and kernel calls
>> ip_mc_leave_group -> ip_mc_dec_group -> igmp_group_dropped.
>> But in function igmp_group_dropped(), the im->reporter is 0, so the 
>> kernel does not send the IGMP leave message.
> 
> RFC 2236 says:
> 
> 2.  Introduction
> 
>The Internet Group Management Protocol (IGMP) is used by IP hosts to
>report their multicast group memberships to any immediately-
>neighboring multicast routers.
> 
> Are Box A or B multicast routers?
Thank you for your comments.
Both Box A and B are IP hosts, not multicast routers.
And the RFC says: IGMP is used by "IP hosts" to report their multicast group 
membership.

> 
> Andrew
> 
> .
> 

Regards,
Dongpo

.



Re: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net

2016-10-24 Thread Alexei Starovoitov
On Sun, Oct 23, 2016 at 06:51:53PM -0700, Shrijeet Mukherjee wrote:
> 
> The main goal of this patch was to start that discussion. My v2 patch
> rejects the ndo op if neither of rx_mergeable or big_buffers are set.
> Does that sound like a good tradeoff ? Don't know enough about who
> turns these features off and why.
> 
> I can say that virtualbox always has the device features enabled .. so
> seems like a good tradeoff ?

If virtio can be taught to work with xdp that would be awesome.
I've looked at it from xdp prog debugging point of view, but amount of
complexity related to mergeable/big/etc was too much, so I went with e1k+xdp.
Are you sure that if mergeable/big disabled than buf is contiguous?
Also my understanding that buf is not writeable?
I don't see how to do TX either... May be it's all solvable somehow.

There was a discussion to convert raw dma buffer in mlx/intel directly
into vhost to avoid skb. This will allow host to send packets into
VMs quickly. Then if we can have fast virtio in the guest then
even more interesting use cases will be solved.



[PATCH v2 net 1/1] net sched filters: fix notification of filter delete with proper handle

2016-10-24 Thread Jamal Hadi Salim
From: Jamal Hadi Salim 

Daniel says:

While trying out [1][2], I noticed that tc monitor doesn't show the
correct handle on delete:

$ tc monitor
qdisc clsact : dev eno1 parent :fff1
filter dev eno1 ingress protocol all pref 49152 bpf handle 0x2a [...]
deleted filter dev eno1 ingress protocol all pref 49152 bpf handle 0xf3be0c80

some context to explain the above:
The user identity of any tc filter is represented by a 32-bit
identifier encoded in tcm->tcm_handle. Example 0x2a in the bpf filter
above. A user wishing to delete, get or even modify a specific filter
uses this handle to reference it.
Every classifier is free to provide its own semantics for the 32 bit handle.
Example: classifiers like u32 use schemes like 800:1:801 to describe
the semantics of their filters represented as hash table, bucket and
node ids etc.
Classifiers also have internal per-filter representation which is different
from this externally visible identity. Most classifiers set this
internal representation to be a pointer address (which allows fast retrieval
of said filters in their implementations). This internal representation
is referenced with the "fh" variable in the kernel control code.

When a user successfuly deletes a specific filter, by specifying the correct
tcm->tcm_handle, an event is generated to user space which indicates
which specific filter was deleted.

Before this patch, the "fh" value was sent to user space as the identity.
As an example what is shown in the sample bpf filter delete event above
is 0xf3be0c80. This is infact a 32-bit truncation of 0x8807f3be0c80
which happens to be a 64-bit memory address of the internal filter
representation (address of the corresponding filter's struct cls_bpf_prog);

After this patch the appropriate user identifiable handle as encoded
in the originating request tcm->tcm_handle is generated in the event.
One of the cardinal rules of netlink rules is to be able to take an
event (such as a delete in this case) and reflect it back to the
kernel and successfully delete the filter. This patch achieves that.

Note, this issue has existed since the original TC action
infrastructure code patch back in 2004 as found in:
https://git.kernel.org/cgit/linux/kernel/git/history/history.git/commit/

[1] http://patchwork.ozlabs.org/patch/682828/
[2] http://patchwork.ozlabs.org/patch/682829/

Fixes: 4e54c4816bfe ("[NET]: Add tc extensions infrastructure.")
Reported-by: Daniel Borkmann 
Acked-by: Cong Wang 
Signed-off-by: Jamal Hadi Salim 
---
 net/sched/cls_api.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 2ee29a3..2b2a797 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -345,7 +345,8 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct 
nlmsghdr *n)
if (err == 0) {
struct tcf_proto *next = 
rtnl_dereference(tp->next);
 
-   tfilter_notify(net, skb, n, tp, fh,
+   tfilter_notify(net, skb, n, tp,
+  t->tcm_handle,
   RTM_DELTFILTER, false);
if (tcf_destroy(tp, false))
RCU_INIT_POINTER(*back, next);
-- 
1.9.1



[PATCH] net: bgmac: fix spelling mistake: "connecton" -> "connection"

2016-10-24 Thread Colin King
From: Colin Ian King 

trivial fix to spelling mistake in dev_err message

Signed-off-by: Colin Ian King 
---
 drivers/net/ethernet/broadcom/bgmac.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bgmac.c 
b/drivers/net/ethernet/broadcom/bgmac.c
index 856379c..31ca204 100644
--- a/drivers/net/ethernet/broadcom/bgmac.c
+++ b/drivers/net/ethernet/broadcom/bgmac.c
@@ -1449,7 +1449,7 @@ static int bgmac_phy_connect(struct bgmac *bgmac)
phy_dev = phy_connect(bgmac->net_dev, bus_id, _adjust_link,
  PHY_INTERFACE_MODE_MII);
if (IS_ERR(phy_dev)) {
-   dev_err(bgmac->dev, "PHY connecton failed\n");
+   dev_err(bgmac->dev, "PHY connection failed\n");
return PTR_ERR(phy_dev);
}
 
-- 
2.9.3



[PATCH v3 net] flow_dissector: fix vlan tag handling

2016-10-24 Thread Arnd Bergmann
gcc warns about an uninitialized pointer dereference in the vlan
priority handling:

net/core/flow_dissector.c: In function '__skb_flow_dissect':
net/core/flow_dissector.c:281:61: error: 'vlan' may be used uninitialized in 
this function [-Werror=maybe-uninitialized]

As pointed out by Jiri Pirko, the variable is never actually used
without being initialized first as the only way it end up uninitialized
is with skb_vlan_tag_present(skb)==true, and that means it does not
get accessed.

However, the warning hints at some related issues that I'm addressing
here:

- the second check for the vlan tag is different from the first one
  that tests the skb for being NULL first, causing both the warning
  and a possible NULL pointer dereference that was not entirely fixed.
- The same patch that introduced the NULL pointer check dropped an
  earlier optimization that skipped the repeated check of the
  protocol type
- The local '_vlan' variable is referenced through the 'vlan' pointer
  but the variable has gone out of scope by the time that it is
  accessed, causing undefined behavior

Caching the result of the 'skb && skb_vlan_tag_present(skb)' check
in a local variable allows the compiler to further optimize the
later check. With those changes, the warning also disappears.

Fixes: 3805a938a6c2 ("flow_dissector: Check skb for VLAN only if skb 
specified.")
Fixes: d5709f7ab776 ("flow_dissector: For stripped vlan, get vlan info from 
skb->vlan_tci")
Signed-off-by: Arnd Bergmann 
---
v3: set 'proto' variable correct again
mark it for net, rather than net-next, as both patches that
  introduced the bugs are in mainline or in net/master

v2: fix multiple issues found in the initial review beyond the
uninitialized access that turned out to be ok

 net/core/flow_dissector.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 44e6ba9d3a6b..ab193e5def07 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -246,13 +246,13 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
case htons(ETH_P_8021AD):
case htons(ETH_P_8021Q): {
const struct vlan_hdr *vlan;
+   struct vlan_hdr _vlan;
+   bool vlan_tag_present = skb && skb_vlan_tag_present(skb);
 
-   if (skb && skb_vlan_tag_present(skb))
+   if (vlan_tag_present)
proto = skb->protocol;
 
-   if (eth_type_vlan(proto)) {
-   struct vlan_hdr _vlan;
-
+   if (!vlan_tag_present || eth_type_vlan(skb->protocol)) {
vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan),
data, hlen, &_vlan);
if (!vlan)
@@ -270,7 +270,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 
FLOW_DISSECTOR_KEY_VLAN,
 target_container);
 
-   if (skb_vlan_tag_present(skb)) {
+   if (vlan_tag_present) {
key_vlan->vlan_id = skb_vlan_tag_get_id(skb);
key_vlan->vlan_priority =
(skb_vlan_tag_get_prio(skb) >> 
VLAN_PRIO_SHIFT);
-- 
2.9.0



Re: [PATCH] netfilter: ip_vs_sync: fix bogus maybe-uninitialized warning

2016-10-24 Thread Arnd Bergmann
On Monday, October 24, 2016 10:47:54 PM CEST Julian Anastasov wrote:
> > diff --git a/net/netfilter/ipvs/ip_vs_sync.c 
> > b/net/netfilter/ipvs/ip_vs_sync.c
> > index 1b07578bedf3..9350530c16c1 100644
> > --- a/net/netfilter/ipvs/ip_vs_sync.c
> > +++ b/net/netfilter/ipvs/ip_vs_sync.c
> > @@ -283,6 +283,7 @@ struct ip_vs_sync_buff {
> >   */
> >  static void ntoh_seq(struct ip_vs_seq *no, struct ip_vs_seq *ho)
> >  {
> > + memset(ho, 0, sizeof(*ho));
> >   ho->init_seq   = get_unaligned_be32(>init_seq);
> >   ho->delta  = get_unaligned_be32(>delta);
> >   ho->previous_delta = get_unaligned_be32(>previous_delta);
> 
> So, now there is a double write here?

Correct. I would hope that a sane version of gcc would just not
perform the first write. What happens instead is that the version
that produces the warning here moves the initialization to the
top of the calling function.

> What about such constructs?:
> 
> *ho = (struct ip_vs_seq) {
> .init_seq   = get_unaligned_be32(>init_seq),
> ...
> };
> 
> Any difference in the compiled code or warnings?

Yes, it's one of many things I tried. What happens here is that
the warning remains as long as all fields are initialized together,
e.g. these two produces the same warning:

a)
ho->init_seq   = get_unaligned_be32(>init_seq);
ho->delta  = get_unaligned_be32(>delta);
ho->previous_delta = get_unaligned_be32(>previous_delta);

b)
   *ho = (struct ip_vs_seq) {
   .init_seq   = get_unaligned_be32(>init_seq);
   .delta  = get_unaligned_be32(>delta);
   .previous_delta = get_unaligned_be32(>previous_delta);
   };

but this one does not:

c)
   *ho = (struct ip_vs_seq) {
   .delta  = get_unaligned_be32(>delta);
   .previous_delta = get_unaligned_be32(>previous_delta);
   };
   ho->init_seq   = get_unaligned_be32(>init_seq);

I have absolutely no idea what is going on inside of gcc here.

Arnd


Re: [PATCH] can: fix warning in bcm_connect/proc_register

2016-10-24 Thread Cong Wang
On Mon, Oct 24, 2016 at 1:10 PM, Cong Wang  wrote:
> On Mon, Oct 24, 2016 at 12:11 PM, Oliver Hartkopp
>  wrote:
>> if (proc_dir) {
>> /* unique socket address as filename */
>> sprintf(bo->procname, "%lu", sock_i_ino(sk));
>> bo->bcm_proc_read = proc_create_data(bo->procname, 0644,
>>  proc_dir,
>>  _proc_fops, sk);
>> +   if (!bo->bcm_proc_read) {
>> +   ret = -ENOMEM;
>> +   goto fail;
>> +   }
>
> Well, I meant we need to call proc_create_data() once per socket,
> so we need a check before proc_create_data() too.

Hmm, bo->bound should guarantee it, so never mind, your patch
looks fine.


Re: [PATCH] can: fix warning in bcm_connect/proc_register

2016-10-24 Thread Cong Wang
On Mon, Oct 24, 2016 at 12:11 PM, Oliver Hartkopp
 wrote:
> if (proc_dir) {
> /* unique socket address as filename */
> sprintf(bo->procname, "%lu", sock_i_ino(sk));
> bo->bcm_proc_read = proc_create_data(bo->procname, 0644,
>  proc_dir,
>  _proc_fops, sk);
> +   if (!bo->bcm_proc_read) {
> +   ret = -ENOMEM;
> +   goto fail;
> +   }

Well, I meant we need to call proc_create_data() once per socket,
so we need a check before proc_create_data() too.

Thanks.


Re: [PATCH] netfilter: ip_vs_sync: fix bogus maybe-uninitialized warning

2016-10-24 Thread Julian Anastasov

Hello,

On Mon, 24 Oct 2016, Arnd Bergmann wrote:

> Building the ip_vs_sync code with CONFIG_OPTIMIZE_INLINING on x86
> confuses the compiler to the point where it produces a rather
> dubious warning message:
> 
> net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.init_seq’ may be used 
> uninitialized in this function [-Werror=maybe-uninitialized]
>   struct ip_vs_sync_conn_options opt;
>  ^~~
> net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.delta’ may be used 
> uninitialized in this function [-Werror=maybe-uninitialized]
> net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.previous_delta’ may be 
> used uninitialized in this function [-Werror=maybe-uninitialized]
> net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)+12).init_seq’ 
> may be used uninitialized in this function [-Werror=maybe-uninitialized]
> net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)+12).delta’ 
> may be used uninitialized in this function [-Werror=maybe-uninitialized]
> net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void 
> *)+12).previous_delta’ may be used uninitialized in this function 
> [-Werror=maybe-uninitialized]
> 
> The problem appears to be a combination of a number of factors, including
> the __builtin_bswap32 compiler builtin being slightly odd, having a large
> amount of code inlined into a single function, and the way that some
> functions only get partially inlined here.
> 
> I've spent way too much time trying to work out a way to improve the
> code, but the best I've come up with is to add an explicit memset
> right before the ip_vs_seq structure is first initialized here. When
> the compiler works correctly, this has absolutely no effect, but in the
> case that produces the warning, the warning disappears.
> 
> In the process of analysing this warning, I also noticed that
> we use memcpy to copy the larger ip_vs_sync_conn_options structure
> over two members of the ip_vs_conn structure. This works because
> the layout is identical, but seems error-prone, so I'm changing
> this in the process to directly copy the two members. This change
> seemed to have no effect on the object code or the warning, but
> it deals with the same data, so I kept the two changes together.
> 
> Signed-off-by: Arnd Bergmann 

OK,

Acked-by: Julian Anastasov 

I guess, Simon will take the patch for ipvs-next.

> ---
>  net/netfilter/ipvs/ip_vs_sync.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
> index 1b07578bedf3..9350530c16c1 100644
> --- a/net/netfilter/ipvs/ip_vs_sync.c
> +++ b/net/netfilter/ipvs/ip_vs_sync.c
> @@ -283,6 +283,7 @@ struct ip_vs_sync_buff {
>   */
>  static void ntoh_seq(struct ip_vs_seq *no, struct ip_vs_seq *ho)
>  {
> + memset(ho, 0, sizeof(*ho));
>   ho->init_seq   = get_unaligned_be32(>init_seq);
>   ho->delta  = get_unaligned_be32(>delta);
>   ho->previous_delta = get_unaligned_be32(>previous_delta);

So, now there is a double write here?

What about such constructs?:

*ho = (struct ip_vs_seq) {
.init_seq   = get_unaligned_be32(>init_seq),
...
};

Any difference in the compiled code or warnings?

> @@ -917,8 +918,10 @@ static void ip_vs_proc_conn(struct netns_ipvs *ipvs, 
> struct ip_vs_conn_param *pa
>   kfree(param->pe_data);
>   }
>  
> - if (opt)
> - memcpy(>in_seq, opt, sizeof(*opt));
> + if (opt) {
> + cp->in_seq = opt->in_seq;
> + cp->out_seq = opt->out_seq;

This fix is fine.

> + }
>   atomic_set(>in_pkts, sysctl_sync_threshold(ipvs));
>   cp->state = state;
>   cp->old_state = cp->state;
> -- 
> 2.9.0

Regards

--
Julian Anastasov 

Re: net/sctp: slab-out-of-bounds in sctp_sf_ootb

2016-10-24 Thread Marcelo Ricardo Leitner
Hi Andrey,

On Mon, Oct 24, 2016 at 05:30:04PM +0200, Andrey Konovalov wrote:
> The problem is that sctp_walk_errors walks the chunk before its length
> is checked for overflow.

Exactly. The check is done too late, for the 2nd and subsequent chunks
only.
Please try the following patch, thanks. Note: not even compile tested.

---8<---

diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 026e3bca4a94..8ec20a64a3f8 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -3422,6 +3422,12 @@ sctp_disposition_t sctp_sf_ootb(struct net *net,
return sctp_sf_violation_chunklen(net, ep, asoc, type, 
arg,
  commands);
 
+   /* Report violation if chunk len overflows */
+   ch_end = ((__u8 *)ch) + SCTP_PAD4(ntohs(ch->length));
+   if (ch_end > skb_tail_pointer(skb))
+   return sctp_sf_violation_chunklen(net, ep, asoc, type, 
arg,
+ commands);
+
/* Now that we know we at least have a chunk header,
 * do things that are type appropriate.
 */
@@ -3453,12 +3459,6 @@ sctp_disposition_t sctp_sf_ootb(struct net *net,
}
}
 
-   /* Report violation if chunk len overflows */
-   ch_end = ((__u8 *)ch) + SCTP_PAD4(ntohs(ch->length));
-   if (ch_end > skb_tail_pointer(skb))
-   return sctp_sf_violation_chunklen(net, ep, asoc, type, 
arg,
- commands);
-
ch = (sctp_chunkhdr_t *) ch_end;
} while (ch_end < skb_tail_pointer(skb));
 


Re: [net-next PATCH RFC 19/26] arch/sparc: Add option to skip DMA sync as a part of map and unmap

2016-10-24 Thread Alexander Duyck
On Mon, Oct 24, 2016 at 11:27 AM, David Miller  wrote:
> From: Alexander Duyck 
> Date: Mon, 24 Oct 2016 08:06:07 -0400
>
>> This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
>> avoid invoking cache line invalidation if the driver will just handle it
>> via a sync_for_cpu or sync_for_device call.
>>
>> Cc: "David S. Miller" 
>> Cc: sparcli...@vger.kernel.org
>> Signed-off-by: Alexander Duyck 
>
> This is fine for avoiding the flush for performance reasons, but the
> chip isn't going to write anything back unless the device wrote into
> the area.

That is mostly what I am doing here.  The original implementation was
mostly for performance.  I am trying to take the attribute that was
already in place for ARM and apply it to all the other architectures.
So what will be happening now is that we call the map function with
this attribute set and then use the sync functions to map it to the
device and then pull the mapping later.

The idea is that if Jesper does his page pool stuff it would be
calling the map/unmap functions and then the drivers would be doing
the sync_for_cpu/sync_for_device.  I want to make sure the map is
cheap and we will have to call sync_for_cpu from the drivers anyway
since there is no guarantee if we will have a new page or be reusing
an existing one.

- Alex


[PATCH] net: ipv6: Do not consider link state for nexthop validation

2016-10-24 Thread David Ahern
Similar to IPv4, do not consider link state when validating next hops.

Currently, if the link is down default routes can fail to insert:
 $ ip -6 ro add vrf blue default via 2100:2::64 dev eth2
 RTNETLINK answers: No route to host

With this patch the command succeeds.

Fixes: 8c14586fc320 ("net: ipv6: Use passed in table for nexthop lookups")
Signed-off-by: David Ahern 
---
 include/net/ip6_route.h | 1 +
 net/ipv6/route.c| 6 --
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index e0cd318d5103..f83e78d071a3 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -32,6 +32,7 @@ struct route_info {
 #define RT6_LOOKUP_F_SRCPREF_TMP   0x0008
 #define RT6_LOOKUP_F_SRCPREF_PUBLIC0x0010
 #define RT6_LOOKUP_F_SRCPREF_COA   0x0020
+#define RT6_LOOKUP_F_IGNORE_LINKSTATE  0x0040
 
 /* We do not (yet ?) support IPv6 jumbograms (RFC 2675)
  * Unlike IPv4, hdr->seg_len doesn't include the IPv6 header
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 3ac19eb81a86..947ed1ded026 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -658,7 +658,8 @@ static struct rt6_info *find_match(struct rt6_info *rt, int 
oif, int strict,
struct net_device *dev = rt->dst.dev;
 
if (dev && !netif_carrier_ok(dev) &&
-   idev->cnf.ignore_routes_with_linkdown)
+   idev->cnf.ignore_routes_with_linkdown &&
+   !(strict & RT6_LOOKUP_F_IGNORE_LINKSTATE))
goto out;
 
if (rt6_check_expired(rt))
@@ -1052,6 +1053,7 @@ struct rt6_info *ip6_pol_route(struct net *net, struct 
fib6_table *table,
int strict = 0;
 
strict |= flags & RT6_LOOKUP_F_IFACE;
+   strict |= flags & RT6_LOOKUP_F_IGNORE_LINKSTATE;
if (net->ipv6.devconf_all->forwarding == 0)
strict |= RT6_LOOKUP_F_REACHABLE;
 
@@ -1791,7 +1793,7 @@ static struct rt6_info *ip6_nh_lookup_table(struct net 
*net,
};
struct fib6_table *table;
struct rt6_info *rt;
-   int flags = RT6_LOOKUP_F_IFACE;
+   int flags = RT6_LOOKUP_F_IFACE | RT6_LOOKUP_F_IGNORE_LINKSTATE;
 
table = fib6_get_table(net, cfg->fc_table);
if (!table)
-- 
2.1.4



Re: [net-next PATCH RFC 02/26] swiotlb: Add support for DMA_ATTR_SKIP_CPU_SYNC

2016-10-24 Thread Alexander Duyck
On Mon, Oct 24, 2016 at 11:09 AM, Konrad Rzeszutek Wilk
 wrote:
> On Mon, Oct 24, 2016 at 08:04:37AM -0400, Alexander Duyck wrote:
>> As a first step to making DMA_ATTR_SKIP_CPU_SYNC apply to architectures
>> beyond just ARM I need to make it so that the swiotlb will respect the
>> flag.  In order to do that I also need to update the swiotlb-xen since it
>> heavily makes use of the functionality.
>>
>> Cc: Konrad Rzeszutek Wilk 
>> Signed-off-by: Alexander Duyck 
>> ---
>>  drivers/xen/swiotlb-xen.c |   40 ++
>>  include/linux/swiotlb.h   |6 --
>>  lib/swiotlb.c |   48 
>> +++--
>>  3 files changed, 56 insertions(+), 38 deletions(-)
>>
>> diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
>> index 87e6035..cf047d8 100644
>> --- a/drivers/xen/swiotlb-xen.c
>> +++ b/drivers/xen/swiotlb-xen.c
>> @@ -405,7 +405,8 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, 
>> struct page *page,
>>*/
>>   trace_swiotlb_bounced(dev, dev_addr, size, swiotlb_force);
>>
>> - map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir);
>> + map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir,
>> +  attrs);
>>   if (map == SWIOTLB_MAP_ERROR)
>>   return DMA_ERROR_CODE;
>>
>> @@ -416,11 +417,13 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, 
>> struct page *page,
>>   /*
>>* Ensure that the address returned is DMA'ble
>>*/
>> - if (!dma_capable(dev, dev_addr, size)) {
>> - swiotlb_tbl_unmap_single(dev, map, size, dir);
>> - dev_addr = 0;
>> - }
>> - return dev_addr;
>> + if (dma_capable(dev, dev_addr, size))
>> + return dev_addr;
>> +
>> + swiotlb_tbl_unmap_single(dev, map, size, dir,
>> +  attrs | DMA_ATTR_SKIP_CPU_SYNC);
>> +
>> + return DMA_ERROR_CODE;
>
> Why? This change (re-ordering the code - and returning DMA_ERROR_CODE instead
> of 0) does not have anything to do with the title.
>
> If you really feel strongly about it - then please send it as a seperate 
> patch.

Okay I can do that.  This was mostly just to clean up the formatting
because I was over 80 characters when I added the attribute.  Changing
the return value to DMA_ERROR_CODE from 0 was based on the fact that
earlier in the function that is the value you return if there is a
mapping error.

>>  }
>>  EXPORT_SYMBOL_GPL(xen_swiotlb_map_page);
>>
>> @@ -444,7 +447,7 @@ static void xen_unmap_single(struct device *hwdev, 
>> dma_addr_t dev_addr,
>>
>>   /* NOTE: We use dev_addr here, not paddr! */
>>   if (is_xen_swiotlb_buffer(dev_addr)) {
>> - swiotlb_tbl_unmap_single(hwdev, paddr, size, dir);
>> + swiotlb_tbl_unmap_single(hwdev, paddr, size, dir, attrs);
>>   return;
>>   }
>>
>> @@ -557,16 +560,9 @@ void xen_swiotlb_unmap_page(struct device *hwdev, 
>> dma_addr_t dev_addr,
>>
>> start_dma_addr,
>>sg_phys(sg),
>>sg->length,
>> -  dir);
>> - if (map == SWIOTLB_MAP_ERROR) {
>> - dev_warn(hwdev, "swiotlb buffer is full\n");
>> - /* Don't panic here, we expect map_sg users
>> -to do proper error handling. */
>> - xen_swiotlb_unmap_sg_attrs(hwdev, sgl, i, dir,
>> -attrs);
>> - sg_dma_len(sgl) = 0;
>> - return 0;
>> - }
>> +  dir, attrs);
>> + if (map == SWIOTLB_MAP_ERROR)
>> + goto map_error;
>>   xen_dma_map_page(hwdev, pfn_to_page(map >> PAGE_SHIFT),
>>   dev_addr,
>>   map & ~PAGE_MASK,
>> @@ -589,6 +585,16 @@ void xen_swiotlb_unmap_page(struct device *hwdev, 
>> dma_addr_t dev_addr,
>>   sg_dma_len(sg) = sg->length;
>>   }
>>   return nelems;
>> +map_error:
>> + dev_warn(hwdev, "swiotlb buffer is full\n");
>> + /*
>> +  * Don't panic here, we expect map_sg users
>> +  * to do proper error handling.
>> +  */
>> + xen_swiotlb_unmap_sg_attrs(hwdev, sgl, i, dir,
>> +attrs | DMA_ATTR_SKIP_CPU_SYNC);
>> + sg_dma_len(sgl) = 0;
>> + return 0;
>>  }
>
> This too. Why can't that be part of the existing code that was there?


[PATCH] can: fix warning in bcm_connect/proc_register

2016-10-24 Thread Oliver Hartkopp
Andrey Konovalov reported an issue with proc_register in bcm.c.
As suggested by Cong Wang this patch adds a lock_sock() protection and
a check for unsuccessful proc_create_data() in bcm_connect().

Reference: http://marc.info/?l=linux-netdev=147732648731237

Reported-by: Andrey Konovalov 
Suggested-by: Cong Wang 
Signed-off-by: Oliver Hartkopp 
---
 net/can/bcm.c | 32 +++-
 1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/net/can/bcm.c b/net/can/bcm.c
index 8e999ff..8af9d25 100644
--- a/net/can/bcm.c
+++ b/net/can/bcm.c
@@ -1549,24 +1549,31 @@ static int bcm_connect(struct socket *sock, struct 
sockaddr *uaddr, int len,
struct sockaddr_can *addr = (struct sockaddr_can *)uaddr;
struct sock *sk = sock->sk;
struct bcm_sock *bo = bcm_sk(sk);
+   int ret = 0;
 
if (len < sizeof(*addr))
return -EINVAL;
 
-   if (bo->bound)
-   return -EISCONN;
+   lock_sock(sk);
+
+   if (bo->bound) {
+   ret = -EISCONN;
+   goto fail;
+   }
 
/* bind a device to this socket */
if (addr->can_ifindex) {
struct net_device *dev;
 
dev = dev_get_by_index(_net, addr->can_ifindex);
-   if (!dev)
-   return -ENODEV;
-
+   if (!dev) {
+   ret = -ENODEV;
+   goto fail;
+   }
if (dev->type != ARPHRD_CAN) {
dev_put(dev);
-   return -ENODEV;
+   ret = -ENODEV;
+   goto fail;
}
 
bo->ifindex = dev->ifindex;
@@ -1577,17 +1584,24 @@ static int bcm_connect(struct socket *sock, struct 
sockaddr *uaddr, int len,
bo->ifindex = 0;
}
 
-   bo->bound = 1;
-
if (proc_dir) {
/* unique socket address as filename */
sprintf(bo->procname, "%lu", sock_i_ino(sk));
bo->bcm_proc_read = proc_create_data(bo->procname, 0644,
 proc_dir,
 _proc_fops, sk);
+   if (!bo->bcm_proc_read) {
+   ret = -ENOMEM;
+   goto fail;
+   }
}
 
-   return 0;
+   bo->bound = 1;
+
+fail:
+   release_sock(sk);
+
+   return ret;
 }
 
 static int bcm_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
-- 
2.9.3



Re: [net-next PATCH RFC 05/26] arch/avr32: Add option to skip sync on DMA map

2016-10-24 Thread Hans-Christian Noren Egtvedt
Around Mon 24 Oct 2016 08:04:53 -0400 or thereabout, Alexander Duyck wrote:
> The use of DMA_ATTR_SKIP_CPU_SYNC was not consistent across all of the DMA
> APIs in the arch/arm folder.  This change is meant to correct that so that
> we get consistent behavior.

Looks good (-:

> Cc: Haavard Skinnemoen 
> Cc: Hans-Christian Egtvedt 
> Signed-off-by: Alexander Duyck 

Acked-by: Hans-Christian Noren Egtvedt 

> ---
>  arch/avr32/mm/dma-coherent.c |7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/avr32/mm/dma-coherent.c b/arch/avr32/mm/dma-coherent.c
> index 58610d0..54534e5 100644
> --- a/arch/avr32/mm/dma-coherent.c
> +++ b/arch/avr32/mm/dma-coherent.c
> @@ -146,7 +146,8 @@ static dma_addr_t avr32_dma_map_page(struct device *dev, 
> struct page *page,
>  {
>   void *cpu_addr = page_address(page) + offset;
>  
> - dma_cache_sync(dev, cpu_addr, size, direction);
> + if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> + dma_cache_sync(dev, cpu_addr, size, direction);
>   return virt_to_bus(cpu_addr);
>  }
>  
> @@ -162,6 +163,10 @@ static int avr32_dma_map_sg(struct device *dev, struct 
> scatterlist *sglist,
>  
>   sg->dma_address = page_to_bus(sg_page(sg)) + sg->offset;
>   virt = sg_virt(sg);
> +
> + if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
> + continue;
> +
>   dma_cache_sync(dev, virt, sg->length, direction);
>   }
>  
-- 
mvh
Hans-Christian Noren Egtvedt


Re: [net-next PATCH RFC 19/26] arch/sparc: Add option to skip DMA sync as a part of map and unmap

2016-10-24 Thread David Miller
From: Alexander Duyck 
Date: Mon, 24 Oct 2016 08:06:07 -0400

> This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
> avoid invoking cache line invalidation if the driver will just handle it
> via a sync_for_cpu or sync_for_device call.
> 
> Cc: "David S. Miller" 
> Cc: sparcli...@vger.kernel.org
> Signed-off-by: Alexander Duyck 

This is fine for avoiding the flush for performance reasons, but the
chip isn't going to write anything back unless the device wrote into
the area.


Re: net/can: warning in bcm_connect/proc_register

2016-10-24 Thread Oliver Hartkopp

Hello Andrey, hello Cong,

thanks for catching this issue.

I added lock_sock() and a check for a failing proc_create_data() below.

Can you please check if it solved the issue?
I tested the patched version with the stress tool as advised by Andrey 
and did not see any problems in dmesg anymore.

If ok I can provide a proper patch.

Many thanks,
Oliver


diff --git a/net/can/bcm.c b/net/can/bcm.c
index 8e999ff..8af9d25 100644
--- a/net/can/bcm.c
+++ b/net/can/bcm.c
@@ -1549,24 +1549,31 @@ static int bcm_connect(struct socket *sock, 
struct sockaddr *uaddr, int len,

struct sockaddr_can *addr = (struct sockaddr_can *)uaddr;
struct sock *sk = sock->sk;
struct bcm_sock *bo = bcm_sk(sk);
+   int ret = 0;

if (len < sizeof(*addr))
return -EINVAL;

-   if (bo->bound)
-   return -EISCONN;
+   lock_sock(sk);
+
+   if (bo->bound) {
+   ret = -EISCONN;
+   goto fail;
+   }

/* bind a device to this socket */
if (addr->can_ifindex) {
struct net_device *dev;

dev = dev_get_by_index(_net, addr->can_ifindex);
-   if (!dev)
-   return -ENODEV;
-
+   if (!dev) {
+   ret = -ENODEV;
+   goto fail;
+   }
if (dev->type != ARPHRD_CAN) {
dev_put(dev);
-   return -ENODEV;
+   ret = -ENODEV;
+   goto fail;
}

bo->ifindex = dev->ifindex;
@@ -1577,17 +1584,24 @@ static int bcm_connect(struct socket *sock, 
struct sockaddr *uaddr, int len,

bo->ifindex = 0;
}

-   bo->bound = 1;
-
if (proc_dir) {
/* unique socket address as filename */
sprintf(bo->procname, "%lu", sock_i_ino(sk));
bo->bcm_proc_read = proc_create_data(bo->procname, 0644,
 proc_dir,
 _proc_fops, sk);
+   if (!bo->bcm_proc_read) {
+   ret = -ENOMEM;
+   goto fail;
+   }
}

-   return 0;
+   bo->bound = 1;
+
+fail:
+   release_sock(sk);
+
+   return ret;
 }

 static int bcm_recvmsg(struct socket *sock, struct msghdr *msg, size_t 
size,



On 10/24/2016 07:31 PM, Andrey Konovalov wrote:

Hi Cong,

I'm able to reproduce it by running
https://gist.github.com/xairy/33f2eb6bf807b004e643bae36c3d02d7 in a
tight parallel loop with stress
(https://godoc.org/golang.org/x/tools/cmd/stress):
$ gcc -lpthread tmp.c
$ ./stress ./a.out

The C program was generated from the following syzkaller prog:
mmap(&(0x7f00/0x991000)=nil, (0x991000), 0x3, 0x32,
0x, 0x0)
socket(0x1d, 0x80002, 0x2)
r0 = socket(0x1d, 0x80002, 0x2)
connect$nfc_llcp(r0, &(0x7f00c000)={0x27, 0x1, 0x0, 0x5,
0x1, 0x1,
"341b3a01b257849ca1d7d1ff9f999d8127b185f88d1d775d59c88a3aa6a8ddacdf2bdc324ea6578a21b85114610186c3817c34b05eaffd2c3f54f57fa81ba0",
0x1ff}, 0x60)
connect$nfc_llcp(r0, &(0x7f991000-0x60)={0x27, 0x1, 0x1,
0x5, 0xfffd, 0x0,
"341b3a01b257849ca1d7d1ff9f999d8127b185f88d1d775dbec88a3aa6a8ddacdf2bdc324ea6578a21b85114610186c3817c34b05eaffd2c3f54f57fa81ba0",
0x1ff}, 0x60)

Unfortunately I wasn't able to create a simpler reproducer.

Thanks!

On Mon, Oct 24, 2016 at 6:58 PM, Cong Wang  wrote:

On Mon, Oct 24, 2016 at 9:21 AM, Andrey Konovalov  wrote:

Hi,

I've got the following error report while running the syzkaller fuzzer:

WARNING: CPU: 0 PID: 32451 at fs/proc/generic.c:345 proc_register+0x25e/0x300
proc_dir_entry 'can-bcm/249757' already registered
Kernel panic - not syncing: panic_on_warn set ...


Looks like we have two problems here:

1) A check for bo->bcm_proc_read != NULL seems missing
2) We need to lock the sock in bcm_connect().

I will work on a patch. Meanwhile, it would help a lot if you could provide
a reproducer.

Thanks!


[net-next PATCH RFC 19/26] arch/sparc: Add option to skip DMA sync as a part of map and unmap

2016-10-24 Thread Alexander Duyck
This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: "David S. Miller" 
Cc: sparcli...@vger.kernel.org
Signed-off-by: Alexander Duyck 
---
 arch/sparc/kernel/iommu.c  |4 ++--
 arch/sparc/kernel/ioport.c |4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
index 5c615ab..8fda4e4 100644
--- a/arch/sparc/kernel/iommu.c
+++ b/arch/sparc/kernel/iommu.c
@@ -415,7 +415,7 @@ static void dma_4u_unmap_page(struct device *dev, 
dma_addr_t bus_addr,
ctx = (iopte_val(*base) & IOPTE_CONTEXT) >> 47UL;
 
/* Step 1: Kick data out of streaming buffers if necessary. */
-   if (strbuf->strbuf_enabled)
+   if (strbuf->strbuf_enabled && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
strbuf_flush(strbuf, iommu, bus_addr, ctx,
 npages, direction);
 
@@ -640,7 +640,7 @@ static void dma_4u_unmap_sg(struct device *dev, struct 
scatterlist *sglist,
base = iommu->page_table + entry;
 
dma_handle &= IO_PAGE_MASK;
-   if (strbuf->strbuf_enabled)
+   if (strbuf->strbuf_enabled && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
strbuf_flush(strbuf, iommu, dma_handle, ctx,
 npages, direction);
 
diff --git a/arch/sparc/kernel/ioport.c b/arch/sparc/kernel/ioport.c
index 2344103..6ffaec4 100644
--- a/arch/sparc/kernel/ioport.c
+++ b/arch/sparc/kernel/ioport.c
@@ -527,7 +527,7 @@ static dma_addr_t pci32_map_page(struct device *dev, struct 
page *page,
 static void pci32_unmap_page(struct device *dev, dma_addr_t ba, size_t size,
 enum dma_data_direction dir, unsigned long attrs)
 {
-   if (dir != PCI_DMA_TODEVICE)
+   if (dir != PCI_DMA_TODEVICE && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
dma_make_coherent(ba, PAGE_ALIGN(size));
 }
 
@@ -572,7 +572,7 @@ static void pci32_unmap_sg(struct device *dev, struct 
scatterlist *sgl,
struct scatterlist *sg;
int n;
 
-   if (dir != PCI_DMA_TODEVICE) {
+   if (dir != PCI_DMA_TODEVICE && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
for_each_sg(sgl, sg, nents, n) {
dma_make_coherent(sg_phys(sg), PAGE_ALIGN(sg->length));
}



[net-next PATCH RFC 23/26] mm: Add support for releasing multiple instances of a page

2016-10-24 Thread Alexander Duyck
This patch adds a function that allows us to batch free a page that has
multiple references outstanding.  Specifically this function can be used to
drop a page being used in the page frag alloc cache.  With this drivers can
make use of functionality similar to the page frag alloc cache without
having to do any workarounds for the fact that there is no function that
frees multiple references.

Cc: linux...@kvack.org
Signed-off-by: Alexander Duyck 
---
 include/linux/gfp.h |2 ++
 mm/page_alloc.c |   14 ++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index f8041f9de..4175dca 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -506,6 +506,8 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int 
order,
 extern void free_hot_cold_page_list(struct list_head *list, bool cold);
 
 struct page_frag_cache;
+extern void __page_frag_drain(struct page *page, unsigned int order,
+ unsigned int count);
 extern void *__alloc_page_frag(struct page_frag_cache *nc,
   unsigned int fragsz, gfp_t gfp_mask);
 extern void __free_page_frag(void *addr);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ca423cc..253046a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3883,6 +3883,20 @@ static struct page *__page_frag_refill(struct 
page_frag_cache *nc,
return page;
 }
 
+void __page_frag_drain(struct page *page, unsigned int order,
+  unsigned int count)
+{
+   VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
+
+   if (page_ref_sub_and_test(page, count)) {
+   if (order == 0)
+   free_hot_cold_page(page, false);
+   else
+   __free_pages_ok(page, order);
+   }
+}
+EXPORT_SYMBOL(__page_frag_drain);
+
 void *__alloc_page_frag(struct page_frag_cache *nc,
unsigned int fragsz, gfp_t gfp_mask)
 {



Re: [net-next PATCH RFC 01/26] swiotlb: Drop unused function swiotlb_map_sg

2016-10-24 Thread Konrad Rzeszutek Wilk
On Mon, Oct 24, 2016 at 08:04:31AM -0400, Alexander Duyck wrote:
> There are no users for swiotlb_map_sg so we might as well just drop it.
> 
> Cc: Konrad Rzeszutek Wilk 

Acked-by: Konrad Rzeszutek Wilk 

Thought I swear I saw a familiar patch by Christopher Hellwig at some point..
but maybe that patchset had been dropped.

> Signed-off-by: Alexander Duyck 
> ---
>  include/linux/swiotlb.h |4 
>  lib/swiotlb.c   |8 
>  2 files changed, 12 deletions(-)
> 
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 5f81f8a..e237b6f 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -72,10 +72,6 @@ extern void swiotlb_unmap_page(struct device *hwdev, 
> dma_addr_t dev_addr,
>  size_t size, enum dma_data_direction dir,
>  unsigned long attrs);
>  
> -extern int
> -swiotlb_map_sg(struct device *hwdev, struct scatterlist *sg, int nents,
> -enum dma_data_direction dir);
> -
>  extern void
>  swiotlb_unmap_sg(struct device *hwdev, struct scatterlist *sg, int nents,
>enum dma_data_direction dir);
> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
> index 22e13a0..47aad37 100644
> --- a/lib/swiotlb.c
> +++ b/lib/swiotlb.c
> @@ -910,14 +910,6 @@ void swiotlb_unmap_page(struct device *hwdev, dma_addr_t 
> dev_addr,
>  }
>  EXPORT_SYMBOL(swiotlb_map_sg_attrs);
>  
> -int
> -swiotlb_map_sg(struct device *hwdev, struct scatterlist *sgl, int nelems,
> -enum dma_data_direction dir)
> -{
> - return swiotlb_map_sg_attrs(hwdev, sgl, nelems, dir, 0);
> -}
> -EXPORT_SYMBOL(swiotlb_map_sg);
> -
>  /*
>   * Unmap a set of streaming mode DMA translations.  Again, cpu read rules
>   * concerning calls here are the same as for swiotlb_unmap_page() above.
> 


[net-next PATCH RFC 26/26] igb: Revert "igb: Revert support for build_skb in igb"

2016-10-24 Thread Alexander Duyck
This reverts commit f9d40f6a9921 ("igb: Revert support for build_skb in
igb") and adds a few changes to update it to work with the latest version
of igb. We are now able to revert the removal of this due to the fact
that with the recent changes to the page count and the use of
DMA_ATTR_SKIP_CPU_SYNC we can make the pages writable so we should not be
invalidating the additional data added when we call build_skb.

The biggest risk with this change is that we are now not able to support
full jumbo frames when using build_skb.  Instead we can only support up to
2K minus the skb overhead and padding offset.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/igb/igb.h  |   29 ++
 drivers/net/ethernet/intel/igb/igb_main.c |  130 ++---
 2 files changed, 142 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index acbc3ab..c3420f3 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -145,6 +145,10 @@ struct vf_data_storage {
 #define IGB_RX_HDR_LEN IGB_RXBUFFER_256
 #define IGB_RX_BUFSZ   IGB_RXBUFFER_2048
 
+#define IGB_SKB_PAD(NET_SKB_PAD + NET_IP_ALIGN)
+#define IGB_MAX_BUILD_SKB_SIZE \
+   (SKB_WITH_OVERHEAD(IGB_RX_BUFSZ) - (IGB_SKB_PAD + IGB_TS_HDR_LEN))
+
 /* How many Rx Buffers do we bundle into one write to the hardware ? */
 #define IGB_RX_BUFFER_WRITE16 /* Must be power of 2 */
 
@@ -301,12 +305,29 @@ struct igb_q_vector {
 };
 
 enum e1000_ring_flags_t {
-   IGB_RING_FLAG_RX_SCTP_CSUM,
-   IGB_RING_FLAG_RX_LB_VLAN_BSWAP,
-   IGB_RING_FLAG_TX_CTX_IDX,
-   IGB_RING_FLAG_TX_DETECT_HANG
+   IGB_RING_FLAG_RX_SCTP_CSUM = 0,
+#if (NET_IP_ALIGN != 0)
+   IGB_RING_FLAG_RX_BUILD_SKB_ENABLED = 1,
+#endif
+   IGB_RING_FLAG_RX_LB_VLAN_BSWAP = 2,
+   IGB_RING_FLAG_TX_CTX_IDX = 3,
+   IGB_RING_FLAG_TX_DETECT_HANG = 4,
+#if (NET_IP_ALIGN == 0)
+#if (L1_CACHE_SHIFT < 5)
+   IGB_RING_FLAG_RX_BUILD_SKB_ENABLED = 5,
+#else
+   IGB_RING_FLAG_RX_BUILD_SKB_ENABLED = L1_CACHE_SHIFT,
+#endif
+#endif
 };
 
+#define ring_uses_build_skb(ring) \
+   test_bit(IGB_RING_FLAG_RX_BUILD_SKB_ENABLED, &(ring)->flags)
+#define set_ring_build_skb_enabled(ring) \
+   set_bit(IGB_RING_FLAG_RX_BUILD_SKB_ENABLED, &(ring)->flags)
+#define clear_ring_build_skb_enabled(ring) \
+   clear_bit(IGB_RING_FLAG_RX_BUILD_SKB_ENABLED, &(ring)->flags)
+
 #define IGB_TXD_DCMD (E1000_ADVTXD_DCMD_EOP | E1000_ADVTXD_DCMD_RS)
 
 #define IGB_RX_DESC(R, i)  \
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 83fdef6..7674a50 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3761,6 +3761,16 @@ void igb_configure_rx_ring(struct igb_adapter *adapter,
wr32(E1000_RXDCTL(reg_idx), rxdctl);
 }
 
+static void igb_set_rx_buffer_len(struct igb_adapter *adapter,
+ struct igb_ring *rx_ring)
+{
+   /* set build_skb flag */
+   if (adapter->max_frame_size <= IGB_MAX_BUILD_SKB_SIZE)
+   set_ring_build_skb_enabled(rx_ring);
+   else
+   clear_ring_build_skb_enabled(rx_ring);
+}
+
 /**
  *  igb_configure_rx - Configure receive Unit after Reset
  *  @adapter: board private structure
@@ -3778,8 +3788,12 @@ static void igb_configure_rx(struct igb_adapter *adapter)
/* Setup the HW Rx Head and Tail Descriptor Pointers and
 * the Base and Length of the Rx Descriptor Ring
 */
-   for (i = 0; i < adapter->num_rx_queues; i++)
-   igb_configure_rx_ring(adapter, adapter->rx_ring[i]);
+   for (i = 0; i < adapter->num_rx_queues; i++) {
+   struct igb_ring *rx_ring = adapter->rx_ring[i];
+
+   igb_set_rx_buffer_len(adapter, rx_ring);
+   igb_configure_rx_ring(adapter, rx_ring);
+   }
 }
 
 /**
@@ -4238,7 +4252,7 @@ static void igb_set_rx_mode(struct net_device *netdev)
struct igb_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = >hw;
unsigned int vfn = adapter->vfs_allocated_count;
-   u32 rctl = 0, vmolr = 0;
+   u32 rctl = 0, vmolr = 0, rlpml = MAX_JUMBO_FRAME_SIZE;
int count;
 
/* Check for Promiscuous and All Multicast modes */
@@ -4310,12 +4324,18 @@ static void igb_set_rx_mode(struct net_device *netdev)
vmolr |= rd32(E1000_VMOLR(vfn)) &
 ~(E1000_VMOLR_ROPE | E1000_VMOLR_MPME | E1000_VMOLR_ROMPE);
 
-   /* enable Rx jumbo frames, no need for restriction */
+   /* enable Rx jumbo frames, restrict as needed to support build_skb */
vmolr &= ~E1000_VMOLR_RLPML_MASK;
-   vmolr |= MAX_JUMBO_FRAME_SIZE | E1000_VMOLR_LPE;
+   vmolr |= E1000_VMOLR_LPE;
+   vmolr |= (adapter->max_frame_size <= IGB_MAX_BUILD_SKB_SIZE) ?
+

[net-next PATCH RFC 22/26] dma: Add calls for dma_map_page_attrs and dma_unmap_page_attrs

2016-10-24 Thread Alexander Duyck
Add support for mapping and unmapping a page with attributes.  The primary
use for this is currently to allow for us to pass the
DMA_ATTR_SKIP_CPU_SYNC attribute when mapping and unmapping a page.  On
some architectures such as ARM the synchronization has significant overhead
and if we are already taking care of the sync_for_cpu and sync_for_device
from the driver there isn't much need to handle this in the map/unmap calls
as well.

Signed-off-by: Alexander Duyck 
---
 include/linux/dma-mapping.h |   20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 08528af..10c5a17 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -243,29 +243,33 @@ static inline void dma_unmap_sg_attrs(struct device *dev, 
struct scatterlist *sg
ops->unmap_sg(dev, sg, nents, dir, attrs);
 }
 
-static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
- size_t offset, size_t size,
- enum dma_data_direction dir)
+static inline dma_addr_t dma_map_page_attrs(struct device *dev,
+   struct page *page,
+   size_t offset, size_t size,
+   enum dma_data_direction dir,
+   unsigned long attrs)
 {
struct dma_map_ops *ops = get_dma_ops(dev);
dma_addr_t addr;
 
kmemcheck_mark_initialized(page_address(page) + offset, size);
BUG_ON(!valid_dma_direction(dir));
-   addr = ops->map_page(dev, page, offset, size, dir, 0);
+   addr = ops->map_page(dev, page, offset, size, dir, attrs);
debug_dma_map_page(dev, page, offset, size, dir, addr, false);
 
return addr;
 }
 
-static inline void dma_unmap_page(struct device *dev, dma_addr_t addr,
- size_t size, enum dma_data_direction dir)
+static inline void dma_unmap_page_attrs(struct device *dev,
+   dma_addr_t addr, size_t size,
+   enum dma_data_direction dir,
+   unsigned long attrs)
 {
struct dma_map_ops *ops = get_dma_ops(dev);
 
BUG_ON(!valid_dma_direction(dir));
if (ops->unmap_page)
-   ops->unmap_page(dev, addr, size, dir, 0);
+   ops->unmap_page(dev, addr, size, dir, attrs);
debug_dma_unmap_page(dev, addr, size, dir, false);
 }
 
@@ -385,6 +389,8 @@ static inline void dma_sync_single_range_for_device(struct 
device *dev,
 #define dma_unmap_single(d, a, s, r) dma_unmap_single_attrs(d, a, s, r, 0)
 #define dma_map_sg(d, s, n, r) dma_map_sg_attrs(d, s, n, r, 0)
 #define dma_unmap_sg(d, s, n, r) dma_unmap_sg_attrs(d, s, n, r, 0)
+#define dma_map_page(d, p, o, s, r) dma_map_page_attrs(d, p, o, s, r, 0)
+#define dma_unmap_page(d, a, s, r) dma_unmap_page_attrs(d, a, s, r, 0)
 
 extern int dma_common_mmap(struct device *dev, struct vm_area_struct *vma,
   void *cpu_addr, dma_addr_t dma_addr, size_t size);



[net-next PATCH RFC 20/26] arch/tile: Add option to skip DMA sync as a part of map and unmap

2016-10-24 Thread Alexander Duyck
This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: Chris Metcalf 
Signed-off-by: Alexander Duyck 
---
 arch/tile/kernel/pci-dma.c |   12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/tile/kernel/pci-dma.c b/arch/tile/kernel/pci-dma.c
index 09bb774..24e0f8c 100644
--- a/arch/tile/kernel/pci-dma.c
+++ b/arch/tile/kernel/pci-dma.c
@@ -213,10 +213,12 @@ static int tile_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
 
for_each_sg(sglist, sg, nents, i) {
sg->dma_address = sg_phys(sg);
-   __dma_prep_pa_range(sg->dma_address, sg->length, direction);
 #ifdef CONFIG_NEED_SG_DMA_LENGTH
sg->dma_length = sg->length;
 #endif
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+   __dma_prep_pa_range(sg->dma_address, sg->length, direction);
}
 
return nents;
@@ -232,6 +234,8 @@ static void tile_dma_unmap_sg(struct device *dev, struct 
scatterlist *sglist,
BUG_ON(!valid_dma_direction(direction));
for_each_sg(sglist, sg, nents, i) {
sg->dma_address = sg_phys(sg);
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
__dma_complete_pa_range(sg->dma_address, sg->length,
direction);
}
@@ -245,7 +249,8 @@ static dma_addr_t tile_dma_map_page(struct device *dev, 
struct page *page,
BUG_ON(!valid_dma_direction(direction));
 
BUG_ON(offset + size > PAGE_SIZE);
-   __dma_prep_page(page, offset, size, direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   __dma_prep_page(page, offset, size, direction);
 
return page_to_pa(page) + offset;
 }
@@ -256,6 +261,9 @@ static void tile_dma_unmap_page(struct device *dev, 
dma_addr_t dma_address,
 {
BUG_ON(!valid_dma_direction(direction));
 
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   return;
+
__dma_complete_page(pfn_to_page(PFN_DOWN(dma_address)),
dma_address & (PAGE_SIZE - 1), size, direction);
 }



[net-next PATCH RFC 24/26] igb: Update driver to make use of DMA_ATTR_SKIP_CPU_SYNC

2016-10-24 Thread Alexander Duyck
The ARM architecture provides a mechanism for deferring cache line
invalidation in the case of map/unmap.  This patch makes use of this
mechanism to avoid unnecessary synchronization.

A secondary effect of this change is that the portion of the page that has
been synchronized for use by the CPU should be writable and could be passed
up the stack (at least on ARM).

The last bit that occurred to me is that on architectures where the
sync_for_cpu call invalidates cache lines we were prefetching and then
invalidating the first 128 bytes of the packet.  To avoid that I have moved
the sync up to before we perform the prefetch and allocate the skbuff so
that we can actually make use of it.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/igb/igb_main.c |   53 ++---
 1 file changed, 33 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 4feca69..c8c458c 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3947,10 +3947,21 @@ static void igb_clean_rx_ring(struct igb_ring *rx_ring)
if (!buffer_info->page)
continue;
 
-   dma_unmap_page(rx_ring->dev,
-  buffer_info->dma,
-  PAGE_SIZE,
-  DMA_FROM_DEVICE);
+   /* Invalidate cache lines that may have been written to by
+* device so that we avoid corrupting memory.
+*/
+   dma_sync_single_range_for_cpu(rx_ring->dev,
+ buffer_info->dma,
+ buffer_info->page_offset,
+ IGB_RX_BUFSZ,
+ DMA_FROM_DEVICE);
+
+   /* free resources associated with mapping */
+   dma_unmap_page_attrs(rx_ring->dev,
+buffer_info->dma,
+PAGE_SIZE,
+DMA_FROM_DEVICE,
+DMA_ATTR_SKIP_CPU_SYNC);
__free_page(buffer_info->page);
 
buffer_info->page = NULL;
@@ -6808,12 +6819,6 @@ static void igb_reuse_rx_page(struct igb_ring *rx_ring,
 
/* transfer page from old buffer to new buffer */
*new_buff = *old_buff;
-
-   /* sync the buffer for use by the device */
-   dma_sync_single_range_for_device(rx_ring->dev, old_buff->dma,
-old_buff->page_offset,
-IGB_RX_BUFSZ,
-DMA_FROM_DEVICE);
 }
 
 static inline bool igb_page_is_reserved(struct page *page)
@@ -6934,6 +6939,13 @@ static struct sk_buff *igb_fetch_rx_buffer(struct 
igb_ring *rx_ring,
page = rx_buffer->page;
prefetchw(page);
 
+   /* we are reusing so sync this buffer for CPU use */
+   dma_sync_single_range_for_cpu(rx_ring->dev,
+ rx_buffer->dma,
+ rx_buffer->page_offset,
+ size,
+ DMA_FROM_DEVICE);
+
if (likely(!skb)) {
void *page_addr = page_address(page) +
  rx_buffer->page_offset;
@@ -6958,21 +6970,15 @@ static struct sk_buff *igb_fetch_rx_buffer(struct 
igb_ring *rx_ring,
prefetchw(skb->data);
}
 
-   /* we are reusing so sync this buffer for CPU use */
-   dma_sync_single_range_for_cpu(rx_ring->dev,
- rx_buffer->dma,
- rx_buffer->page_offset,
- size,
- DMA_FROM_DEVICE);
-
/* pull page into skb */
if (igb_add_rx_frag(rx_ring, rx_buffer, size, rx_desc, skb)) {
/* hand second half of page back to the ring */
igb_reuse_rx_page(rx_ring, rx_buffer);
} else {
/* we are not reusing the buffer so unmap it */
-   dma_unmap_page(rx_ring->dev, rx_buffer->dma,
-  PAGE_SIZE, DMA_FROM_DEVICE);
+   dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
+PAGE_SIZE, DMA_FROM_DEVICE,
+DMA_ATTR_SKIP_CPU_SYNC);
}
 
/* clear contents of rx_buffer */
@@ -7230,7 +7236,8 @@ static bool igb_alloc_mapped_page(struct igb_ring 
*rx_ring,
}
 
/* map page for use */
-   dma = dma_map_page(rx_ring->dev, page, 0, PAGE_SIZE, DMA_FROM_DEVICE);
+   dma = dma_map_page_attrs(rx_ring->dev, page, 0, PAGE_SIZE,
+DMA_FROM_DEVICE, 

[net-next PATCH RFC 15/26] arch/openrisc: Add option to skip DMA sync as a part of mapping

2016-10-24 Thread Alexander Duyck
This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: Jonas Bonn 
Signed-off-by: Alexander Duyck 
---
 arch/openrisc/kernel/dma.c |3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
index 140c991..906998b 100644
--- a/arch/openrisc/kernel/dma.c
+++ b/arch/openrisc/kernel/dma.c
@@ -141,6 +141,9 @@
unsigned long cl;
dma_addr_t addr = page_to_phys(page) + offset;
 
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   return addr;
+
switch (dir) {
case DMA_TO_DEVICE:
/* Flush the dcache for the requested range */



[net-next PATCH RFC 09/26] arch/hexagon: Add option to skip DMA sync as a part of mapping

2016-10-24 Thread Alexander Duyck
This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
later via a sync_for_cpu or sync_for_device call.

Cc: Richard Kuo 
Cc: linux-hexa...@vger.kernel.org
Signed-off-by: Alexander Duyck 
---
 arch/hexagon/kernel/dma.c |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/hexagon/kernel/dma.c b/arch/hexagon/kernel/dma.c
index b901778..dbc4f10 100644
--- a/arch/hexagon/kernel/dma.c
+++ b/arch/hexagon/kernel/dma.c
@@ -119,6 +119,9 @@ static int hexagon_map_sg(struct device *hwdev, struct 
scatterlist *sg,
 
s->dma_length = s->length;
 
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
flush_dcache_range(dma_addr_to_virt(s->dma_address),
   dma_addr_to_virt(s->dma_address + 
s->length));
}
@@ -180,7 +183,8 @@ static dma_addr_t hexagon_map_page(struct device *dev, 
struct page *page,
if (!check_addr("map_single", dev, bus, size))
return bad_dma_address;
 
-   dma_sync(dma_addr_to_virt(bus), size, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   dma_sync(dma_addr_to_virt(bus), size, dir);
 
return bus;
 }



[net-next PATCH RFC 08/26] arch/frv: Add option to skip sync on DMA map

2016-10-24 Thread Alexander Duyck
The use of DMA_ATTR_SKIP_CPU_SYNC was not consistent across all of the DMA
APIs in the arch/arm folder.  This change is meant to correct that so that
we get consistent behavior.

Signed-off-by: Alexander Duyck 
---
 arch/frv/mb93090-mb00/pci-dma-nommu.c |   16 +++-
 arch/frv/mb93090-mb00/pci-dma.c   |7 ++-
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/arch/frv/mb93090-mb00/pci-dma-nommu.c 
b/arch/frv/mb93090-mb00/pci-dma-nommu.c
index 90f2e4c..ff606d1 100644
--- a/arch/frv/mb93090-mb00/pci-dma-nommu.c
+++ b/arch/frv/mb93090-mb00/pci-dma-nommu.c
@@ -109,16 +109,19 @@ static int frv_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
int nents, enum dma_data_direction direction,
unsigned long attrs)
 {
-   int i;
struct scatterlist *sg;
+   int i;
+
+   WARN_ON(direction == DMA_NONE);
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   return nents;
 
for_each_sg(sglist, sg, nents, i) {
frv_cache_wback_inv(sg_dma_address(sg),
sg_dma_address(sg) + sg_dma_len(sg));
}
 
-   BUG_ON(direction == DMA_NONE);
-
return nents;
 }
 
@@ -126,8 +129,11 @@ static dma_addr_t frv_dma_map_page(struct device *dev, 
struct page *page,
unsigned long offset, size_t size,
enum dma_data_direction direction, unsigned long attrs)
 {
-   BUG_ON(direction == DMA_NONE);
-   flush_dcache_page(page);
+   WARN_ON(direction == DMA_NONE);
+
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   flush_dcache_page(page);
+
return (dma_addr_t) page_to_phys(page) + offset;
 }
 
diff --git a/arch/frv/mb93090-mb00/pci-dma.c b/arch/frv/mb93090-mb00/pci-dma.c
index f585745..ee5dadf 100644
--- a/arch/frv/mb93090-mb00/pci-dma.c
+++ b/arch/frv/mb93090-mb00/pci-dma.c
@@ -52,6 +52,9 @@ static int frv_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
for_each_sg(sglist, sg, nents, i) {
vaddr = kmap_atomic_primary(sg_page(sg));
 
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
frv_dcache_writeback((unsigned long) vaddr,
 (unsigned long) vaddr + PAGE_SIZE);
 
@@ -70,7 +73,9 @@ static dma_addr_t frv_dma_map_page(struct device *dev, struct 
page *page,
unsigned long offset, size_t size,
enum dma_data_direction direction, unsigned long attrs)
 {
-   flush_dcache_page(page);
+   if (!(attr & DMA_ATTR_SKIP_CPU_SYNC))
+   flush_dcache_page(page);
+
return (dma_addr_t) page_to_phys(page) + offset;
 }
 



[net-next PATCH RFC 21/26] arch/xtensa: Add option to skip DMA sync as a part of mapping

2016-10-24 Thread Alexander Duyck
This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: Max Filippov 
Signed-off-by: Alexander Duyck 
---
 arch/xtensa/kernel/pci-dma.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/xtensa/kernel/pci-dma.c b/arch/xtensa/kernel/pci-dma.c
index 1e68806..6a16dec 100644
--- a/arch/xtensa/kernel/pci-dma.c
+++ b/arch/xtensa/kernel/pci-dma.c
@@ -189,7 +189,9 @@ static dma_addr_t xtensa_map_page(struct device *dev, 
struct page *page,
 {
dma_addr_t dma_handle = page_to_phys(page) + offset;
 
-   xtensa_sync_single_for_device(dev, dma_handle, size, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   xtensa_sync_single_for_device(dev, dma_handle, size, dir);
+
return dma_handle;
 }
 
@@ -197,7 +199,8 @@ static void xtensa_unmap_page(struct device *dev, 
dma_addr_t dma_handle,
  size_t size, enum dma_data_direction dir,
  unsigned long attrs)
 {
-   xtensa_sync_single_for_cpu(dev, dma_handle, size, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   xtensa_sync_single_for_cpu(dev, dma_handle, size, dir);
 }
 
 static int xtensa_map_sg(struct device *dev, struct scatterlist *sg,



[net-next PATCH RFC 25/26] igb: Update code to better handle incrementing page count

2016-10-24 Thread Alexander Duyck
This patch updates the driver code so that we do bulk updates of the page
reference count instead of just incrementing it by one reference at a time.
The advantage to doing this is that we cut down on atomic operations and
this in turn should give us a slight improvement in cycles per packet.  In
addition if we eventually move this over to using build_skb the gains will
be more noticeable.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/igb/igb.h  |7 ++-
 drivers/net/ethernet/intel/igb/igb_main.c |   24 +---
 2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index d11093d..acbc3ab 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -210,7 +210,12 @@ struct igb_tx_buffer {
 struct igb_rx_buffer {
dma_addr_t dma;
struct page *page;
-   unsigned int page_offset;
+#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
+   __u32 page_offset;
+#else
+   __u16 page_offset;
+#endif
+   __u16 pagecnt_bias;
 };
 
 struct igb_tx_queue_stats {
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index c8c458c..83fdef6 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3962,7 +3962,8 @@ static void igb_clean_rx_ring(struct igb_ring *rx_ring)
 PAGE_SIZE,
 DMA_FROM_DEVICE,
 DMA_ATTR_SKIP_CPU_SYNC);
-   __free_page(buffer_info->page);
+   __page_frag_drain(buffer_info->page, 0,
+ buffer_info->pagecnt_bias);
 
buffer_info->page = NULL;
}
@@ -6830,13 +6831,15 @@ static bool igb_can_reuse_rx_page(struct igb_rx_buffer 
*rx_buffer,
  struct page *page,
  unsigned int truesize)
 {
+   unsigned int pagecnt_bias = rx_buffer->pagecnt_bias--;
+
/* avoid re-using remote pages */
if (unlikely(igb_page_is_reserved(page)))
return false;
 
 #if (PAGE_SIZE < 8192)
/* if we are only owner of page we can reuse it */
-   if (unlikely(page_count(page) != 1))
+   if (unlikely(page_ref_count(page) != pagecnt_bias))
return false;
 
/* flip page offset to other buffer */
@@ -6849,10 +6852,14 @@ static bool igb_can_reuse_rx_page(struct igb_rx_buffer 
*rx_buffer,
return false;
 #endif
 
-   /* Even if we own the page, we are not allowed to use atomic_set()
-* This would break get_page_unless_zero() users.
+   /* If we have drained the page fragment pool we need to update
+* the pagecnt_bias and page count so that we fully restock the
+* number of references the driver holds.
 */
-   page_ref_inc(page);
+   if (unlikely(!rx_buffer->pagecnt_bias)) {
+   page_ref_add(page, USHRT_MAX);
+   rx_buffer->pagecnt_bias = USHRT_MAX;
+   }
 
return true;
 }
@@ -6904,7 +6911,6 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
return true;
 
/* this page cannot be reused so discard it */
-   __free_page(page);
return false;
}
 
@@ -6975,10 +6981,13 @@ static struct sk_buff *igb_fetch_rx_buffer(struct 
igb_ring *rx_ring,
/* hand second half of page back to the ring */
igb_reuse_rx_page(rx_ring, rx_buffer);
} else {
-   /* we are not reusing the buffer so unmap it */
+   /* We are not reusing the buffer so unmap it and free
+* any references we are holding to it
+*/
dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
 PAGE_SIZE, DMA_FROM_DEVICE,
 DMA_ATTR_SKIP_CPU_SYNC);
+   __page_frag_drain(page, 0, rx_buffer->pagecnt_bias);
}
 
/* clear contents of rx_buffer */
@@ -7252,6 +7261,7 @@ static bool igb_alloc_mapped_page(struct igb_ring 
*rx_ring,
bi->dma = dma;
bi->page = page;
bi->page_offset = 0;
+   bi->pagecnt_bias = 1;
 
return true;
 }



Re: [net-next PATCH RFC 02/26] swiotlb: Add support for DMA_ATTR_SKIP_CPU_SYNC

2016-10-24 Thread Konrad Rzeszutek Wilk
On Mon, Oct 24, 2016 at 08:04:37AM -0400, Alexander Duyck wrote:
> As a first step to making DMA_ATTR_SKIP_CPU_SYNC apply to architectures
> beyond just ARM I need to make it so that the swiotlb will respect the
> flag.  In order to do that I also need to update the swiotlb-xen since it
> heavily makes use of the functionality.
> 
> Cc: Konrad Rzeszutek Wilk 
> Signed-off-by: Alexander Duyck 
> ---
>  drivers/xen/swiotlb-xen.c |   40 ++
>  include/linux/swiotlb.h   |6 --
>  lib/swiotlb.c |   48 
> +++--
>  3 files changed, 56 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> index 87e6035..cf047d8 100644
> --- a/drivers/xen/swiotlb-xen.c
> +++ b/drivers/xen/swiotlb-xen.c
> @@ -405,7 +405,8 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, 
> struct page *page,
>*/
>   trace_swiotlb_bounced(dev, dev_addr, size, swiotlb_force);
>  
> - map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir);
> + map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir,
> +  attrs);
>   if (map == SWIOTLB_MAP_ERROR)
>   return DMA_ERROR_CODE;
>  
> @@ -416,11 +417,13 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, 
> struct page *page,
>   /*
>* Ensure that the address returned is DMA'ble
>*/
> - if (!dma_capable(dev, dev_addr, size)) {
> - swiotlb_tbl_unmap_single(dev, map, size, dir);
> - dev_addr = 0;
> - }
> - return dev_addr;
> + if (dma_capable(dev, dev_addr, size))
> + return dev_addr;
> +
> + swiotlb_tbl_unmap_single(dev, map, size, dir,
> +  attrs | DMA_ATTR_SKIP_CPU_SYNC);
> +
> + return DMA_ERROR_CODE;

Why? This change (re-ordering the code - and returning DMA_ERROR_CODE instead
of 0) does not have anything to do with the title.

If you really feel strongly about it - then please send it as a seperate patch.
>  }
>  EXPORT_SYMBOL_GPL(xen_swiotlb_map_page);
>  
> @@ -444,7 +447,7 @@ static void xen_unmap_single(struct device *hwdev, 
> dma_addr_t dev_addr,
>  
>   /* NOTE: We use dev_addr here, not paddr! */
>   if (is_xen_swiotlb_buffer(dev_addr)) {
> - swiotlb_tbl_unmap_single(hwdev, paddr, size, dir);
> + swiotlb_tbl_unmap_single(hwdev, paddr, size, dir, attrs);
>   return;
>   }
>  
> @@ -557,16 +560,9 @@ void xen_swiotlb_unmap_page(struct device *hwdev, 
> dma_addr_t dev_addr,
>start_dma_addr,
>sg_phys(sg),
>sg->length,
> -  dir);
> - if (map == SWIOTLB_MAP_ERROR) {
> - dev_warn(hwdev, "swiotlb buffer is full\n");
> - /* Don't panic here, we expect map_sg users
> -to do proper error handling. */
> - xen_swiotlb_unmap_sg_attrs(hwdev, sgl, i, dir,
> -attrs);
> - sg_dma_len(sgl) = 0;
> - return 0;
> - }
> +  dir, attrs);
> + if (map == SWIOTLB_MAP_ERROR)
> + goto map_error;
>   xen_dma_map_page(hwdev, pfn_to_page(map >> PAGE_SHIFT),
>   dev_addr,
>   map & ~PAGE_MASK,
> @@ -589,6 +585,16 @@ void xen_swiotlb_unmap_page(struct device *hwdev, 
> dma_addr_t dev_addr,
>   sg_dma_len(sg) = sg->length;
>   }
>   return nelems;
> +map_error:
> + dev_warn(hwdev, "swiotlb buffer is full\n");
> + /*
> +  * Don't panic here, we expect map_sg users
> +  * to do proper error handling.
> +  */
> + xen_swiotlb_unmap_sg_attrs(hwdev, sgl, i, dir,
> +attrs | DMA_ATTR_SKIP_CPU_SYNC);
> + sg_dma_len(sgl) = 0;
> + return 0;
>  }

This too. Why can't that be part of the existing code that was there?


[net-next PATCH RFC 00/26] Add support for DMA writable pages being writable by the network stack

2016-10-24 Thread Alexander Duyck
The first 21 patches in the set add support for the DMA attribute
DMA_ATTR_SKIP_CPU_SYNC on multiple platforms/architectures.  This is needed
so that we can flag the calls to dma_map/unmap_page so that we do not
invalidate cache lines that do not currently belong to the device.  Instead
we have to take care of this in the driver via a call to
sync_single_range_for_cpu prior to freeing the Rx page.

Patch 22 adds support for dma_map_page_attrs and dma_unmap_page_attrs so
that we can unmap and map a page using the DMA_ATTR_SKIP_CPU_SYNC
attribute.

Patch 23 adds support for freeing a page that has multiple references being
held by a single caller.  This way we can free page fragments that were
allocated by a given driver.

The last 3 patches use these updates in the igb driver to allow for us to
reimplement the use of build_skb which hands a writable page off to the
stack.

My hope is to get the series accepted into the net-next tree as I have a
number of other Intel drivers I could then begin updating once these
patches are accepted.

Any feedback is welcome.  Specifically if there is something I overlooked
design-wise or an architecture I missed please let me know and I will add
it to this patch set.  If needed I can look into breaking this into a
smaller set of patches but this set is all that should be needed to then
start looking at putting together a DMA page pool per device which I know
is something Jesper has been working on.

---

Alexander Duyck (26):
  swiotlb: Drop unused function swiotlb_map_sg
  swiotlb: Add support for DMA_ATTR_SKIP_CPU_SYNC
  arch/arc: Add option to skip sync on DMA mapping
  arch/arm: Add option to skip sync on DMA map and unmap
  arch/avr32: Add option to skip sync on DMA map
  arch/blackfin: Add option to skip sync on DMA map
  arch/c6x: Add option to skip sync on DMA map and unmap
  arch/frv: Add option to skip sync on DMA map
  arch/hexagon: Add option to skip DMA sync as a part of mapping
  arch/m68k: Add option to skip DMA sync as a part of mapping
  arch/metag: Add option to skip DMA sync as a part of map and unmap
  arch/microblaze: Add option to skip DMA sync as a part of map and unmap
  arch/mips: Add option to skip DMA sync as a part of map and unmap
  arch/nios2: Add option to skip DMA sync as a part of map and unmap
  arch/openrisc: Add option to skip DMA sync as a part of mapping
  arch/parisc: Add option to skip DMA sync as a part of map and unmap
  arch/powerpc: Add option to skip DMA sync as a part of mapping
  arch/sh: Add option to skip DMA sync as a part of mapping
  arch/sparc: Add option to skip DMA sync as a part of map and unmap
  arch/tile: Add option to skip DMA sync as a part of map and unmap
  arch/xtensa: Add option to skip DMA sync as a part of mapping
  dma: Add calls for dma_map_page_attrs and dma_unmap_page_attrs
  mm: Add support for releasing multiple instances of a page
  igb: Update driver to make use of DMA_ATTR_SKIP_CPU_SYNC
  igb: Update code to better handle incrementing page count
  igb: Revert "igb: Revert support for build_skb in igb"


 arch/arc/mm/dma.c |3 
 arch/arm/common/dmabounce.c   |   16 +-
 arch/avr32/mm/dma-coherent.c  |7 +
 arch/blackfin/kernel/dma-mapping.c|7 +
 arch/c6x/kernel/dma.c |   16 ++
 arch/frv/mb93090-mb00/pci-dma-nommu.c |   16 ++
 arch/frv/mb93090-mb00/pci-dma.c   |7 +
 arch/hexagon/kernel/dma.c |6 +
 arch/m68k/kernel/dma.c|8 +
 arch/metag/kernel/dma.c   |   16 ++
 arch/microblaze/kernel/dma.c  |   10 +
 arch/mips/loongson64/common/dma-swiotlb.c |2 
 arch/mips/mm/dma-default.c|8 +
 arch/nios2/mm/dma-mapping.c   |   14 ++
 arch/openrisc/kernel/dma.c|3 
 arch/parisc/kernel/pci-dma.c  |   20 ++-
 arch/powerpc/kernel/dma.c |9 +
 arch/sh/kernel/dma-nommu.c|7 +
 arch/sparc/kernel/iommu.c |4 -
 arch/sparc/kernel/ioport.c|4 -
 arch/tile/kernel/pci-dma.c|   12 +-
 arch/xtensa/kernel/pci-dma.c  |7 +
 drivers/net/ethernet/intel/igb/igb.h  |   36 -
 drivers/net/ethernet/intel/igb/igb_main.c |  207 +++--
 drivers/xen/swiotlb-xen.c |   40 +++---
 include/linux/dma-mapping.h   |   20 ++-
 include/linux/gfp.h   |2 
 include/linux/swiotlb.h   |   10 +
 lib/swiotlb.c |   56 
 mm/page_alloc.c   |   14 ++
 30 files changed, 435 insertions(+), 152 deletions(-)

--
Signature


[net-next PATCH RFC 02/26] swiotlb: Add support for DMA_ATTR_SKIP_CPU_SYNC

2016-10-24 Thread Alexander Duyck
As a first step to making DMA_ATTR_SKIP_CPU_SYNC apply to architectures
beyond just ARM I need to make it so that the swiotlb will respect the
flag.  In order to do that I also need to update the swiotlb-xen since it
heavily makes use of the functionality.

Cc: Konrad Rzeszutek Wilk 
Signed-off-by: Alexander Duyck 
---
 drivers/xen/swiotlb-xen.c |   40 ++
 include/linux/swiotlb.h   |6 --
 lib/swiotlb.c |   48 +++--
 3 files changed, 56 insertions(+), 38 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 87e6035..cf047d8 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -405,7 +405,8 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, struct 
page *page,
 */
trace_swiotlb_bounced(dev, dev_addr, size, swiotlb_force);
 
-   map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir);
+   map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir,
+attrs);
if (map == SWIOTLB_MAP_ERROR)
return DMA_ERROR_CODE;
 
@@ -416,11 +417,13 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, 
struct page *page,
/*
 * Ensure that the address returned is DMA'ble
 */
-   if (!dma_capable(dev, dev_addr, size)) {
-   swiotlb_tbl_unmap_single(dev, map, size, dir);
-   dev_addr = 0;
-   }
-   return dev_addr;
+   if (dma_capable(dev, dev_addr, size))
+   return dev_addr;
+
+   swiotlb_tbl_unmap_single(dev, map, size, dir,
+attrs | DMA_ATTR_SKIP_CPU_SYNC);
+
+   return DMA_ERROR_CODE;
 }
 EXPORT_SYMBOL_GPL(xen_swiotlb_map_page);
 
@@ -444,7 +447,7 @@ static void xen_unmap_single(struct device *hwdev, 
dma_addr_t dev_addr,
 
/* NOTE: We use dev_addr here, not paddr! */
if (is_xen_swiotlb_buffer(dev_addr)) {
-   swiotlb_tbl_unmap_single(hwdev, paddr, size, dir);
+   swiotlb_tbl_unmap_single(hwdev, paddr, size, dir, attrs);
return;
}
 
@@ -557,16 +560,9 @@ void xen_swiotlb_unmap_page(struct device *hwdev, 
dma_addr_t dev_addr,
 start_dma_addr,
 sg_phys(sg),
 sg->length,
-dir);
-   if (map == SWIOTLB_MAP_ERROR) {
-   dev_warn(hwdev, "swiotlb buffer is full\n");
-   /* Don't panic here, we expect map_sg users
-  to do proper error handling. */
-   xen_swiotlb_unmap_sg_attrs(hwdev, sgl, i, dir,
-  attrs);
-   sg_dma_len(sgl) = 0;
-   return 0;
-   }
+dir, attrs);
+   if (map == SWIOTLB_MAP_ERROR)
+   goto map_error;
xen_dma_map_page(hwdev, pfn_to_page(map >> PAGE_SHIFT),
dev_addr,
map & ~PAGE_MASK,
@@ -589,6 +585,16 @@ void xen_swiotlb_unmap_page(struct device *hwdev, 
dma_addr_t dev_addr,
sg_dma_len(sg) = sg->length;
}
return nelems;
+map_error:
+   dev_warn(hwdev, "swiotlb buffer is full\n");
+   /*
+* Don't panic here, we expect map_sg users
+* to do proper error handling.
+*/
+   xen_swiotlb_unmap_sg_attrs(hwdev, sgl, i, dir,
+  attrs | DMA_ATTR_SKIP_CPU_SYNC);
+   sg_dma_len(sgl) = 0;
+   return 0;
 }
 EXPORT_SYMBOL_GPL(xen_swiotlb_map_sg_attrs);
 
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index e237b6f..4517be9 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -44,11 +44,13 @@ enum dma_sync_target {
 extern phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
  dma_addr_t tbl_dma_addr,
  phys_addr_t phys, size_t size,
- enum dma_data_direction dir);
+ enum dma_data_direction dir,
+ unsigned long attrs);
 
 extern void swiotlb_tbl_unmap_single(struct device *hwdev,
 phys_addr_t tlb_addr,
-size_t size, enum dma_data_direction dir);
+size_t size, enum 

[net-next PATCH RFC 13/26] arch/mips: Add option to skip DMA sync as a part of map and unmap

2016-10-24 Thread Alexander Duyck
This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: Ralf Baechle 
Cc: Keguang Zhang 
Cc: linux-m...@linux-mips.org
Signed-off-by: Alexander Duyck 
---
 arch/mips/loongson64/common/dma-swiotlb.c |2 +-
 arch/mips/mm/dma-default.c|8 +---
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/mips/loongson64/common/dma-swiotlb.c 
b/arch/mips/loongson64/common/dma-swiotlb.c
index 1a80b6f..aab4fd6 100644
--- a/arch/mips/loongson64/common/dma-swiotlb.c
+++ b/arch/mips/loongson64/common/dma-swiotlb.c
@@ -61,7 +61,7 @@ static int loongson_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
int nents, enum dma_data_direction dir,
unsigned long attrs)
 {
-   int r = swiotlb_map_sg_attrs(dev, sg, nents, dir, 0);
+   int r = swiotlb_map_sg_attrs(dev, sg, nents, dir, attrs);
mb();
 
return r;
diff --git a/arch/mips/mm/dma-default.c b/arch/mips/mm/dma-default.c
index b2eadd6..dd998d7 100644
--- a/arch/mips/mm/dma-default.c
+++ b/arch/mips/mm/dma-default.c
@@ -293,7 +293,7 @@ static inline void __dma_sync(struct page *page,
 static void mips_dma_unmap_page(struct device *dev, dma_addr_t dma_addr,
size_t size, enum dma_data_direction direction, unsigned long attrs)
 {
-   if (cpu_needs_post_dma_flush(dev))
+   if (cpu_needs_post_dma_flush(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
__dma_sync(dma_addr_to_page(dev, dma_addr),
   dma_addr & ~PAGE_MASK, size, direction);
plat_post_dma_flush(dev);
@@ -307,7 +307,8 @@ static int mips_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
struct scatterlist *sg;
 
for_each_sg(sglist, sg, nents, i) {
-   if (!plat_device_is_coherent(dev))
+   if (!plat_device_is_coherent(dev) &&
+   !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
__dma_sync(sg_page(sg), sg->offset, sg->length,
   direction);
 #ifdef CONFIG_NEED_SG_DMA_LENGTH
@@ -324,7 +325,7 @@ static dma_addr_t mips_dma_map_page(struct device *dev, 
struct page *page,
unsigned long offset, size_t size, enum dma_data_direction direction,
unsigned long attrs)
 {
-   if (!plat_device_is_coherent(dev))
+   if (!plat_device_is_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
__dma_sync(page, offset, size, direction);
 
return plat_map_dma_mem_page(dev, page) + offset;
@@ -339,6 +340,7 @@ static void mips_dma_unmap_sg(struct device *dev, struct 
scatterlist *sglist,
 
for_each_sg(sglist, sg, nhwentries, i) {
if (!plat_device_is_coherent(dev) &&
+   !(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
direction != DMA_TO_DEVICE)
__dma_sync(sg_page(sg), sg->offset, sg->length,
   direction);



[net-next PATCH RFC 01/26] swiotlb: Drop unused function swiotlb_map_sg

2016-10-24 Thread Alexander Duyck
There are no users for swiotlb_map_sg so we might as well just drop it.

Cc: Konrad Rzeszutek Wilk 
Signed-off-by: Alexander Duyck 
---
 include/linux/swiotlb.h |4 
 lib/swiotlb.c   |8 
 2 files changed, 12 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 5f81f8a..e237b6f 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -72,10 +72,6 @@ extern void swiotlb_unmap_page(struct device *hwdev, 
dma_addr_t dev_addr,
   size_t size, enum dma_data_direction dir,
   unsigned long attrs);
 
-extern int
-swiotlb_map_sg(struct device *hwdev, struct scatterlist *sg, int nents,
-  enum dma_data_direction dir);
-
 extern void
 swiotlb_unmap_sg(struct device *hwdev, struct scatterlist *sg, int nents,
 enum dma_data_direction dir);
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 22e13a0..47aad37 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -910,14 +910,6 @@ void swiotlb_unmap_page(struct device *hwdev, dma_addr_t 
dev_addr,
 }
 EXPORT_SYMBOL(swiotlb_map_sg_attrs);
 
-int
-swiotlb_map_sg(struct device *hwdev, struct scatterlist *sgl, int nelems,
-  enum dma_data_direction dir)
-{
-   return swiotlb_map_sg_attrs(hwdev, sgl, nelems, dir, 0);
-}
-EXPORT_SYMBOL(swiotlb_map_sg);
-
 /*
  * Unmap a set of streaming mode DMA translations.  Again, cpu read rules
  * concerning calls here are the same as for swiotlb_unmap_page() above.



[net-next PATCH RFC 10/26] arch/m68k: Add option to skip DMA sync as a part of mapping

2016-10-24 Thread Alexander Duyck
This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
later via a sync_for_cpu or sync_for_device call.

Cc: Geert Uytterhoeven 
Cc: linux-m...@lists.linux-m68k.org
Signed-off-by: Alexander Duyck 
---
 arch/m68k/kernel/dma.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/m68k/kernel/dma.c b/arch/m68k/kernel/dma.c
index 8cf97cb..0707006 100644
--- a/arch/m68k/kernel/dma.c
+++ b/arch/m68k/kernel/dma.c
@@ -134,7 +134,9 @@ static dma_addr_t m68k_dma_map_page(struct device *dev, 
struct page *page,
 {
dma_addr_t handle = page_to_phys(page) + offset;
 
-   dma_sync_single_for_device(dev, handle, size, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   dma_sync_single_for_device(dev, handle, size, dir);
+
return handle;
 }
 
@@ -146,6 +148,10 @@ static int m68k_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
 
for_each_sg(sglist, sg, nents, i) {
sg->dma_address = sg_phys(sg);
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
dma_sync_single_for_device(dev, sg->dma_address, sg->length,
   dir);
}



[net-next PATCH RFC 07/26] arch/c6x: Add option to skip sync on DMA map and unmap

2016-10-24 Thread Alexander Duyck
This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
later via a sync_for_cpu or sync_for_device call.

Cc: Mark Salter 
Cc: Aurelien Jacquiot 
Cc: linux-c6x-...@linux-c6x.org
Signed-off-by: Alexander Duyck 
---
 arch/c6x/kernel/dma.c |   16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/c6x/kernel/dma.c b/arch/c6x/kernel/dma.c
index db4a6a3..d28df74 100644
--- a/arch/c6x/kernel/dma.c
+++ b/arch/c6x/kernel/dma.c
@@ -42,14 +42,17 @@ static dma_addr_t c6x_dma_map_page(struct device *dev, 
struct page *page,
 {
dma_addr_t handle = virt_to_phys(page_address(page) + offset);
 
-   c6x_dma_sync(handle, size, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   c6x_dma_sync(handle, size, dir);
+
return handle;
 }
 
 static void c6x_dma_unmap_page(struct device *dev, dma_addr_t handle,
size_t size, enum dma_data_direction dir, unsigned long attrs)
 {
-   c6x_dma_sync(handle, size, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   c6x_dma_sync(handle, size, dir);
 }
 
 static int c6x_dma_map_sg(struct device *dev, struct scatterlist *sglist,
@@ -60,7 +63,8 @@ static int c6x_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
 
for_each_sg(sglist, sg, nents, i) {
sg->dma_address = sg_phys(sg);
-   c6x_dma_sync(sg->dma_address, sg->length, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   c6x_dma_sync(sg->dma_address, sg->length, dir);
}
 
return nents;
@@ -72,8 +76,10 @@ static void c6x_dma_unmap_sg(struct device *dev, struct 
scatterlist *sglist,
struct scatterlist *sg;
int i;
 
-   for_each_sg(sglist, sg, nents, i)
-   c6x_dma_sync(sg_dma_address(sg), sg->length, dir);
+   for_each_sg(sglist, sg, nents, i) {
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   c6x_dma_sync(sg_dma_address(sg), sg->length, dir);
+   }
 
 }
 



[net-next PATCH RFC 06/26] arch/blackfin: Add option to skip sync on DMA map

2016-10-24 Thread Alexander Duyck
The use of DMA_ATTR_SKIP_CPU_SYNC was not consistent across all of the DMA
APIs in the arch/arm folder.  This change is meant to correct that so that
we get consistent behavior.

Cc: Steven Miao 
Signed-off-by: Alexander Duyck 
---
 arch/blackfin/kernel/dma-mapping.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/blackfin/kernel/dma-mapping.c 
b/arch/blackfin/kernel/dma-mapping.c
index 53fbbb6..ed9a6a8 100644
--- a/arch/blackfin/kernel/dma-mapping.c
+++ b/arch/blackfin/kernel/dma-mapping.c
@@ -133,6 +133,10 @@ static void bfin_dma_sync_sg_for_device(struct device *dev,
 
for_each_sg(sg_list, sg, nelems, i) {
sg->dma_address = (dma_addr_t) sg_virt(sg);
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
__dma_sync(sg_dma_address(sg), sg_dma_len(sg), direction);
}
 }
@@ -143,7 +147,8 @@ static dma_addr_t bfin_dma_map_page(struct device *dev, 
struct page *page,
 {
dma_addr_t handle = (dma_addr_t)(page_address(page) + offset);
 
-   _dma_sync(handle, size, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   _dma_sync(handle, size, dir);
return handle;
 }
 



[net-next PATCH RFC 05/26] arch/avr32: Add option to skip sync on DMA map

2016-10-24 Thread Alexander Duyck
The use of DMA_ATTR_SKIP_CPU_SYNC was not consistent across all of the DMA
APIs in the arch/arm folder.  This change is meant to correct that so that
we get consistent behavior.

Cc: Haavard Skinnemoen 
Cc: Hans-Christian Egtvedt 
Signed-off-by: Alexander Duyck 
---
 arch/avr32/mm/dma-coherent.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/avr32/mm/dma-coherent.c b/arch/avr32/mm/dma-coherent.c
index 58610d0..54534e5 100644
--- a/arch/avr32/mm/dma-coherent.c
+++ b/arch/avr32/mm/dma-coherent.c
@@ -146,7 +146,8 @@ static dma_addr_t avr32_dma_map_page(struct device *dev, 
struct page *page,
 {
void *cpu_addr = page_address(page) + offset;
 
-   dma_cache_sync(dev, cpu_addr, size, direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   dma_cache_sync(dev, cpu_addr, size, direction);
return virt_to_bus(cpu_addr);
 }
 
@@ -162,6 +163,10 @@ static int avr32_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
 
sg->dma_address = page_to_bus(sg_page(sg)) + sg->offset;
virt = sg_virt(sg);
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
dma_cache_sync(dev, virt, sg->length, direction);
}
 



[net-next PATCH RFC 03/26] arch/arc: Add option to skip sync on DMA mapping

2016-10-24 Thread Alexander Duyck
This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
later via a sync_for_cpu or sync_for_device call.

Cc: Vineet Gupta 
Cc: linux-snps-...@lists.infradead.org
Signed-off-by: Alexander Duyck 
---
 arch/arc/mm/dma.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/arc/mm/dma.c b/arch/arc/mm/dma.c
index 20afc65..d0c4b28 100644
--- a/arch/arc/mm/dma.c
+++ b/arch/arc/mm/dma.c
@@ -133,7 +133,8 @@ static dma_addr_t arc_dma_map_page(struct device *dev, 
struct page *page,
unsigned long attrs)
 {
phys_addr_t paddr = page_to_phys(page) + offset;
-   _dma_cache_sync(paddr, size, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   _dma_cache_sync(paddr, size, dir);
return plat_phys_to_dma(dev, paddr);
 }
 



[net-next PATCH RFC 17/26] arch/powerpc: Add option to skip DMA sync as a part of mapping

2016-10-24 Thread Alexander Duyck
This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-...@lists.ozlabs.org
Signed-off-by: Alexander Duyck 
---
 arch/powerpc/kernel/dma.c |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index e64a601..6877e3f 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -203,6 +203,10 @@ static int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl,
for_each_sg(sgl, sg, nents, i) {
sg->dma_address = sg_phys(sg) + get_dma_offset(dev);
sg->dma_length = sg->length;
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
__dma_sync_page(sg_page(sg), sg->offset, sg->length, direction);
}
 
@@ -235,7 +239,10 @@ static inline dma_addr_t dma_direct_map_page(struct device 
*dev,
 unsigned long attrs)
 {
BUG_ON(dir == DMA_NONE);
-   __dma_sync_page(page, offset, size, dir);
+
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   __dma_sync_page(page, offset, size, dir);
+
return page_to_phys(page) + offset + get_dma_offset(dev);
 }
 



[net-next PATCH RFC 11/26] arch/metag: Add option to skip DMA sync as a part of map and unmap

2016-10-24 Thread Alexander Duyck
This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: James Hogan 
Cc: linux-me...@vger.kernel.org
Signed-off-by: Alexander Duyck 
---
 arch/metag/kernel/dma.c |   16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/metag/kernel/dma.c b/arch/metag/kernel/dma.c
index 0db31e2..91968d9 100644
--- a/arch/metag/kernel/dma.c
+++ b/arch/metag/kernel/dma.c
@@ -484,8 +484,9 @@ static dma_addr_t metag_dma_map_page(struct device *dev, 
struct page *page,
unsigned long offset, size_t size,
enum dma_data_direction direction, unsigned long attrs)
 {
-   dma_sync_for_device((void *)(page_to_phys(page) + offset), size,
-   direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   dma_sync_for_device((void *)(page_to_phys(page) + offset),
+   size, direction);
return page_to_phys(page) + offset;
 }
 
@@ -493,7 +494,8 @@ static void metag_dma_unmap_page(struct device *dev, 
dma_addr_t dma_address,
size_t size, enum dma_data_direction direction,
unsigned long attrs)
 {
-   dma_sync_for_cpu(phys_to_virt(dma_address), size, direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   dma_sync_for_cpu(phys_to_virt(dma_address), size, direction);
 }
 
 static int metag_dma_map_sg(struct device *dev, struct scatterlist *sglist,
@@ -507,6 +509,10 @@ static int metag_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
BUG_ON(!sg_page(sg));
 
sg->dma_address = sg_phys(sg);
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
dma_sync_for_device(sg_virt(sg), sg->length, direction);
}
 
@@ -525,6 +531,10 @@ static void metag_dma_unmap_sg(struct device *dev, struct 
scatterlist *sglist,
BUG_ON(!sg_page(sg));
 
sg->dma_address = sg_phys(sg);
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
dma_sync_for_cpu(sg_virt(sg), sg->length, direction);
}
 }



[net-next PATCH RFC 12/26] arch/microblaze: Add option to skip DMA sync as a part of map and unmap

2016-10-24 Thread Alexander Duyck
This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: Michal Simek 
Signed-off-by: Alexander Duyck 
---
 arch/microblaze/kernel/dma.c |   10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/microblaze/kernel/dma.c b/arch/microblaze/kernel/dma.c
index ec04dc1..818daf2 100644
--- a/arch/microblaze/kernel/dma.c
+++ b/arch/microblaze/kernel/dma.c
@@ -61,6 +61,10 @@ static int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl,
/* FIXME this part of code is untested */
for_each_sg(sgl, sg, nents, i) {
sg->dma_address = sg_phys(sg);
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
__dma_sync(page_to_phys(sg_page(sg)) + sg->offset,
sg->length, direction);
}
@@ -80,7 +84,8 @@ static inline dma_addr_t dma_direct_map_page(struct device 
*dev,
 enum dma_data_direction direction,
 unsigned long attrs)
 {
-   __dma_sync(page_to_phys(page) + offset, size, direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   __dma_sync(page_to_phys(page) + offset, size, direction);
return page_to_phys(page) + offset;
 }
 
@@ -95,7 +100,8 @@ static inline void dma_direct_unmap_page(struct device *dev,
  * phys_to_virt is here because in __dma_sync_page is __virt_to_phys and
  * dma_address is physical address
  */
-   __dma_sync(dma_address, size, direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   __dma_sync(dma_address, size, direction);
 }
 
 static inline void



[net-next PATCH RFC 16/26] arch/parisc: Add option to skip DMA sync as a part of map and unmap

2016-10-24 Thread Alexander Duyck
This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: linux-par...@vger.kernel.org
Signed-off-by: Alexander Duyck 
---
 arch/parisc/kernel/pci-dma.c |   20 +++-
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/parisc/kernel/pci-dma.c b/arch/parisc/kernel/pci-dma.c
index 02d9ed0..be55ede 100644
--- a/arch/parisc/kernel/pci-dma.c
+++ b/arch/parisc/kernel/pci-dma.c
@@ -459,7 +459,9 @@ static dma_addr_t pa11_dma_map_page(struct device *dev, 
struct page *page,
void *addr = page_address(page) + offset;
BUG_ON(direction == DMA_NONE);
 
-   flush_kernel_dcache_range((unsigned long) addr, size);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   flush_kernel_dcache_range((unsigned long) addr, size);
+
return virt_to_phys(addr);
 }
 
@@ -469,8 +471,11 @@ static void pa11_dma_unmap_page(struct device *dev, 
dma_addr_t dma_handle,
 {
BUG_ON(direction == DMA_NONE);
 
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   return;
+
if (direction == DMA_TO_DEVICE)
-   return;
+   return;
 
/*
 * For PCI_DMA_FROMDEVICE this flush is not necessary for the
@@ -479,7 +484,6 @@ static void pa11_dma_unmap_page(struct device *dev, 
dma_addr_t dma_handle,
 */
 
flush_kernel_dcache_range((unsigned long) phys_to_virt(dma_handle), 
size);
-   return;
 }
 
 static int pa11_dma_map_sg(struct device *dev, struct scatterlist *sglist,
@@ -496,6 +500,10 @@ static int pa11_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
 
sg_dma_address(sg) = (dma_addr_t) virt_to_phys(vaddr);
sg_dma_len(sg) = sg->length;
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
flush_kernel_dcache_range(vaddr, sg->length);
}
return nents;
@@ -510,14 +518,16 @@ static void pa11_dma_unmap_sg(struct device *dev, struct 
scatterlist *sglist,
 
BUG_ON(direction == DMA_NONE);
 
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   return;
+
if (direction == DMA_TO_DEVICE)
-   return;
+   return;
 
/* once we do combining we'll need to use 
phys_to_virt(sg_dma_address(sglist)) */
 
for_each_sg(sglist, sg, nents, i)
flush_kernel_vmap_range(sg_virt(sg), sg->length);
-   return;
 }
 
 static void pa11_dma_sync_single_for_cpu(struct device *dev,



[net-next PATCH RFC 14/26] arch/nios2: Add option to skip DMA sync as a part of map and unmap

2016-10-24 Thread Alexander Duyck
This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: Ley Foon Tan 
Signed-off-by: Alexander Duyck 
---
 arch/nios2/mm/dma-mapping.c |   14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/nios2/mm/dma-mapping.c b/arch/nios2/mm/dma-mapping.c
index d800fad..b83e723 100644
--- a/arch/nios2/mm/dma-mapping.c
+++ b/arch/nios2/mm/dma-mapping.c
@@ -102,7 +102,9 @@ static int nios2_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
 
addr = sg_virt(sg);
if (addr) {
-   __dma_sync_for_device(addr, sg->length, direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   __dma_sync_for_device(addr, sg->length,
+ direction);
sg->dma_address = sg_phys(sg);
}
}
@@ -117,7 +119,9 @@ static dma_addr_t nios2_dma_map_page(struct device *dev, 
struct page *page,
 {
void *addr = page_address(page) + offset;
 
-   __dma_sync_for_device(addr, size, direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   __dma_sync_for_device(addr, size, direction);
+
return page_to_phys(page) + offset;
 }
 
@@ -125,7 +129,8 @@ static void nios2_dma_unmap_page(struct device *dev, 
dma_addr_t dma_address,
size_t size, enum dma_data_direction direction,
unsigned long attrs)
 {
-   __dma_sync_for_cpu(phys_to_virt(dma_address), size, direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   __dma_sync_for_cpu(phys_to_virt(dma_address), size, direction);
 }
 
 static void nios2_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
@@ -138,6 +143,9 @@ static void nios2_dma_unmap_sg(struct device *dev, struct 
scatterlist *sg,
if (direction == DMA_TO_DEVICE)
return;
 
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   return;
+
for_each_sg(sg, sg, nhwentries, i) {
addr = sg_virt(sg);
if (addr)



[net-next PATCH RFC 04/26] arch/arm: Add option to skip sync on DMA map and unmap

2016-10-24 Thread Alexander Duyck
The use of DMA_ATTR_SKIP_CPU_SYNC was not consistent across all of the DMA
APIs in the arch/arm folder.  This change is meant to correct that so that
we get consistent behavior.

Cc: Russell King 
Signed-off-by: Alexander Duyck 
---
 arch/arm/common/dmabounce.c |   16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/arm/common/dmabounce.c b/arch/arm/common/dmabounce.c
index 3012816..75055df 100644
--- a/arch/arm/common/dmabounce.c
+++ b/arch/arm/common/dmabounce.c
@@ -243,7 +243,8 @@ static int needs_bounce(struct device *dev, dma_addr_t 
dma_addr, size_t size)
 }
 
 static inline dma_addr_t map_single(struct device *dev, void *ptr, size_t size,
-   enum dma_data_direction dir)
+   enum dma_data_direction dir,
+   unsigned long attrs)
 {
struct dmabounce_device_info *device_info = dev->archdata.dmabounce;
struct safe_buffer *buf;
@@ -262,7 +263,8 @@ static inline dma_addr_t map_single(struct device *dev, 
void *ptr, size_t size,
__func__, buf->ptr, virt_to_dma(dev, buf->ptr),
buf->safe, buf->safe_dma_addr);
 
-   if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL) {
+   if ((dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL) &&
+   !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
dev_dbg(dev, "%s: copy unsafe %p to safe %p, size %d\n",
__func__, ptr, buf->safe, size);
memcpy(buf->safe, ptr, size);
@@ -272,7 +274,8 @@ static inline dma_addr_t map_single(struct device *dev, 
void *ptr, size_t size,
 }
 
 static inline void unmap_single(struct device *dev, struct safe_buffer *buf,
-   size_t size, enum dma_data_direction dir)
+   size_t size, enum dma_data_direction dir,
+   unsigned long attrs)
 {
BUG_ON(buf->size != size);
BUG_ON(buf->direction != dir);
@@ -283,7 +286,8 @@ static inline void unmap_single(struct device *dev, struct 
safe_buffer *buf,
 
DO_STATS(dev->archdata.dmabounce->bounce_count++);
 
-   if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL) {
+   if ((dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL) &&
+   !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
void *ptr = buf->ptr;
 
dev_dbg(dev, "%s: copy back safe %p to unsafe %p size %d\n",
@@ -334,7 +338,7 @@ static dma_addr_t dmabounce_map_page(struct device *dev, 
struct page *page,
return DMA_ERROR_CODE;
}
 
-   return map_single(dev, page_address(page) + offset, size, dir);
+   return map_single(dev, page_address(page) + offset, size, dir, attrs);
 }
 
 /*
@@ -357,7 +361,7 @@ static void dmabounce_unmap_page(struct device *dev, 
dma_addr_t dma_addr, size_t
return;
}
 
-   unmap_single(dev, buf, size, dir);
+   unmap_single(dev, buf, size, dir, attrs);
 }
 
 static int __dmabounce_sync_for_cpu(struct device *dev, dma_addr_t addr,



[net-next PATCH RFC 18/26] arch/sh: Add option to skip DMA sync as a part of mapping

2016-10-24 Thread Alexander Duyck
This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: linux...@vger.kernel.org
Signed-off-by: Alexander Duyck 
---
 arch/sh/kernel/dma-nommu.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/sh/kernel/dma-nommu.c b/arch/sh/kernel/dma-nommu.c
index eadb669..47fee3b 100644
--- a/arch/sh/kernel/dma-nommu.c
+++ b/arch/sh/kernel/dma-nommu.c
@@ -18,7 +18,9 @@ static dma_addr_t nommu_map_page(struct device *dev, struct 
page *page,
dma_addr_t addr = page_to_phys(page) + offset;
 
WARN_ON(size == 0);
-   dma_cache_sync(dev, page_address(page) + offset, size, dir);
+
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   dma_cache_sync(dev, page_address(page) + offset, size, dir);
 
return addr;
 }
@@ -35,7 +37,8 @@ static int nommu_map_sg(struct device *dev, struct 
scatterlist *sg,
for_each_sg(sg, s, nents, i) {
BUG_ON(!sg_page(s));
 
-   dma_cache_sync(dev, sg_virt(s), s->length, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   dma_cache_sync(dev, sg_virt(s), s->length, dir);
 
s->dma_address = sg_phys(s);
s->dma_length = s->length;



[PATCH 2/3] mwifiex: Introduce mwifiex_probe_of() to parse common properties

2016-10-24 Thread Rajat Jain
Introduce function mwifiex_probe_of() to parse common properties.
Since the interface drivers get to decide whether or not the device
tree node was a valid one (depending on the compatible property),
let the interface drivers pass a flag to indicate whether the device
tree node was a valid one.

The function mwifiex_probe_of()nodetself is currently only a place
holder with the next patch adding content to it.

Signed-off-by: Rajat Jain 
---
 drivers/net/wireless/marvell/mwifiex/main.c| 15 ++-
 drivers/net/wireless/marvell/mwifiex/main.h|  2 +-
 drivers/net/wireless/marvell/mwifiex/pcie.c|  4 +++-
 drivers/net/wireless/marvell/mwifiex/sdio.c|  4 +++-
 drivers/net/wireless/marvell/mwifiex/sta_cmd.c |  5 +
 drivers/net/wireless/marvell/mwifiex/usb.c |  2 +-
 6 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/drivers/net/wireless/marvell/mwifiex/main.c 
b/drivers/net/wireless/marvell/mwifiex/main.c
index dcceab2..b2f3d96 100644
--- a/drivers/net/wireless/marvell/mwifiex/main.c
+++ b/drivers/net/wireless/marvell/mwifiex/main.c
@@ -1552,6 +1552,16 @@ void mwifiex_do_flr(struct mwifiex_adapter *adapter, 
bool prepare)
 }
 EXPORT_SYMBOL_GPL(mwifiex_do_flr);
 
+static void mwifiex_probe_of(struct mwifiex_adapter *adapter)
+{
+   struct device *dev = adapter->dev;
+
+   if (!dev->of_node)
+   return;
+
+   adapter->dt_node = dev->of_node;
+}
+
 /*
  * This function adds the card.
  *
@@ -1568,7 +1578,7 @@ EXPORT_SYMBOL_GPL(mwifiex_do_flr);
 int
 mwifiex_add_card(void *card, struct semaphore *sem,
 struct mwifiex_if_ops *if_ops, u8 iface_type,
-struct device *dev)
+struct device *dev, bool of_node_valid)
 {
struct mwifiex_adapter *adapter;
 
@@ -1581,6 +1591,9 @@ mwifiex_add_card(void *card, struct semaphore *sem,
}
 
adapter->dev = dev;
+   if (of_node_valid)
+   mwifiex_probe_of(adapter);
+
adapter->iface_type = iface_type;
adapter->card_sem = sem;
 
diff --git a/drivers/net/wireless/marvell/mwifiex/main.h 
b/drivers/net/wireless/marvell/mwifiex/main.h
index 91218a1..83e0776 100644
--- a/drivers/net/wireless/marvell/mwifiex/main.h
+++ b/drivers/net/wireless/marvell/mwifiex/main.h
@@ -1412,7 +1412,7 @@ static inline u8 mwifiex_is_tdls_link_setup(u8 status)
 int mwifiex_init_shutdown_fw(struct mwifiex_private *priv,
 u32 func_init_shutdown);
 int mwifiex_add_card(void *, struct semaphore *, struct mwifiex_if_ops *, u8,
-struct device *);
+struct device *, bool);
 int mwifiex_remove_card(struct mwifiex_adapter *, struct semaphore *);
 
 void mwifiex_get_version(struct mwifiex_adapter *adapter, char *version,
diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c 
b/drivers/net/wireless/marvell/mwifiex/pcie.c
index 49b5835..ea423d5 100644
--- a/drivers/net/wireless/marvell/mwifiex/pcie.c
+++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
@@ -194,6 +194,7 @@ static int mwifiex_pcie_probe(struct pci_dev *pdev,
const struct pci_device_id *ent)
 {
struct pcie_service_card *card;
+   bool valid_of_node = false;
int ret;
 
pr_debug("info: vendor=0x%4.04X device=0x%4.04X rev=%d\n",
@@ -221,10 +222,11 @@ static int mwifiex_pcie_probe(struct pci_dev *pdev,
ret = mwifiex_pcie_probe_of(>dev);
if (ret)
goto err_free;
+   valid_of_node = true;
}
 
ret = mwifiex_add_card(card, _remove_card_sem, _ops,
-  MWIFIEX_PCIE, >dev);
+  MWIFIEX_PCIE, >dev, valid_of_node);
if (ret) {
pr_err("%s failed\n", __func__);
goto err_free;
diff --git a/drivers/net/wireless/marvell/mwifiex/sdio.c 
b/drivers/net/wireless/marvell/mwifiex/sdio.c
index c95f41f..558743a 100644
--- a/drivers/net/wireless/marvell/mwifiex/sdio.c
+++ b/drivers/net/wireless/marvell/mwifiex/sdio.c
@@ -156,6 +156,7 @@ mwifiex_sdio_probe(struct sdio_func *func, const struct 
sdio_device_id *id)
 {
int ret;
struct sdio_mmc_card *card = NULL;
+   bool valid_of_node = false;
 
pr_debug("info: vendor=0x%4.04X device=0x%4.04X class=%d function=%d\n",
 func->vendor, func->device, func->class, func->num);
@@ -203,10 +204,11 @@ mwifiex_sdio_probe(struct sdio_func *func, const struct 
sdio_device_id *id)
dev_err(>dev, "SDIO dt node parse failed\n");
goto err_disable;
}
+   valid_of_node = true;
}
 
ret = mwifiex_add_card(card, _remove_card_sem, _ops,
-  MWIFIEX_SDIO, >dev);
+  MWIFIEX_SDIO, >dev, valid_of_node);
if (ret) {
dev_err(>dev, "add card failed\n");

[PATCH 1/3] mwifiex: Allow mwifiex early access to device structure

2016-10-24 Thread Rajat Jain
Today all the interface drivers (usb/pcie/sdio) assign the
adapter->dev in the register_dev() callback, although they
have this piece of info well before hand.

This patch makes the device structure available for mwifiex
right at the beginning, so that it can be used for early
initialization if needed.

This is needed for subsequent patches in this patchset that
intend to unify and consolidate some of the code that would
otherwise have to be duplicated among the interface drivers
(sdio, pcie, usb).

Signed-off-by: Rajat Jain 
---
 drivers/net/wireless/marvell/mwifiex/main.c | 4 +++-
 drivers/net/wireless/marvell/mwifiex/main.h | 3 ++-
 drivers/net/wireless/marvell/mwifiex/pcie.c | 4 +---
 drivers/net/wireless/marvell/mwifiex/sdio.c | 5 +
 drivers/net/wireless/marvell/mwifiex/usb.c  | 3 +--
 5 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/drivers/net/wireless/marvell/mwifiex/main.c 
b/drivers/net/wireless/marvell/mwifiex/main.c
index 2478ccd..dcceab2 100644
--- a/drivers/net/wireless/marvell/mwifiex/main.c
+++ b/drivers/net/wireless/marvell/mwifiex/main.c
@@ -1567,7 +1567,8 @@ EXPORT_SYMBOL_GPL(mwifiex_do_flr);
  */
 int
 mwifiex_add_card(void *card, struct semaphore *sem,
-struct mwifiex_if_ops *if_ops, u8 iface_type)
+struct mwifiex_if_ops *if_ops, u8 iface_type,
+struct device *dev)
 {
struct mwifiex_adapter *adapter;
 
@@ -1579,6 +1580,7 @@ mwifiex_add_card(void *card, struct semaphore *sem,
goto err_init_sw;
}
 
+   adapter->dev = dev;
adapter->iface_type = iface_type;
adapter->card_sem = sem;
 
diff --git a/drivers/net/wireless/marvell/mwifiex/main.h 
b/drivers/net/wireless/marvell/mwifiex/main.h
index 26df28f..91218a1 100644
--- a/drivers/net/wireless/marvell/mwifiex/main.h
+++ b/drivers/net/wireless/marvell/mwifiex/main.h
@@ -1411,7 +1411,8 @@ static inline u8 mwifiex_is_tdls_link_setup(u8 status)
 
 int mwifiex_init_shutdown_fw(struct mwifiex_private *priv,
 u32 func_init_shutdown);
-int mwifiex_add_card(void *, struct semaphore *, struct mwifiex_if_ops *, u8);
+int mwifiex_add_card(void *, struct semaphore *, struct mwifiex_if_ops *, u8,
+struct device *);
 int mwifiex_remove_card(struct mwifiex_adapter *, struct semaphore *);
 
 void mwifiex_get_version(struct mwifiex_adapter *adapter, char *version,
diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c 
b/drivers/net/wireless/marvell/mwifiex/pcie.c
index f7c84d3..49b5835 100644
--- a/drivers/net/wireless/marvell/mwifiex/pcie.c
+++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
@@ -224,7 +224,7 @@ static int mwifiex_pcie_probe(struct pci_dev *pdev,
}
 
ret = mwifiex_add_card(card, _remove_card_sem, _ops,
-  MWIFIEX_PCIE);
+  MWIFIEX_PCIE, >dev);
if (ret) {
pr_err("%s failed\n", __func__);
goto err_free;
@@ -2990,11 +2990,9 @@ static void mwifiex_pcie_get_fw_name(struct 
mwifiex_adapter *adapter)
 static int mwifiex_register_dev(struct mwifiex_adapter *adapter)
 {
struct pcie_service_card *card = adapter->card;
-   struct pci_dev *pdev = card->dev;
 
/* save adapter pointer in card */
card->adapter = adapter;
-   adapter->dev = >dev;
 
if (mwifiex_pcie_request_irq(adapter))
return -1;
diff --git a/drivers/net/wireless/marvell/mwifiex/sdio.c 
b/drivers/net/wireless/marvell/mwifiex/sdio.c
index 807af13..c95f41f 100644
--- a/drivers/net/wireless/marvell/mwifiex/sdio.c
+++ b/drivers/net/wireless/marvell/mwifiex/sdio.c
@@ -206,7 +206,7 @@ mwifiex_sdio_probe(struct sdio_func *func, const struct 
sdio_device_id *id)
}
 
ret = mwifiex_add_card(card, _remove_card_sem, _ops,
-  MWIFIEX_SDIO);
+  MWIFIEX_SDIO, >dev);
if (ret) {
dev_err(>dev, "add card failed\n");
goto err_disable;
@@ -2106,9 +2106,6 @@ static int mwifiex_register_dev(struct mwifiex_adapter 
*adapter)
return ret;
}
 
-
-   adapter->dev = >dev;
-
strcpy(adapter->fw_name, card->firmware);
if (card->fw_dump_enh) {
adapter->mem_type_mapping_tbl = generic_mem_type_map;
diff --git a/drivers/net/wireless/marvell/mwifiex/usb.c 
b/drivers/net/wireless/marvell/mwifiex/usb.c
index 73eb084..f847fff 100644
--- a/drivers/net/wireless/marvell/mwifiex/usb.c
+++ b/drivers/net/wireless/marvell/mwifiex/usb.c
@@ -476,7 +476,7 @@ static int mwifiex_usb_probe(struct usb_interface *intf,
usb_set_intfdata(intf, card);
 
ret = mwifiex_add_card(card, _remove_card_sem, _ops,
-  MWIFIEX_USB);
+  MWIFIEX_USB, >udev->dev);
if (ret) {
pr_err("%s: mwifiex_add_card failed: %d\n", __func__, ret);
   

[PATCH 3/3] mwifiex: Enable WoWLAN for both sdio and pcie

2016-10-24 Thread Rajat Jain
Commit ce4f6f0c353b ("mwifiex: add platform specific wakeup interrupt
support") added WoWLAN feature only for sdio. This patch moves that
code to the common module so that all the interface drivers can use
it for free. It enables pcie and sdio for its use currently.

Signed-off-by: Rajat Jain 
---
 drivers/net/wireless/marvell/mwifiex/main.c | 41 
 drivers/net/wireless/marvell/mwifiex/main.h | 25 ++
 drivers/net/wireless/marvell/mwifiex/pcie.c |  2 +
 drivers/net/wireless/marvell/mwifiex/sdio.c | 72 ++---
 drivers/net/wireless/marvell/mwifiex/sdio.h |  8 
 5 files changed, 73 insertions(+), 75 deletions(-)

diff --git a/drivers/net/wireless/marvell/mwifiex/main.c 
b/drivers/net/wireless/marvell/mwifiex/main.c
index b2f3d96..20c9b77 100644
--- a/drivers/net/wireless/marvell/mwifiex/main.c
+++ b/drivers/net/wireless/marvell/mwifiex/main.c
@@ -1552,14 +1552,55 @@ void mwifiex_do_flr(struct mwifiex_adapter *adapter, 
bool prepare)
 }
 EXPORT_SYMBOL_GPL(mwifiex_do_flr);
 
+static irqreturn_t mwifiex_irq_wakeup_handler(int irq, void *priv)
+{
+   struct mwifiex_adapter *adapter = priv;
+
+   if (adapter->irq_wakeup >= 0) {
+   dev_dbg(adapter->dev, "%s: wake by wifi", __func__);
+   adapter->wake_by_wifi = true;
+   disable_irq_nosync(irq);
+   }
+
+   /* Notify PM core we are wakeup source */
+   pm_wakeup_event(adapter->dev, 0);
+
+   return IRQ_HANDLED;
+}
+
 static void mwifiex_probe_of(struct mwifiex_adapter *adapter)
 {
+   int ret;
struct device *dev = adapter->dev;
 
if (!dev->of_node)
return;
 
adapter->dt_node = dev->of_node;
+   adapter->irq_wakeup = irq_of_parse_and_map(adapter->dt_node, 0);
+   if (!adapter->irq_wakeup) {
+   dev_info(dev, "fail to parse irq_wakeup from device tree\n");
+   return;
+   }
+
+   ret = devm_request_irq(dev, adapter->irq_wakeup,
+  mwifiex_irq_wakeup_handler, IRQF_TRIGGER_LOW,
+  "wifi_wake", adapter);
+   if (ret) {
+   dev_err(dev, "Failed to request irq_wakeup %d (%d)\n",
+   adapter->irq_wakeup, ret);
+   goto err_exit;
+   }
+
+   disable_irq(adapter->irq_wakeup);
+   if (device_init_wakeup(dev, true)) {
+   dev_err(dev, "fail to init wakeup for mwifiex\n");
+   goto err_exit;
+   }
+   return;
+
+err_exit:
+   adapter->irq_wakeup = 0;
 }
 
 /*
diff --git a/drivers/net/wireless/marvell/mwifiex/main.h 
b/drivers/net/wireless/marvell/mwifiex/main.h
index 83e0776..12def94 100644
--- a/drivers/net/wireless/marvell/mwifiex/main.h
+++ b/drivers/net/wireless/marvell/mwifiex/main.h
@@ -1010,6 +1010,10 @@ struct mwifiex_adapter {
bool usb_mc_setup;
struct cfg80211_wowlan_nd_info *nd_info;
struct ieee80211_regdomain *regd;
+
+   /* Wake-on-WLAN (WoWLAN) */
+   int irq_wakeup;
+   bool wake_by_wifi;
 };
 
 void mwifiex_process_tx_queue(struct mwifiex_adapter *adapter);
@@ -1409,6 +1413,27 @@ static inline u8 mwifiex_is_tdls_link_setup(u8 status)
return false;
 }
 
+/* Disable platform specific wakeup interrupt */
+static inline void mwifiex_disable_wake(struct mwifiex_adapter *adapter)
+{
+   if (adapter->irq_wakeup >= 0) {
+   disable_irq_wake(adapter->irq_wakeup);
+   if (!adapter->wake_by_wifi)
+   disable_irq(adapter->irq_wakeup);
+   }
+}
+
+/* Enable platform specific wakeup interrupt */
+static inline void mwifiex_enable_wake(struct mwifiex_adapter *adapter)
+{
+   /* Enable platform specific wakeup interrupt */
+   if (adapter->irq_wakeup >= 0) {
+   adapter->wake_by_wifi = false;
+   enable_irq(adapter->irq_wakeup);
+   enable_irq_wake(adapter->irq_wakeup);
+   }
+}
+
 int mwifiex_init_shutdown_fw(struct mwifiex_private *priv,
 u32 func_init_shutdown);
 int mwifiex_add_card(void *, struct semaphore *, struct mwifiex_if_ops *, u8,
diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c 
b/drivers/net/wireless/marvell/mwifiex/pcie.c
index ea423d5..af93661 100644
--- a/drivers/net/wireless/marvell/mwifiex/pcie.c
+++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
@@ -133,6 +133,7 @@ static int mwifiex_pcie_suspend(struct device *dev)
 
adapter = card->adapter;
 
+   mwifiex_enable_wake(adapter);
hs_actived = mwifiex_enable_hs(adapter);
 
/* Indicate device suspended */
@@ -179,6 +180,7 @@ static int mwifiex_pcie_resume(struct device *dev)
 
mwifiex_cancel_hs(mwifiex_get_priv(adapter, MWIFIEX_BSS_ROLE_STA),
  MWIFIEX_ASYNC_CMD);
+   mwifiex_disable_wake(adapter);
 
return 0;
 }
diff --git a/drivers/net/wireless/marvell/mwifiex/sdio.c 

[PATCH 0/3] mwifiex: Make WoWLAN a common feature

2016-10-24 Thread Rajat Jain
I have a Marvell card on the PCIe bus that needs to support
WoWLAN (wake-on-wireless-LAN) feature. This is a feature offered
by the "core" mwifiex card and is not specific to an interface
(pcie/sdio/usb).

Currently the code to parse the WowLAN pin, and activate it
resides only in sdio.c [mostly commit ce4f6f0c353b ("mwifiex: add
platform specific wakeup interrupt support")]. I started by
copying all that code & data structures in pcie.c/pcie.h but then
realized that the we should probably have it common since the
feature is not interface specific.

Further, I noticed that interface driver had no interest in
device trees since there are no properties specific to interfaces.
Currently the only properties needed, are the common ones needed
by the core mwifiex.

This patch set thus introduces mwifiex_probe_of() to parse the
common properties, and then moves the WowLAN specific code to the
common module so that all the interfaces can use it. Essentially
this is a single logical patch that has been split up into
multiple patches only for the reason of simplicity and code
reviews.

This is currently rebased on the top of Linus' tree with the
following 2 patches applied:
https://patchwork.kernel.org/patch/9362275/
https://patchwork.kernel.org/patch/9390225/

Rajat Jain (3):
  mwifiex: Allow mwifiex early access to device structure
  mwifiex: Introduce mwifiex_probe_of() to parse common properties
  mwifiex: Enable WoWLAN for both sdio and pcie

 drivers/net/wireless/marvell/mwifiex/main.c| 58 ++-
 drivers/net/wireless/marvell/mwifiex/main.h| 28 -
 drivers/net/wireless/marvell/mwifiex/pcie.c|  8 ++-
 drivers/net/wireless/marvell/mwifiex/sdio.c| 79 +++---
 drivers/net/wireless/marvell/mwifiex/sdio.h|  8 ---
 drivers/net/wireless/marvell/mwifiex/sta_cmd.c |  5 +-
 drivers/net/wireless/marvell/mwifiex/usb.c |  3 +-
 7 files changed, 99 insertions(+), 90 deletions(-)

-- 
2.8.0.rc3.226.g39d4020



Re: [PATCH] netns: revert "netns: avoid disabling irq for netns id"

2016-10-24 Thread Cong Wang
On Sat, Oct 22, 2016 at 12:29 PM, Paul Moore  wrote:
> On Fri, Oct 21, 2016 at 11:38 PM, Cong Wang  wrote:
>> On Fri, Oct 21, 2016 at 6:49 PM, Paul Moore  wrote:
>>> Eventually we should be able to reintroduce this code once we have
>>> rewritten the audit multicast code to queue messages much the same
>>> way we do for unicast messages.  A tracking issue for this can be
>>> found below:
>>
>> NAK.
>>
>> 1) This will be forgotten by Paul.
>
> The way things are going right now, this argument is going to devolve
> into a "yes he will"/"no I won't" so I'll just repeat that I've
> created a tracking issue for this so I won't forget (and included a
> link, repeated below, in the commit description) and I think I have a
> reasonable history of following through on things.
>
> * https://github.com/linux-audit/audit-kernel/issues/23

I never doubt you will remember to do the audit part, what you will
forget is the revert to your revert. We will see.

Also, you make git log history much uglier.

>
>> 2) There is already a fix which is considered as a rework by Paul.
>
> Already discussed this in the other thread, I'm not going to go into
> detail here, just a quick summary: the fix provided by Cong Wang
> doubles the message queue's memory consumption and changes some
> fundamentals in how multicast messages are handled.  The memory
> issues, while still an objectionable blocker, are easily resolved, but
> moving the netlink multicast send is something I want to make sure is
> tested/baked for a bit (it's 4.10 merge window material as far as I'm
> concerned).

Sounds like you don't have the capacity to get it reviewed and tested
within 5 weeks (assuming -rc7 will be the final RC), as a maintainer.


>
> At this point I think it is worth mentioning that we are in this
> position due to a lack of testing; if Cong Wang had tested his
> original patch with SELinux we might not be dealing with this
> regression now.  A more measured approach seems very reasonable.
>

My SELinux is silently disabled because CONFIG_DEFAULT_SECURITY_SELINUX=y
was missing in my kernel config. The change is a cross-subsystem one,
I definitely can't guarantee I can cover all subsystems. This is exactly
why we need -rc1...-rc7, the moment you close the door for -rc2,
the moment you lose the opportunity to get it tested more widely.
I am sure you will revert the revert of revert again for the next merge
window if you continue to work in this style.


>> 3) -rc2 is Paul's personal deadline, not ours.
>
> The current 4.9-rc kernels are broken and cause errors when SELinux is
> enabled, while I understand SELinux is not a priority (or a secondary,
> or tertiary, or N-ary concern) for many on the netdev list, it is
> still an important part of the kernel and this regression needs to be
> treated seriously and corrected soon.

You get it wrong, it is never because SELinux is not important, every
part of Linux kernel is important. You need to realize we as a whole
community don't work in this way, -rc2 is NOT late for a bug fix of any
part of Linux kernel. If you can't review and test a 30-line patch in 5 weeks,
it is very likely your problem.


>
> SELinux/audit has run into interaction issues with the network stack
> before, and we've worked together to sort things out; I'm hopeful
> cooler heads will prevail and we can do the same here.

I am trying my best to help (by providing 3 possible patches), you refused
them because of _your_ -rc2 deadline. Let people judge who is the one
doesn't work together.

I am tired of explaining why we have -rc7 to you.


[PATCH v2] net: ipv6: Fix processing of RAs in presence of VRF

2016-10-24 Thread David Ahern
rt6_add_route_info and rt6_add_dflt_router were updated to pull the FIB
table from the device index, but the corresponding rt6_get_route_info
and rt6_get_dflt_router functions were not leading to the failure to
process RA's:

ICMPv6: RA: ndisc_router_discovery failed to add default route

Fix the 'get' functions by using the table id associated with the
device when applicable.

Also, now that default routes can be added to tables other than the
default table, rt6_purge_dflt_routers needs to be updated as well to
look at all tables. To handle that efficiently, add a flag to the table
denoting if it is has a default route via RA.

Fixes: ca254490c8dfd ("net: Add VRF support to IPv6 stack")
Signed-off-by: David Ahern 
---
v2
- added Fixes to commit message

 include/net/ip6_fib.h |  2 ++
 net/ipv6/route.c  | 68 ---
 2 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index fb961a576abe..a74e2aa40ef4 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -230,6 +230,8 @@ struct fib6_table {
rwlock_ttb6_lock;
struct fib6_nodetb6_root;
struct inet_peer_base   tb6_peers;
+   unsigned intflags;
+#define RT6_TABLE_HAS_DFLT_ROUTER  BIT(0)
 };
 
 #define RT6_TABLE_UNSPEC   RT_TABLE_UNSPEC
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index bdbc38e8bf29..3ac19eb81a86 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -102,11 +102,13 @@ static int rt6_score_route(struct rt6_info *rt, int oif, 
int strict);
 #ifdef CONFIG_IPV6_ROUTE_INFO
 static struct rt6_info *rt6_add_route_info(struct net *net,
   const struct in6_addr *prefix, int 
prefixlen,
-  const struct in6_addr *gwaddr, int 
ifindex,
+  const struct in6_addr *gwaddr,
+  struct net_device *dev,
   unsigned int pref);
 static struct rt6_info *rt6_get_route_info(struct net *net,
   const struct in6_addr *prefix, int 
prefixlen,
-  const struct in6_addr *gwaddr, int 
ifindex);
+  const struct in6_addr *gwaddr,
+  struct net_device *dev);
 #endif
 
 struct uncached_list {
@@ -803,7 +805,7 @@ int rt6_route_rcv(struct net_device *dev, u8 *opt, int len,
rt = rt6_get_dflt_router(gwaddr, dev);
else
rt = rt6_get_route_info(net, prefix, rinfo->prefix_len,
-   gwaddr, dev->ifindex);
+   gwaddr, dev);
 
if (rt && !lifetime) {
ip6_del_rt(rt);
@@ -811,8 +813,8 @@ int rt6_route_rcv(struct net_device *dev, u8 *opt, int len,
}
 
if (!rt && lifetime)
-   rt = rt6_add_route_info(net, prefix, rinfo->prefix_len, gwaddr, 
dev->ifindex,
-   pref);
+   rt = rt6_add_route_info(net, prefix, rinfo->prefix_len, gwaddr,
+   dev, pref);
else if (rt)
rt->rt6i_flags = RTF_ROUTEINFO |
 (rt->rt6i_flags & ~RTF_PREF_MASK) | 
RTF_PREF(pref);
@@ -2325,13 +2327,16 @@ static void ip6_rt_copy_init(struct rt6_info *rt, 
struct rt6_info *ort)
 #ifdef CONFIG_IPV6_ROUTE_INFO
 static struct rt6_info *rt6_get_route_info(struct net *net,
   const struct in6_addr *prefix, int 
prefixlen,
-  const struct in6_addr *gwaddr, int 
ifindex)
+  const struct in6_addr *gwaddr,
+  struct net_device *dev)
 {
+   u32 tb_id = l3mdev_fib_table(dev) ? : RT6_TABLE_INFO;
+   int ifindex = dev->ifindex;
struct fib6_node *fn;
struct rt6_info *rt = NULL;
struct fib6_table *table;
 
-   table = fib6_get_table(net, RT6_TABLE_INFO);
+   table = fib6_get_table(net, tb_id);
if (!table)
return NULL;
 
@@ -2357,12 +2362,13 @@ static struct rt6_info *rt6_get_route_info(struct net 
*net,
 
 static struct rt6_info *rt6_add_route_info(struct net *net,
   const struct in6_addr *prefix, int 
prefixlen,
-  const struct in6_addr *gwaddr, int 
ifindex,
+  const struct in6_addr *gwaddr,
+  struct net_device *dev,
   unsigned int pref)
 {
struct fib6_config cfg = {
.fc_metric  = IP6_RT_PRIO_USER,
-   .fc_ifindex  

[PATCH] net: ipv6: Fix processing of RAs in presence of VRF

2016-10-24 Thread David Ahern
rt6_add_route_info and rt6_add_dflt_router were updated to pull the FIB
table from the device index, but the corresponding rt6_get_route_info
and rt6_get_dflt_router functions were not leading to the failure to
process RA's:

ICMPv6: RA: ndisc_router_discovery failed to add default route

Fix the 'get' functions by using the table id associated with the
device when applicable.

Also, now that default routes can be added to tables other than the
default table, rt6_purge_dflt_routers needs to be updated as well to
look at all tables. To handle that efficiently, add a flag to the table
denoting if it is has a default route via RA.

Signed-off-by: David Ahern 
---
 include/net/ip6_fib.h |  2 ++
 net/ipv6/route.c  | 68 ---
 2 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index fb961a576abe..a74e2aa40ef4 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -230,6 +230,8 @@ struct fib6_table {
rwlock_ttb6_lock;
struct fib6_nodetb6_root;
struct inet_peer_base   tb6_peers;
+   unsigned intflags;
+#define RT6_TABLE_HAS_DFLT_ROUTER  BIT(0)
 };
 
 #define RT6_TABLE_UNSPEC   RT_TABLE_UNSPEC
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index bdbc38e8bf29..3ac19eb81a86 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -102,11 +102,13 @@ static int rt6_score_route(struct rt6_info *rt, int oif, 
int strict);
 #ifdef CONFIG_IPV6_ROUTE_INFO
 static struct rt6_info *rt6_add_route_info(struct net *net,
   const struct in6_addr *prefix, int 
prefixlen,
-  const struct in6_addr *gwaddr, int 
ifindex,
+  const struct in6_addr *gwaddr,
+  struct net_device *dev,
   unsigned int pref);
 static struct rt6_info *rt6_get_route_info(struct net *net,
   const struct in6_addr *prefix, int 
prefixlen,
-  const struct in6_addr *gwaddr, int 
ifindex);
+  const struct in6_addr *gwaddr,
+  struct net_device *dev);
 #endif
 
 struct uncached_list {
@@ -803,7 +805,7 @@ int rt6_route_rcv(struct net_device *dev, u8 *opt, int len,
rt = rt6_get_dflt_router(gwaddr, dev);
else
rt = rt6_get_route_info(net, prefix, rinfo->prefix_len,
-   gwaddr, dev->ifindex);
+   gwaddr, dev);
 
if (rt && !lifetime) {
ip6_del_rt(rt);
@@ -811,8 +813,8 @@ int rt6_route_rcv(struct net_device *dev, u8 *opt, int len,
}
 
if (!rt && lifetime)
-   rt = rt6_add_route_info(net, prefix, rinfo->prefix_len, gwaddr, 
dev->ifindex,
-   pref);
+   rt = rt6_add_route_info(net, prefix, rinfo->prefix_len, gwaddr,
+   dev, pref);
else if (rt)
rt->rt6i_flags = RTF_ROUTEINFO |
 (rt->rt6i_flags & ~RTF_PREF_MASK) | 
RTF_PREF(pref);
@@ -2325,13 +2327,16 @@ static void ip6_rt_copy_init(struct rt6_info *rt, 
struct rt6_info *ort)
 #ifdef CONFIG_IPV6_ROUTE_INFO
 static struct rt6_info *rt6_get_route_info(struct net *net,
   const struct in6_addr *prefix, int 
prefixlen,
-  const struct in6_addr *gwaddr, int 
ifindex)
+  const struct in6_addr *gwaddr,
+  struct net_device *dev)
 {
+   u32 tb_id = l3mdev_fib_table(dev) ? : RT6_TABLE_INFO;
+   int ifindex = dev->ifindex;
struct fib6_node *fn;
struct rt6_info *rt = NULL;
struct fib6_table *table;
 
-   table = fib6_get_table(net, RT6_TABLE_INFO);
+   table = fib6_get_table(net, tb_id);
if (!table)
return NULL;
 
@@ -2357,12 +2362,13 @@ static struct rt6_info *rt6_get_route_info(struct net 
*net,
 
 static struct rt6_info *rt6_add_route_info(struct net *net,
   const struct in6_addr *prefix, int 
prefixlen,
-  const struct in6_addr *gwaddr, int 
ifindex,
+  const struct in6_addr *gwaddr,
+  struct net_device *dev,
   unsigned int pref)
 {
struct fib6_config cfg = {
.fc_metric  = IP6_RT_PRIO_USER,
-   .fc_ifindex = ifindex,
+   .fc_ifindex = dev->ifindex,
.fc_dst_len = 

Re: net/can: warning in bcm_connect/proc_register

2016-10-24 Thread Andrey Konovalov
Hi Cong,

I'm able to reproduce it by running
https://gist.github.com/xairy/33f2eb6bf807b004e643bae36c3d02d7 in a
tight parallel loop with stress
(https://godoc.org/golang.org/x/tools/cmd/stress):
$ gcc -lpthread tmp.c
$ ./stress ./a.out

The C program was generated from the following syzkaller prog:
mmap(&(0x7f00/0x991000)=nil, (0x991000), 0x3, 0x32,
0x, 0x0)
socket(0x1d, 0x80002, 0x2)
r0 = socket(0x1d, 0x80002, 0x2)
connect$nfc_llcp(r0, &(0x7f00c000)={0x27, 0x1, 0x0, 0x5,
0x1, 0x1,
"341b3a01b257849ca1d7d1ff9f999d8127b185f88d1d775d59c88a3aa6a8ddacdf2bdc324ea6578a21b85114610186c3817c34b05eaffd2c3f54f57fa81ba0",
0x1ff}, 0x60)
connect$nfc_llcp(r0, &(0x7f991000-0x60)={0x27, 0x1, 0x1,
0x5, 0xfffd, 0x0,
"341b3a01b257849ca1d7d1ff9f999d8127b185f88d1d775dbec88a3aa6a8ddacdf2bdc324ea6578a21b85114610186c3817c34b05eaffd2c3f54f57fa81ba0",
0x1ff}, 0x60)

Unfortunately I wasn't able to create a simpler reproducer.

Thanks!

On Mon, Oct 24, 2016 at 6:58 PM, Cong Wang  wrote:
> On Mon, Oct 24, 2016 at 9:21 AM, Andrey Konovalov  
> wrote:
>> Hi,
>>
>> I've got the following error report while running the syzkaller fuzzer:
>>
>> WARNING: CPU: 0 PID: 32451 at fs/proc/generic.c:345 proc_register+0x25e/0x300
>> proc_dir_entry 'can-bcm/249757' already registered
>> Kernel panic - not syncing: panic_on_warn set ...
>
> Looks like we have two problems here:
>
> 1) A check for bo->bcm_proc_read != NULL seems missing
> 2) We need to lock the sock in bcm_connect().
>
> I will work on a patch. Meanwhile, it would help a lot if you could provide
> a reproducer.
>
> Thanks!


Re: [PATCH] net: skip genenerating uevents for network namespaces that are exiting

2016-10-24 Thread Cong Wang
On Sat, Oct 22, 2016 at 12:37 AM, Andrey Vagin  wrote:
> Hi Cong,
>
> On Thu, Oct 20, 2016 at 10:25 PM, Andrey Vagin  wrote:
>> On Thu, Oct 20, 2016 at 8:10 PM, Cong Wang  wrote:
>>> On Thu, Oct 20, 2016 at 7:46 PM, Andrei Vagin  wrote:
 No one can see these events, because a network namespace can not be
 destroyed, if it has sockets.

>>>
>>> Are you sure? kobject_uevent_env() seems sending uevents to all
>>> network namespaces.
>>
>> kobj_bcast_filter() checks that a kobject namespace is equal to a
>> socket namespace.
>
> Today I've checked that it really works as I read from the source code.
> I use this tool to read events:
> https://gist.github.com/avagin/430ba431fc2972002df40ebe6a048b36
>
> And I see that events from non-network devices are delivered to all sockets,
> but events from network devices are delivered only to sockets from
> a network namespace where a device is operated.

I missed it, it makes sense now. Please consider adding a comment in
the code or expanding your changelog for reference.

Thanks!


Re: [PATCH net 1/1] net sched filters: fix notification of filter delete with proper handle

2016-10-24 Thread Cong Wang
On Sun, Oct 23, 2016 at 8:35 AM, Jamal Hadi Salim  wrote:
> From: Jamal Hadi Salim 
>
> Signed-off-by: Jamal Hadi Salim 

We definitely need a serious changelog, even just a short one. ;)

Other than this,

Acked-by: Cong Wang 

We can address the "if (RTM_DELTFILTER != event)" in a separated patch
if needed.

Thanks.


Re: net/can: warning in bcm_connect/proc_register

2016-10-24 Thread Cong Wang
On Mon, Oct 24, 2016 at 9:21 AM, Andrey Konovalov  wrote:
> Hi,
>
> I've got the following error report while running the syzkaller fuzzer:
>
> WARNING: CPU: 0 PID: 32451 at fs/proc/generic.c:345 proc_register+0x25e/0x300
> proc_dir_entry 'can-bcm/249757' already registered
> Kernel panic - not syncing: panic_on_warn set ...

Looks like we have two problems here:

1) A check for bo->bcm_proc_read != NULL seems missing
2) We need to lock the sock in bcm_connect().

I will work on a patch. Meanwhile, it would help a lot if you could provide
a reproducer.

Thanks!


Re: Redundancy support through HSR and PRP

2016-10-24 Thread Murali Karicheri
On 10/20/2016 01:08 PM, Murali Karicheri wrote:
> David,
> 
> On 10/10/2016 02:34 PM, Murali Karicheri wrote:
>> All,
>>
>> Wondering if there plan to add PRP driver support, like HSR in Linux? AFAIK, 
>> PRP
>> adds trailor to Ethernet frame and is used for Redundancy management like 
>> HSR.
>> So wondering why this is not supported.
>>
>> Thanks
>>
> I need to work on a prp driver for Linux. So if there is already someone 
> working
> on this, I would like to join and contribute. Either way please respond so 
> that
> I can work to add this support. 
> 
> I am also working to add support for offload HSR functions to hardware and 
> will
> need to modify the hsr driver to support the same. So any suggestion as to 
> how this
> can be done, will be appreciated.
> 
> Here is what I believe should happen to support this at a higher level
> 
> hsr capable NIC (with firmware support) may able to
>  - duplicate packets at the egress. So only one copy needs to be forwarded to 
> the
>NIC
>  - Discard the duplicate at the ingress. So forward only one to copy to the 
> ethernet
>driver
>  - Manage supervision of the network. Keep node list and their status
> 
> It could be a subset of the above. So I am hoping this can be published by 
> the Ethernet
> driver as a set of features. The hsr driver can then look at this features and
> decide to offload and disable same functionality at the hsr driver. Also the 
> node list/status
> has to be polled from the underlying hardware.
> 
> PRP is similar to HSR in many respect. Redundancy management uses a suffix 
> tag to the MAC
> frame instead of prefix used by HSR. So they are more transparently handled 
> by 
> switches or routers. Probably i need to do
>   - rename net/hsr to net/hsr-prp
>   - restructure the current set of files to add prp support 
> 
> Thanks
> 
+ Arvid

Didn't copy HSR owner in my original email. Copying now.


-- 
Murali Karicheri
Linux Kernel, Keystone


net/can: warning in bcm_connect/proc_register

2016-10-24 Thread Andrey Konovalov
Hi,

I've got the following error report while running the syzkaller fuzzer:

WARNING: CPU: 0 PID: 32451 at fs/proc/generic.c:345 proc_register+0x25e/0x300
proc_dir_entry 'can-bcm/249757' already registered
Kernel panic - not syncing: panic_on_warn set ...

CPU: 0 PID: 32451 Comm: syz-executor Not tainted 4.9.0-rc1+ #293
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 880037d8fae0 81b474f4 0003 dc00
 840cbf00 880037d8fb04 880037d8fba8 8140c06a
 41b58ab3 8479ab7d 8140beae 0032
Call Trace:
 [< inline >] __dump_stack lib/dump_stack.c:15
 [] dump_stack+0xb3/0x10f lib/dump_stack.c:51
 [] panic+0x1bc/0x39d kernel/panic.c:179
 [] __warn+0x1cc/0x1f0 kernel/panic.c:542
 [] warn_slowpath_fmt+0xac/0xd0 kernel/panic.c:565
 [] proc_register+0x25e/0x300 fs/proc/generic.c:344
 [] proc_create_data+0x101/0x1a0 fs/proc/generic.c:507
 [] bcm_connect+0x16e/0x380 net/can/bcm.c:1585
 [] SYSC_connect+0x244/0x2f0 net/socket.c:1533
 [] SyS_connect+0x24/0x30 net/socket.c:1514
 [] entry_SYSCALL_64_fastpath+0x1f/0xc2
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled

On commit 07d9a380680d1c0eb51ef87ff2eab5c994949e69 (Oct 23).


Re: [PATCH net-next] flow_dissector: fix vlan tag handling

2016-10-24 Thread Arnd Bergmann
On Monday, October 24, 2016 10:17:36 AM CEST Jiri Pirko wrote:
> Sat, Oct 22, 2016 at 10:30:08PM CEST, a...@arndb.de wrote:
> >diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> >index 44e6ba9d3a6b..17be1b66cc41 100644
> >--- a/net/core/flow_dissector.c
> >+++ b/net/core/flow_dissector.c
> >@@ -246,13 +246,10 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> > case htons(ETH_P_8021AD):
> > case htons(ETH_P_8021Q): {
> > const struct vlan_hdr *vlan;
> >+struct vlan_hdr _vlan;
> >+bool vlan_tag_present = (skb && skb_vlan_tag_present(skb));
> 
> Drop the unnecessary "()"

ok

> > 
> >-if (skb && skb_vlan_tag_present(skb))
> >-proto = skb->protocol;
> 
> This does not look correct. I believe that you need to set proto for
> further processing.
> 

Ah, of course. I only looked at the usage in this 'case' statement,
but the variable is also used after the 'goto again' and at
the end of the function.

Arnd


[PATCH] kalmia: avoid potential uninitialized variable use

2016-10-24 Thread Arnd Bergmann
The kalmia_send_init_packet() returns zero or a negative return
code, but gcc has no way of knowing that there cannot be a
positive return code, so it determines that copying the ethernet
address at the end of kalmia_bind() will access uninitialized
data:

drivers/net/usb/kalmia.c: In function ‘kalmia_bind’:
arch/x86/include/asm/string_32.h:78:22: error: ‘*((void *)_addr+4)’ 
may be used uninitialized in this function [-Werror=maybe-uninitialized]
   *((short *)to + 2) = *((short *)from + 2);
  ^
drivers/net/usb/kalmia.c:138:5: note: ‘*((void *)_addr+4)’ was 
declared here

This warning is harmless, but for consistency, we should make
the check for the return code match what the driver does everywhere
else and just progate it, which then gets rid of the warning.

Signed-off-by: Arnd Bergmann 
---
 drivers/net/usb/kalmia.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/usb/kalmia.c b/drivers/net/usb/kalmia.c
index 5662babf0583..3e37724d30ae 100644
--- a/drivers/net/usb/kalmia.c
+++ b/drivers/net/usb/kalmia.c
@@ -151,7 +151,7 @@ kalmia_bind(struct usbnet *dev, struct usb_interface *intf)
 
status = kalmia_init_and_get_ethernet_addr(dev, ethernet_addr);
 
-   if (status < 0) {
+   if (status) {
usb_set_intfdata(intf, NULL);
usb_driver_release_interface(driver_of(intf), intf);
return status;
-- 
2.9.0



[PATCH] cw1200: fix bogus maybe-uninitialized warning

2016-10-24 Thread Arnd Bergmann
On x86, the cw1200 driver produces a rather silly warning about the
possible use of the 'ret' variable without an initialization
presumably after being confused by the architecture specific definition
of WARN_ON:

drivers/net/wireless/st/cw1200/wsm.c: In function ‘wsm_handle_rx’:
drivers/net/wireless/st/cw1200/wsm.c:1457:9: error: ‘ret’ may be used 
uninitialized in this function [-Werror=maybe-uninitialized]

As the driver just checks the same variable twice here, we can simplify
it by removing the second condition, which makes it more readable and
avoids the warning.

Signed-off-by: Arnd Bergmann 
---
 drivers/net/wireless/st/cw1200/wsm.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/net/wireless/st/cw1200/wsm.c 
b/drivers/net/wireless/st/cw1200/wsm.c
index 680d60eabc75..094e6637ade2 100644
--- a/drivers/net/wireless/st/cw1200/wsm.c
+++ b/drivers/net/wireless/st/cw1200/wsm.c
@@ -385,14 +385,13 @@ static int wsm_multi_tx_confirm(struct cw1200_common 
*priv,
if (WARN_ON(count <= 0))
return -EINVAL;
 
-   if (count > 1) {
-   /* We already released one buffer, now for the rest */
-   ret = wsm_release_tx_buffer(priv, count - 1);
-   if (ret < 0)
-   return ret;
-   else if (ret > 0)
-   cw1200_bh_wakeup(priv);
-   }
+   /* We already released one buffer, now for the rest */
+   ret = wsm_release_tx_buffer(priv, count - 1);
+   if (ret < 0)
+   return ret;
+
+   if (ret > 0)
+   cw1200_bh_wakeup(priv);
 
cw1200_debug_txed_multi(priv, count);
for (i = 0; i < count; ++i) {
-- 
2.9.0



net/ipv4: warning in inet_sock_destruct

2016-10-24 Thread Andrey Konovalov
Hi,

I've got the following error report while running the syzkaller fuzzer:

[ cut here ]
WARNING: CPU: 1 PID: 0 at net/ipv4/af_inet.c:153[] inet_sock_destruct+0x64d/0x810 net/ipv4/af_inet.c:153
Modules linked in:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-rc2+ #301
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 88006cd07d88 81b47264  
 84465d80  88006cd07dd0 8237
 88006cd19100[   60.531224]  0099 84465d80
0099
Call Trace:
  [   60.531224]  [] dump_stack+0xb3/0x10f
 [] __warn+0x1a7/0x1f0 kernel/panic.c:550
 [] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
 [] inet_sock_destruct+0x64d/0x810 net/ipv4/af_inet.c:153
 [] __sk_destruct+0x51/0x480 net/core/sock.c:1422
 [< inline >] __rcu_reclaim kernel/rcu/rcu.h:118
 [< inline >] rcu_do_batch kernel/rcu/tree.c:2776
 [< inline >] invoke_rcu_callbacks kernel/rcu/tree.c:3040
 [< inline >] __rcu_process_callbacks kernel/rcu/tree.c:3007
 [] rcu_process_callbacks+0xa40/0x1190 kernel/rcu/tree.c:3024
 [] __do_softirq+0x23f/0x8e5 kernel/softirq.c:284
 [< inline >] invoke_softirq kernel/softirq.c:364
 [] irq_exit+0x1a7/0x1e0 kernel/softirq.c:405
 [< inline >] exiting_irq ./arch/x86/include/asm/apic.h:659
 [] smp_apic_timer_interrupt+0x7b/0xa0
arch/x86/kernel/apic/apic.c:960
 [] apic_timer_interrupt+0x8c/0xa0
  [   60.531224]  [] ? native_safe_halt+0x6/0x10
 [< inline >] arch_safe_halt ./arch/x86/include/asm/paravirt.h:103
 [] default_idle+0x22/0x2d0 arch/x86/kernel/process.c:308
 [] arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:299
 [] default_idle_call+0x36/0x60 kernel/sched/idle.c:96
 [< inline >] cpuidle_idle_call kernel/sched/idle.c:154
 [< inline >] cpu_idle_loop kernel/sched/idle.c:247
 [] cpu_startup_entry+0x244/0x300 kernel/sched/idle.c:302
 [] start_secondary+0x250/0x2d0 arch/x86/kernel/smpboot.c:263
---[ end trace 3cd7480984cd70d8 ]---

===
[ INFO: suspicious RCU usage. ]
4.9.0-rc2+ #301 Tainted: GW
---
net/core/sock.c:1425 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 0
1 lock held by swapper/1/0:
 #0: [   60.560631]  (
rcu_callback[   60.560930] ){..}
, at: [   60.561271] [] rcu_process_callbacks+0x9eb/0x1190

stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Tainted: GW   4.9.0-rc2+ #301
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 88006cd07e20 81b47264 88006c18 
 0001 843fe660 88006cd07e50 81204a4f
 880066438440 880066438000 8800664381b0 
Call Trace:
  [   60.563304]  [] dump_stack+0xb3/0x10f
 [] lockdep_rcu_suspicious+0x13f/0x190
kernel/locking/lockdep.c:4445
 [] __sk_destruct+0x3c0/0x480 net/core/sock.c:1424
 [< inline >] __rcu_reclaim kernel/rcu/rcu.h:118
 [< inline >] rcu_do_batch kernel/rcu/tree.c:2776
 [< inline >] invoke_rcu_callbacks kernel/rcu/tree.c:3040
 [< inline >] __rcu_process_callbacks kernel/rcu/tree.c:3007
 [] rcu_process_callbacks+0xa40/0x1190 kernel/rcu/tree.c:3024
 [] __do_softirq+0x23f/0x8e5 kernel/softirq.c:284
 [< inline >] invoke_softirq kernel/softirq.c:364
 [] irq_exit+0x1a7/0x1e0 kernel/softirq.c:405
 [< inline >] exiting_irq ./arch/x86/include/asm/apic.h:659
 [] smp_apic_timer_interrupt+0x7b/0xa0
arch/x86/kernel/apic/apic.c:960
 [] apic_timer_interrupt+0x8c/0xa0
  [   60.563304]  [] ? native_safe_halt+0x6/0x10
 [< inline >] arch_safe_halt ./arch/x86/include/asm/paravirt.h:103
 [] default_idle+0x22/0x2d0 arch/x86/kernel/process.c:308
 [] arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:299
 [] default_idle_call+0x36/0x60 kernel/sched/idle.c:96
 [< inline >] cpuidle_idle_call kernel/sched/idle.c:154
 [< inline >] cpu_idle_loop kernel/sched/idle.c:247
 [] cpu_startup_entry+0x244/0x300 kernel/sched/idle.c:302
 [] start_secondary+0x250/0x2d0 arch/x86/kernel/smpboot.c:263



On commit 07d9a380680d1c0eb51ef87ff2eab5c994949e69 (Oct 23).


[PATCH] wireless: fix bogus maybe-uninitialized warning

2016-10-24 Thread Arnd Bergmann
The hostap_80211_rx() function is supposed to set up the mac addresses
for four possible cases, based on two bits of input data. For
some reason, gcc decides that it's possible that none of the these
four cases apply and the addresses remain uninitialized:

drivers/net/wireless/intersil/hostap/hostap_80211_rx.c: In function 
‘hostap_80211_rx’:
arch/x86/include/asm/string_32.h:77:14: warning: ‘src’ may be used 
uninitialized in this function [-Wmaybe-uninitialized]
drivers/net/wireless/intel/ipw2x00/libipw_rx.c: In function ‘libipw_rx’:
arch/x86/include/asm/string_32.h:77:14: error: ‘dst’ may be used uninitialized 
in this function [-Werror=maybe-uninitialized]
arch/x86/include/asm/string_32.h:78:22: error: ‘*((void *)+4)’ may be used 
uninitialized in this function [-Werror=maybe-uninitialized]

This warning is clearly nonsense, but changing the last case into
'default' makes it obvious to the compiler too, which avoids the
warning and probably leads to better object code too.

The same code is duplicated several times in the kernel, so this
patch uses the same workaround for all copies. The exact configuration
was hit only very rarely in randconfig builds and I only saw it
in three drivers, but I assume that all of them are potentially
affected, and it's better to keep the code consistent.

Signed-off-by: Arnd Bergmann 
---
 drivers/net/wireless/ath/ath6kl/wmi.c  | 8 
 drivers/net/wireless/intel/ipw2x00/libipw_rx.c | 2 +-
 drivers/net/wireless/intersil/hostap/hostap_80211_rx.c | 2 +-
 net/wireless/lib80211_crypt_tkip.c | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/wireless/ath/ath6kl/wmi.c 
b/drivers/net/wireless/ath/ath6kl/wmi.c
index 3fd1cc98fd2f..84a6d12c3f8a 100644
--- a/drivers/net/wireless/ath/ath6kl/wmi.c
+++ b/drivers/net/wireless/ath/ath6kl/wmi.c
@@ -421,10 +421,6 @@ int ath6kl_wmi_dot11_hdr_remove(struct wmi *wmi, struct 
sk_buff *skb)
 
switch ((le16_to_cpu(wh.frame_control)) &
(IEEE80211_FCTL_FROMDS | IEEE80211_FCTL_TODS)) {
-   case 0:
-   memcpy(eth_hdr.h_dest, wh.addr1, ETH_ALEN);
-   memcpy(eth_hdr.h_source, wh.addr2, ETH_ALEN);
-   break;
case IEEE80211_FCTL_TODS:
memcpy(eth_hdr.h_dest, wh.addr3, ETH_ALEN);
memcpy(eth_hdr.h_source, wh.addr2, ETH_ALEN);
@@ -435,6 +431,10 @@ int ath6kl_wmi_dot11_hdr_remove(struct wmi *wmi, struct 
sk_buff *skb)
break;
case IEEE80211_FCTL_FROMDS | IEEE80211_FCTL_TODS:
break;
+   default:
+   memcpy(eth_hdr.h_dest, wh.addr1, ETH_ALEN);
+   memcpy(eth_hdr.h_source, wh.addr2, ETH_ALEN);
+   break;
}
 
skb_pull(skb, sizeof(struct ath6kl_llc_snap_hdr));
diff --git a/drivers/net/wireless/intel/ipw2x00/libipw_rx.c 
b/drivers/net/wireless/intel/ipw2x00/libipw_rx.c
index cef7f7d79cd9..1c1ec7bb9302 100644
--- a/drivers/net/wireless/intel/ipw2x00/libipw_rx.c
+++ b/drivers/net/wireless/intel/ipw2x00/libipw_rx.c
@@ -507,7 +507,7 @@ int libipw_rx(struct libipw_device *ieee, struct sk_buff 
*skb,
memcpy(dst, hdr->addr3, ETH_ALEN);
memcpy(src, hdr->addr4, ETH_ALEN);
break;
-   case 0:
+   default:
memcpy(dst, hdr->addr1, ETH_ALEN);
memcpy(src, hdr->addr2, ETH_ALEN);
break;
diff --git a/drivers/net/wireless/intersil/hostap/hostap_80211_rx.c 
b/drivers/net/wireless/intersil/hostap/hostap_80211_rx.c
index 599f30f22841..34dbddbf3f9b 100644
--- a/drivers/net/wireless/intersil/hostap/hostap_80211_rx.c
+++ b/drivers/net/wireless/intersil/hostap/hostap_80211_rx.c
@@ -855,7 +855,7 @@ void hostap_80211_rx(struct net_device *dev, struct sk_buff 
*skb,
memcpy(dst, hdr->addr3, ETH_ALEN);
memcpy(src, hdr->addr4, ETH_ALEN);
break;
-   case 0:
+   default:
memcpy(dst, hdr->addr1, ETH_ALEN);
memcpy(src, hdr->addr2, ETH_ALEN);
break;
diff --git a/net/wireless/lib80211_crypt_tkip.c 
b/net/wireless/lib80211_crypt_tkip.c
index 71447cf86306..ba0a1f398ce5 100644
--- a/net/wireless/lib80211_crypt_tkip.c
+++ b/net/wireless/lib80211_crypt_tkip.c
@@ -556,7 +556,7 @@ static void michael_mic_hdr(struct sk_buff *skb, u8 * hdr)
memcpy(hdr, hdr11->addr3, ETH_ALEN);/* DA */
memcpy(hdr + ETH_ALEN, hdr11->addr4, ETH_ALEN); /* SA */
break;
-   case 0:
+   default:
memcpy(hdr, hdr11->addr1, ETH_ALEN);/* DA */
memcpy(hdr + ETH_ALEN, hdr11->addr2, ETH_ALEN); /* SA */
break;
-- 
2.9.0



Re: [PATCH] LSO feature added to Cadence GEM driver

2016-10-24 Thread Eric Dumazet
On Mon, 2016-10-24 at 14:18 +0100, Rafal Ozieblo wrote:
> New Cadence GEM hardware support Large Segment Offload (LSO):
> TCP segmentation offload (TSO) as well as UDP fragmentation
> offload (UFO). Support for those features was added to the driver.
> 
> Signed-off-by: Rafal Ozieblo 

...
>  
> +static int macb_lso_check_compatibility(struct sk_buff *skb, unsigned int 
> hdrlen)
> +{
> + unsigned int nr_frags, f;
> +
> + if (skb_shinfo(skb)->gso_size == 0)
> + /* not LSO */
> + return -EPERM;
> +
> + /* there is only one buffer */
> + if (!skb_is_nonlinear(skb))
> + return 0;
> +
> + /* For LSO:
> +  * When software supplies two or more payload buffers all payload 
> buffers
> +  * apart from the last must be a multiple of 8 bytes in size.
> +  */
> + if (!IS_ALIGNED(skb_headlen(skb) - hdrlen, MACB_TX_LEN_ALIGN))
> + return -EPERM;
> +
> + nr_frags = skb_shinfo(skb)->nr_frags;
> + /* No need to check last fragment */
> + nr_frags--;
> + for (f = 0; f < nr_frags; f++) {
> + const skb_frag_t *frag = _shinfo(skb)->frags[f];
> +
> + if (!IS_ALIGNED(skb_frag_size(frag), MACB_TX_LEN_ALIGN))
> + return -EPERM;
> + }
> + return 0;
> +}
> +

Very strange hardware requirements ;(

You should implement an .ndo_features_check method
to perform the checks from core networking stack, and not from your
ndo_start_xmit()

This has the huge advantage of not falling back to skb_linearize(skb)
which is very likely to fail with ~64 KB skbs anyway.

(Your ndo_features_check() would request software GSO instead ...)





[PATCH] staging: rtl8192x: fix bogus maybe-uninitialized warning

2016-10-24 Thread Arnd Bergmann
The rtllib_rx_extract_addr() is supposed to set up the mac addresses
for four possible cases, based on two bits of input data. For
some reason, gcc decides that it's possible that none of the these
four cases apply and the addresses remain uninitialized:

drivers/staging/rtl8192e/rtllib_rx.c: In function ‘rtllib_rx_InfraAdhoc’:
include/linux/etherdevice.h:316:61: error: ‘*((void *)+4)’ may be used 
uninitialized in this function [-Werror=maybe-uninitialized]
drivers/staging/rtl8192e/rtllib_rx.c:1318:5: note: ‘*((void *)+4)’ was 
declared here
ded from /git/arm-soc/drivers/staging/rtl8192e/rtllib_rx.c:40:0:
include/linux/etherdevice.h:316:36: error: ‘dst’ may be used uninitialized in 
this function [-Werror=maybe-uninitialized]
drivers/staging/rtl8192e/rtllib_rx.c:1318:5: note: ‘dst’ was declared here

This warning is clearly nonsense, but changing the last case into
'default' makes it obvious to the compiler too, which avoids the
warning and probably leads to better object code too.

As the same warning appears in other files that have the exact
same code, I'm fixing it in both rtl8192e and rtl8192u, even
though I did not observe it for the latter.

Signed-off-by: Arnd Bergmann 
---
 drivers/staging/rtl8192e/rtllib_rx.c  | 2 +-
 drivers/staging/rtl8192u/ieee80211/ieee80211_crypt_tkip.c | 2 +-
 drivers/staging/rtl8192u/ieee80211/ieee80211_rx.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/rtl8192e/rtllib_rx.c 
b/drivers/staging/rtl8192e/rtllib_rx.c
index c743182b933e..d6777ecda64d 100644
--- a/drivers/staging/rtl8192e/rtllib_rx.c
+++ b/drivers/staging/rtl8192e/rtllib_rx.c
@@ -986,7 +986,7 @@ static void rtllib_rx_extract_addr(struct rtllib_device 
*ieee,
ether_addr_copy(src, hdr->addr4);
ether_addr_copy(bssid, ieee->current_network.bssid);
break;
-   case 0:
+   default:
ether_addr_copy(dst, hdr->addr1);
ether_addr_copy(src, hdr->addr2);
ether_addr_copy(bssid, hdr->addr3);
diff --git a/drivers/staging/rtl8192u/ieee80211/ieee80211_crypt_tkip.c 
b/drivers/staging/rtl8192u/ieee80211/ieee80211_crypt_tkip.c
index 6fa96d57d316..e68850897adf 100644
--- a/drivers/staging/rtl8192u/ieee80211/ieee80211_crypt_tkip.c
+++ b/drivers/staging/rtl8192u/ieee80211/ieee80211_crypt_tkip.c
@@ -553,7 +553,7 @@ static void michael_mic_hdr(struct sk_buff *skb, u8 *hdr)
memcpy(hdr, hdr11->addr3, ETH_ALEN); /* DA */
memcpy(hdr + ETH_ALEN, hdr11->addr4, ETH_ALEN); /* SA */
break;
-   case 0:
+   default:
memcpy(hdr, hdr11->addr1, ETH_ALEN); /* DA */
memcpy(hdr + ETH_ALEN, hdr11->addr2, ETH_ALEN); /* SA */
break;
diff --git a/drivers/staging/rtl8192u/ieee80211/ieee80211_rx.c 
b/drivers/staging/rtl8192u/ieee80211/ieee80211_rx.c
index 89cbc077a48d..2e4d2d7bc2e7 100644
--- a/drivers/staging/rtl8192u/ieee80211/ieee80211_rx.c
+++ b/drivers/staging/rtl8192u/ieee80211/ieee80211_rx.c
@@ -1079,7 +1079,7 @@ int ieee80211_rx(struct ieee80211_device *ieee, struct 
sk_buff *skb,
memcpy(src, hdr->addr4, ETH_ALEN);
memcpy(bssid, ieee->current_network.bssid, ETH_ALEN);
break;
-   case 0:
+   default:
memcpy(dst, hdr->addr1, ETH_ALEN);
memcpy(src, hdr->addr2, ETH_ALEN);
memcpy(bssid, hdr->addr3, ETH_ALEN);
-- 
2.9.0



net/sctp: slab-out-of-bounds in sctp_sf_ootb

2016-10-24 Thread Andrey Konovalov
Hi,

I've got the following error report while running the syzkaller fuzzer:

==
BUG: KASAN: slab-out-of-bounds in sctp_sf_ootb+0x634/0x6c0 at addr
88006bc1f210
Read of size 2 by task syz-executor/13493
CPU: 3 PID: 13493 Comm: syz-executor Not tainted 4.9.0-rc2+ #300
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 88003e256e40 81b47264 88003e80ccc0 88006bc1eed8
 88006bc1f210 88006bc1eed0 88003e256e68 8150b34c
 88003e256ef8 88003e80ccc0 8800ebc1f210 88003e256ee8
Call Trace:
 [< inline >] __dump_stack lib/dump_stack.c:15
 [] dump_stack+0xb3/0x10f lib/dump_stack.c:51
 [] kasan_object_err+0x1c/0x70 mm/kasan/report.c:156
 [< inline >] print_address_description mm/kasan/report.c:194
 [] kasan_report_error+0x1f7/0x4d0 mm/kasan/report.c:283
 [< inline >] kasan_report mm/kasan/report.c:303
 [] __asan_report_load_n_noabort+0x3a/0x40
mm/kasan/report.c:334
 [] sctp_sf_ootb+0x634/0x6c0 net/sctp/sm_statefuns.c:3448
 [] sctp_do_sm+0x104/0x4ed0 net/sctp/sm_sideeffect.c:1108
 [] sctp_endpoint_bh_rcv+0x32d/0x9c0
net/sctp/endpointola.c:452
 [] sctp_inq_push+0x134/0x1a0 net/sctp/inqueue.c:95
 [] sctp_rcv+0x1fa8/0x2d00 net/sctp/input.c:268
 [] ip_local_deliver_finish+0x332/0xad0
net/ipv4/ip_input.c:216
 [< inline >] NF_HOOK_THRESH include/linux/netfilter.h:232
 [< inline >] NF_HOOK include/linux/netfilter.h:255
 [] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257
 [< inline >] dst_input include/net/dst.h:507
 [] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396
 [< inline >] NF_HOOK_THRESH include/linux/netfilter.h:232
 [< inline >] NF_HOOK include/linux/netfilter.h:255
 [] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
 [] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4212
 [] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4250
 [] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4278
 [] netif_receive_skb+0x48/0x250 net/core/dev.c:4302
 [] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308
 [] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332
 [< inline >] new_sync_write fs/read_write.c:499
 [] __vfs_write+0x334/0x570 fs/read_write.c:512
 [] vfs_write+0x17b/0x500 fs/read_write.c:560
 [< inline >] SYSC_write fs/read_write.c:607
 [] SyS_write+0xd4/0x1a0 fs/read_write.c:599
 [] entry_SYSCALL_64_fastpath+0x1f/0xc2
Object at 88006bc1eed8, in cache kmalloc-512 size: 512
Allocated:
PID = 9755
 [  182.804017] [] save_stack_trace+0x16/0x20
arch/x86/kernel/stacktrace.c:57
 [  182.804017] [] save_stack+0x46/0xd0 mm/kasan/kasan.c:495
 [  182.804017] [< inline >] set_track mm/kasan/kasan.c:507
 [  182.804017] [] kasan_kmalloc+0xab/0xe0
mm/kasan/kasan.c:598
 [  182.804017] [] kasan_slab_alloc+0x12/0x20
mm/kasan/kasan.c:537
 [  182.804017] [< inline >] slab_post_alloc_hook mm/slab.h:417
 [  182.804017] [< inline >] slab_alloc_node mm/slub.c:2708
 [  182.804017] []
__kmalloc_node_track_caller+0xcb/0x390 mm/slub.c:4270
 [  182.804017] []
__kmalloc_reserve.isra.35+0x41/0xe0 net/core/skbuff.c:138
 [  182.804017] [] __alloc_skb+0xf0/0x600
net/core/skbuff.c:231
 [  182.804017] [< inline >] alloc_skb include/linux/skbuff.h:921
 [  182.804017] [] sock_wmalloc+0xa3/0xf0 net/core/sock.c:1757
 [  182.804017] []
__ip_append_data.isra.46+0x1e38/0x28c0 net/ipv4/ip_output.c:1010
 [  182.804017] [] ip_append_data.part.47+0xf1/0x170
net/ipv4/ip_output.c:1201
 [  182.804017] [< inline >] ip_append_data net/ipv4/ip_output.c:1605
 [  182.804017] [] ip_send_unicast_reply+0x831/0xe20
net/ipv4/ip_output.c:1605
 [  182.804017] [] tcp_v4_send_reset+0xb0a/0x1700
net/ipv4/tcp_ipv4.c:696
 [  182.804017] [] tcp_v4_rcv+0x198b/0x2e50
net/ipv4/tcp_ipv4.c:1719
 [  182.804017] []
ip_local_deliver_finish+0x332/0xad0 net/ipv4/ip_input.c:216
 [  182.804017] [< inline >] NF_HOOK_THRESH
include/linux/netfilter.h:232
 [  182.804017] [< inline >] NF_HOOK include/linux/netfilter.h:255
 [  182.804017] [] ip_local_deliver+0x1c2/0x4b0
net/ipv4/ip_input.c:257
 [  182.804017] [< inline >] dst_input include/net/dst.h:507
 [  182.804017] [] ip_rcv_finish+0x750/0x1c40
net/ipv4/ip_input.c:396
 [  182.804017] [< inline >] NF_HOOK_THRESH
include/linux/netfilter.h:232
 [  182.804017] [< inline >] NF_HOOK include/linux/netfilter.h:255
 [  182.804017] [] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487
 [  182.804017] []
__netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4212
 [  182.804017] [] __netif_receive_skb+0x2a/0x170
net/core/dev.c:4250
 [  182.804017] []
netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4278
 [  182.804017] [] netif_receive_skb+0x48/0x250
net/core/dev.c:4302
 [  182.804017] [] tun_get_user+0xbd5/0x28a0
drivers/net/tun.c:1308
 [  182.804017] [] tun_chr_write_iter+0xda/0x190
drivers/net/tun.c:1332
 [  182.804017] [< inline >] 

RE: [PATCH] net: fec: hard phy reset on open

2016-10-24 Thread Andy Duan
From: manfred.schla...@gmx.at   Sent: Monday, October 
24, 2016 5:26 PM
> To: Andy Duan 
> Cc: netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: [PATCH] net: fec: hard phy reset on open
> 
> We have seen some problems with auto negotiation on i.MX6 using LAN8720,
> after interface down/up.
> 
> In our configuration, the ptp clock is used externally as reference clock for
> the phy. Some phys (e.g. LAN8720) needs a stable clock while and after hard
> reset.
> Before this patch, the driver disabled the clock on close but did no hard 
> reset
> on open, after enabling the clocks again.
> 
> A solution that prevents disabling the clocks on close was considered, but
> discarded because of bad power saving behavior.
> 
> This patch saves the reset dt properties on probe and does a reset on every
> open after clocks where enabled, to make sure the clock is stable while and
> after hard reset.
> 
> Tested on i.MX6 and i.MX28, both using LAN8720.
> 
> Signed-off-by: Manfred Schlaegl 
> ---
This patch did hard reset to let phy stable.
Firstly, you should do reset before clock enable.
Secondly, we suggest to do phy reset in phy driver, not MAC driver.

Regards,
Andy

>  drivers/net/ethernet/freescale/fec.h  |  4 ++
>  drivers/net/ethernet/freescale/fec_main.c | 77 +---
> ---
>  2 files changed, 47 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/fec.h
> b/drivers/net/ethernet/freescale/fec.h
> index c865135..379e619 100644
> --- a/drivers/net/ethernet/freescale/fec.h
> +++ b/drivers/net/ethernet/freescale/fec.h
> @@ -498,6 +498,10 @@ struct fec_enet_private {
>   struct clk *clk_enet_out;
>   struct clk *clk_ptp;
> 
> + int phy_reset;
> + bool phy_reset_active_high;
> + int phy_reset_msec;
> +
>   bool ptp_clk_on;
>   struct mutex ptp_clk_mutex;
>   unsigned int num_tx_queues;
> diff --git a/drivers/net/ethernet/freescale/fec_main.c
> b/drivers/net/ethernet/freescale/fec_main.c
> index 48a033e..8cc1ec5 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -2802,6 +2802,22 @@ static int fec_enet_alloc_buffers(struct
> net_device *ndev)
>   return 0;
>  }
> 
> +static void fec_reset_phy(struct fec_enet_private *fep) {
> + if (!gpio_is_valid(fep->phy_reset))
> + return;
> +
> + gpio_set_value_cansleep(fep->phy_reset, !!fep-
> >phy_reset_active_high);
> +
> + if (fep->phy_reset_msec > 20)
> + msleep(fep->phy_reset_msec);
> + else
> + usleep_range(fep->phy_reset_msec * 1000,
> +  fep->phy_reset_msec * 1000 + 1000);
> +
> + gpio_set_value_cansleep(fep->phy_reset, !fep-
> >phy_reset_active_high);
> +}
> +
>  static int
>  fec_enet_open(struct net_device *ndev)
>  {
> @@ -2817,6 +2833,8 @@ fec_enet_open(struct net_device *ndev)
>   if (ret)
>   goto clk_enable;
> 
> + fec_reset_phy(fep);
> +
>   /* I should reset the ring buffers here, but I don't yet know
>* a simple way to do that.
>*/
> @@ -3183,52 +3201,39 @@ static int fec_enet_init(struct net_device *ndev)
>   return 0;
>  }
> 
> -#ifdef CONFIG_OF
> -static void fec_reset_phy(struct platform_device *pdev)
> +static int
> +fec_get_reset_phy(struct platform_device *pdev, int *msec, int
> *phy_reset,
> +   bool *active_high)
>  {
> - int err, phy_reset;
> - bool active_high = false;
> - int msec = 1;
> + int err;
>   struct device_node *np = pdev->dev.of_node;
> 
> - if (!np)
> - return;
> + if (!np || !of_device_is_available(np))
> + return 0;
> 
> - of_property_read_u32(np, "phy-reset-duration", );
> + of_property_read_u32(np, "phy-reset-duration", msec);
>   /* A sane reset duration should not be longer than 1s */
> - if (msec > 1000)
> - msec = 1;
> + if (*msec > 1000)
> + *msec = 1;
> 
> - phy_reset = of_get_named_gpio(np, "phy-reset-gpios", 0);
> - if (!gpio_is_valid(phy_reset))
> - return;
> + *phy_reset = of_get_named_gpio(np, "phy-reset-gpios", 0);
> + if (!gpio_is_valid(*phy_reset))
> + return 0;
> 
> - active_high = of_property_read_bool(np, "phy-reset-active-high");
> + *active_high = of_property_read_bool(np, "phy-reset-active-high");
> 
> - err = devm_gpio_request_one(>dev, phy_reset,
> - active_high ? GPIOF_OUT_INIT_HIGH :
> GPIOF_OUT_INIT_LOW,
> - "phy-reset");
> + err = devm_gpio_request_one(>dev, *phy_reset,
> + *active_high ?
> + GPIOF_OUT_INIT_HIGH :
> + GPIOF_OUT_INIT_LOW,
> + "phy-reset");
>   if (err) {
>   dev_err(>dev, "failed to get 

[PATCH] netfilter: ip_vs_sync: fix bogus maybe-uninitialized warning

2016-10-24 Thread Arnd Bergmann
Building the ip_vs_sync code with CONFIG_OPTIMIZE_INLINING on x86
confuses the compiler to the point where it produces a rather
dubious warning message:

net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.init_seq’ may be used 
uninitialized in this function [-Werror=maybe-uninitialized]
  struct ip_vs_sync_conn_options opt;
 ^~~
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.delta’ may be used 
uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.previous_delta’ may be 
used uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)+12).init_seq’ 
may be used uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)+12).delta’ may 
be used uninitialized in this function [-Werror=maybe-uninitialized]
net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void 
*)+12).previous_delta’ may be used uninitialized in this function 
[-Werror=maybe-uninitialized]

The problem appears to be a combination of a number of factors, including
the __builtin_bswap32 compiler builtin being slightly odd, having a large
amount of code inlined into a single function, and the way that some
functions only get partially inlined here.

I've spent way too much time trying to work out a way to improve the
code, but the best I've come up with is to add an explicit memset
right before the ip_vs_seq structure is first initialized here. When
the compiler works correctly, this has absolutely no effect, but in the
case that produces the warning, the warning disappears.

In the process of analysing this warning, I also noticed that
we use memcpy to copy the larger ip_vs_sync_conn_options structure
over two members of the ip_vs_conn structure. This works because
the layout is identical, but seems error-prone, so I'm changing
this in the process to directly copy the two members. This change
seemed to have no effect on the object code or the warning, but
it deals with the same data, so I kept the two changes together.

Signed-off-by: Arnd Bergmann 
---
 net/netfilter/ipvs/ip_vs_sync.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
index 1b07578bedf3..9350530c16c1 100644
--- a/net/netfilter/ipvs/ip_vs_sync.c
+++ b/net/netfilter/ipvs/ip_vs_sync.c
@@ -283,6 +283,7 @@ struct ip_vs_sync_buff {
  */
 static void ntoh_seq(struct ip_vs_seq *no, struct ip_vs_seq *ho)
 {
+   memset(ho, 0, sizeof(*ho));
ho->init_seq   = get_unaligned_be32(>init_seq);
ho->delta  = get_unaligned_be32(>delta);
ho->previous_delta = get_unaligned_be32(>previous_delta);
@@ -917,8 +918,10 @@ static void ip_vs_proc_conn(struct netns_ipvs *ipvs, 
struct ip_vs_conn_param *pa
kfree(param->pe_data);
}
 
-   if (opt)
-   memcpy(>in_seq, opt, sizeof(*opt));
+   if (opt) {
+   cp->in_seq = opt->in_seq;
+   cp->out_seq = opt->out_seq;
+   }
atomic_set(>in_pkts, sysctl_sync_threshold(ipvs));
cp->state = state;
cp->old_state = cp->state;
-- 
2.9.0



Re: question about function igmp_stop_timer() in net/ipv4/igmp.c

2016-10-24 Thread Andrew Lunn
On Mon, Oct 24, 2016 at 07:50:12PM +0800, Dongpo Li wrote:
> Hello
> 
> We encountered a multicast problem when two set-top box(STB) join the same 
> multicast group and leave.
> The two boxes can join the same multicast group
> but only one box can send the IGMP leave group message when leave,
> the other box does not send the IGMP leave message.
> Our boxes use the IGMP version 2.
> 
> I added some debug info and found the whole procedure is like this:
> (1) Box A joins the multicast group 225.1.101.145 and send the IGMP v2 
> membership report(join group).
> (2) Box B joins the same multicast group 225.1.101.145 and also send the IGMP 
> v2 membership report(join group).
> (3) Box A receives the IGMP membership report from Box B and kernel calls 
> igmp_heard_report().
> This function will call igmp_stop_timer(im).
> In function igmp_stop_timer(im), it tries to delete IGMP timer and does 
> the following:
> im->tm_running = 0;
> im->reporter = 0;
> (4) Box A leaves the multicast group 225.1.101.145 and kernel calls
> ip_mc_leave_group -> ip_mc_dec_group -> igmp_group_dropped.
> But in function igmp_group_dropped(), the im->reporter is 0, so the 
> kernel does not send the IGMP leave message.

RFC 2236 says:

2.  Introduction

   The Internet Group Management Protocol (IGMP) is used by IP hosts to
   report their multicast group memberships to any immediately-
   neighboring multicast routers.

Are Box A or B multicast routers?

Andrew


[PATCH] ip6_tunnel: Clear IP6CB(skb)->frag_max_size in ip4ip6_tnl_xmit()

2016-10-24 Thread Eli Cooper
skb->cb may contain data from previous layers, as shown in 5146d1f1511
("tunnel: Clear IPCB(skb)->opt before dst_link_failure called").
However, for ipip6 tunnels, clearing IPCB(skb)->opt alone is not enough,
because skb->cb is later misinterpreted as IP6CB(skb)->frag_max_size.

In the observed scenario, the garbage data made the max fragment size so
so small that packets sent through the tunnel are mistakenly fragmented.

This patch clears IP6CB(skb)->frag_max_size for ipip6 tunnels.

Signed-off-by: Eli Cooper 
---
 net/ipv6/ip6_tunnel.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 202d16a..4110562 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1205,6 +1205,7 @@ ip4ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev)
int err;
 
memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt));
+   IP6CB(skb)->frag_max_size = 0;
 
tproto = ACCESS_ONCE(t->parms.proto);
if (tproto != IPPROTO_IPIP && tproto != 0)
-- 
2.10.1



[PATCH v2 RESEND] xen-netback: prefer xenbus_scanf() over xenbus_gather()

2016-10-24 Thread Jan Beulich
For single items being collected this should be preferred as being more
typesafe (as the compiler can check format string and to-be-written-to
variable match) and more efficient (requiring one less parameter to be
passed).

Signed-off-by: Jan Beulich 
---
v2: Avoid commit message to continue from subject.
---
 drivers/net/xen-netback/xenbus.c |   14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

--- 4.9-rc2/drivers/net/xen-netback/xenbus.c
+++ 4.9-rc2-xen-netback-prefer-xenbus_scanf/drivers/net/xen-netback/xenbus.c
@@ -889,16 +889,16 @@ static int connect_ctrl_ring(struct back
unsigned int evtchn;
int err;
 
-   err = xenbus_gather(XBT_NIL, dev->otherend,
-   "ctrl-ring-ref", "%u", , NULL);
-   if (err)
+   err = xenbus_scanf(XBT_NIL, dev->otherend,
+  "ctrl-ring-ref", "%u", );
+   if (err <= 0)
goto done; /* The frontend does not have a control ring */
 
ring_ref = val;
 
-   err = xenbus_gather(XBT_NIL, dev->otherend,
-   "event-channel-ctrl", "%u", , NULL);
-   if (err) {
+   err = xenbus_scanf(XBT_NIL, dev->otherend,
+  "event-channel-ctrl", "%u", );
+   if (err <= 0) {
xenbus_dev_fatal(dev, err,
 "reading %s/event-channel-ctrl",
 dev->otherend);
@@ -919,7 +919,7 @@ done:
return 0;
 
 fail:
-   return err;
+   return err ?: -ENODATA;
 }
 
 static void connect(struct backend_info *be)





Re: [PATCH v2 net] macsec: Fix header length if SCI is added if explicitly disabled

2016-10-24 Thread Sabrina Dubroca
2016-10-24, 15:44:26 +0200, Tobias Brunner wrote:
> Even if sending SCIs is explicitly disabled, the code that creates the
> Security Tag might still decide to add it (e.g. if multiple RX SCs are
> defined on the MACsec interface).
> But because the header length so far only depended on the configuration
> option the SCI overwrote the original frame's contents (EtherType and
> e.g. the beginning of the IP header) and if encrypted did not visibly
> end up in the packet, while the SC flag in the TCI field of the Security
> Tag was still set, resulting in invalid MACsec frames.
> 
> Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
> Signed-off-by: Tobias Brunner 

Acked-by: Sabrina Dubroca 

-- 
Sabrina


Re: [PATCH] net: fec: hard phy reset on open

2016-10-24 Thread Manfred Schlaegl
On 2016-10-24 16:03, Andy Duan wrote:
> From: manfred.schla...@gmx.at   Sent: Monday, 
> October 24, 2016 5:26 PM
>> To: Andy Duan 
>> Cc: netdev@vger.kernel.org; linux-ker...@vger.kernel.org
>> Subject: [PATCH] net: fec: hard phy reset on open
>>
>> We have seen some problems with auto negotiation on i.MX6 using LAN8720,
>> after interface down/up.
>>
>> In our configuration, the ptp clock is used externally as reference clock for
>> the phy. Some phys (e.g. LAN8720) needs a stable clock while and after hard
>> reset.
>> Before this patch, the driver disabled the clock on close but did no hard 
>> reset
>> on open, after enabling the clocks again.
>>
>> A solution that prevents disabling the clocks on close was considered, but
>> discarded because of bad power saving behavior.
>>
>> This patch saves the reset dt properties on probe and does a reset on every
>> open after clocks where enabled, to make sure the clock is stable while and
>> after hard reset.
>>
>> Tested on i.MX6 and i.MX28, both using LAN8720.
>>
>> Signed-off-by: Manfred Schlaegl 
>> ---

> This patch did hard reset to let phy stable.

> Firstly, you should do reset before clock enable.
I have to disagree here.
The phy demands(datasheet + tests) a stable clock at the time of (hard-)reset 
and after this. Therefore the clock has to be enabled before the hard reset.
(This is exactly the reason for the patch.)

Generally: The sense of a reset is to defer the start of digital circuit until 
the environment (power, clocks, ...) has stabilized.

Furthermore: Before this patch the hard reset was done in fec_probe. And here 
also after the clocks were enabled.

Whats was your argument to do it the other way in this special case?

> Secondly, we suggest to do phy reset in phy driver, not MAC driver.
I was not sure, if you meant a soft-, or hard-reset here.

In case you are talking about soft reset:
Yes, the phy drivers perform a soft reset. Sadly a soft reset is not sufficient 
in this case - The phy recovers only on a hard reset from lost clock. 
(datasheet + tests)

In case you're talking about hard reset:
Intuitively I would also think, that the (hard-)reset should be handled by the 
phy driver, but this is not reality in given implementations.

Documentation/devicetree/bindings/net/fsl-fec.txt says

- phy-reset-gpios : Should specify the gpio for phy reset

It is explicitly talked about phy-reset here. And the (hard-)reset was handled 
by the fec driver also before this patch (once on probe).

> 
> Regards,
> Andy

Thanks for your feedback!

Best regards,
Manfred



> 
>>  drivers/net/ethernet/freescale/fec.h  |  4 ++
>>  drivers/net/ethernet/freescale/fec_main.c | 77 +---
>> ---
>>  2 files changed, 47 insertions(+), 34 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/freescale/fec.h
>> b/drivers/net/ethernet/freescale/fec.h
>> index c865135..379e619 100644
>> --- a/drivers/net/ethernet/freescale/fec.h
>> +++ b/drivers/net/ethernet/freescale/fec.h
>> @@ -498,6 +498,10 @@ struct fec_enet_private {
>>  struct clk *clk_enet_out;
>>  struct clk *clk_ptp;
>>
>> +int phy_reset;
>> +bool phy_reset_active_high;
>> +int phy_reset_msec;
>> +
>>  bool ptp_clk_on;
>>  struct mutex ptp_clk_mutex;
>>  unsigned int num_tx_queues;
>> diff --git a/drivers/net/ethernet/freescale/fec_main.c
>> b/drivers/net/ethernet/freescale/fec_main.c
>> index 48a033e..8cc1ec5 100644
>> --- a/drivers/net/ethernet/freescale/fec_main.c
>> +++ b/drivers/net/ethernet/freescale/fec_main.c
>> @@ -2802,6 +2802,22 @@ static int fec_enet_alloc_buffers(struct
>> net_device *ndev)
>>  return 0;
>>  }
>>
>> +static void fec_reset_phy(struct fec_enet_private *fep) {
>> +if (!gpio_is_valid(fep->phy_reset))
>> +return;
>> +
>> +gpio_set_value_cansleep(fep->phy_reset, !!fep-
>>> phy_reset_active_high);
>> +
>> +if (fep->phy_reset_msec > 20)
>> +msleep(fep->phy_reset_msec);
>> +else
>> +usleep_range(fep->phy_reset_msec * 1000,
>> + fep->phy_reset_msec * 1000 + 1000);
>> +
>> +gpio_set_value_cansleep(fep->phy_reset, !fep-
>>> phy_reset_active_high);
>> +}
>> +
>>  static int
>>  fec_enet_open(struct net_device *ndev)
>>  {
>> @@ -2817,6 +2833,8 @@ fec_enet_open(struct net_device *ndev)
>>  if (ret)
>>  goto clk_enable;
>>
>> +fec_reset_phy(fep);
>> +
>>  /* I should reset the ring buffers here, but I don't yet know
>>   * a simple way to do that.
>>   */
>> @@ -3183,52 +3201,39 @@ static int fec_enet_init(struct net_device *ndev)
>>  return 0;
>>  }
>>
>> -#ifdef CONFIG_OF
>> -static void fec_reset_phy(struct platform_device *pdev)
>> +static int
>> +fec_get_reset_phy(struct platform_device *pdev, int *msec, int
>> *phy_reset,
>> +  bool *active_high)
>>  {
>> -int err, phy_reset;
>> -bool active_high = false;
>> -   

Re: UDP does not autobind on recv

2016-10-24 Thread Jiri Slaby
On 10/24/2016, 03:03 PM, Eric Dumazet wrote:
> On Mon, 2016-10-24 at 14:54 +0200, Jiri Slaby wrote:
>> Hello,
>>
>> as per man 7 udp:
>>   In order to receive packets, the socket can be bound to
>>   a local  address first  by using bind(2).  Otherwise,
>>   the socket layer will automatically assign a free local
>>   port out of the range defined by /proc/sys/net/ipv4
>>   /ip_local_port_range and bind the socket to INADDR_ANY.
>>
>> I did not know that bind is unneeded, so I tried that. But it does not
>> work with this piece of code:
>> int main()
>> {
>> char buf[128];
>> int fd = socket(AF_INET, SOCK_DGRAM, 0);
>> recv(fd, buf, sizeof(buf), 0);
>> }
> 
> autobind makes little sense at recv() time really.
> 
> How an application could expect to receive a frame to 'some socket'
> without even knowing its port ?

For example
struct sockaddr_storage sa;
socklen_t slen = sizeof(sa);
recv(fd, buf, sizeof(buf), MSG_DONTWAIT);
getsockname(fd, (struct sockaddr *), );
recv(fd, buf, sizeof(buf), 0);
works.

> How useful would that be exactly ?

No need for finding a free port and checking, for example.

> How TCP behaves ?

TCP is a completely different story. bind is documented to be required
there. (And listen and accept.)

> I would say, fix the documentation if it is not correct.

I don't have a problem with either. I have only found, that the
implementation differs from the documentation :). Is there some
supervisor documentation (like POSIX) which should we be in conformance to?

thanks,
-- 
js
suse labs


Fwd: net/ipx: null-ptr-deref in ipxrtr_route_packet

2016-10-24 Thread Andrey Konovalov
+a...@redhat.com

Hi,

I've got the following error report while running the syzkaller fuzzer:

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] SMP KASAN
Modules linked in:
CPU: 0 PID: 3953 Comm: syz-executor Not tainted 4.9.0-rc1+ #228
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: 88006aa2ac00 task.stack: 880068a9
RIP: 0010:[]  []
ipxrtr_route_packet+0x4e4/0xbe0 net/ipx/ipx_route.c
:213
RSP: 0018:880068a97b08  EFLAGS: 00010246
RAX: 88006b648500 RBX: 880068a97e40 RCX: dc00
RDX: 0003 RSI:  RDI: 88006b648960
RBP: 880068a97bc8 R08: dc00 R09: 11000d4ddf97
R10: dc00 R11:  R12: 88006b410300
R13:  R14: 88006444b68e R15: 88006a6efc80
FS:  7f28cf665700() GS:88006cc0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 00451f80 CR3: 68a9a000 CR4: 06f0
Stack:
 88006a6efd58 880068a97dc0 880068a97e44 11000d152f68
 001a  88006b648500 41b58ab3
 847fb90b 834ed410 82b7cfea 8800ff97
Call Trace:
 [] ipx_sendmsg+0x30e/0x550 net/ipx/af_ipx.c:1749
 [< inline >] sock_sendmsg_nosec net/socket.c:606
 [] sock_sendmsg+0xcc/0x110 net/socket.c:616
 [] SYSC_sendto+0x211/0x340 net/socket.c:1641
 [] SyS_sendto+0x40/0x50 net/socket.c:1609
 [] entry_SYSCALL_64_fastpath+0x1f/0xc2
arch/x86/entry/entry_64.S:209
Code: 41 80 7c 0d 00 00 0f 85 82 06 00 00 48 8b 85 70 ff ff ff 49 b8
00 00 00 00 00 fc ff df 4c 8b a8 60 04 00 00 4c 89 ee 48 c1 ee 03 <46>
0f b6 0c 06 45 84 c9 74 0a 41 80 f9 03 0f 8e e5 05 00 00 49
RIP  [] ipxrtr_route_packet+0x4e4/0xbe0
net/ipx/ipx_route.c:213
 RSP 
---[ end trace f5bc9a28de6b2776 ]---
==

For some reason ipxs->intrfc ends up being NULL.

The reproducer is attached, you need to run a few instances simultaneously.

In case it's relevant, this is what I have in /etc/network/interfaces:

auto eth1
iface eth1 inet static
address 192.168.1.5
netmask 255.255.255.0
post-up arp -s 192.168.1.6 aa:aa:aa:aa:aa:aa
iface eth1 ipx static
frame EtherII
netnum 0x42424242

On commit 1a1891d762d6e64daf07b5be4817e3fbb29e3c59 (Oct 18).
// autogenerated by syzkaller (http://github.com/google/syzkaller)

#ifndef __NR_sendto
#define __NR_sendto 44
#endif
#ifndef __NR_syz_fuse_mount
#define __NR_syz_fuse_mount 104
#endif
#ifndef __NR_syz_open_dev
#define __NR_syz_open_dev 102
#endif
#ifndef __NR_syz_test
#define __NR_syz_test 101
#endif
#ifndef __NR_mmap
#define __NR_mmap 9
#endif
#ifndef __NR_socket
#define __NR_socket 41
#endif
#ifndef __NR_bind
#define __NR_bind 49
#endif
#ifndef __NR_syz_fuseblk_mount
#define __NR_syz_fuseblk_mount 105
#endif
#ifndef __NR_syz_open_pts
#define __NR_syz_open_pts 103
#endif

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

__thread int skip_segv;
__thread jmp_buf segv_env;

static void segv_handler(int sig, siginfo_t* info, void* uctx)
{
  if (__atomic_load_n(_segv, __ATOMIC_RELAXED))
_longjmp(segv_env, 1);
  exit(sig);
}

static void install_segv_handler()
{
  struct sigaction sa;
  memset(, 0, sizeof(sa));
  sa.sa_sigaction = segv_handler;
  sa.sa_flags = SA_NODEFER | SA_SIGINFO;
  sigaction(SIGSEGV, , NULL);
  sigaction(SIGBUS, , NULL);
}

#define NONFAILING(...)\
  {\
__atomic_fetch_add(_segv, 1, __ATOMIC_SEQ_CST);   \
if (_setjmp(segv_env) == 0) {  \
  __VA_ARGS__; \
}  \
__atomic_fetch_sub(_segv, 1, __ATOMIC_SEQ_CST);   \
  }

static uintptr_t syz_open_dev(uintptr_t a0, uintptr_t a1, uintptr_t a2)
{
  if (a0 == 0xc || a0 == 0xb) {
char buf[128];
sprintf(buf, "/dev/%s/%d:%d", a0 == 0xc ? "char" : "block",
(uint8_t)a1, (uint8_t)a2);
return open(buf, O_RDWR, 0);
  } else {
char buf[1024];
char* hash;
strncpy(buf, (char*)a0, sizeof(buf));
buf[sizeof(buf) - 1] = 0;
while ((hash = strchr(buf, '#'))) {
  *hash = '0' + (char)(a1 % 10);
  a1 /= 10;
}
return open(buf, a2, 0);
  }
}

static uintptr_t syz_open_pts(uintptr_t a0, uintptr_t a1)
{
  int ptyno = 0;
  if (ioctl(a0, TIOCGPTN, ))
return -1;
  char buf[128];
  sprintf(buf, "/dev/pts/%d", ptyno);
  return open(buf, a1, 0);
}

static uintptr_t syz_fuse_mount(uintptr_t a0, uintptr_t a1,
uintptr_t a2, uintptr_t a3,

Re: [PATCH net] sctp: fix the panic caused by route update

2016-10-24 Thread Neil Horman
On Mon, Oct 24, 2016 at 01:01:09AM +0800, Xin Long wrote:
> Commit 7303a1475008 ("sctp: identify chunks that need to be fragmented
> at IP level") made the chunk be fragmented at IP level in the next round
> if it's size exceed PMTU.
> 
> But there still is another case, PMTU can be updated if transport's dst
> expires and transport's pmtu_pending is set in sctp_packet_transmit. If
> the new PMTU is less than the chunk, the same issue with that commit can
> be triggered.
> 
> So we should drop this packet and let it retransmit in another round
> where it would be fragmented at IP level.
> 
> This patch is to fix it by checking the chunk size after PMTU may be
> updated and dropping this packet if it's size exceed PMTU.
> 
> Fixes: 90017accff61 ("sctp: Add GSO support")
> Signed-off-by: Xin Long 
> ---
>  net/sctp/output.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sctp/output.c b/net/sctp/output.c
> index 2a5c189..6cb0df8 100644
> --- a/net/sctp/output.c
> +++ b/net/sctp/output.c
> @@ -418,6 +418,7 @@ int sctp_packet_transmit(struct sctp_packet *packet, 
> gfp_t gfp)
>   __u8 has_data = 0;
>   int gso = 0;
>   int pktcount = 0;
> + int auth_len = 0;
>   struct dst_entry *dst;
>   unsigned char *auth = NULL; /* pointer to auth in skb data */
>  
> @@ -510,7 +511,12 @@ int sctp_packet_transmit(struct sctp_packet *packet, 
> gfp_t gfp)
>   list_for_each_entry(chunk, >chunk_list, list) {
>   int padded = SCTP_PAD4(chunk->skb->len);
>  
> - if (pkt_size + padded > tp->pathmtu)
> + if (chunk == packet->auth)
> + auth_len = padded;
> + else if (auth_len + padded + packet->overhead >
> +  tp->pathmtu)
> + goto nomem;
> + else if (pkt_size + padded > tp->pathmtu)
>   break;
>   pkt_size += padded;
>   }
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Acked-by: Neil Horman 



[PATCH] LSO feature added to Cadence GEM driver

2016-10-24 Thread Rafal Ozieblo
New Cadence GEM hardware support Large Segment Offload (LSO):
TCP segmentation offload (TSO) as well as UDP fragmentation
offload (UFO). Support for those features was added to the driver.

Signed-off-by: Rafal Ozieblo 
---
 drivers/net/ethernet/cadence/macb.c | 141 +---
 drivers/net/ethernet/cadence/macb.h |  14 
 2 files changed, 143 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.c 
b/drivers/net/ethernet/cadence/macb.c
index b32444a..f659d57 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -32,7 +32,9 @@
 #include 
 #include 
 #include 
-
+#include 
+#include 
+#include 
 #include "macb.h"
 
 #define MACB_RX_BUFFER_SIZE128
@@ -53,8 +55,10 @@
| MACB_BIT(TXERR))
 #define MACB_TX_INT_FLAGS  (MACB_TX_ERR_FLAGS | MACB_BIT(TCOMP))
 
-#define MACB_MAX_TX_LEN((unsigned int)((1 << 
MACB_TX_FRMLEN_SIZE) - 1))
-#define GEM_MAX_TX_LEN ((unsigned int)((1 << GEM_TX_FRMLEN_SIZE) - 1))
+/* Max length of transmit frame must be a multiple of 8 bytes */
+#define MACB_TX_LEN_ALIGN  8
+#define MACB_MAX_TX_LEN((unsigned int)((1 << 
MACB_TX_FRMLEN_SIZE) - 1) & ~((unsigned int)(MACB_TX_LEN_ALIGN - 1)))
+#define GEM_MAX_TX_LEN ((unsigned int)((1 << GEM_TX_FRMLEN_SIZE) - 1) 
& ~((unsigned int)(MACB_TX_LEN_ALIGN - 1)))
 
 #define GEM_MTU_MIN_SIZE   68
 
@@ -1212,7 +1216,8 @@ static void macb_poll_controller(struct net_device *dev)
 
 static unsigned int macb_tx_map(struct macb *bp,
struct macb_queue *queue,
-   struct sk_buff *skb)
+   struct sk_buff *skb,
+   unsigned int hdrlen)
 {
dma_addr_t mapping;
unsigned int len, entry, i, tx_head = queue->tx_head;
@@ -1220,14 +1225,27 @@ static unsigned int macb_tx_map(struct macb *bp,
struct macb_dma_desc *desc;
unsigned int offset, size, count = 0;
unsigned int f, nr_frags = skb_shinfo(skb)->nr_frags;
-   unsigned int eof = 1;
-   u32 ctrl;
+   unsigned int eof = 1, mss_mfs = 0;
+   u32 ctrl, lso_ctrl = 0, seq_ctrl = 0;
+
+   /* LSO */
+   if (skb_shinfo(skb)->gso_size != 0) {
+   if (IPPROTO_UDP == (((struct iphdr 
*)skb_network_header(skb))->protocol))
+   /* UDP - UFO */
+   lso_ctrl = MACB_LSO_UFO_ENABLE;
+   else
+   /* TCP - TSO */
+   lso_ctrl = MACB_LSO_TSO_ENABLE;
+   }
 
/* First, map non-paged data */
len = skb_headlen(skb);
+
+   /* first buffer length */
+   size = hdrlen;
+
offset = 0;
while (len) {
-   size = min(len, bp->max_tx_length);
entry = macb_tx_ring_wrap(tx_head);
tx_skb = >tx_skb[entry];
 
@@ -1247,6 +1265,8 @@ static unsigned int macb_tx_map(struct macb *bp,
offset += size;
count++;
tx_head++;
+
+   size = min(len, bp->max_tx_length);
}
 
/* Then, map paged data from fragments */
@@ -1300,6 +1320,20 @@ static unsigned int macb_tx_map(struct macb *bp,
desc = >tx_ring[entry];
desc->ctrl = ctrl;
 
+   if (lso_ctrl) {
+   if (lso_ctrl == MACB_LSO_UFO_ENABLE)
+   /* include header and FCS in value given to h/w */
+   mss_mfs = skb_shinfo(skb)->gso_size +
+   skb_transport_offset(skb) + 4;
+   else /* TSO */ {
+   mss_mfs = skb_shinfo(skb)->gso_size;
+   /* TCP Sequence Number Source Select
+* can be set only for TSO
+*/
+   seq_ctrl = 0;
+   }
+   }
+
do {
i--;
entry = macb_tx_ring_wrap(i);
@@ -1314,6 +1348,16 @@ static unsigned int macb_tx_map(struct macb *bp,
if (unlikely(entry == (TX_RING_SIZE - 1)))
ctrl |= MACB_BIT(TX_WRAP);
 
+   /* First descriptor is header descriptor */
+   if (i == queue->tx_head) {
+   ctrl |= MACB_BF(TX_LSO, lso_ctrl);
+   ctrl |= MACB_BF(TX_TCP_SEQ_SRC, seq_ctrl);
+   } else
+   /* Only set MSS/MFS on payload descriptors
+* (second or later descriptor)
+*/
+   ctrl |= MACB_BF(MSS_MFS, mss_mfs);
+
/* Set TX buffer descriptor */
macb_set_addr(desc, tx_skb->mapping);
/* desc->addr must be visible to hardware before clearing
@@ -1339,6 +1383,37 @@ static unsigned int macb_tx_map(struct macb *bp,
return 0;
 }
 
+static int 

[PATCH v2 net] macsec: Fix header length if SCI is added if explicitly disabled

2016-10-24 Thread Tobias Brunner
Even if sending SCIs is explicitly disabled, the code that creates the
Security Tag might still decide to add it (e.g. if multiple RX SCs are
defined on the MACsec interface).
But because the header length so far only depended on the configuration
option the SCI overwrote the original frame's contents (EtherType and
e.g. the beginning of the IP header) and if encrypted did not visibly
end up in the packet, while the SC flag in the TCI field of the Security
Tag was still set, resulting in invalid MACsec frames.

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Tobias Brunner 
---
 drivers/net/macsec.c | 26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
index 3ea47f28e143..d2e61e002926 100644
--- a/drivers/net/macsec.c
+++ b/drivers/net/macsec.c
@@ -397,6 +397,14 @@ static struct macsec_cb *macsec_skb_cb(struct sk_buff *skb)
 #define DEFAULT_ENCRYPT false
 #define DEFAULT_ENCODING_SA 0
 
+static bool send_sci(const struct macsec_secy *secy)
+{
+   const struct macsec_tx_sc *tx_sc = >tx_sc;
+
+   return tx_sc->send_sci ||
+   (secy->n_rx_sc > 1 && !tx_sc->end_station && !tx_sc->scb);
+}
+
 static sci_t make_sci(u8 *addr, __be16 port)
 {
sci_t sci;
@@ -437,15 +445,15 @@ static unsigned int macsec_extra_len(bool sci_present)
 
 /* Fill SecTAG according to IEEE 802.1AE-2006 10.5.3 */
 static void macsec_fill_sectag(struct macsec_eth_header *h,
-  const struct macsec_secy *secy, u32 pn)
+  const struct macsec_secy *secy, u32 pn,
+  bool sci_present)
 {
const struct macsec_tx_sc *tx_sc = >tx_sc;
 
-   memset(>tci_an, 0, macsec_sectag_len(tx_sc->send_sci));
+   memset(>tci_an, 0, macsec_sectag_len(sci_present));
h->eth.h_proto = htons(ETH_P_MACSEC);
 
-   if (tx_sc->send_sci ||
-   (secy->n_rx_sc > 1 && !tx_sc->end_station && !tx_sc->scb)) {
+   if (sci_present) {
h->tci_an |= MACSEC_TCI_SC;
memcpy(>secure_channel_id, >sci,
   sizeof(h->secure_channel_id));
@@ -650,6 +658,7 @@ static struct sk_buff *macsec_encrypt(struct sk_buff *skb,
struct macsec_tx_sc *tx_sc;
struct macsec_tx_sa *tx_sa;
struct macsec_dev *macsec = macsec_priv(dev);
+   bool sci_present;
u32 pn;
 
secy = >secy;
@@ -687,7 +696,8 @@ static struct sk_buff *macsec_encrypt(struct sk_buff *skb,
 
unprotected_len = skb->len;
eth = eth_hdr(skb);
-   hh = (struct macsec_eth_header *)skb_push(skb, 
macsec_extra_len(tx_sc->send_sci));
+   sci_present = send_sci(secy);
+   hh = (struct macsec_eth_header *)skb_push(skb, 
macsec_extra_len(sci_present));
memmove(hh, eth, 2 * ETH_ALEN);
 
pn = tx_sa_update_pn(tx_sa, secy);
@@ -696,7 +706,7 @@ static struct sk_buff *macsec_encrypt(struct sk_buff *skb,
kfree_skb(skb);
return ERR_PTR(-ENOLINK);
}
-   macsec_fill_sectag(hh, secy, pn);
+   macsec_fill_sectag(hh, secy, pn, sci_present);
macsec_set_shortlen(hh, unprotected_len - 2 * ETH_ALEN);
 
skb_put(skb, secy->icv_len);
@@ -726,10 +736,10 @@ static struct sk_buff *macsec_encrypt(struct sk_buff *skb,
skb_to_sgvec(skb, sg, 0, skb->len);
 
if (tx_sc->encrypt) {
-   int len = skb->len - macsec_hdr_len(tx_sc->send_sci) -
+   int len = skb->len - macsec_hdr_len(sci_present) -
  secy->icv_len;
aead_request_set_crypt(req, sg, sg, len, iv);
-   aead_request_set_ad(req, macsec_hdr_len(tx_sc->send_sci));
+   aead_request_set_ad(req, macsec_hdr_len(sci_present));
} else {
aead_request_set_crypt(req, sg, sg, 0, iv);
aead_request_set_ad(req, skb->len - secy->icv_len);
-- 
1.9.1


Re: net/dccp: warning in dccp_set_state

2016-10-24 Thread Andrey Konovalov
Hi Eric,

I can confirm that with your patch the warning goes away.

Tested-by: Andrey Konovalov 

On Mon, Oct 24, 2016 at 2:52 PM, Eric Dumazet  wrote:
> On Mon, 2016-10-24 at 05:47 -0700, Eric Dumazet wrote:
>> On Mon, 2016-10-24 at 14:23 +0200, Andrey Konovalov wrote:
>> > Hi,
>> >
>> > I've got the following error report while running the syzkaller fuzzer:
>> >
>> > WARNING: CPU: 1 PID: 21072 at net/dccp/proto.c:83 
>> > dccp_set_state+0x229/0x290
>> > Kernel panic - not syncing: panic_on_warn set ...
>> >
>> > CPU: 1 PID: 21072 Comm: syz-executor Not tainted 4.9.0-rc1+ #293
>> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 
>> > 01/01/2011
>> >  88003d4c7738 81b474f4 0003 dc00
>> >  844f8b00 88003d4c7804 88003d4c7800 8140c06a
>> >  41b58ab3 8479ab7d 8140beae 8140cd00
>> > Call Trace:
>> >  [< inline >] __dump_stack lib/dump_stack.c:15
>> >  [] dump_stack+0xb3/0x10f lib/dump_stack.c:51
>> >  [] panic+0x1bc/0x39d kernel/panic.c:179
>> >  [] __warn+0x1cc/0x1f0 kernel/panic.c:542
>> >  [] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
>> >  [] dccp_set_state+0x229/0x290 net/dccp/proto.c:83
>> >  [] dccp_close+0x612/0xc10 net/dccp/proto.c:1016
>> >  [] inet_release+0xef/0x1c0 net/ipv4/af_inet.c:415
>> >  [] sock_release+0x8e/0x1d0 net/socket.c:570
>> >  [] sock_close+0x16/0x20 net/socket.c:1017
>> >  [] __fput+0x29d/0x720 fs/file_table.c:208
>> >  [] fput+0x15/0x20 fs/file_table.c:244
>> >  [] task_work_run+0xf8/0x170 kernel/task_work.c:116
>> >  [< inline >] exit_task_work include/linux/task_work.h:21
>> >  [] do_exit+0x883/0x2ac0 kernel/exit.c:828
>> >  [] do_group_exit+0x10e/0x340 kernel/exit.c:931
>> >  [] get_signal+0x634/0x15a0 kernel/signal.c:2307
>> >  [] do_signal+0x8d/0x1a30 arch/x86/kernel/signal.c:807
>> >  [] exit_to_usermode_loop+0xe5/0x130
>> > arch/x86/entry/common.c:156
>> >  [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190
>> >  [] syscall_return_slowpath+0x1a8/0x1e0
>> > arch/x86/entry/common.c:259
>> >  [] entry_SYSCALL_64_fastpath+0xc0/0xc2
>> > Dumping ftrace buffer:
>> >(ftrace buffer empty)
>> > Kernel Offset: disabled
>> >
>> > On commit 1a1891d762d6e64daf07b5be4817e3fbb29e3c59 (Oct 18).
>>
>> Not sure we we keep around DCCP. David could we kill it ?
>>
>> TCP seems to have an additional check, missing in DCCP.
>>
>> diff --git a/net/dccp/proto.c b/net/dccp/proto.c
>> index 41e65804ddf5..9fe25bf63296 100644
>> --- a/net/dccp/proto.c
>> +++ b/net/dccp/proto.c
>> @@ -1009,6 +1009,10 @@ void dccp_close(struct sock *sk, long timeout)
>>   __kfree_skb(skb);
>>   }
>>
>> + /* If socket has been already reset kill it. */
>> + if (sk->sk_state == DCCP_CLOSED)
>> + goto adjudge_to_death;
>> +
>>   if (data_was_unread) {
>>   /* Unread data was tossed, send an appropriate Reset Code */
>>   DCCP_WARN("ABORT with %u bytes unread\n", data_was_unread);
>>
>
> The equivalent tcp fix was :
> https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=565b7b2d2e632b5792879c0c9cccdd9eecd31195
>
>


Re: [PATCH net] macsec: Fix header length if SCI is added if explicitily disabled

2016-10-24 Thread Sabrina Dubroca
2016-10-24, 15:32:40 +0200, Tobias Brunner wrote:
> > [snip]
> >> @@ -440,12 +448,12 @@ static void macsec_fill_sectag(struct 
> >> macsec_eth_header *h,
> >>   const struct macsec_secy *secy, u32 pn)
> >>  {
> >>const struct macsec_tx_sc *tx_sc = >tx_sc;
> >> +  bool sci_present = send_sci(secy);
> > 
> > You're already computing this in macsec_encrypt() just before calling
> > macsec_fill_sectag(), so you could pass it as argument instead of
> > recomputing it.
> 
> Right, I'll send a v2.  Would you like me to inline the send_sci()
> function, as it will only be called once afterwards.

I think keeping the send_sci() function is okay, but if you prefer to
inline it, I don't mind.

-- 
Sabrina


Re: [PATCH net] macsec: Fix header length if SCI is added if explicitily disabled

2016-10-24 Thread Tobias Brunner
> [snip]
>> @@ -440,12 +448,12 @@ static void macsec_fill_sectag(struct 
>> macsec_eth_header *h,
>> const struct macsec_secy *secy, u32 pn)
>>  {
>>  const struct macsec_tx_sc *tx_sc = >tx_sc;
>> +bool sci_present = send_sci(secy);
> 
> You're already computing this in macsec_encrypt() just before calling
> macsec_fill_sectag(), so you could pass it as argument instead of
> recomputing it.

Right, I'll send a v2.  Would you like me to inline the send_sci()
function, as it will only be called once afterwards.

Regards,
Tobias



Re: UDP does not autobind on recv

2016-10-24 Thread Eric Dumazet
On Mon, 2016-10-24 at 14:54 +0200, Jiri Slaby wrote:
> Hello,
> 
> as per man 7 udp:
>   In order to receive packets, the socket can be bound to
>   a local  address first  by using bind(2).  Otherwise,
>   the socket layer will automatically assign a free local
>   port out of the range defined by /proc/sys/net/ipv4
>   /ip_local_port_range and bind the socket to INADDR_ANY.
> 
> I did not know that bind is unneeded, so I tried that. But it does not
> work with this piece of code:
> int main()
> {
> char buf[128];
> int fd = socket(AF_INET, SOCK_DGRAM, 0);
> recv(fd, buf, sizeof(buf), 0);
> }

autobind makes little sense at recv() time really.

How an application could expect to receive a frame to 'some socket'
without even knowing its port ?

How useful would that be exactly ?

How TCP behaves ?

I would say, fix the documentation if it is not correct.





Re: [PATCH 3/5] genetlink: statically initialize families

2016-10-24 Thread Johannes Berg
On Mon, 2016-10-24 at 14:40 +0200, Johannes Berg wrote:
> From: Johannes Berg 
> 
> Instead of providing macros/inline functions to initialize
> the families, make all users initialize them statically and
> get rid of the macros.
> 
> This reduces the kernel code size by about 1.6k on x86-64
> (with allyesconfig).

Actually, with the new system where it's not const, I could even split
this up and submit per subsystem, i.e. the fourth patch doesn't depend
on it. I thought it would, since I wanted to make it const, but since I
failed it doesn't actually have that dependency.

johannes


UDP does not autobind on recv

2016-10-24 Thread Jiri Slaby
Hello,

as per man 7 udp:
  In order to receive packets, the socket can be bound to
  a local  address first  by using bind(2).  Otherwise,
  the socket layer will automatically assign a free local
  port out of the range defined by /proc/sys/net/ipv4
  /ip_local_port_range and bind the socket to INADDR_ANY.

I did not know that bind is unneeded, so I tried that. But it does not
work with this piece of code:
int main()
{
char buf[128];
int fd = socket(AF_INET, SOCK_DGRAM, 0);
recv(fd, buf, sizeof(buf), 0);
}

The recv above never returns (even if I bomb all ports from the range).
ss -ulpan is silent too. As a workaround, I can stick a dummy write/send
before recv:
write(fd, "", 0);

And it starts working. ss suddenly displays a port which the program
listens on.

I think the UDP recv path should do inet_autobind as I have done in the
attached patch. But my knowledge is very limited in that area, so I have
no idea whether that is correct at all.

thanks,
-- 
js
suse labs
>From 57c320998feb2e1e705a4ab6d3bbcb74c6ae65f0 Mon Sep 17 00:00:00 2001
From: Jiri Slaby 
Date: Sat, 22 Oct 2016 12:10:53 +0200
Subject: [PATCH] net: autobind UDP on recv

Signed-off-by: Jiri Slaby 
---
 include/net/inet_common.h | 1 +
 net/ipv4/af_inet.c| 3 ++-
 net/ipv4/udp.c| 5 +
 net/ipv6/udp.c| 5 +
 4 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 5d683428fced..ba224ed3dd36 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -27,6 +27,7 @@ ssize_t inet_sendpage(struct socket *sock, struct page *page, int offset,
 int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 		 int flags);
 int inet_shutdown(struct socket *sock, int how);
+int inet_autobind(struct sock *sk);
 int inet_listen(struct socket *sock, int backlog);
 void inet_sock_destruct(struct sock *sk);
 int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9648c97e541f..d23acb11cdb0 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -171,7 +171,7 @@ EXPORT_SYMBOL(inet_sock_destruct);
  *	Automatically bind an unbound socket.
  */
 
-static int inet_autobind(struct sock *sk)
+int inet_autobind(struct sock *sk)
 {
 	struct inet_sock *inet;
 	/* We may need to bind the socket. */
@@ -187,6 +187,7 @@ static int inet_autobind(struct sock *sk)
 	release_sock(sk);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(inet_autobind);
 
 /*
  *	Move a socket into listening state.
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 82fb78265f4b..ceb07c83af17 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1360,6 +1360,11 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 	if (flags & MSG_ERRQUEUE)
 		return ip_recv_error(sk, msg, len, addr_len);
 
+	/* We may need to bind the socket. */
+	if (!inet_sk(sk)->inet_num && !sk->sk_prot->no_autobind &&
+	inet_autobind(sk))
+		return -EAGAIN;
+
 try_again:
 	peeking = off = sk_peek_offset(sk, flags);
 	skb = __skb_recv_datagram(sk, flags | (noblock ? MSG_DONTWAIT : 0),
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 71963b23d5a5..1c3dafc3d91e 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -341,6 +341,11 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	if (np->rxpmtu && np->rxopt.bits.rxpmtu)
 		return ipv6_recv_rxpmtu(sk, msg, len, addr_len);
 
+	/* We may need to bind the socket. */
+	if (!inet_sk(sk)->inet_num && !sk->sk_prot->no_autobind &&
+	inet_autobind(sk))
+		return -EAGAIN;
+
 try_again:
 	peeking = off = sk_peek_offset(sk, flags);
 	skb = __skb_recv_datagram(sk, flags | (noblock ? MSG_DONTWAIT : 0),
-- 
2.10.1



Re: [PATCH net-next 0/9] alx: add multi queue support

2016-10-24 Thread Tobias Regnery
I tested this patchset with my AR8161 ethernet card in different situations:

  - After two weeks of daily use I observed no regression with this patchset.
  - I manually tested the new error paths in the __alx-open function and in
the other newly added device bringup functions.

  - iperf udp and tcp throughput are exactly the same with and without this
patchset, regardless of the number of parallel streams.
  - netperf TCP_RR and UDP_RR tests shows a slight performance increase of
about 1-2% with this patchset.

I don't own any of the other supported cards by the driver, so if someone is
willing to test these patches on one of the other cards, this is highly
appreciated.

Benefits are the split between misc interrupts and the tx / rx interrupts
with the new msi-x support and better multi core cpu utilization.

Sorry for not providing these information in the patchset, I will add these
in the next revision.

--
Tobias

On 21.10.16, Chris Snook wrote:
> Can you please elaborate on the testing and benefits?
> 
> - Chris
> 
> On Fri, Oct 21, 2016 at 3:50 AM Tobias Regnery 
> wrote:
> 
> > This patchset lays the groundwork for multi queue support in the alx driver
> > and enables multi queue support for the tx path by default. The hardware
> > supports up to 4 tx queues.
> >
> > The rx path is a little bit harder because apparently (based on the limited
> > information from the downstream driver) the hardware supports up to 8 rss
> > queues but only has one hardware descriptor ring on the rx side. So the rx
> > path will be part of another patchset.
> >
> > This work is based on the downstream driver at github.com/qca/alx
> >
> > I had a hard time splitting these changes up into reasonable parts because
> > this is my first bigger kernel patchset, so please be patient if this is
> > not
> > the right approach.
> >
> > Tobias Regnery (9):
> >   alx: refactor descriptor allocation
> >   alx: extend data structures for multi queue support
> >   alx: add ability to allocate and free alx_napi structures
> >   alx: switch to per queue data structures
> >   alx: prepare interrupt functions for multiple queues
> >   alx: prepare resource allocation for multi queue support
> >   alx: prepare tx path for multi queue support
> >   alx: enable msi-x interrupts by default
> >   alx: enable multiple tx queues
> >
> >  drivers/net/ethernet/atheros/alx/alx.h  |  36 ++-
> >  drivers/net/ethernet/atheros/alx/main.c | 554
> > ++--
> >  2 files changed, 420 insertions(+), 170 deletions(-)
> >
> > --
> > 2.7.4
> >
> >


Re: net/dccp: warning in dccp_set_state

2016-10-24 Thread Eric Dumazet
On Mon, 2016-10-24 at 05:47 -0700, Eric Dumazet wrote:
> On Mon, 2016-10-24 at 14:23 +0200, Andrey Konovalov wrote:
> > Hi,
> > 
> > I've got the following error report while running the syzkaller fuzzer:
> > 
> > WARNING: CPU: 1 PID: 21072 at net/dccp/proto.c:83 dccp_set_state+0x229/0x290
> > Kernel panic - not syncing: panic_on_warn set ...
> > 
> > CPU: 1 PID: 21072 Comm: syz-executor Not tainted 4.9.0-rc1+ #293
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> >  88003d4c7738 81b474f4 0003 dc00
> >  844f8b00 88003d4c7804 88003d4c7800 8140c06a
> >  41b58ab3 8479ab7d 8140beae 8140cd00
> > Call Trace:
> >  [< inline >] __dump_stack lib/dump_stack.c:15
> >  [] dump_stack+0xb3/0x10f lib/dump_stack.c:51
> >  [] panic+0x1bc/0x39d kernel/panic.c:179
> >  [] __warn+0x1cc/0x1f0 kernel/panic.c:542
> >  [] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
> >  [] dccp_set_state+0x229/0x290 net/dccp/proto.c:83
> >  [] dccp_close+0x612/0xc10 net/dccp/proto.c:1016
> >  [] inet_release+0xef/0x1c0 net/ipv4/af_inet.c:415
> >  [] sock_release+0x8e/0x1d0 net/socket.c:570
> >  [] sock_close+0x16/0x20 net/socket.c:1017
> >  [] __fput+0x29d/0x720 fs/file_table.c:208
> >  [] fput+0x15/0x20 fs/file_table.c:244
> >  [] task_work_run+0xf8/0x170 kernel/task_work.c:116
> >  [< inline >] exit_task_work include/linux/task_work.h:21
> >  [] do_exit+0x883/0x2ac0 kernel/exit.c:828
> >  [] do_group_exit+0x10e/0x340 kernel/exit.c:931
> >  [] get_signal+0x634/0x15a0 kernel/signal.c:2307
> >  [] do_signal+0x8d/0x1a30 arch/x86/kernel/signal.c:807
> >  [] exit_to_usermode_loop+0xe5/0x130
> > arch/x86/entry/common.c:156
> >  [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190
> >  [] syscall_return_slowpath+0x1a8/0x1e0
> > arch/x86/entry/common.c:259
> >  [] entry_SYSCALL_64_fastpath+0xc0/0xc2
> > Dumping ftrace buffer:
> >(ftrace buffer empty)
> > Kernel Offset: disabled
> > 
> > On commit 1a1891d762d6e64daf07b5be4817e3fbb29e3c59 (Oct 18).
> 
> Not sure we we keep around DCCP. David could we kill it ?
> 
> TCP seems to have an additional check, missing in DCCP.
> 
> diff --git a/net/dccp/proto.c b/net/dccp/proto.c
> index 41e65804ddf5..9fe25bf63296 100644
> --- a/net/dccp/proto.c
> +++ b/net/dccp/proto.c
> @@ -1009,6 +1009,10 @@ void dccp_close(struct sock *sk, long timeout)
>   __kfree_skb(skb);
>   }
>  
> + /* If socket has been already reset kill it. */
> + if (sk->sk_state == DCCP_CLOSED)
> + goto adjudge_to_death;
> +
>   if (data_was_unread) {
>   /* Unread data was tossed, send an appropriate Reset Code */
>   DCCP_WARN("ABORT with %u bytes unread\n", data_was_unread);
> 

The equivalent tcp fix was : 
https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=565b7b2d2e632b5792879c0c9cccdd9eecd31195




[PATCH v4] skbedit: allow the user to specify bitmask for mark

2016-10-24 Thread Antonio Quartulli
The user may want to use only some bits of the skb mark in
his skbedit rules because the remaining part might be used by
something else.

Introduce the "mask" parameter to the skbedit actor in order
to implement such functionality.

When the mask is specified, only those bits selected by the
latter are altered really changed by the actor, while the
rest is left untouched.

Signed-off-by: Antonio Quartulli 
Signed-off-by: Jamal Hadi Salim 

---

This patch has been sleeping for a while although it was basically ready for
being merged. I hope it can still be merged.

Checkpatch is now complaining about this:

CHECK: Comparison to NULL could be written "tb[TCA_SKBEDIT_MASK]"
#112: FILE: net/sched/act_skbedit.c:114:
+   if (tb[TCA_SKBEDIT_MASK] != NULL) {

However the surrounding code does not use this codestyle. Please, let me know if
I should rearrange this line.


Thanks!



Changes from v3:
- rebase on top of net-next
- fix syntax error in if statement

Changes from v2:
- remove useless comments
- use nla_put_u32() and fix typ0

Changes from v1:
- use '&=' tcf_skbedit() to clean the mark
- extend tcf_skbedit_dump() in order to send the mask as well

 include/net/tc_act/tc_skbedit.h|  1 +
 include/uapi/linux/tc_act/tc_skbedit.h |  2 ++
 net/sched/act_skbedit.c| 21 ++---
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/net/tc_act/tc_skbedit.h b/include/net/tc_act/tc_skbedit.h
index 5767e9dbcf92..19cd3d345804 100644
--- a/include/net/tc_act/tc_skbedit.h
+++ b/include/net/tc_act/tc_skbedit.h
@@ -27,6 +27,7 @@ struct tcf_skbedit {
u32 flags;
u32 priority;
u32 mark;
+   u32 mask;
u16 queue_mapping;
u16 ptype;
 };
diff --git a/include/uapi/linux/tc_act/tc_skbedit.h 
b/include/uapi/linux/tc_act/tc_skbedit.h
index a4d00c608d8f..2884425738ce 100644
--- a/include/uapi/linux/tc_act/tc_skbedit.h
+++ b/include/uapi/linux/tc_act/tc_skbedit.h
@@ -28,6 +28,7 @@
 #define SKBEDIT_F_QUEUE_MAPPING0x2
 #define SKBEDIT_F_MARK 0x4
 #define SKBEDIT_F_PTYPE0x8
+#define SKBEDIT_F_MASK 0x10
 
 struct tc_skbedit {
tc_gen;
@@ -42,6 +43,7 @@ enum {
TCA_SKBEDIT_MARK,
TCA_SKBEDIT_PAD,
TCA_SKBEDIT_PTYPE,
+   TCA_SKBEDIT_MASK,
__TCA_SKBEDIT_MAX
 };
 #define TCA_SKBEDIT_MAX (__TCA_SKBEDIT_MAX - 1)
diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index a133dcb82132..024f3a3afeff 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -46,8 +46,10 @@ static int tcf_skbedit(struct sk_buff *skb, const struct 
tc_action *a,
if (d->flags & SKBEDIT_F_QUEUE_MAPPING &&
skb->dev->real_num_tx_queues > d->queue_mapping)
skb_set_queue_mapping(skb, d->queue_mapping);
-   if (d->flags & SKBEDIT_F_MARK)
-   skb->mark = d->mark;
+   if (d->flags & SKBEDIT_F_MARK) {
+   skb->mark &= ~d->mask;
+   skb->mark |= d->mark & d->mask;
+   }
if (d->flags & SKBEDIT_F_PTYPE)
skb->pkt_type = d->ptype;
 
@@ -61,6 +63,7 @@ static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX 
+ 1] = {
[TCA_SKBEDIT_QUEUE_MAPPING] = { .len = sizeof(u16) },
[TCA_SKBEDIT_MARK]  = { .len = sizeof(u32) },
[TCA_SKBEDIT_PTYPE] = { .len = sizeof(u16) },
+   [TCA_SKBEDIT_MASK]  = { .len = sizeof(u32) },
 };
 
 static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
@@ -71,7 +74,7 @@ static int tcf_skbedit_init(struct net *net, struct nlattr 
*nla,
struct nlattr *tb[TCA_SKBEDIT_MAX + 1];
struct tc_skbedit *parm;
struct tcf_skbedit *d;
-   u32 flags = 0, *priority = NULL, *mark = NULL;
+   u32 flags = 0, *priority = NULL, *mark = NULL, *mask = NULL;
u16 *queue_mapping = NULL, *ptype = NULL;
bool exists = false;
int ret = 0, err;
@@ -108,6 +111,11 @@ static int tcf_skbedit_init(struct net *net, struct nlattr 
*nla,
mark = nla_data(tb[TCA_SKBEDIT_MARK]);
}
 
+   if (tb[TCA_SKBEDIT_MASK] != NULL) {
+   flags |= SKBEDIT_F_MASK;
+   mask = nla_data(tb[TCA_SKBEDIT_MASK]);
+   }
+
parm = nla_data(tb[TCA_SKBEDIT_PARMS]);
 
exists = tcf_hash_check(tn, parm->index, a, bind);
@@ -145,6 +153,10 @@ static int tcf_skbedit_init(struct net *net, struct nlattr 
*nla,
d->mark = *mark;
if (flags & SKBEDIT_F_PTYPE)
d->ptype = *ptype;
+   /* default behaviour is to use all the bits */
+   d->mask = 0x;
+   if (flags & SKBEDIT_F_MASK)
+   d->mask = *mask;
 
d->tcf_action = parm->action;
 
@@ -182,6 +194,9 @@ static int tcf_skbedit_dump(struct sk_buff *skb, struct 

Re: net/dccp: warning in dccp_set_state

2016-10-24 Thread Eric Dumazet
On Mon, 2016-10-24 at 14:23 +0200, Andrey Konovalov wrote:
> Hi,
> 
> I've got the following error report while running the syzkaller fuzzer:
> 
> WARNING: CPU: 1 PID: 21072 at net/dccp/proto.c:83 dccp_set_state+0x229/0x290
> Kernel panic - not syncing: panic_on_warn set ...
> 
> CPU: 1 PID: 21072 Comm: syz-executor Not tainted 4.9.0-rc1+ #293
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>  88003d4c7738 81b474f4 0003 dc00
>  844f8b00 88003d4c7804 88003d4c7800 8140c06a
>  41b58ab3 8479ab7d 8140beae 8140cd00
> Call Trace:
>  [< inline >] __dump_stack lib/dump_stack.c:15
>  [] dump_stack+0xb3/0x10f lib/dump_stack.c:51
>  [] panic+0x1bc/0x39d kernel/panic.c:179
>  [] __warn+0x1cc/0x1f0 kernel/panic.c:542
>  [] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
>  [] dccp_set_state+0x229/0x290 net/dccp/proto.c:83
>  [] dccp_close+0x612/0xc10 net/dccp/proto.c:1016
>  [] inet_release+0xef/0x1c0 net/ipv4/af_inet.c:415
>  [] sock_release+0x8e/0x1d0 net/socket.c:570
>  [] sock_close+0x16/0x20 net/socket.c:1017
>  [] __fput+0x29d/0x720 fs/file_table.c:208
>  [] fput+0x15/0x20 fs/file_table.c:244
>  [] task_work_run+0xf8/0x170 kernel/task_work.c:116
>  [< inline >] exit_task_work include/linux/task_work.h:21
>  [] do_exit+0x883/0x2ac0 kernel/exit.c:828
>  [] do_group_exit+0x10e/0x340 kernel/exit.c:931
>  [] get_signal+0x634/0x15a0 kernel/signal.c:2307
>  [] do_signal+0x8d/0x1a30 arch/x86/kernel/signal.c:807
>  [] exit_to_usermode_loop+0xe5/0x130
> arch/x86/entry/common.c:156
>  [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190
>  [] syscall_return_slowpath+0x1a8/0x1e0
> arch/x86/entry/common.c:259
>  [] entry_SYSCALL_64_fastpath+0xc0/0xc2
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Kernel Offset: disabled
> 
> On commit 1a1891d762d6e64daf07b5be4817e3fbb29e3c59 (Oct 18).

Not sure we we keep around DCCP. David could we kill it ?

TCP seems to have an additional check, missing in DCCP.

diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 41e65804ddf5..9fe25bf63296 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1009,6 +1009,10 @@ void dccp_close(struct sock *sk, long timeout)
__kfree_skb(skb);
}
 
+   /* If socket has been already reset kill it. */
+   if (sk->sk_state == DCCP_CLOSED)
+   goto adjudge_to_death;
+
if (data_was_unread) {
/* Unread data was tossed, send an appropriate Reset Code */
DCCP_WARN("ABORT with %u bytes unread\n", data_was_unread);




[PATCH net-next] ethernet: fix min/max MTU typos

2016-10-24 Thread Stefan Richter
Fixes: d894be57ca92('ethernet: use net core MTU range checking in more drivers')
CC: Jarod Wilson 
CC: Thomas Falcon 
Signed-off-by: Stefan Richter 
---
 drivers/net/ethernet/broadcom/sb1250-mac.c | 2 +-
 drivers/net/ethernet/ibm/ibmveth.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/sb1250-mac.c 
b/drivers/net/ethernet/broadcom/sb1250-mac.c
index cb312e4c89f4..435a2e4739d1 100644
--- a/drivers/net/ethernet/broadcom/sb1250-mac.c
+++ b/drivers/net/ethernet/broadcom/sb1250-mac.c
@@ -2219,7 +2219,7 @@ static int sbmac_init(struct platform_device *pldev, long 
long base)
 
dev->netdev_ops = _netdev_ops;
dev->watchdog_timeo = TX_TIMEOUT;
-   dev->max_mtu = 0;
+   dev->min_mtu = 0;
dev->max_mtu = ENET_PACKET_SIZE;
 
netif_napi_add(dev, >napi, sbmac_poll, 16);
diff --git a/drivers/net/ethernet/ibm/ibmveth.c 
b/drivers/net/ethernet/ibm/ibmveth.c
index 29c05d0d79a9..4a81c892fc31 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1549,7 +1549,7 @@ static int ibmveth_probe(struct vio_dev *dev, const 
struct vio_device_id *id)
}
 
netdev->min_mtu = IBMVETH_MIN_MTU;
-   netdev->min_mtu = ETH_MAX_MTU;
+   netdev->max_mtu = ETH_MAX_MTU;
 
memcpy(netdev->dev_addr, mac_addr_p, ETH_ALEN);
 
-- 
Stefan Richter
-==- =-=- ==---
http://arcgraph.de/sr/


  1   2   >