[PATCH] vhost: introduce O(1) vq metadata cache

2016-12-13 Thread Jason Wang
When device IOTLB is enabled, all address translations were stored in
interval tree. O(lgN) searching time could be slow for virtqueue
metadata (avail, used and descriptors) since they were accessed much
often than other addresses. So this patch introduces an O(1) array
which points to the interval tree nodes that store the translations of
vq metadata. Those array were update during vq IOTLB prefetching and
were reset during each invalidation and tlb update. Each time we want
to access vq metadata, this small array were queried before interval
tree. This would be sufficient for static mappings but not dynamic
mappings, we could do optimizations on top.

Test were done with l2fwd in guest (2M hugepage):

   noiommu  | before| after
tx 1.32Mpps | 1.06Mpps(82%) | 1.30Mpps(98%)
rx 2.33Mpps | 1.46Mpps(63%) | 2.29Mpps(98%)

We can almost reach the same performance as noiommu mode.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vhost.c | 136 --
 drivers/vhost/vhost.h |   8 +++
 2 files changed, 118 insertions(+), 26 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c6f2d89..89e40b6 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -282,6 +282,22 @@ void vhost_poll_queue(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_queue);
 
+static void __vhost_vq_meta_reset(struct vhost_virtqueue *vq)
+{
+   int j;
+
+   for (j = 0; j < VHOST_NUM_ADDRS; j++)
+   vq->meta_iotlb[j] = NULL;
+}
+
+static void vhost_vq_meta_reset(struct vhost_dev *d)
+{
+   int i;
+
+   for (i = 0; i < d->nvqs; ++i)
+   __vhost_vq_meta_reset(d->vqs[i]);
+}
+
 static void vhost_vq_reset(struct vhost_dev *dev,
   struct vhost_virtqueue *vq)
 {
@@ -311,6 +327,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->busyloop_timeout = 0;
vq->umem = NULL;
vq->iotlb = NULL;
+   __vhost_vq_meta_reset(vq);
 }
 
 static int vhost_worker(void *data)
@@ -690,6 +707,18 @@ static int vq_memory_access_ok(void __user *log_base, 
struct vhost_umem *umem,
return 1;
 }
 
+static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
+  u64 addr, unsigned int size,
+  int type)
+{
+   const struct vhost_umem_node *node = vq->meta_iotlb[type];
+
+   if (!node)
+   return NULL;
+
+   return (void *)(node->userspace_addr + (u64)addr - node->start);
+}
+
 /* Can we switch to this memory table? */
 /* Caller should have device mutex but not vq mutex */
 static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
@@ -732,8 +761,14 @@ static int vhost_copy_to_user(struct vhost_virtqueue *vq, 
void *to,
 * could be access through iotlb. So -EAGAIN should
 * not happen in this case.
 */
-   /* TODO: more fast path */
struct iov_iter t;
+   void __user *uaddr = vhost_vq_meta_fetch(vq,
+(u64)(uintptr_t)to, size,
+VHOST_ADDR_DESC);
+
+   if (uaddr)
+   return __copy_to_user(uaddr, from, size);
+
ret = translate_desc(vq, (u64)(uintptr_t)to, size, 
vq->iotlb_iov,
 ARRAY_SIZE(vq->iotlb_iov),
 VHOST_ACCESS_WO);
@@ -761,8 +796,14 @@ static int vhost_copy_from_user(struct vhost_virtqueue 
*vq, void *to,
 * could be access through iotlb. So -EAGAIN should
 * not happen in this case.
 */
-   /* TODO: more fast path */
+   void __user *uaddr = vhost_vq_meta_fetch(vq,
+(u64)(uintptr_t)from, size,
+VHOST_ADDR_DESC);
struct iov_iter f;
+
+   if (uaddr)
+   return __copy_from_user(to, uaddr, size);
+
ret = translate_desc(vq, (u64)(uintptr_t)from, size, 
vq->iotlb_iov,
 ARRAY_SIZE(vq->iotlb_iov),
 VHOST_ACCESS_RO);
@@ -782,17 +823,12 @@ static int vhost_copy_from_user(struct vhost_virtqueue 
*vq, void *to,
return ret;
 }
 
-static void __user *__vhost_get_user(struct vhost_virtqueue *vq,
-void *addr, unsigned size)
+static void __user *__vhost_get_user_slow(struct vhost_virtqueue *vq,
+ void *addr, unsigned int size,
+ int type)
 {
int ret;
 
-   /* This function should be called after iotlb
-* prefetch, which means we're sure that vq
-* could be access through iotlb. So -EAGAIN should
-* not happen in this case.
- 

[Query] Delayed vxlan socket creation?

2016-12-13 Thread Du, Fan
Hi

I'm interested to one Docker issue[1] which looks like related to kernel vxlan 
socket creation
as described in the thread. From my limited knowledge here, socket creation is 
synchronous ,
and after the *socket* syscall, the sock handle will be valid and ready to 
linkup.

Somehow I'm not sure the detailed scenario here, and which/how possible commit 
fix?
Thanks!

Quoted analysis:
--
(Found in kernel 3.13)
The issue happens because in older kernels when a vxlan interface is created, 
the socket creation is queued up in a worker thread which actually creates 
the socket. But this needs to happen before we bring up the link on the vxlan 
interface. 
If for some chance, the worker thread hasn't completed the creation of the 
socket 
before we did link up then when we do link up the kernel checks if the socket 
was 
created and if not it will return ENOTCONN. This was a bug in the kernel which 
got fixed
in later kernels. That is why retrying with a timer fixes the issue.

[1]: https://github.com/docker/libnetwork/issues/1247



Re: [PATCH iproute2 -net-next] lwt: BPF support for LWT

2016-12-13 Thread Thomas Graf
On 13 December 2016 at 00:41, Stephen Hemminger
 wrote:
> I went ahead and fixed these.

Thanks for fixing it up Stephen.


[PATCH v3] net: macb: Added PCI wrapper for Platform Driver.

2016-12-13 Thread Bartosz Folta
There are hardware PCI implementations of Cadence GEM network
controller. This patch will allow to use such hardware with reuse of
existing Platform Driver.

Signed-off-by: Bartosz Folta 
---
Changed in v3:
Fixed dependencies in Kconfig.
---
Changed in v2:
Respin to net-next. Changed patch formatting.
---
 drivers/net/ethernet/cadence/Kconfig|   9 ++
 drivers/net/ethernet/cadence/Makefile   |   1 +
 drivers/net/ethernet/cadence/macb.c |  31 +--
 drivers/net/ethernet/cadence/macb_pci.c | 153 
 include/linux/platform_data/macb.h  |   6 ++
 5 files changed, 195 insertions(+), 5 deletions(-)
 create mode 100644 drivers/net/ethernet/cadence/macb_pci.c

diff --git a/drivers/net/ethernet/cadence/Kconfig 
b/drivers/net/ethernet/cadence/Kconfig
index f0bcb15..608bea1 100644
--- a/drivers/net/ethernet/cadence/Kconfig
+++ b/drivers/net/ethernet/cadence/Kconfig
@@ -31,4 +31,13 @@ config MACB
  To compile this driver as a module, choose M here: the module
  will be called macb.
 
+config MACB_PCI
+   tristate "Cadence PCI MACB/GEM support"
+   depends on MACB && PCI && COMMON_CLK
+   ---help---
+ This is PCI wrapper for MACB driver.
+
+ To compile this driver as a module, choose M here: the module
+ will be called macb_pci.
+
 endif # NET_CADENCE
diff --git a/drivers/net/ethernet/cadence/Makefile 
b/drivers/net/ethernet/cadence/Makefile
index 91f79b1..4ba7559 100644
--- a/drivers/net/ethernet/cadence/Makefile
+++ b/drivers/net/ethernet/cadence/Makefile
@@ -3,3 +3,4 @@
 #
 
 obj-$(CONFIG_MACB) += macb.o
+obj-$(CONFIG_MACB_PCI) += macb_pci.o
diff --git a/drivers/net/ethernet/cadence/macb.c 
b/drivers/net/ethernet/cadence/macb.c
index 538544a..c0fb80a 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -404,6 +404,8 @@ static int macb_mii_probe(struct net_device *dev)
phy_irq = gpio_to_irq(pdata->phy_irq_pin);
phydev->irq = (phy_irq < 0) ? PHY_POLL : phy_irq;
}
+   } else {
+   phydev->irq = PHY_POLL;
}
 
/* attach the mac to the phy */
@@ -482,6 +484,9 @@ static int macb_mii_init(struct macb *bp)
goto err_out_unregister_bus;
}
} else {
+   for (i = 0; i < PHY_MAX_ADDR; i++)
+   bp->mii_bus->irq[i] = PHY_POLL;
+
if (pdata)
bp->mii_bus->phy_mask = pdata->phy_mask;
 
@@ -2523,16 +2528,24 @@ static int macb_clk_init(struct platform_device *pdev, 
struct clk **pclk,
 struct clk **hclk, struct clk **tx_clk,
 struct clk **rx_clk)
 {
+   struct macb_platform_data *pdata;
int err;
 
-   *pclk = devm_clk_get(>dev, "pclk");
+   pdata = dev_get_platdata(>dev);
+   if (pdata) {
+   *pclk = pdata->pclk;
+   *hclk = pdata->hclk;
+   } else {
+   *pclk = devm_clk_get(>dev, "pclk");
+   *hclk = devm_clk_get(>dev, "hclk");
+   }
+
if (IS_ERR(*pclk)) {
err = PTR_ERR(*pclk);
dev_err(>dev, "failed to get macb_clk (%u)\n", err);
return err;
}
 
-   *hclk = devm_clk_get(>dev, "hclk");
if (IS_ERR(*hclk)) {
err = PTR_ERR(*hclk);
dev_err(>dev, "failed to get hclk (%u)\n", err);
@@ -3107,15 +3120,23 @@ static int at91ether_init(struct platform_device *pdev)
 MODULE_DEVICE_TABLE(of, macb_dt_ids);
 #endif /* CONFIG_OF */
 
+static const struct macb_config default_gem_config = {
+   .caps = MACB_CAPS_GIGABIT_MODE_AVAILABLE | MACB_CAPS_JUMBO,
+   .dma_burst_length = 16,
+   .clk_init = macb_clk_init,
+   .init = macb_init,
+   .jumbo_max_len = 10240,
+};
+
 static int macb_probe(struct platform_device *pdev)
 {
+   const struct macb_config *macb_config = _gem_config;
int (*clk_init)(struct platform_device *, struct clk **,
struct clk **, struct clk **,  struct clk **)
- = macb_clk_init;
-   int (*init)(struct platform_device *) = macb_init;
+ = macb_config->clk_init;
+   int (*init)(struct platform_device *) = macb_config->init;
struct device_node *np = pdev->dev.of_node;
struct device_node *phy_node;
-   const struct macb_config *macb_config = NULL;
struct clk *pclk, *hclk = NULL, *tx_clk = NULL, *rx_clk = NULL;
unsigned int queue_mask, num_queues;
struct macb_platform_data *pdata;
diff --git a/drivers/net/ethernet/cadence/macb_pci.c 
b/drivers/net/ethernet/cadence/macb_pci.c
new file mode 100644
index 000..92be2cd
--- /dev/null
+++ b/drivers/net/ethernet/cadence/macb_pci.c
@@ -0,0 +1,153 @@
+/**
+ * macb_pci.c - Cadence GEM PCI wrapper.
+ *
+ 

Re: [RFC PATCH v3] audit: use proper refcount locking on audit_sock

2016-12-13 Thread Cong Wang
On Tue, Dec 13, 2016 at 8:00 PM, Richard Guy Briggs  wrote:
> On 2016-12-13 16:19, Cong Wang wrote:
>> On Tue, Dec 13, 2016 at 7:03 AM, Richard Guy Briggs  wrote:
>> > @@ -1283,8 +1299,10 @@ static void __net_exit audit_net_exit(struct net 
>> > *net)
>> >  {
>> > struct audit_net *aunet = net_generic(net, audit_net_id);
>> > struct sock *sock = aunet->nlsk;
>> > +   mutex_lock(_cmd_mutex);
>> > if (sock == audit_sock)
>> > auditd_reset();
>> > +   mutex_unlock(_cmd_mutex);
>>
>> This still doesn't look correct to me, b/c here we release the audit_sock
>> refcnt twice:
>>
>> 1) inside audit_reset()
>
> The audit_reset() refcount decrement corresponds to a setting of
> audit_sock only if audit_sock is still non-NULL.
>

Hmm, thinking about it again, looks like the sock == audit_sock
and audit_sock != NULL checks can guarantee we are safe. So,

Reviewed-by: Cong Wang 


Re: [PATCH v3 net-next 1/3] openvswitch: Add a missing break statement.

2016-12-13 Thread Pravin Shelar
On Tue, Nov 29, 2016 at 3:30 PM, Jarno Rajahalme  wrote:
> Add a break statement to prevent fall-through from
> OVS_KEY_ATTR_ETHERNET to OVS_KEY_ATTR_TUNNEL.  Without the break
> actions setting ethernet addresses fail to validate with log messages
> complaining about invalid tunnel attributes.
>
> Fixes: 0a6410fbde ("openvswitch: netlink: support L3 packets")
> Signed-off-by: Jarno Rajahalme 
> Acked-by: Pravin B Shelar 
> Acked-by: Jiri Benc 

Hi Jarno,
Since this is straight forward patch. can you send it separately so
that we can get it merged soon?

Thanks,
Pravin.


Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE

2016-12-13 Thread Wei Xu

On 2016年12月14日 03:44, Theodore Ts'o wrote:

Jason's patch fixed the issue, so I think we have the proper fix, but
to answer your questions:

On Wed, Dec 14, 2016 at 01:46:44AM +0800, Wei Xu wrote:


Q1:
Which distribution are you using for the GCE instance?


The test appliance is based on Debian Jessie.


Q2:
Are you running xfs test as an embedded VM case, which means XFS test
appliance is also a VM inside the GCE instance? Or the kernel is built
for the instance itself?


No, GCE currently doesn't support running nested VM's (e.g., running
VM's inside GCE).  So the kernel is built for the instance itself.
The way the test appliance works is that it initially boots using the
Debian Jessie default kernel and then we kexec into the kernel under
test.


Q3:
Can this bug be reproduced for kvm-xfstests case? I'm trying to set up
a local test bed if it makes sense.


You definitely can't do it out of the box -- you need to build the
image using "gen-image --networking", and then run "kvm-xfstests -N
shell" as root.  But the bug doesn't reproduce on kvm-xfstests, using
a 4.9 host kernel and linux-next guest kernel.



OK, thanks a lot.

BTW, although this is a guest issue, is there anyway to view the GCE
host kernel or qemu(if it is) version?



Cheers,

- Ted



Re: netlink: GPF in sock_sndtimeo

2016-12-13 Thread Richard Guy Briggs
On 2016-12-13 16:17, Cong Wang wrote:
> On Tue, Dec 13, 2016 at 2:52 AM, Richard Guy Briggs  wrote:
> > It is actually the audit_pid and audit_nlk_portid that I care about
> > more.  The audit daemon could vanish or close the socket while the
> > kernel sock to which it was attached is still quite valid.  Accessing
> > the set of three atomically is the urge.  I wonder if it makes more
> > sense to test for the presence of auditd using audit_sock rather than
> > audit_pid, but still keep audit_pid for our reporting and replacement
> > strategy.  Another idea would be to put the three in one struct.
> 
> Note, the process has audit_pid should hold a refcnt to the netns too,
> so the netns can't be gone until that process is gone.

I noted that.  I did wonder if there might be a problem if all the
processes were moved to another netns with the struct sock stuck in the
now process-void netns.

This is alluded-to in 6f285b19d09f ("audit: Send replies in the proper
network namespace.").

> > Can someone explain how they think the original test was able to trigger
> > this GPF?  Network namespace shutdown while something pretended to set
> > up a new auditd?  That's impressive for a fuzzer if that's the case...
> > Is there an strace?  I guess it is all in test().
> 
> I am surprised you still don't get the race condition even when you
> are now working on v2...
> 
> The race happens in this scenarios :
> 
> 1) Create a new netns
> 
> 2) In the new netns, communicate with kauditd to set audit_sock
> 
> 3) Generate some audit messages, so kauditd will keep sending them
> via audit_sock
> 
> 4) exit the netns
> 
> 5) the previous audit_sock is now going away, but kaudit_sock could still
> access it in this small window.

Ah ok that fits...

- RGB

--
Richard Guy Briggs 
Kernel Security Engineering, Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635


Re: stmmac driver...

2016-12-13 Thread Jie Deng
Hi Peppe,

On 2016/12/12 22:17, Giuseppe CAVALLARO wrote:
> Hi David
>
> On 12/7/2016 7:06 PM, David Miller wrote:
>>
>> Giuseppe and Alexandre,
>>
>> There are a lot of patches and discussions happening around the stammc
>> driver lately and both of you are listed as the maintainers.
>>
>> I really need prompt and conclusive reviews of these patch submissions
>> from you, and participation in all discussions about the driver.
>
> yes we are trying to do the best.
>
>> Otherwise I have only three things I can do: 1) let the patches rot in
>> patchwork for days 2) trust that the patches are sane and fit your
>> desires and goals and just apply them or 3) reject them since they
>> aren't being reviewed properly.
>
> at this stage, I think the best is: (3).
I think the patches David mentioned also included XLGMAC. He sent this email
before I explained QoS and XLGMAC were different IPs. Do you mind we do XLGMAC
development under drivers/net/ethernet/synopsys/ ? I think we don't have
conflict since we will keep QoS development in stmmac.
>
>>
>> Thanks in advance.
>>
> you are welcome
>
>
> Peppe



[PATCH v2 1/4] siphash: add cryptographically secure hashtable function

2016-12-13 Thread Jason A. Donenfeld
SipHash is a 64-bit keyed hash function that is actually a
cryptographically secure PRF, like HMAC. Except SipHash is super fast,
and is meant to be used as a hashtable keyed lookup function.

SipHash isn't just some new trendy hash function. It's been around for a
while, and there really isn't anything that comes remotely close to
being useful in the way SipHash is. With that said, why do we need this?

There are a variety of attacks known as "hashtable poisoning" in which an
attacker forms some data such that the hash of that data will be the
same, and then preceeds to fill up all entries of a hashbucket. This is
a realistic and well-known denial-of-service vector.

Linux developers already seem to be aware that this is an issue, and
various places that use hash tables in, say, a network context, use a
non-cryptographically secure function (usually jhash) and then try to
twiddle with the key on a time basis (or in many cases just do nothing
and hope that nobody notices). While this is an admirable attempt at
solving the problem, it doesn't actually fix it. SipHash fixes it.

(It fixes it in such a sound way that you could even build a stream
cipher out of SipHash that would resist the modern cryptanalysis.)

There are a modicum of places in the kernel that are vulnerable to
hashtable poisoning attacks, either via userspace vectors or network
vectors, and there's not a reliable mechanism inside the kernel at the
moment to fix it. The first step toward fixing these issues is actually
getting a secure primitive into the kernel for developers to use. Then
we can, bit by bit, port things over to it as deemed appropriate.

Dozens of languages are already using this internally for their hash
tables. Some of the BSDs already use this in their kernels. SipHash is
a widely known high-speed solution to a widely known problem, and it's
time we catch-up.

Signed-off-by: Jason A. Donenfeld 
Cc: Jean-Philippe Aumasson 
Cc: Daniel J. Bernstein 
Cc: Linus Torvalds 
Cc: Eric Biggers 
---
Changes from v1->v2:

   - None in this patch, but see elsewhere in series.

 include/linux/siphash.h | 20 +
 lib/Kconfig.debug   |  6 ++--
 lib/Makefile|  5 ++--
 lib/siphash.c   | 76 +
 lib/test_siphash.c  | 74 +++
 5 files changed, 176 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/siphash.h
 create mode 100644 lib/siphash.c
 create mode 100644 lib/test_siphash.c

diff --git a/include/linux/siphash.h b/include/linux/siphash.h
new file mode 100644
index ..6623b3090645
--- /dev/null
+++ b/include/linux/siphash.h
@@ -0,0 +1,20 @@
+/* Copyright (C) 2016 Jason A. Donenfeld 
+ *
+ * This file is provided under a dual BSD/GPLv2 license.
+ *
+ * SipHash: a fast short-input PRF
+ * https://131002.net/siphash/
+ */
+
+#ifndef _LINUX_SIPHASH_H
+#define _LINUX_SIPHASH_H
+
+#include 
+
+enum siphash_lengths {
+   SIPHASH24_KEY_LEN = 16
+};
+
+u64 siphash24(const u8 *data, size_t len, const u8 key[SIPHASH24_KEY_LEN]);
+
+#endif /* _LINUX_SIPHASH_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index e6327d102184..32bbf689fc46 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1843,9 +1843,9 @@ config TEST_HASH
tristate "Perform selftest on hash functions"
default n
help
- Enable this option to test the kernel's integer ()
- and string () hash functions on boot
- (or module load).
+ Enable this option to test the kernel's integer (),
+ string (), and siphash ()
+ hash functions on boot (or module load).
 
  This is intended to help people writing architecture-specific
  optimized versions.  If unsure, say N.
diff --git a/lib/Makefile b/lib/Makefile
index 50144a3aeebd..71d398b04a74 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -22,7 +22,8 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 sha1.o chacha20.o md5.o irq_regs.o argv_split.o \
 flex_proportions.o ratelimit.o show_mem.o \
 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
-earlycpio.o seq_buf.o nmi_backtrace.o nodemask.o win_minmax.o
+earlycpio.o seq_buf.o siphash.o \
+nmi_backtrace.o nodemask.o win_minmax.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
@@ -44,7 +45,7 @@ obj-$(CONFIG_TEST_HEXDUMP) += test_hexdump.o
 obj-y += kstrtox.o
 obj-$(CONFIG_TEST_BPF) += test_bpf.o
 obj-$(CONFIG_TEST_FIRMWARE) += test_firmware.o
-obj-$(CONFIG_TEST_HASH) += test_hash.o
+obj-$(CONFIG_TEST_HASH) += test_hash.o test_siphash.o
 obj-$(CONFIG_TEST_KASAN) += test_kasan.o
 obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o
 obj-$(CONFIG_TEST_LKM) += test_module.o
diff --git a/lib/siphash.c b/lib/siphash.c
new file mode 100644
index 

[PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform

2016-12-13 Thread Jason A. Donenfeld
This gives a clear speed and security improvement. Siphash is both
faster and is more solid crypto than the aging MD5.

Rather than manually filling MD5 buffers, we simply create
a layout by a simple anonymous struct, for which gcc generates
rather efficient code.

Signed-off-by: Jason A. Donenfeld 
Cc: Andi Kleen 
---
Changes from v1->v2:

  - Rebased on the latest 4.10, and now uses top 32-bits of siphash
for the optional ts value.

 net/core/secure_seq.c | 160 +-
 1 file changed, 79 insertions(+), 81 deletions(-)

diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c
index 88a8e429fc3e..abadc79cd5d3 100644
--- a/net/core/secure_seq.c
+++ b/net/core/secure_seq.c
@@ -1,3 +1,5 @@
+/* Copyright (C) 2016 Jason A. Donenfeld . All Rights 
Reserved. */
+
 #include 
 #include 
 #include 
@@ -8,14 +10,14 @@
 #include 
 #include 
 #include 
-
+#include 
 #include 
 
 #if IS_ENABLED(CONFIG_IPV6) || IS_ENABLED(CONFIG_INET)
+#include 
 #include 
-#define NET_SECRET_SIZE (MD5_MESSAGE_BYTES / 4)
 
-static u32 net_secret[NET_SECRET_SIZE] cacheline_aligned;
+static u8 net_secret[SIPHASH24_KEY_LEN];
 
 static __always_inline void net_secret_init(void)
 {
@@ -44,44 +46,39 @@ static u32 seq_scale(u32 seq)
 u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr,
 __be16 sport, __be16 dport, u32 *tsoff)
 {
-   u32 secret[MD5_MESSAGE_BYTES / 4];
-   u32 hash[MD5_DIGEST_WORDS];
-   u32 i;
-
+   const struct {
+   struct in6_addr saddr;
+   struct in6_addr daddr;
+   __be16 sport;
+   __be16 dport;
+   } __packed combined = {
+   .saddr = *(struct in6_addr *)saddr,
+   .daddr = *(struct in6_addr *)daddr,
+   .sport = sport,
+   .dport = dport
+   };
+   u64 hash;
net_secret_init();
-   memcpy(hash, saddr, 16);
-   for (i = 0; i < 4; i++)
-   secret[i] = net_secret[i] + (__force u32)daddr[i];
-   secret[4] = net_secret[4] +
-   (((__force u16)sport << 16) + (__force u16)dport);
-   for (i = 5; i < MD5_MESSAGE_BYTES / 4; i++)
-   secret[i] = net_secret[i];
-
-   md5_transform(hash, secret);
-
-   *tsoff = sysctl_tcp_timestamps == 1 ? hash[1] : 0;
-   return seq_scale(hash[0]);
+   hash = siphash24((const u8 *), sizeof(combined), net_secret);
+   *tsoff = sysctl_tcp_timestamps == 1 ? (hash >> 32) : 0;
+   return seq_scale(hash);
 }
 EXPORT_SYMBOL(secure_tcpv6_sequence_number);
 
 u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr,
   __be16 dport)
 {
-   u32 secret[MD5_MESSAGE_BYTES / 4];
-   u32 hash[MD5_DIGEST_WORDS];
-   u32 i;
-
+   const struct {
+   struct in6_addr saddr;
+   struct in6_addr daddr;
+   __be16 dport;
+   } __packed combined = {
+   .saddr = *(struct in6_addr *)saddr,
+   .daddr = *(struct in6_addr *)daddr,
+   .dport = dport
+   };
net_secret_init();
-   memcpy(hash, saddr, 16);
-   for (i = 0; i < 4; i++)
-   secret[i] = net_secret[i] + (__force u32) daddr[i];
-   secret[4] = net_secret[4] + (__force u32)dport;
-   for (i = 5; i < MD5_MESSAGE_BYTES / 4; i++)
-   secret[i] = net_secret[i];
-
-   md5_transform(hash, secret);
-
-   return hash[0];
+   return siphash24((const u8 *), sizeof(combined), net_secret);
 }
 EXPORT_SYMBOL(secure_ipv6_port_ephemeral);
 #endif
@@ -91,33 +88,37 @@ EXPORT_SYMBOL(secure_ipv6_port_ephemeral);
 u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
   __be16 sport, __be16 dport, u32 *tsoff)
 {
-   u32 hash[MD5_DIGEST_WORDS];
-
+   const struct {
+   __be32 saddr;
+   __be32 daddr;
+   __be16 sport;
+   __be16 dport;
+   } __packed combined = {
+   .saddr = saddr,
+   .daddr = daddr,
+   .sport = sport,
+   .dport = dport
+   };
+   u64 hash;
net_secret_init();
-   hash[0] = (__force u32)saddr;
-   hash[1] = (__force u32)daddr;
-   hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
-   hash[3] = net_secret[15];
-
-   md5_transform(hash, net_secret);
-
-   *tsoff = sysctl_tcp_timestamps == 1 ? hash[1] : 0;
-   return seq_scale(hash[0]);
+   hash = siphash24((const u8 *), sizeof(combined), net_secret);
+   *tsoff = sysctl_tcp_timestamps == 1 ? (hash >> 32) : 0;
+   return seq_scale(hash);
 }
 
 u32 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport)
 {
-   u32 hash[MD5_DIGEST_WORDS];
-
+   const struct {
+   __be32 saddr;
+   __be32 daddr;
+   __be16 

Re: [RFC PATCH v3] audit: use proper refcount locking on audit_sock

2016-12-13 Thread Richard Guy Briggs
On 2016-12-13 16:19, Cong Wang wrote:
> On Tue, Dec 13, 2016 at 7:03 AM, Richard Guy Briggs  wrote:
> > @@ -1283,8 +1299,10 @@ static void __net_exit audit_net_exit(struct net 
> > *net)
> >  {
> > struct audit_net *aunet = net_generic(net, audit_net_id);
> > struct sock *sock = aunet->nlsk;
> > +   mutex_lock(_cmd_mutex);
> > if (sock == audit_sock)
> > auditd_reset();
> > +   mutex_unlock(_cmd_mutex);
> 
> This still doesn't look correct to me, b/c here we release the audit_sock
> refcnt twice:
> 
> 1) inside audit_reset()

The audit_reset() refcount decrement corresponds to a setting of
audit_sock only if audit_sock is still non-NULL.

> 2) netlink_kernel_release()

This refcount decrement corresponds to netlink_kernel_create().

- RGB

--
Richard Guy Briggs 
Kernel Security Engineering, Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635


[PATCH v2 4/4] random: use siphash24 instead of md5 for get_random_int/long

2016-12-13 Thread Jason A. Donenfeld
This duplicates the current algorithm for get_random_int/long, but uses
siphash24 instead. This comes with several benefits. It's certainly
faster and more cryptographically secure than MD5. This patch also
hashes the pid, entropy, and timestamp as fixed width fields, in order
to increase diffusion.

The previous md5 algorithm used a per-cpu md5 state, which caused
successive calls to the function to chain upon each other. While it's
not entirely clear that this kind of chaining is absolutely necessary
when using a secure PRF like siphash24, it can't hurt, and the timing of
the call chain does add a degree of natural entropy. So, in keeping with
this design, instead of the massive per-cpu 64-byte md5 state, there is
instead a per-cpu previously returned value for chaining.

Signed-off-by: Jason A. Donenfeld 
Cc: Jean-Philippe Aumasson 
Cc: Ted Tso 
---
Changes from v1->v2:

  - Uses u64 instead of uint64_t
  - Uses get_cpu_ptr instead of get_cpu_var

 drivers/char/random.c | 50 +++---
 1 file changed, 31 insertions(+), 19 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d6876d506220..61c4b45427dc 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -262,6 +262,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -2042,7 +2043,7 @@ struct ctl_table random_table[] = {
 };
 #endif /* CONFIG_SYSCTL */
 
-static u32 random_int_secret[MD5_MESSAGE_BYTES / 4] cacheline_aligned;
+static u8 random_int_secret[SIPHASH24_KEY_LEN];
 
 int random_int_secret_init(void)
 {
@@ -2050,8 +2051,7 @@ int random_int_secret_init(void)
return 0;
 }
 
-static DEFINE_PER_CPU(__u32 [MD5_DIGEST_WORDS], get_random_int_hash)
-   __aligned(sizeof(unsigned long));
+static DEFINE_PER_CPU(u64, get_random_int_chaining);
 
 /*
  * Get a random word for internal kernel use only. Similar to urandom but
@@ -2061,19 +2061,25 @@ static DEFINE_PER_CPU(__u32 [MD5_DIGEST_WORDS], 
get_random_int_hash)
  */
 unsigned int get_random_int(void)
 {
-   __u32 *hash;
unsigned int ret;
+   struct {
+   u64 chaining;
+   unsigned long ts;
+   unsigned long entropy;
+   pid_t pid;
+   } __packed combined;
+   u64 *chaining;
 
if (arch_get_random_int())
return ret;
 
-   hash = get_cpu_var(get_random_int_hash);
-
-   hash[0] += current->pid + jiffies + random_get_entropy();
-   md5_transform(hash, random_int_secret);
-   ret = hash[0];
-   put_cpu_var(get_random_int_hash);
-
+   chaining = get_cpu_ptr(_random_int_chaining);
+   combined.chaining = *chaining;
+   combined.ts = jiffies;
+   combined.entropy = random_get_entropy();
+   combined.pid = current->pid;
+   ret = *chaining = siphash24((u8 *), sizeof(combined), 
random_int_secret);
+   put_cpu_ptr(chaining);
return ret;
 }
 EXPORT_SYMBOL(get_random_int);
@@ -2083,19 +2089,25 @@ EXPORT_SYMBOL(get_random_int);
  */
 unsigned long get_random_long(void)
 {
-   __u32 *hash;
unsigned long ret;
+   struct {
+   u64 chaining;
+   unsigned long ts;
+   unsigned long entropy;
+   pid_t pid;
+   } __packed combined;
+   u64 *chaining;
 
if (arch_get_random_long())
return ret;
 
-   hash = get_cpu_var(get_random_int_hash);
-
-   hash[0] += current->pid + jiffies + random_get_entropy();
-   md5_transform(hash, random_int_secret);
-   ret = *(unsigned long *)hash;
-   put_cpu_var(get_random_int_hash);
-
+   chaining = get_cpu_ptr(_random_int_chaining);
+   combined.chaining = *chaining;
+   combined.ts = jiffies;
+   combined.entropy = random_get_entropy();
+   combined.pid = current->pid;
+   ret = *chaining = siphash24((u8 *), sizeof(combined), 
random_int_secret);
+   put_cpu_ptr(chaining);
return ret;
 }
 EXPORT_SYMBOL(get_random_long);
-- 
2.11.0



[PATCH v2 2/4] siphash: add convenience functions for jhash converts

2016-12-13 Thread Jason A. Donenfeld
Many jhash users currently rely on the Nwords functions. In order to
make transitions to siphash fit something people already know about, we
provide analog functions here. This also winds up being nice for the
networking stack, where hashing 32-bit fields is common.

Signed-off-by: Jason A. Donenfeld 
---
Changes from v1->v2:

  - None in this patch, but see elsewhere in series.

 include/linux/siphash.h | 33 +
 1 file changed, 33 insertions(+)

diff --git a/include/linux/siphash.h b/include/linux/siphash.h
index 6623b3090645..1391054c4c29 100644
--- a/include/linux/siphash.h
+++ b/include/linux/siphash.h
@@ -17,4 +17,37 @@ enum siphash_lengths {
 
 u64 siphash24(const u8 *data, size_t len, const u8 key[SIPHASH24_KEY_LEN]);
 
+static inline u64 siphash24_1word(const u32 a, const u8 key[SIPHASH24_KEY_LEN])
+{
+   return siphash24((u8 *), sizeof(a), key);
+}
+
+static inline u64 siphash24_2words(const u32 a, const u32 b, const u8 
key[SIPHASH24_KEY_LEN])
+{
+   const struct {
+   u32 a;
+   u32 b;
+   } __packed combined = {
+   .a = a,
+   .b = b
+   };
+
+   return siphash24((const u8 *), sizeof(combined), key);
+}
+
+static inline u64 siphash24_3words(const u32 a, const u32 b, const u32 c, 
const u8 key[SIPHASH24_KEY_LEN])
+{
+   const struct {
+   u32 a;
+   u32 b;
+   u32 c;
+   } __packed combined = {
+   .a = a,
+   .b = b,
+   .c = c
+   };
+
+   return siphash24((const u8 *), sizeof(combined), key);
+}
+
 #endif /* _LINUX_SIPHASH_H */
-- 
2.11.0



[PATCH 4/3] random: use siphash24 instead of md5 for get_random_int/long

2016-12-13 Thread Jason A. Donenfeld
This duplicates the current algorithm for get_random_int/long, but uses
siphash24 instead. This comes with several benefits. It's certainly
faster and more cryptographically secure than MD5. This patch also
hashes the pid, entropy, and timestamp as fixed width fields, in order
to increase diffusion.

The previous md5 algorithm used a per-cpu md5 state, which caused
successive calls to the function to chain upon each other. While it's
not entirely clear that this kind of chaining is absolutely necessary
when using a secure PRF like siphash24, it can't hurt, and the timing of
the call chain does add a degree of natural entropy. So, in keeping with
this design, instead of the massive per-cpu 64-byte md5 state, there is
instead a per-cpu previously returned value for chaining.

Signed-off-by: Jason A. Donenfeld 
Cc: Jean-Philippe Aumasson 
---
 drivers/char/random.c | 50 +++---
 1 file changed, 31 insertions(+), 19 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d6876d506220..25f96f074da5 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -262,6 +262,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -2042,7 +2043,7 @@ struct ctl_table random_table[] = {
 };
 #endif /* CONFIG_SYSCTL */
 
-static u32 random_int_secret[MD5_MESSAGE_BYTES / 4] cacheline_aligned;
+static u8 random_int_secret[SIPHASH24_KEY_LEN];
 
 int random_int_secret_init(void)
 {
@@ -2050,8 +2051,7 @@ int random_int_secret_init(void)
return 0;
 }
 
-static DEFINE_PER_CPU(__u32 [MD5_DIGEST_WORDS], get_random_int_hash)
-   __aligned(sizeof(unsigned long));
+static DEFINE_PER_CPU(u64, get_random_int_chaining);
 
 /*
  * Get a random word for internal kernel use only. Similar to urandom but
@@ -2061,19 +2061,25 @@ static DEFINE_PER_CPU(__u32 [MD5_DIGEST_WORDS], 
get_random_int_hash)
  */
 unsigned int get_random_int(void)
 {
-   __u32 *hash;
+   uint64_t *chaining;
unsigned int ret;
+   struct {
+   uint64_t chaining;
+   unsigned long ts;
+   unsigned long entropy;
+   pid_t pid;
+   } __packed combined;
 
if (arch_get_random_int())
return ret;
 
-   hash = get_cpu_var(get_random_int_hash);
-
-   hash[0] += current->pid + jiffies + random_get_entropy();
-   md5_transform(hash, random_int_secret);
-   ret = hash[0];
-   put_cpu_var(get_random_int_hash);
-
+   chaining = _cpu_var(get_random_int_chaining);
+   combined.chaining = *chaining;
+   combined.ts = jiffies;
+   combined.entropy = random_get_entropy();
+   combined.pid = current->pid;
+   ret = *chaining = siphash24((u8 *), sizeof(combined), 
random_int_secret);
+   put_cpu_var(chaining);
return ret;
 }
 EXPORT_SYMBOL(get_random_int);
@@ -2083,19 +2089,25 @@ EXPORT_SYMBOL(get_random_int);
  */
 unsigned long get_random_long(void)
 {
-   __u32 *hash;
+   uint64_t *chaining;
unsigned long ret;
+   struct {
+   uint64_t chaining;
+   unsigned long ts;
+   unsigned long entropy;
+   pid_t pid;
+   } __packed combined;
 
if (arch_get_random_long())
return ret;
 
-   hash = get_cpu_var(get_random_int_hash);
-
-   hash[0] += current->pid + jiffies + random_get_entropy();
-   md5_transform(hash, random_int_secret);
-   ret = *(unsigned long *)hash;
-   put_cpu_var(get_random_int_hash);
-
+   chaining = _cpu_var(get_random_int_chaining);
+   combined.chaining = *chaining;
+   combined.ts = jiffies;
+   combined.entropy = random_get_entropy();
+   combined.pid = current->pid;
+   ret = *chaining = siphash24((u8 *), sizeof(combined), 
random_int_secret);
+   put_cpu_var(chaining);
return ret;
 }
 EXPORT_SYMBOL(get_random_long);
-- 
2.11.0



Re: [PATCH net-next] net: remove abuse of VLAN DEI/CFI bit

2016-12-13 Thread Alexei Starovoitov
On Tue, Dec 13, 2016 at 6:03 PM, Michał Mirosław
 wrote:
> On Tue, Dec 13, 2016 at 05:21:18PM -0800, Stephen Hemminger wrote:
>> On Sat,  3 Dec 2016 10:22:28 +0100 (CET)
>> Michał Mirosław  wrote:
>> > This All-in-one patch removes abuse of VLAN CFI bit, so it can be passed
>> > intact through linux networking stack.
>> >
>> > Signed-off-by: Michał Mirosław 
>> > ---
>> >
>> > Dear NetDevs
>> >
>> > I guess this needs to be split to the prep..convert[]..finish sequence,
>> > but if you like it as is, then it's ready.
>> >
>> > The biggest question is if the modified interface and vlan_present
>> > is the way to go. This can be changed to use vlan_proto != 0 instead
>> > of an extra flag bit.
>> >
>> > As I can't test most of the driver changes, please look at them carefully.
>> > OVS and bridge eyes are especially welcome.
>> >
>> > Best Regards,
>> > Michał Mirosław
>> Is the motivation to support 802.1ad Drop Eligability Indicator (DEI)?
>>
>> If so then you need to be more verbose in the commit log, and lots more
>> work is needed. You need to rename fields and validate every place a
>> driver is using DEI bit to make sure it really does the right thing
>> on that hardware. It is not just a mechanical change.
>
> My main motivation is to be able to see the bit intact in tcpdump and be
> able to pass it untouched through at least a veth pair. It would be great
> if all devices didn't do something stupid with the bit, but it's not
> something I am able to make happen.

imo "be able to pass untouched through veth" is not good enough
justification for such invasive patches.
I'm still not sure that all of these changes don't affect user space.


Re: [PATCH net-next] net: remove abuse of VLAN DEI/CFI bit

2016-12-13 Thread Michał Mirosław
On Tue, Dec 13, 2016 at 05:21:18PM -0800, Stephen Hemminger wrote:
> On Sat,  3 Dec 2016 10:22:28 +0100 (CET)
> Michał Mirosław  wrote:
> > This All-in-one patch removes abuse of VLAN CFI bit, so it can be passed
> > intact through linux networking stack.
> > 
> > Signed-off-by: Michał Mirosław 
> > ---
> > 
> > Dear NetDevs
> > 
> > I guess this needs to be split to the prep..convert[]..finish sequence,
> > but if you like it as is, then it's ready.
> > 
> > The biggest question is if the modified interface and vlan_present
> > is the way to go. This can be changed to use vlan_proto != 0 instead
> > of an extra flag bit.
> > 
> > As I can't test most of the driver changes, please look at them carefully.
> > OVS and bridge eyes are especially welcome.
> > 
> > Best Regards,
> > Michał Mirosław
> Is the motivation to support 802.1ad Drop Eligability Indicator (DEI)?
> 
> If so then you need to be more verbose in the commit log, and lots more
> work is needed. You need to rename fields and validate every place a
> driver is using DEI bit to make sure it really does the right thing
> on that hardware. It is not just a mechanical change.

My main motivation is to be able to see the bit intact in tcpdump and be
able to pass it untouched through at least a veth pair. It would be great
if all devices didn't do something stupid with the bit, but it's not
something I am able to make happen.

Best Regards,
Michał Mirosław


Re: [PATCH net-next 00/27] Remove VLAN CFI bit abuse

2016-12-13 Thread Michał Mirosław
On Tue, Dec 13, 2016 at 05:16:26PM -0800, Stephen Hemminger wrote:
> On Tue, 13 Dec 2016 01:12:32 +0100 (CET)
> Michał Mirosław  wrote:
> > This series removes an abuse of VLAN CFI bit in Linux networking stack.
> > Currently Linux always clears the bit on outgoing traffic and presents
> > it cleared to userspace (even via AF_PACKET/tcpdump when hw-accelerated).
> > 
> > This uses a new vlan_present bit in struct skbuff, and removes an assumption
> > that vlan_proto != 0 when VLAN tag is present.
> > 
> > As I can't test most of the driver changes, please look at them carefully.
> > 
> > The series is supposed to be bisect-friendly and that requires temporary
> > insertion of #define VLAN_TAG_PRESENT in BPF code to be able to split
> > JIT changes per architecture.
> 
> I wonder if CFI can every validly be non-zero in the modern world, on Hyper-V.
> There are no token ring devices and that seems to be the only use case where 
> CFI would
> be non-zero. Unless someone is planning to reuse it a a protocol bit which 
> seems
> like a really bad idea.
> 
> Maybe the right thing is to keep hard coded as zero and not start adding
> more untestable code conditions.
> 
> My recommendation would be get rid of VLAN_TAG_PRESENT, but don't preserve
> CFI bit.

According to Wikipedia page [1] on 802.1Q, CFI bit got already changed
to DEI (Drop eligible indicator) in 2011 revision of the IEEE standard.

I can't verify this, though.

Best Regards,
Michał Mirosław

[1] https://en.wikipedia.org/wiki/IEEE_802.1Q#Frame_format


Re: [PATCH net-next] net: remove abuse of VLAN DEI/CFI bit

2016-12-13 Thread Stephen Hemminger
On Sat,  3 Dec 2016 10:22:28 +0100 (CET)
Michał Mirosław  wrote:

> This All-in-one patch removes abuse of VLAN CFI bit, so it can be passed
> intact through linux networking stack.
> 
> Signed-off-by: Michał Mirosław 
> ---
> 
> Dear NetDevs
> 
> I guess this needs to be split to the prep..convert[]..finish sequence,
> but if you like it as is, then it's ready.
> 
> The biggest question is if the modified interface and vlan_present
> is the way to go. This can be changed to use vlan_proto != 0 instead
> of an extra flag bit.
> 
> As I can't test most of the driver changes, please look at them carefully.
> OVS and bridge eyes are especially welcome.
> 
> Best Regards,
> Michał Mirosław

Is the motivation to support 802.1ad Drop Eligability Indicator (DEI)?

If so then you need to be more verbose in the commit log, and lots more
work is needed. You need to rename fields and validate every place a
driver is using DEI bit to make sure it really does the right thing
on that hardware. It is not just a mechanical change.


Re: [PATCH net-next 00/27] Remove VLAN CFI bit abuse

2016-12-13 Thread Stephen Hemminger
On Tue, 13 Dec 2016 01:12:32 +0100 (CET)
Michał Mirosław  wrote:

> Dear NetDevs
> 
> This series removes an abuse of VLAN CFI bit in Linux networking stack.
> Currently Linux always clears the bit on outgoing traffic and presents
> it cleared to userspace (even via AF_PACKET/tcpdump when hw-accelerated).
> 
> This uses a new vlan_present bit in struct skbuff, and removes an assumption
> that vlan_proto != 0 when VLAN tag is present.
> 
> As I can't test most of the driver changes, please look at them carefully.
> 
> The series is supposed to be bisect-friendly and that requires temporary
> insertion of #define VLAN_TAG_PRESENT in BPF code to be able to split
> JIT changes per architecture.
> 
> Best Regards,
> Michał Mirosław

I wonder if CFI can every validly be non-zero in the modern world, on Hyper-V.
There are no token ring devices and that seems to be the only use case where 
CFI would
be non-zero. Unless someone is planning to reuse it a a protocol bit which seems
like a really bad idea.

Maybe the right thing is to keep hard coded as zero and not start adding
more untestable code conditions.

My recommendation would be get rid of VLAN_TAG_PRESENT, but don't preserve
CFI bit.


[PATCH 1/3] siphash: add cryptographically secure hashtable function

2016-12-13 Thread Jason A. Donenfeld
SipHash is a 64-bit keyed hash function that is actually a
cryptographically secure PRF, like HMAC. Except SipHash is super fast,
and is meant to be used as a hashtable keyed lookup function.

SipHash isn't just some new trendy hash function. It's been around for a
while, and there really isn't anything that comes remotely close to
being useful in the way SipHash is. With that said, why do we need this?

There are a variety of attacks known as "hashtable poisoning" in which an
attacker forms some data such that the hash of that data will be the
same, and then preceeds to fill up all entries of a hashbucket. This is
a realistic and well-known denial-of-service vector.

Linux developers already seem to be aware that this is an issue, and
various places that use hash tables in, say, a network context, use a
non-cryptographically secure function (usually jhash) and then try to
twiddle with the key on a time basis (or in many cases just do nothing
and hope that nobody notices). While this is an admirable attempt at
solving the problem, it doesn't actually fix it. SipHash fixes it.

(It fixes it in such a sound way that you could even build a stream
cipher out of SipHash that would resist the modern cryptanalysis.)

There are a modicum of places in the kernel that are vulnerable to
hashtable poisoning attacks, either via userspace vectors or network
vectors, and there's not a reliable mechanism inside the kernel at the
moment to fix it. The first step toward fixing these issues is actually
getting a secure primitive into the kernel for developers to use. Then
we can, bit by bit, port things over to it as deemed appropriate.

Dozens of languages are already using this internally for their hash
tables. Some of the BSDs already use this in their kernels. SipHash is
a widely known high-speed solution to a widely known problem, and it's
time we catch-up.

Signed-off-by: Jason A. Donenfeld 
Cc: Jean-Philippe Aumasson 
Cc: Daniel J. Bernstein 
---
 include/linux/siphash.h | 20 +
 lib/Kconfig.debug   |  6 ++--
 lib/Makefile|  5 ++--
 lib/siphash.c   | 76 +
 lib/test_siphash.c  | 74 +++
 5 files changed, 176 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/siphash.h
 create mode 100644 lib/siphash.c
 create mode 100644 lib/test_siphash.c

diff --git a/include/linux/siphash.h b/include/linux/siphash.h
new file mode 100644
index ..6623b3090645
--- /dev/null
+++ b/include/linux/siphash.h
@@ -0,0 +1,20 @@
+/* Copyright (C) 2016 Jason A. Donenfeld 
+ *
+ * This file is provided under a dual BSD/GPLv2 license.
+ *
+ * SipHash: a fast short-input PRF
+ * https://131002.net/siphash/
+ */
+
+#ifndef _LINUX_SIPHASH_H
+#define _LINUX_SIPHASH_H
+
+#include 
+
+enum siphash_lengths {
+   SIPHASH24_KEY_LEN = 16
+};
+
+u64 siphash24(const u8 *data, size_t len, const u8 key[SIPHASH24_KEY_LEN]);
+
+#endif /* _LINUX_SIPHASH_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a6c8db1d62f6..2a1797704b41 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1823,9 +1823,9 @@ config TEST_HASH
tristate "Perform selftest on hash functions"
default n
help
- Enable this option to test the kernel's integer ()
- and string () hash functions on boot
- (or module load).
+ Enable this option to test the kernel's integer (),
+ string (), and siphash ()
+ hash functions on boot (or module load).
 
  This is intended to help people writing architecture-specific
  optimized versions.  If unsure, say N.
diff --git a/lib/Makefile b/lib/Makefile
index 50144a3aeebd..71d398b04a74 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -22,7 +22,8 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 sha1.o chacha20.o md5.o irq_regs.o argv_split.o \
 flex_proportions.o ratelimit.o show_mem.o \
 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
-earlycpio.o seq_buf.o nmi_backtrace.o nodemask.o win_minmax.o
+earlycpio.o seq_buf.o siphash.o \
+nmi_backtrace.o nodemask.o win_minmax.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
@@ -44,7 +45,7 @@ obj-$(CONFIG_TEST_HEXDUMP) += test_hexdump.o
 obj-y += kstrtox.o
 obj-$(CONFIG_TEST_BPF) += test_bpf.o
 obj-$(CONFIG_TEST_FIRMWARE) += test_firmware.o
-obj-$(CONFIG_TEST_HASH) += test_hash.o
+obj-$(CONFIG_TEST_HASH) += test_hash.o test_siphash.o
 obj-$(CONFIG_TEST_KASAN) += test_kasan.o
 obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o
 obj-$(CONFIG_TEST_LKM) += test_module.o
diff --git a/lib/siphash.c b/lib/siphash.c
new file mode 100644
index ..7b55ad3a7fe9
--- /dev/null
+++ b/lib/siphash.c
@@ -0,0 +1,76 @@
+/* Copyright (C) 2015-2016 Jason A. Donenfeld 
+ * Copyright (C) 2012-2014 

Re: [PATCH net-next 13/27] bridge: use __vlan_hwaccel helpers

2016-12-13 Thread Toshiaki Makita
On 2016/12/14 0:11, Michał Mirosław wrote:
> On Tue, Dec 13, 2016 at 03:59:46PM +0300, Sergei Shtylyov wrote:
>> Hello!
>>
>> On 12/13/2016 3:12 AM, Michał Mirosław wrote:
>>
>>> This removes assumption than vlan_tci != 0 when tag is present.
>>>
>>> Signed-off-by: Michał Mirosław 
>>> ---
>>>  net/bridge/br_netfilter_hooks.c | 14 --
>>>  net/bridge/br_private.h |  2 +-
>>>  net/bridge/br_vlan.c|  6 +++---
>>>  3 files changed, 12 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/net/bridge/br_netfilter_hooks.c 
>>> b/net/bridge/br_netfilter_hooks.c
>>> index b12501a..2cc0747 100644
>>> --- a/net/bridge/br_netfilter_hooks.c
>>> +++ b/net/bridge/br_netfilter_hooks.c
>> [...]
>>> @@ -749,8 +747,12 @@ static int br_nf_dev_queue_xmit(struct net *net, 
>>> struct sock *sk, struct sk_buff
>>>
>>> data = this_cpu_ptr(_frag_data_storage);
>>>
>>> -   data->vlan_tci = skb->vlan_tci;
>>> -   data->vlan_proto = skb->vlan_proto;
>>> +   if (skb_vlan_tag_present(skb)) {
>>> +   data->vlan_tci = skb->vlan_tci;
>>> +   data->vlan_proto = skb->vlan_proto;
>>> +   } else
>>> +   data->vlan_proto = 0;
>>
>>CodingStyle: should use {} in all branches of *if* if at least one branch
>> has {}.
>>
>> [...]
>>> diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
>>> index b6de4f4..ef94664 100644
>>> --- a/net/bridge/br_vlan.c
>>> +++ b/net/bridge/br_vlan.c
>>
>>> @@ -444,8 +444,8 @@ static bool __allowed_ingress(const struct net_bridge 
>>> *br,
>>> __vlan_hwaccel_put_tag(skb, br->vlan_proto, pvid);
>>> else
>>> /* Priority-tagged Frame.
>>> -* At this point, We know that skb->vlan_tci had
>>> -* VLAN_TAG_PRESENT bit and its VID field was 0x000.
>>> +* At this point, We know that skb->vlan_tci VID
>>
>>s/We/we/.
>>
>>> +* field was 0x000.
>>
>>Simply 0, maybe?

I originally wrote it like this because we are playing with bit field here.
I meant that all of 12 bits are 0 thus we can safely perform bitwise-OR
to update the VID field.

Thanks,
Toshiaki Makita




Re: netlink: GPF in sock_sndtimeo

2016-12-13 Thread Cong Wang
On Tue, Dec 13, 2016 at 2:52 AM, Richard Guy Briggs  wrote:
> It is actually the audit_pid and audit_nlk_portid that I care about
> more.  The audit daemon could vanish or close the socket while the
> kernel sock to which it was attached is still quite valid.  Accessing
> the set of three atomically is the urge.  I wonder if it makes more
> sense to test for the presence of auditd using audit_sock rather than
> audit_pid, but still keep audit_pid for our reporting and replacement
> strategy.  Another idea would be to put the three in one struct.

Note, the process has audit_pid should hold a refcnt to the netns too,
so the netns can't be gone until that process is gone.

>
> Can someone explain how they think the original test was able to trigger
> this GPF?  Network namespace shutdown while something pretended to set
> up a new auditd?  That's impressive for a fuzzer if that's the case...
> Is there an strace?  I guess it is all in test().
>

I am surprised you still don't get the race condition even when you
are now working on v2...

The race happens in this scenarios :

1) Create a new netns

2) In the new netns, communicate with kauditd to set audit_sock

3) Generate some audit messages, so kauditd will keep sending them
via audit_sock

4) exit the netns

5) the previous audit_sock is now going away, but kaudit_sock could still
access it in this small window.


[PATCH 2/3] siphash: add convenience functions for jhash converts

2016-12-13 Thread Jason A. Donenfeld
Many jhash users currently rely on the Nwords functions. In order to
make transitions to siphash fit something people already know about, we
provide analog functions here. This also winds up being nice for the
networking stack, where hashing 32-bit fields is common.

Signed-off-by: Jason A. Donenfeld 
---
 include/linux/siphash.h | 33 +
 1 file changed, 33 insertions(+)

diff --git a/include/linux/siphash.h b/include/linux/siphash.h
index 6623b3090645..1391054c4c29 100644
--- a/include/linux/siphash.h
+++ b/include/linux/siphash.h
@@ -17,4 +17,37 @@ enum siphash_lengths {
 
 u64 siphash24(const u8 *data, size_t len, const u8 key[SIPHASH24_KEY_LEN]);
 
+static inline u64 siphash24_1word(const u32 a, const u8 key[SIPHASH24_KEY_LEN])
+{
+   return siphash24((u8 *), sizeof(a), key);
+}
+
+static inline u64 siphash24_2words(const u32 a, const u32 b, const u8 
key[SIPHASH24_KEY_LEN])
+{
+   const struct {
+   u32 a;
+   u32 b;
+   } __packed combined = {
+   .a = a,
+   .b = b
+   };
+
+   return siphash24((const u8 *), sizeof(combined), key);
+}
+
+static inline u64 siphash24_3words(const u32 a, const u32 b, const u32 c, 
const u8 key[SIPHASH24_KEY_LEN])
+{
+   const struct {
+   u32 a;
+   u32 b;
+   u32 c;
+   } __packed combined = {
+   .a = a,
+   .b = b,
+   .c = c
+   };
+
+   return siphash24((const u8 *), sizeof(combined), key);
+}
+
 #endif /* _LINUX_SIPHASH_H */
-- 
2.11.0



Re: [RFC PATCH v3] audit: use proper refcount locking on audit_sock

2016-12-13 Thread Cong Wang
On Tue, Dec 13, 2016 at 7:03 AM, Richard Guy Briggs  wrote:
> @@ -1283,8 +1299,10 @@ static void __net_exit audit_net_exit(struct net *net)
>  {
> struct audit_net *aunet = net_generic(net, audit_net_id);
> struct sock *sock = aunet->nlsk;
> +   mutex_lock(_cmd_mutex);
> if (sock == audit_sock)
> auditd_reset();
> +   mutex_unlock(_cmd_mutex);

This still doesn't look correct to me, b/c here we release the audit_sock
refcnt twice:

1) inside audit_reset()
2) netlink_kernel_release()


[PATCH 3/3] secure_seq: use fast siphash instead of slow md5

2016-12-13 Thread Jason A. Donenfeld
This gives a clear speed and security improvement. Rather than manually
filling MD5 buffers, we simply create a layout by a simple anonymous
struct, for which gcc generates rather efficient code.

Signed-off-by: Jason A. Donenfeld 
---
 net/core/secure_seq.c | 153 --
 1 file changed, 73 insertions(+), 80 deletions(-)

diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c
index fd3ce461fbe6..dcee974f2fb1 100644
--- a/net/core/secure_seq.c
+++ b/net/core/secure_seq.c
@@ -1,3 +1,5 @@
+/* Copyright (C) 2016 Jason A. Donenfeld . All Rights 
Reserved. */
+
 #include 
 #include 
 #include 
@@ -8,13 +10,12 @@
 #include 
 #include 
 #include 
-
+#include 
+#include 
 #include 
 
 #if IS_ENABLED(CONFIG_IPV6) || IS_ENABLED(CONFIG_INET)
-#define NET_SECRET_SIZE (MD5_MESSAGE_BYTES / 4)
-
-static u32 net_secret[NET_SECRET_SIZE] cacheline_aligned;
+static u8 net_secret[SIPHASH24_KEY_LEN];
 
 static __always_inline void net_secret_init(void)
 {
@@ -43,43 +44,36 @@ static u32 seq_scale(u32 seq)
 __u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr,
   __be16 sport, __be16 dport)
 {
-   u32 secret[MD5_MESSAGE_BYTES / 4];
-   u32 hash[MD5_DIGEST_WORDS];
-   u32 i;
-
+   const struct {
+   struct in6_addr saddr;
+   struct in6_addr daddr;
+   __be16 sport;
+   __be16 dport;
+   } __packed combined = {
+   .saddr = *(struct in6_addr *)saddr,
+   .daddr = *(struct in6_addr *)daddr,
+   .sport = sport,
+   .dport = dport
+   };
net_secret_init();
-   memcpy(hash, saddr, 16);
-   for (i = 0; i < 4; i++)
-   secret[i] = net_secret[i] + (__force u32)daddr[i];
-   secret[4] = net_secret[4] +
-   (((__force u16)sport << 16) + (__force u16)dport);
-   for (i = 5; i < MD5_MESSAGE_BYTES / 4; i++)
-   secret[i] = net_secret[i];
-
-   md5_transform(hash, secret);
-
-   return seq_scale(hash[0]);
+   return seq_scale(siphash24((const u8 *), sizeof(combined), 
net_secret));
 }
 EXPORT_SYMBOL(secure_tcpv6_sequence_number);
 
 u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr,
   __be16 dport)
 {
-   u32 secret[MD5_MESSAGE_BYTES / 4];
-   u32 hash[MD5_DIGEST_WORDS];
-   u32 i;
-
+   const struct {
+   struct in6_addr saddr;
+   struct in6_addr daddr;
+   __be16 dport;
+   } __packed combined = {
+   .saddr = *(struct in6_addr *)saddr,
+   .daddr = *(struct in6_addr *)daddr,
+   .dport = dport
+   };
net_secret_init();
-   memcpy(hash, saddr, 16);
-   for (i = 0; i < 4; i++)
-   secret[i] = net_secret[i] + (__force u32) daddr[i];
-   secret[4] = net_secret[4] + (__force u32)dport;
-   for (i = 5; i < MD5_MESSAGE_BYTES / 4; i++)
-   secret[i] = net_secret[i];
-
-   md5_transform(hash, secret);
-
-   return hash[0];
+   return siphash24((const u8 *), sizeof(combined), net_secret);
 }
 EXPORT_SYMBOL(secure_ipv6_port_ephemeral);
 #endif
@@ -89,32 +83,34 @@ EXPORT_SYMBOL(secure_ipv6_port_ephemeral);
 __u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
 __be16 sport, __be16 dport)
 {
-   u32 hash[MD5_DIGEST_WORDS];
-
+   const struct {
+   __be32 saddr;
+   __be32 daddr;
+   __be16 sport;
+   __be16 dport;
+   } __packed combined = {
+   .saddr = saddr,
+   .daddr = daddr,
+   .sport = sport,
+   .dport = dport
+   };
net_secret_init();
-   hash[0] = (__force u32)saddr;
-   hash[1] = (__force u32)daddr;
-   hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
-   hash[3] = net_secret[15];
-
-   md5_transform(hash, net_secret);
-
-   return seq_scale(hash[0]);
+   return seq_scale(siphash24((const u8 *), sizeof(combined), 
net_secret));
 }
 
 u32 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport)
 {
-   u32 hash[MD5_DIGEST_WORDS];
-
+   const struct {
+   __be32 saddr;
+   __be32 daddr;
+   __be16 dport;
+   } __packed combined = {
+   .saddr = saddr,
+   .daddr = daddr,
+   .dport = dport
+   };
net_secret_init();
-   hash[0] = (__force u32)saddr;
-   hash[1] = (__force u32)daddr;
-   hash[2] = (__force u32)dport ^ net_secret[14];
-   hash[3] = net_secret[15];
-
-   md5_transform(hash, net_secret);
-
-   return hash[0];
+   return seq_scale(siphash24((const u8 *), sizeof(combined), 
net_secret));
 }
 EXPORT_SYMBOL_GPL(secure_ipv4_port_ephemeral);
 #endif
@@ -123,21 +119,22 @@ 

[PATCH net] ibmveth: calculate gso_segs for large packets

2016-12-13 Thread Thomas Falcon
Include calculations to compute the number of segments
that comprise an aggregated large packet.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmveth.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmveth.c 
b/drivers/net/ethernet/ibm/ibmveth.c
index fbece63..a831f94 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1181,7 +1181,9 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
 
 static void ibmveth_rx_mss_helper(struct sk_buff *skb, u16 mss, int lrg_pkt)
 {
+   struct tcphdr *tcph;
int offset = 0;
+   int hdr_len;
 
/* only TCP packets will be aggregated */
if (skb->protocol == htons(ETH_P_IP)) {
@@ -1208,14 +1210,20 @@ static void ibmveth_rx_mss_helper(struct sk_buff *skb, 
u16 mss, int lrg_pkt)
/* if mss is not set through Large Packet bit/mss in rx buffer,
 * expect that the mss will be written to the tcp header checksum.
 */
+   tcph = (struct tcphdr *)(skb->data + offset);
if (lrg_pkt) {
skb_shinfo(skb)->gso_size = mss;
} else if (offset) {
-   struct tcphdr *tcph = (struct tcphdr *)(skb->data + offset);
-
skb_shinfo(skb)->gso_size = ntohs(tcph->check);
tcph->check = 0;
}
+
+   if (skb_shinfo(skb)->gso_size) {
+   hdr_len = offset + tcph->doff * 4;
+   skb_shinfo(skb)->gso_segs =
+   DIV_ROUND_UP(skb->len - hdr_len,
+skb_shinfo(skb)->gso_size);
+   }
 }
 
 static int ibmveth_poll(struct napi_struct *napi, int budget)
-- 
1.8.3.1



Re: [PATCH] [v2] net: qcom/emac: don't try to claim clocks on ACPI systems

2016-12-13 Thread Florian Fainelli
On 12/13/2016 03:49 PM, Timur Tabi wrote:
> On ACPI systems, clocks are not available to drivers directly.  They are
> handled exclusively by ACPI and/or firmware, so there is no clock driver.
> Calls to clk_get() always fail, so we should not even attempt to claim
> any clocks on ACPI systems.

Reviewed-by: Florian Fainelli 
-- 
Florian


[PATCH] [v2] net: qcom/emac: don't try to claim clocks on ACPI systems

2016-12-13 Thread Timur Tabi
On ACPI systems, clocks are not available to drivers directly.  They are
handled exclusively by ACPI and/or firmware, so there is no clock driver.
Calls to clk_get() always fail, so we should not even attempt to claim
any clocks on ACPI systems.

Signed-off-by: Timur Tabi 
---

Notes:
v2: move check into functions

 drivers/net/ethernet/qualcomm/emac/emac.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/net/ethernet/qualcomm/emac/emac.c 
b/drivers/net/ethernet/qualcomm/emac/emac.c
index ae32f85..422289c 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac.c
@@ -460,6 +460,12 @@ static int emac_clks_phase1_init(struct platform_device 
*pdev,
 {
int ret;
 
+   /* On ACPI platforms, clocks are controlled by firmware and/or
+* ACPI, not by drivers.
+*/
+   if (has_acpi_companion(>dev))
+   return 0;
+
ret = emac_clks_get(pdev, adpt);
if (ret)
return ret;
@@ -485,6 +491,9 @@ static int emac_clks_phase2_init(struct platform_device 
*pdev,
 {
int ret;
 
+   if (has_acpi_companion(>dev))
+   return 0;
+
ret = clk_set_rate(adpt->clk[EMAC_CLK_TX], 12500);
if (ret)
return ret;
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.



Re: Soft lockup in inet_put_port on 4.6

2016-12-13 Thread Tom Herbert
On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek  wrote:
> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert  wrote:
>> I think there may be some suspicious code in inet_csk_get_port. At
>> tb_found there is:
>>
>> if (((tb->fastreuse > 0 && reuse) ||
>>  (tb->fastreuseport > 0 &&
>>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>   sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
>> smallest_size == -1)
>> goto success;
>> if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) {
>> if ((reuse ||
>>  (tb->fastreuseport > 0 &&
>>   sk->sk_reuseport &&
>>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>>   uid_eq(tb->fastuid, uid))) &&
>> smallest_size != -1 && --attempts >= 0) {
>> spin_unlock_bh(>lock);
>> goto again;
>> }
>> goto fail_unlock;
>> }
>>
>> AFAICT there is redundancy in these two conditionals.  The same clause
>> is being checked in both: (tb->fastreuseport > 0 &&
>> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the
>> first conditional should be hit, goto done,  and the second will never
>> evaluate that part to true-- unless the sk is changed (do we need
>> READ_ONCE for sk->sk_reuseport_cb?).
> That's an interesting point... It looks like this function also
> changed in 4.6 from using a single local_bh_disable() at the beginning
> with several spin_lock(>lock) to exclusively
> spin_lock_bh(>lock) at each locking point.  Perhaps the full bh
> disable variant was preventing the timers in your stack trace from
> running interleaved with this function before?

Could be, although dropping the lock shouldn't be able to affect the
search state. TBH, I'm a little lost in reading function, the
SO_REUSEPORT handling is pretty complicated. For instance,
rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that
function and also in every call to inet_csk_bind_conflict. I wonder if
we can simply this under the assumption that SO_REUSEPORT is only
allowed if the port number (snum) is explicitly specified.

Tom


[PATCH iproute2] Fix compile warning in get_addr_1

2016-12-13 Thread David Ahern
A recent cleanup causes a compile warning on Debian jessie:

CC   utils.o
utils.c: In function ‘get_addr_1’:
utils.c:486:21: warning: passing argument 1 of ‘ll_addr_a2n’ from incompatible 
pointer type
   len = ll_addr_a2n(>data, sizeof(addr->data), name);
 ^
In file included from utils.c:34:0:
../include/rt_names.h:27:5: note: expected ‘char *’ but argument is of type 
‘__u32 (*)[8]’
 int ll_addr_a2n(char *lladdr, int len, const char *arg);
 ^

Revert the removal of the typecast

Fixes: e1933b928125 ("utils: cleanup style")
Signed-off-by: David Ahern 
---
 lib/utils.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lib/utils.c b/lib/utils.c
index 316b048abcfc..83c9d097c608 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -483,7 +483,8 @@ int get_addr_1(inet_prefix *addr, const char *name, int 
family)
if (family == AF_PACKET) {
int len;
 
-   len = ll_addr_a2n(>data, sizeof(addr->data), name);
+   len = ll_addr_a2n((char *) >data, sizeof(addr->data),
+ name);
if (len < 0)
return -1;
 
-- 
2.1.4



Re: Soft lockup in inet_put_port on 4.6

2016-12-13 Thread Craig Gallek
On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert  wrote:
> I think there may be some suspicious code in inet_csk_get_port. At
> tb_found there is:
>
> if (((tb->fastreuse > 0 && reuse) ||
>  (tb->fastreuseport > 0 &&
>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>   sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
> smallest_size == -1)
> goto success;
> if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) {
> if ((reuse ||
>  (tb->fastreuseport > 0 &&
>   sk->sk_reuseport &&
>   !rcu_access_pointer(sk->sk_reuseport_cb) &&
>   uid_eq(tb->fastuid, uid))) &&
> smallest_size != -1 && --attempts >= 0) {
> spin_unlock_bh(>lock);
> goto again;
> }
> goto fail_unlock;
> }
>
> AFAICT there is redundancy in these two conditionals.  The same clause
> is being checked in both: (tb->fastreuseport > 0 &&
> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the
> first conditional should be hit, goto done,  and the second will never
> evaluate that part to true-- unless the sk is changed (do we need
> READ_ONCE for sk->sk_reuseport_cb?).
That's an interesting point... It looks like this function also
changed in 4.6 from using a single local_bh_disable() at the beginning
with several spin_lock(>lock) to exclusively
spin_lock_bh(>lock) at each locking point.  Perhaps the full bh
disable variant was preventing the timers in your stack trace from
running interleaved with this function before?


Re: [PATCH] net: qcom/emac: don't try to claim clocks on ACPI systems

2016-12-13 Thread Timur Tabi

On 12/13/2016 04:02 PM, Florian Fainelli wrote:

No strong feelings either, it just seems easier and safer to move the
check down in the function and make it return success rather than
potentially affecting the error path within the caller of
emac_clks_phase{1,2}_init here.


I suppose that makes sense.  I'll post a V2.

--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.


Re: [PATCH] net: qcom/emac: don't try to claim clocks on ACPI systems

2016-12-13 Thread Florian Fainelli
On 12/13/2016 01:54 PM, Timur Tabi wrote:
> On 12/13/2016 03:46 PM, Florian Fainelli wrote:
>> Is there a reason why the check is not moved down inwo
>> emac_clks_phase{1,2}_init functions? Do you anticipate other
>> ACPI-related changes in the future that would warrant having this check
>> moved at a higher level?
> 
> No, this is the last ACPI-related change that I expect.  I could move
> the check into those functions, but I don't see how that's any different
> than what I'm doing now.  My way avoids calling a function altogether,
> your way calls into a function only to have it return immediately.
> 
> But I don't have any strong feelings either way.  I will change it if
> you want me to.

No strong feelings either, it just seems easier and safer to move the
check down in the function and make it return success rather than
potentially affecting the error path within the caller of
emac_clks_phase{1,2}_init here.
-- 
Florian


Re: [PATCH] net: qcom/emac: don't try to claim clocks on ACPI systems

2016-12-13 Thread Timur Tabi

On 12/13/2016 03:46 PM, Florian Fainelli wrote:

Is there a reason why the check is not moved down inwo
emac_clks_phase{1,2}_init functions? Do you anticipate other
ACPI-related changes in the future that would warrant having this check
moved at a higher level?


No, this is the last ACPI-related change that I expect.  I could move 
the check into those functions, but I don't see how that's any different 
than what I'm doing now.  My way avoids calling a function altogether, 
your way calls into a function only to have it return immediately.


But I don't have any strong feelings either way.  I will change it if 
you want me to.


--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.


Re: [PATCH] net: qcom/emac: don't try to claim clocks on ACPI systems

2016-12-13 Thread Florian Fainelli
On 12/13/2016 11:55 AM, Timur Tabi wrote:
> On ACPI systems, clocks are not available to drivers directly.  They are
> handled exclusively by ACPI and/or firmware, so there is no clock driver.
> Calls to clk_get() always fail, so we should not even attempt to claim
> any clocks on ACPI systems.
> 
> Signed-off-by: Timur Tabi 
> ---
>  drivers/net/ethernet/qualcomm/emac/emac.c | 22 --
>  1 file changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/ethernet/qualcomm/emac/emac.c 
> b/drivers/net/ethernet/qualcomm/emac/emac.c
> index ae32f85..b1c1cdc 100644
> --- a/drivers/net/ethernet/qualcomm/emac/emac.c
> +++ b/drivers/net/ethernet/qualcomm/emac/emac.c
> @@ -627,11 +627,12 @@ static int emac_probe(struct platform_device *pdev)
>   if (ret)
>   goto err_undo_netdev;
>  
> - /* initialize clocks */
> - ret = emac_clks_phase1_init(pdev, adpt);
> - if (ret) {
> - dev_err(>dev, "could not initialize clocks\n");
> - goto err_undo_netdev;
> + if (!has_acpi_companion(>dev)) {

Is there a reason why the check is not moved down inwo
emac_clks_phase{1,2}_init functions? Do you anticipate other
ACPI-related changes in the future that would warrant having this check
moved at a higher level?
-- 
Florian


Re: sctp: suspicious rcu_dereference_check() usage in sctp_epaddr_lookup_transport

2016-12-13 Thread Marcelo Ricardo Leitner
On Tue, Dec 13, 2016 at 07:07:01PM +0100, Dmitry Vyukov wrote:
> Hello,
> 
> I am getting the following reports while running syzkaller fuzzer:
> 
> [ INFO: suspicious RCU usage. ]
> 4.9.0+ #85 Not tainted
> ---
> ./include/linux/rhashtable.h:572 suspicious rcu_dereference_check() usage!
> 
> other info that might help us debug this:
> 
> rcu_scheduler_active = 1, debug_locks = 0
> 1 lock held by syz-executor1/18023:
>  #0:  (sk_lock-AF_INET){+.+.+.}, at: [< inline >] lock_sock
> include/net/sock.h:1454
>  #0:  (sk_lock-AF_INET){+.+.+.}, at: []
> sctp_getsockopt+0x45f/0x6800 net/sctp/socket.c:6432
> 
> stack backtrace:
> CPU: 2 PID: 18023 Comm: syz-executor1 Not tainted 4.9.0+ #85
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
> [< inline >] __dump_stack lib/dump_stack.c:15
> [] dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
> [] lockdep_rcu_suspicious+0x139/0x180
> kernel/locking/lockdep.c:4448
> [< inline >] __rhashtable_lookup ./include/linux/rhashtable.h:572
> [< inline >] rhltable_lookup ./include/linux/rhashtable.h:660
> [] sctp_epaddr_lookup_transport+0x641/0x930
> net/sctp/input.c:946

I think this was introduced in the rhlist converstion. We had removed
some rcu_read_lock() calls on sctp stack because rhashtable was already
calling it, but then we didn't add them back when moving to rhlist.

This code:
+/* return a transport without holding it, as it's only used under sock lock */
 struct sctp_transport *sctp_epaddr_lookup_transport(
const struct sctp_endpoint *ep,
const union sctp_addr *paddr)
 {
struct net *net = sock_net(ep->base.sk);
+   struct rhlist_head *tmp, *list;
+   struct sctp_transport *t;
struct sctp_hash_cmp_arg arg = {
-   .ep= ep,
.paddr = paddr,
.net   = net,
+   .lport = htons(ep->base.bind_addr.port),
};
 
-   return rhashtable_lookup_fast(_transport_hashtable, ,
- sctp_hash_params);
+   list = rhltable_lookup(_transport_hashtable, ,
+  sctp_hash_params);

Had an implicit rcu_read_lock() on rhashtable_lookup_fast, but it
doesn't on rhltable_lookup and rhltable_lookup uses _rcu calls which
assumes we have rcu read protection.

  Marcelo



RE: [Intel-wired-lan] [PATCH net v2] igb: re-assign hw address pointer on reset after PCI error

2016-12-13 Thread Brown, Aaron F
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On
> Behalf Of Guilherme G. Piccoli
> Sent: Thursday, November 10, 2016 10:47 AM
> To: intel-wired-...@lists.osuosl.org
> Cc: netdev@vger.kernel.org; gpicc...@linux.vnet.ibm.com
> Subject: [Intel-wired-lan] [PATCH net v2] igb: re-assign hw address pointer on
> reset after PCI error
> 
> Whenever the igb driver detects the result of a read operation returns
> a value composed only by F's (like 0x), it will detach the
> net_device, clear the hw_addr pointer and warn to the user that adapter's
> link is lost - those steps happen on igb_rd32().
> 
> In case a PCI error happens on Power architecture, there's a recovery
> mechanism called EEH, that will reset the PCI slot and call driver's
> handlers to reset the adapter and network functionality as well.
> 
> We observed that once hw_addr is NULL after the error is detected on
> igb_rd32(), it's never assigned back, so in the process of resetting
> the network functionality we got a NULL pointer dereference in both
> igb_configure_tx_ring() and igb_configure_rx_ring(). In order to avoid
> such bug, this patch re-assigns the hw_addr value in the slot_reset
> handler.
> 
> Reported-by: Anthony H. Thai 
> Reported-by: Harsha Thyagaraja 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>  drivers/net/ethernet/intel/igb/igb_main.c | 5 +
>  1 file changed, 5 insertions(+)

Tested-by: Aaron Brown 


Re: Soft lockup in inet_put_port on 4.6

2016-12-13 Thread Tom Herbert
I think there may be some suspicious code in inet_csk_get_port. At
tb_found there is:

if (((tb->fastreuse > 0 && reuse) ||
 (tb->fastreuseport > 0 &&
  !rcu_access_pointer(sk->sk_reuseport_cb) &&
  sk->sk_reuseport && uid_eq(tb->fastuid, uid))) &&
smallest_size == -1)
goto success;
if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, true)) {
if ((reuse ||
 (tb->fastreuseport > 0 &&
  sk->sk_reuseport &&
  !rcu_access_pointer(sk->sk_reuseport_cb) &&
  uid_eq(tb->fastuid, uid))) &&
smallest_size != -1 && --attempts >= 0) {
spin_unlock_bh(>lock);
goto again;
}
goto fail_unlock;
}

AFAICT there is redundancy in these two conditionals.  The same clause
is being checked in both: (tb->fastreuseport > 0 &&
!rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport &&
uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the
first conditional should be hit, goto done,  and the second will never
evaluate that part to true-- unless the sk is changed (do we need
READ_ONCE for sk->sk_reuseport_cb?).

Another potential issue is the that the goto again goes back to doing
the port scan, but if snum had been set originally that doesn't seem
like what we want.

Thanks,
Tom




On Mon, Dec 12, 2016 at 2:24 PM, Josef Bacik  wrote:
>
> On Mon, Dec 12, 2016 at 1:44 PM, Hannes Frederic Sowa
>  wrote:
>>
>> On 12.12.2016 19:05, Josef Bacik wrote:
>>>
>>>  On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet 
>>>  wrote:

  On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:

>
>   Hmm... Is your ephemeral port range includes the port your load
>   balancing app is using ?


  I suspect that you might have processes doing bind( port = 0) that are
  trapped into the bind_conflict() scan ?

  With 100,000 + timewaits there, this possibly hurts.

  Can you try the following loop breaker ?
>>>
>>>
>>>  It doesn't appear that the app is doing bind(port = 0) during normal
>>>  operation.  I tested this patch and it made no difference.  I'm going to
>>>  test simply restarting the app without changing to the SO_REUSEPORT
>>>  option.  Thanks,
>>
>>
>> Would it be possible to trace the time the function uses with trace? If
>> we don't see the number growing considerably over time we probably can
>> rule out that we loop somewhere in there (I would instrument
>> inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port).
>>
>> __inet_hash_connect -> __inet_check_established also takes a lock
>> (inet_ehash_lockp) which can be locked from inet_diag code path during
>> socket diag info dumping.
>>
>> Unfortunately we couldn't reproduce it so far. :/
>
>
> So I had a bcc script running to time how long we spent in
> inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port, but of
> course I'm an idiot and didn't actually separate out the stats so I could
> tell _which_ one was taking forever.  But anyway here's a normal
> distribution on the box
>
> Some shit   : count distribution
> 0 -> 1  : 0|   |
> 2 -> 3  : 0|   |
> 4 -> 7  : 0|   |
> 8 -> 15 : 0|   |
>16 -> 31 : 0|   |
>32 -> 63 : 0|   |
>64 -> 127: 0|   |
>   128 -> 255: 0|   |
>   256 -> 511: 0|   |
>   512 -> 1023   : 0|   |
>  1024 -> 2047   : 74   |   |
>  2048 -> 4095   : 10537
> ||
>  4096 -> 8191   : 8497 |   |
>  8192 -> 16383  : 3745 |** |
> 16384 -> 32767  : 300  |*  |
> 32768 -> 65535  : 250  |   |
> 65536 -> 131071 : 180  |   |
>131072 -> 262143 : 71   |   |
>262144 -> 524287 : 18   |   |
>  

Re: [RFC PATCH v3] audit: use proper refcount locking on audit_sock

2016-12-13 Thread Paul Moore
On Tue, Dec 13, 2016 at 10:03 AM, Richard Guy Briggs  wrote:
> Resetting audit_sock appears to be racy.
>
> audit_sock was being copied and dereferenced without using a refcount on
> the source sock.
>
> Bump the refcount on the underlying sock when we store a refrence in
> audit_sock and release it when we reset audit_sock.  audit_sock
> modification needs the audit_cmd_mutex.
>
> See: https://lkml.org/lkml/2016/11/26/232
>
> Thanks to Eric Dumazet  and Cong Wang
>  on ideas how to fix it.
>
> Signed-off-by: Richard Guy Briggs 
> ---
> There has been a lot of change in the audit code that is about to go
> upstream to address audit queue issues.  This patch is based on the
> source tree: git://git.infradead.org/users/pcmoore/audit#next
> ---
>  kernel/audit.c |   28 +++-
>  1 files changed, 23 insertions(+), 5 deletions(-)

This looks more reasonable.  I still wonder about synchronization
between threads changing the audit_* connection variables and the
kauditd_thread, but I guess we can treat that as another issue; this
patch fixes a bug and is worth merging now.

I'm building a test kernel right now, assuming nothing blows up I'll
push this patch with the rest of the audit patches tomorrow; if
something bad happens, this is going to miss the first audit pull
request.

> diff --git a/kernel/audit.c b/kernel/audit.c
> index f20eee0..3bb4126 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -446,14 +446,19 @@ static void kauditd_retry_skb(struct sk_buff *skb)
>   * Description:
>   * Break the auditd/kauditd connection and move all the records in the retry
>   * queue into the hold queue in case auditd reconnects.
> + * The audit_cmd_mutex must be held when calling this function.
>   */

Don't resend, but in the future please start comments like this on the
previous line.


[PATCH iproute2 1/1] tc: pass correct conversion specifier to print 'unsigned int' action index.

2016-12-13 Thread Roman Mashak
Signed-off-by: Roman Mashak 
Signed-off-by: Jamal Hadi Salim 
---
 tc/m_bpf.c  | 2 +-
 tc/m_connmark.c | 2 +-
 tc/m_csum.c | 3 ++-
 tc/m_gact.c | 3 ++-
 tc/m_ife.c  | 2 +-
 tc/m_ipt.c  | 2 +-
 tc/m_mirred.c   | 3 ++-
 tc/m_pedit.c| 2 +-
 tc/m_simple.c   | 2 +-
 tc/m_skbedit.c  | 2 +-
 tc/m_skbmod.c   | 2 +-
 tc/m_vlan.c | 2 +-
 tc/m_xt.c   | 2 +-
 tc/m_xt_old.c   | 2 +-
 14 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/tc/m_bpf.c b/tc/m_bpf.c
index 9bf2a85..6400724 100644
--- a/tc/m_bpf.c
+++ b/tc/m_bpf.c
@@ -161,7 +161,7 @@ static int bpf_print_opt(struct action_util *au, FILE *f, 
struct rtattr *arg)
}
 
fprintf(f, "default-action %s\n", action_n2a(parm->action));
-   fprintf(f, "\tindex %d ref %d bind %d", parm->index, parm->refcnt,
+   fprintf(f, "\tindex %u ref %d bind %d", parm->index, parm->refcnt,
parm->bindcnt);
 
if (show_stats) {
diff --git a/tc/m_connmark.c b/tc/m_connmark.c
index 20f98e4..295f90d 100644
--- a/tc/m_connmark.c
+++ b/tc/m_connmark.c
@@ -123,7 +123,7 @@ static int print_connmark(struct action_util *au, FILE *f, 
struct rtattr *arg)
ci = RTA_DATA(tb[TCA_CONNMARK_PARMS]);
 
fprintf(f, " connmark zone %d\n", ci->zone);
-   fprintf(f, "\t index %d ref %d bind %d", ci->index,
+   fprintf(f, "\t index %u ref %d bind %d", ci->index,
ci->refcnt, ci->bindcnt);
 
if (show_stats) {
diff --git a/tc/m_csum.c b/tc/m_csum.c
index a6e4c1e..d5b1af6 100644
--- a/tc/m_csum.c
+++ b/tc/m_csum.c
@@ -199,7 +199,8 @@ print_csum(struct action_util *au, FILE *f, struct rtattr 
*arg)
uflag_1, uflag_2, uflag_3,
uflag_4, uflag_5, uflag_6,
action_n2a(sel->action));
-   fprintf(f, "\tindex %d ref %d bind %d", sel->index, sel->refcnt, 
sel->bindcnt);
+   fprintf(f, "\tindex %u ref %d bind %d", sel->index, sel->refcnt,
+   sel->bindcnt);
 
if (show_stats) {
if (tb[TCA_CSUM_TM]) {
diff --git a/tc/m_gact.c b/tc/m_gact.c
index dc04b9f..755a3be 100644
--- a/tc/m_gact.c
+++ b/tc/m_gact.c
@@ -224,7 +224,8 @@ print_gact(struct action_util *au, FILE * f, struct rtattr 
*arg)
fprintf(f, "\n\t random type %s %s val %d",
prob_n2a(pp->ptype), action_n2a(pp->paction), pp->pval);
 #endif
-   fprintf(f, "\n\t index %d ref %d bind %d", p->index, p->refcnt, 
p->bindcnt);
+   fprintf(f, "\n\t index %u ref %d bind %d", p->index, p->refcnt,
+   p->bindcnt);
if (show_stats) {
if (tb[TCA_GACT_TM]) {
struct tcf_t *tm = RTA_DATA(tb[TCA_GACT_TM]);
diff --git a/tc/m_ife.c b/tc/m_ife.c
index e6f6153..f6131b1 100644
--- a/tc/m_ife.c
+++ b/tc/m_ife.c
@@ -312,7 +312,7 @@ static int print_ife(struct action_util *au, FILE *f, 
struct rtattr *arg)
sizeof(b2)));
}
 
-   fprintf(f, "\n\t index %d ref %d bind %d", p->index, p->refcnt,
+   fprintf(f, "\n\t index %u ref %d bind %d", p->index, p->refcnt,
p->bindcnt);
if (show_stats) {
if (tb[TCA_IFE_TM]) {
diff --git a/tc/m_ipt.c b/tc/m_ipt.c
index d6f62bd..1b935ec 100644
--- a/tc/m_ipt.c
+++ b/tc/m_ipt.c
@@ -489,7 +489,7 @@ print_ipt(struct action_util *au, FILE * f, struct rtattr 
*arg)
__u32 index;
 
index = rta_getattr_u32(tb[TCA_IPT_INDEX]);
-   fprintf(f, "\n\tindex %d", index);
+   fprintf(f, "\n\tindex %u", index);
}
 
if (tb[TCA_IPT_CNT]) {
diff --git a/tc/m_mirred.c b/tc/m_mirred.c
index 11f4c9b..35ae21f 100644
--- a/tc/m_mirred.c
+++ b/tc/m_mirred.c
@@ -260,7 +260,8 @@ print_mirred(struct action_util *au, FILE * f, struct 
rtattr *arg)
mirred_n2a(p->eaction), dev, action_n2a(p->action));
 
fprintf(f, "\n ");
-   fprintf(f, "\tindex %d ref %d bind %d", p->index, p->refcnt, 
p->bindcnt);
+   fprintf(f, "\tindex %u ref %d bind %d", p->index, p->refcnt,
+   p->bindcnt);
 
if (show_stats) {
if (tb[TCA_MIRRED_TM]) {
diff --git a/tc/m_pedit.c b/tc/m_pedit.c
index 891c2ec..8e9bf07 100644
--- a/tc/m_pedit.c
+++ b/tc/m_pedit.c
@@ -527,7 +527,7 @@ int print_pedit(struct action_util *au, FILE *f, struct 
rtattr *arg)
 
fprintf(f, " pedit action %s keys %d\n ",
action_n2a(sel->action), sel->nkeys);
-   fprintf(f, "\t index %d ref %d bind %d", sel->index, sel->refcnt,
+   fprintf(f, "\t index %u ref %d bind %d", sel->index, sel->refcnt,
sel->bindcnt);
 
if (show_stats) {
diff --git a/tc/m_simple.c b/tc/m_simple.c
index 732eaf1..3a8bd91 100644
--- a/tc/m_simple.c
+++ b/tc/m_simple.c
@@ -187,7 +187,7 @@ static int print_simple(struct action_util *au, FILE *f, 
struct rtattr *arg)
simpdata = 

Re: bpf debug info

2016-12-13 Thread Daniel Borkmann

On 12/13/2016 08:38 PM, Alexei Starovoitov wrote:

On Tue, Nov 29, 2016 at 9:01 AM, Alexei Starovoitov
 wrote:

If I try to run samples/bpf/test_cls_bpf.sh the verifier will complain:
R0=imm0,min_value=0,max_value=0 R1=pkt(id=0,off=0,r=42) R2=pkt_end
112: (0f) r4 += r3
113: (0f) r1 += r4
114: (b7) r0 = 2
115: (69) r2 = *(u16 *)(r1 +2)
invalid access to packet, off=2 size=2, R1(id=3,off=0,r=0)

Now multiply 115 * 8 and convert to hex. This is address 0x398 in llvm-objdump:
; struct udphdr *udp = data + tp_off;
  388:   r1 += r4
  390:   r0 = 2
; if (udp->dest == htons(DEFAULT_PKTGEN_UDP_PORT) ||
  398:   r2 = *(u16 *)(r1 + 2)
  3a0:   if r2 == 2304 goto 16

Now it's clear which line of C code is causing the verifier to reject.

[...]

Could llvm-objdump switch line numbering for bpf same way as verifier
output, so mapping step is not really needed?


you mean that llvm-objdump to print 113,114,115 ?
I guess it's doable. Will give it a try.


Hi Daniel,

your feature request turned out to be pretty straightforward
to implement. Please pull the latest llvm and rebuild llvm-objdump.
It will be printing instruction numbers instead of absolute addresses.
No "multiply 115 * 8 and convert to hex" steps necessary anymore.


That's great to hear, thanks for following up on this. Sounds about
right to upgrade.

Thanks,
Daniel


[PATCH] net: qcom/emac: don't try to claim clocks on ACPI systems

2016-12-13 Thread Timur Tabi
On ACPI systems, clocks are not available to drivers directly.  They are
handled exclusively by ACPI and/or firmware, so there is no clock driver.
Calls to clk_get() always fail, so we should not even attempt to claim
any clocks on ACPI systems.

Signed-off-by: Timur Tabi 
---
 drivers/net/ethernet/qualcomm/emac/emac.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/emac/emac.c 
b/drivers/net/ethernet/qualcomm/emac/emac.c
index ae32f85..b1c1cdc 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac.c
@@ -627,11 +627,12 @@ static int emac_probe(struct platform_device *pdev)
if (ret)
goto err_undo_netdev;
 
-   /* initialize clocks */
-   ret = emac_clks_phase1_init(pdev, adpt);
-   if (ret) {
-   dev_err(>dev, "could not initialize clocks\n");
-   goto err_undo_netdev;
+   if (!has_acpi_companion(>dev)) {
+   ret = emac_clks_phase1_init(pdev, adpt);
+   if (ret) {
+   dev_err(>dev, "could not initialize clocks\n");
+   goto err_undo_netdev;
+   }
}
 
netdev->watchdog_timeo = EMAC_WATCHDOG_TIME;
@@ -655,11 +656,12 @@ static int emac_probe(struct platform_device *pdev)
if (ret)
goto err_undo_mdiobus;
 
-   /* enable clocks */
-   ret = emac_clks_phase2_init(pdev, adpt);
-   if (ret) {
-   dev_err(>dev, "could not initialize clocks\n");
-   goto err_undo_mdiobus;
+   if (!has_acpi_companion(>dev)) {
+   ret = emac_clks_phase2_init(pdev, adpt);
+   if (ret) {
+   dev_err(>dev, "could not initialize clocks\n");
+   goto err_undo_mdiobus;
+   }
}
 
emac_mac_reset(adpt);
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.



Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-13 Thread John Fastabend
On 16-12-13 11:53 AM, David Miller wrote:
> From: John Fastabend 
> Date: Tue, 13 Dec 2016 09:43:59 -0800
> 
>> What does "zero-copy send packet-pages to the application/socket that
>> requested this" mean? At the moment on x86 page-flipping appears to be
>> more expensive than memcpy (I can post some data shortly) and shared
>> memory was proposed and rejected for security reasons when we were
>> working on bifurcated driver.
> 
> The whole idea is that we map all the active RX ring pages into
> userspace from the start.
> 
> And just how Jesper's page pool work will avoid DMA map/unmap,
> it will also avoid changing the userspace mapping of the pages
> as well.
> 
> Thus avoiding the TLB/VM overhead altogether.
> 

I get this but it requires applications to be isolated. The pages from
a queue can not be shared between multiple applications in different
trust domains. And the application has to be cooperative meaning it
can't "look" at data that has not been marked by the stack as OK. In
these schemes we tend to end up with something like virtio/vhost or
af_packet.

Any ACLs/filtering/switching/headers need to be done in hardware or
the application trust boundaries are broken.

If the above can not be met then a copy is needed. What I am trying
to tease out is the above comment along with other statements like
this "can be done with out HW filter features".

.John


Re: [PATCH net-next] netlink: revert broken, broken "2-clause nla_ok()"

2016-12-13 Thread David Miller
From: Alexey Dobriyan 
Date: Tue, 13 Dec 2016 22:30:15 +0300

> Commit 4f7df337fe79bba1e4c2d525525d63b5ba186bbd
> "netlink: 2-clause nla_ok()" is BROKEN.
> 
> First clause tests if "->nla_len" could even be accessed at all,
> it can not possibly be omitted.
> 
> Signed-off-by: Alexey Dobriyan 

Applied, thanks.


Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-13 Thread David Miller
From: John Fastabend 
Date: Tue, 13 Dec 2016 09:43:59 -0800

> What does "zero-copy send packet-pages to the application/socket that
> requested this" mean? At the moment on x86 page-flipping appears to be
> more expensive than memcpy (I can post some data shortly) and shared
> memory was proposed and rejected for security reasons when we were
> working on bifurcated driver.

The whole idea is that we map all the active RX ring pages into
userspace from the start.

And just how Jesper's page pool work will avoid DMA map/unmap,
it will also avoid changing the userspace mapping of the pages
as well.

Thus avoiding the TLB/VM overhead altogether.


Re: bpf debug info

2016-12-13 Thread Alexei Starovoitov
On Tue, Nov 29, 2016 at 9:01 AM, Alexei Starovoitov
 wrote:
>> >If I try to run samples/bpf/test_cls_bpf.sh the verifier will complain:
>> >R0=imm0,min_value=0,max_value=0 R1=pkt(id=0,off=0,r=42) R2=pkt_end
>> >112: (0f) r4 += r3
>> >113: (0f) r1 += r4
>> >114: (b7) r0 = 2
>> >115: (69) r2 = *(u16 *)(r1 +2)
>> >invalid access to packet, off=2 size=2, R1(id=3,off=0,r=0)
>> >
>> >Now multiply 115 * 8 and convert to hex. This is address 0x398 in 
>> >llvm-objdump:
>> >; struct udphdr *udp = data + tp_off;
>> >  388:   r1 += r4
>> >  390:   r0 = 2
>> >; if (udp->dest == htons(DEFAULT_PKTGEN_UDP_PORT) ||
>> >  398:   r2 = *(u16 *)(r1 + 2)
>> >  3a0:   if r2 == 2304 goto 16
>> >
>> >Now it's clear which line of C code is causing the verifier to reject.
>> [...]
>>
>> Could llvm-objdump switch line numbering for bpf same way as verifier
>> output, so mapping step is not really needed?
>
> you mean that llvm-objdump to print 113,114,115 ?
> I guess it's doable. Will give it a try.

Hi Daniel,

your feature request turned out to be pretty straightforward
to implement. Please pull the latest llvm and rebuild llvm-objdump.
It will be printing instruction numbers instead of absolute addresses.
No "multiply 115 * 8 and convert to hex" steps necessary anymore.

Thanks


Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE

2016-12-13 Thread Theodore Ts'o
Jason's patch fixed the issue, so I think we have the proper fix, but
to answer your questions:

On Wed, Dec 14, 2016 at 01:46:44AM +0800, Wei Xu wrote:
> 
> Q1:
> Which distribution are you using for the GCE instance?

The test appliance is based on Debian Jessie.

> Q2:
> Are you running xfs test as an embedded VM case, which means XFS test
> appliance is also a VM inside the GCE instance? Or the kernel is built
> for the instance itself?

No, GCE currently doesn't support running nested VM's (e.g., running
VM's inside GCE).  So the kernel is built for the instance itself.
The way the test appliance works is that it initially boots using the
Debian Jessie default kernel and then we kexec into the kernel under
test.

> Q3:
> Can this bug be reproduced for kvm-xfstests case? I'm trying to set up
> a local test bed if it makes sense.

You definitely can't do it out of the box -- you need to build the
image using "gen-image --networking", and then run "kvm-xfstests -N
shell" as root.  But the bug doesn't reproduce on kvm-xfstests, using
a 4.9 host kernel and linux-next guest kernel.


Cheers,

- Ted


Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-13 Thread Hannes Frederic Sowa
On 13.12.2016 17:10, Jesper Dangaard Brouer wrote:
>> What is bad about RDMA is that it is a separate kernel subsystem.
>> What I would like to see is a deeper integration with the network
>> stack so that memory regions can be registred with a network socket
>> and work requests then can be submitted and processed that directly
>> read and write in these regions. The network stack should provide the
>> services that the hardware of the NIC does not suppport as usual.
> 
> Interesting.  So you even imagine sockets registering memory regions
> with the NIC.  If we had a proper NIC HW filter API across the drivers,
> to register the steering rule (like ibv_create_flow), this would be
> doable, but we don't (DPDK actually have an interesting proposal[1])

On a side note, this is what windows does with RIO ("registered I/O").
Maybe you want to look at the API to get some ideas: allocating and
pinning down memory in user space and registering that with sockets to
get zero-copy IO.



Re: [iproute2 v3 net-next 0/8] Add support for vrf helper

2016-12-13 Thread Stephen Hemminger
On Sun, 11 Dec 2016 16:53:07 -0800
David Ahern  wrote:

> This series adds support to iproute2 to run a command against a specific
> VRF. The user semantics are similar to 'ip netns'.
> 
> The 'ip vrf' subcommand supports 3 usages:
> 
> 1. Run a command against a given vrf:
>ip vrf exec NAME CMD
> 
>Uses the recently committed cgroup/sock BPF option. vrf directory
>is added to cgroup2 mount. Individual vrfs are created under it. BPF
>filter is attached to vrf/NAME cgroup2 to set sk_bound_dev_if to the
>device index of the VRF. From there the current process (ip's pid) is
>addded to the cgroups.proc file and the given command is exected. In
>doing so all AF_INET/AF_INET6 (ipv4/ipv6) sockets are automatically
>bound to the VRF domain.
> 
>The association is inherited parent to child allowing the command to
>be a shell from which other commands are run relative to the VRF.
> 
> 2. Show the VRF a process is bound to:
>ip vrf id [PID]
>This command essentially looks at /proc/pid/cgroup for a "::/vrf/"
>entry. If pid arg is not given current process id is used.
> 
> 3. Show process ids bound to a VRF
>ip vrf pids NAME
>This command dumps the file MNT/vrf/NAME/cgroup.procs since that file
>shows the process ids in the particular vrf cgroup.
> 
> v3
> - bpf_prog_{at,de}tach changes as requested by Daniel
> - BPF macros added to bpf_util.h versus adding a new file as requested by 
> Daniel
> 
> v2
> - updated suject of patch 3 to avoid spam filters on vger
> 
> David Ahern (8):
>   lib bpf: Add support for BPF_PROG_ATTACH and BPF_PROG_DETACH
>   bpf: export bpf_prog_load
>   bpf: Add BPF_ macros
>   move cmd_exec to lib utils
>   Add filesystem APIs to lib
>   change name_is_vrf to return index
>   libnetlink: Add variant of rtnl_talk that does not display RTNETLINK
> answers error
>   Introduce ip vrf command
> 
>  include/bpf_util.h   | 186 +
>  include/libnetlink.h |   3 +
>  include/utils.h  |   4 +
>  ip/Makefile  |   3 +-
>  ip/ip.c  |   4 +-
>  ip/ip_common.h   |   4 +-
>  ip/iplink_vrf.c  |  29 --
>  ip/ipnetns.c |  34 --
>  ip/ipvrf.c   | 289 
> +++
>  lib/Makefile |   2 +-
>  lib/bpf.c|  61 +++
>  lib/exec.c   |  41 
>  lib/fs.c | 143 +
>  lib/libnetlink.c |  20 +++-
>  man/man8/ip-vrf.8|  88 
>  15 files changed, 841 insertions(+), 70 deletions(-)
>  create mode 100644 ip/ipvrf.c
>  create mode 100644 lib/exec.c
>  create mode 100644 lib/fs.c
>  create mode 100644 man/man8/ip-vrf.8
> 

Thanks, applied. Then I went and cleanup the long lines and whitespace issues


Re: [PATCH iproute2 1/2] tc/cls_flower: Add dest UDP port to tunnel params

2016-12-13 Thread Stephen Hemminger
On Tue, 13 Dec 2016 10:07:46 +0200
Hadar Hen Zion  wrote:

> Enhance IP tunnel parameters by adding destination UDP port.
> 
> Signed-off-by: Hadar Hen Zion 
> Reviewed-by: Roi Dayan 

Both applied, thanks.


Re: [PATCH iproute2 V2 1/2] tc: flower: Fix typo and style in flower man page

2016-12-13 Thread Stephen Hemminger
On Tue, 13 Dec 2016 14:39:01 +0200
Roi Dayan  wrote:

> Replace vlan_eth_type with vlan_ethtype.
> 
> Fixes: 745d91726006 ("tc: flower: Introduce vlan support")
> Signed-off-by: Roi Dayan 
> Reviewed-by: Hadar Hen Zion 

Both applied, thanks.


sctp: suspicious rcu_dereference_check() usage in sctp_epaddr_lookup_transport

2016-12-13 Thread Dmitry Vyukov
Hello,

I am getting the following reports while running syzkaller fuzzer:

[ INFO: suspicious RCU usage. ]
4.9.0+ #85 Not tainted
---
./include/linux/rhashtable.h:572 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
1 lock held by syz-executor1/18023:
 #0:  (sk_lock-AF_INET){+.+.+.}, at: [< inline >] lock_sock
include/net/sock.h:1454
 #0:  (sk_lock-AF_INET){+.+.+.}, at: []
sctp_getsockopt+0x45f/0x6800 net/sctp/socket.c:6432

stack backtrace:
CPU: 2 PID: 18023 Comm: syz-executor1 Not tainted 4.9.0+ #85
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[] dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
[] lockdep_rcu_suspicious+0x139/0x180
kernel/locking/lockdep.c:4448
[< inline >] __rhashtable_lookup ./include/linux/rhashtable.h:572
[< inline >] rhltable_lookup ./include/linux/rhashtable.h:660
[] sctp_epaddr_lookup_transport+0x641/0x930
net/sctp/input.c:946
[] sctp_endpoint_lookup_assoc+0x83/0x120
net/sctp/endpointola.c:335
[] sctp_addr_id2transport+0xaf/0x1e0 net/sctp/socket.c:241
[] sctp_getsockopt_peer_addr_info+0x216/0x630
net/sctp/socket.c:4625
[] sctp_getsockopt+0x2860/0x6800 net/sctp/socket.c:6500
[] sock_common_getsockopt+0x9a/0xe0 net/core/sock.c:2685
[< inline >] SYSC_getsockopt net/socket.c:1819
[] SyS_getsockopt+0x245/0x380 net/socket.c:1801
[] entry_SYSCALL_64_fastpath+0x23/0xc6
arch/x86/entry/entry_64.S:203

On commit e7aa8c2eb11ba69b1b69099c3c7bd6be3087b0ba (Dec 12).


Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-13 Thread John Fastabend
On 16-12-13 08:10 AM, Jesper Dangaard Brouer wrote:
> 
> On Mon, 12 Dec 2016 12:06:59 -0600 (CST) Christoph Lameter  
> wrote:
>> On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote:
>>
>>> Hmmm. If you can rely on hardware setup to give you steering and
>>> dedicated access to the RX rings.  In those cases, I guess, the "push"
>>> model could be a more direct API approach.  
>>
>> If the hardware does not support steering then one should be able to
>> provide those services in software.
> 
> This is the early demux problem.  With the push-mode of registering
> memory, you need hardware steering support, for zero-copy support, as
> the software step happens after DMA engine have written into the memory.
> 
> My model pre-VMA map all the pages in the RX ring (if zero-copy gets
> enabled, by a single user).  The software step can filter and zero-copy
> send packet-pages to the application/socket that requested this. The

What does "zero-copy send packet-pages to the application/socket that
requested this" mean? At the moment on x86 page-flipping appears to be
more expensive than memcpy (I can post some data shortly) and shared
memory was proposed and rejected for security reasons when we were
working on bifurcated driver.

> disadvantage is all zero-copy application need to share this VMA
> mapping.  This is solved by configuring HW filters into a RX-queue, and
> then only attach your zero-copy application to that queue.
> 
> 
>>> I was shooting for a model that worked without hardware support.
>>> And then transparently benefit from HW support by configuring a HW
>>> filter into a specific RX queue and attaching/using to that queue.  
>>
>> The discussion here is a bit amusing since these issues have been
>> resolved a long time ago with the design of the RDMA subsystem. Zero
>> copy is already in wide use. Memory registration is used to pin down
>> memory areas. Work requests can be filed with the RDMA subsystem that
>> then send and receive packets from the registered memory regions.
>> This is not strictly remote memory access but this is a basic mode of
>> operations supported  by the RDMA subsystem. The mlx5 driver quoted
>> here supports all of that.
> 
> I hear what you are saying.  I will look into a push-model, as it might
> be a better solution.
>  I will read up on RDMA + verbs and learn more about their API model.  I
> even plan to write a small sample program to get a feeling for the API,
> and maybe we can use that as a baseline for the performance target we
> can obtain on the same HW. (Thanks to Björn for already giving me some
> pointer here)
> 
> 
>> What is bad about RDMA is that it is a separate kernel subsystem.
>> What I would like to see is a deeper integration with the network
>> stack so that memory regions can be registred with a network socket
>> and work requests then can be submitted and processed that directly
>> read and write in these regions. The network stack should provide the
>> services that the hardware of the NIC does not suppport as usual.
> 
> Interesting.  So you even imagine sockets registering memory regions
> with the NIC.  If we had a proper NIC HW filter API across the drivers,
> to register the steering rule (like ibv_create_flow), this would be
> doable, but we don't (DPDK actually have an interesting proposal[1])
> 

Note rte_flow is in the same family of APIs as the proposed Flow API
that was rejected as well.  The features in Flow API that are not
included in the rte_flow proposal have logical extensions to support
them. In kernel we have 'tc' and multiple vendors support cls_flower
and cls_tc which offer a subset of the functionality in the DPDK
implementation.

Are you suggesting 'tc' is not a proper NIC HW filter API?

>  
>> The RX/TX ring in user space should be an additional mode of
>> operation of the socket layer. Once that is in place the "Remote
>> memory acces" can be trivially implemented on top of that and the
>> ugly RDMA sidecar subsystem can go away.
>  
> I cannot follow that 100%, but I guess you are saying we also need a
> more efficient mode of handing over pages/packet to userspace (than
> going through the normal socket API calls).
> 
> 
> Appreciate your input, it challenged my thinking.
> 



Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE

2016-12-13 Thread Wei Xu


On 2016年12月13日 07:33, Theodore Ts'o wrote:

Hi,

I was doing a last minute regression test of the ext4 tree before
sending a pull request to Linus, which I do using gce-xfstests[1], and
I found that using networking was broken on GCE on linux-next.  I was
using next-20161209, and after bisecting things, I narrowed down the
commit which causing things to break to commit 449000102901:
"virtio-net: enable multiqueue by default".  Reverting this commit on
top of next-20161209 fixed the problem.

[1] http://thunk.org/gce-xfstests

You can reproduce the problem for building the kernel for Google
Compute Engine --- I use a config such as this [2], and then try to
boot a kernel on a VM.  The way I do this involves booting a test
appliance and then kexec'ing into the kernel to be tested[3], using a
2cpu configuration.  (GCE machine type: n1-standard-2)

[2] 
https://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/kernel-configs/ext4-x86_64-config-4.9
[3] 
https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md

You can then take a look at serial console using a command such as
"gcloud compute instances get-serial-port-output ", and
you will get something like this (see attached).  The important bit is
that the dhclient command is completely failing to be able to get a
response from the network, from which I deduce that apparently that
either networking send or receive or both seem to be badly affected by
the commit in question.

Please let me know if there's anything I can do to help you debug this
further.


Hi Ted,
Just had a quick try on GCE, sorry for my stupid questions.

Q1:
Which distribution are you using for the GCE instance?

Q2:
Are you running xfs test as an embedded VM case, which means XFS test
appliance is also a VM inside the GCE instance? Or the kernel is built
for the instance itself?

Q3:
Can this bug be reproduced for kvm-xfstests case? I'm trying to set up
a local test bed if it makes sense.



Cheers,

- Ted

Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] Linux version 
4.9.0-rc8-ext4-06387-g03e5cbd (tytso@tytso-ssd) (gcc version 4.9.2 (Debian 
4.9.2-10) ) #9 SMP Mon Dec 12 04:50:16 UTC 2016
Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] Command line: 
root=/dev/sda1 ro console=ttyS0,38400n8 elevator=noop console=ttyS0  
fstestcfg=4k fstestset=-g,quick fstestexc= fstestopt=aex fstesttyp=ext4 
fstestapi=1.3
Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: 
Supporting XSAVE feature 0x001: 'x87 floating point registers'
Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: 
Supporting XSAVE feature 0x002: 'SSE registers'
Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: 
Supporting XSAVE feature 0x004: 'AVX registers'
Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: 
xstate_offset[2]:  576, xstate_sizes[2]:  256
Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: Enabled 
xstate features 0x7, context size is 832 bytes, using 'standard' format.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Load Kernel Modules.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Apply Kernel 
Variables...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounting Configuration File 
System...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounting FUSE Control File 
System...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounted FUSE Control File 
System.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounted Configuration File 
System.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Apply Kernel 
Variables.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Create Static Device 
Nodes in /dev.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting udev Kernel Device 
Manager...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started udev Kernel Device 
Manager.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started udev Coldplug all 
Devices.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting udev Wait for 
Complete Device Initialization...
Dec 11 23:53:20 xfstests-201612120451 systemd-fsck[1659]: xfstests-root: clean, 
56268/655360 files, 357439/2620928 blocks
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started File System Check on 
Root Device.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Remount Root and 
Kernel File Systems...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Remount Root and 
Kernel File Systems.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Various fixups to 
make systemd work better on Debian.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Load/Save Random 
Seed...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Local File Systems 
(Pre).
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Reached target Local File 
Systems (Pre).
Dec 11 23:53:20 xfstests-201612120451 

Re: [PATCH iproute2 -net-next] lwt: BPF support for LWT

2016-12-13 Thread Stephen Hemminger
On Mon, 12 Dec 2016 01:14:35 +0100
Daniel Borkmann  wrote:

> +
> +static int lwt_parse_bpf(struct rtattr *rta, size_t len, int *argcp, char 
> ***argvp,
> +  int attr, const enum bpf_prog_type bpf_type)

Please break long lines like this.


> +
> + /* argv is currently the first unparsed argument,
> +  * but the lwt_parse_encap() caller will move to the next,
> +  * so step back */
> + *argcp = argc + 1;

iproute2 uses kernel comment style. 

I went ahead and fixed these.


RE: [PATCH net-next 07/27] gianfar: remove use of VLAN_TAG_PRESENT

2016-12-13 Thread Claudiu Manoil
>-Original Message-
>From: Michał Mirosław [mailto:mirq-li...@rere.qmqm.pl]
>Sent: Tuesday, December 13, 2016 2:13 AM
>To: netdev@vger.kernel.org
>Cc: Claudiu Manoil 
>Subject: [PATCH net-next 07/27] gianfar: remove use of VLAN_TAG_PRESENT
>
>Signed-off-by: Michał Mirosław 
>---
> drivers/net/ethernet/freescale/gianfar_ethtool.c | 8 +++-
> 1 file changed, 3 insertions(+), 5 deletions(-)
>
>diff --git a/drivers/net/ethernet/freescale/gianfar_ethtool.c
>b/drivers/net/ethernet/freescale/gianfar_ethtool.c
>index 56588f2..95fa647 100644
>--- a/drivers/net/ethernet/freescale/gianfar_ethtool.c
>+++ b/drivers/net/ethernet/freescale/gianfar_ethtool.c
>@@ -1155,11 +1155,9 @@ static int gfar_convert_to_filer(struct
>ethtool_rx_flow_spec *rule,
>   prio = vlan_tci_prio(rule);
>   prio_mask = vlan_tci_priom(rule);
>
>-  if (cfi == VLAN_TAG_PRESENT && cfi_mask ==
>VLAN_TAG_PRESENT) {
>-  vlan |= RQFPR_CFI;
>-  vlan_mask |= RQFPR_CFI;
>-  } else if (cfi != VLAN_TAG_PRESENT &&
>- cfi_mask == VLAN_TAG_PRESENT) {
>+  if (cfi_mask) {
>+  if (cfi)
>+  vlan |= RQFPR_CFI;
>   vlan_mask |= RQFPR_CFI;
>   }
>   }

Reviewed-by: Claudiu Manoil 


Re: [PATCH] kcm: fix spelling mistake in Kconfig, "connectons"

2016-12-13 Thread Colin Ian King
On 13/12/16 17:30, Colin King wrote:
> From: Colin Ian King 
> 
> Trivial fix to spelling mistake "connectons" to "connections" in
> Kconfig text.
> 
> Signed-off-by: Colin Ian King 
> ---
>  net/kcm/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/kcm/Kconfig b/net/kcm/Kconfig
> index 87fca36..23b01e1 100644
> --- a/net/kcm/Kconfig
> +++ b/net/kcm/Kconfig
> @@ -7,5 +7,5 @@ config AF_KCM
>   ---help---
> KCM (Kernel Connection Multiplexor) sockets provide a method
> for multiplexing messages of a message based application
> -   protocol over kernel connectons (e.g. TCP connections).
> +   protocol over kernel connections (e.g. TCP connections).
>  
> 
Oops, ignore that, I was working on the wrong tree.




[PATCH] kcm: fix spelling mistake in Kconfig, "connectons"

2016-12-13 Thread Colin King
From: Colin Ian King 

Trivial fix to spelling mistake "connectons" to "connections" in
Kconfig text.

Signed-off-by: Colin Ian King 
---
 net/kcm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/kcm/Kconfig b/net/kcm/Kconfig
index 87fca36..23b01e1 100644
--- a/net/kcm/Kconfig
+++ b/net/kcm/Kconfig
@@ -7,5 +7,5 @@ config AF_KCM
---help---
  KCM (Kernel Connection Multiplexor) sockets provide a method
  for multiplexing messages of a message based application
- protocol over kernel connectons (e.g. TCP connections).
+ protocol over kernel connections (e.g. TCP connections).
 
-- 
2.10.2



[PATCH net-next] netlink: revert broken, broken "2-clause nla_ok()"

2016-12-13 Thread Alexey Dobriyan
Commit 4f7df337fe79bba1e4c2d525525d63b5ba186bbd
"netlink: 2-clause nla_ok()" is BROKEN.

First clause tests if "->nla_len" could even be accessed at all,
it can not possibly be omitted.

Signed-off-by: Alexey Dobriyan 
---

 include/net/netlink.h |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -698,7 +698,8 @@ static inline int nla_len(const struct nlattr *nla)
  */
 static inline int nla_ok(const struct nlattr *nla, int remaining)
 {
-   return nla->nla_len >= sizeof(*nla) &&
+   return remaining >= (int) sizeof(*nla) &&
+  nla->nla_len >= sizeof(*nla) &&
   nla->nla_len <= remaining;
 }
 


Re: [PATCH] net: mvpp2: fix dma unmapping of TX buffers for fragments

2016-12-13 Thread Marcin Wojtas
Hi Thomas,

Reviewed-by: Marcin Wojtas 

Best regards,
Marcin

2016-12-13 17:53 GMT+01:00 Thomas Petazzoni
:
> Since commit 71ce391dfb784 ("net: mvpp2: enable proper per-CPU TX
> buffers unmapping"), we are not correctly DMA unmapping TX buffers for
> fragments.
>
> Indeed, the mvpp2_txq_inc_put() function only stores in the
> txq_cpu->tx_buffs[] array the physical address of the buffer to be
> DMA-unmapped when skb != NULL. In addition, when DMA-unmapping, we use
> skb_headlen(skb) to get the size to be unmapped. Both of this works fine
> for TX descriptors that are associated directly to a SKB, but not the
> ones that are used for fragments, with a NULL pointer as skb:
>
>  - We have a NULL physical address when calling DMA unmap
>  - skb_headlen(skb) crashes because skb is NULL
>
> This causes random crashes when fragments are used.
>
> To solve this problem, this commit:
>
>  - Stores the physical address of the buffer to be unmapped
>unconditionally, regardless of whether it is tied to a SKB or not.
>
>  - Adds a txq_cpu->tx_data_size[] array to store the size of the DMA
>buffer to be unmapped upon TX completion.
>
> Fixes: 71ce391dfb784 ("net: mvpp2: enable proper per-CPU TX buffers 
> unmapping")
> Reported-by: Raphael G 
> Cc: Raphael G 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Thomas Petazzoni 
> ---
>  drivers/net/ethernet/marvell/mvpp2.c | 20 
>  1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/marvell/mvpp2.c 
> b/drivers/net/ethernet/marvell/mvpp2.c
> index 1026c45..d168b13 100644
> --- a/drivers/net/ethernet/marvell/mvpp2.c
> +++ b/drivers/net/ethernet/marvell/mvpp2.c
> @@ -791,6 +791,8 @@ struct mvpp2_txq_pcpu {
> /* Array of transmitted buffers' physical addresses */
> dma_addr_t *tx_buffs;
>
> +   size_t *tx_data_size;
> +
> /* Index of last TX DMA descriptor that was inserted */
> int txq_put_index;
>
> @@ -980,9 +982,10 @@ static void mvpp2_txq_inc_put(struct mvpp2_txq_pcpu 
> *txq_pcpu,
>   struct mvpp2_tx_desc *tx_desc)
>  {
> txq_pcpu->tx_skb[txq_pcpu->txq_put_index] = skb;
> -   if (skb)
> -   txq_pcpu->tx_buffs[txq_pcpu->txq_put_index] =
> -
> tx_desc->buf_phys_addr;
> +   txq_pcpu->tx_data_size[txq_pcpu->txq_put_index] =
> +   tx_desc->data_size;
> +   txq_pcpu->tx_buffs[txq_pcpu->txq_put_index] =
> +   tx_desc->buf_phys_addr;
> txq_pcpu->txq_put_index++;
> if (txq_pcpu->txq_put_index == txq_pcpu->size)
> txq_pcpu->txq_put_index = 0;
> @@ -4404,11 +4407,13 @@ static void mvpp2_txq_bufs_free(struct mvpp2_port 
> *port,
> dma_addr_t buf_phys_addr =
> 
> txq_pcpu->tx_buffs[txq_pcpu->txq_get_index];
> struct sk_buff *skb = 
> txq_pcpu->tx_skb[txq_pcpu->txq_get_index];
> +   size_t data_size =
> +   txq_pcpu->tx_data_size[txq_pcpu->txq_get_index];
>
> mvpp2_txq_inc_get(txq_pcpu);
>
> dma_unmap_single(port->dev->dev.parent, buf_phys_addr,
> -skb_headlen(skb), DMA_TO_DEVICE);
> +data_size, DMA_TO_DEVICE);
> if (!skb)
> continue;
> dev_kfree_skb_any(skb);
> @@ -4662,6 +4667,11 @@ static int mvpp2_txq_init(struct mvpp2_port *port,
> if (!txq_pcpu->tx_buffs)
> goto error;
>
> +   txq_pcpu->tx_data_size = kmalloc(txq_pcpu->size *
> +sizeof(size_t), GFP_KERNEL);
> +   if (!txq_pcpu->tx_data_size)
> +   goto error;
> +
> txq_pcpu->count = 0;
> txq_pcpu->reserved_num = 0;
> txq_pcpu->txq_put_index = 0;
> @@ -4675,6 +4685,7 @@ static int mvpp2_txq_init(struct mvpp2_port *port,
> txq_pcpu = per_cpu_ptr(txq->pcpu, cpu);
> kfree(txq_pcpu->tx_skb);
> kfree(txq_pcpu->tx_buffs);
> +   kfree(txq_pcpu->tx_data_size);
> }
>
> dma_free_coherent(port->dev->dev.parent,
> @@ -4695,6 +4706,7 @@ static void mvpp2_txq_deinit(struct mvpp2_port *port,
> txq_pcpu = per_cpu_ptr(txq->pcpu, cpu);
> kfree(txq_pcpu->tx_skb);
> kfree(txq_pcpu->tx_buffs);
> +   kfree(txq_pcpu->tx_data_size);
> }
>
> if (txq->descs)
> --
> 2.7.4
>


[PATCH] net: mvpp2: fix dma unmapping of TX buffers for fragments

2016-12-13 Thread Thomas Petazzoni
Since commit 71ce391dfb784 ("net: mvpp2: enable proper per-CPU TX
buffers unmapping"), we are not correctly DMA unmapping TX buffers for
fragments.

Indeed, the mvpp2_txq_inc_put() function only stores in the
txq_cpu->tx_buffs[] array the physical address of the buffer to be
DMA-unmapped when skb != NULL. In addition, when DMA-unmapping, we use
skb_headlen(skb) to get the size to be unmapped. Both of this works fine
for TX descriptors that are associated directly to a SKB, but not the
ones that are used for fragments, with a NULL pointer as skb:

 - We have a NULL physical address when calling DMA unmap
 - skb_headlen(skb) crashes because skb is NULL

This causes random crashes when fragments are used.

To solve this problem, this commit:

 - Stores the physical address of the buffer to be unmapped
   unconditionally, regardless of whether it is tied to a SKB or not.

 - Adds a txq_cpu->tx_data_size[] array to store the size of the DMA
   buffer to be unmapped upon TX completion.

Fixes: 71ce391dfb784 ("net: mvpp2: enable proper per-CPU TX buffers unmapping")
Reported-by: Raphael G 
Cc: Raphael G 
Cc: sta...@vger.kernel.org
Signed-off-by: Thomas Petazzoni 
---
 drivers/net/ethernet/marvell/mvpp2.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c 
b/drivers/net/ethernet/marvell/mvpp2.c
index 1026c45..d168b13 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -791,6 +791,8 @@ struct mvpp2_txq_pcpu {
/* Array of transmitted buffers' physical addresses */
dma_addr_t *tx_buffs;
 
+   size_t *tx_data_size;
+
/* Index of last TX DMA descriptor that was inserted */
int txq_put_index;
 
@@ -980,9 +982,10 @@ static void mvpp2_txq_inc_put(struct mvpp2_txq_pcpu 
*txq_pcpu,
  struct mvpp2_tx_desc *tx_desc)
 {
txq_pcpu->tx_skb[txq_pcpu->txq_put_index] = skb;
-   if (skb)
-   txq_pcpu->tx_buffs[txq_pcpu->txq_put_index] =
-tx_desc->buf_phys_addr;
+   txq_pcpu->tx_data_size[txq_pcpu->txq_put_index] =
+   tx_desc->data_size;
+   txq_pcpu->tx_buffs[txq_pcpu->txq_put_index] =
+   tx_desc->buf_phys_addr;
txq_pcpu->txq_put_index++;
if (txq_pcpu->txq_put_index == txq_pcpu->size)
txq_pcpu->txq_put_index = 0;
@@ -4404,11 +4407,13 @@ static void mvpp2_txq_bufs_free(struct mvpp2_port *port,
dma_addr_t buf_phys_addr =
txq_pcpu->tx_buffs[txq_pcpu->txq_get_index];
struct sk_buff *skb = txq_pcpu->tx_skb[txq_pcpu->txq_get_index];
+   size_t data_size =
+   txq_pcpu->tx_data_size[txq_pcpu->txq_get_index];
 
mvpp2_txq_inc_get(txq_pcpu);
 
dma_unmap_single(port->dev->dev.parent, buf_phys_addr,
-skb_headlen(skb), DMA_TO_DEVICE);
+data_size, DMA_TO_DEVICE);
if (!skb)
continue;
dev_kfree_skb_any(skb);
@@ -4662,6 +4667,11 @@ static int mvpp2_txq_init(struct mvpp2_port *port,
if (!txq_pcpu->tx_buffs)
goto error;
 
+   txq_pcpu->tx_data_size = kmalloc(txq_pcpu->size *
+sizeof(size_t), GFP_KERNEL);
+   if (!txq_pcpu->tx_data_size)
+   goto error;
+
txq_pcpu->count = 0;
txq_pcpu->reserved_num = 0;
txq_pcpu->txq_put_index = 0;
@@ -4675,6 +4685,7 @@ static int mvpp2_txq_init(struct mvpp2_port *port,
txq_pcpu = per_cpu_ptr(txq->pcpu, cpu);
kfree(txq_pcpu->tx_skb);
kfree(txq_pcpu->tx_buffs);
+   kfree(txq_pcpu->tx_data_size);
}
 
dma_free_coherent(port->dev->dev.parent,
@@ -4695,6 +4706,7 @@ static void mvpp2_txq_deinit(struct mvpp2_port *port,
txq_pcpu = per_cpu_ptr(txq->pcpu, cpu);
kfree(txq_pcpu->tx_skb);
kfree(txq_pcpu->tx_buffs);
+   kfree(txq_pcpu->tx_data_size);
}
 
if (txq->descs)
-- 
2.7.4



Re: [PATCH] ARM: add cmpxchg64 helper for ARMv7-M

2016-12-13 Thread Russell King - ARM Linux
On Sat, Dec 10, 2016 at 01:32:34PM +0100, Pablo Neira Ayuso wrote:
> Hi Arnd,
> 
> On Sat, Dec 10, 2016 at 11:36:34AM +0100, Arnd Bergmann wrote:
> > A change to the netfilter code in net-next introduced the first caller of
> > cmpxchg64 that can get built on ARMv7-M, leading to an error from the
> > assembler that points out the lack of 64-bit atomics on this architecture:
> > 
> > /tmp/ccMe7djj.s: Assembler messages:
> > /tmp/ccMe7djj.s:367: Error: selected processor does not support `ldrexd 
> > r0,r1,[lr]' in Thumb mode
> > /tmp/ccMe7djj.s:371: Error: selected processor does not support `strexd 
> > ip,r2,r3,[lr]' in Thumb mode
> > /tmp/ccMe7djj.s:389: Error: selected processor does not support `ldrexd 
> > r8,r9,[r7]' in Thumb mode
> > /tmp/ccMe7djj.s:393: Error: selected processor does not support `strexd 
> > lr,r0,r1,[r7]' in Thumb mode
> > scripts/Makefile.build:299: recipe for target 'net/netfilter/nft_counter.o' 
> > failed
> > 
> > This makes ARMv7-M use the same emulation from asm-generic/cmpxchg-local.h
> > that we use on architectures earlier than ARMv6K, to fix the build. The
> > 32-bit atomics are available on ARMv7-M and we keep using them there.
> > This ARM specific change is probably something we should do regardless
> > of the netfilter code.
> > 
> > However, looking at the new nft_counter_reset() function in nft_counter.c,
> > this looks incorrect to me not just on ARMv7-M but also on other
> > architectures, with at least the following possible race:
> 
> Right, Eric Dumazet already spotted this problem. I'm preparing a
> patch that doesn't require cmpxchg64(). Will keep you on Cc. Thanks.

Please keep me on the Cc as well so I know what's happening, thanks.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.


Re: [PATCH V2 18/22] bnxt_re: Support for DCB

2016-12-13 Thread Jason Gunthorpe
On Tue, Dec 13, 2016 at 11:55:55AM +0530, Selvin Xavier wrote:

> v1 eth_type is not defined. All vendor drivers have their own definition.

Send a cleanup patch?

Jason


Re: double free issue, mvpp2 driver, armada375 modules

2016-12-13 Thread Thomas Petazzoni
Hello,

On Mon, 2 May 2016 11:46:46 +0200, Raphael G wrote:

> enclosed the kernel panic we obtain after boot with a slightly patched 
> upstream kernel (4.5.2).
> 
> (as well as the patchset applied to the upstream kernel, so that you can 
> know which code we are talking about. Too bad we cannot use the upstream 
> kernel but we had no choice about this + we're no experts so we rely on 
> provided patches, adapted to our bootloader and hardware for this)
> 
> Reproduce:
> boot on kernel on an armada375 module, connect to it, launch a top in 
> commandline
> 
> As seen with Marcin Wojtas, reverting commit 
> e864b4c7b184bde36fa6a02bb3190983d2f796f9 fixes this issue.
> 
> Reporting upstream so that you can decide what should be done next

Thanks for your report. I have finally submitted a fix for this issue.
You can find it at:

  
http://lists.infradead.org/pipermail/linux-arm-kernel/2016-December/473989.html

If you have the time to test it and report back, it would be very
useful.

Thanks a lot!

Thomas
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com


Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-13 Thread Christoph Lameter
On Tue, 13 Dec 2016, Jesper Dangaard Brouer wrote:

> This is the early demux problem.  With the push-mode of registering
> memory, you need hardware steering support, for zero-copy support, as
> the software step happens after DMA engine have written into the memory.

Right. But we could fall back to software. Transfer to a kernel buffer and
then move stuff over. Not much of an improvment but it will make things
work.

> > The discussion here is a bit amusing since these issues have been
> > resolved a long time ago with the design of the RDMA subsystem. Zero
> > copy is already in wide use. Memory registration is used to pin down
> > memory areas. Work requests can be filed with the RDMA subsystem that
> > then send and receive packets from the registered memory regions.
> > This is not strictly remote memory access but this is a basic mode of
> > operations supported  by the RDMA subsystem. The mlx5 driver quoted
> > here supports all of that.
>
> I hear what you are saying.  I will look into a push-model, as it might
> be a better solution.
>  I will read up on RDMA + verbs and learn more about their API model.  I
> even plan to write a small sample program to get a feeling for the API,
> and maybe we can use that as a baseline for the performance target we
> can obtain on the same HW. (Thanks to Björn for already giving me some
> pointer here)

Great.

> > What is bad about RDMA is that it is a separate kernel subsystem.
> > What I would like to see is a deeper integration with the network
> > stack so that memory regions can be registred with a network socket
> > and work requests then can be submitted and processed that directly
> > read and write in these regions. The network stack should provide the
> > services that the hardware of the NIC does not suppport as usual.
>
> Interesting.  So you even imagine sockets registering memory regions
> with the NIC.  If we had a proper NIC HW filter API across the drivers,
> to register the steering rule (like ibv_create_flow), this would be
> doable, but we don't (DPDK actually have an interesting proposal[1])

Well doing this would mean adding some features and that also would at
best allow general support for zero copy direct to user space with a
fallback to software if the hardware is missing some feature.

> > The RX/TX ring in user space should be an additional mode of
> > operation of the socket layer. Once that is in place the "Remote
> > memory acces" can be trivially implemented on top of that and the
> > ugly RDMA sidecar subsystem can go away.
>
> I cannot follow that 100%, but I guess you are saying we also need a
> more efficient mode of handing over pages/packet to userspace (than
> going through the normal socket API calls).

A work request contains the user space address of the data to be sent
and/or received. The address must be in a registered memory region. This
is different from copying the packet into kernel data structures.

I think this can easily be generalized. We need support for registering
memory regions, submissions of work request and the processing of
completion requets. QP (queue-pair) processing is probably the basis for
the whole scheme that is used in multiple context these days.


Re: [PATCHv2 1/5] sh_eth: add generic wake-on-lan support via magic packet

2016-12-13 Thread Niklas Söderlund
Hi Geert,

Thanks for feedback and testing!

On 2016-12-13 14:03:52 +0100, Geert Uytterhoeven wrote:
> CC linux-pm

I think you forgot to CC linux-pm :-)

> 
> On Mon, Dec 12, 2016 at 5:09 PM, Niklas Söderlund
>  wrote:
> > Add generic functionality to support Wake-on-Lan using MagicPacket which
> > are supported by at least a few versions of sh_eth. Only add
> > functionality for WoL, no specific sh_eth version are marked to support
> > WoL yet.
> >
> > WoL is enabled in the suspend callback by setting MagicPacket detection
> > and disabling all interrupts expect MagicPacket. In the resume path the
> > driver needs to reset the hardware to rearm the WoL logic, this prevents
> > the driver from simply restoring the registers and to take advantage of
> > that sh_eth was not suspended to reduce resume time. To reset the
> > hardware the driver close and reopens the device just like it would do
> > in a normal suspend/resume scenario without WoL enabled, but it both
> > close and open the device in the resume callback since the device needs
> > to be open for WoL to work.
> >
> > One quirk needed for WoL is that the module clock needs to be prevented
> > from being switched off by Runtime PM. To keep the clock alive the
> > suspend callback need to call clk_enable() directly to increase the
> > usage count of the clock. Then when Runtime PM decreases the clock usage
> > count it won't reach 0 and be switched off.
> >
> > Signed-off-by: Niklas Söderlund 
> 
> Thanks for the update!
> 
> I've verified WoL is working on r8a7791/koelsch and r8a7740/armadillo.
> 
> However, if I look at /sys/kernel/debug/wakeup_sources, "active_count" and
> "event_count" for the Ethernet device do not increase when using WoL, while
> they do for the GPIO keyboard when using the keyboard for wake up.
> So something seems to be missing from a bookkeeping point of view.

Cool, now I know why some net drivers call pm_wakeup_event() if the 
wakeup source was WoL :-) I added this to sh_eth and now the bookkeeping 
seems to work as you describe, "active_count" and "event_count" are 
incremented while waking up from WoL. I will include this in the next 
version, thanks for reporting I had no idea to check for this.

> 
> Interestingly, "wakeup_count" does not increase for both?
> Has this been broken recently?

I had a quick look at this and the 'wakeup_count' is increased in 
wakeup_source_report_event() which is in the call path from 
pm_wakeup_event().

pm_wakeup_event()
  __pm_wakeup_event()
wakeup_source_report_event()

static void wakeup_source_report_event(struct wakeup_source *ws) 
{
ws->event_count++;
/* This is racy, but the counter is approximate anyway. */
if (events_check_enabled)
ws->wakeup_count++;

if (!ws->active)
wakeup_source_activate(ws);
}

So maybe 'wakeup_count' is not incremented due to the race with 
events_check_enabled? But then again I'm new to PM so I might miss 
something obvious. I'm also not sure if I can do anything in this series 
to improve the behavior of 'wakeup_count' for sh_eth?

> 
> > ---
> >  drivers/net/ethernet/renesas/sh_eth.c | 112 
> > +++---
> >  drivers/net/ethernet/renesas/sh_eth.h |   3 +
> >  2 files changed, 107 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/renesas/sh_eth.c 
> > b/drivers/net/ethernet/renesas/sh_eth.c
> > index 05b0dc5..87640b9 100644
> > --- a/drivers/net/ethernet/renesas/sh_eth.c
> > +++ b/drivers/net/ethernet/renesas/sh_eth.c
> > @@ -2199,6 +2199,33 @@ static int sh_eth_set_ringparam(struct net_device 
> > *ndev,
> > return 0;
> >  }
> >
> > +static void sh_eth_get_wol(struct net_device *ndev, struct ethtool_wolinfo 
> > *wol)
> > +{
> > +   struct sh_eth_private *mdp = netdev_priv(ndev);
> > +
> > +   wol->supported = 0;
> > +   wol->wolopts = 0;
> > +
> > +   if (mdp->cd->magic && mdp->clk) {
> > +   wol->supported = WAKE_MAGIC;
> > +   wol->wolopts = mdp->wol_enabled ? WAKE_MAGIC : 0;
> > +   }
> > +}
> > +
> > +static int sh_eth_set_wol(struct net_device *ndev, struct ethtool_wolinfo 
> > *wol)
> > +{
> > +   struct sh_eth_private *mdp = netdev_priv(ndev);
> > +
> > +   if (!mdp->cd->magic || !mdp->clk || wol->wolopts & ~WAKE_MAGIC)
> > +   return -EOPNOTSUPP;
> > +
> > +   mdp->wol_enabled = !!(wol->wolopts & WAKE_MAGIC);
> > +
> > +   device_set_wakeup_enable(>pdev->dev, mdp->wol_enabled);
> > +
> > +   return 0;
> > +}
> > +
> >  static const struct ethtool_ops sh_eth_ethtool_ops = {
> > .get_regs_len   = sh_eth_get_regs_len,
> > .get_regs   = sh_eth_get_regs,
> > @@ -2213,6 +2240,8 @@ static const struct ethtool_ops sh_eth_ethtool_ops = {
> > .set_ringparam  = sh_eth_set_ringparam,
> > .get_link_ksettings = sh_eth_get_link_ksettings,
> >

Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-13 Thread Jesper Dangaard Brouer

On Mon, 12 Dec 2016 12:06:59 -0600 (CST) Christoph Lameter  
wrote:
> On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote:
> 
> > Hmmm. If you can rely on hardware setup to give you steering and
> > dedicated access to the RX rings.  In those cases, I guess, the "push"
> > model could be a more direct API approach.  
> 
> If the hardware does not support steering then one should be able to
> provide those services in software.

This is the early demux problem.  With the push-mode of registering
memory, you need hardware steering support, for zero-copy support, as
the software step happens after DMA engine have written into the memory.

My model pre-VMA map all the pages in the RX ring (if zero-copy gets
enabled, by a single user).  The software step can filter and zero-copy
send packet-pages to the application/socket that requested this. The
disadvantage is all zero-copy application need to share this VMA
mapping.  This is solved by configuring HW filters into a RX-queue, and
then only attach your zero-copy application to that queue.


> > I was shooting for a model that worked without hardware support.
> > And then transparently benefit from HW support by configuring a HW
> > filter into a specific RX queue and attaching/using to that queue.  
> 
> The discussion here is a bit amusing since these issues have been
> resolved a long time ago with the design of the RDMA subsystem. Zero
> copy is already in wide use. Memory registration is used to pin down
> memory areas. Work requests can be filed with the RDMA subsystem that
> then send and receive packets from the registered memory regions.
> This is not strictly remote memory access but this is a basic mode of
> operations supported  by the RDMA subsystem. The mlx5 driver quoted
> here supports all of that.

I hear what you are saying.  I will look into a push-model, as it might
be a better solution.
 I will read up on RDMA + verbs and learn more about their API model.  I
even plan to write a small sample program to get a feeling for the API,
and maybe we can use that as a baseline for the performance target we
can obtain on the same HW. (Thanks to Björn for already giving me some
pointer here)


> What is bad about RDMA is that it is a separate kernel subsystem.
> What I would like to see is a deeper integration with the network
> stack so that memory regions can be registred with a network socket
> and work requests then can be submitted and processed that directly
> read and write in these regions. The network stack should provide the
> services that the hardware of the NIC does not suppport as usual.

Interesting.  So you even imagine sockets registering memory regions
with the NIC.  If we had a proper NIC HW filter API across the drivers,
to register the steering rule (like ibv_create_flow), this would be
doable, but we don't (DPDK actually have an interesting proposal[1])

 
> The RX/TX ring in user space should be an additional mode of
> operation of the socket layer. Once that is in place the "Remote
> memory acces" can be trivially implemented on top of that and the
> ugly RDMA sidecar subsystem can go away.
 
I cannot follow that 100%, but I guess you are saying we also need a
more efficient mode of handing over pages/packet to userspace (than
going through the normal socket API calls).


Appreciate your input, it challenged my thinking.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] https://rawgit.com/6WIND/rte_flow/master/rte_flow.html


Re: [PATCH net v3] ibmveth: set correct gso_size and gso_type

2016-12-13 Thread Thomas Falcon
On 12/10/2016 02:56 PM, David Miller wrote:
> From: Thomas Falcon 
> Date: Sat, 10 Dec 2016 12:39:48 -0600
>
>> v3: include a check for non-zero mss when calculating gso_segs
>>
>> v2: calculate gso_segs after Eric Dumazet's comments on the earlier patch
>> and make sure everyone is included on CC
> I already applied v1 which made it all the way even to Linus's
> tree.  So you'll have to send me relative fixups if there are
> things to fix or change since v1.
>
> You must always generate patches against the current 'net' tree.
>
Sorry about that.  Thank you for applying it on short notice.  I agree that 
using the TCP checksum field is not ideal, but it was a compromise with the 
VIOS team.  Hopefully, there will be a better way in the future.

Thanks again,

Tom 



Re: [PATCH net] virtio-net: correctly enable multiqueue

2016-12-13 Thread David Miller
From: Jason Wang 
Date: Tue, 13 Dec 2016 14:23:05 +0800

> Commit 4490001029012539937ff02778fe6180613fa949 ("virtio-net: enable
> multiqueue by default") blindly set the affinity instead of queues
> during probe which can cause a mismatch of #queues between guest and
> host. This patch fixes it by setting queues.
> 
> Reported-by: Theodore Ts'o 
> Tested-by: Theodore Ts'o 
> Cc: Neil Horman 
> Cc: Michael S. Tsirkin 
> Fixes: 49000102901 ("virtio-net: enable multiqueue by default")
> Signed-off-by: Jason Wang 

Applied, thanks Jason.


[PATCH/replace net-next 17/27] OVS: remove use of VLAN_TAG_PRESENT

2016-12-13 Thread Michał Mirosław
This is a minimal change to allow removing of VLAN_TAG_PRESENT.
It leaves OVS unable to use CFI bit, as fixing this would need
a deeper surgery involving userspace interface.

Signed-off-by: Michał Mirosław 
---
 net/openvswitch/actions.c  | 13 +
 net/openvswitch/flow.c |  2 +-
 net/openvswitch/flow.h |  2 +-
 net/openvswitch/flow_netlink.c | 22 +++---
 4 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 514f7bc..fb41833 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -278,7 +278,7 @@ static int push_vlan(struct sk_buff *skb, struct 
sw_flow_key *key,
key->eth.vlan.tpid = vlan->vlan_tpid;
}
return skb_vlan_push(skb, vlan->vlan_tpid,
-ntohs(vlan->vlan_tci) & ~VLAN_TAG_PRESENT);
+ntohs(vlan->vlan_tci) & ~VLAN_CFI_MASK);
 }
 
 /* 'src' is already properly masked. */
@@ -704,8 +704,10 @@ static int ovs_vport_output(struct net *net, struct sock 
*sk, struct sk_buff *sk
__skb_dst_copy(skb, data->dst);
*OVS_CB(skb) = data->cb;
skb->inner_protocol = data->inner_protocol;
-   skb->vlan_tci = data->vlan_tci;
-   skb->vlan_proto = data->vlan_proto;
+   if (data->vlan_tci & VLAN_CFI_MASK)
+   __vlan_hwaccel_put_tag(skb, data->vlan_proto, data->vlan_tci & 
~VLAN_CFI_MASK);
+   else
+   __vlan_hwaccel_clear_tag(skb);
 
/* Reconstruct the MAC header.  */
skb_push(skb, data->l2_len);
@@ -749,7 +751,10 @@ static void prepare_frag(struct vport *vport, struct 
sk_buff *skb,
data->cb = *OVS_CB(skb);
data->inner_protocol = skb->inner_protocol;
data->network_offset = orig_network_offset;
-   data->vlan_tci = skb->vlan_tci;
+   if (skb_vlan_tag_present(skb))
+   data->vlan_tci = skb_vlan_tag_get(skb) | VLAN_CFI_MASK;
+   else
+   data->vlan_tci = 0;
data->vlan_proto = skb->vlan_proto;
data->mac_proto = mac_proto;
data->l2_len = hlen;
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 08aa926..2bd4eac 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -347,7 +347,7 @@ static int parse_vlan(struct sk_buff *skb, struct 
sw_flow_key *key)
int res;
 
if (skb_vlan_tag_present(skb)) {
-   key->eth.vlan.tci = htons(skb->vlan_tci);
+   key->eth.vlan.tci = htons(skb->vlan_tci) | htons(VLAN_CFI_MASK);
key->eth.vlan.tpid = skb->vlan_proto;
} else {
/* Parse outer vlan tag in the non-accelerated case. */
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index f61cae7..c8724ca 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -58,7 +58,7 @@ struct ovs_tunnel_info {
 
 struct vlan_head {
__be16 tpid; /* Vlan type. Generally 802.1q or 802.1ad.*/
-   __be16 tci;  /* 0 if no VLAN, VLAN_TAG_PRESENT set otherwise. */
+   __be16 tci;  /* 0 if no VLAN, VLAN_CFI_MASK set otherwise. */
 };
 
 #define OVS_SW_FLOW_KEY_METADATA_SIZE  \
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index d19044f..b72fcbd 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -853,9 +853,9 @@ static int validate_vlan_from_nlattrs(const struct 
sw_flow_match *match,
if (a[OVS_KEY_ATTR_VLAN])
tci = nla_get_be16(a[OVS_KEY_ATTR_VLAN]);
 
-   if (!(tci & htons(VLAN_TAG_PRESENT))) {
+   if (!(tci & htons(VLAN_CFI_MASK))) {
if (tci) {
-   OVS_NLERR(log, "%s TCI does not have VLAN_TAG_PRESENT 
bit set.",
+   OVS_NLERR(log, "%s TCI does not have VLAN_CFI_MASK bit 
set.",
  (inner) ? "C-VLAN" : "VLAN");
return -EINVAL;
} else if (nla_len(a[OVS_KEY_ATTR_ENCAP])) {
@@ -876,9 +876,9 @@ static int validate_vlan_mask_from_nlattrs(const struct 
sw_flow_match *match,
__be16 tci = 0;
__be16 tpid = 0;
bool encap_valid = !!(match->key->eth.vlan.tci &
- htons(VLAN_TAG_PRESENT));
+ htons(VLAN_CFI_MASK));
bool i_encap_valid = !!(match->key->eth.cvlan.tci &
-   htons(VLAN_TAG_PRESENT));
+   htons(VLAN_CFI_MASK));
 
if (!(key_attrs & (1 << OVS_KEY_ATTR_ENCAP))) {
/* Not a VLAN. */
@@ -902,8 +902,8 @@ static int validate_vlan_mask_from_nlattrs(const struct 
sw_flow_match *match,
  (inner) ? "C-VLAN" : "VLAN", ntohs(tpid));
return -EINVAL;
}
-   if (!(tci & htons(VLAN_TAG_PRESENT))) {
-   OVS_NLERR(log, "%s TCI mask does not have exact match for 
VLAN_TAG_PRESENT bit.",

Re: [PATCH v2 2/2] net: rfkill: Add rfkill-any LED trigger

2016-12-13 Thread Johannes Berg
On Thu, 2016-12-08 at 08:30 +0100, Michał Kępień wrote:
> Add a new "global" (i.e. not per-rfkill device) LED trigger, rfkill-
> any,
> which may be useful on laptops with a single "radio LED" and multiple
> radio transmitters.  The trigger is meant to turn a LED on whenever
> there is at least one radio transmitter active and turn it off
> otherwise.
> 

Also applied, but I moved the discussion of the mutex into the recorded
commit log.

johannes


Re: [PATCH v2 1/2] net: rfkill: Cleanup error handling in rfkill_init()

2016-12-13 Thread Johannes Berg
On Thu, 2016-12-08 at 08:30 +0100, Michał Kępień wrote:
> Use a separate label per error condition in rfkill_init() to make it
> a bit cleaner and easier to extend.

applied.

johannes


Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries

2016-12-13 Thread Andrew Lunn
> Another interesting thing is that fdb dump operation is slow
> too. The more entries you have the slower it gets. We have
> implemented caching with timing there as well (I have not submitted
> such patch yet), which makes fdb dump much more pleasant in user
> space, the operation happens instantly, where regular fdb dump op
> might take few seconds.

I know some people won't like the added MDIO transactions, because
they are doing PTP and want low jitter when talking to the switch and
PHYs. As things are now, once the system is configured, the MDIO bus
is idle. Hence there is very low jitter for PTP. For your cache to be
useful, i assume you are refreshing it at regular intervals. So you
are adding a constant load, which i guess 99.9% of the timer is never
used because there is no user looking at the table, where as the PTP
load is used to keep the clocks synchronised.

   Andrew


Re: [PATCH net-next 17/27] OVS: remove assumptions about VLAN_TAG_PRESENT bit

2016-12-13 Thread Michał Mirosław
On Tue, Dec 13, 2016 at 11:40:10AM +0100, Jiri Benc wrote:
> On Tue, 13 Dec 2016 01:12:38 +0100 (CET), Michał Mirosław wrote:
> > @@ -850,20 +848,11 @@ static int validate_vlan_from_nlattrs(const struct 
> > sw_flow_match *match,
> > return -EINVAL;
> > }
> >  
> > -   if (a[OVS_KEY_ATTR_VLAN])
> > -   tci = nla_get_be16(a[OVS_KEY_ATTR_VLAN]);
> > -
> > -   if (!(tci & htons(VLAN_TAG_PRESENT))) {
> > -   if (tci) {
> > -   OVS_NLERR(log, "%s TCI does not have VLAN_TAG_PRESENT 
> > bit set.",
> > - (inner) ? "C-VLAN" : "VLAN");
> > -   return -EINVAL;
> > -   } else if (nla_len(a[OVS_KEY_ATTR_ENCAP])) {
> > -   /* Corner case for truncated VLAN header. */
> > -   OVS_NLERR(log, "Truncated %s header has non-zero encap 
> > attribute.",
> > - (inner) ? "C-VLAN" : "VLAN");
> > -   return -EINVAL;
> > -   }
> > +   if (!a[OVS_KEY_ATTR_VLAN] && nla_len(a[OVS_KEY_ATTR_ENCAP])) {
> > +   /* Corner case for truncated VLAN header. */
> > +   OVS_NLERR(log, "Truncated %s header has non-zero encap 
> > attribute.",
> > +   (inner) ? "C-VLAN" : "VLAN");
> > +   return -EINVAL;
> 
> This looks wrong. The zero value of nla_get_be16(a[OVS_KEY_ATTR_VLAN])
> together with empty a[OVS_KEY_ATTR_ENCAP] means truncated VLAN header
> and this needs to be preserved.
> 
> In other words, you need to check also !nla_get_be16(a[OVS_KEY_ATTR_VLAN])
> here.
> 
> 
> Overall, this whole patch looks very suspicious from the very
> beginning. The xors used are strong indication that something is not
> right.
> 
> And indeed, you're breaking uAPI compatibility. Previously, the
> VLAG_TAG_PRESENT bit set in OVS_KEY_ATTR_VLAN caused the flow to match
> on packets with VLAN tag present. After this patch, it causes the flow
> to match only on those packets that have a certain value of
> VLAN_CFI_MASK in their VLAN tag (I haven't bother deciphering what
> value is checked after all these xors, it's as well possible that it
> will also match on packets _without_ any VLAN header).
> 
> You have to preserve the old meaning of VLAN_TAG_PRESENT in
> OVS_KEY_ATTR_VLAN. When doing flow lookups, VLAN_TAG_PRESENT must match
> on and only on packets that have VLAN tag present, irrespective of their
> VLAN_CFI_MASK.
> 
> If you want to introduce support for lookups on VLAN_CFI_MASK, you'll
> have to do that by introducing a new netlink attribute.

Hmm. In that case I'll just mask the CFI bit on entry to OVS.
Otherwise it's for me like to stab in a dark at a large beast
I know nothing about.

Best Regards,
Michał Mirosław


Re: [PATCH net-next 13/27] bridge: use __vlan_hwaccel helpers

2016-12-13 Thread Michał Mirosław
On Tue, Dec 13, 2016 at 03:59:46PM +0300, Sergei Shtylyov wrote:
> Hello!
> 
> On 12/13/2016 3:12 AM, Michał Mirosław wrote:
> 
> > This removes assumption than vlan_tci != 0 when tag is present.
> > 
> > Signed-off-by: Michał Mirosław 
> > ---
> >  net/bridge/br_netfilter_hooks.c | 14 --
> >  net/bridge/br_private.h |  2 +-
> >  net/bridge/br_vlan.c|  6 +++---
> >  3 files changed, 12 insertions(+), 10 deletions(-)
> > 
> > diff --git a/net/bridge/br_netfilter_hooks.c 
> > b/net/bridge/br_netfilter_hooks.c
> > index b12501a..2cc0747 100644
> > --- a/net/bridge/br_netfilter_hooks.c
> > +++ b/net/bridge/br_netfilter_hooks.c
> [...]
> > @@ -749,8 +747,12 @@ static int br_nf_dev_queue_xmit(struct net *net, 
> > struct sock *sk, struct sk_buff
> > 
> > data = this_cpu_ptr(_frag_data_storage);
> > 
> > -   data->vlan_tci = skb->vlan_tci;
> > -   data->vlan_proto = skb->vlan_proto;
> > +   if (skb_vlan_tag_present(skb)) {
> > +   data->vlan_tci = skb->vlan_tci;
> > +   data->vlan_proto = skb->vlan_proto;
> > +   } else
> > +   data->vlan_proto = 0;
> 
>CodingStyle: should use {} in all branches of *if* if at least one branch
> has {}.
> 
> [...]
> > diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
> > index b6de4f4..ef94664 100644
> > --- a/net/bridge/br_vlan.c
> > +++ b/net/bridge/br_vlan.c
> 
> > @@ -444,8 +444,8 @@ static bool __allowed_ingress(const struct net_bridge 
> > *br,
> > __vlan_hwaccel_put_tag(skb, br->vlan_proto, pvid);
> > else
> > /* Priority-tagged Frame.
> > -* At this point, We know that skb->vlan_tci had
> > -* VLAN_TAG_PRESENT bit and its VID field was 0x000.
> > +* At this point, We know that skb->vlan_tci VID
> 
>s/We/we/.
> 
> > +* field was 0x000.
> 
>Simply 0, maybe?

Thanks, fixed.
-- Michał Mirosław


[RFC PATCH v3] audit: use proper refcount locking on audit_sock

2016-12-13 Thread Richard Guy Briggs
Resetting audit_sock appears to be racy.

audit_sock was being copied and dereferenced without using a refcount on
the source sock.

Bump the refcount on the underlying sock when we store a refrence in
audit_sock and release it when we reset audit_sock.  audit_sock
modification needs the audit_cmd_mutex.

See: https://lkml.org/lkml/2016/11/26/232

Thanks to Eric Dumazet  and Cong Wang
 on ideas how to fix it.

Signed-off-by: Richard Guy Briggs 
---
There has been a lot of change in the audit code that is about to go
upstream to address audit queue issues.  This patch is based on the
source tree: git://git.infradead.org/users/pcmoore/audit#next
---
 kernel/audit.c |   28 +++-
 1 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/kernel/audit.c b/kernel/audit.c
index f20eee0..3bb4126 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -446,14 +446,19 @@ static void kauditd_retry_skb(struct sk_buff *skb)
  * Description:
  * Break the auditd/kauditd connection and move all the records in the retry
  * queue into the hold queue in case auditd reconnects.
+ * The audit_cmd_mutex must be held when calling this function.
  */
 static void auditd_reset(void)
 {
struct sk_buff *skb;
 
/* break the connection */
+   if (audit_sock) {
+   sock_put(audit_sock);
+   audit_sock = NULL;
+   }
audit_pid = 0;
-   audit_sock = NULL;
+   audit_nlk_portid = 0;
 
/* flush all of the retry queue to the hold queue */
while ((skb = skb_dequeue(_retry_queue)))
@@ -579,7 +584,9 @@ static int kauditd_thread(void *dummy)
 
auditd = 0;
if (AUDITD_BAD(rc, reschedule)) {
+   mutex_lock(_cmd_mutex);
auditd_reset();
+   mutex_unlock(_cmd_mutex);
reschedule = 0;
}
} else
@@ -594,7 +601,9 @@ static int kauditd_thread(void *dummy)
auditd = 0;
if (AUDITD_BAD(rc, reschedule)) {
kauditd_hold_skb(skb);
+   mutex_lock(_cmd_mutex);
auditd_reset();
+   mutex_unlock(_cmd_mutex);
reschedule = 0;
} else
/* temporary problem (we hope), queue
@@ -623,7 +632,9 @@ quick_loop:
if (rc) {
auditd = 0;
if (AUDITD_BAD(rc, reschedule)) {
+   mutex_lock(_cmd_mutex);
auditd_reset();
+   mutex_unlock(_cmd_mutex);
reschedule = 0;
}
 
@@ -1010,11 +1021,16 @@ static int audit_receive_msg(struct sk_buff *skb, 
struct nlmsghdr *nlh)
}
if (audit_enabled != AUDIT_OFF)
audit_log_config_change("audit_pid", new_pid, 
audit_pid, 1);
-   audit_pid = new_pid;
-   audit_nlk_portid = NETLINK_CB(skb).portid;
-   audit_sock = skb->sk;
-   if (!new_pid)
+   if (new_pid) {
+   if (audit_sock)
+   sock_put(audit_sock);
+   audit_pid = new_pid;
+   audit_nlk_portid = NETLINK_CB(skb).portid;
+   sock_hold(skb->sk);
+   audit_sock = skb->sk;
+   } else {
auditd_reset();
+   }
wake_up_interruptible(_wait);
}
if (s.mask & AUDIT_STATUS_RATE_LIMIT) {
@@ -1283,8 +1299,10 @@ static void __net_exit audit_net_exit(struct net *net)
 {
struct audit_net *aunet = net_generic(net, audit_net_id);
struct sock *sock = aunet->nlsk;
+   mutex_lock(_cmd_mutex);
if (sock == audit_sock)
auditd_reset();
+   mutex_unlock(_cmd_mutex);
 
RCU_INIT_POINTER(aunet->nlsk, NULL);
synchronize_net();
-- 
1.7.1



Re: [PATCH v2] audit: use proper refcount locking on audit_sock

2016-12-13 Thread Richard Guy Briggs
On 2016-12-13 00:10, Richard Guy Briggs wrote:
> On 2016-12-12 15:18, Paul Moore wrote:
> > On Mon, Dec 12, 2016 at 5:03 AM, Richard Guy Briggs  wrote:
> > > Resetting audit_sock appears to be racy.
> > >
> > > audit_sock was being copied and dereferenced without using a refcount on
> > > the source sock.
> > >
> > > Bump the refcount on the underlying sock when we store a refrence in
> > > audit_sock and release it when we reset audit_sock.  audit_sock
> > > modification needs the audit_cmd_mutex.
> > >
> > > See: https://lkml.org/lkml/2016/11/26/232
> > >
> > > Thanks to Eric Dumazet  and Cong Wang
> > >  on ideas how to fix it.
> > >
> > > Signed-off-by: Richard Guy Briggs 
> > > ---
> > > There has been a lot of change in the audit code that is about to go
> > > upstream to address audit queue issues.  This patch is based on the
> > > source tree: git://git.infradead.org/users/pcmoore/audit#next
> > > ---
> > >  kernel/audit.c |   34 --
> > >  1 files changed, 28 insertions(+), 6 deletions(-)
> > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > index f20eee0..439f7f3 100644
> > > --- a/kernel/audit.c
> > > +++ b/kernel/audit.c
> > > @@ -452,7 +452,9 @@ static void auditd_reset(void)
> > > struct sk_buff *skb;
> > >
> > > /* break the connection */
> > > +   sock_put(audit_sock);
> > > audit_pid = 0;
> > > +   audit_nlk_portid = 0;
> > > audit_sock = NULL;
> > >
> > > /* flush all of the retry queue to the hold queue */
> > > @@ -478,6 +480,12 @@ static int kauditd_send_unicast_skb(struct sk_buff 
> > > *skb)
> > > if (rc >= 0) {
> > > consume_skb(skb);
> > > rc = 0;
> > > +   } else {
> > > +   if (rc & (-ENOMEM|-EPERM|-ECONNREFUSED)) {
> > 
> > I dislike the way you wrote this because instead of simply looking at
> > this to see if it correct I need to sort out all the bits and find out
> > if there are other error codes that could run afoul of this check ...
> > make it simple, e.g. (rc == -ENOMEM || rc == -EPERM || ...).
> > Actually, since EPERM is 1, -EPERM (-1 in two's compliment is
> > 0x) is going to cause this to be true for pretty much any
> > value of rc, yes?
> 
> Yes, you are correct.  We need there a logical or on the results of each
> comparison to the return code rather than bit-wise or-ing the result
> codes together first to save a step.
> 
> > > +   mutex_lock(_cmd_mutex);
> > > +   auditd_reset();
> > > +   mutex_unlock(_cmd_mutex);
> > > +   }
> > 
> > The code in audit#next handles netlink_unicast() errors in
> > kauditd_thread() and you are adding error handling code here in
> > kauditd_send_unicast_skb() ... that's messy.  I don't care too much
> > where the auditd_reset() call is made, but let's only do it in one
> > function; FWIW, I originally put the error handling code in
> > kauditd_thread() because there was other error handling code that
> > needed to done in that scope so it resulted in cleaner code.
> 
> Hmmm, I seem to remember it not returning the return code and I thought
> I had changed it to do so, but I see now that it was already there.
> Agreed, I needlessly duplicated that error handling.
> 
> > Related, I see you are now considering ENOMEM to be a fatal condition,
> > that differs from the AUDITD_BAD macro in kauditd_thread(); this
> > difference needs to be reconciled.
> 
> Also correct about -EPERM now that I check back to the intent of commit
> 32a1dbaece7e ("audit: try harder to send to auditd upon netlink
> failure")
> 
> > Finally, you should update the comment header block for auditd_reset()
> > that it needs to be called with the audit_cmd_mutex held.
> 
> Yup.
> 
> > > @@ -1004,17 +1018,22 @@ static int audit_receive_msg(struct sk_buff *skb, 
> > > struct nlmsghdr *nlh)
> > > return -EACCES;
> > > }
> > > if (audit_pid && new_pid &&
> > > -   audit_replace(requesting_pid) != 
> > > -ECONNREFUSED) {
> > > +   (audit_replace(requesting_pid) & 
> > > (-ECONNREFUSED|-EPERM|-ENOMEM))) {
> > 
> > Do we simply want to treat any error here as fatal, and not just
> > ECONN/EPERM/ENOMEM?  If not, let's come up with a single macro to
> > handle the fatal netlink_unicast() return codes so we have some chance
> > to keep things consistent in the future.
> 
> I'll work through this before I post another patch...

Ok, I've gone back to look at the reasoning in commit 133e1e5acd4a
("audit: stop an old auditd being starved out by a new auditd") which
suggests only ECONNREFUSED can cause an audit_pid replace, so I've
returned it to its original state.

I'll post another tested patch, but I'm still not that happy that it
does not proactively reset 

Re: [PATCH v2] audit: use proper refcount locking on audit_sock

2016-12-13 Thread Richard Guy Briggs
On 2016-12-12 15:58, Cong Wang wrote:
> On Mon, Dec 12, 2016 at 2:03 AM, Richard Guy Briggs  wrote:
> > Resetting audit_sock appears to be racy.
> >
> > audit_sock was being copied and dereferenced without using a refcount on
> > the source sock.
> >
> > Bump the refcount on the underlying sock when we store a refrence in
> > audit_sock and release it when we reset audit_sock.  audit_sock
> > modification needs the audit_cmd_mutex.
> >
> > See: https://lkml.org/lkml/2016/11/26/232
> >
> > Thanks to Eric Dumazet  and Cong Wang
> >  on ideas how to fix it.
> >
> > Signed-off-by: Richard Guy Briggs 
> > ---
> > There has been a lot of change in the audit code that is about to go
> > upstream to address audit queue issues.  This patch is based on the
> > source tree: git://git.infradead.org/users/pcmoore/audit#next
> > ---
> >  kernel/audit.c |   34 --
> >  1 files changed, 28 insertions(+), 6 deletions(-)
> >
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index f20eee0..439f7f3 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -452,7 +452,9 @@ static void auditd_reset(void)
> > struct sk_buff *skb;
> >
> > /* break the connection */
> > +   sock_put(audit_sock);
> 
> Why audit_sock can't be NULL here?

Fixed.

> > audit_pid = 0;
> > +   audit_nlk_portid = 0;
> > audit_sock = NULL;
> >
> > /* flush all of the retry queue to the hold queue */
> > @@ -478,6 +480,12 @@ static int kauditd_send_unicast_skb(struct sk_buff 
> > *skb)
> > if (rc >= 0) {
> > consume_skb(skb);
> > rc = 0;
> > +   } else {
> > +   if (rc & (-ENOMEM|-EPERM|-ECONNREFUSED)) {
> 
> Are these errno's bits??

No, I've fixed this silly error.

> > +   mutex_lock(_cmd_mutex);
> > +   auditd_reset();
> > +   mutex_unlock(_cmd_mutex);
> > +   }
> > }
> >
> > return rc;
> > @@ -579,7 +587,9 @@ static int kauditd_thread(void *dummy)
> >
> > auditd = 0;
> > if (AUDITD_BAD(rc, reschedule)) {
> > +   mutex_lock(_cmd_mutex);
> > auditd_reset();
> > +   mutex_unlock(_cmd_mutex);
> > reschedule = 0;
> > }
> > } else
> > @@ -594,7 +604,9 @@ static int kauditd_thread(void *dummy)
> > auditd = 0;
> > if (AUDITD_BAD(rc, reschedule)) {
> > kauditd_hold_skb(skb);
> > +   mutex_lock(_cmd_mutex);
> > auditd_reset();
> > +   mutex_unlock(_cmd_mutex);
> > reschedule = 0;
> > } else
> > /* temporary problem (we hope), 
> > queue
> > @@ -623,7 +635,9 @@ quick_loop:
> > if (rc) {
> > auditd = 0;
> > if (AUDITD_BAD(rc, reschedule)) {
> > +   
> > mutex_lock(_cmd_mutex);
> > auditd_reset();
> > +   
> > mutex_unlock(_cmd_mutex);
> > reschedule = 0;
> > }
> >
> > @@ -1004,17 +1018,22 @@ static int audit_receive_msg(struct sk_buff *skb, 
> > struct nlmsghdr *nlh)
> > return -EACCES;
> > }
> > if (audit_pid && new_pid &&
> > -   audit_replace(requesting_pid) != -ECONNREFUSED) 
> > {
> > +   (audit_replace(requesting_pid) & 
> > (-ECONNREFUSED|-EPERM|-ENOMEM))) {
> > audit_log_config_change("audit_pid", 
> > new_pid, audit_pid, 0);
> > return -EEXIST;
> > }
> > if (audit_enabled != AUDIT_OFF)
> > audit_log_config_change("audit_pid", 
> > new_pid, audit_pid, 1);
> > -   audit_pid = new_pid;
> > -   audit_nlk_portid = NETLINK_CB(skb).portid;
> > -   audit_sock = skb->sk;
> > -   if (!new_pid)
> > +   if (new_pid) {
> > +   if (audit_sock)
> > +   sock_put(audit_sock);
> > +   

Re: mlx5: net_device.addr_list_lock usage before initialization

2016-12-13 Thread Saeed Mahameed
On Tue, Dec 13, 2016 at 3:22 PM, Sebastian Ott
 wrote:
> Hi,
>
> I ran into the following lockdep complaint:
>
> [7.059561] INFO: trying to register non-static key.
> [7.059566] the code is fine but needs lockdep annotation.
> [7.059570] turning off the locking correctness validator.
> [7.059579] CPU: 6 PID: 6 Comm: kworker/u32:0 Not tainted 
> 4.9.0-02683-g784243e-dirty #77
> [7.059582] Hardware name: IBM  2964 N96  704  
> (LPAR)
> [7.061260] Workqueue: mlx5e mlx5e_set_rx_mode_work [mlx5_core]
> [7.061268] Stack:
> [7.061270]f95739c0 f9573a50 0003 
> 
> [7.061278]f9573af0 f9573a68 f9573a68 
> 0020
> [7.061286] 0020 000a 
> 000a
> [7.061294]000c f9573ab8  
> 
> [7.061301]008a1038 00112a50 f9573a50 
> f9573aa8
> [7.061314] Call Trace:
> [7.061321] ([<0011292a>] show_trace+0x8a/0xe0)
> [7.061327]  [<00112a00>] show_stack+0x80/0xd8
> [7.061334]  [<005cdce6>] dump_stack+0x96/0xd8
> [7.061338]  [<001ae352>] register_lock_class+0x1d2/0x530
> [7.061341]  [<001b33f6>] __lock_acquire+0xfe/0x7d8
> [7.061345]  [<001b4394>] lock_acquire+0x30c/0x358
> [7.061352]  [<0089454c>] _raw_spin_lock_bh+0x64/0xa0
> [7.062171]  [<03ff81465858>] mlx5e_set_rx_mode_work+0x248/0x490 
> [mlx5_core]
> [7.062178]  [<00163864>] process_one_work+0x41c/0x830
> [7.062181]  [<00163f2c>] worker_thread+0x2b4/0x478
> [7.062186]  [<0016c46c>] kthread+0x15c/0x170
> [7.062190]  [<00895a52>] kernel_thread_starter+0x6/0xc
> [7.062193]  [<00895a4c>] kernel_thread_starter+0x0/0xc
> [7.062196] INFO: lockdep is turned off.
>
> The problematic lock is net_device.addr_list_lock whose usage is
> asynchronously triggered by:
>
> mlx5e_add -> mlx5e_attach -> mlx5e_attach_netdev -> mlx5e_nic_enable
> [workq] mlx5e_set_rx_mode_work -> mlx5e_handle_netdev_addr -> 
> mlx5e_sync_netdev_addr
>
> Initialization of this lock is triggered by:
> mlx5e_add -> register_netdev
>
> ...after the call to mlx5e_attach which is obviously racy.
>

Thanks Sebastian for the report,

indeed there is an issue, I wonder why the net_device.addr_list_lock
is initialized so late (at register_netdevice) IMHO it should be
initialized at
alloc_netdev_mqs->dev_addr_init
where all the other net_device fields are initialized!

We will handle this.

Thanks,
Saeed.


RE: [PATCH 3/3] netns: fix net_generic() "id - 1" bloat

2016-12-13 Thread David Laight
From: Alexey Dobriyan
> Sent: 13 December 2016 14:23
...
> Well, the point of the patch is to save .text, so might as well save
> as much as possible. Any form other than "ptr[id]" is going
> to be either bigger or bigger and slower and "ptr" should be the first field.

You've not read and understood the next bit:

> > However if you offset the 'id' values so that only
> > values 2 up are valid the code becomes:
> > return net->gen2->ptr[id - 2];
> > which will be exactly the same code as:
> > return net->gen1->ptr[id];
> > but it is much more obvious that 'id' values must be >= 2.
> >
> > The '2' should be generated from the structure offset, but with my method
> > is doesn't actually matter if it is wrong.

If you have foo->bar[id - const] then the compiler has to add the
offset of 'bar' and subtract for 'const'.
If the numbers match no add or subtract is needed.

It is much cleaner to do this by explicitly removing the offset on the
accesses than using a union.

David



ATENCIÓN;

2016-12-13 Thread Administrador
ATENCIÓN;

Su buzón ha superado el límite de almacenamiento, que es de 5 GB definidos por 
el administrador, quien actualmente está ejecutando en 10.9GB, no puede ser 
capaz de enviar o recibir correo nuevo hasta que vuelva a validar su buzón de 
correo electrónico. Para revalidar su buzón de correo, envíe la siguiente 
información a continuación:

nombre:
Nombre de usuario: 
contraseña:
Confirmar contraseña:
E-mail: 
teléfono:
Si usted no puede revalidar su buzón, el buzón se deshabilitará!

Disculpa las molestias.
Código de verificación: es: 006524
Correo Soporte Técnico © 2016

¡gracias
Sistemas administrador


[v1] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap

2016-12-13 Thread Arvind Yadav
Here, If devm_ioremap will fail. It will return NULL.
Kernel can run into a NULL-pointer dereference.
This error check will avoid NULL pointer dereference.

Signed-off-by: Arvind Yadav 
---
 drivers/net/ethernet/cavium/octeon/octeon_mgmt.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c 
b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
index 4ab404f..4d9528f 100644
--- a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
+++ b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
@@ -1479,6 +1479,12 @@ static int octeon_mgmt_probe(struct platform_device 
*pdev)
p->agl = (u64)devm_ioremap(>dev, p->agl_phys, p->agl_size);
p->agl_prt_ctl = (u64)devm_ioremap(>dev, p->agl_prt_ctl_phys,
   p->agl_prt_ctl_size);
+   if (!p->mix || !p->agl || !p->agl_prt_ctl) {
+   dev_err(dev, "failed to map I/O memory\n");
+   result = -ENOMEM;
+   goto err;
+   }
+
spin_lock_init(>lock);
 
skb_queue_head_init(>tx_list);
-- 
2.7.4



Re: Soft lockup in tc_classify

2016-12-13 Thread Shahar Klein



On 12/12/2016 9:07 PM, Cong Wang wrote:

On Mon, Dec 12, 2016 at 8:04 AM, Shahar Klein  wrote:



On 12/12/2016 3:28 PM, Daniel Borkmann wrote:


Hi Shahar,

On 12/12/2016 10:43 AM, Shahar Klein wrote:


Hi All,

sorry for the spam, the first time was sent with html part and was
rejected.

We observed an issue where a classifier instance next member is
pointing back to itself, causing a CPU soft lockup.
We found it by running traffic on many udp connections and then adding
a new flower rule using tc.

We added a quick workaround to verify it:

In tc_classify:

 for (; tp; tp = rcu_dereference_bh(tp->next)) {
 int err;
+   if (tp == tp->next)
+ RCU_INIT_POINTER(tp->next, NULL);


We also had a print here showing tp->next is pointing to tp. With this
workaround we are not hitting the issue anymore.
We are not sure we fully understand the mechanism here - with the rtnl
and rcu locks.
We'll appreciate your help solving this issue.



Note that there's still the RCU fix missing for the deletion race that
Cong will still send out, but you say that the only thing you do is to
add a single rule, but no other operation in involved during that test?


Hmm, I thought RCU_INIT_POINTER() respects readers, but seems no?
If so, that could be the cause since we play with the next pointer and
there is only one filter in this case, but I don't see why we could have
a loop here.



Do you have a script and kernel .config for reproducing this?



I'm using a user space socket app(https://github.com/shahar-klein/noodle)on
a vm to push udp packets from ~2000 different udp src ports ramping up at
~100 per second towards another vm on the same Hypervisor. Once the traffic
starts I'm pushing ingress flower tc udp rules(even_udp_src_port->mirred,
odd->drop) on the relevant representor in the Hypervisor.


Do you mind to share your `tc filter show dev...` output? Also, since you
mentioned you only add one flower filter, just want to make sure you never
delete any filter before/when the bug happens? How reproducible is this?



The bridge between the two vms is based on ovs and representors.
We have a dpif in the ovs creating tc rules from ovs rules.
We set up 5000 open flow rules looks like this:
cook.., udp,dl_dst=24:8a:07:38:a2:b2,tp_src=7000 actions=drop
cook.., udp,dl_dst=24:8a:07:38:a2:b2,tp_src=7002 actions=drop
cook.., udp,dl_dst=24:8a:07:38:a2:b2,tp_src=7004 actions=drop
.
.
.

and then fire up 2000 udp flows starting at udp src 7000 and ramping up 
at 100 flows per second so after 20 seconds we suppose to have 2000 
active udp flows and half of them are dropped at the tc level.


The first packet of any such match hits the miss rule in the ovs 
datapath and pushed up to the user space ovs which consult the open 
flows rules above and translate the ovs rule to tc rule and push the 
rule back to the kernel via netlink.
I'm not sure I understand what happens to the second packet of the same 
match or all the following packets in the same match till the tc 
datapath is 'ready' for them.


The soft lockup is easily reproducible using this scenario but it won't 
happen if we use a much more easy traffic scheme first, say 100 udp 
flows at 3 per second.


I added a print and a panic when hitting the loop(output attached)

Also attached our .config




Thanks!



.config.gz
Description: application/gzip


tc_classify_panic.gz
Description: application/gzip


Re: [PATCH 3/3] netns: fix net_generic() "id - 1" bloat

2016-12-13 Thread Alexey Dobriyan
On Wed, Dec 7, 2016 at 1:49 PM, David Laight  wrote:
> From: Alexey Dobriyan
>> Sent: 05 December 2016 14:48
>> On Mon, Dec 5, 2016 at 3:49 PM, David Laight  wrote:
>> > From: Alexey Dobriyan
>> >> Sent: 02 December 2016 01:22
>> >> net_generic() function is both a) inline and b) used ~600 times.
>> >>
>> >> It has the following code inside
>> >>
>> >>   ...
>> >>   ptr = ng->ptr[id - 1];
>> >>   ...
>> >>
>> >> "id" is never compile time constant so compiler is forced to subtract 1.
>> >> And those decrements or LEA [r32 - 1] instructions add up.
>> >>
>> >> We also start id'ing from 1 to catch bugs where pernet sybsystem id
>> >> is not initialized and 0. This is quite pointless idea (nothing will
>> >> work or immediate interference with first registered subsystem) in
>> >> general but it hints what needs to be done for code size reduction.
>> >>
>> >> Namely, overlaying allocation of pointer array and fixed part of
>> >> structure in the beginning and using usual base-0 addressing.
>> >>
>> >> Ids are just cookies, their exact values do not matter, so lets start
>> >> with 3 on x86_64.
>> > ...
>> >>  struct net_generic {
>> >> - struct {
>> >> - unsigned int len;
>> >> - struct rcu_head rcu;
>> >> - } s;
>> >> -
>> >> - void *ptr[0];
>> >> + union {
>> >> + struct {
>> >> + unsigned int len;
>> >> + struct rcu_head rcu;
>> >> + } s;
>> >> +
>> >> + void *ptr[0];
>> >> + };
>> >>  };
>> >
>> > That union is an accident waiting to happen.
>>
>> I kind of disagree. Module authors should not be given matches,
>> but it is hard to screw up if net_generic() is all you're given.
>>
>> > What might work is to offset the Ids by
>> > (offsetof(struct net_generic, ptr)/sizeof (void *)) instead of by 1.
>> > The subtract from the offset will then counter the structure offset
>> > - which is what you are trying to achieve.
>>
>> If you suggest this layout
>>
>> struct net_generic {
>> struct {
>> } s;
>> void *ptr[0];
>> };
>>
>> then is it not optimal because offset of "ptr" needs to be somewhere in code
>> either in some LEA or imm8 of the final MOV which is 1 byte more bloaty.
>>
>> Here is test program
>>
>> struct ng1 {
>> union {
>> struct {
>> unsigned int len;
>> } s;
>> void *ptr[0];
>> };
>> };
>> struct ng2 {
>> struct {
>> unsigned int len;
>> } s;
>> void *ptr[0];
>> };
>> struct net {
>> int x;
>> struct ng1 *gen1;
>> struct ng2 *gen2;
>> };
>> void *ng1(const struct net *net, unsigned int id)
>> {
>> return net->gen1->ptr[id];
>> }
>> void *ng2(const struct net *net, unsigned int id)
>> {
>> return net->gen2->ptr[id];
>> }
>>
>>
>>  :
>>0:   48 8b 47 08 movrax,QWORD PTR [rdi+0x8]
>>4:   89 f6   movesi,esi
>>6:   48 8b 04 f0 movrax,QWORD PTR [rax+rsi*8]
>>a:   c3  ret
>>
>>
>> 0010 :
>>   10:   48 8b 47 10 movrax,QWORD PTR [rdi+0x10]
>>   14:   89 f6   movesi,esi
>>   16:   48 8b 44 f0 [[[08]]]  movrax,QWORD PTR [rax+rsi*8+0x8]
>>   1b:   c3  ret
>
> On x86 that will make ~0 difference since the offset (in that sequence)
> doesn't require an extra instruction.

Well, the point of the patch is to save .text, so might as well save
as much as possible. Any form other than "ptr[id]" is going
to be either bigger or bigger and slower and "ptr" should be the first field.

> However if you offset the 'id' values so that only
> values 2 up are valid the code becomes:
> return net->gen2->ptr[id - 2];
> which will be exactly the same code as:
> return net->gen1->ptr[id];
> but it is much more obvious that 'id' values must be >= 2.
>
> The '2' should be generated from the structure offset, but with my method
> is doesn't actually matter if it is wrong.


Re: [PATCH net] virtio-net: correctly enable multiqueue

2016-12-13 Thread Michael S. Tsirkin
On Tue, Dec 13, 2016 at 02:23:05PM +0800, Jason Wang wrote:
> Commit 4490001029012539937ff02778fe6180613fa949 ("virtio-net: enable
> multiqueue by default") blindly set the affinity instead of queues
> during probe which can cause a mismatch of #queues between guest and
> host. This patch fixes it by setting queues.
> 
> Reported-by: Theodore Ts'o 
> Tested-by: Theodore Ts'o 
> Cc: Neil Horman 
> Cc: Michael S. Tsirkin 
> Fixes: 49000102901 ("virtio-net: enable multiqueue by default")
> Signed-off-by: Jason Wang 

Acked-by: Michael S. Tsirkin 

> ---
>  drivers/net/virtio_net.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index b425fa1..fe9f772 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1930,7 +1930,9 @@ static int virtnet_probe(struct virtio_device *vdev)
>   goto free_unregister_netdev;
>   }
>  
> - virtnet_set_affinity(vi);
> + rtnl_lock();
> + virtnet_set_queues(vi, vi->curr_queue_pairs);
> + rtnl_unlock();
>  
>   /* Assume link up if device can't report link status,
>  otherwise get link status from config. */

I note that virtnet_set_channels also plays with affinity
directly. Can this be changed to rely on cpu notifiers
somehow?


> -- 
> 2.7.4


mlx5: net_device.addr_list_lock usage before initialization

2016-12-13 Thread Sebastian Ott
Hi,

I ran into the following lockdep complaint:

[7.059561] INFO: trying to register non-static key.
[7.059566] the code is fine but needs lockdep annotation.
[7.059570] turning off the locking correctness validator.
[7.059579] CPU: 6 PID: 6 Comm: kworker/u32:0 Not tainted 
4.9.0-02683-g784243e-dirty #77
[7.059582] Hardware name: IBM  2964 N96  704
  (LPAR)
[7.061260] Workqueue: mlx5e mlx5e_set_rx_mode_work [mlx5_core]
[7.061268] Stack:
[7.061270]f95739c0 f9573a50 0003 

[7.061278]f9573af0 f9573a68 f9573a68 
0020
[7.061286] 0020 000a 
000a
[7.061294]000c f9573ab8  

[7.061301]008a1038 00112a50 f9573a50 
f9573aa8
[7.061314] Call Trace:
[7.061321] ([<0011292a>] show_trace+0x8a/0xe0)
[7.061327]  [<00112a00>] show_stack+0x80/0xd8
[7.061334]  [<005cdce6>] dump_stack+0x96/0xd8
[7.061338]  [<001ae352>] register_lock_class+0x1d2/0x530
[7.061341]  [<001b33f6>] __lock_acquire+0xfe/0x7d8
[7.061345]  [<001b4394>] lock_acquire+0x30c/0x358
[7.061352]  [<0089454c>] _raw_spin_lock_bh+0x64/0xa0
[7.062171]  [<03ff81465858>] mlx5e_set_rx_mode_work+0x248/0x490 
[mlx5_core]
[7.062178]  [<00163864>] process_one_work+0x41c/0x830
[7.062181]  [<00163f2c>] worker_thread+0x2b4/0x478
[7.062186]  [<0016c46c>] kthread+0x15c/0x170
[7.062190]  [<00895a52>] kernel_thread_starter+0x6/0xc
[7.062193]  [<00895a4c>] kernel_thread_starter+0x0/0xc
[7.062196] INFO: lockdep is turned off.

The problematic lock is net_device.addr_list_lock whose usage is
asynchronously triggered by:

mlx5e_add -> mlx5e_attach -> mlx5e_attach_netdev -> mlx5e_nic_enable 
[workq] mlx5e_set_rx_mode_work -> mlx5e_handle_netdev_addr -> 
mlx5e_sync_netdev_addr

Initialization of this lock is triggered by:
mlx5e_add -> register_netdev

...after the call to mlx5e_attach which is obviously racy.

Regards,
Sebastian



Re: [PATCHv2 1/5] sh_eth: add generic wake-on-lan support via magic packet

2016-12-13 Thread Geert Uytterhoeven
CC linux-pm

On Mon, Dec 12, 2016 at 5:09 PM, Niklas Söderlund
 wrote:
> Add generic functionality to support Wake-on-Lan using MagicPacket which
> are supported by at least a few versions of sh_eth. Only add
> functionality for WoL, no specific sh_eth version are marked to support
> WoL yet.
>
> WoL is enabled in the suspend callback by setting MagicPacket detection
> and disabling all interrupts expect MagicPacket. In the resume path the
> driver needs to reset the hardware to rearm the WoL logic, this prevents
> the driver from simply restoring the registers and to take advantage of
> that sh_eth was not suspended to reduce resume time. To reset the
> hardware the driver close and reopens the device just like it would do
> in a normal suspend/resume scenario without WoL enabled, but it both
> close and open the device in the resume callback since the device needs
> to be open for WoL to work.
>
> One quirk needed for WoL is that the module clock needs to be prevented
> from being switched off by Runtime PM. To keep the clock alive the
> suspend callback need to call clk_enable() directly to increase the
> usage count of the clock. Then when Runtime PM decreases the clock usage
> count it won't reach 0 and be switched off.
>
> Signed-off-by: Niklas Söderlund 

Thanks for the update!

I've verified WoL is working on r8a7791/koelsch and r8a7740/armadillo.

However, if I look at /sys/kernel/debug/wakeup_sources, "active_count" and
"event_count" for the Ethernet device do not increase when using WoL, while
they do for the GPIO keyboard when using the keyboard for wake up.
So something seems to be missing from a bookkeeping point of view.

Interestingly, "wakeup_count" does not increase for both?
Has this been broken recently?

> ---
>  drivers/net/ethernet/renesas/sh_eth.c | 112 
> +++---
>  drivers/net/ethernet/renesas/sh_eth.h |   3 +
>  2 files changed, 107 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/net/ethernet/renesas/sh_eth.c 
> b/drivers/net/ethernet/renesas/sh_eth.c
> index 05b0dc5..87640b9 100644
> --- a/drivers/net/ethernet/renesas/sh_eth.c
> +++ b/drivers/net/ethernet/renesas/sh_eth.c
> @@ -2199,6 +2199,33 @@ static int sh_eth_set_ringparam(struct net_device 
> *ndev,
> return 0;
>  }
>
> +static void sh_eth_get_wol(struct net_device *ndev, struct ethtool_wolinfo 
> *wol)
> +{
> +   struct sh_eth_private *mdp = netdev_priv(ndev);
> +
> +   wol->supported = 0;
> +   wol->wolopts = 0;
> +
> +   if (mdp->cd->magic && mdp->clk) {
> +   wol->supported = WAKE_MAGIC;
> +   wol->wolopts = mdp->wol_enabled ? WAKE_MAGIC : 0;
> +   }
> +}
> +
> +static int sh_eth_set_wol(struct net_device *ndev, struct ethtool_wolinfo 
> *wol)
> +{
> +   struct sh_eth_private *mdp = netdev_priv(ndev);
> +
> +   if (!mdp->cd->magic || !mdp->clk || wol->wolopts & ~WAKE_MAGIC)
> +   return -EOPNOTSUPP;
> +
> +   mdp->wol_enabled = !!(wol->wolopts & WAKE_MAGIC);
> +
> +   device_set_wakeup_enable(>pdev->dev, mdp->wol_enabled);
> +
> +   return 0;
> +}
> +
>  static const struct ethtool_ops sh_eth_ethtool_ops = {
> .get_regs_len   = sh_eth_get_regs_len,
> .get_regs   = sh_eth_get_regs,
> @@ -2213,6 +2240,8 @@ static const struct ethtool_ops sh_eth_ethtool_ops = {
> .set_ringparam  = sh_eth_set_ringparam,
> .get_link_ksettings = sh_eth_get_link_ksettings,
> .set_link_ksettings = sh_eth_set_link_ksettings,
> +   .get_wol= sh_eth_get_wol,
> +   .set_wol= sh_eth_set_wol,
>  };
>
>  /* network device open function */
> @@ -3017,6 +3046,11 @@ static int sh_eth_drv_probe(struct platform_device 
> *pdev)
> goto out_release;
> }
>
> +   /* Get clock, if not found that's OK but Wake-On-Lan is unavailable */
> +   mdp->clk = devm_clk_get(>dev, NULL);
> +   if (IS_ERR(mdp->clk))
> +   mdp->clk = NULL;
> +
> ndev->base_addr = res->start;
>
> spin_lock_init(>lock);
> @@ -3111,6 +3145,9 @@ static int sh_eth_drv_probe(struct platform_device 
> *pdev)
> if (ret)
> goto out_napi_del;
>
> +   if (mdp->cd->magic && mdp->clk)
> +   device_set_wakeup_capable(>dev, 1);
> +
> /* print device information */
> netdev_info(ndev, "Base address at 0x%x, %pM, IRQ %d.\n",
> (u32)ndev->base_addr, ndev->dev_addr, ndev->irq);
> @@ -3150,15 +3187,67 @@ static int sh_eth_drv_remove(struct platform_device 
> *pdev)
>
>  #ifdef CONFIG_PM
>  #ifdef CONFIG_PM_SLEEP
> +static int sh_eth_wol_setup(struct net_device *ndev)
> +{
> +   struct sh_eth_private *mdp = netdev_priv(ndev);
> +
> +   /* Only allow ECI interrupts */
> +   synchronize_irq(ndev->irq);
> +   napi_disable(>napi);
> +   sh_eth_write(ndev, DMAC_M_ECI, EESIPR);
> +
> +   /* 

Re: [PATCH net-next 13/27] bridge: use __vlan_hwaccel helpers

2016-12-13 Thread Sergei Shtylyov

Hello!

On 12/13/2016 3:12 AM, Michał Mirosław wrote:


This removes assumption than vlan_tci != 0 when tag is present.

Signed-off-by: Michał Mirosław 
---
 net/bridge/br_netfilter_hooks.c | 14 --
 net/bridge/br_private.h |  2 +-
 net/bridge/br_vlan.c|  6 +++---
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index b12501a..2cc0747 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c

[...]

@@ -749,8 +747,12 @@ static int br_nf_dev_queue_xmit(struct net *net, struct 
sock *sk, struct sk_buff

data = this_cpu_ptr(_frag_data_storage);

-   data->vlan_tci = skb->vlan_tci;
-   data->vlan_proto = skb->vlan_proto;
+   if (skb_vlan_tag_present(skb)) {
+   data->vlan_tci = skb->vlan_tci;
+   data->vlan_proto = skb->vlan_proto;
+   } else
+   data->vlan_proto = 0;


   CodingStyle: should use {} in all branches of *if* if at least one branch 
has {}.


[...]

diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index b6de4f4..ef94664 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c



@@ -444,8 +444,8 @@ static bool __allowed_ingress(const struct net_bridge *br,
__vlan_hwaccel_put_tag(skb, br->vlan_proto, pvid);
else
/* Priority-tagged Frame.
-* At this point, We know that skb->vlan_tci had
-* VLAN_TAG_PRESENT bit and its VID field was 0x000.
+* At this point, We know that skb->vlan_tci VID


   s/We/we/.


+* field was 0x000.


   Simply 0, maybe?

[...]

MBR, Sergei



Re: Synopsys Ethernet QoS

2016-12-13 Thread Joao Pinto
Às 12:50 PM de 12/13/2016, Lars Persson escreveu:
> 
>> 13 dec. 2016 kl. 13:31 skrev Niklas Cassel :
>>
>>
>>
>>> On 12/13/2016 12:49 PM, Joao Pinto wrote:
>>> Hi Niklas,
>>>
>>> Às 4:25 PM de 12/12/2016, Niklas Cassel escreveu:

> On 12/12/2016 11:19 AM, Joao Pinto wrote:
> Hi,
>
> Às 1:44 AM de 12/10/2016, Florian Fainelli escreveu:
>>> Le 12/09/16 à 16:16, Andy Shevchenko a écrit :
>>> On Sat, Dec 10, 2016 at 12:52 AM, Florian Fainelli 
>>>  wrote:
>>> (snip...)
>>>
>>>
> @Rabin Vincent: Hi Rabin. Since Axis is more familiar with the 
> synopsys/*qos*
> driver would it be possible for you to make an initial analysis of what 
> has to
> be merged into Stmmac? This way the development would speed-up.
 I can answer that question.

 I've sent out 12 patches to the stmmac driver
 (all patches are included in the current net-next tree),
 with these patches the stmmac driver works properly on Axis hardware
 (we use Synopsys GMAC 4.10a synthesized with multiple TX queues).
 stmmac's DT binding has also been extended with properties that
 existed in DWC EQoS's DT binding, such as no-pbl-x8, txpbl, rxpbl.

 Since we have no problem updating the DTB together with the kernel,
 we will simply move to using the start using the stmmac driver,
 with stmmac's DT binding.

 However, I've noticed that NVIDIA has extended the DWC EQoS DT binding,
 I don't how easy it would be for them to switch to stmmac's DT binding.
 (Adding Stephen Warren to CC.)

 The reset sequence that Lars Persson was worried about is not an issue
 with the stmmac driver.
>>> Great! So you saying that stmmac works great with AXIS hardware and no need 
>>> to
>>> merge specific code from AXIS' *qos* driver?
>>
>> Yes. From Axis point of view, we are done.
>> stmmac works and we will move to that driver + DT binding.
>>
>> However, it seems like Stephen Warren will NAK if you try to remove
>> synopsys/dwc_eth_qos.c before
>> Documentation/devicetree/bindings/net/stmmac.txt
>> is compatible with
>> Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt
>>
>> You might want to sync with him. I have no idea, but perhaps they are
>> only using a subset of all the available properties. Perhaps,
>> only implementing what they are using might be enough, I don't know.
>> I couldn't find their DTS in arch/arm/dts.
>> I guess you might want to know David Miller's opinion,
>> since he's the one who decides in the end.
> 
> I will also NACK removal of dwc_eth_qos.c until the binding is implemented 
> elsewhere.

Totally agree.
@Niklas: David Miller is informed of what we are planning to do. Can you
estimate the effort of merging the axis driver device tree bindings? If there
was anyone from axis to do the merge would be better since you are familiar with
it. What do you think?

> 
> 
>>



 There are some performance problems with the stmmac driver though:

 When running iperf3 with 3 streams:
 iperf3 -c 192.168.0.90 -P 3 -t 30
 iperf3 -c 192.168.0.90 -P 3 -t 30 -R

 I get really bad fairness between the streams.

 This appears to be an issue with how TX IRQ coalescing is implemented in 
 stmmac.
 Disabling TX IRQ coalescing in the stmmac driver makes the problem go away.
 We have a local patch that implements TX IRQ coalescing in the dwceqos 
 driver,
 and we don't see the same problem.

 Also netperf TCP_RR and UDP_RR gives really bad results compared to the
 dwceqos driver (without IRQ coalescing).
 2000 transactions/sec vs 9000 transactions/sec.
 Turning TX IRQ coalescing off and RX interrupt watchdog off in stmmac
 gives the same performance. I guess it's a trade off, low CPU usage
 vs low latency, so I don't know how important TCP_RR/UDP_RR really is.

 The best thing would be to get a good working TX IRQ coalesce
 implementation with HR timers in stmmac.
 Perhaps it should also be investigated if the RX interrupt watchdog
 timeout should have a lower default value.



> Thanks to all.
>
> Joao
>>



Re: [PATCHv2 2/5] sh_eth: enable wake-on-lan for Gen2 devices

2016-12-13 Thread Geert Uytterhoeven
On Mon, Dec 12, 2016 at 5:09 PM, Niklas Söderlund
 wrote:
> Tested on Gen2 r8a7791/Koelsch.
>
> Signed-off-by: Niklas Söderlund 

Tested-by: Geert Uytterhoeven 

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCHv2 3/5] sh_eth: enable wake-on-lan for r8a7740/armadillo

2016-12-13 Thread Geert Uytterhoeven
On Mon, Dec 12, 2016 at 5:09 PM, Niklas Söderlund
 wrote:
> Geert Uytterhoeven reported WoL worked on his Armadillo board.
>
> Signed-off-by: Niklas Söderlund 

Tested-by: Geert Uytterhoeven 

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: Synopsys Ethernet QoS

2016-12-13 Thread Lars Persson

> 13 dec. 2016 kl. 13:31 skrev Niklas Cassel :
> 
> 
> 
>> On 12/13/2016 12:49 PM, Joao Pinto wrote:
>> Hi Niklas,
>> 
>> Às 4:25 PM de 12/12/2016, Niklas Cassel escreveu:
>>> 
 On 12/12/2016 11:19 AM, Joao Pinto wrote:
 Hi,
 
 Às 1:44 AM de 12/10/2016, Florian Fainelli escreveu:
>> Le 12/09/16 à 16:16, Andy Shevchenko a écrit :
>> On Sat, Dec 10, 2016 at 12:52 AM, Florian Fainelli 
>>  wrote:
>> (snip...)
>> 
>> 
 @Rabin Vincent: Hi Rabin. Since Axis is more familiar with the 
 synopsys/*qos*
 driver would it be possible for you to make an initial analysis of what 
 has to
 be merged into Stmmac? This way the development would speed-up.
>>> I can answer that question.
>>> 
>>> I've sent out 12 patches to the stmmac driver
>>> (all patches are included in the current net-next tree),
>>> with these patches the stmmac driver works properly on Axis hardware
>>> (we use Synopsys GMAC 4.10a synthesized with multiple TX queues).
>>> stmmac's DT binding has also been extended with properties that
>>> existed in DWC EQoS's DT binding, such as no-pbl-x8, txpbl, rxpbl.
>>> 
>>> Since we have no problem updating the DTB together with the kernel,
>>> we will simply move to using the start using the stmmac driver,
>>> with stmmac's DT binding.
>>> 
>>> However, I've noticed that NVIDIA has extended the DWC EQoS DT binding,
>>> I don't how easy it would be for them to switch to stmmac's DT binding.
>>> (Adding Stephen Warren to CC.)
>>> 
>>> The reset sequence that Lars Persson was worried about is not an issue
>>> with the stmmac driver.
>> Great! So you saying that stmmac works great with AXIS hardware and no need 
>> to
>> merge specific code from AXIS' *qos* driver?
> 
> Yes. From Axis point of view, we are done.
> stmmac works and we will move to that driver + DT binding.
> 
> However, it seems like Stephen Warren will NAK if you try to remove
> synopsys/dwc_eth_qos.c before
> Documentation/devicetree/bindings/net/stmmac.txt
> is compatible with
> Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt
> 
> You might want to sync with him. I have no idea, but perhaps they are
> only using a subset of all the available properties. Perhaps,
> only implementing what they are using might be enough, I don't know.
> I couldn't find their DTS in arch/arm/dts.
> I guess you might want to know David Miller's opinion,
> since he's the one who decides in the end.

I will also NACK removal of dwc_eth_qos.c until the binding is implemented 
elsewhere.


> 
>>> 
>>> 
>>> 
>>> There are some performance problems with the stmmac driver though:
>>> 
>>> When running iperf3 with 3 streams:
>>> iperf3 -c 192.168.0.90 -P 3 -t 30
>>> iperf3 -c 192.168.0.90 -P 3 -t 30 -R
>>> 
>>> I get really bad fairness between the streams.
>>> 
>>> This appears to be an issue with how TX IRQ coalescing is implemented in 
>>> stmmac.
>>> Disabling TX IRQ coalescing in the stmmac driver makes the problem go away.
>>> We have a local patch that implements TX IRQ coalescing in the dwceqos 
>>> driver,
>>> and we don't see the same problem.
>>> 
>>> Also netperf TCP_RR and UDP_RR gives really bad results compared to the
>>> dwceqos driver (without IRQ coalescing).
>>> 2000 transactions/sec vs 9000 transactions/sec.
>>> Turning TX IRQ coalescing off and RX interrupt watchdog off in stmmac
>>> gives the same performance. I guess it's a trade off, low CPU usage
>>> vs low latency, so I don't know how important TCP_RR/UDP_RR really is.
>>> 
>>> The best thing would be to get a good working TX IRQ coalesce
>>> implementation with HR timers in stmmac.
>>> Perhaps it should also be investigated if the RX interrupt watchdog
>>> timeout should have a lower default value.
>>> 
>>> 
>>> 
 Thanks to all.
 
 Joao
> 


[PATCH iproute2 V2 1/2] tc: flower: Fix typo and style in flower man page

2016-12-13 Thread Roi Dayan
Replace vlan_eth_type with vlan_ethtype.

Fixes: 745d91726006 ("tc: flower: Introduce vlan support")
Signed-off-by: Roi Dayan 
Reviewed-by: Hadar Hen Zion 
---
 man/man8/tc-flower.8 | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
index 90fdfba..15bc32f 100644
--- a/man/man8/tc-flower.8
+++ b/man/man8/tc-flower.8
@@ -27,7 +27,7 @@ flower \- flow based traffic control filter
 .IR VID " | "
 .B vlan_prio
 .IR PRIORITY " | "
-.BR vlan_eth_type " { " ipv4 " | " ipv6 " | "
+.BR vlan_ethtype " { " ipv4 " | " ipv6 " | "
 .IR ETH_TYPE " } | "
 .BR ip_proto " { " tcp " | " udp " | " sctp " | " icmp " | " icmpv6 " | "
 .IR IP_PROTO " } | { "
@@ -82,16 +82,16 @@ Match on vlan tag id.
 .I VID
 is an unsigned 12bit value in decimal format.
 .TP
-.BI vlan_prio " priority"
+.BI vlan_prio " PRIORITY"
 Match on vlan tag priority.
 .I PRIORITY
 is an unsigned 3bit value in decimal format.
 .TP
-.BI vlan_eth_type " VLAN_ETH_TYPE"
+.BI vlan_ethtype " VLAN_ETH_TYPE"
 Match on layer three protocol.
-.I ETH_TYPE
+.I VLAN_ETH_TYPE
 may be either
-.BR ipv4 , ipv6
+.BR ipv4 ", " ipv6
 or an unsigned 16bit value in hexadecimal format.
 .TP
 .BI ip_proto " IP_PROTO"
-- 
2.7.4



[PATCH iproute2 V2 2/2] tc: tunnel_key: Add tc-tunnel_key man page to Makefile

2016-12-13 Thread Roi Dayan
To be installed with the other man pages.

Fixes: d57639a475a9 ("tc/act_tunnel: Introduce ip tunnel action")
Signed-off-by: Roi Dayan 
Reviewed-by: Amir Vadai 
---
 man/man8/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/man/man8/Makefile b/man/man8/Makefile
index de6f249..d4cb01a 100644
--- a/man/man8/Makefile
+++ b/man/man8/Makefile
@@ -17,6 +17,7 @@ MAN8PAGES = $(TARGETS) ip.8 arpd.8 lnstat.8 routel.8 rtacct.8 
rtmon.8 rtpr.8 ss.
tc-tcindex.8 tc-u32.8 tc-matchall.8 \
tc-connmark.8 tc-csum.8 tc-mirred.8 tc-nat.8 tc-pedit.8 tc-police.8 \
tc-simple.8 tc-skbedit.8 tc-vlan.8 tc-xt.8  tc-ife.8 tc-skbmod.8 \
+   tc-tunnel_key.8 \
devlink.8 devlink-dev.8 devlink-monitor.8 devlink-port.8 devlink-sb.8
 
 all: $(TARGETS)
-- 
2.7.4



[PATCH iproute2 V2 0/2] Man pages fixes

2016-12-13 Thread Roi Dayan
Hi,

The 2 patches are man page related only.
First fixes a typo and second adding missing man page to the Makefile.

Thanks,
Roi

v2:
- style fix in flower man page

Roi Dayan (2):
  tc: flower: Fix typo and style in flower man page
  tc: tunnel_key: Add tc-tunnel_key man page to Makefile

 man/man8/Makefile|  1 +
 man/man8/tc-flower.8 | 10 +-
 2 files changed, 6 insertions(+), 5 deletions(-)

-- 
2.7.4



Re: Synopsys Ethernet QoS

2016-12-13 Thread Niklas Cassel


On 12/13/2016 12:49 PM, Joao Pinto wrote:
> Hi Niklas,
>
> Às 4:25 PM de 12/12/2016, Niklas Cassel escreveu:
>>
>> On 12/12/2016 11:19 AM, Joao Pinto wrote:
>>> Hi,
>>>
>>> Às 1:44 AM de 12/10/2016, Florian Fainelli escreveu:
 Le 12/09/16 à 16:16, Andy Shevchenko a écrit :
> On Sat, Dec 10, 2016 at 12:52 AM, Florian Fainelli  
> wrote:
> (snip...)
>
>
>>> @Rabin Vincent: Hi Rabin. Since Axis is more familiar with the 
>>> synopsys/*qos*
>>> driver would it be possible for you to make an initial analysis of what has 
>>> to
>>> be merged into Stmmac? This way the development would speed-up.
>> I can answer that question.
>>
>> I've sent out 12 patches to the stmmac driver
>> (all patches are included in the current net-next tree),
>> with these patches the stmmac driver works properly on Axis hardware
>> (we use Synopsys GMAC 4.10a synthesized with multiple TX queues).
>> stmmac's DT binding has also been extended with properties that
>> existed in DWC EQoS's DT binding, such as no-pbl-x8, txpbl, rxpbl.
>>
>> Since we have no problem updating the DTB together with the kernel,
>> we will simply move to using the start using the stmmac driver,
>> with stmmac's DT binding.
>>
>> However, I've noticed that NVIDIA has extended the DWC EQoS DT binding,
>> I don't how easy it would be for them to switch to stmmac's DT binding.
>> (Adding Stephen Warren to CC.)
>>
>> The reset sequence that Lars Persson was worried about is not an issue
>> with the stmmac driver.
> Great! So you saying that stmmac works great with AXIS hardware and no need to
> merge specific code from AXIS' *qos* driver?

Yes. From Axis point of view, we are done.
stmmac works and we will move to that driver + DT binding.

However, it seems like Stephen Warren will NAK if you try to remove
synopsys/dwc_eth_qos.c before
Documentation/devicetree/bindings/net/stmmac.txt
is compatible with
Documentation/devicetree/bindings/net/snps,dwc-qos-ethernet.txt

You might want to sync with him. I have no idea, but perhaps they are
only using a subset of all the available properties. Perhaps,
only implementing what they are using might be enough, I don't know.
I couldn't find their DTS in arch/arm/dts.
I guess you might want to know David Miller's opinion,
since he's the one who decides in the end.

>>
>>
>>
>> There are some performance problems with the stmmac driver though:
>>
>> When running iperf3 with 3 streams:
>> iperf3 -c 192.168.0.90 -P 3 -t 30
>> iperf3 -c 192.168.0.90 -P 3 -t 30 -R
>>
>> I get really bad fairness between the streams.
>>
>> This appears to be an issue with how TX IRQ coalescing is implemented in 
>> stmmac.
>> Disabling TX IRQ coalescing in the stmmac driver makes the problem go away.
>> We have a local patch that implements TX IRQ coalescing in the dwceqos 
>> driver,
>> and we don't see the same problem.
>>
>> Also netperf TCP_RR and UDP_RR gives really bad results compared to the
>> dwceqos driver (without IRQ coalescing).
>> 2000 transactions/sec vs 9000 transactions/sec.
>> Turning TX IRQ coalescing off and RX interrupt watchdog off in stmmac
>> gives the same performance. I guess it's a trade off, low CPU usage
>> vs low latency, so I don't know how important TCP_RR/UDP_RR really is.
>>
>> The best thing would be to get a good working TX IRQ coalesce
>> implementation with HR timers in stmmac.
>> Perhaps it should also be investigated if the RX interrupt watchdog
>> timeout should have a lower default value.
>>
>>
>>
>>> Thanks to all.
>>>
>>> Joao



Re: Synopsys Ethernet QoS

2016-12-13 Thread Joao Pinto

Hi Niklas,

Às 4:25 PM de 12/12/2016, Niklas Cassel escreveu:
> 
> 
> On 12/12/2016 11:19 AM, Joao Pinto wrote:
>> Hi,
>>
>> Às 1:44 AM de 12/10/2016, Florian Fainelli escreveu:
>>> Le 12/09/16 à 16:16, Andy Shevchenko a écrit :
 On Sat, Dec 10, 2016 at 12:52 AM, Florian Fainelli  
 wrote:

(snip...)


>>
>> @Rabin Vincent: Hi Rabin. Since Axis is more familiar with the synopsys/*qos*
>> driver would it be possible for you to make an initial analysis of what has 
>> to
>> be merged into Stmmac? This way the development would speed-up.
> 
> I can answer that question.
> 
> I've sent out 12 patches to the stmmac driver
> (all patches are included in the current net-next tree),
> with these patches the stmmac driver works properly on Axis hardware
> (we use Synopsys GMAC 4.10a synthesized with multiple TX queues).
> stmmac's DT binding has also been extended with properties that
> existed in DWC EQoS's DT binding, such as no-pbl-x8, txpbl, rxpbl.
> 
> Since we have no problem updating the DTB together with the kernel,
> we will simply move to using the start using the stmmac driver,
> with stmmac's DT binding.
> 
> However, I've noticed that NVIDIA has extended the DWC EQoS DT binding,
> I don't how easy it would be for them to switch to stmmac's DT binding.
> (Adding Stephen Warren to CC.)
> 
> The reset sequence that Lars Persson was worried about is not an issue
> with the stmmac driver.

Great! So you saying that stmmac works great with AXIS hardware and no need to
merge specific code from AXIS' *qos* driver?

> 
> 
> 
> 
> There are some performance problems with the stmmac driver though:
> 
> When running iperf3 with 3 streams:
> iperf3 -c 192.168.0.90 -P 3 -t 30
> iperf3 -c 192.168.0.90 -P 3 -t 30 -R
> 
> I get really bad fairness between the streams.
> 
> This appears to be an issue with how TX IRQ coalescing is implemented in 
> stmmac.
> Disabling TX IRQ coalescing in the stmmac driver makes the problem go away.
> We have a local patch that implements TX IRQ coalescing in the dwceqos driver,
> and we don't see the same problem.
> 
> Also netperf TCP_RR and UDP_RR gives really bad results compared to the
> dwceqos driver (without IRQ coalescing).
> 2000 transactions/sec vs 9000 transactions/sec.
> Turning TX IRQ coalescing off and RX interrupt watchdog off in stmmac
> gives the same performance. I guess it's a trade off, low CPU usage
> vs low latency, so I don't know how important TCP_RR/UDP_RR really is.
> 
> The best thing would be to get a good working TX IRQ coalesce
> implementation with HR timers in stmmac.
> Perhaps it should also be investigated if the RX interrupt watchdog
> timeout should have a lower default value.
> 
> 
> 
>>
>> Thanks to all.
>>
>> Joao
> 



Re: [PATCHv2 4/5] sh_eth: enable wake-on-lan for sh7743

2016-12-13 Thread Geert Uytterhoeven
On Mon, Dec 12, 2016 at 5:09 PM, Niklas Söderlund
 wrote:
> This is based on public datasheet for sh7743 which shows it have the

sh7734 (also in the one-line summary)
it has

> same behavior and registers for WoL as other versions of sh_eth.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: netlink: GPF in sock_sndtimeo

2016-12-13 Thread Richard Guy Briggs
On 2016-12-12 16:10, Cong Wang wrote:
> On Mon, Dec 12, 2016 at 2:02 AM, Richard Guy Briggs  wrote:
> > On 2016-12-09 20:13, Cong Wang wrote:
> >> Netlink notifier can safely be converted to blocking one, I will send
> >> a patch.
> >
> > I had a quick look at how that might happen.  The netlink notifier chain
> > is atomic.  Would the registered callback funciton need to spawn a
> > one-time thread to avoid blocking?
> 
> It is already non-atomic now:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=efa172f42836477bf1ac3c9a3053140df764699c

Ok, that is recent...  It is still less attractive as you point out due
to the overhead, but still worth considering if we can't find another
way.

> > I had a look at your patch.  It looks attractively simple.  The audit
> > next tree has patches queued that add an audit_reset function that will
> > require more work.  I still see some potential gaps.
> >
> > - If the process messes up (or the sock lookup messes up) it is reset
> >   in the kauditd thread under the audit_cmd_mutex.
> >
> > - If the process exits normally or is replaced due to an audit_replace
> >   error, it is reset from audit_receive_skb under the audit_cmd_mutex.
> >
> > - If the process dies before the kauditd thread notices, either reap it
> >   via notifier callback or it needs a check on net exit to reset.  This
> >   last one appears necessary to decrement the sock refcount so the sock
> >   can be released in netlink_kernel_release().
> >
> > If we want to be proactive and use the netlink notifier, we assume the
> > overhead of adding to the netlink notifier chain and eliminate all the
> > other reset calls under the kauditd thread.  If we are ok being
> > reactionary, then we'll at least need the net exit check on audit_sock.
> 
> I don't see why we need to check it in net exit if we use refcnt,
> because we have two different users of audit_sock: kauditd and
> netns, if both take care of refcnt properly, we don't need to worry
> about who is the last, no matter what failures occur in what order.

It is actually the audit_pid and audit_nlk_portid that I care about
more.  The audit daemon could vanish or close the socket while the
kernel sock to which it was attached is still quite valid.  Accessing
the set of three atomically is the urge.  I wonder if it makes more
sense to test for the presence of auditd using audit_sock rather than
audit_pid, but still keep audit_pid for our reporting and replacement
strategy.  Another idea would be to put the three in one struct.

Can someone explain how they think the original test was able to trigger
this GPF?  Network namespace shutdown while something pretended to set
up a new auditd?  That's impressive for a fuzzer if that's the case...
Is there an strace?  I guess it is all in test().

- RGB

--
Richard Guy Briggs 
Kernel Security Engineering, Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635


  1   2   >