[PATCH] ip: find correct route for socket which is not bound to a device

2015-09-16 Thread Wengang Wang
For multi-cast, we should find valid route(thus get the meaniful pmtu) for
the package on the socket which is not bound to a device(sk_bound_dev_if
being 0) too.

>From man page of socket(7)

   SO_BINDTODEVICE
Bind this socket to a particular device like “eth0”, as
specified in the passed interface name.  If the name is an
empty string or the option length is zero, the socket
device binding is removed. The  passed  option is  a
variable-length null-terminated interface name string with
the maximum size of IFNAMSIZ.  If a socket is bound to an
interface, only packets received from that particular
interface are processed by the socket. Note that this works
only for some socket types, particularly AF_INET sockets.
It is not supported for packet sockets (use normal bind(2)
there).

The man page doesn't say when socket not bound packages won't be routed.

A problem is hit that all multi-cast packages dropped by kernel(from sender
host). The lower layer is IPoIB with MTU being 7000. And I was sending 4096
length multi-cast  package. In side IPoIB the first send is dropped because
is exeeding the internal package size limitation mcast_mtu which is 2044.
So IPoIB calls ip_rt_update_pmtu (indirectly) trying to set path mtu. A
correct route is configured for the multi-cast, so the setting of pmtu
cucceeded and the next multi-cast package(to the same target) is expected
to succeed(it would be well fragmented accroding to the pmtu I just set).
But actually the second and later multi-cast packages got dropped too. And
the reason is that the neighor looking up(fib_lookup) is skipped because of
the socket is not bound to device(sk_bound_dev_if being 0). After applied
the patch I proposed here, it works fine.

Signed-off-by: Wengang Wang 
---
 net/ipv4/route.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 5f4a556..032481a 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2097,7 +2097,7 @@ struct rtable *__ip_route_output_key(struct net *net, 
struct flowi4 *fl4)
 */
 
fl4->flowi4_oif = dev_out->ifindex;
-   goto make_route;
+   goto lookup;
}
 
if (!(fl4->flowi4_flags & FLOWI_FLAG_ANYSRC)) {
@@ -2153,6 +2153,7 @@ struct rtable *__ip_route_output_key(struct net *net, 
struct flowi4 *fl4)
goto make_route;
}
 
+lookup:
if (fib_lookup(net, fl4, , 0)) {
res.fi = NULL;
res.table = NULL;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 net-next 0/2] bpf: performance improvements

2015-09-16 Thread Alexei Starovoitov
v1->v2: dropped redundant iff_up check in patch 2

At plumbers we discussed different options on how to get rid of skb_clone
from bpf_clone_redirect(), the patch 2 implements the best option.
Patch 1 adds 'integrated exts' to cls_bpf to improve performance by
combining simple actions into bpf classifier.

Alexei Starovoitov (1):
  bpf: add bpf_redirect() helper

Daniel Borkmann (1):
  cls_bpf: introduce integrated actions

 include/net/sch_generic.h|3 ++-
 include/uapi/linux/bpf.h |9 +++
 include/uapi/linux/pkt_cls.h |4 +++
 net/core/dev.c   |8 ++
 net/core/filter.c|   58 +++
 net/sched/act_bpf.c  |1 +
 net/sched/cls_bpf.c  |   61 ++
 samples/bpf/bpf_helpers.h|4 +++
 samples/bpf/tcbpf1_kern.c|   24 -
 9 files changed, 159 insertions(+), 13 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 net-next 2/2] bpf: add bpf_redirect() helper

2015-09-16 Thread Alexei Starovoitov
Existing bpf_clone_redirect() helper clones skb before redirecting
it to RX or TX of destination netdev.
Introduce bpf_redirect() helper that does that without cloning.

Benchmarked with two hosts using 10G ixgbe NICs.
One host is doing line rate pktgen.
Another host is configured as:
$ tc qdisc add dev $dev ingress
$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
   action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
so it receives the packet on $dev and immediately xmits it on $dev + 1
The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
that does bpf_clone_redirect() and performance is 2.0 Mpps

$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
   action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
which is using bpf_redirect() - 2.4 Mpps

and using cls_bpf with integrated actions as:
$ tc filter add dev $dev root pref 10 \
  bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
performance is 2.5 Mpps

To summarize:
u32+act_bpf using clone_redirect - 2.0 Mpps
u32+act_bpf using redirect - 2.4 Mpps
cls_bpf using redirect - 2.5 Mpps

For comparison linux bridge in this setup is doing 2.1 Mpps
and ixgbe rx + drop in ip_rcv - 7.8 Mpps

Signed-off-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
This approach is using per_cpu scratch area to store ifindex and flags.
The other alternatives discussed at plumbers are slower and more intrusive.
v1->v2: dropped redundant iff_up check

 include/net/sch_generic.h|1 +
 include/uapi/linux/bpf.h |8 
 include/uapi/linux/pkt_cls.h |1 +
 net/core/dev.c   |8 
 net/core/filter.c|   44 ++
 net/sched/act_bpf.c  |1 +
 net/sched/cls_bpf.c  |1 +
 samples/bpf/bpf_helpers.h|4 
 samples/bpf/tcbpf1_kern.c|   24 ++-
 9 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index da61febb9091..4c79ce8c1f92 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -402,6 +402,7 @@ void __qdisc_calculate_pkt_len(struct sk_buff *skb,
   const struct qdisc_size_table *stab);
 bool tcf_destroy(struct tcf_proto *tp, bool force);
 void tcf_destroy_chain(struct tcf_proto __rcu **fl);
+int skb_do_redirect(struct sk_buff *);
 
 /* Reset all TX qdiscs greater then index of a device.  */
 static inline void qdisc_reset_all_tx_gt(struct net_device *dev, unsigned int 
i)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2fbd1c71fa3b..4ec0b5488294 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -272,6 +272,14 @@ enum bpf_func_id {
BPF_FUNC_skb_get_tunnel_key,
BPF_FUNC_skb_set_tunnel_key,
BPF_FUNC_perf_event_read,   /* u64 bpf_perf_event_read(, index) 
*/
+   /**
+* bpf_redirect(ifindex, flags) - redirect to another netdev
+* @ifindex: ifindex of the net device
+* @flags: bit 0 - if set, redirect to ingress instead of egress
+* other bits - reserved
+* Return: TC_ACT_REDIRECT
+*/
+   BPF_FUNC_redirect,
__BPF_FUNC_MAX_ID,
 };
 
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 0a262a83f9d4..439873775d49 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -87,6 +87,7 @@ enum {
 #define TC_ACT_STOLEN  4
 #define TC_ACT_QUEUED  5
 #define TC_ACT_REPEAT  6
+#define TC_ACT_REDIRECT7
 #define TC_ACT_JUMP0x1000
 
 /* Action type identifiers*/
diff --git a/net/core/dev.c b/net/core/dev.c
index 877c84834d81..d6a492e57874 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3668,6 +3668,14 @@ static inline struct sk_buff *handle_ing(struct sk_buff 
*skb,
case TC_ACT_QUEUED:
kfree_skb(skb);
return NULL;
+   case TC_ACT_REDIRECT:
+   /* skb_mac_header check was done by cls/act_bpf, so
+* we can safely push the L2 header back before
+* redirecting to another netdev
+*/
+   __skb_push(skb, skb->mac_len);
+   skb_do_redirect(skb);
+   return NULL;
default:
break;
}
diff --git a/net/core/filter.c b/net/core/filter.c
index 971d6ba89758..da3f3d94d6e9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1427,6 +1427,48 @@ const struct bpf_func_proto bpf_clone_redirect_proto = {
.arg3_type  = ARG_ANYTHING,
 };
 
+struct redirect_info {
+   u32 ifindex;
+   u32 flags;
+};
+
+static DEFINE_PER_CPU(struct redirect_info, redirect_info);
+static u64 bpf_redirect(u64 ifindex, u64 flags, u64 r3, u64 r4, u64 r5)
+{
+   struct redirect_info *ri = 

[PATCH v2 net-next 1/2] cls_bpf: introduce integrated actions

2015-09-16 Thread Alexei Starovoitov
From: Daniel Borkmann 

Often cls_bpf classifier is used with single action drop attached.
Optimize this use case and let cls_bpf return both classid and action.
For backwards compatibility reasons enable this feature under
TCA_BPF_FLAG_ACT_DIRECT flag.

Then more interesting programs like the following are easier to write:
int cls_bpf_prog(struct __sk_buff *skb)
{
  /* classify arp, ip, ipv6 into different traffic classes
   * and drop all other packets
   */
  switch (skb->protocol) {
  case htons(ETH_P_ARP):
skb->tc_classid = 1;
break;
  case htons(ETH_P_IP):
skb->tc_classid = 2;
break;
  case htons(ETH_P_IPV6):
skb->tc_classid = 3;
break;
  default:
return TC_ACT_SHOT;
  }

  return TC_ACT_OK;
}

Joint work with Daniel Borkmann.

Signed-off-by: Daniel Borkmann 
Signed-off-by: Alexei Starovoitov 
---
v1->v2: no changes

 include/net/sch_generic.h|2 +-
 include/uapi/linux/bpf.h |1 +
 include/uapi/linux/pkt_cls.h |3 +++
 net/core/filter.c|   14 ++
 net/sched/cls_bpf.c  |   60 ++
 5 files changed, 68 insertions(+), 12 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 444faa89a55f..da61febb9091 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -251,7 +251,7 @@ struct tcf_proto {
 struct qdisc_skb_cb {
unsigned intpkt_len;
u16 slave_dev_queue_mapping;
-   u16 _pad;
+   u16 tc_classid;
 #define QDISC_CB_PRIV_LEN 20
unsigned char   data[QDISC_CB_PRIV_LEN];
 };
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 92a48e2d5461..2fbd1c71fa3b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -293,6 +293,7 @@ struct __sk_buff {
__u32 tc_index;
__u32 cb[5];
__u32 hash;
+   __u32 tc_classid;
 };
 
 struct bpf_tunnel_key {
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 4f0d1bc3647d..0a262a83f9d4 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -373,6 +373,8 @@ enum {
 
 /* BPF classifier */
 
+#define TCA_BPF_FLAG_ACT_DIRECT(1 << 0)
+
 enum {
TCA_BPF_UNSPEC,
TCA_BPF_ACT,
@@ -382,6 +384,7 @@ enum {
TCA_BPF_OPS,
TCA_BPF_FD,
TCA_BPF_NAME,
+   TCA_BPF_FLAGS,
__TCA_BPF_MAX,
 };
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 13079f03902e..971d6ba89758 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1632,6 +1632,9 @@ static bool __is_valid_access(int off, int size, enum 
bpf_access_type type)
 static bool sk_filter_is_valid_access(int off, int size,
  enum bpf_access_type type)
 {
+   if (off == offsetof(struct __sk_buff, tc_classid))
+   return false;
+
if (type == BPF_WRITE) {
switch (off) {
case offsetof(struct __sk_buff, cb[0]) ...
@@ -1648,6 +1651,9 @@ static bool sk_filter_is_valid_access(int off, int size,
 static bool tc_cls_act_is_valid_access(int off, int size,
   enum bpf_access_type type)
 {
+   if (off == offsetof(struct __sk_buff, tc_classid))
+   return type == BPF_WRITE ? true : false;
+
if (type == BPF_WRITE) {
switch (off) {
case offsetof(struct __sk_buff, mark):
@@ -1760,6 +1766,14 @@ static u32 bpf_net_convert_ctx_access(enum 
bpf_access_type type, int dst_reg,
*insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg, ctx_off);
break;
 
+   case offsetof(struct __sk_buff, tc_classid):
+   ctx_off -= offsetof(struct __sk_buff, tc_classid);
+   ctx_off += offsetof(struct sk_buff, cb);
+   ctx_off += offsetof(struct qdisc_skb_cb, tc_classid);
+   WARN_ON(type != BPF_WRITE);
+   *insn++ = BPF_STX_MEM(BPF_H, dst_reg, src_reg, ctx_off);
+   break;
+
case offsetof(struct __sk_buff, tc_index):
 #ifdef CONFIG_NET_SCHED
BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, tc_index) != 2);
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index e5168f8b9640..77b0ef148256 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -38,6 +38,7 @@ struct cls_bpf_prog {
struct bpf_prog *filter;
struct list_head link;
struct tcf_result res;
+   bool exts_integrated;
struct tcf_exts exts;
u32 handle;
union {
@@ -52,6 +53,7 @@ struct cls_bpf_prog {
 
 static const struct nla_policy bpf_policy[TCA_BPF_MAX + 1] = {
[TCA_BPF_CLASSID]   = { .type = NLA_U32 },
+   [TCA_BPF_FLAGS] = { .type = NLA_U32 },
[TCA_BPF_FD]= { .type = NLA_U32 },
[TCA_BPF_NAME] 

[PATCH v2 1/1] eventfd: implementation of EFD_MASK flag

2015-09-16 Thread Damian Hobson-Garcia
From: Martin Sustrik 

When implementing network protocols in user space, one has to implement
fake file descriptors to represent the sockets for the protocol.

Polling on such fake file descriptors is a problem (poll/select/epoll
accept only true file descriptors) and forces protocol implementers to use
various workarounds resulting in complex, non-standard and convoluted APIs.

More generally, ability to create full-blown file descriptors for
userspace-to-userspace signalling is missing. While eventfd(2) goes half
the way towards this goal it has follwoing shorcomings:

I.  There's no way to signal POLLPRI, POLLHUP etc.
II. There's no way to signal arbitrary combination of POLL* flags. Most
notably, simultaneous !POLLIN and !POLLOUT, which is a perfectly valid
combination for a network protocol (rx buffer is empty and tx buffer is
full), cannot be signaled using eventfd.

This patch implements new EFD_MASK flag which solves the above problems.

Additionally, to provide a way to associate user-space state with eventfd
object, it allows to attach user-space data to the file descriptor.

The semantics of EFD_MASK are as follows:

eventfd(2):

If eventfd is created with EFD_MASK flag set, it is initialised in such a
way as to signal no events on the file descriptor when it is polled on.
The 'initval' argument is ignored.

write(2):

User is allowed to write only buffers containing the following structure:

struct efd_mask {
  uint32_t events;
};

The value of 'events' should be any combination of event flags as defined
by poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified
events will be signaled when polling (select, poll, epoll) on the eventfd
is done later on.

read(2):

read is not supported and will fail with EINVAL.

select(2), poll(2) and similar:

When polling on the eventfd marked by EFD_MASK flag, all the events
specified in last written 'events' field shall be signaled.

Signed-off-by: Martin Sustrik 

[dhobs...@igel.co.jp: Rebased, and resubmitted for Linux 4.3]
Signed-off-by: Damian Hobson-Garcia 
---
 fs/eventfd.c | 102 ++-
 include/linux/eventfd.h  |  16 +--
 include/uapi/linux/eventfd.h |  39 +
 3 files changed, 132 insertions(+), 25 deletions(-)
 create mode 100644 include/uapi/linux/eventfd.h

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 8d0c0df..1a6a066 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -2,6 +2,7 @@
  *  fs/eventfd.c
  *
  *  Copyright (C) 2007  Davide Libenzi 
+ *  Copyright (C) 2013  Martin Sustrik 
  *
  */
 
@@ -22,18 +23,31 @@
 #include 
 #include 
 
+#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
+#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_MASK)
+#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | POLLHUP)
+
 struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
-   /*
-* Every time that a write(2) is performed on an eventfd, the
-* value of the __u64 being written is added to "count" and a
-* wakeup is performed on "wqh". A read(2) will return the "count"
-* value to userspace, and will reset "count" to zero. The kernel
-* side eventfd_signal() also, adds to the "count" counter and
-* issue a wakeup.
-*/
-   __u64 count;
+   union {
+   /*
+* Every time that a write(2) is performed on an eventfd, the
+* value of the __u64 being written is added to "count" and a
+* wakeup is performed on "wqh". A read(2) will return the
+* "count" value to userspace, and will reset "count" to zero.
+* The kernel side eventfd_signal() also, adds to the "count"
+* counter and issue a wakeup.
+*/
+   __u64 count;
+
+   /*
+* When using eventfd in EFD_MASK mode this stracture stores the
+* current events to be signaled on the eventfd (events member)
+* along with opaque user-defined data (data member).
+*/
+   struct efd_mask mask;
+   };
unsigned int flags;
 };
 
@@ -134,6 +148,14 @@ static unsigned int eventfd_poll(struct file *file, 
poll_table *wait)
return events;
 }
 
+static unsigned int eventfd_mask_poll(struct file *file, poll_table *wait)
+{
+   struct eventfd_ctx *ctx = file->private_data;
+
+   poll_wait(file, >wqh, wait);
+   return ctx->mask.events;
+}
+
 static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
 {
*cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
@@ -239,6 +261,14 @@ static ssize_t eventfd_read(struct file *file, char __user 
*buf, size_t count,
return put_user(cnt, (__u64 __user *) buf) ? -EFAULT : sizeof(cnt);
 }
 

[PATCH v2 0/1] Generalize poll events from eventfd

2015-09-16 Thread Damian Hobson-Garcia
Using eventfd user space can generate POLLIN/POLLOUT events but some
applications may want to generate POLLPRI/POLLERR events as well.
This patch submission aims to generalize the events generated by an
eventfd. This is a resubmission of a patch from Feb 2013[1]. The original
discussion trailed off without any conclusion, but the original author
has recently confirmed[2] that this functionality is still useful, so I
volunteered to rebase and resubmit the patch for discussion.

[1] https://lkml.org/lkml/2013/2/18/147
[2] https://lkml.org/lkml/2015/7/9/153

Changes in v2
-

* rebased on Linux v4.3-rc1
* Move file operation implementations for EFD_MASK to a seperate structure
* Remove 'data' element from efd_mask structure
* read() is no longer supported when EFD_MASK is set (fails with EINVAL)
* eventfd_ctx_fileget() now returns EINVAL when EFD_MASK is set, eliminating
  the possibility of triggering the orginal BUG_ON() macros which have now
  been removed.

Thank you,
Damian

Martin Sustrik (1):
  eventfd: implementation of EFD_MASK flag

 fs/eventfd.c | 91 ++--
 include/linux/eventfd.h  | 16 +---
 include/uapi/linux/eventfd.h | 40 +++
 3 files changed, 121 insertions(+), 26 deletions(-)
 create mode 100644 include/uapi/linux/eventfd.h

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel 4.2 : "bridge vlan" command return empty result (works with kernel 4.1.3)

2015-09-16 Thread Alexandre DERUMIER
>>Do you have a bond in your system ?. 

Yes, Indeed.
Removing the bond fix the problem.

I'll try your patch today.


Thanks !

Alexandre

- Mail original -
De: "roopa" 
À: "aderumier" 
Cc: "netdev" , "Scott Feldman" 

Envoyé: Mardi 15 Septembre 2015 21:02:34
Objet: Re: kernel 4.2 : "bridge vlan" command return empty result (works with 
kernel 4.1.3)

On 9/15/15, 10:39 AM, Alexandre DERUMIER wrote: 
> Hi, 
> 
> since kernel 4.2, "bridge vlan" command return empty result. 
> 
> 
> kernel 4.1.3 
>  
> # bridge vlan 
> port vlan ids 
> eth0 1 PVID Egress Untagged 
> 90 
> 91 
> 92 
> 93 
> 94 
> 95 
> 96 
> 97 
> 98 
> 99 
> 100 
> 
> vmbr0 1 PVID Egress Untagged 
> 94 
> 
> 
> 
> kernel 4.2 
>  
> # bridge vlan 
> port vlan ids 
> 
> 
> 
> Note that vlans are correctly working,it seem that is just the display. 
> 
> tcpdump -e -i vmbr0 
> 
> 19:38:08.005055 00:08:7c:bd:ae:40 (oui Unknown) > 00:18:8b:7c:c8:37 (oui 
> Unknown), ethertype 802.1Q (0x8100), length 64: vlan 94, p 0, ethertype IPv4, 
> 172.20.0.17.52299 > kvmtest2.odiso.net.ssh: Flags [.], ack 339613, win 5523, 
> length 0 
> 19:38:08.007730 00:08:7c:bd:ae:40 (oui Unknown) > 00:18:8b:7c:c8:37 (oui 
> Unknown), ethertype 802.1Q (0x8100), length 64: vlan 94, p 0, ethertype IPv4, 
> 172.20.0.17.52299 > kvmtest2.odiso.net.ssh: Flags [.], ack 342145, win 5568, 
> length 0 
> 19:38:08.010977 00:08:7c:bd:ae:40 (oui Unknown) > 00:18:8b:7c:c8:37 (oui 
> Unknown), ethertype 802.1Q (0x8100), length 64: vlan 94, p 0, ethertype IPv4, 
> 172.20.0.17.52299 > kvmtest2.odiso.net.ssh: Flags [.], ack 344677, win 5614, 
> length 0 
> 19:3 
I was able to reproduce this when there is a bond in the system. 

Looks like this was due to 85fdb956726ff2a ("switchdev: cut over to new 
switchdev_port_bridge_getlink"). 
When CONFIG_SWITCHDEV is off, nodes that use switchdev api for 
ndo_bridge_getlink (example, bonds, teams, rocker) can return 
-EOPNOTSUPP. The problem went away on my box with the following patch. I 
will submit an official patch in a bit. 
Do you have a bond in your system ?. 

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c 
index 01ced4a..bdb3842 100644 
--- a/net/core/rtnetlink.c 
+++ b/net/core/rtnetlink.c 
@@ -3013,6 +3013,7 @@ static int rtnl_bridge_getlink(struct sk_buff 
*skb, struct 
u32 portid = NETLINK_CB(cb->skb).portid; 
u32 seq = cb->nlh->nlmsg_seq; 
u32 filter_mask = 0; 
+ int err; 

if (nlmsg_len(cb->nlh) > sizeof(struct ifinfomsg)) { 
struct nlattr *extfilt; 
@@ -3033,20 +3034,25 @@ static int rtnl_bridge_getlink(struct sk_buff 
*skb, stru 
struct net_device *br_dev = 
netdev_master_upper_dev_get(dev); 

if (br_dev && br_dev->netdev_ops->ndo_bridge_getlink) { 
- if (idx >= cb->args[0] && 
- br_dev->netdev_ops->ndo_bridge_getlink( 
- skb, portid, seq, dev, filter_mask, 
- NLM_F_MULTI) < 0) 
- break; 
+ if (idx >= cb->args[0]) { 
+ err = 
br_dev->netdev_ops->ndo_bridge_getlink( 
+ skb, portid, seq, dev, 
+ filter_mask, NLM_F_MULTI); 
+ if ( err < 0 && err != -EOPNOTSUPP) 
+ break; 
+ } 
idx++; 
} 

if (ops->ndo_bridge_getlink) { 
- if (idx >= cb->args[0] && 
- ops->ndo_bridge_getlink(skb, portid, seq, dev, 
- filter_mask, 
- NLM_F_MULTI) < 0) 
- break; 
+ if (idx >= cb->args[0]) { 
+ err = ops->ndo_bridge_getlink(skb, portid, 
+ seq, dev, 
+ filter_mask, 
+ NLM_F_MULTI); 
+ if ( err < 0 && err != -EOPNOTSUPP) 
+ break; 
+ } 
idx++; 
} 
} 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 net-next 2/2] bpf: add bpf_redirect() helper

2015-09-16 Thread John Fastabend
On 15-09-15 11:05 PM, Alexei Starovoitov wrote:
> Existing bpf_clone_redirect() helper clones skb before redirecting
> it to RX or TX of destination netdev.
> Introduce bpf_redirect() helper that does that without cloning.
> 
> Benchmarked with two hosts using 10G ixgbe NICs.
> One host is doing line rate pktgen.
> Another host is configured as:
> $ tc qdisc add dev $dev ingress
> $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
>action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
> so it receives the packet on $dev and immediately xmits it on $dev + 1
> The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
> that does bpf_clone_redirect() and performance is 2.0 Mpps
> 
> $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
>action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
> which is using bpf_redirect() - 2.4 Mpps
> 
> and using cls_bpf with integrated actions as:
> $ tc filter add dev $dev root pref 10 \
>   bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
> performance is 2.5 Mpps
> 
> To summarize:
> u32+act_bpf using clone_redirect - 2.0 Mpps
> u32+act_bpf using redirect - 2.4 Mpps
> cls_bpf using redirect - 2.5 Mpps
> 
> For comparison linux bridge in this setup is doing 2.1 Mpps
> and ixgbe rx + drop in ip_rcv - 7.8 Mpps
> 
> Signed-off-by: Alexei Starovoitov 
> Acked-by: Daniel Borkmann 
> ---
> This approach is using per_cpu scratch area to store ifindex and flags.
> The other alternatives discussed at plumbers are slower and more intrusive.
> v1->v2: dropped redundant iff_up check
> 
>  include/net/sch_generic.h|1 +
>  include/uapi/linux/bpf.h |8 
>  include/uapi/linux/pkt_cls.h |1 +
>  net/core/dev.c   |8 
>  net/core/filter.c|   44 
> ++
>  net/sched/act_bpf.c  |1 +
>  net/sched/cls_bpf.c  |1 +
>  samples/bpf/bpf_helpers.h|4 
>  samples/bpf/tcbpf1_kern.c|   24 ++-
>  9 files changed, 91 insertions(+), 1 deletion(-)
> 

Acked-by: John Fastabend 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bnx2x - occasional high packet loss (on LAN)

2015-09-16 Thread Nikola Ciprich
On Wed, Sep 16, 2015 at 08:15:41AM +, Ariel Elior wrote:
> Hi Nikola,
> Please provide dmesg output from your system.
> Thanks,
> Ariel

Hello Ariel,

here it is:

http://nik.lbox.cz/download/dmesg.txt

BR

nik


> 


-- 
-
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:+420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-


pgppmrmdfyDhb.pgp
Description: PGP signature


[PATCH v4] add stealth mode

2015-09-16 Thread Matteo Croce
Add option to disable any reply not related to a listening socket,
like RST/ACK for TCP and ICMP Port-Unreachable for UDP.
Also disables ICMP replies to echo request and timestamp.
The stealth mode can be enabled selectively for a single interface.

Signed-off-by: Matteo Croce 
---
rebased on 4.3-rc1

 Documentation/networking/ip-sysctl.txt | 14 ++
 include/linux/inetdevice.h |  1 +
 include/linux/ipv6.h   |  1 +
 include/uapi/linux/ip.h|  1 +
 net/ipv4/devinet.c |  1 +
 net/ipv4/icmp.c|  6 ++
 net/ipv4/ip_input.c|  5 +++--
 net/ipv4/tcp_ipv4.c|  3 ++-
 net/ipv4/udp.c |  4 +++-
 net/ipv6/addrconf.c|  7 +++
 net/ipv6/icmp.c|  3 ++-
 net/ipv6/ip6_input.c   |  5 +++--
 net/ipv6/tcp_ipv6.c|  2 +-
 net/ipv6/udp.c |  3 ++-
 14 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index ebe94f2..1d46adc 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1206,6 +1206,13 @@ igmp_link_local_mcast_reports - BOOLEAN
224.0.0.X range.
Default TRUE
 
+stealth - BOOLEAN
+   Disable any reply not related to a listening socket,
+   like RST/ACK for TCP and ICMP Port-Unreachable for UDP.
+   Also disables ICMP replies to echo requests and timestamp
+   and ICMP errors for unknown protocols.
+   Default value is 0.
+
 Alexey Kuznetsov.
 kuz...@ms2.inr.ac.ru
 
@@ -1635,6 +1642,13 @@ stable_secret - IPv6 address
 
By default the stable secret is unset.
 
+stealth - BOOLEAN
+   Disable any reply not related to a listening socket,
+   like RST/ACK for TCP and ICMP Port-Unreachable for UDP.
+   Also disables ICMPv6 replies to echo requests
+   and ICMP errors for unknown protocols.
+   Default value is 0.
+
 icmp/*:
 ratelimit - INTEGER
Limit the maximal rates for sending ICMPv6 packets.
diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index a4328ce..a64c01e 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -128,6 +128,7 @@ static inline void ipv4_devconf_setall(struct in_device 
*in_dev)
 #define IN_DEV_ARP_ANNOUNCE(in_dev)IN_DEV_MAXCONF((in_dev), ARP_ANNOUNCE)
 #define IN_DEV_ARP_IGNORE(in_dev)  IN_DEV_MAXCONF((in_dev), ARP_IGNORE)
 #define IN_DEV_ARP_NOTIFY(in_dev)  IN_DEV_MAXCONF((in_dev), ARP_NOTIFY)
+#define IN_DEV_STEALTH(in_dev) IN_DEV_MAXCONF((in_dev), STEALTH)
 
 struct in_ifaddr {
struct hlist_node   hash;
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index f1f32af..a9d0172 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -55,6 +55,7 @@ struct ipv6_devconf {
__s32   ndisc_notify;
__s32   suppress_frag_ndisc;
__s32   accept_ra_mtu;
+   __s32   stealth;
struct ipv6_stable_secret {
bool initialized;
struct in6_addr secret;
diff --git a/include/uapi/linux/ip.h b/include/uapi/linux/ip.h
index 08f894d..4acbf99 100644
--- a/include/uapi/linux/ip.h
+++ b/include/uapi/linux/ip.h
@@ -165,6 +165,7 @@ enum
IPV4_DEVCONF_IGMPV2_UNSOLICITED_REPORT_INTERVAL,
IPV4_DEVCONF_IGMPV3_UNSOLICITED_REPORT_INTERVAL,
IPV4_DEVCONF_IGNORE_ROUTES_WITH_LINKDOWN,
+   IPV4_DEVCONF_STEALTH,
__IPV4_DEVCONF_MAX
 };
 
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 2d9cb17..6d9c080 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -2190,6 +2190,7 @@ static struct devinet_sysctl_table {
  "promote_secondaries"),
DEVINET_SYSCTL_FLUSHING_ENTRY(ROUTE_LOCALNET,
  "route_localnet"),
+   DEVINET_SYSCTL_RW_ENTRY(STEALTH, "stealth"),
},
 };
 
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 79fe05b..4cd35b2 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -889,6 +889,9 @@ static bool icmp_echo(struct sk_buff *skb)
 {
struct net *net;
 
+   if (IN_DEV_STEALTH(skb->dev->ip_ptr))
+   return true;
+
net = dev_net(skb_dst(skb)->dev);
if (!net->ipv4.sysctl_icmp_echo_ignore_all) {
struct icmp_bxm icmp_param;
@@ -922,6 +925,9 @@ static bool icmp_timestamp(struct sk_buff *skb)
if (skb->len < 4)
goto out_err;
 
+   if (IN_DEV_STEALTH(skb->dev->ip_ptr))
+   return true;
+
/*
 *  Fill in the current time as ms since midnight UT:
 */
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index f4fc8a7..e75f250 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ 

Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag

2015-09-16 Thread Martin Sustrik

On 2015-09-16 08:27, Damian Hobson-Garcia wrote:

From: Martin Sustrik 

When implementing network protocols in user space, one has to implement
fake file descriptors to represent the sockets for the protocol.

Polling on such fake file descriptors is a problem (poll/select/epoll
accept only true file descriptors) and forces protocol implementers to 
use
various workarounds resulting in complex, non-standard and convoluted 
APIs.


More generally, ability to create full-blown file descriptors for
userspace-to-userspace signalling is missing. While eventfd(2) goes 
half

the way towards this goal it has follwoing shorcomings:

I.  There's no way to signal POLLPRI, POLLHUP etc.
II. There's no way to signal arbitrary combination of POLL* flags. Most
notably, simultaneous !POLLIN and !POLLOUT, which is a perfectly 
valid
combination for a network protocol (rx buffer is empty and tx 
buffer is

full), cannot be signaled using eventfd.

This patch implements new EFD_MASK flag which solves the above 
problems.


Additionally, to provide a way to associate user-space state with 
eventfd

object, it allows to attach user-space data to the file descriptor.


The above paragraph is a leftover from the past. The functionality no 
longer exist.




The semantics of EFD_MASK are as follows:

eventfd(2):

If eventfd is created with EFD_MASK flag set, it is initialised in such 
a

way as to signal no events on the file descriptor when it is polled on.
The 'initval' argument is ignored.

write(2):

User is allowed to write only buffers containing the following 
structure:


struct efd_mask {
  uint32_t events;
};


Is it worth having a struct here? Why not just uint32_t?

Martin



The value of 'events' should be any combination of event flags as 
defined

by poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified
events will be signaled when polling (select, poll, epoll) on the 
eventfd

is done later on.

read(2):

read is not supported and will fail with EINVAL.

select(2), poll(2) and similar:

When polling on the eventfd marked by EFD_MASK flag, all the events
specified in last written 'events' field shall be signaled.

Signed-off-by: Martin Sustrik 

[dhobs...@igel.co.jp: Rebased, and resubmitted for Linux 4.3]
Signed-off-by: Damian Hobson-Garcia 
---
 fs/eventfd.c | 102 
++-

 include/linux/eventfd.h  |  16 +--
 include/uapi/linux/eventfd.h |  39 +
 3 files changed, 132 insertions(+), 25 deletions(-)
 create mode 100644 include/uapi/linux/eventfd.h

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 8d0c0df..1a6a066 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -2,6 +2,7 @@
  *  fs/eventfd.c
  *
  *  Copyright (C) 2007  Davide Libenzi 
+ *  Copyright (C) 2013  Martin Sustrik 
  *
  */

@@ -22,18 +23,31 @@
 #include 
 #include 

+#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
+#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | 
EFD_MASK)
+#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | 
POLLHUP)

+
 struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
-   /*
-* Every time that a write(2) is performed on an eventfd, the
-* value of the __u64 being written is added to "count" and a
-* wakeup is performed on "wqh". A read(2) will return the "count"
-* value to userspace, and will reset "count" to zero. The kernel
-* side eventfd_signal() also, adds to the "count" counter and
-* issue a wakeup.
-*/
-   __u64 count;
+   union {
+   /*
+* Every time that a write(2) is performed on an eventfd, the
+* value of the __u64 being written is added to "count" and a
+* wakeup is performed on "wqh". A read(2) will return the
+* "count" value to userspace, and will reset "count" to zero.
+* The kernel side eventfd_signal() also, adds to the "count"
+* counter and issue a wakeup.
+*/
+   __u64 count;
+
+   /*
+* When using eventfd in EFD_MASK mode this stracture stores the
+* current events to be signaled on the eventfd (events member)
+* along with opaque user-defined data (data member).
+*/
+   struct efd_mask mask;
+   };
unsigned int flags;
 };

@@ -134,6 +148,14 @@ static unsigned int eventfd_poll(struct file
*file, poll_table *wait)
return events;
 }

+static unsigned int eventfd_mask_poll(struct file *file, poll_table 
*wait)

+{
+   struct eventfd_ctx *ctx = file->private_data;
+
+   poll_wait(file, >wqh, wait);
+   return ctx->mask.events;
+}
+
 static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
 {
*cnt = 

Re: [PATCH 1/4] stmmac: replace all pr_xxx by their dev_xxx counterpart

2015-09-16 Thread LABBE Corentin
On Wed, Sep 09, 2015 at 09:14:42AM -0700, Joe Perches wrote:
> On Wed, 2015-09-09 at 15:14 +0200, LABBE Corentin wrote:
> > The stmmac driver use lots of pr_xxx functions to print information.
> > This is bad since we cannot know which device logs the information.
> > (moreover if two stmmac device are present)
> []
> > So this patch replace all pr_xxx by their dev_xxx counterpart.
> 
> Using
>   netdev_(priv->dev, ...
> instead of
>   dev_(priv->device,
> 
> would be more consistent with other ethernet devices.
> 
> > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
> > b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> []
> > @@ -298,7 +298,7 @@ bool stmmac_eee_init(struct stmmac_priv *priv)
> >  */
> > spin_lock_irqsave(>lock, flags);
> > if (priv->eee_active) {
> > -   pr_debug("stmmac: disable EEE\n");
> > +   dev_dbg(priv->device, "disable EEE\n");
> 
>   netdev_dbg(priv->dev, ...)
> 
> > @@ -657,10 +657,10 @@ static int stmmac_init_ptp(struct stmmac_priv *priv)
> > priv->adv_ts = 1;
> >  
> > if (netif_msg_hw(priv) && priv->dma_cap.time_stamp)
> > -   pr_debug("IEEE 1588-2002 Time Stamp supported\n");
> > +   dev_dbg(priv->device, "IEEE 1588-2002 Time Stamp supported\n");
> 
> And these netif_msg_ could be
> 
>   if (priv->dma_cap.timestamp)
>   netif_dbg(priv, hw, priv->dev, ...);
> 
> 

Hello

My main goal is to improve logging from
[0.796804] stmmaceth 1c5.ethernet: no reset control found
[0.802635]  Ring mode enabled
[0.805713]  No HW DMA feature register supported
[0.810239]  Normal descriptors
[0.813577]  TX Checksum insertion supported
[   23.615074] eth0: device MAC address aa:65:84:d5:a3:58
[   23.704326]  RX IPC Checksum Offload disabled
[   23.704349]  No MAC Management Counters available

to that:
[0.788147] sun7i-dwmac 1c5.ethernet (unnamed net_device) 
(uninitialized): no reset control found
[0.797400] sun7i-dwmac 1c5.ethernet (unnamed net_device) 
(uninitialized): Ring mode enabled
[0.806211] sun7i-dwmac 1c5.ethernet (unnamed net_device) 
(uninitialized): No HW DMA feature register supported
[0.816658] sun7i-dwmac 1c5.ethernet (unnamed net_device) 
(uninitialized): Normal descriptors
[0.825522] sun7i-dwmac 1c5.ethernet (unnamed net_device) 
(uninitialized): TX Checksum insertion supported
[   12.971725] sun7i-dwmac 1c5.ethernet eth0: device MAC address 
3e:62:18:6f:c7:f4
[   13.056902] sun7i-dwmac 1c5.ethernet eth0: RX IPC Checksum Offload 
disabled
[   13.056929] sun7i-dwmac 1c5.ethernet eth0: No MAC Management Counters 
available

But by using the netdev_ functions the first five lines are not "pretty" with 
the "(unnamed net_device) (uninitialized)"
Could I switch back do dev_xxx since they are "early device logging" and so 
make it prettier ?

Best regards

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: smc91x: convert pxa dma to dmaengine

2015-09-16 Thread Robert Jarzmik
David Miller  writes:

> From: Robert Jarzmik 
> Date: Thu, 10 Sep 2015 21:26:04 +0200
>
>> Convert the dma transfers to be dmaengine based, now pxa has a dmaengine
>> slave driver. This makes this driver a bit more PXA agnostic.
>> 
>> The driver was tested on pxa27x (mainstone) and pxa310 (zylonite),
>> ie. only pxa platforms.
>> 
>> Signed-off-by: Robert Jarzmik 
>> Cc: Russell King 
>> Cc: Arnd Bergmann 
>> ---
>> This has potential to break other platform such as Neponset, Idp,
>> halibut and qsd8x50, so I added Russell and Arnd as they were discussing
>> smc91x support last February.
>

> Is someone testing whether such platforms break or not?  I'm waiting for
> that before I consider applying this patch.

My understanding is that Russell is the only one left testing them, or at least
he was the only one complaining about a breakage lately on neponset.

I can wait several weeks for Russell to have a bit of time to try : I know it
will compile correctly at least for neponset, and I know almost all the code is
under #ifdef CONFIG_ARCH_PXA. And still I would feel far more comfortable if it
was tested, just as you.

Cheers.

-- 
Robert
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


request for stable inclusion

2015-09-16 Thread Or Gerlitz

Hi Dave,

Commit 9293267 "net/mlx4_core: Capping number of requested MSIXs to 
MAX_MSIX"  fixes a bug under which the driver doesn't really starts over 
a machine with > 32 cores.


The bug was introduced in 4.2-rc1 but the fix missed 4.2 -- could you 
please push it to 4.2 -stable?


If you prefer that we will submit it directly there, fine too.

thanks,

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


vhost: build failure

2015-09-16 Thread Sudip Mukherjee
Hi,
While crosscompiling the kernel for openrisc with allmodconfig the build
failed with the error:
drivers/vhost/vhost.c: In function 'vhost_vring_ioctl':
drivers/vhost/vhost.c:818:3: error: call to '__compiletime_assert_818' declared 
with attribute error: BUILD_BUG_ON failed: __alignof__
*vq->avail > VRING_AVAIL_ALIGN_SIZE

Can you please give me any idea about what the problem might be and how
it can be solved.

You can see the build log at:
https://travis-ci.org/sudipm-mukherjee/parport/jobs/80365425

regards
sudip
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS/TCP/IPv6 acting strangely in 4.2

2015-09-16 Thread Willy Tarreau
Hi,

On Wed, Sep 16, 2015 at 06:53:57AM +, Damien Thébault wrote:
> On Fri, 2015-09-11 at 12:38 +0100, Russell King - ARM Linux wrote:
> > I have a recent Marvell Armada 388 board here which uses the mvneta
> > driver.  I'm seeing some weird effects with NFS with it acting as a
> > client.
> 
> Hello,
> 
> I'm upgrading a Marvelle Armada 370 board using the mvneta driver from
> 4.0 to 4.2 and noticed issues with NFS booting.
> Basically, most of the time init returns with an error code, or
> programs segfault or throw illegal instructions.
> 
> Since it worked fine on 4.0 I bisected until I found commit
> a84e32894191cfcbffa54180d78d7d4654d56c20 "net: mvneta: fix refilling
> for Rx DMA buffers".
> 
> If I revert this commit, everything seems to get back to normal.
> Could you try it ? The two issues look very similar.

I'm not sure but I'm seeing that the accounting was changed by this
patch without being certain of the implications; if the revert above
works, it would be nice to try to only apply this just to see if
that's indeed an accounting error or not :

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index 62e48bc..4205867 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1463,6 +1463,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 {
struct net_device *dev = pp->dev;
int rx_done;
+   int missed = 0;
u32 rcvd_pkts = 0;
u32 rcvd_bytes = 0;
 
@@ -1527,6 +1528,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
if (err) {
netdev_err(dev, "Linux processing - Can't refill\n");
rxq->missed++;
+   missed++;
goto err_drop_frame;
}
 
@@ -1561,7 +1563,7 @@ static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
}
 
/* Update rxq management counters */
-   mvneta_rxq_desc_num_update(pp, rxq, rx_done, rx_done);
+   mvneta_rxq_desc_num_update(pp, rxq, rx_done, rx_done - missed);
 
return rx_done;
 }

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost: build failure

2015-09-16 Thread Michael S. Tsirkin
On Wed, Sep 16, 2015 at 01:50:08PM +0530, Sudip Mukherjee wrote:
> Hi,
> While crosscompiling the kernel for openrisc with allmodconfig the build
> failed with the error:
> drivers/vhost/vhost.c: In function 'vhost_vring_ioctl':
> drivers/vhost/vhost.c:818:3: error: call to '__compiletime_assert_818' 
> declared with attribute error: BUILD_BUG_ON failed: __alignof__
> *vq->avail > VRING_AVAIL_ALIGN_SIZE
> 
> Can you please give me any idea about what the problem might be and how
> it can be solved.
> 
> You can see the build log at:
> https://travis-ci.org/sudipm-mukherjee/parport/jobs/80365425
> 
> regards
> sudip

Yes - I think I saw this already.
I think the openrisc cross-compiler is broken.

VRING_AVAIL_ALIGN_SIZE is 2

*vq->avail is:

struct vring_avail {
__virtio16 flags;
__virtio16 idx;
__virtio16 ring[];
};

And __virtio16 is just a u16 with some sparse annotations.

Looking at openrisc architecture document:
Operand:Length  addr[3:0] if aligned
Halfword (or half)  2 bytes Xxx0

TypeC-TYPESizeof Alignment Openrisc Equivalent
Short   Signed short2  2   Signed halfword

and

16.1.2
Aggregates and Unions
Aggregates (structures and arrays) and unions assume the alignment of their most
strictly aligned element.

So to me, it looks like your gcc violates the ABI
by adding alignment requirements > 2.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[linux-next] oops in ip_route_input_noref

2015-09-16 Thread Sergey Senozhatsky
Hi,

4.3.0-rc1-next-20150916

oops after removal of rndis usb device

...
8146c052:   00 
8146c053:   0f b6 55 8a movzbl -0x76(%rbp),%edx
8146c057:   49 8b bf e8 01 00 00mov0x1e8(%r15),%rdi
8146c05e:   45 89 d1mov%r10d,%r9d
8146c061:   44 89 f6mov%r14d,%esi
8146c064:   44 88 95 70 ff ff ffmov%r10b,-0x90(%rbp)
8146c06b:   0f 95 c1setne  %cl
8146c06e:   81 ce 00 00 00 80   or $0x8000,%esi
8146c074:   41 83 e1 01 and$0x1,%r9d
8146c078:   45 31 c0xor%r8d,%r8d
8146c07b:   e8 49 d5 ff ff  callq  814695c9 

8146c080:   48 85 c0test   %rax,%rax
8146c083:   49 89 c5mov%rax,%r13
8146c086:   75 0a   jne8146c092 
<ip_route_input_noref+0xa75>
8146c088:   bb 97 ff ff ff  mov$0xff97,%ebx
8146c08d:   e9 06 f8 ff ff  jmpq   8146b898 
<ip_route_input_noref+0x27b>
8146c092:   48 c7 40 58 a3 95 46movq   
$0x814695a3,0x58(%rax)
8146c099:   81 
8146c09a:   c6 80 a2 00 00 00 01movb   $0x1,0xa2(%rax)
8146c0a1:   48 8b 45 98 mov-0x68(%rbp),%rax
8146c0a5:   44 8a 95 70 ff ff ffmov-0x90(%rbp),%r10b
8146c0ac:   48 85 c0test   %rax,%rax
8146c0af:   74 0a   je 8146c0bb 
<ip_route_input_noref+0xa9e>
8146c0b1:   8b 40 10mov0x10(%rax),%eax
^^^
8146c0b4:   41 89 85 b0 00 00 00mov%eax,0xb0(%r13)
8146c0bb:   65 ff 05 9e 54 ba 7eincl   %gs:0x7eba549e(%rip) 
   # 11560 
8146c0c2:   80 7d 8a 07 cmpb   $0x7,-0x76(%rbp)
8146c0c6:   75 1a   jne8146c0e2 
<ip_route_input_noref+0xac5>
8146c0c8:   41 81 a5 9c 00 00 00andl   $0x7fff,0x9c(%r13)
8146c0cf:   ff ff ff 7f 
8146c0d3:   f7 db   neg%ebx
8146c0d5:   49 c7 45 50 b1 96 46movq   
$0x814696b1,0x50(%r13)
8146c0dc:   81 
8146c0dd:   66 41 89 5d 64  mov%bx,0x64(%r13)
8146c0e2:   45 84 d2test   %r10b,%r10b
8146c0e5:   74 29   je 8146c110 
<ip_route_input_noref+0xaf3>
8146c0e7:   0f b6 7d 89 movzbl -0x77(%rbp),%edi
8146c0eb:   4c 89 eemov%r13,%rsi
8146c0ee:   48 ff c7inc%rdi
8146c0f1:   48 6b ff 60 imul   $0x60,%rdi,%rdi
8146c0f5:   48 03 7d 90 add-0x70(%rbp),%rdi
8146c0f9:   e8 10 d1 ff ff  callq  8146920e 

8146c0fe:   84 c0   test   %al,%al
8146c100:   75 0e   jne8146c110 
<ip_route_input_noref+0xaf3>
8146c102:   66 41 83 4d 60 10   orw$0x10,0x60(%r13)
8146c108:   4c 89 efmov%r13,%rdi
8146c10b:   e8 7d cc ff ff  callq  81468d8d 

8146c110:   4d 89 6c 24 58  mov%r13,0x58(%r12)
8146c115:   31 db   xor%ebx,%ebx
8146c117:   e9 7c f7 ff ff  jmpq   8146b898 
<ip_route_input_noref+0x27b>
8146c11c:   bb 8f ff ff ff  mov$0xff8f,%ebx
8146c121:   c6 45 8a 07 movb   $0x7,-0x76(%rbp)
8146c125:   48 c7 45 90 00 00 00movq   $0x0,-0x70(%rbp)
...

addr2line -e vmlinux -i 0x8146c0b1
net/ipv4/route.c:1815
net/ipv4/route.c:1905


which seems to be this line ip_route_input_noref()->ip_route_input_slow():
...
1813 rth->rt_is_input = 1;
1814 if (res.table)
1815 rth->rt_table_id = res.table->tb_id;
1816
...


added by b7503e0cdb5dbec5d201aa69dc14679b5ae8

net: Add FIB table id to rtable

Add the FIB table id to rtable to make the information available for
IPv4 as it is for IPv6.


-ss
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS/TCP/IPv6 acting strangely in 4.2

2015-09-16 Thread Gregory CLEMENT
Hi Damien,
 
 On mer., sept. 16 2015, Damien Thébault  wrote:

> On Fri, 2015-09-11 at 12:38 +0100, Russell King - ARM Linux wrote:
>> I have a recent Marvell Armada 388 board here which uses the mvneta
>> driver.  I'm seeing some weird effects with NFS with it acting as a
>> client.
>
> Hello,
>
> I'm upgrading a Marvelle Armada 370 board using the mvneta driver from
> 4.0 to 4.2 and noticed issues with NFS booting.
> Basically, most of the time init returns with an error code, or
> programs segfault or throw illegal instructions.
>
> Since it worked fine on 4.0 I bisected until I found commit
> a84e32894191cfcbffa54180d78d7d4654d56c20 "net: mvneta: fix refilling
> for Rx DMA buffers".
>
> If I revert this commit, everything seems to get back to normal.
> Could you try it ? The two issues look very similar.

Actually there was a bug with this commit, but a fix had been submitted
and accepted yesterday, you can find him here:
https://patchwork.ozlabs.org/patch/518111/.

Thanks,

Gregory

>
> Regards

-- 
Gregory Clement, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag

2015-09-16 Thread Damian Hobson-Garcia
Hi Martin,

On 2015-09-16 3:51 PM, Martin Sustrik wrote:
> On 2015-09-16 08:27, Damian Hobson-Garcia wrote:
>>
>> Additionally, to provide a way to associate user-space state with eventfd
>> object, it allows to attach user-space data to the file descriptor.
> 
> The above paragraph is a leftover from the past. The functionality no
> longer exist.
> 
Oops, I forgot to delete that part. I'll get rid of it.

>>
>> The semantics of EFD_MASK are as follows:
>>
>> eventfd(2):
>>
>> If eventfd is created with EFD_MASK flag set, it is initialised in such a
>> way as to signal no events on the file descriptor when it is polled on.
>> The 'initval' argument is ignored.
>>
>> write(2):
>>
>> User is allowed to write only buffers containing the following structure:
>>
>> struct efd_mask {
>>   uint32_t events;
>> };
> 
> Is it worth having a struct here? Why not just uint32_t?
As it stands right now, no, the struct doesn't really add anything.
uint32_t should be just fine.
> 
> Martin

Damian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: Fix vti use case with oif in dst lookups

2015-09-16 Thread Steffen Klassert
On Tue, Sep 15, 2015 at 03:10:50PM -0700, David Ahern wrote:
> Steffen reported that the recent change to add oif to dst lookups breaks
> the VTI use case. The problem is that with the oif set in the flow struct
> the comparison to the nh_oif is triggered. Fix by splitting the
> FLOWI_FLAG_VRFSRC into 2 flags -- one that triggers the vrf device cache
> bypass (FLOWI_FLAG_VRFSRC) and another telling the lookup to not compare
> nh oif (FLOWI_FLAG_SKIP_NH_OIF).
> 
> Fixes: 42a7b32b73d6 ("xfrm: Add oif to dst lookups")
> 
> Signed-off-by: David Ahern 

This works, thanks a lot for the quick fix!

> ---
> IPv6 does not show this problem for me. So no change is added for IPv6.
> If your mileage varies let me know and I'll take another look.

IPv6 works just fine as it is, so no change needed.

Acked-by: Steffen Klassert 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] net: irda: pxaficp_ir: use sched_clock() for time management

2015-09-16 Thread Robert Jarzmik
David Miller  writes:

> From: Robert Jarzmik 
> Date: Sat, 12 Sep 2015 13:45:22 +0200
>
>> Instead of using directly the OS timer through direct register access,
>> use the standard sched_clock(), which will end up in OSCR reading
>> anyway.
>> 
>> This is a first step for direct access register removal and machine
>> specific code removal from this driver.
>> 
>> Signed-off-by: Robert Jarzmik 
>
> What is the granularity of the OSCR register?
It's 307ns (ie. 3.25MHz clock).

> If it is not nanoseconds, then you need to adjust calculations
> such as this one:
Tell me if the 307ns requires something I should adjust.

My understanding is that the flow will be :
 sched_clock()
   rd->read_sched_clock() (cyc_to_ns() transformed for return)
 pxa_read_sched_clock()
   readl_relaxed(OSCR)

I didn't see any timings issue, as the flow looks equivalent to the readl(OSCR),
but I might have overlooked something.

Cheers.

-- 
Robert
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost: build failure

2015-09-16 Thread Sudip Mukherjee
On Wed, Sep 16, 2015 at 11:36:45AM +0300, Michael S. Tsirkin wrote:
> On Wed, Sep 16, 2015 at 01:50:08PM +0530, Sudip Mukherjee wrote:
> > Hi,
> > While crosscompiling the kernel for openrisc with allmodconfig the build
> > failed with the error:
> > drivers/vhost/vhost.c: In function 'vhost_vring_ioctl':
> > drivers/vhost/vhost.c:818:3: error: call to '__compiletime_assert_818' 
> > declared with attribute error: BUILD_BUG_ON failed: __alignof__
> > *vq->avail > VRING_AVAIL_ALIGN_SIZE
> > 
> > Can you please give me any idea about what the problem might be and how
> > it can be solved.
> > 
> > You can see the build log at:
> > https://travis-ci.org/sudipm-mukherjee/parport/jobs/80365425
> > 
> > regards
> > sudip
> 
> Yes - I think I saw this already.
> I think the openrisc cross-compiler is broken.
I thought so. Thanks for the quick reply. I will open a bug in gcc and
lets see what they say.

regards
sudip
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS/TCP/IPv6 acting strangely in 4.2

2015-09-16 Thread Damien Thébault
On Fri, 2015-09-11 at 12:38 +0100, Russell King - ARM Linux wrote:
> I have a recent Marvell Armada 388 board here which uses the mvneta
> driver.  I'm seeing some weird effects with NFS with it acting as a
> client.

Hello,

I'm upgrading a Marvelle Armada 370 board using the mvneta driver from
4.0 to 4.2 and noticed issues with NFS booting.
Basically, most of the time init returns with an error code, or
programs segfault or throw illegal instructions.

Since it worked fine on 4.0 I bisected until I found commit
a84e32894191cfcbffa54180d78d7d4654d56c20 "net: mvneta: fix refilling
for Rx DMA buffers".

If I revert this commit, everything seems to get back to normal.
Could you try it ? The two issues look very similar.

Regards
-- 
Damien Thébault



Re: NFS/TCP/IPv6 acting strangely in 4.2

2015-09-16 Thread Damien Thébault
On Wed, 2015-09-16 at 09:15 +0200, Gregory CLEMENT wrote:
> > Since it worked fine on 4.0 I bisected until I found commit
> > a84e32894191cfcbffa54180d78d7d4654d56c20 "net: mvneta: fix
> > refilling
> > for Rx DMA buffers".
> > 
> > If I revert this commit, everything seems to get back to normal.
> > Could you try it ? The two issues look very similar.
> Actually there was a bug with this commit, but a fix had been
> submitted
> and accepted yesterday, you can find him here:
> https://patchwork.ozlabs.org/patch/518111/.

Hello, this indeed fixes the issue for me, thanks.
-- 
Damien Thébault
R Engineer
VITEC

T. +33 1 46 73 06 06
F. +33 9 59 85 99 92
E. damien.theba...@vitec.com
http://www.vitec.com



RE: bnx2x - occasional high packet loss (on LAN)

2015-09-16 Thread Ariel Elior
Hi Nikola,
Please provide dmesg output from your system.
Thanks,
Ariel

> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On
> Behalf Of Nikola Ciprich
> Sent: Tuesday, September 15, 2015 7:17 AM
> To: netdev 
> Cc: n...@linuxbox.cz
> Subject: bnx2x - occasional high packet loss (on LAN)
> 
> Hello,
> 
> I'm trying to track strange issue with one of our servers and
> like to ask for recommendations..
> 
> I've got three node cluster (nodes A..C) interconnected with stacked broadcom
> ICX6610. eth0 of each box is connected to first switch, eth1 to second one,
> bonding set as follows: "mode=802.3ad lacp_rate=fast xmit_hash_policy=layer2+3
> miimon=100"
> 
> It happened few times, that suddenly eth1 on box A started misbehaving and
> communication
> with other nodes (ie flood ping) started dropping up to 30% packets. When 
> this port
> has been shut on both sides, problem immediately vanished.
> 
> We've tried replacing card, cable and using different port on switch, but 
> problem
> repeated again yesterday..
> 
> Since it's "only" loss, and not link loss, bonding doesn't help me much..
> 
> however during weekend, port also had strange link issue:
> 
> Sep 12 15:23:45 remrprv1a kernel: [676373.296786] bnx2x :03:00.1 eth1: NIC
> Link is Down
> Sep 12 15:23:46 remrprv1a kernel: [676373.356638] bond0: link status 
> definitely
> down for interface eth1, disabling it
> Sep 12 15:23:46 remrprv1a kernel: [676374.299571] bnx2x :03:00.1 eth1: NIC
> Link is Up, 1 Mbps full duplex, Flow control: ON - receive & transmit
> Sep 12 15:23:47 remrprv1a kernel: [676374.364428] bond0: link status 
> definitely up
> for interface eth1, 1 Mbps full duplex
> Sep 12 15:23:47 remrprv1a kernel: [676374.372902] bond0: first active 
> interface up!
> Sep 12 15:24:24 remrprv1a kernel: [676411.402511] bnx2x :03:00.1 eth1: NIC
> Link is Down
> Sep 12 15:24:24 remrprv1a kernel: [676411.407422] bond0: link status 
> definitely
> down for interface eth1, disabling it
> Sep 12 15:24:25 remrprv1a kernel: [676412.405311] bnx2x :03:00.1 eth1: NIC
> Link is Up, 1 Mbps full duplex, Flow control: ON - receive & transmit
> Sep 12 15:24:25 remrprv1a kernel: [676412.408123] bond0: link status 
> definitely up
> for interface eth1, 0 Mbps full duplex
> Sep 12 15:24:51 remrprv1a kernel: [676438.477641] bnx2x :03:00.1 eth1: NIC
> Link is Down
> Sep 12 15:24:51 remrprv1a kernel: [676438.528513] bond0: link status 
> definitely
> down for interface eth1, disabling it
> Sep 12 15:24:52 remrprv1a kernel: [676439.480472] bnx2x :03:00.1 eth1: NIC
> Link is Up, 1 Mbps full duplex, Flow control: ON - receive & transmit
> Sep 12 15:24:52 remrprv1a kernel: [676439.536282] bond0: link status 
> definitely up
> for interface eth1, 1 Mbps full duplex
> 
> 0mbps link speed is quite weird I guess..
> 
> all three boxes are the same, running centos6 based system, 4.0.5 x86_64 
> kernel.
> 
> The only difference I noticed on them is, that irqbalance was enabled on 
> problematic
> box and not on the others.. So I disabled it and rebooted the box.. The 
> problem is,
> I can't really wait for the problem to reappear, so I'd like to ask, has 
> anybody
> seen similar problem? I of so, was it fixed in some newer kernel release? I 
> haven't
> found mention in the changelogs, but still.. or does somebody have a hint on 
> what
> else
> I should check?
> 
> I'll try to reproduce this on test system (enabling irqbalance and doing some 
> network
> benchmarks, but I'd be most happy if I could prevent it on this production 
> system..)
> 
> thanks a lot for any advance
> 
> with best regards
> 
> nikola ciprich
> 
> PS: here's lspci -vv of eths.. should I provide any further information, 
> please let me
> know:
> 
> http://nik.lbox.cz/download/lspci.txt
> 
> --
> -
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28.rijna 168, 709 00 Ostrava
> 
> tel.:   +420 591 166 214
> fax:+420 596 621 273
> mobil:  +420 777 093 799
> www.linuxbox.cz
> 
> mobil servis: +420 737 238 656
> email servis: ser...@linuxbox.cz
> -
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: xfrm4_garbage_collect reaching limit

2015-09-16 Thread Steffen Klassert
On Mon, Sep 14, 2015 at 11:14:59PM -0400, Dan Streetman wrote:
> On Fri, Sep 11, 2015 at 5:48 AM, Steffen Klassert
>  wrote:
> >
> >> Possibly the
> >> default value of xfrm4_gc_thresh could be set proportional to
> >> num_online_cpus(), but that doesn't help when cpus are onlined after
> >> boot.
> >
> > This could be an option, we could change the xfrm4_gc_thresh value with
> > a cpu notifier callback if more cpus come up after boot.
> 
> the issue there is, if the value is changed by the user, does a cpu
> hotplug reset it back to default...

What about the patch below? With this we are independent of the number
of cpus. It should cover most, if not all usecases.

While we are at it, we could think about increasing the flowcache
percpu limit. This value was choosen back in 2003, so maybe we could
have more than 4k cache entries per cpu these days.


Subject: [PATCH RFC] xfrm: Let the flowcache handle its size by default.

The xfrm flowcache size is limited by the flowcache limit
(4096 * number of online cpus) and the xfrm garbage collector
threshold (2 * 32768), whatever is reached first. This means
that we can hit the garbage collector limit only on systems
with more than 16 cpus. On such systems we simply refuse
new allocations if we reach the limit, so new flows are dropped.
On syslems with 16 or less cpus, we hit the flowcache limit.
In this case, we shrink the flow cache instead of refusing new
flows.

We increase the xfrm garbage collector threshold to INT_MAX
to get the same behaviour, independent of the number of cpus.

The xfrm garbage collector threshold can still be set below
the flowcache limit to reduce the memory usage of the flowcache.

Signed-off-by: Steffen Klassert 
---
 Documentation/networking/ip-sysctl.txt | 6 --
 net/ipv4/xfrm4_policy.c| 2 +-
 net/ipv6/xfrm6_policy.c| 2 +-
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index ebe94f2..260f30b 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1199,7 +1199,8 @@ tag - INTEGER
 xfrm4_gc_thresh - INTEGER
The threshold at which we will start garbage collecting for IPv4
destination cache entries.  At twice this value the system will
-   refuse new allocations.
+   refuse new allocations. The value must be set below the flowcache
+   limit (4096 * number of online cpus) to take effect.
 
 igmp_link_local_mcast_reports - BOOLEAN
Enable IGMP reports for link local multicast groups in the
@@ -1645,7 +1646,8 @@ ratelimit - INTEGER
 xfrm6_gc_thresh - INTEGER
The threshold at which we will start garbage collecting for IPv6
destination cache entries.  At twice this value the system will
-   refuse new allocations.
+   refuse new allocations. The value must be set below the flowcache
+   limit (4096 * number of online cpus) to take effect.
 
 
 IPv6 Update by:
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index 1e06c4f..3dffc73 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -248,7 +248,7 @@ static struct dst_ops xfrm4_dst_ops = {
.destroy =  xfrm4_dst_destroy,
.ifdown =   xfrm4_dst_ifdown,
.local_out =__ip_local_out,
-   .gc_thresh =32768,
+   .gc_thresh =INT_MAX,
 };
 
 static struct xfrm_policy_afinfo xfrm4_policy_afinfo = {
diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index f10b940..e9af39a 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -289,7 +289,7 @@ static struct dst_ops xfrm6_dst_ops = {
.destroy =  xfrm6_dst_destroy,
.ifdown =   xfrm6_dst_ifdown,
.local_out =__ip6_local_out,
-   .gc_thresh =32768,
+   .gc_thresh =INT_MAX,
 };
 
 static struct xfrm_policy_afinfo xfrm6_policy_afinfo = {
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RESEND PATCH] net: ks8851: Export OF module alias information

2015-09-16 Thread Javier Martinez Canillas
Drivers needs to export the OF id table and this be built into
the module or udev won't have the necessary information to autoload
the driver module when the device is registered via OF.

Signed-off-by: Javier Martinez Canillas 

---

 drivers/net/ethernet/micrel/ks8851.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/micrel/ks8851.c 
b/drivers/net/ethernet/micrel/ks8851.c
index 66d4ab703f45..60f43ec22175 100644
--- a/drivers/net/ethernet/micrel/ks8851.c
+++ b/drivers/net/ethernet/micrel/ks8851.c
@@ -1601,6 +1601,7 @@ static const struct of_device_id ks8851_match_table[] = {
{ .compatible = "micrel,ks8851" },
{ }
 };
+MODULE_DEVICE_TABLE(of, ks8851_match_table);
 
 static struct spi_driver ks8851_driver = {
.driver = {
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] solos-pci: Increase headroom on received packets

2015-09-16 Thread Eric Dumazet
On Wed, 2015-09-16 at 11:25 +0100, David Woodhouse wrote:
> A comment in include/linux/skbuff.h says that:
> 
>  * Various parts of the networking layer expect at least 32 bytes of
>  * headroom, you should not reduce this.
> 
> This was demonstrated by a panic when handling fragmented IPv6 packets:
> http://marc.info/?l=linux-netdev=144236093519172=2
> 
> It's not entirely clear if that comment is still valid — and if it is,
> perhaps netif_rx() ought to be enforcing it with a warning.
> 
> But either way, it is rather stupid from a performance point of view
> for us to be receiving packets into a buffer which doesn't have enough
> room to prepend an Ethernet header — it means that *every* incoming
> packet is going to be need to be reallocated. So let's fix that.
> 
> Signed-off-by: David Woodhouse 
> --- 
> Tested in the DMA code path; I don't believe the DMA-capable devices
> can still be used in MMIO mode. Simon, Guy, would you be able to test
> the MMIO version?

You should use netdev_alloc_skb() : This helper is better for rx skbs,
as it allows for better packing of frames in GRO or TCP stack.

Also netdev_alloc_skb_ip_align() might handle the NET_IP_ALIGN stuff
for arches that care.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] add stealth mode

2015-09-16 Thread Florian Westphal
Matteo Croce  wrote:
> Add option to disable any reply not related to a listening socket,
> like RST/ACK for TCP and ICMP Port-Unreachable for UDP.
> Also disables ICMP replies to echo request and timestamp.
> The stealth mode can be enabled selectively for a single interface.

I think it would make more sense to extend the socket match
in xtables if it can't be used to achive this already.

seems like
*filter
:INPUT ACCEPT [0:0]
-A INPUT -p tcp -m socket --nowildcard -j ACCEPT
-A INPUT -p tcp -j DROP
COMMIT

Already does what you want for tcp, udp should work too.
I'd much rather see xtables and/or nftables to be extended
with whatever feature(s) are needed to configure such a policy
rather than pushing this into the core network stack.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-next] oops in ip_route_input_noref

2015-09-16 Thread Richard Alpe
On 2015-09-16 11:24, Sergey Senozhatsky wrote:
> Hi,
> 
> 4.3.0-rc1-next-20150916
> 
> oops after removal of rndis usb device
> 
> ...
> 8146c052:   00 
> 8146c053:   0f b6 55 8a movzbl -0x76(%rbp),%edx
> 8146c057:   49 8b bf e8 01 00 00mov0x1e8(%r15),%rdi
> 8146c05e:   45 89 d1mov%r10d,%r9d
> 8146c061:   44 89 f6mov%r14d,%esi
> 8146c064:   44 88 95 70 ff ff ffmov%r10b,-0x90(%rbp)
> 8146c06b:   0f 95 c1setne  %cl
> 8146c06e:   81 ce 00 00 00 80   or $0x8000,%esi
> 8146c074:   41 83 e1 01 and$0x1,%r9d
> 8146c078:   45 31 c0xor%r8d,%r8d
> 8146c07b:   e8 49 d5 ff ff  callq  814695c9 
> 
> 8146c080:   48 85 c0test   %rax,%rax
> 8146c083:   49 89 c5mov%rax,%r13
> 8146c086:   75 0a   jne8146c092 
> <ip_route_input_noref+0xa75>
> 8146c088:   bb 97 ff ff ff  mov$0xff97,%ebx
> 8146c08d:   e9 06 f8 ff ff  jmpq   8146b898 
> <ip_route_input_noref+0x27b>
> 8146c092:   48 c7 40 58 a3 95 46movq   
> $0x814695a3,0x58(%rax)
> 8146c099:   81 
> 8146c09a:   c6 80 a2 00 00 00 01movb   $0x1,0xa2(%rax)
> 8146c0a1:   48 8b 45 98 mov-0x68(%rbp),%rax
> 8146c0a5:   44 8a 95 70 ff ff ffmov-0x90(%rbp),%r10b
> 8146c0ac:   48 85 c0test   %rax,%rax
> 8146c0af:   74 0a   je 8146c0bb 
> <ip_route_input_noref+0xa9e>
> 8146c0b1:   8b 40 10mov0x10(%rax),%eax
> ^^^
> 8146c0b4:   41 89 85 b0 00 00 00mov%eax,0xb0(%r13)
> 8146c0bb:   65 ff 05 9e 54 ba 7eincl   %gs:0x7eba549e(%rip)   
>  # 11560 
> 8146c0c2:   80 7d 8a 07 cmpb   $0x7,-0x76(%rbp)
> 8146c0c6:   75 1a   jne8146c0e2 
> <ip_route_input_noref+0xac5>
> 8146c0c8:   41 81 a5 9c 00 00 00andl   $0x7fff,0x9c(%r13)
> 8146c0cf:   ff ff ff 7f 
> 8146c0d3:   f7 db   neg%ebx
> 8146c0d5:   49 c7 45 50 b1 96 46movq   
> $0x814696b1,0x50(%r13)
> 8146c0dc:   81 
> 8146c0dd:   66 41 89 5d 64  mov%bx,0x64(%r13)
> 8146c0e2:   45 84 d2test   %r10b,%r10b
> 8146c0e5:   74 29   je 8146c110 
> <ip_route_input_noref+0xaf3>
> 8146c0e7:   0f b6 7d 89 movzbl -0x77(%rbp),%edi
> 8146c0eb:   4c 89 eemov%r13,%rsi
> 8146c0ee:   48 ff c7inc%rdi
> 8146c0f1:   48 6b ff 60 imul   $0x60,%rdi,%rdi
> 8146c0f5:   48 03 7d 90 add-0x70(%rbp),%rdi
> 8146c0f9:   e8 10 d1 ff ff  callq  8146920e 
> 
> 8146c0fe:   84 c0   test   %al,%al
> 8146c100:   75 0e   jne8146c110 
> <ip_route_input_noref+0xaf3>
> 8146c102:   66 41 83 4d 60 10   orw$0x10,0x60(%r13)
> 8146c108:   4c 89 efmov%r13,%rdi
> 8146c10b:   e8 7d cc ff ff  callq  81468d8d 
> 
> 8146c110:   4d 89 6c 24 58  mov%r13,0x58(%r12)
> 8146c115:   31 db   xor%ebx,%ebx
> 8146c117:   e9 7c f7 ff ff  jmpq   8146b898 
> <ip_route_input_noref+0x27b>
> 8146c11c:   bb 8f ff ff ff  mov$0xff8f,%ebx
> 8146c121:   c6 45 8a 07 movb   $0x7,-0x76(%rbp)
> 8146c125:   48 c7 45 90 00 00 00movq   $0x0,-0x70(%rbp)
> ...
> 
> addr2line -e vmlinux -i 0x8146c0b1
> net/ipv4/route.c:1815
> net/ipv4/route.c:1905
> 
> 
> which seems to be this line ip_route_input_noref()->ip_route_input_slow():
> ...
> 1813 rth->rt_is_input = 1;
> 1814 if (res.table)
> 1815 rth->rt_table_id = res.table->tb_id;
> 1816
> ...
> 
> 
> added by b7503e0cdb5dbec5d201aa69dc14679b5ae8
> 
> net: Add FIB table id to rtable
> 
> Add the FIB table id to rtable to make the information available for
> IPv4 as it is for IPv6.
> 
> 
>   -ss
> --

Re: [ANNOUNCE] libnftnl 1.0.4 release

2015-09-16 Thread Jan Engelhardt

On Wednesday 2015-09-16 13:50, Pablo Neira Ayuso wrote:
>The Netfilter project proudly presents:
>
>libnftnl 1.0.4

$ git diff libnftnl-1.0.3..libnftnl-1.0.4 src/libnftnl.map
diff --git a/src/libnftnl.map b/src/libnftnl.map
index be7b998..14ec88c 100644
--- a/src/libnftnl.map
+++ b/src/libnftnl.map
@@ -124,10 +123,12 @@ global:
   nft_set_attr_is_set;
   nft_set_attr_set;
   nft_set_attr_set_u32;
+  nft_set_attr_set_u64;
   nft_set_attr_set_str;
   nft_set_attr_get;
   nft_set_attr_get_str;
   nft_set_attr_get_u32;
+  nft_set_attr_get_u64;
   nft_set_nlmsg_build_payload;
   nft_set_nlmsg_parse;
   nft_set_parse;

You broke the ABI. A program that uses nft_set_attr_set_u64 and is
built against libnftnl-1.0.4 is marked to be compatible with the
"LIBNFTNL_1.0" symbol group, but this is incorrect, since the
nft_set_attr_set_u64 symbol did not previously exist.

Existing symbol groups in .map must not be extended. Always start
a new group.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/31] net/cavium/liquidio: use kmemdup rather than duplicating its implementation

2015-09-16 Thread Andrzej Hajda
Ping.

Regards
Andrzej

On 08/07/2015 09:59 AM, Andrzej Hajda wrote:
> The patch was generated using fixed coccinelle semantic patch
> scripts/coccinelle/api/memdup.cocci [1].
>
> [1]: http://permalink.gmane.org/gmane.linux.kernel/2014320
>
> Signed-off-by: Andrzej Hajda 
> ---
>  drivers/net/ethernet/cavium/liquidio/octeon_device.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_device.c 
> b/drivers/net/ethernet/cavium/liquidio/octeon_device.c
> index f67641a..8e23e3f 100644
> --- a/drivers/net/ethernet/cavium/liquidio/octeon_device.c
> +++ b/drivers/net/ethernet/cavium/liquidio/octeon_device.c
> @@ -602,12 +602,10 @@ int octeon_download_firmware(struct octeon_device *oct, 
> const u8 *data,
>   snprintf(oct->fw_info.liquidio_firmware_version, 32, "LIQUIDIO: %s",
>h->version);
>  
> - buffer = kmalloc(size, GFP_KERNEL);
> + buffer = kmemdup(data, size, GFP_KERNEL);
>   if (!buffer)
>   return -ENOMEM;
>  
> - memcpy(buffer, data, size);
> -
>   p = buffer + sizeof(struct octeon_firmware_file_header);
>  
>   /* load all images */

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next PATCH] net: bridge: fix for bridging 802.1Q without REORDER_HDR

2015-09-16 Thread Phil Sutter
On Tue, Sep 15, 2015 at 10:36:41PM -0400, Vlad Yasevich wrote:
> On 09/15/2015 02:17 PM, Phil Sutter wrote:
> > On Tue, Sep 15, 2015 at 11:11:53AM -0400, Vlad Yasevich wrote:
> >> On 09/14/2015 04:06 PM, Phil Sutter wrote:
> >>> On Mon, Sep 14, 2015 at 02:21:10PM -0400, Vlad Yasevich wrote:
>  On 09/11/2015 04:20 PM, Phil Sutter wrote:
> > On Fri, Sep 11, 2015 at 12:24:45PM -0700, Stephen Hemminger wrote:
> >> On Fri, 11 Sep 2015 21:22:03 +0200
> >> Phil Sutter  wrote:
> >>
> >>> When forwarding packets from an 802.1Q interface with REORDER_HDR set 
> >>> to
> >>> zero, the VLAN header previously inserted by vlan_do_receive() needs 
> >>> to
> >>> be stripped from the packet and the mac_header adjustment undone,
> >>> otherwise a tagged frame with first four bytes missing will be
> >>> transmitted.
> >>>
> >>> Signed-off-by: Phil Sutter 
> >>> ---
> >>>  net/bridge/br_input.c | 10 ++
> >>>  1 file changed, 10 insertions(+)
> >>>
> >>> diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> >>> index f921a5d..e4e3fc7 100644
> >>> --- a/net/bridge/br_input.c
> >>> +++ b/net/bridge/br_input.c
> >>> @@ -288,6 +288,16 @@ rx_handler_result_t br_handle_frame(struct 
> >>> sk_buff **pskb)
> >>>   }
> >>>  
> >>>  forward:
> >>> + if (is_vlan_dev(skb->dev) &&
> >>> + !(vlan_dev_priv(skb->dev)->flags & VLAN_FLAG_REORDER_HDR)) {
> >>> + unsigned int offset = skb->data - skb_mac_header(skb);
> >>> +
> >>> + skb_push(skb, offset);
> >>> + memmove(skb->data + VLAN_HLEN, skb->data, 2 * ETH_ALEN);
> >>> + skb->mac_header += VLAN_HLEN;
> >>> + skb_pull(skb, offset);
> >>> + skb_reset_mac_len(skb);
> >>> + }
> >>>   switch (p->state) {
> >>>   case BR_STATE_FORWARDING:
> >>>   rhook = rcu_dereference(br_should_route_hook);
> >>
> >> Thanks for finding this. Is this a new thing or has it always been 
> >> there?
> >
> > Sorry, I didn't check if this is a regression or not. Seen initially
> > with RHEL7's kernel-3.10.0-229.7.2, which due to the massive backporting
> > is by far not as old as it might seem. But it's surely not a brand new
> > problem of net-next or so.
> >
> > Since nowadays no sane mind touches REORDER_HDR (there was originally a
> > bug in NetworkManager which defaulted this to 0), it may very well be
> > there for a long time already.
> >
> >> Sorry, this looks so special case it doesn't seem like a good idea.
> >> Something is broken in VLAN handling if this is required.
> >
> > It is so ugly, I wish I had found a better way to fix the problem. Well,
> > maybe I miss something:
> >
> > - packet enters __netif_receive_skb_core():
> >   - skb->protocol is set to ETH_P_8021Q, so:
> > - packet is untagged
> > - skb->vlan_tci set
> > - skb->protocol set to 'real' protocol
> >   - skb_vlan_tag_present(skb) == true, so:
> > - vlan_do_receive() is called:
> >   - tags the packet again
> >   - zeroes vlan_tci
> > - goto another_round
> > - __netif_receive_skb_core(), round 2:
> >   - skb->protocol is not ETH_P_8021Q -> no untagging
> >   - skb_vlan_tag_present(skb) == false -> no vlan_do_receive()
> >   - rx_handler handler (== br_handle_frame) is called
> >
> > IMO the root of all evil is the existence of REORDER_HDR itself. It
> > causes an skb which should have been untagged to being passed along with
> > VLAN header present and code dealing with it needs to clean up the mess.
> 
>  So the problem here appears the be the code the in 
>  br_dev_queue_push_xmit().
>  It assumes that MAC_HLEN worth of data has been removed from the skb,
>  which is normal in case of normal VLAN processing.  However, without
>  REORDER_HEADER set this is no longer the case.  In this case, the 
>  ethernet
>  header is shifted 4 bytes, and when we push the it back we miss the 4 
>  bytes
>  of the destination mac address...
> >>>
> >>> Please note that vlan_do_receive() also inserts the VLAN header in
> >>> between ethernet header and IP header, therefore:
> >>>
>  I wonder if it would be safe to just use skb->mac_len.
> >>>
> >>> Given this works, the bridge would still forward a tagged frame which
> >>> should have been untagged in the first place.
> >>>
> >>> I just wondered where this added VLAN header is dropped if the interface
> >>> does not belong to a bridge, but then realized that further packet
> >>> processing simply ignores the ethernet header (and everything following
> >>> it). So unless I forget something, this should indeed be a
> >>> bridge-specific problem.
> >>>
> >>
> >> Looks like macvtap 

Re: [PATCH v4] add stealth mode

2015-09-16 Thread Eric Dumazet
On Wed, 2015-09-16 at 11:54 +0200, Matteo Croce wrote:
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 93898e0..fe62ae0 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -77,6 +77,7 @@
>  #include 
>  
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1652,7 +1653,7 @@ csum_error:
>   TCP_INC_STATS_BH(net, TCP_MIB_CSUMERRORS);
>  bad_packet:
>   TCP_INC_STATS_BH(net, TCP_MIB_INERRS);
> - } else {
> + } else if (!IN_DEV_STEALTH(skb->dev->ip_ptr)) {
>   tcp_v4_send_reset(NULL, skb);
>   }


It is illegal to deref skb->dev->ip_ptr without proper accessor /
annotations.

Check 

struct in_device *in_dev = __in_dev_get_rcu(skb->dev); 

(Same remarks in other places of your patch)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 0/4] can: Allwinner A10/A20 CAN Controller support - Summary

2015-09-16 Thread Gerhard Bertelsmann
Hi,

please find attached the next version of my patch set. I have 
taken all remarks from Maxime Ripard into the new version

Please review, test and report bugs if exists.

The patchset applies to all recent Kernel versions (4.x, next etc.).

[PATCH v8 1/4] Device Tree Binding Documentation
[PATCH v8 2/4] Defconfig multi_v7
[PATCH v8 3/4] Defconfig sunxi
[PATCH v8 4/4] Kernel Module

History:
V8: sun4i_can.c: rename sunxi to sun4i
dt: use sun4i-a10-can as identifier
can_open: don't use shared IRQ

v7: set_normal_mode: stripped (code inserted in can_stop)
set_reset_mode: stripped (code inserted in can_start)
sunxi_can_start: reworked
sunxi_can_stop: function added
sunxi_can_err: don't skip if skb alloc fails
sunxican_bittiming_const: use netdev_dbg instead of netdev_info
sunxican_probe: CAN_CTRLMODE_PRESUME_ACK

v6: renamed the driver to sun4i as suggested by Maxime Ripard
removed module version
removed suspend and resume
moved clk enable from can_start into open / should be balanced
  between enabling and disabling now
freeing resources on error

v5: fix license
modify prefix to mode select defines
enable and disable clock in sunxican_get_berr_counter
delete set_normal_mode at the end of sunxi_can_start
removed sunxican_id_table
use devm_clk_get instead of clk_get
use devm_ioremap_resource to simplify probe and remove
make set-normal-mode and set-reset-mode more readable

v4: defines prefixed with SUNXI_
sunxi_can_write_cmdreg tweaked
loops in set_xxx_mode reworked
add return value to set_xxx_mode
sunxican_start_xmit reworked
struct platform_driver stripped
moved set_bittiming into open
moved clock start into open
add clock stop to close
suspend reworked
resume reworked
fixed double counting bug

v3: changed error state change handling (thx Andri for the hint)
use bittiming function correct (no need to call it)
strip down priv (suggested by Marc)
scripts/checkpatch.pl-> no matches anymore
sparse -> no errors or warnings anymore
v2: cleaning
v1: initial

Signed-off-by: Gerhard Bertelsmann 
---

 .../devicetree/bindings/net/can/sun4i_can.txt  |  38 +
 arch/arm/configs/multi_v7_defconfig|   1 +
 arch/arm/configs/sunxi_defconfig   |   2 +
 drivers/net/can/Kconfig|  10 +
 drivers/net/can/Makefile   |   1 +
 drivers/net/can/sun4i_can.c| 857 +
 6 files changed, 909 insertions(+)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 3/4] can: Allwinner A10/A20 CAN Controller support - Defconfig

2015-09-16 Thread Gerhard Bertelsmann
Defconfig sunxi for Allwinner A10/A20 CAN driver

Signed-off-by: Gerhard Bertelsmann 
---

 arch/arm/configs/sunxi_defconfig   |   2 +
 1 file changed, 2 insertions(+)


diff --git a/arch/arm/configs/sunxi_defconfig b/arch/arm/configs/sunxi_defconfig
index 51eea22..fe020a5 100644
--- a/arch/arm/configs/sunxi_defconfig
+++ b/arch/arm/configs/sunxi_defconfig
@@ -31,6 +31,8 @@ CONFIG_IP_PNP_BOOTP=y
 # CONFIG_INET_LRO is not set
 # CONFIG_INET_DIAG is not set
 # CONFIG_IPV6 is not set
+CONFIG_CAN=y
+CONFIG_CAN_SUN4I=y
 # CONFIG_WIRELESS is not set
 CONFIG_DEVTMPFS=y
 CONFIG_DEVTMPFS_MOUNT=y
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 2/4] can: Allwinner A10/A20 CAN Controller support - Defconfig

2015-09-16 Thread Gerhard Bertelsmann
Defconfig multi_v7 for Allwinner A10/A20 CAN driver

Signed-off-by: Gerhard Bertelsmann 
---

 arch/arm/configs/multi_v7_defconfig|   1 +
 1 file changed, 1 insertions(+)


diff --git a/arch/arm/configs/multi_v7_defconfig 
b/arch/arm/configs/multi_v7_defconfig
index 03deb7f..14eb6b9 100644
--- a/arch/arm/configs/multi_v7_defconfig
+++ b/arch/arm/configs/multi_v7_defconfig
@@ -153,6 +153,7 @@ CONFIG_CAN_DEV=y
 CONFIG_CAN_AT91=m
 CONFIG_CAN_XILINXCAN=y
 CONFIG_CAN_MCP251X=y
+CONFIG_CAN_SUN4I=y
 CONFIG_BT=m
 CONFIG_BT_MRVL=m
 CONFIG_BT_MRVL_SDIO=m
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 1/3] can: Allwinner A10/A20 CAN Controller support - Devicetree bindings

2015-09-16 Thread Gerhard Bertelsmann
Devicetree bindings for Allwinner A10/A20 CAN

Signed-off-by: Gerhard Bertelsmann 
---

 .../devicetree/bindings/net/can/sun4i_can.txt  |  38 +
 1 files changed, 389 insertions(+)


diff --git a/Documentation/devicetree/bindings/net/can/sun4i_can.txt 
b/Documentation/devicetree/bindings/net/can/sun4i_can.txt
new file mode 100644
index 000..cd0f50c
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/can/sun4i_can.txt
@@ -0,0 +1,38 @@
+Allwinner A10/A20 CAN controller Device Tree Bindings
+-
+
+Required properties:
+- compatible: "allwinner,sun4i-a10-can"
+- reg: physical base address and size of the Allwinner A10/A20 CAN register 
map.
+- interrupts: interrupt specifier for the sole interrupt.
+- clock: phandle and clock specifier.
+
+
+Example
+---
+
+SoC common .dtsi file:
+
+   can0_pins_a: can0@0 {
+   allwinner,pins = "PH20","PH21";
+   allwinner,function = "can";
+   allwinner,drive = <0>;
+   allwinner,pull = <0>;
+   };
+...
+   can0: can@01c2bc00 {
+   compatible = "allwinner,sun4i-a10-can";
+   reg = <0x01c2bc00 0x400>;
+   interrupts = <0 26 4>;
+   clocks = <_gates 4>;
+   status = "disabled";
+   };
+
+Board specific .dts file:
+
+   can0: can@01c2bc00 {
+   pinctrl-names = "default";
+   pinctrl-0 = <_pins_a>;
+   status = "okay";
+   };
+
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] solos-pci: Increase headroom on received packets

2015-09-16 Thread David Woodhouse
On Wed, 2015-09-16 at 03:53 -0700, Eric Dumazet wrote:
> You should use netdev_alloc_skb() : This helper is better for rx skbs,
> as it allows for better packing of frames in GRO or TCP stack.

OK, thanks. I don't have a netdev (this is an ATM device) but I can use
dev_alloc_skb().

> Also netdev_alloc_skb_ip_align() might handle the NET_IP_ALIGN stuff
> for arches that care.

I'd briefly considered NET_IP_ALIGN but decided against it because this
isn't Ethernet and my hardware header is a nice sane 8 bytes, not 14.

But actually, the primary use cases for this are PPPoATM — with 2 bytes
of PPP frame type, and PPPoE over BR2684 — with 14 bytes of Ethernet
header. So NET_IP_ALIGN would actually make sense.

Unfortunately the FPGA can't do DMA to unaligned addresses, so I can't
do it in the DMA case. I can do it for the MMIO code path though (which
I still haven't tested).

I'll send a new patch in a moment...

-- 
dwmw2



smime.p7s
Description: S/MIME cryptographic signature


Re: IPv6 routing/fragmentation panic

2015-09-16 Thread David Woodhouse
On Wed, 2015-09-16 at 01:48 +0200, Florian Westphal wrote:
> 
> What I don't understand is why you see this with fragmented ipv6 
> packets only (and not with all ipv6 forwarded skbs).
> 
> Something like this copy-pastry from ip_finish_output2 should fix it:

That works; thanks.

Tested-by: David Woodhouse 

A little extra debugging output shows that the offending fragments were
arriving here with skb_headroom(skb)==10. Which is reasonable, being
the Solos ADSL card's header of 8 bytes followed by 2 bytes of PPP
frame type.

The non-fragmented packets, on the other hand, are arriving with a
headroom of 42 bytes. Could something else already have reallocated
them before they get that far? (Do we have any way to gather statistics
on such reallocations? It seems that might be useful for performance
investigation.)

Johannes and I were talking on IRC yesterday about trying to make this
kind of thing easier to reproduce without odd hardware. We postulated a
skb_torture() function which, when an appropriate debugging option was
enabled, would randomly screw around with the skb in various
interesting ways — shifting the data down so that there's no headroom,
deliberately making it *non-linear*, temporarily cloning it and freeing
the clone a couple of seconds later, etc.

Then we could insert calls to skb_torture() in interesting places like
netif_rx(), ip6_finish_output2() and anywhere else that seems
appropriate (perhaps with flags to indicate *what* kind of torture is
permissible in certain locations). And see what breaks...

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation



smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v4] add stealth mode

2015-09-16 Thread Daniel Borkmann

On 09/16/2015 11:54 AM, Matteo Croce wrote:

Add option to disable any reply not related to a listening socket,
like RST/ACK for TCP and ICMP Port-Unreachable for UDP.
Also disables ICMP replies to echo request and timestamp.
The stealth mode can be enabled selectively for a single interface.

Signed-off-by: Matteo Croce 
---
rebased on 4.3-rc1

  Documentation/networking/ip-sysctl.txt | 14 ++
  include/linux/inetdevice.h |  1 +
  include/linux/ipv6.h   |  1 +
  include/uapi/linux/ip.h|  1 +
  net/ipv4/devinet.c |  1 +
  net/ipv4/icmp.c|  6 ++
  net/ipv4/ip_input.c|  5 +++--
  net/ipv4/tcp_ipv4.c|  3 ++-
  net/ipv4/udp.c |  4 +++-
  net/ipv6/addrconf.c|  7 +++
  net/ipv6/icmp.c|  3 ++-
  net/ipv6/ip6_input.c   |  5 +++--
  net/ipv6/tcp_ipv6.c|  2 +-
  net/ipv6/udp.c |  3 ++-
  14 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index ebe94f2..1d46adc 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1206,6 +1206,13 @@ igmp_link_local_mcast_reports - BOOLEAN
224.0.0.X range.
Default TRUE

+stealth - BOOLEAN
+   Disable any reply not related to a listening socket,
+   like RST/ACK for TCP and ICMP Port-Unreachable for UDP.
+   Also disables ICMP replies to echo requests and timestamp
+   and ICMP errors for unknown protocols.
+   Default value is 0.
+


Hmm, what about all other protocols besides TCP/UDP such as SCTP, DCCP,
etc? It seems it gives false expectations in such cases when the user
enables being "stealth", but finds out it has no effect at all there ...
nmap f.e. has a couple of scanning options for SCTP, and at least SCTP
is still relevant in telco space.

I know this question has been asked before, but the only answer on this
was so far: "well, I've never played with SCTP before" ... :/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] solos-pci: Increase headroom on received packets

2015-09-16 Thread David Woodhouse
A comment in include/linux/skbuff.h says that:

 * Various parts of the networking layer expect at least 32 bytes of
 * headroom, you should not reduce this.

This was demonstrated by a panic when handling fragmented IPv6 packets:
http://marc.info/?l=linux-netdev=144236093519172=2

It's not entirely clear if that comment is still valid — and if it is,
perhaps netif_rx() ought to be enforcing it with a warning.

But either way, it is rather stupid from a performance point of view
for us to be receiving packets into a buffer which doesn't have enough
room to prepend an Ethernet header — it means that *every* incoming
packet is going to be need to be reallocated. So let's fix that.

Signed-off-by: David Woodhouse 
--- 
Tested in the DMA code path; I don't believe the DMA-capable devices
can still be used in MMIO mode. Simon, Guy, would you be able to test
the MMIO version?

diff --git a/drivers/atm/solos-pci.c b/drivers/atm/solos-pci.c
index 74e18b0..be8225e 100644
--- a/drivers/atm/solos-pci.c
+++ b/drivers/atm/solos-pci.c
@@ -805,13 +805,13 @@ static void solos_bh(unsigned long card_arg)
continue;
}
 
-   skb = alloc_skb(size + 1, GFP_ATOMIC);
+   skb = alloc_skb(size + NET_SKB_PAD + 1, 
GFP_ATOMIC);
if (!skb) {
if (net_ratelimit())
dev_warn(>dev->dev, 
"Failed to allocate sk_buff for RX\n");
continue;
}
-
+   skb_reserve(skb, NET_SKB_PAD);
memcpy_fromio(skb_put(skb, size),
  RX_BUF(card, port) + 
sizeof(*header),
  size);
@@ -869,8 +869,10 @@ static void solos_bh(unsigned long card_arg)
/* Allocate RX skbs for any ports which need them */
if (card->using_dma && card->atmdev[port] &&
!card->rx_skb[port]) {
-   struct sk_buff *skb = alloc_skb(RX_DMA_SIZE, 
GFP_ATOMIC);
+   struct sk_buff *skb = alloc_skb(RX_DMA_SIZE + 
NET_SKB_PAD,
+   GFP_ATOMIC);
if (skb) {
+   skb_reserve(skb, NET_SKB_PAD);
SKB_CB(skb)->dma_addr =
dma_map_single(>dev->dev, 
skb->data,
   RX_DMA_SIZE, 
DMA_FROM_DEVICE);

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation



smime.p7s
Description: S/MIME cryptographic signature


[PATCH v2] solos-pci: Increase headroom on received packets

2015-09-16 Thread David Woodhouse
A comment in include/linux/skbuff.h says that:

 * Various parts of the networking layer expect at least 32 bytes of
 * headroom, you should not reduce this.

This was demonstrated by a panic when handling fragmented IPv6 packets:
http://marc.info/?l=linux-netdev=144236093519172=2

It's not entirely clear if that comment is still valid — and if it is,
perhaps netif_rx() ought to be enforcing it with a warning.

But either way, it is rather stupid from a performance point of view
for us to be receiving packets into a buffer which doesn't have enough
room to prepend an Ethernet header — it means that *every* incoming
packet is going to be need to be reallocated. So let's fix that.

Signed-off-by: David Woodhouse 
---
diff --git a/drivers/atm/solos-pci.c b/drivers/atm/solos-pci.c
index 74e18b0..3d7fb65 100644
--- a/drivers/atm/solos-pci.c
+++ b/drivers/atm/solos-pci.c
@@ -805,7 +805,12 @@ static void solos_bh(unsigned long card_arg)
continue;
}
 
-   skb = alloc_skb(size + 1, GFP_ATOMIC);
+   /* Use netdev_alloc_skb() because it adds 
NET_SKB_PAD of
+* headroom, and ensures we can route packets 
back out an
+* Ethernet interface (for example) without 
having to
+* reallocate. Adding NET_IP_ALIGN also ensures 
that both
+* PPPoATM and PPPoEoBR2684 packets end up 
aligned. */
+   skb = netdev_alloc_skb_ip_align(NULL, size + 1);
if (!skb) {
if (net_ratelimit())
dev_warn(>dev->dev, 
"Failed to allocate sk_buff for RX\n");
@@ -869,7 +874,10 @@ static void solos_bh(unsigned long card_arg)
/* Allocate RX skbs for any ports which need them */
if (card->using_dma && card->atmdev[port] &&
!card->rx_skb[port]) {
-   struct sk_buff *skb = alloc_skb(RX_DMA_SIZE, 
GFP_ATOMIC);
+   /* Unlike the MMIO case (qv) we can't add NET_IP_ALIGN
+* here; the FPGA can only DMA to addresses which are
+* aligned to 4 bytes. */
+   struct sk_buff *skb = dev_alloc_skb(RX_DMA_SIZE);
if (skb) {
SKB_CB(skb)->dma_addr =
dma_map_single(>dev->dev, 
skb->data,

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation



smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v2 net] net/mlx4_en: really allow to change RSS key

2015-09-16 Thread Or Gerlitz
On Wed, Sep 16, 2015 at 4:29 AM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> When changing rss key, we do not want to overwrite user provided key
> by the one provided by netdev_rss_key_fill(), which is the host random
> key generated at boot time.
>
> Fixes: 947cbb0ac242 ("net/mlx4_en: Support for configurable RSS hash 
> function")
> Signed-off-by: Eric Dumazet 
> Cc: Eyal Perry 
> CC: Amir Vadai 

Acked-by: Or Gerlitz 

Dave, can you please push it to -stable of >= 3.19 ?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ANNOUNCE] libnftnl 1.0.4 release

2015-09-16 Thread Pablo Neira Ayuso
Hi!

The Netfilter project proudly presents:

libnftnl 1.0.4

libnftnl is a userspace library providing a low-level netlink
programming interface (API) to the in-kernel nf_tables subsystem. The
library libnftnl has been previously known as libnftables. This
library is currently used by the nft command line tool.

This release comes with new features available up to 4.2, see
ChangeLog for more details.

In this release, we have renamed most of the library symbols to use
the nftnl_ prefix while keeping aliases to the old ones. We would like
to reserve the nft_ prefix for our higher level library which should
land anytime soon. We have kept aliases around to reduce the impact of
this changes, but they will be deprecated soon. Sorry for the
inconvenience in any case.

You can download this library from:

http://www.netfilter.org/projects/libnftnl/downloads.html
ftp://ftp.netfilter.org/pub/libnftnl/

Thanks!
Alvaro Neira (12):
  ruleset: clean up the variable names in the xml/json parsing functions
  src: don't create iterator with empty list
  ruleset: refactor nft_ruleset_*_parse_ruleset()
  set: refactor code in json parse function
  rule: don't release the tree parameter from nft_jansson_parse_rule()
  ruleset: fix leak in json/xml in set lists
  ruleset: fix crash if we free sets included in the set_list
  ruleset: crash from error path when we build the xml/json tree
  xml: test if the root node name is initialized
  examples: add nft-ruleset-parse-file
  ruleset: add nft_ruleset_ctx_free
  parser: Add operation not supported error message

Alvaro Neira Ayuso (4):
  buffer: fix missing XML string tag in nft_buf_close
  src: add command tag in JSON/XML export support
  src: add support to import JSON/XML with the new command tag
  tests: update JSON/XML tests with the new syntax

Arturo Borrero Gonzalez (1):
  expr: dynset: fix json/xml parsing

Balazs Scheidler (1):
  expr: redir: fix snprintf to return the number of bytes printed

Carlos Falgueras García (1):
  src: fix memory leaks at nft_[object]_nlmsg_parse

Pablo Neira Ayuso (17):
  src: add missing include in utils.c
  ruleset: fix more leaks in error path
  src: split internal.h is smaller files
  Makefile: internal.h now resides in include
  src: restore static array with expression operations
  src: add batch abstraction
  table: add netdev family support
  chain: add netdev family support
  expr: immediate: fix leak in expression destroy path
  src: introduce nftnl_* aliases for all existing functions
  src: rename existing functions to use the nftnl_ prefix
  src: add compat header file definitions
  src: rename nftnl_rule_expr to nftnl_expr
  src: rename NFTNL_RULE_EXPR_ATTR to NFTNL_EXPR_
  src: get rid of _ATTR_ infix in new nfntl_ definitions
  src: get rid of _attr_ infix in new nftnl_ definitions
  bump version to 1.0.4

Patrick McHardy (11):
  list: fix prefetch dummy
  set: add support for set timeouts
  set_elem: add timeout support
  set: print set elem timeout information
  set_elem: add support for userdata
  expr: add support for the dynset expr
  headers: resync headers for new register definitions
  data: increase maximum possible data size
  expr: seperate expression parsing and building functions
  set_elem: support expressions attached to set elements
  dynset: support expression templates



Re: [PATCH 27/31] net/tipc: use kmemdup rather than duplicating its implementation

2015-09-16 Thread Andrzej Hajda
Ping.

Regards
Andrzej

On 08/07/2015 09:59 AM, Andrzej Hajda wrote:
> The patch was generated using fixed coccinelle semantic patch
> scripts/coccinelle/api/memdup.cocci [1].
>
> [1]: http://permalink.gmane.org/gmane.linux.kernel/2014320
>
> Signed-off-by: Andrzej Hajda 
> ---
>  net/tipc/server.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/net/tipc/server.c b/net/tipc/server.c
> index 922e04a..c187cad 100644
> --- a/net/tipc/server.c
> +++ b/net/tipc/server.c
> @@ -411,13 +411,12 @@ static struct outqueue_entry *tipc_alloc_entry(void 
> *data, int len)
>   if (!entry)
>   return NULL;
>  
> - buf = kmalloc(len, GFP_ATOMIC);
> + buf = kmemdup(data, len, GFP_ATOMIC);
>   if (!buf) {
>   kfree(entry);
>   return NULL;
>   }
>  
> - memcpy(buf, data, len);
>   entry->iov.iov_base = buf;
>   entry->iov.iov_len = len;
>  

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 4/4] can: Allwinner A10/A20 CAN Controller support - Kernel module

2015-09-16 Thread Gerhard Bertelsmann
Kernel module for Allwinner A10/A20 CAN

Signed-off-by: Gerhard Bertelsmann 
---

 drivers/net/can/Kconfig|  10 +
 drivers/net/can/Makefile   |   1 +
 drivers/net/can/sun4i_can.c| 857 +
 3 files changed, 868 insertions(+)


diff --git a/drivers/net/can/Kconfig b/drivers/net/can/Kconfig
index e8c96b8..6d04183 100644
--- a/drivers/net/can/Kconfig
+++ b/drivers/net/can/Kconfig
@@ -129,6 +129,16 @@ config CAN_RCAR
  To compile this driver as a module, choose M here: the module will
  be called rcar_can.
 
+config CAN_SUN4I
+   tristate "Allwinner A10 CAN controller"
+   depends on MACH_SUN4I || MACH_SUN7I || COMPILE_TEST
+   ---help---
+ Say Y here if you want to use CAN controller found on Allwinner
+ A10/A20 SoCs.
+
+ To compile this driver as a module, choose M here: the module will
+ be called sun4i_can.
+
 config CAN_XILINXCAN
tristate "Xilinx CAN"
depends on ARCH_ZYNQ || ARM64 || MICROBLAZE || COMPILE_TEST
diff --git a/drivers/net/can/Makefile b/drivers/net/can/Makefile
index c533c62..1f21cef 100644
--- a/drivers/net/can/Makefile
+++ b/drivers/net/can/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_CAN_FLEXCAN) += flexcan.o
 obj-$(CONFIG_PCH_CAN)  += pch_can.o
 obj-$(CONFIG_CAN_GRCAN)+= grcan.o
 obj-$(CONFIG_CAN_RCAR) += rcar_can.o
+obj-$(CONFIG_CAN_SUN4I)+= sun4i_can.o
 obj-$(CONFIG_CAN_XILINXCAN)+= xilinx_can.o
 
 subdir-ccflags-y += -D__CHECK_ENDIAN__
diff --git a/drivers/net/can/sun4i_can.c b/drivers/net/can/sun4i_can.c
new file mode 100644
index 000..10d8497
--- /dev/null
+++ b/drivers/net/can/sun4i_can.c
@@ -0,0 +1,857 @@
+/*
+ * sun4i_can.c - CAN bus controller driver for Allwinner SUN4I based SoCs
+ *
+ * Copyright (C) 2013 Peter Chen
+ * Copyright (C) 2015 Gerhard Bertelsmann
+ * All rights reserved.
+ *
+ * Parts of this software are based on (derived from) the SJA1000 code by:
+ *   Copyright (C) 2014 Oliver Hartkopp 
+ *   Copyright (C) 2007 Wolfgang Grandegger 
+ *   Copyright (C) 2002-2007 Volkswagen Group Electronic Research
+ *   Copyright (C) 2003 Matthias Brukner, Trajet Gmbh, Rebenring 33,
+ *   38106 Braunschweig, GERMANY
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in the
+ *documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of Volkswagen nor the names of its contributors
+ *may be used to endorse or promote products derived from this software
+ *without specific prior written permission.
+ *
+ * Alternatively, provided that this notice is retained in full, this
+ * software may be distributed under the terms of the GNU General
+ * Public License ("GPL") version 2, in which case the provisions of the
+ * GPL apply INSTEAD OF those given above.
+ *
+ * The provided data structures and external interfaces from this code
+ * are not restricted to be used by modules with a GPL compatible license.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
+ * DAMAGE.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DRV_NAME "sun4i_can"
+
+/* Registers address (physical base address 0x01C2BC00) */
+#define SUN4I_REG_MSEL_ADDR0x  /* CAN Mode Select */
+#define SUN4I_REG_CMD_ADDR 0x0004  /* CAN Command */
+#define SUN4I_REG_STA_ADDR 0x0008  /* CAN Status */
+#define SUN4I_REG_INT_ADDR 0x000c  /* CAN Interrupt Flag */
+#define SUN4I_REG_INTEN_ADDR   0x0010  /* CAN Interrupt Enable */
+#define SUN4I_REG_BTIME_ADDR   0x0014  /* CAN Bus Timing 0 */
+#define SUN4I_REG_TEWL_ADDR0x0018  /* 

Re: [PATCH RFC] solos-pci: Fix BUG() with shared skb

2015-09-16 Thread Simon Arlott
On Tue, September 15, 2015 20:10, David Woodhouse wrote:
> On Wed, 2013-09-04 at 21:41 +0100, David Woodhouse wrote:
>> +++ b/drivers/atm/solos-pci.c
>> @@ -1145,19 +1145,19 @@ static int psend(struct atm_vcc *vcc, struct sk_buff 
>> *skb)
>> +>   > if (skb_headroom(skb) < sizeof(*header)) {
>> +>   >   > struct sk_buff *nskb;
>> +
>> +>   >   > nskb = skb_realloc_headroom(skb, sizeof(*header));
>> +>   >   > if (!nskb) {
>> +>   >   >   > solos_pop(vcc, skb);
>> +>   >   >   > return -ENOMEM;
>> +>   >   > }
>> +>   >   > if (skb->truesize != nskb->truesize)
>> +>   >   >   > atm_force_charge(vcc, nskb->truesize - skb->truesize);
>> +
>> +>   >   > dev_kfree_skb_any(skb);
>> +>   >   > skb = nskb;
>>  >   > }
>
> Simon, did you ever test this?
> Can you still (tell me how to) reproduce the original problem? I think
> that sending on br2684 was necessary but not sufficient...?

I'm currently using this but without the call to atm_force_charge().

I don't know how to reproduce the BUG() but it hasn't happened again.

-- 
Simon Arlott
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Experiences with slub bulk use-case for network stack

2015-09-16 Thread Jesper Dangaard Brouer

Hint, this leads up to discussing if current bulk *ALLOC* API need to
be changed...

Alex and I have been working hard on practical use-case for SLAB
bulking (mostly slUb), in the network stack.  Here is a summary of
what we have learned so far.

Bulk free'ing SKBs during TX completion is a big and easy win.

Specifically for slUb, normal path for freeing these objects (which
are not on c->freelist) require a locked double_cmpxchg per object.
The bulk free (via detached freelist patch) allow to free all objects
belonging to the same slab-page, to be free'ed with a single locked
double_cmpxchg. Thus, the bulk free speedup is quite an improvement.

The slUb alloc is hard to beat on speed:
 * accessing c->freelist, local cmpxchg 9 cycles (38% of cost)
 * c->freelist is refilled with single locked cmpxchg

In micro benchmarking it looks like we can beat alloc, because we do a
local_irq_{disable,enable} (cost 7 cycles).  And then pull out all
objects in c->freelist.  Thus, saving 9 cycles per object (counting
from the 2nd object).

However, in practical use-cases we are seeing the single object alloc
win over bulk alloc, we believe this to be due to prefetching.  When
c->freelist get (semi) cache-cold, then it gets more expensive to walk
the freelist (which is a basic single linked list to next free object).

For bulk alloc the full freelist is walked (right-way) and objects
pulled out into the array.  For normal single object alloc only a
single object is returned, but it does a prefetch on the next object
pointer.  Thus, next time single alloc is called the object will have
been prefetched.  Doing prefetch in bulk alloc only helps a little, as
it does not have enough "time" between accessing/walking the freelist
for objects.

So, how can we solve this and make bulk alloc faster?


Alex and I had the idea of bulk alloc returns an "allocator specific
cache" data-structure (and we add some helpers to access this).

In the slUb case, the freelist is a single linked pointer list.  In
the network stack the skb objects have a skb->next pointer, which is
located at the same position as freelist pointer.  Thus, simply
returning the freelist directly, could be interpreted as a skb-list.
The helper API would then do the prefetching, when pulling out
objects.

For the slUb case, we would simply cmpxchg either c->freelist or
page->freelist with a NULL ptr, and then own all objects on the
freelist. This also reduce the time we keep IRQs disabled.

API wise, we don't (necessary) know how many objects are on the
freelist (without first walking the list, which would cause stalls on
data, which we are trying to avoid).

Thus, the API of always returning the exact number of requested
objects will not work...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

(related to http://thread.gmane.org/gmane.linux.kernel.mm/137469)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] add stealth mode

2015-09-16 Thread Daniel Borkmann

On 09/16/2015 12:45 PM, Matteo Croce wrote:

2015-09-16 12:26 GMT+02:00 Daniel Borkmann :

On 09/16/2015 11:54 AM, Matteo Croce wrote:


Add option to disable any reply not related to a listening socket,
like RST/ACK for TCP and ICMP Port-Unreachable for UDP.
Also disables ICMP replies to echo request and timestamp.
The stealth mode can be enabled selectively for a single interface.

Signed-off-by: Matteo Croce 
---
rebased on 4.3-rc1

   Documentation/networking/ip-sysctl.txt | 14 ++
   include/linux/inetdevice.h |  1 +
   include/linux/ipv6.h   |  1 +
   include/uapi/linux/ip.h|  1 +
   net/ipv4/devinet.c |  1 +
   net/ipv4/icmp.c|  6 ++
   net/ipv4/ip_input.c|  5 +++--
   net/ipv4/tcp_ipv4.c|  3 ++-
   net/ipv4/udp.c |  4 +++-
   net/ipv6/addrconf.c|  7 +++
   net/ipv6/icmp.c|  3 ++-
   net/ipv6/ip6_input.c   |  5 +++--
   net/ipv6/tcp_ipv6.c|  2 +-
   net/ipv6/udp.c |  3 ++-
   14 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt
b/Documentation/networking/ip-sysctl.txt
index ebe94f2..1d46adc 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1206,6 +1206,13 @@ igmp_link_local_mcast_reports - BOOLEAN
 224.0.0.X range.
 Default TRUE

+stealth - BOOLEAN
+   Disable any reply not related to a listening socket,
+   like RST/ACK for TCP and ICMP Port-Unreachable for UDP.
+   Also disables ICMP replies to echo requests and timestamp
+   and ICMP errors for unknown protocols.
+   Default value is 0.
+


Hmm, what about all other protocols besides TCP/UDP such as SCTP, DCCP,
etc? It seems it gives false expectations in such cases when the user
enables being "stealth", but finds out it has no effect at all there ...
nmap f.e. has a couple of scanning options for SCTP, and at least SCTP
is still relevant in telco space.

I know this question has been asked before, but the only answer on this
was so far: "well, I've never played with SCTP before" ... :/


Right, I was thinking to add them in a later version


I feel, there would be many follow-ups. :/ Architecturally on the bigger
picture, nft and its connection tracker would be the much better place for
such policies, and it also provides matches for various protocols already.

What has been tried to address this more generically f.e. inside netfilter
subsystem, and why is it absolutely not possible to extend this functionality
over there?

Sorry if my question is stubborn, but from reading over the old threads
it still is not fully clear to me.

Thanks again,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net] bna: check for dma mapping errors

2015-09-16 Thread Ivan Vecera
Check for DMA mapping errors, recover from them and register them in
ethtool stats like other errors.

Cc: Rasesh Mody 
Signed-off-by: Ivan Vecera 
---
 drivers/net/ethernet/brocade/bna/bna_tx_rx.c|  2 ++
 drivers/net/ethernet/brocade/bna/bna_types.h|  1 +
 drivers/net/ethernet/brocade/bna/bnad.c | 29 -
 drivers/net/ethernet/brocade/bna/bnad.h |  2 ++
 drivers/net/ethernet/brocade/bna/bnad_ethtool.c |  4 
 5 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/brocade/bna/bna_tx_rx.c 
b/drivers/net/ethernet/brocade/bna/bna_tx_rx.c
index 5d0753c..04b0d16 100644
--- a/drivers/net/ethernet/brocade/bna/bna_tx_rx.c
+++ b/drivers/net/ethernet/brocade/bna/bna_tx_rx.c
@@ -2400,6 +2400,7 @@ bna_rx_create(struct bna *bna, struct bnad *bnad,
q0->rcb->id = 0;
q0->rx_packets = q0->rx_bytes = 0;
q0->rx_packets_with_error = q0->rxbuf_alloc_failed = 0;
+   q0->rxbuf_map_failed = 0;
 
bna_rxq_qpt_setup(q0, rxp, dpage_count, PAGE_SIZE,
_mem[i], _mem[i], _mem[i]);
@@ -2428,6 +2429,7 @@ bna_rx_create(struct bna *bna, struct bnad *bnad,
: rx_cfg->q1_buf_size;
q1->rx_packets = q1->rx_bytes = 0;
q1->rx_packets_with_error = q1->rxbuf_alloc_failed = 0;
+   q1->rxbuf_map_failed = 0;
 
bna_rxq_qpt_setup(q1, rxp, hpage_count, PAGE_SIZE,
_mem[i], _mem[i],
diff --git a/drivers/net/ethernet/brocade/bna/bna_types.h 
b/drivers/net/ethernet/brocade/bna/bna_types.h
index e0e797f..c438d03 100644
--- a/drivers/net/ethernet/brocade/bna/bna_types.h
+++ b/drivers/net/ethernet/brocade/bna/bna_types.h
@@ -587,6 +587,7 @@ struct bna_rxq {
u64 rx_bytes;
u64 rx_packets_with_error;
u64 rxbuf_alloc_failed;
+   u64 rxbuf_map_failed;
 };
 
 /* RxQ pair */
diff --git a/drivers/net/ethernet/brocade/bna/bnad.c 
b/drivers/net/ethernet/brocade/bna/bnad.c
index 506047c..21a0cfc 100644
--- a/drivers/net/ethernet/brocade/bna/bnad.c
+++ b/drivers/net/ethernet/brocade/bna/bnad.c
@@ -399,7 +399,13 @@ bnad_rxq_refill_page(struct bnad *bnad, struct bna_rcb 
*rcb, u32 nalloc)
}
 
dma_addr = dma_map_page(>pcidev->dev, page, page_offset,
-   unmap_q->map_size, DMA_FROM_DEVICE);
+   unmap_q->map_size, DMA_FROM_DEVICE);
+   if (dma_mapping_error(>pcidev->dev, dma_addr)) {
+   put_page(page);
+   BNAD_UPDATE_CTR(bnad, rxbuf_map_failed);
+   rcb->rxq->rxbuf_map_failed++;
+   goto finishing;
+   }
 
unmap->page = page;
unmap->page_offset = page_offset;
@@ -454,8 +460,15 @@ bnad_rxq_refill_skb(struct bnad *bnad, struct bna_rcb 
*rcb, u32 nalloc)
rcb->rxq->rxbuf_alloc_failed++;
goto finishing;
}
+
dma_addr = dma_map_single(>pcidev->dev, skb->data,
  buff_sz, DMA_FROM_DEVICE);
+   if (dma_mapping_error(>pcidev->dev, dma_addr)) {
+   dev_kfree_skb_any(skb);
+   BNAD_UPDATE_CTR(bnad, rxbuf_map_failed);
+   rcb->rxq->rxbuf_map_failed++;
+   goto finishing;
+   }
 
unmap->skb = skb;
dma_unmap_addr_set(>vector, dma_addr, dma_addr);
@@ -3025,6 +3038,11 @@ bnad_start_xmit(struct sk_buff *skb, struct net_device 
*netdev)
unmap = head_unmap;
dma_addr = dma_map_single(>pcidev->dev, skb->data,
  len, DMA_TO_DEVICE);
+   if (dma_mapping_error(>pcidev->dev, dma_addr)) {
+   dev_kfree_skb_any(skb);
+   BNAD_UPDATE_CTR(bnad, tx_skb_map_failed);
+   return NETDEV_TX_OK;
+   }
BNA_SET_DMA_ADDR(dma_addr, >vector[0].host_addr);
txqent->vector[0].length = htons(len);
dma_unmap_addr_set(>vectors[0], dma_addr, dma_addr);
@@ -3056,6 +3074,15 @@ bnad_start_xmit(struct sk_buff *skb, struct net_device 
*netdev)
 
dma_addr = skb_frag_dma_map(>pcidev->dev, frag,
0, size, DMA_TO_DEVICE);
+   if (dma_mapping_error(>pcidev->dev, dma_addr)) {
+   /* Undo the changes starting at tcb->producer_index */
+   bnad_tx_buff_unmap(bnad, unmap_q, q_depth,
+  tcb->producer_index);
+   dev_kfree_skb_any(skb);
+   BNAD_UPDATE_CTR(bnad, tx_skb_map_failed);
+   

Re: IPv6 routing/fragmentation panic

2015-09-16 Thread Florian Westphal
David Woodhouse  wrote:
> On Wed, 2015-09-16 at 01:48 +0200, Florian Westphal wrote:
> > 
> > What I don't understand is why you see this with fragmented ipv6 
> > packets only (and not with all ipv6 forwarded skbs).
> > 
> > Something like this copy-pastry from ip_finish_output2 should fix it:
> 
> That works; thanks.
> 
> Tested-by: David Woodhouse 
> 
> A little extra debugging output shows that the offending fragments were
> arriving here with skb_headroom(skb)==10. Which is reasonable, being
> the Solos ADSL card's header of 8 bytes followed by 2 bytes of PPP
> frame type.
> 
> The non-fragmented packets, on the other hand, are arriving with a
> headroom of 42 bytes. Could something else already have reallocated
> them before they get that far?

Yep.  I missed

if (skb_cow(skb, dst->dev->hard_header_len)) {

call in ip6_forward().

Problem is of course that we only expand headroom of the skb
and not of the fragment(s) stored in that skbs frag list.

So we have several options for a fix.

- expand headroom in ip6_finish_output2, like we do for ipv4
- expand headroom in ip6_fragment
- defer to slowpath if frags don't have enough headroom.

The latter is the smallest patch and would not add test for locally
generated, non-fragmented skbs.

(not even compile tested)
David, could you test this?  I'd do an official patch submission then.

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -586,6 +586,7 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
frag_id = ipv6_select_ident(net, _hdr(skb)->daddr,
_hdr(skb)->saddr);
 
+   hroom = LL_RESERVED_SPACE(rt->dst.dev);
if (skb_has_frag_list(skb)) {
int first_len = skb_pagelen(skb);
struct sk_buff *frag2;
@@ -599,7 +600,7 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
/* Correct geometry. */
if (frag->len > mtu ||
((frag->len & 7) && frag->next) ||
-   skb_headroom(frag) < hlen)
+   skb_headroom(frag) < (hlen + hroom))
goto slow_path_clean;
 
/* Partially cloned skb? */
@@ -724,7 +725,6 @@ slow_path:
 */
 
*prevhdr = NEXTHDR_FRAGMENT;
-   hroom = LL_RESERVED_SPACE(rt->dst.dev);
troom = rt->dst.dev->needed_tailroom;
 
/*
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-next] oops in ip_route_input_noref

2015-09-16 Thread Richard Alpe
On 2015-09-16 15:07, David Ahern wrote:
> On 9/16/15 5:50 AM, Richard Alpe wrote:
>> On 2015-09-16 11:24, Sergey Senozhatsky wrote:
>>> Hi,
>>>
>>> 4.3.0-rc1-next-20150916
>>>
>>> oops after removal of rndis usb device
> 
> Hi Sergey:
> 
> Is this with KVM or baremetal?
> 
> -8<-
> thanks for the analysis
> 
>>> addr2line -e vmlinux -i 0x8146c0b1
>>> net/ipv4/route.c:1815
>>> net/ipv4/route.c:1905
>>>
>>>
>>> which seems to be this line ip_route_input_noref()->ip_route_input_slow():
>>> ...
>>> 1813 rth->rt_is_input = 1;
>>> 1814 if (res.table)
>>> 1815 rth->rt_table_id = res.table->tb_id;
>>> 1816
>>> ...
>>>
>>>
>>> added by b7503e0cdb5dbec5d201aa69dc14679b5ae8
>>>
>>>  net: Add FIB table id to rtable
>>>
>>>  Add the FIB table id to rtable to make the information available for
>>>  IPv4 as it is for IPv6.
>>>
>>>
>>> -ss
> 
> Hi Richard:
> 
>> I to get an Oops in ip_route_input_noref(). It happens occasionally during 
>> bootup.
>> KVM environment using virtio driver. Let me know if you need any additional 
>> info or
>> if you want me to try to bisect it.
>>
>> Starting network...
>> ...
>> [0.877040] BUG: unable to handle kernel NULL pointer dereference at 
>> 0056
>> [0.877597] IP: [] ip_route_input_noref+0x1a2/0xb00
> 
> Can you send me your kernel config and qemu command line? KVM with virtio 
> networking is a primary test vehicle, and I did not encounter this at all.
Sure thing. Not sure how ppl normally provide files on netdev but I'm just going
to go ahead and paste them here :)

$ ps aux | grep kvm
qemu-system-x86_64 -enable-kvm -name tipc-medium-node1 -S -machine 
pc-0.14,accel=kvm,usb=off -m 1024 -realtime mlock=off -smp 
2,sockets=2,cores=1,threads=1 -uuid cdec478a-5f0d-49f1-b25e-fac4ca0b290c 
-no-user-config -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/tipc-medium-node1.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown 
-boot order=n,menu=on,strict=on -device 
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -netdev tap,fd=25,id=hostnet0 
-device e1000,netdev=hostnet0,id=net0,mac=00:0f:ff:10:04:01,bus=pci.0,addr=0x3 
-chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 
-vnc 127.0.0.1:28101 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on

$ cat .config
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 3.12.28 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_CPU_AUTOPROBE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx 
-fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 
-fcall-saved-r11"
CONFIG_ARCH_CPU_PROBE_RELEASE=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_SUSE_KERNEL=y
# CONFIG_SUSE_KERNEL_SUPPORTED is not set
# CONFIG_SPLIT_PACKAGE is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME=&qu

Re: [linux-next] oops in ip_route_input_noref

2015-09-16 Thread Richard Alpe
On 2015-09-16 15:53, Richard Alpe wrote:
> On 2015-09-16 15:07, David Ahern wrote:
>> On 9/16/15 5:50 AM, Richard Alpe wrote:
>>> On 2015-09-16 11:24, Sergey Senozhatsky wrote:
>>>> Hi,
>>>>
>>>> 4.3.0-rc1-next-20150916
>>>>
>>>> oops after removal of rndis usb device
>>
>> Hi Sergey:
>>
>> Is this with KVM or baremetal?
>>
>> -8<-
>> thanks for the analysis
>>
>>>> addr2line -e vmlinux -i 0x8146c0b1
>>>> net/ipv4/route.c:1815
>>>> net/ipv4/route.c:1905
>>>>
>>>>
>>>> which seems to be this line ip_route_input_noref()->ip_route_input_slow():
>>>> ...
>>>> 1813 rth->rt_is_input = 1;
>>>> 1814 if (res.table)
>>>> 1815 rth->rt_table_id = res.table->tb_id;
>>>> 1816
>>>> ...
>>>>
>>>>
>>>> added by b7503e0cdb5dbec5d201aa69dc14679b5ae8
>>>>
>>>>  net: Add FIB table id to rtable
>>>>
>>>>  Add the FIB table id to rtable to make the information available for
>>>>  IPv4 as it is for IPv6.
>>>>
>>>>
>>>> -ss
>>
>> Hi Richard:
>>
>>> I to get an Oops in ip_route_input_noref(). It happens occasionally during 
>>> bootup.
>>> KVM environment using virtio driver. Let me know if you need any additional 
>>> info or
>>> if you want me to try to bisect it.
>>>
>>> Starting network...
>>> ...
>>> [0.877040] BUG: unable to handle kernel NULL pointer dereference at 
>>> 0056
>>> [0.877597] IP: [] ip_route_input_noref+0x1a2/0xb00
>>
>> Can you send me your kernel config and qemu command line? KVM with virtio 
>> networking is a primary test vehicle, and I did not encounter this at all.
> Sure thing. Not sure how ppl normally provide files on netdev but I'm just 
> going
> to go ahead and paste them here :)
> 
> $ ps aux | grep kvm
> qemu-system-x86_64 -enable-kvm -name tipc-medium-node1 -S -machine 
> pc-0.14,accel=kvm,usb=off -m 1024 -realtime mlock=off -smp 
> 2,sockets=2,cores=1,threads=1 -uuid cdec478a-5f0d-49f1-b25e-fac4ca0b290c 
> -no-user-config -nodefaults -chardev 
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/tipc-medium-node1.monitor,server,nowait
>  -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown 
> -boot order=n,menu=on,strict=on -device 
> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -netdev tap,fd=25,id=hostnet0 
> -device 
> e1000,netdev=hostnet0,id=net0,mac=00:0f:ff:10:04:01,bus=pci.0,addr=0x3 
> -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 
> -vnc 127.0.0.1:28101 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device 
> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on

Sorry about that kvm cmdline was a copy-paste error. Here's the right one using 
virtio.

$ ps aux | grep qemu
qemu-system-x86_64 -enable-kvm -name tipc-large-node16 -S -machine 
pc-0.14,accel=kvm,usb=off -m 1024 -realtime mlock=off -smp 
2,sockets=2,cores=1,threads=1 -uuid 5c2ffa5f-fc39-47a2-9868-9ef93bada31a 
-no-user-config -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/tipc-large-node16.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown 
-boot order=n,menu=on,strict=on -device 
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -netdev 
tap,fd=25,id=hostnet0,vhost=on,vhostfd=48 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=00:0f:ff:10:05:16,bus=pci.0,addr=0x3 
-chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 
-vnc 127.0.0.1:29116 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on

Regards
Richard

> 
> $ cat .config
> #
> # Automatically generated file; DO NOT EDIT.
> # Linux/x86 3.12.28 Kernel Configuration
> #
> CONFIG_64BIT=y
> CONFIG_X86_64=y
> CONFIG_X86=y
> CONFIG_INSTRUCTION_DECODER=y
> CONFIG_OUTPUT_FORMAT="elf64-x86-64"
> CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
> CONFIG_LOCKDEP_SUPPORT=y
> CONFIG_STACKTRACE_SUPPORT=y
> CONFIG_HAVE_LATENCYTOP_SUPPORT=y
> CONFIG_MMU=y
> CONFIG_NEED_DMA_MAP_STATE=y
> CONFIG_NEED_SG_DMA_LENGTH=y
> CONFIG_GENERIC_ISA_DMA=y
> CONFIG_GENERIC_BUG=y
> CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
> CONFIG_GENERIC_HWEIGHT=y
> CONFIG_ARCH_MAY_HAVE_PC_FDC=y
> CONFIG_RWSEM_XCHGADD_ALGORITHM=y
> CONFIG_GENERIC_CALIBRATE_DELAY=y
> CONFIG_ARCH_HAS_CPU_RELAX=y

Re: IPv6 routing/fragmentation panic

2015-09-16 Thread David Woodhouse
On Wed, 2015-09-16 at 15:27 +0200, Florian Westphal wrote:
> @@ -599,7 +600,7 @@ int ip6_fragment(struct sock *sk, struct sk_buff
> *skb,
> /* Correct geometry. */
> if (frag->len > mtu ||
> ((frag->len & 7) && frag->next) ||
> -   skb_headroom(frag) < hlen)
> +   skb_headroom(frag) < (hlen + hroom))
> goto slow_path_clean;
>  
> /* Partially cloned skb? */

My test is 'ping -s 2000', and I end up with a fragment of 1280 bytes
followed by a fragment of 776 bytes.

The test cited above is only actually running on the latter fragment
(which for some reason is fine and has headroom of 58 bytes).

The first, larger, fragment isn't being checked. And that's the one
with only 10 bytes of headroom.

[   62.027984] has frag list
[   62.030616] line 604 check frag ddc5fcc0 len 776 headroom 58 (hlen 40 hroom 
16) 
[   62.036720] line 678 send skb ded050c0 len 1280 headroom 10  
  
[   62.041096] skbuff: skb_under_panic: text:c125f9ca len:1294 put:14 head:dec89
000 data:dec88ffc tail:0xdec8950a end:0xdec89f50 dev:br-lan 

-- 
dwmw2



smime.p7s
Description: S/MIME cryptographic signature


Re: [linux-next] oops in ip_route_input_noref

2015-09-16 Thread David Ahern

On 9/16/15 5:50 AM, Richard Alpe wrote:

On 2015-09-16 11:24, Sergey Senozhatsky wrote:

Hi,

4.3.0-rc1-next-20150916

oops after removal of rndis usb device


Hi Sergey:

Is this with KVM or baremetal?

-8<-
thanks for the analysis


addr2line -e vmlinux -i 0x8146c0b1
net/ipv4/route.c:1815
net/ipv4/route.c:1905


which seems to be this line ip_route_input_noref()->ip_route_input_slow():
...
1813 rth->rt_is_input = 1;
1814 if (res.table)
1815 rth->rt_table_id = res.table->tb_id;
1816
...


added by b7503e0cdb5dbec5d201aa69dc14679b5ae8

 net: Add FIB table id to rtable

 Add the FIB table id to rtable to make the information available for
 IPv4 as it is for IPv6.


-ss


Hi Richard:


I to get an Oops in ip_route_input_noref(). It happens occasionally during 
bootup.
KVM environment using virtio driver. Let me know if you need any additional 
info or
if you want me to try to bisect it.

Starting network...
...
[0.877040] BUG: unable to handle kernel NULL pointer dereference at 
0056
[0.877597] IP: [] ip_route_input_noref+0x1a2/0xb00


Can you send me your kernel config and qemu command line? KVM with 
virtio networking is a primary test vehicle, and I did not encounter 
this at all.


Thanks,
David


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ARCNET: fix hard_header_len limit

2015-09-16 Thread Michael Grzeschik
On Wed, Aug 05, 2015 at 05:34:51PM +0200, Michael Grzeschik wrote:
> On Thu, Jul 30, 2015 at 11:16:36AM -0700, David Miller wrote:
> > From: Michael Grzeschik 
> > Date: Thu, 30 Jul 2015 15:34:36 +0200
> > 
> > > The commit <9c7077622dd9> ("packet: make packet_snd fail on len smaller
> > > than l2 header") adds the check for minimum packet length of the used l2.
> > > For arcnet the hardware header length is not the complete archdr which
> > > includes hard + soft header. This patch changes the length to
> > > sizeof(arc_hardware).
> > > 
> > > Signed-off-by: Michael Grzeschik 
> > 
> > The hard header len is used for other purposes as well, are you sure
> > those don't get broken by this change?
> 
> Its meaning is to represent the amount of the hardware (link layer)
> data of one packet.
> 
> Which other purposes do you mean?
> Can you point to some code?
> 
> > Code assumes that if the data at the SKB mac pointer is taken, for
> > dev->hard_header_len bytes, that is exactly the link layer header.
> > And that this can be used to compare two MAC headers, copy the
> > MAC header from one packet to another, etc.
> 
> The link layer size of arcnet is 4 bytes long. 1 byte source, 1 byte
> dest and two offset bytes. As described by struct arc_hardware in
> if_arcnet.h . The above condition is fulfilled when the mac pointer
> is 0.
> 
> The following pending bytes of struct archdr have a variable meaning
> depending of the used protocol and are represented by an union.
> (network layer)
> 
> In the case of raw packets, the payload comes immediately after the
> hard_header.
> 

Ping!

I have the cleanup patches from Joe Perches and several ARCNET patches
on top, waiting to be posted on the list. What is your Opinion to my
Maintainer Request I send some weeks ago?

Michael

-- 
Pengutronix e.K.   | |
Industrial Linux Solutions | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0|
Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch net-next 2/3] mlxsw: expose EMAD transactions statistics via debugfs

2015-09-16 Thread Marcelo Ricardo Leitner
On Thu, Aug 27, 2015 at 08:40:29AM +0200, Jiri Pirko wrote:
> Thu, Aug 27, 2015 at 08:36:03AM CEST, da...@davemloft.net wrote:
> >From: Jiri Pirko 
> >Date: Thu, 27 Aug 2015 08:27:04 +0200
> >
> >> I'm not saying it is not possible, it certainly is. But I think that
> >> for example rocker internals have no value to default user, he
> >> should not care and he cannot find out what is going on there
> >> without knowledge or rocker.c code. The question is, do we need some
> >> standard interface to expose random debugging data? I don't think
> >> so, I think that debugfs is exactly the tool to be used in that
> >> case.
> >
> >If it is only interesting to rocker.c maintainer, he can keep a local
> >patch he applies when he needs such a facility.
> >
> >This discussion is becomming circular.
> >
> >If it's useful, it needs a well defined interface.
> >
> >If it's not useful, it doesn't belong in the tree.
> >
> >Therefore, debugfs is useless.
> 
> Fair enough.

Late reply, sorry, but another idea is to leave the stats in place (as
they were going to be calculated even with debugfs unmounted) and (for
now at least) fetch them with systemtap, perf or something like that.
Then the stats are there for when you need them and with an interface as
flexible as it can get. Even if you happen to do a post-mortem analysis,
the info would at least be there.

  Marcelo

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-next] oops in ip_route_input_noref

2015-09-16 Thread David Ahern

On 9/16/15 3:24 AM, Sergey Senozhatsky wrote:

Hi,

4.3.0-rc1-next-20150916

oops after removal of rndis usb device


Sergey:

Can you send me the oops output?

Thanks,
David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-next] oops in ip_route_input_noref

2015-09-16 Thread David Ahern

On 9/16/15 7:53 AM, Richard Alpe wrote:

I to get an Oops in ip_route_input_noref(). It happens occasionally during 
bootup.
KVM environment using virtio driver. Let me know if you need any additional 
info or
if you want me to try to bisect it.

Starting network...
...
[0.877040] BUG: unable to handle kernel NULL pointer dereference at 
0056
[0.877597] IP: [] ip_route_input_noref+0x1a2/0xb00


Can you send me your kernel config and qemu command line? KVM with virtio 
networking is a primary test vehicle, and I did not encounter this at all.

Sure thing. Not sure how ppl normally provide files on netdev but I'm just going
to go ahead and paste them here :)


An attachment for the config is better than inline.



$ ps aux | grep kvm
qemu-system-x86_64 -enable-kvm -name tipc-medium-node1 -S -machine 
pc-0.14,accel=kvm,usb=off -m 1024 -realtime mlock=off -smp 
2,sockets=2,cores=1,threads=1 -uuid cdec478a-5f0d-49f1-b25e-fac4ca0b290c 
-no-user-config -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/tipc-medium-node1.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown 
-boot order=n,menu=on,strict=on -device 
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -netdev tap,fd=25,id=hostnet0 
-device e1000,netdev=hostnet0,id=net0,mac=00:0f:ff:10:04:01,bus=pci.0,addr=0x3 
-chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 
-vnc 127.0.0.1:28101 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on

$ cat .config
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 3.12.28 Kernel Configuration
#


3.12.28? That should say this for net-next:

# Linux/x86 4.2.0 Kernel Configuration

Or are you reporting a problem with 3.12.28?

David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next:master 6/12] include/linux/usb/cdc.h:23: error: redefinition of 'struct usb_cdc_parsed_header'

2015-09-16 Thread Fengguang Wu
On Tue, Sep 15, 2015 at 01:27:42PM -0700, David Miller wrote:
> From: kbuild test robot 
> Date: Wed, 16 Sep 2015 03:57:11 +0800
> 
> > All error/warnings (new ones prefixed by >>):
> > 
> >In file included from drivers/usb/gadget/function/u_ether.h:20,
> > from drivers/usb/gadget/legacy/cdc2.c:16:
> >include/linux/usb/cdc.h:47: warning: 'struct usb_interface' declared 
> > inside parameter list
> >include/linux/usb/cdc.h:47: warning: its scope is only this definition 
> > or declaration, which is probably not what you want
> >In file included from drivers/usb/gadget/function/u_serial.h:16,
> > from drivers/usb/gadget/legacy/cdc2.c:17:
> >>> include/linux/usb/cdc.h:23: error: redefinition of 'struct 
> >>> usb_cdc_parsed_header'
> >include/linux/usb/cdc.h:47: warning: 'struct usb_interface' declared 
> > inside parameter list
> >>> include/linux/usb/cdc.h:47: error: conflicting types for 
> >>> 'cdc_parse_cdc_header'
> >include/linux/usb/cdc.h:47: error: previous declaration of 
> > 'cdc_parse_cdc_header' was here
> 
> This may be a side effect of the initial warning, does this reproduce with
> that fixed?  Please show me what the warning looks like in that case.

Dave, net-next/master commit ad1e7b97b3 ("cdc: Fix build warning.")
still has errors.

The problem is, the header file  is included twice.

recent_errors
├── arm-arm5
│   ├── include-linux-usb-cdc.h:error:conflicting-types-for-cdc_parse_cdc_header
│   └── 
include-linux-usb-cdc.h:error:redefinition-of-struct-usb_cdc_parsed_header
├── arm-arm67
│   ├── include-linux-usb-cdc.h:error:conflicting-types-for-cdc_parse_cdc_header
│   └── 
include-linux-usb-cdc.h:error:redefinition-of-struct-usb_cdc_parsed_header
├── arm-mmp
│   ├── include-linux-usb-cdc.h:error:conflicting-types-for-cdc_parse_cdc_header
│   └── 
include-linux-usb-cdc.h:error:redefinition-of-struct-usb_cdc_parsed_header
├── arm-omap2plus_defconfig
│   ├── include-linux-usb-cdc.h:error:conflicting-types-for-cdc_parse_cdc_header
│   └── 
include-linux-usb-cdc.h:error:redefinition-of-struct-usb_cdc_parsed_header
├── avr32-atngw100_defconfig
│   └── 
include-linux-usb-cdc.h:error:redefinition-of-struct-usb_cdc_parsed_header
├── avr32-atstk1006_defconfig
│   └── 
include-linux-usb-cdc.h:error:redefinition-of-struct-usb_cdc_parsed_header
└── i386-allmodconfig
├── include-linux-usb-cdc.h:error:conflicting-types-for-cdc_parse_cdc_header
└── 
include-linux-usb-cdc.h:error:redefinition-of-struct-usb_cdc_parsed_header

The error messages are now:

In file included from drivers/usb/gadget/function/u_ether.h:20:0,
 from drivers/usb/gadget/function/f_ncm.c:26:
include/linux/usb/cdc.h:23:8: error: redefinition of 'struct 
usb_cdc_parsed_header'
 struct usb_cdc_parsed_header {
^
In file included from drivers/usb/gadget/function/f_ncm.c:24:0:
include/linux/usb/cdc.h:23:8: note: originally defined here
 struct usb_cdc_parsed_header {
^
In file included from drivers/usb/gadget/function/u_ether.h:20:0,
 from drivers/usb/gadget/function/f_ncm.c:26:
include/linux/usb/cdc.h:44:5: error: conflicting types for 
'cdc_parse_cdc_header'
 int cdc_parse_cdc_header(struct usb_cdc_parsed_header *hdr,
 ^
In file included from drivers/usb/gadget/function/f_ncm.c:24:0:
include/linux/usb/cdc.h:44:5: note: previous declaration of 
'cdc_parse_cdc_header' was here
 int cdc_parse_cdc_header(struct usb_cdc_parsed_header *hdr,
 ^

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPv6 routing/fragmentation panic

2015-09-16 Thread David Woodhouse
On Wed, 2015-09-16 at 15:27 +0200, Florian Westphal wrote:
> 
> David, could you test this?  I'd do an official patch submission
> then.

Compiles. Doesn't fix the problem.

-- 
dwmw2



smime.p7s
Description: S/MIME cryptographic signature


Re: [linux-next] oops in ip_route_input_noref

2015-09-16 Thread David Ahern

On 9/16/15 7:59 AM, Richard Alpe wrote:

Sorry about that kvm cmdline was a copy-paste error. Here's the right one using 
virtio.


I was just about to respond to that as well...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-next] oops in ip_route_input_noref

2015-09-16 Thread Richard Alpe
On 2015-09-16 15:57, David Ahern wrote:
> On 9/16/15 7:53 AM, Richard Alpe wrote:
 I to get an Oops in ip_route_input_noref(). It happens occasionally during 
 bootup.
 KVM environment using virtio driver. Let me know if you need any 
 additional info or
 if you want me to try to bisect it.

 Starting network...
 ...
 [0.877040] BUG: unable to handle kernel NULL pointer dereference at 
 0056
 [0.877597] IP: [] ip_route_input_noref+0x1a2/0xb00
>>>
>>> Can you send me your kernel config and qemu command line? KVM with virtio 
>>> networking is a primary test vehicle, and I did not encounter this at all.
>> Sure thing. Not sure how ppl normally provide files on netdev but I'm just 
>> going
>> to go ahead and paste them here :)
> 
> An attachment for the config is better than inline.
Fantastic day today, I managed to mess up two out of two copy pastes.
Sorry about that.. Here is the proper kconfig as .gz :)

Regards
Richard

> 
>>
>> $ ps aux | grep kvm
>> qemu-system-x86_64 -enable-kvm -name tipc-medium-node1 -S -machine 
>> pc-0.14,accel=kvm,usb=off -m 1024 -realtime mlock=off -smp 
>> 2,sockets=2,cores=1,threads=1 -uuid cdec478a-5f0d-49f1-b25e-fac4ca0b290c 
>> -no-user-config -nodefaults -chardev 
>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/tipc-medium-node1.monitor,server,nowait
>>  -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown 
>> -boot order=n,menu=on,strict=on -device 
>> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -netdev tap,fd=25,id=hostnet0 
>> -device 
>> e1000,netdev=hostnet0,id=net0,mac=00:0f:ff:10:04:01,bus=pci.0,addr=0x3 
>> -chardev pty,id=charserial0 -device 
>> isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:28101 -device 
>> cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device 
>> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on
>>
>> $ cat .config
>> #
>> # Automatically generated file; DO NOT EDIT.
>> # Linux/x86 3.12.28 Kernel Configuration
>> #
> 
> 3.12.28? That should say this for net-next:
> 
> # Linux/x86 4.2.0 Kernel Configuration
> 
> Or are you reporting a problem with 3.12.28?
> 
> David



config.gz
Description: application/gzip


Re: [PATCH 2/2] airo: Implement netif_carrier_on/off

2015-09-16 Thread Sergei Shtylyov

Hello.

On 9/15/2015 6:18 PM, Ondrej Zary wrote:


Add calls to netif_carrier_on and netif_carrier_off

Signed-off-by: Ondrej Zary 
---
  drivers/net/wireless/airo.c |6 +-
  1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/airo.c b/drivers/net/wireless/airo.c
index a8f2767..629245c 100644
--- a/drivers/net/wireless/airo.c
+++ b/drivers/net/wireless/airo.c

[...]

@@ -3277,7 +3278,9 @@ static void airo_handle_link(struct airo_info *ai)
eth_zero_addr(wrqu.ap_addr.sa_data);
wrqu.ap_addr.sa_family = ARPHRD_ETHER;
wireless_send_event(ai->dev, SIOCGIWAP, , NULL);
-   }
+   netif_carrier_off(ai->dev);
+   } else
+   netif_carrier_off(ai->dev);


   Need {} in all branches, according the the Documentation/CodingStyle.


  }

  static void airo_handle_rx(struct airo_info *ai)

[...]

MBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-next] oops in ip_route_input_noref

2015-09-16 Thread Fabio Estevam
On Wed, Sep 16, 2015 at 6:24 AM, Sergey Senozhatsky
<sergey.senozhatsky.w...@gmail.com> wrote:

> added by b7503e0cdb5dbec5d201aa69dc14679b5ae8
>
> net: Add FIB table id to rtable
>
> Add the FIB table id to rtable to make the information available for
> IPv4 as it is for IPv6.

I see the same issue here when booting a mx25 ARM processor via NFS.

defconfig is arch/arm/configs/imx_v4_v5_defconfig.

It happens in 100% of the boots and the log is:

fec 50038000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
Sending DHCP requests .
Unable to handle kernel NULL pointer dereference at virtual address 0007
pgd = c0004000
[0007] *pgd=
Internal error: Oops: 1 [#1] PREEMPT ARM
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.3.0-rc1-next-20150916-dirty #96
Hardware name: Freescale i.MX25 (Device Tree Support)
task: c06ac1d0 ti: c06a8000 task.ti: c06a8000
PC is at ip_route_input_noref+0x3d8/0x808
LR is at __local_bh_enable_ip+0x5c/0xdc
pc : []lr : []psr: a013
sp : c06a9cb0  ip : 000a  fp : 
r10: c39b7000  r9 : c39c8d00  r8 : 1e00a8c0
r7 : c39c04a0  r6 :   r5 : c3969a00  r4 : ff8f
r3 :   r2 : 0001  r1 : c0438410  r0 : c3969a00
Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 0005317f  Table: 80004000  DAC: 0053
Process swapper (pid: 0, stack limit = 0xc06a8190)
Stack: (0xc06a9cb0 to 0xc06aa000)
9ca0:   0044 c06ab93c
9cc0: 0100a8c0 c06f0540 c3a8f74e  0002 00070044 c043bba8 
9ce0:  c06a9d24  0002  1e00 0100a8c0 
9d00:   0100a8c0 1e00a8c0 c3a8f720 0001  c3a8f74e
9d20: c39c04a0 c39c04a0 002e c3a8f720 0001  c06fa580 c043bbbc
9d40: c39b7000 c06a9d60 c3a8f720 c39b7000 c39c04a0 c06f0540 002e c043c3a0
9d60: c3a8f896 c06929f4 c39b7000 c30d3ce0 c06a9d78 0100a8c0 c0401190 c06ab9d8
9d80: 0008 c39b7048 0008 c06ab9d8 c39b7000 c39b705c c39c04a0 c040dfa0
9da0: c39c04a0 c3a8f74e 002e c39b753c c39c04a0 c39c04a0 c06ab9c0 c39b705c
9dc0: c39b7520 0008     c39c04a0 c04117c0
9de0: 08e0 c39c04a0 c39c04a0 c39b7520 0001   
9e00: c39b7000 c0411400   0003 c04120cc c485d000 0800
9e20: c39c04a0 c02fde40  c06ac1d0 c06b1c38 0040 c39b7030 c39b7040
9e40: c3943000   0002 c30d39e0   
9e60: c39b74b8 0040 c39b7460 c39b7520 c06b3ce0 c383de14 e1e6cf80 c39b7520
9e80: 0001 0040 012c c06a9ea8 8c4d c06b4500 c06fa580 c0411bdc
9ea0: c06a9eb0 8c4d c06a9ea8 c06a9ea8 c06a9eb0 c06a9eb0 0001 
9ec0: 0008 0003 c06fd76c c06fa8d0 0101 0004 000c c001bc24
9ee0: c39d2080  0001 000a 8c4c 0020  
9f00: c06cf3a4  0001 c06a9f58 41069264 c3802000  c001c148
9f20:  c004c958 c06a9f58 c06fd284 c06a9f58  c06a9f8c c06fa69d
9f40: c06b3034 c0009404 c000ac20 6013  c04b5c64  0005317f
9f60: 0005217f 6013 c06aa0f4 c06fae98 c06fa69d c06fae98 c06fa69d 41069264
9f80: c06b3034  60d3 c06a9fa8 c000ac30 c000ac20 6013 
9fa0: 0053 c06fae98  c0041724 c06ac1d0   c065ebc4
9fc0:    c065e670  c06978bc  c06fd174
9fe0: c06aa094 c06978b8 c06ad120 80004000 80695fb8 80008048  
[] (ip_route_input_noref) from [] (ip_rcv_finish+0xe8/0x31c)
[] (ip_rcv_finish) from [] (ip_rcv+0x2b4/0x3d4)
[] (ip_rcv) from [] (__netif_receive_skb_core+0x304/0x944)
[] (__netif_receive_skb_core) from []
(netif_receive_skb_internal+0x28/0x78)
[] (netif_receive_skb_internal) from []
(napi_gro_receive+0x88/0x130)
[] (napi_gro_receive) from [] (fec_enet_rx_napi+0x404/0xa78)
[] (fec_enet_rx_napi) from [] (net_rx_action+0xf8/0x334)
[] (net_rx_action) from [] (__do_softirq+0x11c/0x3a0)
[] (__do_softirq) from [] (irq_exit+0xac/0xf8)
[] (irq_exit) from [] (__handle_domain_irq+0x64/0xd0)
[] (__handle_domain_irq) from [] (avic_handle_irq+0x34/0x54)
[] (avic_handle_irq) from [] (__irq_svc+0x44/0x78)
Exception stack(0xc06a9f58 to 0xc06a9fa0)
9f40:    0005317f
9f60: 0005217f 6013 c06aa0f4 c06fae98 c06fa69d c06fae98 c06fa69d 41069264
9f80: c06b3034  60d3 c06a9fa8 c000ac30 c000ac20 6013 
[] (__irq_svc) from [] (arch_cpu_idle+0x28/0x44)
[] (arch_cpu_idle) from [] (cpu_startup_entry+0x118/0x2bc)
[] (cpu_startup_entry) from [] (start_kernel+0x308/0x368)
[] (start_kernel) from [<80008048>] (0x80008048)
Code: e3a02001 e353 e585102c e5c5205e (15933008)
---[ end trace 443993f61e8bf0a0 ]---
Kernel panic - not syncing: Fatal exception in interrupt
---[ end Kernel panic - not syncing: Fatal exception in interrupt
--
To un

Re: [linux-next] oops in ip_route_input_noref

2015-09-16 Thread David Ahern

On 9/16/15 9:00 AM, Fabio Estevam wrote:

On Wed, Sep 16, 2015 at 6:24 AM, Sergey Senozhatsky
 wrote:


added by b7503e0cdb5dbec5d201aa69dc14679b5ae8

 net: Add FIB table id to rtable

 Add the FIB table id to rtable to make the information available for
 IPv4 as it is for IPv6.


I see the same issue here when booting a mx25 ARM processor via NFS.

defconfig is arch/arm/configs/imx_v4_v5_defconfig.



I am still not able to reproduce. While I work on a full Cumulus image 
for other test cases here's a patch to try; eagle eye Nikolay noted a 
potential use without init in the maze of goto's.


Thanks,
David
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index da427a4a33fe..80f7c5b7b832 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1712,6 +1712,7 @@ static int ip_route_input_slow(struct sk_buff *skb, 
__be32 daddr, __be32 saddr,
goto martian_source;
 
res.fi = NULL;
+   res.table = NULL;
if (ipv4_is_lbcast(daddr) || (saddr == 0 && daddr == 0))
goto brd_input;
 
@@ -1834,6 +1835,7 @@ out:  return err;
RT_CACHE_STAT_INC(in_no_route);
res.type = RTN_UNREACHABLE;
res.fi = NULL;
+   res.table = NULL;
goto local_input;
 
/*


[PATCH net] ipv6: ip6_fragment: fix headroom tests and skb leak

2015-09-16 Thread Florian Westphal
David Woodhouse reports skb_under_panic when we try to push ethernet
header to fragmented ipv6 skbs:

 skbuff: skb_under_panic: text:c1277f1e len:1294 put:14 head:dec98000
 data:dec97ffc tail:0xdec9850a end:0xdec98f40 dev:br-lan
[..]
ip6_finish_output2+0x196/0x4da

David further debugged this:
  [..] offending fragments were arriving here with skb_headroom(skb)==10.
  Which is reasonable, being the Solos ADSL card's header of 8 bytes
  followed by 2 bytes of PPP frame type.

The problem is that if netfilter ipv6 defragmentation is used, skb_cow()
in ip6_forward will only see reassembled skb.

Therefore, headroom is overestimated by 8 bytes (we pulled fragment
header) and we don't check the skbs in the frag_list either.

We can't do these checks in netfilter defrag since outdev isn't known yet.

Furthermore, existing tests in ip6_fragment did not consider the fragment
or ipv6 header size when checking headroom of the fraglist skbs.

While at it, also fix a skb leak on memory allocation -- ip6_fragment
must consume the skb.

I tested this e1000 driver hacked to not allocate additional headroom
(we end up in slowpath, since LL_RESERVED_SPACE is 16).

If 2 bytes of headroom are allocated, fastpath is taken (14 byte
ethernet header was pulled, so 16 byte headroom available in all
fragments).

Reported-by: David Woodhouse 
Diagnosed-by: David Woodhouse 
Signed-off-by: Florian Westphal 
---
 net/ipv6/ip6_output.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 26ea479..92b1aa3 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -586,20 +586,22 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
frag_id = ipv6_select_ident(net, _hdr(skb)->daddr,
_hdr(skb)->saddr);
 
+   hroom = LL_RESERVED_SPACE(rt->dst.dev);
if (skb_has_frag_list(skb)) {
int first_len = skb_pagelen(skb);
struct sk_buff *frag2;
 
if (first_len - hlen > mtu ||
((first_len - hlen) & 7) ||
-   skb_cloned(skb))
+   skb_cloned(skb) ||
+   skb_headroom(skb) < (hroom + sizeof(struct frag_hdr)))
goto slow_path;
 
skb_walk_frags(skb, frag) {
/* Correct geometry. */
if (frag->len > mtu ||
((frag->len & 7) && frag->next) ||
-   skb_headroom(frag) < hlen)
+   skb_headroom(frag) < (hlen + hroom + sizeof(struct 
frag_hdr)))
goto slow_path_clean;
 
/* Partially cloned skb? */
@@ -616,8 +618,6 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
 
err = 0;
offset = 0;
-   frag = skb_shinfo(skb)->frag_list;
-   skb_frag_list_init(skb);
/* BUILD HEADER */
 
*prevhdr = NEXTHDR_FRAGMENT;
@@ -625,8 +625,11 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
if (!tmp_hdr) {
IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
  IPSTATS_MIB_FRAGFAILS);
-   return -ENOMEM;
+   err = -ENOMEM;
+   goto fail;
}
+   frag = skb_shinfo(skb)->frag_list;
+   skb_frag_list_init(skb);
 
__skb_pull(skb, hlen);
fh = (struct frag_hdr *)__skb_push(skb, sizeof(struct 
frag_hdr));
@@ -723,7 +726,6 @@ slow_path:
 */
 
*prevhdr = NEXTHDR_FRAGMENT;
-   hroom = LL_RESERVED_SPACE(rt->dst.dev);
troom = rt->dst.dev->needed_tailroom;
 
/*
-- 
2.0.5

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] net: Initialize table in fib result

2015-09-16 Thread David Ahern
Sergey, Richard and Fabio reported an oops in ip_route_input_noref. e.g., from 
Richard:

[0.877040] BUG: unable to handle kernel NULL pointer dereference at 
0056
[0.877597] IP: [] ip_route_input_noref+0x1a2/0xb00
[0.877597] PGD 3fa14067 PUD 3fa6e067 PMD 0
[0.877597] Oops:  [#1] SMP
[0.877597] Modules linked in: virtio_net virtio_pci virtio_ring virtio
[0.877597] CPU: 1 PID: 119 Comm: ifconfig Not tainted 4.2.0+ #1
[0.877597] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[0.877597] task: 88003fab0bc0 ti: 88003faa8000 task.ti: 
88003faa8000
[0.877597] RIP: 0010:[]  [] 
ip_route_input_noref+0x1a2/0xb00
[0.877597] RSP: 0018:88003ed03ba0  EFLAGS: 00010202
[0.877597] RAX: 0046 RBX: ff8f RCX: 0020
[0.877597] RDX: 88003fab50b8 RSI: 0200 RDI: 8152b4b8
[0.877597] RBP: 88003ed03c50 R08:  R09: 
[0.877597] R10:  R11:  R12: 88003fab6f00
[0.877597] R13: 88003fab5000 R14:  R15: 81cb5600
[0.877597] FS:  7f6de5751700() GS:88003ed0() 
knlGS:
[0.877597] CS:  0010 DS:  ES:  CR0: 80050033
[0.877597] CR2: 0056 CR3: 3fa6d000 CR4: 06e0
[0.877597] Stack:
[0.877597]   0046 88003fffa600 
88003ed03be0
[0.877597]  88003f9e2c00 697da8c0017da8c0 8800 
0007fd00
[0.877597]   0046  
0004
[0.877597] Call Trace:
[0.877597]  
[0.877597]  [] ? cpumask_next_and+0x2f/0x40
[0.877597]  [] arp_process+0x39c/0x690
[0.877597]  [] arp_rcv+0x13e/0x170
[0.877597]  [] __netif_receive_skb_core+0x60c/0xa00
[0.877597]  [] ? __build_skb+0x25/0x100
[0.877597]  [] ? __build_skb+0x25/0x100
[0.877597]  [] __netif_receive_skb+0x16/0x70
[0.877597]  [] netif_receive_skb_internal+0x28/0x90
[0.877597]  [] napi_gro_receive+0x7f/0xd0
[0.877597]  [] virtnet_receive+0x256/0x910 [virtio_net]
[0.877597]  [] virtnet_poll+0x18/0x80 [virtio_net]
[0.877597]  [] net_rx_action+0x1dd/0x2f0
[0.877597]  [] __do_softirq+0x98/0x260
[0.877597]  [] do_softirq_own_stack+0x1c/0x30

The root cause is use of res.table uninitialized.

Thanks to Nikolay for noticing the uninitialized use amongst the maze of
gotos.

Fixes: b7503e0cdb5d ("net: Add FIB table id to rtable")
Reported-by: Sergey Senozhatsky 
Reported-by: Richard Alpe 
Reported-by: Fabio Estevam 
Signed-off-by: David Ahern 
---
 net/ipv4/route.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index da427a4a33fe..80f7c5b7b832 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1712,6 +1712,7 @@ static int ip_route_input_slow(struct sk_buff *skb, 
__be32 daddr, __be32 saddr,
goto martian_source;
 
res.fi = NULL;
+   res.table = NULL;
if (ipv4_is_lbcast(daddr) || (saddr == 0 && daddr == 0))
goto brd_input;
 
@@ -1834,6 +1835,7 @@ out:  return err;
RT_CACHE_STAT_INC(in_no_route);
res.type = RTN_UNREACHABLE;
res.fi = NULL;
+   res.table = NULL;
goto local_input;
 
/*
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experiences with slub bulk use-case for network stack

2015-09-16 Thread Christoph Lameter
On Wed, 16 Sep 2015, Jesper Dangaard Brouer wrote:

>
> Hint, this leads up to discussing if current bulk *ALLOC* API need to
> be changed...
>
> Alex and I have been working hard on practical use-case for SLAB
> bulking (mostly slUb), in the network stack.  Here is a summary of
> what we have learned so far.

SLAB refers to the SLAB allocator which is one slab allocator and SLUB is
another slab allocator.

Please keep that consistent otherwise things get confusing

> Bulk free'ing SKBs during TX completion is a big and easy win.
>
> Specifically for slUb, normal path for freeing these objects (which
> are not on c->freelist) require a locked double_cmpxchg per object.
> The bulk free (via detached freelist patch) allow to free all objects
> belonging to the same slab-page, to be free'ed with a single locked
> double_cmpxchg. Thus, the bulk free speedup is quite an improvement.

Yep.

> Alex and I had the idea of bulk alloc returns an "allocator specific
> cache" data-structure (and we add some helpers to access this).

Maybe add some Macros to handle this?

> In the slUb case, the freelist is a single linked pointer list.  In
> the network stack the skb objects have a skb->next pointer, which is
> located at the same position as freelist pointer.  Thus, simply
> returning the freelist directly, could be interpreted as a skb-list.
> The helper API would then do the prefetching, when pulling out
> objects.

The problem with the SLUB case is that the objects must be on the same
slab page.

> For the slUb case, we would simply cmpxchg either c->freelist or
> page->freelist with a NULL ptr, and then own all objects on the
> freelist. This also reduce the time we keep IRQs disabled.

You dont need to disable interrupts for the cmpxchges. There is additional
state in the page struct though so the updates must be done carefully.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 0/3] Allow postponed netfilter handling for socket matches

2015-09-16 Thread Daniel Mack
I'm re-addressing the issue of matching socket meta information for
non-established sockets that has been discussed a while ago:

  http://article.gmane.org/gmane.comp.security.firewalls.netfilter.devel/56877

Being able to reliably match on net_cls cgroup ids is crucial in
order to build a per-application or per-container firewall rules
which don't leak ingress packets. Such a feature would be very
useful to have.

A previous attempt to fix the currently existing issues was to call
out to the early demuxing helper functions from the meta matching
callbacks, but that doesn't suffice because it doesn't address the
case of multicast UDP and other, more complex lookup methods 
implemented in various protocol handlers.

This patch set outlines a different approach by adding a flag to
'struct sk_buff' called 'nf_postponed'. This flag is set by
nft_meta_get_eval() in case a decision cannot be made due to a missing
skb->sk. skbs flagged that way will then be ran through the netfilter
chain processor again after the protocol handlers did the real socket
lookup. A small addition to 'struct nft_pktinfo' is needed so that the
matching callbacks can access the socket that was passed into
nf_hook().

Note that the new flag does not actually bloat 'struct skb_buff',
because it still fits into the 'flags1' bitfield. Also, the extra
netfilter chain iteration will not be done by any subsequent packet in
the same stream, as for those, the early demux code will set skb->sk.

The patch set is obviously not yet finished, because a lot more
protocol handlers need to be patched. Right now, I only addressed
tcp_ipv4. Before I do that, I want to get some feedback on the
approach, so please let me know what you think.


Thanks,
Daniel


Daniel Mack (3):
  netfilter: add socket to struct nft_pktinfo
  netfilter: nft_meta: mark skbs for postponed filter processing
  net: tcp_ipv4: re-run netfilter chains for marked skbs

 include/linux/skbuff.h|  3 ++-
 include/net/netfilter/nf_tables.h |  2 ++
 net/ipv4/tcp_ipv4.c   | 10 ++
 net/netfilter/nft_meta.c  |  9 ++---
 4 files changed, 20 insertions(+), 4 deletions(-)

-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 2/3] netfilter: nft_meta: mark skbs for postponed filter processing

2015-09-16 Thread Daniel Mack
When the cgroup matching code in nft_meta is called without a socket to
look at, it currently bails out and lets the packet pass. This is bad,
because the reason for skb->sk being NULL is simply that the packet was
directed to a socket that hasn't been looked up yet by early demux.

This patch does two things:

 a) it uses the newly introduced pkt->sk pointer rather than skb->sk
to check for the net class ID. This allows us to look at the socket
the user passed into nf_hook().

 b) in case the socket can't be accessed, it marks the skb as
'nf_postponed', so that later dispatchers have a chance to
re-iterate the chain for such packets, after a full demux was
conducted.

Note that the added flag in 'struct skb' does not increase the size
of the struct, as it fits in the 'flags1' bitfield.

Signed-off-by: Daniel Mack 
---
 include/linux/skbuff.h   | 3 ++-
 net/netfilter/nft_meta.c | 9 ++---
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2738d35..3590101 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -584,7 +584,8 @@ struct sk_buff {
fclone:2,
peeked:1,
head_frag:1,
-   xmit_more:1;
+   xmit_more:1,
+   nf_postponed:1;
/* one bit hole */
kmemcheck_bitfield_end(flags1);
 
diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
index cb2f13e..33b8d23 100644
--- a/net/netfilter/nft_meta.c
+++ b/net/netfilter/nft_meta.c
@@ -29,8 +29,9 @@ void nft_meta_get_eval(const struct nft_expr *expr,
   const struct nft_pktinfo *pkt)
 {
const struct nft_meta *priv = nft_expr_priv(expr);
-   const struct sk_buff *skb = pkt->skb;
const struct net_device *in = pkt->in, *out = pkt->out;
+   struct sk_buff *skb = pkt->skb;
+   struct sock *sk = pkt->sk;
u32 *dest = >data[priv->dreg];
 
switch (priv->key) {
@@ -168,9 +169,11 @@ void nft_meta_get_eval(const struct nft_expr *expr,
break;
 #ifdef CONFIG_CGROUP_NET_CLASSID
case NFT_META_CGROUP:
-   if (skb->sk == NULL || !sk_fullsock(skb->sk))
+   if (sk == NULL || !sk_fullsock(sk)) {
+   skb->nf_postponed = 1;
goto err;
-   *dest = skb->sk->sk_classid;
+   }
+   *dest = sk->sk_classid;
break;
 #endif
default:
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 3/3] net: tcp_ipv4: re-run netfilter chains for marked skbs

2015-09-16 Thread Daniel Mack
When an skb has been marked for later re-iteration through netfilter,
do that after __inet_lookup_skb() has been called. This allows packets
sent to unconnected sockets to be filtered reliably.

Note that this will never happen for subsequent packets in the same
stream, as skb->sk will be set due to early demux, and hence
skb->nf_postponed will remain 0.

Signed-off-by: Daniel Mack 
---
 net/ipv4/tcp_ipv4.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 93898e0..61e0cb4 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -78,6 +78,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1594,6 +1595,15 @@ int tcp_v4_rcv(struct sk_buff *skb)
if (!sk)
goto no_tcp_socket;
 
+   if (unlikely(skb->nf_postponed)) {
+   ret = nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_IN, sk,
+ skb, skb->dev, NULL, NULL);
+   if (ret != 1) {
+   sock_put(sk);
+   return 0;
+   }
+   }
+
 process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: Initialize table in fib result

2015-09-16 Thread Nikolay Aleksandrov
On 09/16/2015 05:38 PM, David Ahern wrote:
> Sergey, Richard and Fabio reported an oops in ip_route_input_noref. e.g., 
> from Richard:
> 
> [0.877040] BUG: unable to handle kernel NULL pointer dereference at 
> 0056
> [0.877597] IP: [] ip_route_input_noref+0x1a2/0xb00
> [0.877597] PGD 3fa14067 PUD 3fa6e067 PMD 0
> [0.877597] Oops:  [#1] SMP
> [0.877597] Modules linked in: virtio_net virtio_pci virtio_ring virtio
> [0.877597] CPU: 1 PID: 119 Comm: ifconfig Not tainted 4.2.0+ #1
> [0.877597] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [0.877597] task: 88003fab0bc0 ti: 88003faa8000 task.ti: 
> 88003faa8000
> [0.877597] RIP: 0010:[]  [] 
> ip_route_input_noref+0x1a2/0xb00
> [0.877597] RSP: 0018:88003ed03ba0  EFLAGS: 00010202
> [0.877597] RAX: 0046 RBX: ff8f RCX: 
> 0020
> [0.877597] RDX: 88003fab50b8 RSI: 0200 RDI: 
> 8152b4b8
> [0.877597] RBP: 88003ed03c50 R08:  R09: 
> 
> [0.877597] R10:  R11:  R12: 
> 88003fab6f00
> [0.877597] R13: 88003fab5000 R14:  R15: 
> 81cb5600
> [0.877597] FS:  7f6de5751700() GS:88003ed0() 
> knlGS:
> [0.877597] CS:  0010 DS:  ES:  CR0: 80050033
> [0.877597] CR2: 0056 CR3: 3fa6d000 CR4: 
> 06e0
> [0.877597] Stack:
> [0.877597]   0046 88003fffa600 
> 88003ed03be0
> [0.877597]  88003f9e2c00 697da8c0017da8c0 8800 
> 0007fd00
> [0.877597]   0046  
> 0004
> [0.877597] Call Trace:
> [0.877597]  
> [0.877597]  [] ? cpumask_next_and+0x2f/0x40
> [0.877597]  [] arp_process+0x39c/0x690
> [0.877597]  [] arp_rcv+0x13e/0x170
> [0.877597]  [] __netif_receive_skb_core+0x60c/0xa00
> [0.877597]  [] ? __build_skb+0x25/0x100
> [0.877597]  [] ? __build_skb+0x25/0x100
> [0.877597]  [] __netif_receive_skb+0x16/0x70
> [0.877597]  [] netif_receive_skb_internal+0x28/0x90
> [0.877597]  [] napi_gro_receive+0x7f/0xd0
> [0.877597]  [] virtnet_receive+0x256/0x910 [virtio_net]
> [0.877597]  [] virtnet_poll+0x18/0x80 [virtio_net]
> [0.877597]  [] net_rx_action+0x1dd/0x2f0
> [0.877597]  [] __do_softirq+0x98/0x260
> [0.877597]  [] do_softirq_own_stack+0x1c/0x30
> 
> The root cause is use of res.table uninitialized.
> 
> Thanks to Nikolay for noticing the uninitialized use amongst the maze of
> gotos.
> 
> Fixes: b7503e0cdb5d ("net: Add FIB table id to rtable")
> Reported-by: Sergey Senozhatsky 
> Reported-by: Richard Alpe 
> Reported-by: Fabio Estevam 
> Signed-off-by: David Ahern 
> ---
>  net/ipv4/route.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
Just to have it documented: I don't think we need the second NULLing,
but it doesn't hurt.

Thanks,
Signed-off-by: Nikolay Aleksandrov 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 1/3] netfilter: add socket to struct nft_pktinfo

2015-09-16 Thread Daniel Mack
The high-level netfilter hook API already enables users to pass a socket,
but that information is lost when the chains are walked.

In order to let internal eval callbacks use the passed filter rather than
skb->sk, add a pointer of type 'struct sock' to 'struct nft_pktinfo' and
set that field via nft_set_pktinfo().

This allows us to run filter chains from situations where skb->sk is unset.
Fall back to skb->sk in case state->sk is NULL, so filter callbacks can be
written in a generic way.

Signed-off-by: Daniel Mack 
---
 include/net/netfilter/nf_tables.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/net/netfilter/nf_tables.h 
b/include/net/netfilter/nf_tables.h
index aa8bee7..05e97ed 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -13,6 +13,7 @@
 #define NFT_JUMP_STACK_SIZE16
 
 struct nft_pktinfo {
+   struct sock *sk;
struct sk_buff  *skb;
const struct net_device *in;
const struct net_device *out;
@@ -29,6 +30,7 @@ static inline void nft_set_pktinfo(struct nft_pktinfo *pkt,
   struct sk_buff *skb,
   const struct nf_hook_state *state)
 {
+   pkt->sk = state->sk ?: skb->sk;
pkt->skb = skb;
pkt->in = pkt->xt.in = state->in;
pkt->out = pkt->xt.out = state->out;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPv6 routing/fragmentation panic

2015-09-16 Thread Florian Westphal
David Woodhouse  wrote:
> > if (frag->len > mtu ||
> > ((frag->len & 7) && frag->next) ||
> > -   skb_headroom(frag) < hlen)
> > +   skb_headroom(frag) < (hlen + hroom))
> > goto slow_path_clean;
> >  
> > /* Partially cloned skb? */
> 
> My test is 'ping -s 2000', and I end up with a fragment of 1280 bytes
> followed by a fragment of 776 bytes.
> 
> The test cited above is only actually running on the latter fragment
> (which for some reason is fine and has headroom of 58 bytes).
> 
> The first, larger, fragment isn't being checked. And that's the one
> with only 10 bytes of headroom.

Thanks for this detailed analysis.
I've sent a patch that should address all of these issues.

Turns out that all tests are wrong in your case.

ip6_fragment doesn't expand headroom, since this skb had the ipv6
fragment header pulled, so that part thinks there are 18 bytes
available (we later push the frag header back when sending fragments).

The 'skb_headroom(frag) < hlen))' is wrong since it neither accounts for
device header length nor the fragment header that we need to push.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-V

2015-09-16 Thread Vitaly Kuznetsov
Commit b08cc79155fc26d0d112b1470d1ece5034651a4b ("hv_netvsc: Eliminate
 memory allocation in the packet send path") introduced skb headroom
request for Hyper-V netvsc driver:

   max_needed_headroom = sizeof(struct hv_netvsc_packet) +
   sizeof(struct rndis_message) +
   NDIS_VLAN_PPI_SIZE + NDIS_CSUM_PPI_SIZE +
   NDIS_LSO_PPI_SIZE + NDIS_HASH_PPI_SIZE;
   ...
   net->needed_headroom = max_needed_headroom;

max_needed_headroom is 220 bytes, it significantly exceeds the
LL_MAX_HEADER setting. This causes each skb to be cloned on send path,
e.g. for IPv4 case we fall into the following clause
(ip_finish_output2()):

if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
...
skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
...
}

leading to a significant performance regression. Increase LL_MAX_HEADER
to make it suitable for netvsc, make it 224 to be 16-aligned.
Alternatively we could (partially) revert the commit which introduced skb
headroom request restoring manual memory allocation on transmit path.

Signed-off-by: Vitaly Kuznetsov 
---
 include/linux/netdevice.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 88a0069..7233790 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -132,7 +132,9 @@ static inline bool dev_xmit_complete(int rc)
  * used.
  */
 
-#if defined(CONFIG_WLAN) || IS_ENABLED(CONFIG_AX25)
+#if IS_ENABLED(CONFIG_HYPERV_NET)
+# define LL_MAX_HEADER 224
+#elif defined(CONFIG_WLAN) || IS_ENABLED(CONFIG_AX25)
 # if defined(CONFIG_MAC80211_MESH)
 #  define LL_MAX_HEADER 128
 # else
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH iproute2] man ip-link: Fix wording in VLAN reorder_hdr explanation

2015-09-16 Thread Vadim Kochan
From: Vadim Kochan 

Signed-off-by: Vadim Kochan 
---
 man/man8/ip-link.8.in | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 1896eb6..4928249 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -327,7 +327,7 @@ physical device (if this device does not support VLAN 
offloading), the similar
 on the RX direction - by default the packet will be untagged before being
 received by VLAN device. Reordering allows to accelerate tagging on egress and
 to hide VLAN header on ingress so the packet looks like regular Ethernet 
packet,
-at the same time it might be confusing while the packet sniffing as the VLAN 
header
+at the same time it might be confusing for packet capture as the VLAN header
 does not exist within the packet.
 
 VLAN offloading can be checked by
-- 
2.4.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] man ip-link: Add more explanation about vlan reordering

2015-09-16 Thread Vadim Kochan
On Wed, Aug 26, 2015 at 04:27:48PM +0100, Jeremy Harris wrote:
> On 17/08/15 20:22, Vadim Kochan wrote:
> > +.BR reorder_hdr " is " on
> > +then VLAN header will be not inserted immediately but only before passing 
> > to the
> > +physical device (if this device does not support VLAN offloading), the 
> > similar
> > +on the RX direction - by default the packet will be untagged before being
> > +received by VLAN device. Reordering allows to accelerate tagging on egress 
> > and
> > +to hide VLAN header on ingress so the packet looks like regular Ethernet 
> > packet,
> > +at the same time it might be confusing while the packet sniffing as the 
> > VLAN header
>   ^
> 
> Does not read well.  "for packet capture" perhaps?
> -- 
> Jeremy
> 
> 

Hi Jeremy,

Thanks for you comment, I have sent a patch, let me know if it is correct now.
And sorry for so late response.

Thanks,
Vadim Kochan
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: Initialize table in fib result

2015-09-16 Thread David Ahern

On 9/16/15 9:56 AM, Nikolay Aleksandrov wrote:

Just to have it documented: I don't think we need the second NULLing,
but it doesn't hurt.


I think we do. After the second one there is a goto to local_input which 
uses res.table. The second goto is reachable 'if 
!IN_DEV_FORWARD(in_dev)' in which case res.table is valid but should not 
be. In short if fi is reset, table should be.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: Initialize table in fib result

2015-09-16 Thread Fabio Estevam
On Wed, Sep 16, 2015 at 12:38 PM, David Ahern  wrote:

> The root cause is use of res.table uninitialized.
>
> Thanks to Nikolay for noticing the uninitialized use amongst the maze of
> gotos.
>
> Fixes: b7503e0cdb5d ("net: Add FIB table id to rtable")
> Reported-by: Sergey Senozhatsky 
> Reported-by: Richard Alpe 
> Reported-by: Fabio Estevam 
> Signed-off-by: David Ahern 

Thanks, David. I am able to NFS boot again:

Tested-by: Fabio Estevam 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-V

2015-09-16 Thread Haiyang Zhang


> -Original Message-
> From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
> Sent: Wednesday, September 16, 2015 11:50 AM
> To: netdev@vger.kernel.org
> Cc: David S. Miller ; linux-ker...@vger.kernel.org;
> KY Srinivasan ; Haiyang Zhang
> ; Jason Wang 
> Subject: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-V
> 
> Commit b08cc79155fc26d0d112b1470d1ece5034651a4b ("hv_netvsc: Eliminate
>  memory allocation in the packet send path") introduced skb headroom
> request for Hyper-V netvsc driver:
> 
>max_needed_headroom = sizeof(struct hv_netvsc_packet) +
>sizeof(struct rndis_message) +
>NDIS_VLAN_PPI_SIZE + NDIS_CSUM_PPI_SIZE +
>NDIS_LSO_PPI_SIZE + NDIS_HASH_PPI_SIZE;
>...
>net->needed_headroom = max_needed_headroom;
> 
> max_needed_headroom is 220 bytes, it significantly exceeds the
> LL_MAX_HEADER setting. This causes each skb to be cloned on send path,
> e.g. for IPv4 case we fall into the following clause
> (ip_finish_output2()):
> 
> if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
> ...
> skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
> ...
> }
> 
> leading to a significant performance regression. Increase LL_MAX_HEADER
> to make it suitable for netvsc, make it 224 to be 16-aligned.
> Alternatively we could (partially) revert the commit which introduced
> skb
> headroom request restoring manual memory allocation on transmit path.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  include/linux/netdevice.h | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 88a0069..7233790 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -132,7 +132,9 @@ static inline bool dev_xmit_complete(int rc)
>   *   used.
>   */
> 
> -#if defined(CONFIG_WLAN) || IS_ENABLED(CONFIG_AX25)
> +#if IS_ENABLED(CONFIG_HYPERV_NET)
> +# define LL_MAX_HEADER 224
> +#elif defined(CONFIG_WLAN) || IS_ENABLED(CONFIG_AX25)
>  # if defined(CONFIG_MAC80211_MESH)
>  #  define LL_MAX_HEADER 128
>  # else

Thanks for the patch.
To avoid we forget to update that 224 number when we add more things
into netvsc header, I suggest that we define a macro in netdevice.h such 
as:
#define HVNETVSC_MAX_HEADER 224
#define LL_MAX_HEADER HVNETVSC_MAX_HEADER

And, put a note in netvsc code saying the header reservation shouldn't 
exceed HVNETVSC_MAX_HEADER, or you need to update HVNETVSC_MAX_HEADER.

Thanks,
- Haiyang




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v2] net: Initialize table in fib result

2015-09-16 Thread David Ahern
Sergey, Richard and Fabio reported an oops in ip_route_input_noref. e.g., from 
Richard:

[0.877040] BUG: unable to handle kernel NULL pointer dereference at 
0056
[0.877597] IP: [] ip_route_input_noref+0x1a2/0xb00
[0.877597] PGD 3fa14067 PUD 3fa6e067 PMD 0
[0.877597] Oops:  [#1] SMP
[0.877597] Modules linked in: virtio_net virtio_pci virtio_ring virtio
[0.877597] CPU: 1 PID: 119 Comm: ifconfig Not tainted 4.2.0+ #1
[0.877597] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[0.877597] task: 88003fab0bc0 ti: 88003faa8000 task.ti: 
88003faa8000
[0.877597] RIP: 0010:[]  [] 
ip_route_input_noref+0x1a2/0xb00
[0.877597] RSP: 0018:88003ed03ba0  EFLAGS: 00010202
[0.877597] RAX: 0046 RBX: ff8f RCX: 0020
[0.877597] RDX: 88003fab50b8 RSI: 0200 RDI: 8152b4b8
[0.877597] RBP: 88003ed03c50 R08:  R09: 
[0.877597] R10:  R11:  R12: 88003fab6f00
[0.877597] R13: 88003fab5000 R14:  R15: 81cb5600
[0.877597] FS:  7f6de5751700() GS:88003ed0() 
knlGS:
[0.877597] CS:  0010 DS:  ES:  CR0: 80050033
[0.877597] CR2: 0056 CR3: 3fa6d000 CR4: 06e0
[0.877597] Stack:
[0.877597]   0046 88003fffa600 
88003ed03be0
[0.877597]  88003f9e2c00 697da8c0017da8c0 8800 
0007fd00
[0.877597]   0046  
0004
[0.877597] Call Trace:
[0.877597]  
[0.877597]  [] ? cpumask_next_and+0x2f/0x40
[0.877597]  [] arp_process+0x39c/0x690
[0.877597]  [] arp_rcv+0x13e/0x170
[0.877597]  [] __netif_receive_skb_core+0x60c/0xa00
[0.877597]  [] ? __build_skb+0x25/0x100
[0.877597]  [] ? __build_skb+0x25/0x100
[0.877597]  [] __netif_receive_skb+0x16/0x70
[0.877597]  [] netif_receive_skb_internal+0x28/0x90
[0.877597]  [] napi_gro_receive+0x7f/0xd0
[0.877597]  [] virtnet_receive+0x256/0x910 [virtio_net]
[0.877597]  [] virtnet_poll+0x18/0x80 [virtio_net]
[0.877597]  [] net_rx_action+0x1dd/0x2f0
[0.877597]  [] __do_softirq+0x98/0x260
[0.877597]  [] do_softirq_own_stack+0x1c/0x30

The root cause is use of res.table uninitialized.

Thanks to Nikolay for noticing the uninitialized use amongst the maze of
gotos.

As Nikolay pointed out the second initialization is not required to fix
the oops, but rather to fix a related problem where a valid lookup should
be invalidated before creating the rth entry.

Fixes: b7503e0cdb5d ("net: Add FIB table id to rtable")
Reported-by: Sergey Senozhatsky 
Reported-by: Richard Alpe 
Reported-by: Fabio Estevam 
Tested-by: Fabio Estevam 
Signed-off-by: David Ahern 
---
v2:
- clarification in the commit message regarding the second initialization

 net/ipv4/route.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index da427a4a33fe..80f7c5b7b832 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1712,6 +1712,7 @@ static int ip_route_input_slow(struct sk_buff *skb, 
__be32 daddr, __be32 saddr,
goto martian_source;
 
res.fi = NULL;
+   res.table = NULL;
if (ipv4_is_lbcast(daddr) || (saddr == 0 && daddr == 0))
goto brd_input;
 
@@ -1834,6 +1835,7 @@ out:  return err;
RT_CACHE_STAT_INC(in_no_route);
res.type = RTN_UNREACHABLE;
res.fi = NULL;
+   res.table = NULL;
goto local_input;
 
/*
-- 
2.3.2 (Apple Git-55)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2] net: Initialize table in fib result

2015-09-16 Thread Nikolay Aleksandrov
On 09/16/2015 06:16 PM, David Ahern wrote:
> Sergey, Richard and Fabio reported an oops in ip_route_input_noref. e.g., 
> from Richard:
> 
> [0.877040] BUG: unable to handle kernel NULL pointer dereference at 
> 0056
> [0.877597] IP: [] ip_route_input_noref+0x1a2/0xb00
> [0.877597] PGD 3fa14067 PUD 3fa6e067 PMD 0
> [0.877597] Oops:  [#1] SMP
> [0.877597] Modules linked in: virtio_net virtio_pci virtio_ring virtio
> [0.877597] CPU: 1 PID: 119 Comm: ifconfig Not tainted 4.2.0+ #1
> [0.877597] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [0.877597] task: 88003fab0bc0 ti: 88003faa8000 task.ti: 
> 88003faa8000
> [0.877597] RIP: 0010:[]  [] 
> ip_route_input_noref+0x1a2/0xb00
> [0.877597] RSP: 0018:88003ed03ba0  EFLAGS: 00010202
> [0.877597] RAX: 0046 RBX: ff8f RCX: 
> 0020
> [0.877597] RDX: 88003fab50b8 RSI: 0200 RDI: 
> 8152b4b8
> [0.877597] RBP: 88003ed03c50 R08:  R09: 
> 
> [0.877597] R10:  R11:  R12: 
> 88003fab6f00
> [0.877597] R13: 88003fab5000 R14:  R15: 
> 81cb5600
> [0.877597] FS:  7f6de5751700() GS:88003ed0() 
> knlGS:
> [0.877597] CS:  0010 DS:  ES:  CR0: 80050033
> [0.877597] CR2: 0056 CR3: 3fa6d000 CR4: 
> 06e0
> [0.877597] Stack:
> [0.877597]   0046 88003fffa600 
> 88003ed03be0
> [0.877597]  88003f9e2c00 697da8c0017da8c0 8800 
> 0007fd00
> [0.877597]   0046  
> 0004
> [0.877597] Call Trace:
> [0.877597]  
> [0.877597]  [] ? cpumask_next_and+0x2f/0x40
> [0.877597]  [] arp_process+0x39c/0x690
> [0.877597]  [] arp_rcv+0x13e/0x170
> [0.877597]  [] __netif_receive_skb_core+0x60c/0xa00
> [0.877597]  [] ? __build_skb+0x25/0x100
> [0.877597]  [] ? __build_skb+0x25/0x100
> [0.877597]  [] __netif_receive_skb+0x16/0x70
> [0.877597]  [] netif_receive_skb_internal+0x28/0x90
> [0.877597]  [] napi_gro_receive+0x7f/0xd0
> [0.877597]  [] virtnet_receive+0x256/0x910 [virtio_net]
> [0.877597]  [] virtnet_poll+0x18/0x80 [virtio_net]
> [0.877597]  [] net_rx_action+0x1dd/0x2f0
> [0.877597]  [] __do_softirq+0x98/0x260
> [0.877597]  [] do_softirq_own_stack+0x1c/0x30
> 
> The root cause is use of res.table uninitialized.
> 
> Thanks to Nikolay for noticing the uninitialized use amongst the maze of
> gotos.
> 
> As Nikolay pointed out the second initialization is not required to fix
> the oops, but rather to fix a related problem where a valid lookup should
> be invalidated before creating the rth entry.
> 
> Fixes: b7503e0cdb5d ("net: Add FIB table id to rtable")
> Reported-by: Sergey Senozhatsky 
> Reported-by: Richard Alpe 
> Reported-by: Fabio Estevam 
> Tested-by: Fabio Estevam 
> Signed-off-by: David Ahern 
> ---
> v2:
> - clarification in the commit message regarding the second initialization
> 
>  net/ipv4/route.c | 2 ++
>  1 file changed, 2 insertions(+)
> 

Thanks again!

Signed-off-by: Nikolay Aleksandrov 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-V

2015-09-16 Thread David Laight
From: Haiyang Zhang
> Sent: 16 September 2015 17:09
> > -Original Message-
> > From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
> > Sent: Wednesday, September 16, 2015 11:50 AM
> > To: netdev@vger.kernel.org
> > Cc: David S. Miller ; linux-ker...@vger.kernel.org;
> > KY Srinivasan ; Haiyang Zhang
> > ; Jason Wang 
> > Subject: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-V
> >
> > Commit b08cc79155fc26d0d112b1470d1ece5034651a4b ("hv_netvsc: Eliminate
> >  memory allocation in the packet send path") introduced skb headroom
> > request for Hyper-V netvsc driver:
> >
> >max_needed_headroom = sizeof(struct hv_netvsc_packet) +
> >sizeof(struct rndis_message) +
> >NDIS_VLAN_PPI_SIZE + NDIS_CSUM_PPI_SIZE +
> >NDIS_LSO_PPI_SIZE + NDIS_HASH_PPI_SIZE;
> >...
> >net->needed_headroom = max_needed_headroom;
> >
> > max_needed_headroom is 220 bytes, it significantly exceeds the
> > LL_MAX_HEADER setting. This causes each skb to be cloned on send path,
> > e.g. for IPv4 case we fall into the following clause
> > (ip_finish_output2()):
> >
> > if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
> > ...
> > skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
> > ...
> > }
> >
> > leading to a significant performance regression. Increase LL_MAX_HEADER
> > to make it suitable for netvsc, make it 224 to be 16-aligned.
> > Alternatively we could (partially) revert the commit which introduced
> > skb
> > headroom request restoring manual memory allocation on transmit path.
> >
> > Signed-off-by: Vitaly Kuznetsov 
> > ---
> >  include/linux/netdevice.h | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 88a0069..7233790 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -132,7 +132,9 @@ static inline bool dev_xmit_complete(int rc)
> >   * used.
> >   */
> >
> > -#if defined(CONFIG_WLAN) || IS_ENABLED(CONFIG_AX25)
> > +#if IS_ENABLED(CONFIG_HYPERV_NET)
> > +# define LL_MAX_HEADER 224
> > +#elif defined(CONFIG_WLAN) || IS_ENABLED(CONFIG_AX25)
> >  # if defined(CONFIG_MAC80211_MESH)
> >  #  define LL_MAX_HEADER 128
> >  # else
> 
> Thanks for the patch.
> To avoid we forget to update that 224 number when we add more things
> into netvsc header, I suggest that we define a macro in netdevice.h such
> as:
> #define HVNETVSC_MAX_HEADER 224
> #define LL_MAX_HEADER HVNETVSC_MAX_HEADER
> 
> And, put a note in netvsc code saying the header reservation shouldn't
> exceed HVNETVSC_MAX_HEADER, or you need to update HVNETVSC_MAX_HEADER.

Am I right in thinking this is adding an extra 96 unused bytes to the front
of almost all skb just so that hyper-v can make its link level header
contiguous with whatever follows (IP header ?).

Doesn't sound ideal.

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] atm: deal with setting entry before mkip was called

2015-09-16 Thread Sasha Levin
If we didn't call ATMARP_MKIP before ATMARP_ENCAP the VCC descriptor is
non-existant and we'll end up dereferencing a NULL ptr:

[1033173.491930] kasan: GPF could be caused by NULL-ptr deref or user memory 
accessirq event stamp: 123386
[1033173.493678] general protection fault:  [#1] PREEMPT SMP 
DEBUG_PAGEALLOC KASAN
[1033173.493689] Modules linked in:
[1033173.493697] CPU: 9 PID: 23815 Comm: trinity-c64 Not tainted 
4.2.0-next-20150911-sasha-00043-g353d875-dirty #2545
[1033173.493706] task: 8800630c4000 ti: 88006311 task.ti: 
88006311
[1033173.493823] RIP: clip_ioctl (net/atm/clip.c:320 net/atm/clip.c:689)
[1033173.493826] RSP: 0018:880063117a88  EFLAGS: 00010203
[1033173.493828] RAX: dc00 RBX:  RCX: 
000c
[1033173.493830] RDX: 0002 RSI: b3f10720 RDI: 
0014
[1033173.493832] RBP: 880063117b80 R08: 88047574d9a4 R09: 

[1033173.493834] R10:  R11:  R12: 
11000c622f53
[1033173.493836] R13: 8800cb905500 R14: 8808d6da2000 R15: 
fdfd
[1033173.493840] FS:  7fa56b92d700() GS:88047800() 
knlGS:
[1033173.493843] CS:  0010 DS:  ES:  CR0: 8005003b
[1033173.493845] CR2:  CR3: 630e8000 CR4: 
06a0
[1033173.493855] Stack:
[1033173.493862]  b0b60444 eaea 41b58ab3 
b3c3ce32
[1033173.493867]  b0b6f3e0 b0b60444 b5ea2e50 
11000c622f5e
[1033173.493873]  8800630c4cd8 000ee09a b3ec4888 
b5ea2de8
[1033173.493874] Call Trace:
[1033173.494108] do_vcc_ioctl (net/atm/ioctl.c:170)
[1033173.494113] vcc_ioctl (net/atm/ioctl.c:189)
[1033173.494116] svc_ioctl (net/atm/svc.c:605)
[1033173.494200] sock_do_ioctl (net/socket.c:874)
[1033173.494204] sock_ioctl (net/socket.c:958)
[1033173.494244] do_vfs_ioctl (fs/ioctl.c:43 fs/ioctl.c:607)
[1033173.494290] SyS_ioctl (fs/ioctl.c:622 fs/ioctl.c:613)
[1033173.494295] entry_SYSCALL_64_fastpath (arch/x86/entry/entry_64.S:186)
[1033173.494362] Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 50 09 00 00 49 8b 9e 60 
06 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 14 48 89 fa 48 c1 ea 03 <0f> b6 
04 02 48 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 14 09 00
All code

   0:   fa  cli
   1:   48 c1 ea 03 shr$0x3,%rdx
   5:   80 3c 02 00 cmpb   $0x0,(%rdx,%rax,1)
   9:   0f 85 50 09 00 00   jne0x95f
   f:   49 8b 9e 60 06 00 00mov0x660(%r14),%rbx
  16:   48 b8 00 00 00 00 00movabs $0xdc00,%rax
  1d:   fc ff df
  20:   48 8d 7b 14 lea0x14(%rbx),%rdi
  24:   48 89 famov%rdi,%rdx
  27:   48 c1 ea 03 shr$0x3,%rdx
  2b:*  0f b6 04 02 movzbl (%rdx,%rax,1),%eax   <-- 
trapping instruction
  2f:   48 89 famov%rdi,%rdx
  32:   83 e2 07and$0x7,%edx
  35:   38 d0   cmp%dl,%al
  37:   7f 08   jg 0x41
  39:   84 c0   test   %al,%al
  3b:   0f 85 14 09 00 00   jne0x955

Code starting with the faulting instruction
===
   0:   0f b6 04 02 movzbl (%rdx,%rax,1),%eax
   4:   48 89 famov%rdi,%rdx
   7:   83 e2 07and$0x7,%edx
   a:   38 d0   cmp%dl,%al
   c:   7f 08   jg 0x16
   e:   84 c0   test   %al,%al
  10:   0f 85 14 09 00 00   jne0x92a
[1033173.494366] RIP clip_ioctl (net/atm/clip.c:320 net/atm/clip.c:689)
[1033173.494368]  RSP 

Signed-off-by: Sasha Levin 
---
 net/atm/clip.c |3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/atm/clip.c b/net/atm/clip.c
index 17e55df..4407b2f 100644
--- a/net/atm/clip.c
+++ b/net/atm/clip.c
@@ -317,6 +317,9 @@ static int clip_constructor(struct neighbour *neigh)
 
 static int clip_encap(struct atm_vcc *vcc, int mode)
 {
+   if (!CLIP_VCC(vcc))
+   return -EBADFD;
+
CLIP_VCC(vcc)->encap = mode;
return 0;
 }
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-V

2015-09-16 Thread KY Srinivasan


> -Original Message-
> From: David Laight [mailto:david.lai...@aculab.com]
> Sent: Wednesday, September 16, 2015 9:25 AM
> To: Haiyang Zhang ; Vitaly Kuznetsov
> ; netdev@vger.kernel.org
> Cc: David S. Miller ; linux-ker...@vger.kernel.org;
> KY Srinivasan ; Jason Wang 
> Subject: RE: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-
> V
> 
> From: Haiyang Zhang
> > Sent: 16 September 2015 17:09
> > > -Original Message-
> > > From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
> > > Sent: Wednesday, September 16, 2015 11:50 AM
> > > To: netdev@vger.kernel.org
> > > Cc: David S. Miller ; linux-
> ker...@vger.kernel.org;
> > > KY Srinivasan ; Haiyang Zhang
> > > ; Jason Wang 
> > > Subject: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-
> V
> > >
> > > Commit b08cc79155fc26d0d112b1470d1ece5034651a4b ("hv_netvsc:
> Eliminate
> > >  memory allocation in the packet send path") introduced skb headroom
> > > request for Hyper-V netvsc driver:
> > >
> > >max_needed_headroom = sizeof(struct hv_netvsc_packet) +
> > >sizeof(struct rndis_message) +
> > >NDIS_VLAN_PPI_SIZE + NDIS_CSUM_PPI_SIZE +
> > >NDIS_LSO_PPI_SIZE + NDIS_HASH_PPI_SIZE;
> > >...
> > >net->needed_headroom = max_needed_headroom;
> > >
> > > max_needed_headroom is 220 bytes, it significantly exceeds the
> > > LL_MAX_HEADER setting. This causes each skb to be cloned on send
> path,
> > > e.g. for IPv4 case we fall into the following clause
> > > (ip_finish_output2()):
> > >
> > > if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
> > > ...
> > > skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
> > > ...
> > > }
> > >
> > > leading to a significant performance regression. Increase
> LL_MAX_HEADER
> > > to make it suitable for netvsc, make it 224 to be 16-aligned.
> > > Alternatively we could (partially) revert the commit which introduced
> > > skb
> > > headroom request restoring manual memory allocation on transmit path.
> > >
> > > Signed-off-by: Vitaly Kuznetsov 
> > > ---
> > >  include/linux/netdevice.h | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > index 88a0069..7233790 100644
> > > --- a/include/linux/netdevice.h
> > > +++ b/include/linux/netdevice.h
> > > @@ -132,7 +132,9 @@ static inline bool dev_xmit_complete(int rc)
> > >   *   used.
> > >   */
> > >
> > > -#if defined(CONFIG_WLAN) || IS_ENABLED(CONFIG_AX25)
> > > +#if IS_ENABLED(CONFIG_HYPERV_NET)
> > > +# define LL_MAX_HEADER 224
> > > +#elif defined(CONFIG_WLAN) || IS_ENABLED(CONFIG_AX25)
> > >  # if defined(CONFIG_MAC80211_MESH)
> > >  #  define LL_MAX_HEADER 128
> > >  # else
> >
> > Thanks for the patch.
> > To avoid we forget to update that 224 number when we add more things
> > into netvsc header, I suggest that we define a macro in netdevice.h such
> > as:
> > #define HVNETVSC_MAX_HEADER 224
> > #define LL_MAX_HEADER HVNETVSC_MAX_HEADER
> >
> > And, put a note in netvsc code saying the header reservation shouldn't
> > exceed HVNETVSC_MAX_HEADER, or you need to update
> HVNETVSC_MAX_HEADER.
> 
> Am I right in thinking this is adding an extra 96 unused bytes to the front
> of almost all skb just so that hyper-v can make its link level header
> contiguous with whatever follows (IP header ?).
> 
> Doesn't sound ideal.

Remote NDIS is the protocol used to send packets from the guest to the host. 
Every packet
needs to be decorated with the RNDIS header and the maximum room needed for the 
RNDIS
header is the hreadroom we want.

K. Y
> 
>   David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next:master 6/12] include/linux/usb/cdc.h:23: error: redefinition of 'struct usb_cdc_parsed_header'

2015-09-16 Thread David Miller
From: Fengguang Wu 
Date: Wed, 16 Sep 2015 21:06:58 +0800

> On Tue, Sep 15, 2015 at 01:27:42PM -0700, David Miller wrote:
>> From: kbuild test robot 
>> Date: Wed, 16 Sep 2015 03:57:11 +0800
>> 
>> > All error/warnings (new ones prefixed by >>):
>> > 
>> >In file included from drivers/usb/gadget/function/u_ether.h:20,
>> > from drivers/usb/gadget/legacy/cdc2.c:16:
>> >include/linux/usb/cdc.h:47: warning: 'struct usb_interface' declared 
>> > inside parameter list
>> >include/linux/usb/cdc.h:47: warning: its scope is only this definition 
>> > or declaration, which is probably not what you want
>> >In file included from drivers/usb/gadget/function/u_serial.h:16,
>> > from drivers/usb/gadget/legacy/cdc2.c:17:
>> >>> include/linux/usb/cdc.h:23: error: redefinition of 'struct 
>> >>> usb_cdc_parsed_header'
>> >include/linux/usb/cdc.h:47: warning: 'struct usb_interface' declared 
>> > inside parameter list
>> >>> include/linux/usb/cdc.h:47: error: conflicting types for 
>> >>> 'cdc_parse_cdc_header'
>> >include/linux/usb/cdc.h:47: error: previous declaration of 
>> > 'cdc_parse_cdc_header' was here
>> 
>> This may be a side effect of the initial warning, does this reproduce with
>> that fixed?  Please show me what the warning looks like in that case.
> 
> Dave, net-next/master commit ad1e7b97b3 ("cdc: Fix build warning.")
> still has errors.
> 
> The problem is, the header file  is included twice.

That's not possible after the patch I committed from Stephen Rothwell
which adds proper include guards:


commit b84ee0d7f375ed7840c7c110d46eac24cf94b2a2
Author: Stephen Rothwell 
Date:   Wed Sep 16 11:10:16 2015 +1000

cdc: add header guards

Signed-off-by: Stephen Rothwell 
Signed-off-by: David S. Miller 

diff --git a/include/linux/usb/cdc.h b/include/linux/usb/cdc.h
index 959d0c8..b5706f9 100644
--- a/include/linux/usb/cdc.h
+++ b/include/linux/usb/cdc.h
@@ -7,6 +7,8 @@
  * modify it under the terms of the GNU General Public License
  * version 2 as published by the Free Software Foundation.
  */
+#ifndef __LINUX_USB_CDC_H
+#define __LINUX_USB_CDC_H
 
 #include 
 
@@ -45,3 +47,5 @@ int cdc_parse_cdc_header(struct usb_cdc_parsed_header *hdr,
struct usb_interface *intf,
u8 *buffer,
int buflen);
+
+#endif /* __LINUX_USB_CDC_H */
diff --git a/include/uapi/linux/usb/cdc.h b/include/uapi/linux/usb/cdc.h
index b6a9cdd..e2bc417 100644
--- a/include/uapi/linux/usb/cdc.h
+++ b/include/uapi/linux/usb/cdc.h
@@ -6,8 +6,8 @@
  * firmware based USB peripherals.
  */
 
-#ifndef __LINUX_USB_CDC_H
-#define __LINUX_USB_CDC_H
+#ifndef __UAPI_LINUX_USB_CDC_H
+#define __UAPI_LINUX_USB_CDC_H
 
 #include 
 
@@ -444,4 +444,4 @@ struct usb_cdc_ncm_ndp_input_size {
 #define USB_CDC_NCM_CRC_NOT_APPENDED   0x00
 #define USB_CDC_NCM_CRC_APPENDED   0x01
 
-#endif /* __LINUX_USB_CDC_H */
+#endif /* __UAPI_LINUX_USB_CDC_H */
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] net: fix cdc-phonet.c dependency and build error

2015-09-16 Thread Randy Dunlap
From: Randy Dunlap <rdun...@infradead.org>

Fix build error caused by missing Kconfig dependency:

ERROR: "cdc_parse_cdc_header" [drivers/net/usb/cdc-phonet.ko] undefined!

Reported-by: Fengguang Wu <fengguang...@intel.com>
Signed-off-by: Randy Dunlap <rdun...@infradead.org>
---
 drivers/net/usb/Kconfig |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next-20150916.orig/drivers/net/usb/Kconfig
+++ linux-next-20150916/drivers/net/usb/Kconfig
@@ -541,7 +541,7 @@ config USB_NET_INT51X1
 
 config USB_CDC_PHONET
tristate "CDC Phonet support"
-   depends on PHONET
+   depends on PHONET && USB_USBNET
help
  Choose this option to support the Phonet interface to a Nokia
  cellular modem, as found on most Nokia handsets with the
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-V

2015-09-16 Thread David Miller
From: David Laight 
Date: Wed, 16 Sep 2015 16:25:03 +

> Am I right in thinking this is adding an extra 96 unused bytes to the front
> of almost all skb just so that hyper-v can make its link level header
> contiguous with whatever follows (IP header ?).
> 
> Doesn't sound ideal.

Agreed, this is rediculous, and the entire stack will incur this cost
just because hyperv is enabled in the kernel config.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] Allow postponed netfilter handling for socket matches

2015-09-16 Thread Florian Westphal
Daniel Mack  wrote:
> I'm re-addressing the issue of matching socket meta information for
> non-established sockets that has been discussed a while ago:
> 
>   http://article.gmane.org/gmane.comp.security.firewalls.netfilter.devel/56877
> 
> Being able to reliably match on net_cls cgroup ids is crucial in
> order to build a per-application or per-container firewall rules
> which don't leak ingress packets. Such a feature would be very
> useful to have.

Could you clarify what 'which don't leak ingress packets' means?

> A previous attempt to fix the currently existing issues was to call
> out to the early demuxing helper functions from the meta matching
> callbacks, but that doesn't suffice because it doesn't address the
> case of multicast UDP and other, more complex lookup methods
> implemented in various protocol handlers.

Yes, but see below.

> This patch set outlines a different approach by adding a flag to
> 'struct sk_buff' called 'nf_postponed'. This flag is set by
> nft_meta_get_eval() in case a decision cannot be made due to a missing
> skb->sk. skbs flagged that way will then be ran through the netfilter
> chain processor again after the protocol handlers did the real socket
> lookup. A small addition to 'struct nft_pktinfo' is needed so that the
> matching callbacks can access the socket that was passed into
> nf_hook().
> 
> Note that the new flag does not actually bloat 'struct skb_buff',
> because it still fits into the 'flags1' bitfield. Also, the extra
> netfilter chain iteration will not be done by any subsequent packet in
> the same stream, as for those, the early demux code will set skb->sk.
> 
> The patch set is obviously not yet finished, because a lot more
> protocol handlers need to be patched. Right now, I only addressed
> tcp_ipv4. Before I do that, I want to get some feedback on the
> approach, so please let me know what you think.

I think there are several issues.

implementation problems:
- i'm not sure its legal to call the hook input with skb->sk locked,
  some matches might want to aquire it.
- what makes NFT_META_CGROUP special? (or was that just an example?)

design issues:
The assumption seems to be that a given skb can always be mapped to a
particular socket, and hence a cgroup.

Thats not necessarily the case, e.g. with broad-/multicasting or when
the socket is e.g. in timewait state.

Some skbs will now travel INPUT hooks twice.

And once you'd extend this so that we re-invoke nf hooks for mcast
packets, for each socket they've been received on, you change netfilter
behaviour again (one skb, one traversal -> n traversals of ruleset, one
for each sk).

I think that this makes it a non-starter, sorry.

I would much rather see nft_demux_{udp,tcp,sctp,dccp,...}.c which moves
early-demux-esque code into the nft ruleset.

Then you could do something like

nft add rule ip filter input meta l4proto tcp demux meta cgroup 42

The caveat being that even in this case we cannot guarantee
that skb->sk is set afterwards, or that a cgroup can be derived from it.

Iff you absolutely need this, I'd seriously entertain the idea of adding
NFPROTO_L4_TCP, etc, ... or, maybe better, allow to attach nft ruleset
as a socket filter.

But really, at that point, a much better question would be wheter net
cgroups are the answer to whatever the question was, or what problem we
are attempting to address here...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: qdisc: enhance default_qdisc documentation

2015-09-16 Thread Cong Wang
On Tue, Sep 15, 2015 at 1:33 AM, Phil Sutter  wrote:
> Aside from some lingual cleanup, point out which interfaces are not or
> partly covered by this setting.
>
> Signed-off-by: Phil Sutter 

Acked-by: Cong Wang 

It also worth to explain what the default qdisc means, but we can do
that in another patch.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v2] xen-netfront: always set num queues if possible

2015-09-16 Thread Charles (Chas) Williams
If netfront connects with two (or more) queues and then reconnects with
only one queue it fails to delete or rewrite the multi-queue-num-queues
key and netback will try to use the wrong number of queues.

Always write the num-queues field if the backend has multi-queue support.

Signed-off-by: Chas Williams <3ch...@gmail.com>
---
 drivers/net/xen-netfront.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index f821a97..9bf63c2 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1819,19 +1819,22 @@ again:
goto destroy_ring;
}
 
-   if (num_queues == 1) {
-   err = write_queue_xenstore_keys(>queues[0], , 0); /* 
flat */
-   if (err)
-   goto abort_transaction_no_dev_fatal;
-   } else {
+   if (xenbus_exists(XBT_NIL,
+ info->xbdev->otherend, "multi-queue-max-queues")) {
/* Write the number of queues */
-   err = xenbus_printf(xbt, dev->nodename, 
"multi-queue-num-queues",
-   "%u", num_queues);
+   err = xenbus_printf(xbt, dev->nodename,
+   "multi-queue-num-queues", "%u", num_queues);
if (err) {
message = "writing multi-queue-num-queues";
goto abort_transaction_no_dev_fatal;
}
+   }
 
+   if (num_queues == 1) {
+   err = write_queue_xenstore_keys(>queues[0], , 0); /* 
flat */
+   if (err)
+   goto abort_transaction_no_dev_fatal;
+   } else {
/* Write the keys for each queue */
for (i = 0; i < num_queues; ++i) {
queue = >queues[i];
-- 
2.1.0



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-V

2015-09-16 Thread Alexander Duyck

On 09/16/2015 10:55 AM, KY Srinivasan wrote:



-Original Message-
From: David Laight [mailto:david.lai...@aculab.com]
Sent: Wednesday, September 16, 2015 9:25 AM
To: Haiyang Zhang ; Vitaly Kuznetsov
; netdev@vger.kernel.org
Cc: David S. Miller ; linux-ker...@vger.kernel.org;
KY Srinivasan ; Jason Wang 
Subject: RE: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-
V

From: Haiyang Zhang

Sent: 16 September 2015 17:09

-Original Message-
From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
Sent: Wednesday, September 16, 2015 11:50 AM
To: netdev@vger.kernel.org
Cc: David S. Miller ; linux-

ker...@vger.kernel.org;

KY Srinivasan ; Haiyang Zhang
; Jason Wang 
Subject: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-

V

Commit b08cc79155fc26d0d112b1470d1ece5034651a4b ("hv_netvsc:

Eliminate

  memory allocation in the packet send path") introduced skb headroom
request for Hyper-V netvsc driver:

max_needed_headroom = sizeof(struct hv_netvsc_packet) +
sizeof(struct rndis_message) +
NDIS_VLAN_PPI_SIZE + NDIS_CSUM_PPI_SIZE +
NDIS_LSO_PPI_SIZE + NDIS_HASH_PPI_SIZE;
...
net->needed_headroom = max_needed_headroom;

max_needed_headroom is 220 bytes, it significantly exceeds the
LL_MAX_HEADER setting. This causes each skb to be cloned on send

path,

e.g. for IPv4 case we fall into the following clause
(ip_finish_output2()):

if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
 ...
 skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
 ...
}

leading to a significant performance regression. Increase

LL_MAX_HEADER

to make it suitable for netvsc, make it 224 to be 16-aligned.
Alternatively we could (partially) revert the commit which introduced
skb
headroom request restoring manual memory allocation on transmit path.

Signed-off-by: Vitaly Kuznetsov 
---
  include/linux/netdevice.h | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 88a0069..7233790 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -132,7 +132,9 @@ static inline bool dev_xmit_complete(int rc)
   *used.
   */

-#if defined(CONFIG_WLAN) || IS_ENABLED(CONFIG_AX25)
+#if IS_ENABLED(CONFIG_HYPERV_NET)
+# define LL_MAX_HEADER 224
+#elif defined(CONFIG_WLAN) || IS_ENABLED(CONFIG_AX25)
  # if defined(CONFIG_MAC80211_MESH)
  #  define LL_MAX_HEADER 128
  # else

Thanks for the patch.
To avoid we forget to update that 224 number when we add more things
into netvsc header, I suggest that we define a macro in netdevice.h such
as:
#define HVNETVSC_MAX_HEADER 224
#define LL_MAX_HEADER HVNETVSC_MAX_HEADER

And, put a note in netvsc code saying the header reservation shouldn't
exceed HVNETVSC_MAX_HEADER, or you need to update

HVNETVSC_MAX_HEADER.

Am I right in thinking this is adding an extra 96 unused bytes to the front
of almost all skb just so that hyper-v can make its link level header
contiguous with whatever follows (IP header ?).

Doesn't sound ideal.

Remote NDIS is the protocol used to send packets from the guest to the host. 
Every packet
needs to be decorated with the RNDIS header and the maximum room needed for the 
RNDIS
header is the hreadroom we want.


I think we get that.  The question is does the Remote NDIS header and 
packet info actually need to be a part of the header data?  I would 
argue that it probably doesn't.


So for example in netvsc_start_xmit it looks like you are calling 
init_page_array in order to populate a set of page buffers, but the 
first buffer for the Remote NDIS protocol is populated as a separate 
page and offset.  As such it doesn't seem like it necessarily needs to 
be a part of the header data but could be maintained perhaps in a 
separate ring buffer, or perhaps just be a separate page that you break 
up to use for each header.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net 5/6] lan78xx: Create lan78xx_get_mdix_status() and lan78xx_set_mdix_status() for MDIX control.

2015-09-16 Thread Woojung.Huh
Create lan78xx_get_mdix_status() and lan78xx_set_mdix_status() for MDIX control.

Signed-off-by: Woojung Huh 
---
 drivers/net/usb/lan78xx.c | 90 +++
 1 file changed, 52 insertions(+), 38 deletions(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index 517264d..9102c71 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -1175,6 +1175,55 @@ static void lan78xx_set_msglevel(struct net_device *net, 
u32 level)
dev->msg_enable = level;
 }
 
+static int lan78xx_get_mdix_status(struct net_device *net)
+{
+   struct phy_device *phydev = net->phydev;
+   int buf;
+
+   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, LAN88XX_EXT_PAGE_SPACE_1);
+   buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL);
+   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, LAN88XX_EXT_PAGE_SPACE_0);
+
+   return buf;
+}
+
+static void lan78xx_set_mdix_status(struct net_device *net, __u8 mdix_ctrl)
+{
+   struct lan78xx_net *dev = netdev_priv(net);
+   struct phy_device *phydev = net->phydev;
+   int buf;
+
+   if (mdix_ctrl == ETH_TP_MDI) {
+   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
+ LAN88XX_EXT_PAGE_SPACE_1);
+   buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL);
+   buf &= ~LAN88XX_EXT_MODE_CTRL_MDIX_MASK_;
+   phy_write(phydev, LAN88XX_EXT_MODE_CTRL,
+ buf | LAN88XX_EXT_MODE_CTRL_MDI_);
+   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
+ LAN88XX_EXT_PAGE_SPACE_0);
+   } else if (mdix_ctrl == ETH_TP_MDI_X) {
+   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
+ LAN88XX_EXT_PAGE_SPACE_1);
+   buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL);
+   buf &= ~LAN88XX_EXT_MODE_CTRL_MDIX_MASK_;
+   phy_write(phydev, LAN88XX_EXT_MODE_CTRL,
+ buf | LAN88XX_EXT_MODE_CTRL_MDI_X_);
+   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
+ LAN88XX_EXT_PAGE_SPACE_0);
+   } else if (mdix_ctrl == ETH_TP_MDI_AUTO) {
+   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
+ LAN88XX_EXT_PAGE_SPACE_1);
+   buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL);
+   buf &= ~LAN88XX_EXT_MODE_CTRL_MDIX_MASK_;
+   phy_write(phydev, LAN88XX_EXT_MODE_CTRL,
+ buf | LAN88XX_EXT_MODE_CTRL_AUTO_MDIX_);
+   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
+ LAN88XX_EXT_PAGE_SPACE_0);
+   }
+   dev->mdix_ctrl = mdix_ctrl;
+}
+
 static int lan78xx_get_settings(struct net_device *net, struct ethtool_cmd 
*cmd)
 {
struct lan78xx_net *dev = netdev_priv(net);
@@ -1188,9 +1237,7 @@ static int lan78xx_get_settings(struct net_device *net, 
struct ethtool_cmd *cmd)
 
ret = phy_ethtool_gset(phydev, cmd);
 
-   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, LAN88XX_EXT_PAGE_SPACE_1);
-   buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL);
-   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, LAN88XX_EXT_PAGE_SPACE_0);
+   buf = lan78xx_get_mdix_status(net);
 
buf &= LAN88XX_EXT_MODE_CTRL_MDIX_MASK_;
if (buf == LAN88XX_EXT_MODE_CTRL_AUTO_MDIX_) {
@@ -1221,34 +1268,7 @@ static int lan78xx_set_settings(struct net_device *net, 
struct ethtool_cmd *cmd)
return ret;
 
if (dev->mdix_ctrl != cmd->eth_tp_mdix_ctrl) {
-   if (cmd->eth_tp_mdix_ctrl == ETH_TP_MDI) {
-   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
- LAN88XX_EXT_PAGE_SPACE_1);
-   temp = phy_read(phydev, LAN88XX_EXT_MODE_CTRL);
-   temp &= ~LAN88XX_EXT_MODE_CTRL_MDIX_MASK_;
-   phy_write(phydev, LAN88XX_EXT_MODE_CTRL,
- temp | LAN88XX_EXT_MODE_CTRL_MDI_);
-   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
- LAN88XX_EXT_PAGE_SPACE_0);
-   } else if (cmd->eth_tp_mdix_ctrl == ETH_TP_MDI_X) {
-   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
- LAN88XX_EXT_PAGE_SPACE_1);
-   temp = phy_read(phydev, LAN88XX_EXT_MODE_CTRL);
-   temp &= ~LAN88XX_EXT_MODE_CTRL_MDIX_MASK_;
-   phy_write(phydev, LAN88XX_EXT_MODE_CTRL,
- temp | LAN88XX_EXT_MODE_CTRL_MDI_X_);
-   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
- LAN88XX_EXT_PAGE_SPACE_0);
-   } else if (cmd->eth_tp_mdix_ctrl == ETH_TP_MDI_AUTO) {
-   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
- LAN88XX_EXT_PAGE_SPACE_1);
-   temp = phy_read(phydev, 

Re: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-V

2015-09-16 Thread Alexander Duyck

On 09/16/2015 03:57 PM, KY Srinivasan wrote:



-Original Message-
From: Alexander Duyck [mailto:alexander.du...@gmail.com]
Sent: Wednesday, September 16, 2015 2:39 PM
To: KY Srinivasan ; David Laight
; Haiyang Zhang ;
Vitaly Kuznetsov ; netdev@vger.kernel.org
Cc: David S. Miller ; linux-ker...@vger.kernel.org;
Jason Wang 
Subject: Re: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-V

On 09/16/2015 10:55 AM, KY Srinivasan wrote:

-Original Message-
From: David Laight [mailto:david.lai...@aculab.com]
Sent: Wednesday, September 16, 2015 9:25 AM
To: Haiyang Zhang ; Vitaly Kuznetsov
; netdev@vger.kernel.org
Cc: David S. Miller ; linux-ker...@vger.kernel.org;
KY Srinivasan ; Jason Wang 
Subject: RE: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-
V

From: Haiyang Zhang

Sent: 16 September 2015 17:09

-Original Message-
From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
Sent: Wednesday, September 16, 2015 11:50 AM
To: netdev@vger.kernel.org
Cc: David S. Miller ; linux-

ker...@vger.kernel.org;

KY Srinivasan ; Haiyang Zhang
; Jason Wang 
Subject: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-

V

Commit b08cc79155fc26d0d112b1470d1ece5034651a4b ("hv_netvsc:

Eliminate

   memory allocation in the packet send path") introduced skb headroom
request for Hyper-V netvsc driver:

 max_needed_headroom = sizeof(struct hv_netvsc_packet) +
 sizeof(struct rndis_message) +
 NDIS_VLAN_PPI_SIZE + NDIS_CSUM_PPI_SIZE +
 NDIS_LSO_PPI_SIZE + NDIS_HASH_PPI_SIZE;
 ...
 net->needed_headroom = max_needed_headroom;

max_needed_headroom is 220 bytes, it significantly exceeds the
LL_MAX_HEADER setting. This causes each skb to be cloned on send

path,

e.g. for IPv4 case we fall into the following clause
(ip_finish_output2()):

if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
  ...
  skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
  ...
}

leading to a significant performance regression. Increase

LL_MAX_HEADER

to make it suitable for netvsc, make it 224 to be 16-aligned.
Alternatively we could (partially) revert the commit which introduced
skb
headroom request restoring manual memory allocation on transmit path.

Signed-off-by: Vitaly Kuznetsov 
---
   include/linux/netdevice.h | 4 +++-
   1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 88a0069..7233790 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -132,7 +132,9 @@ static inline bool dev_xmit_complete(int rc)
*   used.
*/

-#if defined(CONFIG_WLAN) || IS_ENABLED(CONFIG_AX25)
+#if IS_ENABLED(CONFIG_HYPERV_NET)
+# define LL_MAX_HEADER 224
+#elif defined(CONFIG_WLAN) || IS_ENABLED(CONFIG_AX25)
   # if defined(CONFIG_MAC80211_MESH)
   #  define LL_MAX_HEADER 128
   # else

Thanks for the patch.
To avoid we forget to update that 224 number when we add more things
into netvsc header, I suggest that we define a macro in netdevice.h such
as:
#define HVNETVSC_MAX_HEADER 224
#define LL_MAX_HEADER HVNETVSC_MAX_HEADER

And, put a note in netvsc code saying the header reservation shouldn't
exceed HVNETVSC_MAX_HEADER, or you need to update

HVNETVSC_MAX_HEADER.

Am I right in thinking this is adding an extra 96 unused bytes to the front
of almost all skb just so that hyper-v can make its link level header
contiguous with whatever follows (IP header ?).

Doesn't sound ideal.

Remote NDIS is the protocol used to send packets from the guest to the host.

Every packet

needs to be decorated with the RNDIS header and the maximum room needed

for the RNDIS

header is the hreadroom we want.

I think we get that.  The question is does the Remote NDIS header and
packet info actually need to be a part of the header data?  I would
argue that it probably doesn't.

So for example in netvsc_start_xmit it looks like you are calling
init_page_array in order to populate a set of page buffers, but the
first buffer for the Remote NDIS protocol is populated as a separate
page and offset.  As such it doesn't seem like it necessarily needs to
be a part of the header data but could be maintained perhaps in a
separate ring buffer, or perhaps just be a separate page that you break
up to use for each header.

You are right; the rndis header can be built as a separate fragment and sent.
Indeed this is what we were doing earlier - on the outgoing path we would 
allocate
memory for the rndis header. My goal was to avoid this allocation on every 
packet being
sent and 

Re: [PATCH net-next v2] net: Initialize table in fib result

2015-09-16 Thread Florian Fainelli
On 16/09/15 09:16, David Ahern wrote:
> The root cause is use of res.table uninitialized.
> 
> Thanks to Nikolay for noticing the uninitialized use amongst the maze of
> gotos.
> 
> As Nikolay pointed out the second initialization is not required to fix
> the oops, but rather to fix a related problem where a valid lookup should
> be invalidated before creating the rth entry.
> 
> Fixes: b7503e0cdb5d ("net: Add FIB table id to rtable")
> Reported-by: Sergey Senozhatsky 
> Reported-by: Richard Alpe 
> Reported-by: Fabio Estevam 
> Tested-by: Fabio Estevam 
> Signed-off-by: David Ahern 

There are enough Tested-by tags, but thanks for fixing this!
-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-V

2015-09-16 Thread KY Srinivasan


> -Original Message-
> From: Alexander Duyck [mailto:alexander.du...@gmail.com]
> Sent: Wednesday, September 16, 2015 2:39 PM
> To: KY Srinivasan ; David Laight
> ; Haiyang Zhang ;
> Vitaly Kuznetsov ; netdev@vger.kernel.org
> Cc: David S. Miller ; linux-ker...@vger.kernel.org;
> Jason Wang 
> Subject: Re: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-V
> 
> On 09/16/2015 10:55 AM, KY Srinivasan wrote:
> >
> >> -Original Message-
> >> From: David Laight [mailto:david.lai...@aculab.com]
> >> Sent: Wednesday, September 16, 2015 9:25 AM
> >> To: Haiyang Zhang ; Vitaly Kuznetsov
> >> ; netdev@vger.kernel.org
> >> Cc: David S. Miller ; linux-ker...@vger.kernel.org;
> >> KY Srinivasan ; Jason Wang 
> >> Subject: RE: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-
> >> V
> >>
> >> From: Haiyang Zhang
> >>> Sent: 16 September 2015 17:09
>  -Original Message-
>  From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
>  Sent: Wednesday, September 16, 2015 11:50 AM
>  To: netdev@vger.kernel.org
>  Cc: David S. Miller ; linux-
> >> ker...@vger.kernel.org;
>  KY Srinivasan ; Haiyang Zhang
>  ; Jason Wang 
>  Subject: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-
> >> V
>  Commit b08cc79155fc26d0d112b1470d1ece5034651a4b ("hv_netvsc:
> >> Eliminate
>    memory allocation in the packet send path") introduced skb headroom
>  request for Hyper-V netvsc driver:
> 
>  max_needed_headroom = sizeof(struct hv_netvsc_packet) +
>  sizeof(struct rndis_message) +
>  NDIS_VLAN_PPI_SIZE + NDIS_CSUM_PPI_SIZE +
>  NDIS_LSO_PPI_SIZE + NDIS_HASH_PPI_SIZE;
>  ...
>  net->needed_headroom = max_needed_headroom;
> 
>  max_needed_headroom is 220 bytes, it significantly exceeds the
>  LL_MAX_HEADER setting. This causes each skb to be cloned on send
> >> path,
>  e.g. for IPv4 case we fall into the following clause
>  (ip_finish_output2()):
> 
>  if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
>   ...
>   skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
>   ...
>  }
> 
>  leading to a significant performance regression. Increase
> >> LL_MAX_HEADER
>  to make it suitable for netvsc, make it 224 to be 16-aligned.
>  Alternatively we could (partially) revert the commit which introduced
>  skb
>  headroom request restoring manual memory allocation on transmit path.
> 
>  Signed-off-by: Vitaly Kuznetsov 
>  ---
>    include/linux/netdevice.h | 4 +++-
>    1 file changed, 3 insertions(+), 1 deletion(-)
> 
>  diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>  index 88a0069..7233790 100644
>  --- a/include/linux/netdevice.h
>  +++ b/include/linux/netdevice.h
>  @@ -132,7 +132,9 @@ static inline bool dev_xmit_complete(int rc)
> * used.
> */
> 
>  -#if defined(CONFIG_WLAN) || IS_ENABLED(CONFIG_AX25)
>  +#if IS_ENABLED(CONFIG_HYPERV_NET)
>  +# define LL_MAX_HEADER 224
>  +#elif defined(CONFIG_WLAN) || IS_ENABLED(CONFIG_AX25)
>    # if defined(CONFIG_MAC80211_MESH)
>    #  define LL_MAX_HEADER 128
>    # else
> >>> Thanks for the patch.
> >>> To avoid we forget to update that 224 number when we add more things
> >>> into netvsc header, I suggest that we define a macro in netdevice.h such
> >>> as:
> >>> #define HVNETVSC_MAX_HEADER 224
> >>> #define LL_MAX_HEADER HVNETVSC_MAX_HEADER
> >>>
> >>> And, put a note in netvsc code saying the header reservation shouldn't
> >>> exceed HVNETVSC_MAX_HEADER, or you need to update
> >> HVNETVSC_MAX_HEADER.
> >>
> >> Am I right in thinking this is adding an extra 96 unused bytes to the front
> >> of almost all skb just so that hyper-v can make its link level header
> >> contiguous with whatever follows (IP header ?).
> >>
> >> Doesn't sound ideal.
> > Remote NDIS is the protocol used to send packets from the guest to the host.
> Every packet
> > needs to be decorated with the RNDIS header and the maximum room needed
> for the RNDIS
> > header is the hreadroom we want.
> 
> I think we get that.  The question is does the Remote NDIS header and
> packet info actually need to be a part of the header data?  I would
> argue that it probably doesn't.
> 
> So for example in netvsc_start_xmit it looks like you are calling
> init_page_array in order to populate a set of page buffers, but the
> first buffer for the Remote NDIS 

RE: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-V

2015-09-16 Thread KY Srinivasan


> -Original Message-
> From: Alexander Duyck [mailto:alexander.du...@gmail.com]
> Sent: Wednesday, September 16, 2015 4:49 PM
> To: KY Srinivasan ; David Laight
> ; Haiyang Zhang ;
> Vitaly Kuznetsov ; netdev@vger.kernel.org
> Cc: David S. Miller ; linux-ker...@vger.kernel.org;
> Jason Wang 
> Subject: Re: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-V
> 
> On 09/16/2015 03:57 PM, KY Srinivasan wrote:
> >
> >> -Original Message-
> >> From: Alexander Duyck [mailto:alexander.du...@gmail.com]
> >> Sent: Wednesday, September 16, 2015 2:39 PM
> >> To: KY Srinivasan ; David Laight
> >> ; Haiyang Zhang ;
> >> Vitaly Kuznetsov ; netdev@vger.kernel.org
> >> Cc: David S. Miller ; linux-ker...@vger.kernel.org;
> >> Jason Wang 
> >> Subject: Re: [PATCH net-next RFC] net: increase LL_MAX_HEADER for Hyper-
> V
> >>
> >> On 09/16/2015 10:55 AM, KY Srinivasan wrote:
>  -Original Message-
>  From: David Laight [mailto:david.lai...@aculab.com]
>  Sent: Wednesday, September 16, 2015 9:25 AM
>  To: Haiyang Zhang ; Vitaly Kuznetsov
>  ; netdev@vger.kernel.org
>  Cc: David S. Miller ; linux-
> ker...@vger.kernel.org;
>  KY Srinivasan ; Jason Wang
> 
>  Subject: RE: [PATCH net-next RFC] net: increase LL_MAX_HEADER for
> Hyper-
>  V
> 
>  From: Haiyang Zhang
> > Sent: 16 September 2015 17:09
> >> -Original Message-
> >> From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
> >> Sent: Wednesday, September 16, 2015 11:50 AM
> >> To: netdev@vger.kernel.org
> >> Cc: David S. Miller ; linux-
>  ker...@vger.kernel.org;
> >> KY Srinivasan ; Haiyang Zhang
> >> ; Jason Wang 
> >> Subject: [PATCH net-next RFC] net: increase LL_MAX_HEADER for
> Hyper-
>  V
> >> Commit b08cc79155fc26d0d112b1470d1ece5034651a4b
> ("hv_netvsc:
>  Eliminate
> >>memory allocation in the packet send path") introduced skb
> headroom
> >> request for Hyper-V netvsc driver:
> >>
> >>  max_needed_headroom = sizeof(struct hv_netvsc_packet) +
> >>  sizeof(struct rndis_message) +
> >>  NDIS_VLAN_PPI_SIZE + 
> >> NDIS_CSUM_PPI_SIZE +
> >>  NDIS_LSO_PPI_SIZE + 
> >> NDIS_HASH_PPI_SIZE;
> >>  ...
> >>  net->needed_headroom = max_needed_headroom;
> >>
> >> max_needed_headroom is 220 bytes, it significantly exceeds the
> >> LL_MAX_HEADER setting. This causes each skb to be cloned on send
>  path,
> >> e.g. for IPv4 case we fall into the following clause
> >> (ip_finish_output2()):
> >>
> >> if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
> >>   ...
> >>   skb2 = skb_realloc_headroom(skb, LL_RESERVED_SPACE(dev));
> >>   ...
> >> }
> >>
> >> leading to a significant performance regression. Increase
>  LL_MAX_HEADER
> >> to make it suitable for netvsc, make it 224 to be 16-aligned.
> >> Alternatively we could (partially) revert the commit which introduced
> >> skb
> >> headroom request restoring manual memory allocation on transmit
> path.
> >>
> >> Signed-off-by: Vitaly Kuznetsov 
> >> ---
> >>include/linux/netdevice.h | 4 +++-
> >>1 file changed, 3 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >> index 88a0069..7233790 100644
> >> --- a/include/linux/netdevice.h
> >> +++ b/include/linux/netdevice.h
> >> @@ -132,7 +132,9 @@ static inline bool dev_xmit_complete(int rc)
> >> *  used.
> >> */
> >>
> >> -#if defined(CONFIG_WLAN) || IS_ENABLED(CONFIG_AX25)
> >> +#if IS_ENABLED(CONFIG_HYPERV_NET)
> >> +# define LL_MAX_HEADER 224
> >> +#elif defined(CONFIG_WLAN) || IS_ENABLED(CONFIG_AX25)
> >># if defined(CONFIG_MAC80211_MESH)
> >>#  define LL_MAX_HEADER 128
> >># else
> > Thanks for the patch.
> > To avoid we forget to update that 224 number when we add more things
> > into netvsc header, I suggest that we define a macro in netdevice.h such
> > as:
> > #define HVNETVSC_MAX_HEADER 224
> > #define LL_MAX_HEADER HVNETVSC_MAX_HEADER
> >
> > And, put a note in netvsc code saying the header reservation shouldn't
> > exceed HVNETVSC_MAX_HEADER, or you need to update
>  HVNETVSC_MAX_HEADER.
> 
>  Am 

  1   2   >