Re: [PATCHv2 net-next] dropwatch: Support monitoring of dropped frames

2020-08-05 Thread Neil Horman
On Tue, Aug 04, 2020 at 04:14:14PM -0700, David Miller wrote:
> From: izabela.bakoll...@gmail.com
> Date: Tue,  4 Aug 2020 18:09:08 +0200
> 
> > @@ -1315,6 +1334,53 @@ static int net_dm_cmd_trace(struct sk_buff *skb,
> > return -EOPNOTSUPP;
> >  }
> >  
> > +static int net_dm_interface_start(struct net *net, const char *ifname)
> > +{
> > +   struct net_device *nd = dev_get_by_name(net, ifname);
> > +
> > +   if (nd)
> > +   interface = nd;
> > +   else
> > +   return -ENODEV;
> > +
> > +   return 0;
> > +}
> > +
> > +static int net_dm_interface_stop(struct net *net, const char *ifname)
> > +{
> > +   dev_put(interface);
> > +   interface = NULL;
> > +
> > +   return 0;
> > +}
> 
> Where is the netdev notifier that will drop this reference if the network
> device is unregistered?
> 
See the changes to dropmon_net_event in the patch.  Its there under the case for
NETDEV_UNREGISTER

Neil


Re: [Linux-kernel-mentees] [PATCHv2 net-next] dropwatch: Support monitoring of dropped frames

2020-08-04 Thread Neil Horman
On Tue, Aug 04, 2020 at 02:28:28PM -0700, Cong Wang wrote:
> On Tue, Aug 4, 2020 at 9:14 AM  wrote:
> >
> > From: Izabela Bakollari 
> >
> > Dropwatch is a utility that monitors dropped frames by having userspace
> > record them over the dropwatch protocol over a file. This augument
> > allows live monitoring of dropped frames using tools like tcpdump.
> >
> > With this feature, dropwatch allows two additional commands (start and
> > stop interface) which allows the assignment of a net_device to the
> > dropwatch protocol. When assinged, dropwatch will clone dropped frames,
> > and receive them on the assigned interface, allowing tools like tcpdump
> > to monitor for them.
> >
> > With this feature, create a dummy ethernet interface (ip link add dev
> > dummy0 type dummy), assign it to the dropwatch kernel subsystem, by using
> > these new commands, and then monitor dropped frames in real time by
> > running tcpdump -i dummy0.
> 
> drop monitor is already able to send dropped packets to user-space,
> and wireshark already catches up with this feature:
> 
> https://code.wireshark.org/review/gitweb?p=wireshark.git;a=commitdiff;h=a94a860c0644ec3b8a129fd243674a2e376ce1c8
> 
> So what you propose here seems pretty much a duplicate?
> 
I had asked Izabela to implement this feature as an alternative approach to
doing live capture of dropped packets, as part of the Linux foundation
mentorship program.  I'm supportive of this additional feature as the added code
is fairly minimal, and allows for the use of other user space packet monitoring
tools without additional code changes (i.e. tcpdump/snort/etc can now monitor
dropped packets without the need to augment those tools with netlink capture
code.

Best
Neil 
> Thanks.
> ___
> Linux-kernel-mentees mailing list
> linux-kernel-ment...@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees
> 



Re: [PATCH] sctp: Fix sk_buff leak when receiving a datagram

2020-06-13 Thread Neil Horman
On Sat, Jun 13, 2020 at 08:39:25PM +0800, Xiyu Yang wrote:
> In sctp_skb_recv_datagram(), the function fetch a sk_buff object from
> the receiving queue to "skb" by calling skb_peek() or __skb_dequeue()
> and return its reference to the caller.
> 
> However, when calling __skb_dequeue() successfully, the function forgets
> to hold a reference count of the "skb" object and directly return it,
> causing a potential memory leak in the caller function.
> 
> Fix this issue by calling refcount_inc after __skb_dequeue()
> successfully executed.
> 
> Signed-off-by: Xiyu Yang 
> Signed-off-by: Xin Tan 
> ---
>  net/sctp/socket.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index d57e1a002ffc..4c8f0b83efd0 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -8990,6 +8990,8 @@ struct sk_buff *sctp_skb_recv_datagram(struct sock *sk, 
> int flags,
>   refcount_inc(>users);
>   } else {
>   skb = __skb_dequeue(>sk_receive_queue);
> + if (skb)
> + refcount_inc(>users);
For completeness, you should probably use skb_get here, rather than refcount_inc
directly.

Also, I'm not entirely sure I see how a memory leak can happen here.  we take an
extra reference in the skb_peek clause of this code area because if we return an
skb that continues to exist on the sk_receive_queue list, we legitimately have
two users for the skb (the user who called sctp_skb_recv_datagram(...,MSG_PEEK),
and the potential next caller who will actually dequeue the skb.

In the else clause however, that condition doesn't exist.  the user count for
the skb should alreday be 1, if the caller is the only user of the skb), or more
than 1, if 1 or more callers have gotten a reference to the message using
MSG_PEEK.

I don't think this code is needed, and in fact will actually cause memory leaks,
because theres no subsequent skb_unref call to drop refcount that you are adding
here.

Neil

>   }
>  
>   if (skb)
> -- 
> 2.7.4
> 
> 


Re: [PATCH] sctp: Replace zero-length array with flexible-array

2020-05-07 Thread Neil Horman
ip skip[0];
> + struct sctp_fwdtsn_skip skip[];
>  };
>  
>  struct sctp_fwdtsn_chunk {
> @@ -611,7 +611,7 @@ struct sctp_ifwdtsn_skip {
>  
>  struct sctp_ifwdtsn_hdr {
>   __be32 new_cum_tsn;
> - struct sctp_ifwdtsn_skip skip[0];
> + struct sctp_ifwdtsn_skip skip[];
>  };
>  
>  struct sctp_ifwdtsn_chunk {
> @@ -658,7 +658,7 @@ struct sctp_addip_param {
>  
>  struct sctp_addiphdr {
>   __be32  serial;
> - __u8params[0];
> + __u8params[];
>  };
>  
>  struct sctp_addip_chunk {
> @@ -718,7 +718,7 @@ struct sctp_addip_chunk {
>  struct sctp_authhdr {
>   __be16 shkey_id;
>   __be16 hmac_id;
> - __u8   hmac[0];
> + __u8   hmac[];
>  };
>  
>  struct sctp_auth_chunk {
> @@ -733,7 +733,7 @@ struct sctp_infox {
>  
>  struct sctp_reconf_chunk {
>   struct sctp_chunkhdr chunk_hdr;
> -     __u8 params[0];
> + __u8 params[];
>  };
>  
>  struct sctp_strreset_outreq {
> @@ -741,13 +741,13 @@ struct sctp_strreset_outreq {
>   __be32 request_seq;
>   __be32 response_seq;
>   __be32 send_reset_at_tsn;
> - __be16 list_of_streams[0];
> + __be16 list_of_streams[];
>  };
>  
>  struct sctp_strreset_inreq {
>   struct sctp_paramhdr param_hdr;
>   __be32 request_seq;
> - __be16 list_of_streams[0];
> + __be16 list_of_streams[];
>  };
>  
>  struct sctp_strreset_tsnreq {
> 
> 
Acked-by: Neil Horman 


Re: [PATCH 07/15] drop_monitor: work around gcc-10 stringop-overflow warning

2020-05-01 Thread Neil Horman
On Thu, Apr 30, 2020 at 11:30:49PM +0200, Arnd Bergmann wrote:
> The current gcc-10 snapshot produces a false-positive warning:
> 
> net/core/drop_monitor.c: In function 'trace_drop_common.constprop':
> cc1: error: writing 8 bytes into a region of size 0 
> [-Werror=stringop-overflow=]
> In file included from net/core/drop_monitor.c:23:
> include/uapi/linux/net_dropmon.h:36:8: note: at offset 0 to object 'entries' 
> with size 4 declared here
>36 |  __u32 entries;
>   |^~~
> 
> I reported this in the gcc bugzilla, but in case it does not get
> fixed in the release, work around it by using a temporary variable.
> 
> Fixes: 9a8afc8d3962 ("Network Drop Monitor: Adding drop monitor 
> implementation & Netlink protocol")
> Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94881
> Signed-off-by: Arnd Bergmann 
> ---
>  net/core/drop_monitor.c | 11 +++
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
> index 8e33cec9fc4e..2ee7bc4c9e03 100644
> --- a/net/core/drop_monitor.c
> +++ b/net/core/drop_monitor.c
> @@ -213,6 +213,7 @@ static void sched_send_work(struct timer_list *t)
>  static void trace_drop_common(struct sk_buff *skb, void *location)
>  {
>   struct net_dm_alert_msg *msg;
> + struct net_dm_drop_point *point;
>   struct nlmsghdr *nlh;
>   struct nlattr *nla;
>   int i;
> @@ -231,11 +232,13 @@ static void trace_drop_common(struct sk_buff *skb, void 
> *location)
>   nlh = (struct nlmsghdr *)dskb->data;
>   nla = genlmsg_data(nlmsg_data(nlh));
>   msg = nla_data(nla);
> + point = msg->points;
>   for (i = 0; i < msg->entries; i++) {
> - if (!memcmp(, msg->points[i].pc, sizeof(void *))) {
> - msg->points[i].count++;
> + if (!memcmp(, >pc, sizeof(void *))) {
> + point->count++;
>   goto out;
>   }
> + point++;
>   }
>   if (msg->entries == dm_hit_limit)
>   goto out;
> @@ -244,8 +247,8 @@ static void trace_drop_common(struct sk_buff *skb, void 
> *location)
>*/
>   __nla_reserve_nohdr(dskb, sizeof(struct net_dm_drop_point));
>   nla->nla_len += NLA_ALIGN(sizeof(struct net_dm_drop_point));
> - memcpy(msg->points[msg->entries].pc, , sizeof(void *));
> - msg->points[msg->entries].count = 1;
> + memcpy(point->pc, , sizeof(void *));
> + point->count = 1;
>   msg->entries++;
>  
>   if (!timer_pending(>send_timer)) {
Acked-by: Neil Horman 


Re: [PATCH 1/4] net: sctp: Rename fallthrough label to unhandled

2019-10-11 Thread Neil Horman
On Sat, Oct 05, 2019 at 09:46:41AM -0700, Joe Perches wrote:
> fallthrough may become a pseudo reserved keyword so this only use of
> fallthrough is better renamed to allow it.
> 
> Signed-off-by: Joe Perches 
> ---
>  net/sctp/sm_make_chunk.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> index e41ed2e0ae7d..48d63956a68c 100644
> --- a/net/sctp/sm_make_chunk.c
> +++ b/net/sctp/sm_make_chunk.c
> @@ -2155,7 +2155,7 @@ static enum sctp_ierror sctp_verify_param(struct net 
> *net,
>   case SCTP_PARAM_SET_PRIMARY:
>   if (ep->asconf_enable)
>   break;
> - goto fallthrough;
> + goto unhandled;
>  
>   case SCTP_PARAM_HOST_NAME_ADDRESS:
>   /* Tell the peer, we won't support this param.  */
> @@ -2166,11 +2166,11 @@ static enum sctp_ierror sctp_verify_param(struct net 
> *net,
>   case SCTP_PARAM_FWD_TSN_SUPPORT:
>   if (ep->prsctp_enable)
>   break;
> - goto fallthrough;
> + goto unhandled;
>  
>   case SCTP_PARAM_RANDOM:
>   if (!ep->auth_enable)
> - goto fallthrough;
> + goto unhandled;
>  
>   /* SCTP-AUTH: Secion 6.1
>* If the random number is not 32 byte long the association
> @@ -2187,7 +2187,7 @@ static enum sctp_ierror sctp_verify_param(struct net 
> *net,
>  
>   case SCTP_PARAM_CHUNKS:
>   if (!ep->auth_enable)
> - goto fallthrough;
> + goto unhandled;
>  
>   /* SCTP-AUTH: Section 3.2
>* The CHUNKS parameter MUST be included once in the INIT or
> @@ -2203,7 +2203,7 @@ static enum sctp_ierror sctp_verify_param(struct net 
> *net,
>  
>   case SCTP_PARAM_HMAC_ALGO:
>   if (!ep->auth_enable)
> - goto fallthrough;
> + goto unhandled;
>  
>   hmacs = (struct sctp_hmac_algo_param *)param.p;
>   n_elt = (ntohs(param.p->length) -
> @@ -2226,7 +2226,7 @@ static enum sctp_ierror sctp_verify_param(struct net 
> *net,
>   retval = SCTP_IERROR_ABORT;
>   }
>   break;
> -fallthrough:
> +unhandled:
>   default:
>   pr_debug("%s: unrecognized param:%d for chunk:%d\n",
>__func__, ntohs(param.p->type), cid);
> -- 
> 2.15.0
> 
> 
I'm still not a fan of the pseudo keyword fallthrough, but I don't have
a problem in renaming the label, so

Acked-by: Neil Horman 



Re: [PATCH] lib/generic-radix-tree.c: add kmemleak annotations

2019-10-04 Thread Neil Horman
On Thu, Oct 03, 2019 at 11:50:39PM -0700, Eric Biggers wrote:
> From: Eric Biggers 
> 
> Kmemleak is falsely reporting a leak of the slab allocation in
> sctp_stream_init_ext():
> 
> BUG: memory leak
> unreferenced object 0x8881114f5d80 (size 96):
>comm "syz-executor934", pid 7160, jiffies 4294993058 (age 31.950s)
>hex dump (first 32 bytes):
>  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>backtrace:
>  [<ce7a1326>] kmemleak_alloc_recursive  
> include/linux/kmemleak.h:55 [inline]
>  [<ce7a1326>] slab_post_alloc_hook mm/slab.h:439 [inline]
>  [<ce7a1326>] slab_alloc mm/slab.c:3326 [inline]
>  [<ce7a1326>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
>  [<7abb7ac9>] kmalloc include/linux/slab.h:547 [inline]
>  [<7abb7ac9>] kzalloc include/linux/slab.h:742 [inline]
>  [<7abb7ac9>] sctp_stream_init_ext+0x2b/0xa0  
> net/sctp/stream.c:157
>  [<48ecb9c1>] sctp_sendmsg_to_asoc+0x946/0xa00  
> net/sctp/socket.c:1882
>  [<4483ca2b>] sctp_sendmsg+0x2a8/0x990 net/sctp/socket.c:2102
>  [...]
> 
> But it's freed later.  Kmemleak misses the allocation because its
> pointer is stored in the generic radix tree sctp_stream::out, and the
> generic radix tree uses raw pages which aren't tracked by kmemleak.
> 
> Fix this by adding the kmemleak hooks to the generic radix tree code.
> 
> Reported-by: syzbot+7f3b6b106be8dcdcd...@syzkaller.appspotmail.com
> Signed-off-by: Eric Biggers 
> ---
>  lib/generic-radix-tree.c | 32 ++--
>  1 file changed, 26 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/generic-radix-tree.c b/lib/generic-radix-tree.c
> index ae25e2fa2187..f25eb111c051 100644
> --- a/lib/generic-radix-tree.c
> +++ b/lib/generic-radix-tree.c
> @@ -2,6 +2,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define GENRADIX_ARY (PAGE_SIZE / sizeof(struct genradix_node *))
>  #define GENRADIX_ARY_SHIFT   ilog2(GENRADIX_ARY)
> @@ -75,6 +76,27 @@ void *__genradix_ptr(struct __genradix *radix, size_t 
> offset)
>  }
>  EXPORT_SYMBOL(__genradix_ptr);
>  
> +static inline struct genradix_node *genradix_alloc_node(gfp_t gfp_mask)
> +{
> + struct genradix_node *node;
> +
> + node = (struct genradix_node *)__get_free_page(gfp_mask|__GFP_ZERO);
> +
> + /*
> +  * We're using pages (not slab allocations) directly for kernel data
> +  * structures, so we need to explicitly inform kmemleak of them in order
> +  * to avoid false positive memory leak reports.
> +  */
> + kmemleak_alloc(node, PAGE_SIZE, 1, gfp_mask);
> + return node;
> +}
> +
> +static inline void genradix_free_node(struct genradix_node *node)
> +{
> + kmemleak_free(node);
> + free_page((unsigned long)node);
> +}
> +
>  /*
>   * Returns pointer to the specified byte @offset within @radix, allocating 
> it if
>   * necessary - newly allocated slots are always zeroed out:
> @@ -97,8 +119,7 @@ void *__genradix_ptr_alloc(struct __genradix *radix, 
> size_t offset,
>   break;
>  
>   if (!new_node) {
> - new_node = (void *)
> - __get_free_page(gfp_mask|__GFP_ZERO);
> + new_node = genradix_alloc_node(gfp_mask);
>   if (!new_node)
>   return NULL;
>   }
> @@ -121,8 +142,7 @@ void *__genradix_ptr_alloc(struct __genradix *radix, 
> size_t offset,
>   n = READ_ONCE(*p);
>   if (!n) {
>   if (!new_node) {
> - new_node = (void *)
> - __get_free_page(gfp_mask|__GFP_ZERO);
> + new_node = genradix_alloc_node(gfp_mask);
>   if (!new_node)
>   return NULL;
>   }
> @@ -133,7 +153,7 @@ void *__genradix_ptr_alloc(struct __genradix *radix, 
> size_t offset,
>   }
>  
>   if (new_node)
> - free_page((unsigned long) new_node);
> + genradix_free_node(new_node);
>  
>   return >data[offset];
>  }
> @@ -191,7 +211,7 @@ static void genradix_free_recurse(struct genradix_node 
> *n, unsigned level)
>   genradix_free_recurse(n->children[i], level - 
> 1);
>   }
>  
> - free_page((unsigned long) n);
> + genradix_free_node(n);
>  }
>  
>  int __genradix_prealloc(struct __genradix *radix, size_t size,
> -- 
> 2.23.0
> 
> 
Acked-by: Neil Horman 



Re: [PATCH net 0/2] fix memory leak for sctp_do_bind

2019-09-10 Thread Neil Horman
On Tue, Sep 10, 2019 at 03:13:41PM +0800, Mao Wenan wrote:
> First patch is to do cleanup, remove redundant assignment,
> second patch is to fix memory leak for sctp_do_bind if failed
> to bind address.
> 
> Mao Wenan (2):
>   sctp: remove redundant assignment when call sctp_get_port_local
>   sctp: destroy bucket if failed to bind addr
> 
>  net/sctp/socket.c | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> -- 
> 2.20.1
> 
> 
Series
Acked-by: Neil Horman 



[tip: x86/apic] x86/apic/vector: Warn when vector space exhaustion breaks affinity

2019-08-28 Thread tip-bot2 for Neil Horman
The following commit has been merged into the x86/apic branch of tip:

Commit-ID: 743dac494d61d991967ebcfab92e4f80dc7583b3
Gitweb:
https://git.kernel.org/tip/743dac494d61d991967ebcfab92e4f80dc7583b3
Author:Neil Horman 
AuthorDate:Thu, 22 Aug 2019 10:34:21 -04:00
Committer: Thomas Gleixner 
CommitterDate: Wed, 28 Aug 2019 14:44:08 +02:00

x86/apic/vector: Warn when vector space exhaustion breaks affinity

On x86, CPUs are limited in the number of interrupts they can have affined
to them as they only support 256 interrupt vectors per CPU. 32 vectors are
reserved for the CPU and the kernel reserves another 22 for internal
purposes. That leaves 202 vectors for assignement to devices.

When an interrupt is set up or the affinity is changed by the kernel or the
administrator, the vector assignment code attempts to honor the requested
affinity mask. If the vector space on the CPUs in that affinity mask is
exhausted the code falls back to a wider set of CPUs and assigns a vector
on a CPU outside of the requested affinity mask silently.

While the effective affinity is reflected in the corresponding
/proc/irq/$N/effective_affinity* files the silent breakage of the requested
affinity can lead to unexpected behaviour for administrators.

Add a pr_warn() when this happens so that adminstrators get at least
informed about it in the syslog.

[ tglx: Massaged changelog and made the pr_warn() more informative ]

Reported-by: dju...@redhat.com
Signed-off-by: Neil Horman 
Signed-off-by: Thomas Gleixner 
Tested-by: dju...@redhat.com
Link: https://lkml.kernel.org/r/20190822143421.9535-1-nhor...@tuxdriver.com
---
 arch/x86/kernel/apic/vector.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index fdacb86..2c5676b 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -398,6 +398,17 @@ static int activate_reserved(struct irq_data *irqd)
if (!irqd_can_reserve(irqd))
apicd->can_reserve = false;
}
+
+   /*
+* Check to ensure that the effective affinity mask is a subset
+* the user supplied affinity mask, and warn the user if it is not
+*/
+   if (!cpumask_subset(irq_data_get_effective_affinity_mask(irqd),
+   irq_data_get_affinity_mask(irqd))) {
+   pr_warn("irq %u: Affinity broken due to vector space 
exhaustion.\n",
+   irqd->irq);
+   }
+
return ret;
 }
 


[PATCH V2] x86: Add irq spillover warning

2019-08-22 Thread Neil Horman
On Intel hardware, cpus are limited in the number of irqs they can
have affined to them (currently 240), based on section 10.5.2 of:
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

assign_irq_vector_any_locked() will attempt to honor the affining
request, but if the 240 vector limitation documented above is crossed, a
new mask will be selected that is potentially outside the requested cpu
set silently.  This can lead to unexpected behavior for administrators.

Mitigate this problem by checking the affinity mask after its been
assigned in activate_reserved so that adminstrators get a logged warning
about the change.

Tested successfully by the reporter

Change Notes:
V1->V2)
* Moved the check for this condition to activate_reserved from
do_IRQ, taking it out of the hot path (request by t...@lintronix.de)

Signed-off-by: Neil Horman 
Reported-by: dju...@redhat.com
CC: Thomas Gleixner 
CC: Ingo Molnar 
CC: Borislav Petkov 
CC: "H. Peter Anvin" 
CC: x...@kernel.org
---
 arch/x86/kernel/apic/vector.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index fdacb864c3dd..b8ed0406d41f 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -398,6 +398,16 @@ static int activate_reserved(struct irq_data *irqd)
if (!irqd_can_reserve(irqd))
apicd->can_reserve = false;
}
+
+   /*
+* Check to ensure that the effective affinity mask is a subset
+* the user supplied affinity mask, and warn the user if it is not
+*/
+   if (!cpumask_subset(irq_data_get_effective_affinity_mask(irqd),
+irq_data_get_affinity_mask(irqd)))
+   pr_warn("irq %d has been assigned to a cpu outside of its user 
affinity mask\n",
+   irqd->irq);
+
return ret;
 }
 
-- 
2.21.0



Re: [PATCH] net: sctp: Rename fallthrough label to unhandled

2019-08-04 Thread Neil Horman
On Fri, Aug 02, 2019 at 04:19:32PM -0700, David Miller wrote:
> From: Joe Perches 
> Date: Fri, 02 Aug 2019 10:47:34 -0700
> 
> > On Wed, 2019-07-31 at 08:16 -0400, Neil Horman wrote:
> >> On Wed, Jul 31, 2019 at 04:32:43AM -0700, Joe Perches wrote:
> >> > On Wed, 2019-07-31 at 07:19 -0400, Neil Horman wrote:
> >> > > On Tue, Jul 30, 2019 at 10:04:37PM -0700, Joe Perches wrote:
> >> > > > fallthrough may become a pseudo reserved keyword so this only use of
> >> > > > fallthrough is better renamed to allow it.
> > 
> > Can you or any other maintainer apply this patch
> > or ack it so David Miller can apply it?
> 
> I, like others, don't like the lack of __ in the keyword.  It's kind of
> rediculous the problems it creates to pollute the global namespace like
> that and yes also inconsistent with other shorthands for builtins.
> 
FWIW, I acked the sctp patch, because the use of the word fallthrough as a
label, isn't that important to me, unhendled is just as good, so I'm ok with
that change.

But, as I stated in the other thread, I agree, making a macro out of fallthrough
without clearly naming it using a macro convention like __ is not something I'm
ok with
Neil



Re: [PATCH] net: sctp: Rename fallthrough label to unhandled

2019-08-02 Thread Neil Horman
On Tue, Jul 30, 2019 at 10:04:37PM -0700, Joe Perches wrote:
> fallthrough may become a pseudo reserved keyword so this only use of
> fallthrough is better renamed to allow it.
> 
> Signed-off-by: Joe Perches 
> ---
>  net/sctp/sm_make_chunk.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> index 36bd8a6e82df..3fdcaa2fbf12 100644
> --- a/net/sctp/sm_make_chunk.c
> +++ b/net/sctp/sm_make_chunk.c
> @@ -2152,7 +2152,7 @@ static enum sctp_ierror sctp_verify_param(struct net 
> *net,
>   case SCTP_PARAM_SET_PRIMARY:
>   if (net->sctp.addip_enable)
>   break;
> - goto fallthrough;
> + goto unhandled;
>  
>   case SCTP_PARAM_HOST_NAME_ADDRESS:
>   /* Tell the peer, we won't support this param.  */
> @@ -2163,11 +2163,11 @@ static enum sctp_ierror sctp_verify_param(struct net 
> *net,
>   case SCTP_PARAM_FWD_TSN_SUPPORT:
>   if (ep->prsctp_enable)
>   break;
> - goto fallthrough;
> + goto unhandled;
>  
>   case SCTP_PARAM_RANDOM:
>   if (!ep->auth_enable)
> - goto fallthrough;
> + goto unhandled;
>  
>   /* SCTP-AUTH: Secion 6.1
>* If the random number is not 32 byte long the association
> @@ -2184,7 +2184,7 @@ static enum sctp_ierror sctp_verify_param(struct net 
> *net,
>  
>   case SCTP_PARAM_CHUNKS:
>   if (!ep->auth_enable)
> - goto fallthrough;
> + goto unhandled;
>  
>   /* SCTP-AUTH: Section 3.2
>* The CHUNKS parameter MUST be included once in the INIT or
> @@ -2200,7 +2200,7 @@ static enum sctp_ierror sctp_verify_param(struct net 
> *net,
>  
>   case SCTP_PARAM_HMAC_ALGO:
>   if (!ep->auth_enable)
> - goto fallthrough;
> + goto unhandled;
>  
>   hmacs = (struct sctp_hmac_algo_param *)param.p;
>   n_elt = (ntohs(param.p->length) -
> @@ -2223,7 +2223,7 @@ static enum sctp_ierror sctp_verify_param(struct net 
> *net,
>   retval = SCTP_IERROR_ABORT;
>   }
>   break;
> -fallthrough:
> +unhandled:
>   default:
>   pr_debug("%s: unrecognized param:%d for chunk:%d\n",
>__func__, ntohs(param.p->type), cid);
> -- 
> 2.15.0
> 
> 
Yeah, it seems reasonable to me, though I'm still not comfortable with defining
fallthrough as a macrotized keyword, but thats a debate for another thread

Acked-by: Neil Horman 



Re: [RFC PATCH] compiler_attributes.h: Add 'fallthrough' pseudo keyword for switch/case use

2019-08-02 Thread Neil Horman
On Thu, Aug 01, 2019 at 10:26:29PM +0200, Miguel Ojeda wrote:
> On Thu, Aug 1, 2019 at 10:10 PM  wrote:
> >
> > I'm not disagreeing... I think using a macro makes sense.
> 
> It is either a macro or waiting for 5+ years (while we keep using the
> comment style) :-)
> 
> In case it helps to make one's mind about whether to go for it or not,
> I summarized the advantages and a few other details in the patch I
> sent in October:
> 
>   
> https://github.com/ojeda/linux/commit/668f011a2706ea555987e263f609a5deba9c7fc4
> 
> It would be nice, however, to discuss whether we want __fallthrough or
> fallthrough. The former is consistent with the rest of compiler
> attributes and makes it clear it is not a keyword, the latter is
> consistent with "break", "goto" and "return", as Joe's patch explains.
> 
I was having this conversation with Joe, and I agree, I like the idea of
macroing up the fall through attribute, but naming it __fallthrough seems more
consistent to me with the other attribute macros.  I also feel like its more
recognizable as a macro.  Naming it fallthrough just makes it look like someone
forgot to put /**/'s around it to me.

Neil

> Cheers,
> Miguel
> 


Re: [PATCH] net: sctp: Rename fallthrough label to unhandled

2019-08-01 Thread Neil Horman
On Thu, Aug 01, 2019 at 10:42:31AM -0700, Joe Perches wrote:
> On Thu, 2019-08-01 at 06:50 -0400, Neil Horman wrote:
> > On Wed, Jul 31, 2019 at 03:23:46PM -0700, Joe Perches wrote:
> []
> > You can say that if you want, but you made the point that your think the 
> > macro
> > as you have written is more readable.  You example illustrates though that 
> > /*
> > fallthrough */ is a pretty common comment, and not prefixing it makes it 
> > look
> > like someone didn't add a comment that they meant to.  The __ prefix is 
> > standard
> > practice for defining macros to attributes (212 instances of it by my 
> > count).  I
> > don't mind rewriting the goto labels at all, but I think consistency is
> > valuable.
> 
> Hey Neil.
> 
> Perhaps you want to make this argument on the RFC patch thread
> that introduces the fallthrough pseudo-keyword.
> 
> https://lore.kernel.org/patchwork/patch/1108577/
> 
> 
> 
Sure, but it will have to wait for tomorrow at this point, as I need to run to
an appointment.

Best
Neil



Re: [PATCH] net: sctp: Rename fallthrough label to unhandled

2019-08-01 Thread Neil Horman
On Wed, Jul 31, 2019 at 03:23:46PM -0700, Joe Perches wrote:
> On Wed, 2019-07-31 at 16:58 -0400, Neil Horman wrote:
> > On Wed, Jul 31, 2019 at 09:35:31AM -0700, Joe Perches wrote:
> > > On Wed, 2019-07-31 at 08:16 -0400, Neil Horman wrote:
> > > > On Wed, Jul 31, 2019 at 04:32:43AM -0700, Joe Perches wrote:
> > > > > On Wed, 2019-07-31 at 07:19 -0400, Neil Horman wrote:
> > > > > > On Tue, Jul 30, 2019 at 10:04:37PM -0700, Joe Perches wrote:
> > > > > > > fallthrough may become a pseudo reserved keyword so this only use 
> > > > > > > of
> > > > > > > fallthrough is better renamed to allow it.
> > > > > > > 
> > > > > > > Signed-off-by: Joe Perches 
> > > > > > Are you referring to the __attribute__((fallthrough)) statement 
> > > > > > that gcc
> > > > > > supports?  If so the compiler should by all rights be able to 
> > > > > > differentiate
> > > > > > between a null statement attribute and a explicit goto and label 
> > > > > > without the
> > > > > > need for renaming here.  Or are you referring to something else?
> > > > > 
> > > > > Hi.
> > > > > 
> > > > > I sent after this a patch that adds
> > > > > 
> > > > > # define fallthrough
> > > > > __attribute__((__fallthrough__))
> > > > > 
> > > > > https://lore.kernel.org/patchwork/patch/1108577/
> > > > > 
> > > > > So this rename is a prerequisite to adding this #define.
> > > > > 
> > > > why not just define __fallthrough instead, like we do for all the other
> > > > attributes we alias (i.e. __read_mostly, __protected_by, __unused, 
> > > > __exception,
> > > > etc)
> > > 
> > > Because it's not as intelligible when used as a statement.
> > I think thats somewhat debatable.  __fallthrough to me looks like an 
> > internal
> > macro, whereas fallthrough looks like a comment someone forgot to /* */
> 
> 
> I'd rather see:
> 
>   switch (foo) {
>   case FOO:
>   bar |= baz;
>   fallthrough;
>   case BAR:
>   bar |= qux;
>   break;
>   default:
>   error();
>   }
> 
> than
> 
>   switch (foo) {
>   case FOO:
>   bar |= baz;
>   __fallthrough;
>   case BAR:
>   bar |= qux;
>   break;
>   default:
>   error();
>   }
> 
> or esoecially
> 
>   switch (foo) {
>   case FOO:
>   bar |= baz;
>   /* fallthrough
> */;
>   case BAR:
>   bar |= qux;
>   break;
>   default:
>   error();
>   }
> 
> but , bikeshed ahoy!...
You can say that if you want, but you made the point that your think the macro
as you have written is more readable.  You example illustrates though that /*
fallthrough */ is a pretty common comment, and not prefixing it makes it look
like someone didn't add a comment that they meant to.  The __ prefix is standard
practice for defining macros to attributes (212 instances of it by my count).  I
don't mind rewriting the goto labels at all, but I think consistency is
valuable.

Neil

> 
> 
> 


Re: [PATCH] net: sctp: Rename fallthrough label to unhandled

2019-07-31 Thread Neil Horman
On Wed, Jul 31, 2019 at 09:35:31AM -0700, Joe Perches wrote:
> On Wed, 2019-07-31 at 08:16 -0400, Neil Horman wrote:
> > On Wed, Jul 31, 2019 at 04:32:43AM -0700, Joe Perches wrote:
> > > On Wed, 2019-07-31 at 07:19 -0400, Neil Horman wrote:
> > > > On Tue, Jul 30, 2019 at 10:04:37PM -0700, Joe Perches wrote:
> > > > > fallthrough may become a pseudo reserved keyword so this only use of
> > > > > fallthrough is better renamed to allow it.
> > > > > 
> > > > > Signed-off-by: Joe Perches 
> > > > Are you referring to the __attribute__((fallthrough)) statement that gcc
> > > > supports?  If so the compiler should by all rights be able to 
> > > > differentiate
> > > > between a null statement attribute and a explicit goto and label 
> > > > without the
> > > > need for renaming here.  Or are you referring to something else?
> > > 
> > > Hi.
> > > 
> > > I sent after this a patch that adds
> > > 
> > > # define fallthrough__attribute__((__fallthrough__))
> > > 
> > > https://lore.kernel.org/patchwork/patch/1108577/
> > > 
> > > So this rename is a prerequisite to adding this #define.
> > > 
> > why not just define __fallthrough instead, like we do for all the other
> > attributes we alias (i.e. __read_mostly, __protected_by, __unused, 
> > __exception,
> > etc)
> 
> Because it's not as intelligible when used as a statement.
I think thats somewhat debatable.  __fallthrough to me looks like an internal
macro, whereas fallthrough looks like a comment someone forgot to /* */

Neil

> 
> 
> 
> 


Re: [PATCH] net: sctp: Rename fallthrough label to unhandled

2019-07-31 Thread Neil Horman
On Wed, Jul 31, 2019 at 04:32:43AM -0700, Joe Perches wrote:
> On Wed, 2019-07-31 at 07:19 -0400, Neil Horman wrote:
> > On Tue, Jul 30, 2019 at 10:04:37PM -0700, Joe Perches wrote:
> > > fallthrough may become a pseudo reserved keyword so this only use of
> > > fallthrough is better renamed to allow it.
> > > 
> > > Signed-off-by: Joe Perches 
> > Are you referring to the __attribute__((fallthrough)) statement that gcc
> > supports?  If so the compiler should by all rights be able to differentiate
> > between a null statement attribute and a explicit goto and label without the
> > need for renaming here.  Or are you referring to something else?
> 
> Hi.
> 
> I sent after this a patch that adds
> 
> # define fallthrough__attribute__((__fallthrough__))
> 
> https://lore.kernel.org/patchwork/patch/1108577/
> 
> So this rename is a prerequisite to adding this #define.
> 
why not just define __fallthrough instead, like we do for all the other
attributes we alias (i.e. __read_mostly, __protected_by, __unused, __exception,
etc)

Neil

> > > diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> []
> > > @@ -2152,7 +2152,7 @@ static enum sctp_ierror sctp_verify_param(struct 
> > > net *net,
> > >   case SCTP_PARAM_SET_PRIMARY:
> > >   if (net->sctp.addip_enable)
> > >   break;
> > > - goto fallthrough;
> > > + goto unhandled;
> 
> etc...
> 
> 
> 


Re: [PATCH] net: sctp: Rename fallthrough label to unhandled

2019-07-31 Thread Neil Horman
On Tue, Jul 30, 2019 at 10:04:37PM -0700, Joe Perches wrote:
> fallthrough may become a pseudo reserved keyword so this only use of
> fallthrough is better renamed to allow it.
> 
> Signed-off-by: Joe Perches 
Are you referring to the __attribute__((fallthrough)) statement that gcc
supports?  If so the compiler should by all rights be able to differentiate
between a null statement attribute and a explicit goto and label without the
need for renaming here.  Or are you referring to something else?

Neil

> ---
>  net/sctp/sm_make_chunk.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> index 36bd8a6e82df..3fdcaa2fbf12 100644
> --- a/net/sctp/sm_make_chunk.c
> +++ b/net/sctp/sm_make_chunk.c
> @@ -2152,7 +2152,7 @@ static enum sctp_ierror sctp_verify_param(struct net 
> *net,
>   case SCTP_PARAM_SET_PRIMARY:
>   if (net->sctp.addip_enable)
>   break;
> - goto fallthrough;
> + goto unhandled;
>  
>   case SCTP_PARAM_HOST_NAME_ADDRESS:
>   /* Tell the peer, we won't support this param.  */
> @@ -2163,11 +2163,11 @@ static enum sctp_ierror sctp_verify_param(struct net 
> *net,
>   case SCTP_PARAM_FWD_TSN_SUPPORT:
>   if (ep->prsctp_enable)
>   break;
> - goto fallthrough;
> + goto unhandled;
>  
>   case SCTP_PARAM_RANDOM:
>   if (!ep->auth_enable)
> - goto fallthrough;
> + goto unhandled;
>  
>   /* SCTP-AUTH: Secion 6.1
>* If the random number is not 32 byte long the association
> @@ -2184,7 +2184,7 @@ static enum sctp_ierror sctp_verify_param(struct net 
> *net,
>  
>   case SCTP_PARAM_CHUNKS:
>   if (!ep->auth_enable)
> - goto fallthrough;
> + goto unhandled;
>  
>   /* SCTP-AUTH: Section 3.2
>* The CHUNKS parameter MUST be included once in the INIT or
> @@ -2200,7 +2200,7 @@ static enum sctp_ierror sctp_verify_param(struct net 
> *net,
>  
>   case SCTP_PARAM_HMAC_ALGO:
>   if (!ep->auth_enable)
> - goto fallthrough;
> + goto unhandled;
>  
>   hmacs = (struct sctp_hmac_algo_param *)param.p;
>   n_elt = (ntohs(param.p->length) -
> @@ -2223,7 +2223,7 @@ static enum sctp_ierror sctp_verify_param(struct net 
> *net,
>   retval = SCTP_IERROR_ABORT;
>   }
>   break;
> -fallthrough:
> +unhandled:
>   default:
>   pr_debug("%s: unrecognized param:%d for chunk:%d\n",
>__func__, ntohs(param.p->type), cid);
> -- 
> 2.15.0
> 
> 


Re: [PATCH] x86: Add irq spillover warning

2019-07-16 Thread Neil Horman
On Tue, Jul 16, 2019 at 09:05:30PM +0200, Thomas Gleixner wrote:
> Neil,
> 
> On Tue, 16 Jul 2019, Neil Horman wrote:
> > On Tue, Jul 16, 2019 at 05:57:31PM +0200, Thomas Gleixner wrote:
> > > On Tue, 16 Jul 2019, Neil Horman wrote:
> > > > If a cpu has more than this number of interrupts affined to it, they
> > > > will spill over to other cpus, which potentially may be outside of their
> > > > affinity mask.
> > > 
> > > Spill over?
> > > 
> > > The kernel decides to pick a vector on a CPU outside of the affinity when
> > > it runs out of vectors on the CPUs in the affinity mask.
> > > 
> > Yes.
> > 
> > > Please explain issues technically correct.
> > > 
> > I don't know what you mean by this.  I explained it above, and you clearly
> > understood it.
> 
> It took me a while to grok it. Simply because I first thought it's some
> hardware issue. And of course after confusion settled I knew what it is,
> but just because I know that code like the back of my hand.
> 
> > > > Given that this might cause unexpected behavior on
> > > > performance sensitive systems, warn the user should this condition occur
> > > > so that corrective action can be taken
> > > 
> > > > @@ -244,6 +244,14 @@ __visible unsigned int __irq_entry do_IRQ(struct 
> > > > pt_regs *regs)
> > > 
> > > Why on earth warn in the interrupt delivery hotpath? Just because it's the
> > > place which really needs extra instructions and extra cache lines on
> > > performance sensitive systems, right?
> > > 
> > Because theres already a check of the same variety in do_IRQ, but if the
> > information is available outside the hotpath, I was unaware, and am happy to
> > update this patch to refelct that.
> 
> Which check are you referring to?
> 
This one:
if (desc != VECTOR_RETRIGGERED) {
pr_emerg_ratelimited("%s: %d.%d No irq handler for 
vector\n",
 __func__, smp_processor_id(),
 vector);
I figured it was already checking one condition, another wouldn't hurt too much,
but no worries, I'm redoing this in activate_reserved now.

Best
Neil

> Thanks,
> 
>   tglx
> 


Re: [PATCH] x86: Add irq spillover warning

2019-07-16 Thread Neil Horman
On Tue, Jul 16, 2019 at 05:57:31PM +0200, Thomas Gleixner wrote:
> Neil,
> 
> On Tue, 16 Jul 2019, Neil Horman wrote:
> 
> > On Intel hardware, cpus are limited in the number of irqs they can
> > have affined to them (currently 240), based on section 10.5.2 of:
> > https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf
> 
> That reference is really not useful to explain the problem and the number
> of vectors is neither. Please explain the conceptual issue.
>  
You seem to have already done that below.  Not really sure what more you are
asking for here.

> > If a cpu has more than this number of interrupts affined to it, they
> > will spill over to other cpus, which potentially may be outside of their
> > affinity mask.
> 
> Spill over?
> 
> The kernel decides to pick a vector on a CPU outside of the affinity when
> it runs out of vectors on the CPUs in the affinity mask.
> 
Yes.

> Please explain issues technically correct.
> 
I don't know what you mean by this.  I explained it above, and you clearly
understood it.

> > Given that this might cause unexpected behavior on
> > performance sensitive systems, warn the user should this condition occur
> > so that corrective action can be taken
> 
> > @@ -244,6 +244,14 @@ __visible unsigned int __irq_entry do_IRQ(struct 
> > pt_regs *regs)
> 
> Why on earth warn in the interrupt delivery hotpath? Just because it's the
> place which really needs extra instructions and extra cache lines on
> performance sensitive systems, right?
> 
Because theres already a check of the same variety in do_IRQ, but if the
information is available outside the hotpath, I was unaware, and am happy to
update this patch to refelct that.

> The fact that the kernel ran out of vectors for the CPUs in the affinity
> mask is already known when the vector is allocated in activate_reserved().
> 
> So there is an obvious place to put such a warning and it's certainly not
> do_IRQ().
> 
Sure

Thanks
Neil

> Thanks,
> 
>   tglx
> 


[PATCH] x86: Add irq spillover warning

2019-07-16 Thread Neil Horman
On Intel hardware, cpus are limited in the number of irqs they can
have affined to them (currently 240), based on section 10.5.2 of:
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

If a cpu has more than this number of interrupts affined to it, they
will spill over to other cpus, which potentially may be outside of their
affinity mask.  Given that this might cause unexpected behavior on
performance sensitive systems, warn the user should this condition occur
so that corrective action can be taken

Signed-off-by: Neil Horman 
Reported-by: dju...@redhat.com
CC: Thomas Gleixner 
CC: Ingo Molnar 
CC: Borislav Petkov 
CC: "H. Peter Anvin" 
CC: x...@kernel.org
---
 arch/x86/kernel/irq.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 9b68b5b00ac9..ac7ed32de3d5 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -244,6 +244,14 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs 
*regs)
 
desc = __this_cpu_read(vector_irq[vector]);
 
+   /*
+* Intel processors are limited in the number of irqs they can address. 
If we affine
+* too many irqs to a given cpu, they can silently spill to another cpu 
outside of
+* their affinity mask. Warn the user when this occurs
+*/
+   if (unlikely(!cpumask_test_cpu(smp_processor_id(), 
>irq_common_data.affinity)))
+   pr_emerg_ratelimited("%s: %d.%d handled outside of affinity 
mask\n");
+
if (!handle_irq(desc, regs)) {
ack_APIC_irq();
 
-- 
2.21.0



Re: [PATCH] net: sctp: fix warning "NULL check before some freeing functions is not needed"

2019-07-16 Thread Neil Horman
On Tue, Jul 16, 2019 at 07:50:02AM +0530, Hariprasad Kelam wrote:
> This patch removes NULL checks before calling kfree.
> 
> fixes below issues reported by coccicheck
> net/sctp/sm_make_chunk.c:2586:3-8: WARNING: NULL check before some
> freeing functions is not needed.
> net/sctp/sm_make_chunk.c:2652:3-8: WARNING: NULL check before some
> freeing functions is not needed.
> net/sctp/sm_make_chunk.c:2667:3-8: WARNING: NULL check before some
> freeing functions is not needed.
> net/sctp/sm_make_chunk.c:2684:3-8: WARNING: NULL check before some
> freeing functions is not needed.
> 
> Signed-off-by: Hariprasad Kelam 
> ---
>  net/sctp/sm_make_chunk.c | 12 
>  1 file changed, 4 insertions(+), 8 deletions(-)
> 
> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> index ed39396..36bd8a6e 100644
> --- a/net/sctp/sm_make_chunk.c
> +++ b/net/sctp/sm_make_chunk.c
> @@ -2582,8 +2582,7 @@ static int sctp_process_param(struct sctp_association 
> *asoc,
>   case SCTP_PARAM_STATE_COOKIE:
>   asoc->peer.cookie_len =
>   ntohs(param.p->length) - sizeof(struct sctp_paramhdr);
> - if (asoc->peer.cookie)
> - kfree(asoc->peer.cookie);
> + kfree(asoc->peer.cookie);
>   asoc->peer.cookie = kmemdup(param.cookie->body, 
> asoc->peer.cookie_len, gfp);
>   if (!asoc->peer.cookie)
>   retval = 0;
> @@ -2648,8 +2647,7 @@ static int sctp_process_param(struct sctp_association 
> *asoc,
>   goto fall_through;
>  
>   /* Save peer's random parameter */
> - if (asoc->peer.peer_random)
> - kfree(asoc->peer.peer_random);
> + kfree(asoc->peer.peer_random);
>   asoc->peer.peer_random = kmemdup(param.p,
>   ntohs(param.p->length), gfp);
>   if (!asoc->peer.peer_random) {
> @@ -2663,8 +2661,7 @@ static int sctp_process_param(struct sctp_association 
> *asoc,
>   goto fall_through;
>  
>   /* Save peer's HMAC list */
> - if (asoc->peer.peer_hmacs)
> - kfree(asoc->peer.peer_hmacs);
> + kfree(asoc->peer.peer_hmacs);
>   asoc->peer.peer_hmacs = kmemdup(param.p,
>   ntohs(param.p->length), gfp);
>   if (!asoc->peer.peer_hmacs) {
> @@ -2680,8 +2677,7 @@ static int sctp_process_param(struct sctp_association 
> *asoc,
>   if (!ep->auth_enable)
>   goto fall_through;
>  
> - if (asoc->peer.peer_chunks)
> - kfree(asoc->peer.peer_chunks);
> +     kfree(asoc->peer.peer_chunks);
>   asoc->peer.peer_chunks = kmemdup(param.p,
>   ntohs(param.p->length), gfp);
>   if (!asoc->peer.peer_chunks)
> -- 
> 2.7.4
> 
> 

Acked-by: Neil Horman 


Re: [PATCH 19/87] i2c: busses: Remove call to memset after dmam_alloc_coherent

2019-06-28 Thread Neil Horman
On Fri, Jun 28, 2019 at 01:36:51AM +0800, Fuqian Huang wrote:
> In commit af7ddd8a627c
> ("Merge tag 'dma-mapping-4.21' of 
> git://git.infradead.org/users/hch/dma-mapping"),
> dmam_alloc_coherent has already zeroed the memory.
> So memset is not needed.
> 
> Signed-off-by: Fuqian Huang 
> ---
>  drivers/i2c/busses/i2c-ismt.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/i2c/busses/i2c-ismt.c b/drivers/i2c/busses/i2c-ismt.c
> index 02d23edb2fb1..2f95e25a10f7 100644
> --- a/drivers/i2c/busses/i2c-ismt.c
> +++ b/drivers/i2c/busses/i2c-ismt.c
> @@ -781,8 +781,6 @@ static int ismt_dev_init(struct ismt_priv *priv)
>   if (!priv->hw)
>   return -ENOMEM;
>  
> - memset(priv->hw, 0, (ISMT_DESC_ENTRIES * sizeof(struct ismt_desc)));
> -
>   priv->head = 0;
>   init_completion(>cmp);
>  
> -- 
> 2.11.0
> 
> 
Acked-by: Neil Horman 


Re: [PATCH] sctp: Add rcu lock to protect dst entry in sctp_transport_route

2019-06-13 Thread Neil Horman
On Thu, Jun 13, 2019 at 10:37:51AM +0800, Su Yanjun wrote:
> 
> 在 2019/6/12 21:13, Neil Horman 写道:
> > On Tue, Jun 11, 2019 at 10:33:17AM +0800, Su Yanjun wrote:
> > > 在 2019/6/10 19:12, Neil Horman 写道:
> > > > On Mon, Jun 10, 2019 at 11:20:00AM +0800, Su Yanjun wrote:
> > > > > syzbot found a crash in rt_cache_valid. Problem is that when more
> > > > > threads release dst in sctp_transport_route, the route cache can
> > > > > be freed.
> > > > > 
> > > > > As follows,
> > > > > p1:
> > > > > sctp_transport_route
> > > > > dst_release
> > > > > get_dst
> > > > > 
> > > > > p2:
> > > > > sctp_transport_route
> > > > > dst_release
> > > > > get_dst
> > > > > ...
> > > > > 
> > > > > If enough threads calling dst_release will cause dst->refcnt==0
> > > > > then rcu softirq will reclaim the dst entry,get_dst then use
> > > > > the freed memory.
> > > > > 
> > > > > This patch adds rcu lock to protect the dst_entry here.
> > > > > 
> > > > > Fixes: 6e91b578bf3f("sctp: re-use sctp_transport_pmtu in
> > > > sctp_transport_route")
> > > > > Signed-off-by: Su Yanjun 
> > > > > Reported-by: syzbot+a9e23ea2aa21044c2...@syzkaller.appspotmail.com
> > > > > ---
> > > > >net/sctp/transport.c | 5 +
> > > > >1 file changed, 5 insertions(+)
> > > > > 
> > > > > diff --git a/net/sctp/transport.c b/net/sctp/transport.c
> > > > > index ad158d3..5ad7e20 100644
> > > > > --- a/net/sctp/transport.c
> > > > > +++ b/net/sctp/transport.c
> > > > > @@ -308,8 +308,13 @@ void sctp_transport_route(struct sctp_transport
> > > > *transport,
> > > > >   struct sctp_association *asoc = transport->asoc;
> > > > >   struct sctp_af *af = transport->af_specific;
> > > > > + /* When dst entry is being released, route cache may be referred
> > > > > +  * again. Add rcu lock here to protect dst entry.
> > > > > +  */
> > > > > + rcu_read_lock();
> > > > >   sctp_transport_dst_release(transport);
> > > > >   af->get_dst(transport, saddr, >fl, sctp_opt2sk(opt));
> > > > > + rcu_read_unlock();
> > > > What is the exact error that syzbot reported?  This doesn't seem like it
> > > > fixes
> > > BUG: KASAN: slab-out-of-bounds in rt_cache_valid+0x158/0x190
> > > net/ipv4/route.c:1556
> > > Read of size 2 at addr 8880654f3ac7 by task syz-executor.0/26603
> > > 
> > > CPU: 0 PID: 26603 Comm: syz-executor.0 Not tainted 5.2.0-rc2+ #9
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > > Google 01/01/2011
> > > Call Trace:
> > >   __dump_stack lib/dump_stack.c:77 [inline]
> > >   dump_stack+0x172/0x1f0 lib/dump_stack.c:113
> > >   print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
> > >   __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
> > >   kasan_report+0x12/0x20 mm/kasan/common.c:614
> > >   __asan_report_load2_noabort+0x14/0x20 mm/kasan/generic_report.c:130
> > >   rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556
> > >   __mkroute_output net/ipv4/route.c:2332 [inline]
> > >   ip_route_output_key_hash_rcu+0x819/0x2d50 net/ipv4/route.c:2564
> > >   ip_route_output_key_hash+0x1ef/0x360 net/ipv4/route.c:2393
> > >   __ip_route_output_key include/net/route.h:125 [inline]
> > >   ip_route_output_flow+0x28/0xc0 net/ipv4/route.c:2651
> > >   ip_route_output_key include/net/route.h:135 [inline]
> > >   sctp_v4_get_dst+0x467/0x1260 net/sctp/protocol.c:435
> > >   sctp_transport_route+0x12d/0x360 net/sctp/transport.c:297
> > >   sctp_assoc_add_peer+0x53e/0xfc0 net/sctp/associola.c:663
> > >   sctp_process_param net/sctp/sm_make_chunk.c:2531 [inline]
> > >   sctp_process_init+0x2491/0x2b10 net/sctp/sm_make_chunk.c:2344
> > >   sctp_cmd_process_init net/sctp/sm_sideeffect.c:667 [inline]
> > >   sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1369 [inline]
> > >   sctp_side_effects net/sctp/sm_sideeffect.c:1179 [inline]
> > >   sctp_do_sm+0x3a30/0x50e0 net/sctp/sm_sideeffect.c:1150
> > >   sctp_assoc_bh_rcv+0x343/0x660 net/sctp/associola.c:1059

Re: [PATCH] sctp: Add rcu lock to protect dst entry in sctp_transport_route

2019-06-12 Thread Neil Horman
On Tue, Jun 11, 2019 at 10:33:17AM +0800, Su Yanjun wrote:
> 
> 在 2019/6/10 19:12, Neil Horman 写道:
> > On Mon, Jun 10, 2019 at 11:20:00AM +0800, Su Yanjun wrote:
> > > syzbot found a crash in rt_cache_valid. Problem is that when more
> > > threads release dst in sctp_transport_route, the route cache can
> > > be freed.
> > > 
> > > As follows,
> > > p1:
> > > sctp_transport_route
> > >dst_release
> > >get_dst
> > > 
> > > p2:
> > > sctp_transport_route
> > >dst_release
> > >get_dst
> > > ...
> > > 
> > > If enough threads calling dst_release will cause dst->refcnt==0
> > > then rcu softirq will reclaim the dst entry,get_dst then use
> > > the freed memory.
> > > 
> > > This patch adds rcu lock to protect the dst_entry here.
> > > 
> > > Fixes: 6e91b578bf3f("sctp: re-use sctp_transport_pmtu in
> > sctp_transport_route")
> > > Signed-off-by: Su Yanjun 
> > > Reported-by: syzbot+a9e23ea2aa21044c2...@syzkaller.appspotmail.com
> > > ---
> > >   net/sctp/transport.c | 5 +
> > >   1 file changed, 5 insertions(+)
> > > 
> > > diff --git a/net/sctp/transport.c b/net/sctp/transport.c
> > > index ad158d3..5ad7e20 100644
> > > --- a/net/sctp/transport.c
> > > +++ b/net/sctp/transport.c
> > > @@ -308,8 +308,13 @@ void sctp_transport_route(struct sctp_transport
> > *transport,
> > >   struct sctp_association *asoc = transport->asoc;
> > >   struct sctp_af *af = transport->af_specific;
> > > + /* When dst entry is being released, route cache may be referred
> > > +  * again. Add rcu lock here to protect dst entry.
> > > +  */
> > > + rcu_read_lock();
> > >   sctp_transport_dst_release(transport);
> > >   af->get_dst(transport, saddr, >fl, sctp_opt2sk(opt));
> > > + rcu_read_unlock();
> > What is the exact error that syzbot reported?  This doesn't seem like it
> > fixes
> BUG: KASAN: slab-out-of-bounds in rt_cache_valid+0x158/0x190
> net/ipv4/route.c:1556
> Read of size 2 at addr 8880654f3ac7 by task syz-executor.0/26603
> 
> CPU: 0 PID: 26603 Comm: syz-executor.0 Not tainted 5.2.0-rc2+ #9
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
>  print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
>  __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
>  kasan_report+0x12/0x20 mm/kasan/common.c:614
>  __asan_report_load2_noabort+0x14/0x20 mm/kasan/generic_report.c:130
>  rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556
>  __mkroute_output net/ipv4/route.c:2332 [inline]
>  ip_route_output_key_hash_rcu+0x819/0x2d50 net/ipv4/route.c:2564
>  ip_route_output_key_hash+0x1ef/0x360 net/ipv4/route.c:2393
>  __ip_route_output_key include/net/route.h:125 [inline]
>  ip_route_output_flow+0x28/0xc0 net/ipv4/route.c:2651
>  ip_route_output_key include/net/route.h:135 [inline]
>  sctp_v4_get_dst+0x467/0x1260 net/sctp/protocol.c:435
>  sctp_transport_route+0x12d/0x360 net/sctp/transport.c:297
>  sctp_assoc_add_peer+0x53e/0xfc0 net/sctp/associola.c:663
>  sctp_process_param net/sctp/sm_make_chunk.c:2531 [inline]
>  sctp_process_init+0x2491/0x2b10 net/sctp/sm_make_chunk.c:2344
>  sctp_cmd_process_init net/sctp/sm_sideeffect.c:667 [inline]
>  sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1369 [inline]
>  sctp_side_effects net/sctp/sm_sideeffect.c:1179 [inline]
>  sctp_do_sm+0x3a30/0x50e0 net/sctp/sm_sideeffect.c:1150
>  sctp_assoc_bh_rcv+0x343/0x660 net/sctp/associola.c:1059
>  sctp_inq_push+0x1e4/0x280 net/sctp/inqueue.c:80
>  sctp_backlog_rcv+0x196/0xbe0 net/sctp/input.c:339
>  sk_backlog_rcv include/net/sock.h:945 [inline]
>  __release_sock+0x129/0x390 net/core/sock.c:2412
>  release_sock+0x59/0x1c0 net/core/sock.c:2928
>  sctp_wait_for_connect+0x316/0x540 net/sctp/socket.c:9039
>  __sctp_connect+0xab2/0xcd0 net/sctp/socket.c:1226
>  __sctp_setsockopt_connectx+0x133/0x1a0 net/sctp/socket.c:1334
>  sctp_setsockopt_connectx_old net/sctp/socket.c:1350 [inline]
>  sctp_setsockopt net/sctp/socket.c:4644 [inline]
>  sctp_setsockopt+0x22c0/0x6d10 net/sctp/socket.c:4608
>  compat_sock_common_setsockopt+0x106/0x140 net/core/sock.c:3137
>  __compat_sys_setsockopt+0x185/0x380 net/compat.c:383
>  __do_compat_sys_setsockopt net/compat.c:396 [inline]
>  __se_compat_sys_setsockopt net/compat.c:393 [inline]
>  __ia32_compat_sys_setsockopt+0xbd/0x150

Re: [PATCH] sctp: Add rcu lock to protect dst entry in sctp_transport_route

2019-06-10 Thread Neil Horman
On Mon, Jun 10, 2019 at 11:20:00AM +0800, Su Yanjun wrote:
> syzbot found a crash in rt_cache_valid. Problem is that when more
> threads release dst in sctp_transport_route, the route cache can
> be freed.
> 
> As follows,
> p1:
> sctp_transport_route
>   dst_release
>   get_dst
> 
> p2:
> sctp_transport_route
>   dst_release
>   get_dst
> ...
> 
> If enough threads calling dst_release will cause dst->refcnt==0
> then rcu softirq will reclaim the dst entry,get_dst then use
> the freed memory.
> 
> This patch adds rcu lock to protect the dst_entry here.
> 
> Fixes: 6e91b578bf3f("sctp: re-use sctp_transport_pmtu in 
> sctp_transport_route")
> Signed-off-by: Su Yanjun 
> Reported-by: syzbot+a9e23ea2aa21044c2...@syzkaller.appspotmail.com
> ---
>  net/sctp/transport.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/net/sctp/transport.c b/net/sctp/transport.c
> index ad158d3..5ad7e20 100644
> --- a/net/sctp/transport.c
> +++ b/net/sctp/transport.c
> @@ -308,8 +308,13 @@ void sctp_transport_route(struct sctp_transport 
> *transport,
>   struct sctp_association *asoc = transport->asoc;
>   struct sctp_af *af = transport->af_specific;
>  
> + /* When dst entry is being released, route cache may be referred
> +  * again. Add rcu lock here to protect dst entry.
> +  */
> + rcu_read_lock();
>   sctp_transport_dst_release(transport);
>   af->get_dst(transport, saddr, >fl, sctp_opt2sk(opt));
> + rcu_read_unlock();
>  
What is the exact error that syzbot reported?  This doesn't seem like it fixes
anything.  Based on what you've said above, we have multiple processes looking
up and releasing routes in parallel (which IIRC should never happen, as only one
process should traverse the sctp state machine for a given association at any
one time).  Protecting the lookup/release operations with a read side rcu lock
won't fix that.  

Neil

>   if (saddr)
>   memcpy(>saddr, saddr, sizeof(union sctp_addr));
> -- 
> 2.7.4
> 
> 
> 
> 


Re: [PATCH v3] coredump: Split pipe command whitespace before expanding template

2019-06-06 Thread Neil Horman
rray(argvs, sizeof(**argv), GFP_KERNEL);
> + if (!(*argv))
> + return -ENOMEM;
> + (*argv)[(*argc)++] = 0;
>   ++pat_ptr;
> + }
>  
>   /* Repeat as long as we have more pattern to process and more output
>  space */
>   while (*pat_ptr) {
> + /*
> +  * Split on spaces before doing template expansion so that
> +  * %e and %E don't get split if they have spaces in them
> +  */
> + if (ispipe) {
> + if (isspace(*pat_ptr)) {
> + was_space = true;
> + pat_ptr++;
> + continue;
> + } else if (was_space) {
> + was_space = false;
> + err = cn_printf(cn, "%c", '\0');
> + if (err)
> + return err;
> + (*argv)[(*argc)++] = cn->used;
> + }
> + }
>   if (*pat_ptr != '%') {
>   err = cn_printf(cn, "%c", *pat_ptr++);
>   } else {
> @@ -546,6 +572,8 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>   struct cred *cred;
>   int retval = 0;
>   int ispipe;
> + size_t *argv = NULL;
> + int argc = 0;
>   struct files_struct *displaced;
>   /* require nonrelative corefile path and be extra careful */
>   bool need_suid_safe = false;
> @@ -592,9 +620,10 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  
>   old_cred = override_creds(cred);
>  
> - ispipe = format_corename(, );
> + ispipe = format_corename(, , , );
>  
>   if (ispipe) {
> + int argi;
>   int dump_count;
>   char **helper_argv;
>   struct subprocess_info *sub_info;
> @@ -637,12 +666,16 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>   goto fail_dropcount;
>   }
>  
> - helper_argv = argv_split(GFP_KERNEL, cn.corename, NULL);
> + helper_argv = kmalloc_array(argc + 1, sizeof(*helper_argv),
> + GFP_KERNEL);
>   if (!helper_argv) {
>   printk(KERN_WARNING "%s failed to allocate memory\n",
>  __func__);
>   goto fail_dropcount;
>   }
> + for (argi = 0; argi < argc; argi++)
> + helper_argv[argi] = cn.corename + argv[argi];
> + helper_argv[argi] = NULL;
>  
>   retval = -ENOMEM;
>   sub_info = call_usermodehelper_setup(helper_argv[0],
> @@ -652,7 +685,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>   retval = call_usermodehelper_exec(sub_info,
>     UMH_WAIT_EXEC);
>  
> - argv_free(helper_argv);
> + kfree(helper_argv);
>   if (retval) {
>   printk(KERN_INFO "Core dump to |%s pipe failed\n",
>  cn.corename);
> @@ -766,6 +799,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>   if (ispipe)
>   atomic_dec(_dump_count);
>  fail_unlock:
> + kfree(argv);
>   kfree(cn.corename);
>   coredump_finish(mm, core_dumped);
>   revert_creds(old_cred);
> -- 
> 2.20.1
> 
> 
Acked-by: Neil Horman 



Re: [PATCH net-next] net: Drop unlikely before IS_ERR(_OR_NULL)

2019-06-05 Thread Neil Horman
On Wed, Jun 05, 2019 at 10:24:26PM +0800, Kefeng Wang wrote:
> IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag,
> so no need to do that again from its callers. Drop it.
> 
> Cc: "David S. Miller" 
> Cc: Alexey Kuznetsov 
> Cc: Hideaki YOSHIFUJI 
> Cc: Vlad Yasevich 
> Cc: Neil Horman 
> Cc: Marcelo Ricardo Leitner 
> Cc: net...@vger.kernel.org
> Cc: linux-s...@vger.kernel.org
> Signed-off-by: Kefeng Wang 
> ---
>  include/net/udp.h   | 2 +-
>  net/ipv4/fib_semantics.c| 2 +-
>  net/ipv4/inet_hashtables.c  | 2 +-
>  net/ipv4/udp.c  | 2 +-
>  net/ipv4/udp_offload.c  | 2 +-
>  net/ipv6/inet6_hashtables.c | 2 +-
>  net/ipv6/udp.c  | 2 +-
>  net/sctp/socket.c   | 4 ++--
>  8 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/include/net/udp.h b/include/net/udp.h
> index 79d141d2103b..bad74f780831 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -480,7 +480,7 @@ static inline struct sk_buff *udp_rcv_segment(struct sock 
> *sk,
>* CB fragment
>*/
>   segs = __skb_gso_segment(skb, features, false);
> - if (unlikely(IS_ERR_OR_NULL(segs))) {
> + if (IS_ERR_OR_NULL(segs)) {
>   int segs_nr = skb_shinfo(skb)->gso_segs;
>  
>   atomic_add(segs_nr, >sk_drops);
> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index b80410673915..cd35bd0a4d8a 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -1295,7 +1295,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg,
>   goto failure;
>   fi->fib_metrics = ip_fib_metrics_init(fi->fib_net, cfg->fc_mx,
> cfg->fc_mx_len, extack);
> - if (unlikely(IS_ERR(fi->fib_metrics))) {
> + if (IS_ERR(fi->fib_metrics)) {
>   err = PTR_ERR(fi->fib_metrics);
>   kfree(fi);
>   return ERR_PTR(err);
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index c4503073248b..97824864e40d 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -316,7 +316,7 @@ struct sock *__inet_lookup_listener(struct net *net,
>   saddr, sport, htonl(INADDR_ANY), hnum,
>   dif, sdif);
>  done:
> - if (unlikely(IS_ERR(result)))
> + if (IS_ERR(result))
>   return NULL;
>   return result;
>  }
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 189144346cd4..8983afe2fe9e 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -478,7 +478,7 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 
> saddr,
> htonl(INADDR_ANY), hnum, dif, sdif,
> exact_dif, hslot2, skb);
>   }
> - if (unlikely(IS_ERR(result)))
> + if (IS_ERR(result))
>   return NULL;
>   return result;
>  }
> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> index 06b3e2c1fcdc..0112f64faf69 100644
> --- a/net/ipv4/udp_offload.c
> +++ b/net/ipv4/udp_offload.c
> @@ -208,7 +208,7 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
>   gso_skb->destructor = NULL;
>  
>   segs = skb_segment(gso_skb, features);
> - if (unlikely(IS_ERR_OR_NULL(segs))) {
> + if (IS_ERR_OR_NULL(segs)) {
>   if (copy_dtor)
>   gso_skb->destructor = sock_wfree;
>   return segs;
> diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
> index b2a55f300318..cf60fae9533b 100644
> --- a/net/ipv6/inet6_hashtables.c
> +++ b/net/ipv6/inet6_hashtables.c
> @@ -174,7 +174,7 @@ struct sock *inet6_lookup_listener(struct net *net,
>saddr, sport, _any, hnum,
>dif, sdif);
>  done:
> - if (unlikely(IS_ERR(result)))
> + if (IS_ERR(result))
>   return NULL;
>   return result;
>  }
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index b3418a7c5c74..693518350f79 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -215,7 +215,7 @@ struct sock *__udp6_lib_lookup(struct net *net,
> exact_dif, hslot2,
> skb);
>   }
> - if (unlikely(IS_ERR(result)))
> + if (IS_ERR(result))
>   return NULL;
>   return result;
>  }
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 39ea0a37af09..c7b0f51c19d5 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
>

Re: [PATCH] net: sctp: fix memory leak in sctp_send_reset_streams

2019-06-02 Thread Neil Horman
On Sun, Jun 02, 2019 at 11:44:29AM +0800, Hillf Danton wrote:
> 
> syzbot found the following crash on:
> 
> HEAD commit:036e3431 Merge git://git.kernel.org/pub/scm/linux/kernel/g..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=153cff12a0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=8f0f63a62bb5b13c
> dashboard link: https://syzkaller.appspot.com/bug?extid=6ad9c3bd0a218a2ab41d
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=12561c86a0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15b76fd8a0
> 
> executing program
> executing program
> executing program
> executing program
> executing program
> BUG: memory leak
> unreferenced object 0x888123894820 (size 32):
>   comm "syz-executor045", pid 7267, jiffies 4294943559 (age 13.660s)
>   hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>   backtrace:
> [<c7e71c69>] kmemleak_alloc_recursive
> include/linux/kmemleak.h:55 [inline]
> [<c7e71c69>] slab_post_alloc_hook mm/slab.h:439 [inline]
> [<c7e71c69>] slab_alloc mm/slab.c:3326 [inline]
> [<c7e71c69>] __do_kmalloc mm/slab.c:3658 [inline]
> [<c7e71c69>] __kmalloc+0x161/0x2c0 mm/slab.c:3669
> [<3250ed8e>] kmalloc_array include/linux/slab.h:670 [inline]
> [<3250ed8e>] kcalloc include/linux/slab.h:681 [inline]
> [<3250ed8e>] sctp_send_reset_streams+0x1ab/0x5a0 
> net/sctp/stream.c:302
> [<cd899c6e>] sctp_setsockopt_reset_streams net/sctp/socket.c:4314 
> [inline]
> [<cd899c6e>] sctp_setsockopt net/sctp/socket.c:4765 [inline]
> [<cd899c6e>] sctp_setsockopt+0xc23/0x2bf0 net/sctp/socket.c:4608
> [<ff3a21a2>] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3130
> [<9eb87ae7>] __sys_setsockopt+0x98/0x120 net/socket.c:2078
> [<e0ede6ca>] __do_sys_setsockopt net/socket.c:2089 [inline]
> [<e0ede6ca>] __se_sys_setsockopt net/socket.c:2086 [inline]
> [<e0ede6ca>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086
> [<c61155f5>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
> [<e540958c>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> 
> It was introduced in commit d570a59c5b5f ("sctp: only allow the out stream
> reset when the stream outq is empty"), in orde to check stream outqs before
> sending SCTP_STRRESET_IN_PROGRESS back to the peer of the stream. EAGAIN is
> returned, however, without the nstr_list slab released, if any outq is found
> to be non empty.
> 
> Freeing the slab in question before bailing out fixes it.
> 
> Fixes: d570a59c5b5f ("sctp: only allow the out stream reset when the stream 
> outq is empty")
> Reported-by: syzbot 
> Reported-by: Marcelo Ricardo Leitner 
> Tested-by: Marcelo Ricardo Leitner 
> Cc: Xin Long 
> Cc: Neil Horman 
> Cc: Vlad Yasevich 
> Cc: Eric Dumazet 
> Signed-off-by: Hillf Danton 
> ---
> net/sctp/stream.c | 1 +
> 1 file changed, 1 insertion(+)
> 
> diff --git a/net/sctp/stream.c b/net/sctp/stream.c
> index 93ed078..d3e2f03 100644
> --- a/net/sctp/stream.c
> +++ b/net/sctp/stream.c
> @@ -310,6 +310,7 @@ int sctp_send_reset_streams(struct sctp_association *asoc,
> 
>   if (out && !sctp_stream_outq_is_empty(stream, str_nums, nstr_list)) {
>   retval = -EAGAIN;
> + kfree(nstr_list);
>   goto out;
>   }
> 
> --
> 
> 
Acked-by: Neil Horman 


Re: memory leak in sctp_process_init

2019-05-31 Thread Neil Horman
On Fri, May 31, 2019 at 09:42:42AM -0300, Marcelo Ricardo Leitner wrote:
> On Thu, May 30, 2019 at 03:56:34PM -0400, Neil Horman wrote:
> > On Thu, May 30, 2019 at 12:17:05PM -0300, Marcelo Ricardo Leitner wrote:
> ...
> > > --- a/net/sctp/sm_sideeffect.c
> > > +++ b/net/sctp/sm_sideeffect.c
> > > @@ -898,6 +898,11 @@ static void sctp_cmd_new_state(struct sctp_cmd_seq 
> > > *cmds,
> > >   asoc->rto_initial;
> > >   }
> > >  
> > > + if (sctp_state(asoc, ESTABLISHED)) {
> > > + kfree(asoc->peer.cookie);
> > > + asoc->peer.cookie = NULL;
> > > + }
> > > +
> > Not sure I follow why this is needed.  It doesn't hurt anything of course, 
> > but
> > if we're freeing in sctp_association_free, we don't need to duplicate the
> > operation here, do we?
> 
> This one would be to avoid storing the cookie throughout the entire
> association lifetime, as the cookie is only needed during the
> handshake.
> While the free in sctp_association_free will handle the freeing in
> case the association never enters established state.
> 

Ok, I see we do that with the peer_random and other allocated values as well
when they are no longer needed, but ew, I hate freeing in multiple places like
that.  I'll fix this up on monday, but I wonder if we can't consolidate that
somehow

Neil

> > >   if (sctp_state(asoc, ESTABLISHED) ||
> > >   sctp_state(asoc, CLOSED) ||
> > >   sctp_state(asoc, SHUTDOWN_RECEIVED)) {
> > > 
> > > Also untested, just sharing the idea.
> > > 
> > >   Marcelo
> > > 
> 


Re: memory leak in sctp_process_init

2019-05-30 Thread Neil Horman
On Thu, May 30, 2019 at 12:17:05PM -0300, Marcelo Ricardo Leitner wrote:
> On Thu, May 30, 2019 at 10:20:11AM -0400, Neil Horman wrote:
> > On Wed, May 29, 2019 at 08:37:57PM -0300, Marcelo Ricardo Leitner wrote:
> > > On Wed, May 29, 2019 at 03:07:09PM -0400, Neil Horman wrote:
> > > > --- a/net/sctp/sm_make_chunk.c
> > > > +++ b/net/sctp/sm_make_chunk.c
> > > > @@ -2419,9 +2419,12 @@ int sctp_process_init(struct sctp_association 
> > > > *asoc, struct sctp_chunk *chunk,
> > > > /* Copy cookie in case we need to resend COOKIE-ECHO. */
> > > > cookie = asoc->peer.cookie;
> > > > if (cookie) {
> > > > +   if (asoc->peer.cookie_allocated)
> > > > +   kfree(cookie);
> > > > asoc->peer.cookie = kmemdup(cookie, 
> > > > asoc->peer.cookie_len, gfp);
> > > > if (!asoc->peer.cookie)
> > > > goto clean_up;
> > > > +   asoc->peer.cookie_allocated=1;
> > > > }
> > > >  
> > > > /* RFC 2960 7.2.1 The initial value of ssthresh MAY be 
> > > > arbitrarily
> > > 
> > > What if we kmemdup directly at sctp_process_param(), as it's done for
> > > others already? Like SCTP_PARAM_RANDOM and SCTP_PARAM_HMAC_ALGO. I
> > > don't see a reason for SCTP_PARAM_STATE_COOKIE to be different
> > > here. This way it would be always allocated, and ready to be kfreed.
> > > 
> > > We still need to free it after the handshake, btw.
> > > 
> > >   Marcelo
> > > 
> > 
> > Still untested, but something like this?
> > 
> 
> Yes, just..
> 
> > 
> > diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> > index d2c7d0d2abc1..718b9917844e 100644
> > --- a/net/sctp/associola.c
> > +++ b/net/sctp/associola.c
> > @@ -393,6 +393,7 @@ void sctp_association_free(struct sctp_association 
> > *asoc)
> > kfree(asoc->peer.peer_random);
> > kfree(asoc->peer.peer_chunks);
> > kfree(asoc->peer.peer_hmacs);
> > +   kfree(asoc->peer.cookie);
> 
> this chunk is not needed because it is freed right above the first
> kfree() in the context here.
> 
Ah, thanks, missed that.

> >  
> > /* Release the transport structures. */
> > list_for_each_safe(pos, temp, >peer.transport_addr_list) {
> > diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> > index 72e74503f9fc..ff365f22a3c1 100644
> > --- a/net/sctp/sm_make_chunk.c
> > +++ b/net/sctp/sm_make_chunk.c
> > @@ -2431,14 +2431,6 @@ int sctp_process_init(struct sctp_association *asoc, 
> > struct sctp_chunk *chunk,
> > /* Peer Rwnd   : Current calculated value of the peer's rwnd.  */
> > asoc->peer.rwnd = asoc->peer.i.a_rwnd;
> >  
> > -   /* Copy cookie in case we need to resend COOKIE-ECHO. */
> > -   cookie = asoc->peer.cookie;
> > -   if (cookie) {
> > -   asoc->peer.cookie = kmemdup(cookie, asoc->peer.cookie_len, gfp);
> > -   if (!asoc->peer.cookie)
> > -   goto clean_up;
> > -   }
> > -
> > /* RFC 2960 7.2.1 The initial value of ssthresh MAY be arbitrarily
> >  * high (for example, implementations MAY use the size of the receiver
> >  * advertised window).
> > @@ -2607,7 +2599,9 @@ static int sctp_process_param(struct sctp_association 
> > *asoc,
> > case SCTP_PARAM_STATE_COOKIE:
> > asoc->peer.cookie_len =
> > ntohs(param.p->length) - sizeof(struct sctp_paramhdr);
> > -   asoc->peer.cookie = param.cookie->body;
> > +   asoc->peer.cookie = kmemdup(param.cookie->body, 
> > asoc->peer.cookie_len, gfp);
> > +   if (!asoc->peer.cookie)
> > +   retval = 0;
> > break;
> >  
> > case SCTP_PARAM_HEARTBEAT_INFO:
> 
> Plus:
> 
> --- a/net/sctp/sm_sideeffect.c
> +++ b/net/sctp/sm_sideeffect.c
> @@ -898,6 +898,11 @@ static void sctp_cmd_new_state(struct sctp_cmd_seq *cmds,
>   asoc->rto_initial;
>   }
>  
> + if (sctp_state(asoc, ESTABLISHED)) {
> + kfree(asoc->peer.cookie);
> + asoc->peer.cookie = NULL;
> + }
> +
Not sure I follow why this is needed.  It doesn't hurt anything of course, but
if we're freeing in sctp_association_free, we don't need to duplicate the
operation here, do we?
>   if (sctp_state(asoc, ESTABLISHED) ||
>   sctp_state(asoc, CLOSED) ||
>   sctp_state(asoc, SHUTDOWN_RECEIVED)) {
> 
> Also untested, just sharing the idea.
> 
>   Marcelo
> 


Re: memory leak in sctp_process_init

2019-05-30 Thread Neil Horman
On Wed, May 29, 2019 at 08:37:57PM -0300, Marcelo Ricardo Leitner wrote:
> On Wed, May 29, 2019 at 03:07:09PM -0400, Neil Horman wrote:
> > --- a/net/sctp/sm_make_chunk.c
> > +++ b/net/sctp/sm_make_chunk.c
> > @@ -2419,9 +2419,12 @@ int sctp_process_init(struct sctp_association *asoc, 
> > struct sctp_chunk *chunk,
> > /* Copy cookie in case we need to resend COOKIE-ECHO. */
> > cookie = asoc->peer.cookie;
> > if (cookie) {
> > +   if (asoc->peer.cookie_allocated)
> > +   kfree(cookie);
> > asoc->peer.cookie = kmemdup(cookie, asoc->peer.cookie_len, gfp);
> > if (!asoc->peer.cookie)
> > goto clean_up;
> > +   asoc->peer.cookie_allocated=1;
> > }
> >  
> > /* RFC 2960 7.2.1 The initial value of ssthresh MAY be arbitrarily
> 
> What if we kmemdup directly at sctp_process_param(), as it's done for
> others already? Like SCTP_PARAM_RANDOM and SCTP_PARAM_HMAC_ALGO. I
> don't see a reason for SCTP_PARAM_STATE_COOKIE to be different
> here. This way it would be always allocated, and ready to be kfreed.
> 
> We still need to free it after the handshake, btw.
> 
>   Marcelo
> 

Still untested, but something like this?


diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index d2c7d0d2abc1..718b9917844e 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -393,6 +393,7 @@ void sctp_association_free(struct sctp_association *asoc)
kfree(asoc->peer.peer_random);
kfree(asoc->peer.peer_chunks);
kfree(asoc->peer.peer_hmacs);
+   kfree(asoc->peer.cookie);
 
/* Release the transport structures. */
list_for_each_safe(pos, temp, >peer.transport_addr_list) {
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 72e74503f9fc..ff365f22a3c1 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -2431,14 +2431,6 @@ int sctp_process_init(struct sctp_association *asoc, 
struct sctp_chunk *chunk,
/* Peer Rwnd   : Current calculated value of the peer's rwnd.  */
asoc->peer.rwnd = asoc->peer.i.a_rwnd;
 
-   /* Copy cookie in case we need to resend COOKIE-ECHO. */
-   cookie = asoc->peer.cookie;
-   if (cookie) {
-   asoc->peer.cookie = kmemdup(cookie, asoc->peer.cookie_len, gfp);
-   if (!asoc->peer.cookie)
-   goto clean_up;
-   }
-
/* RFC 2960 7.2.1 The initial value of ssthresh MAY be arbitrarily
 * high (for example, implementations MAY use the size of the receiver
 * advertised window).
@@ -2607,7 +2599,9 @@ static int sctp_process_param(struct sctp_association 
*asoc,
case SCTP_PARAM_STATE_COOKIE:
asoc->peer.cookie_len =
ntohs(param.p->length) - sizeof(struct sctp_paramhdr);
-   asoc->peer.cookie = param.cookie->body;
+   asoc->peer.cookie = kmemdup(param.cookie->body, 
asoc->peer.cookie_len, gfp);
+   if (!asoc->peer.cookie)
+   retval = 0;
break;
 
case SCTP_PARAM_HEARTBEAT_INFO:


Re: memory leak in sctp_process_init

2019-05-30 Thread Neil Horman
On Wed, May 29, 2019 at 08:37:57PM -0300, Marcelo Ricardo Leitner wrote:
> On Wed, May 29, 2019 at 03:07:09PM -0400, Neil Horman wrote:
> > --- a/net/sctp/sm_make_chunk.c
> > +++ b/net/sctp/sm_make_chunk.c
> > @@ -2419,9 +2419,12 @@ int sctp_process_init(struct sctp_association *asoc, 
> > struct sctp_chunk *chunk,
> > /* Copy cookie in case we need to resend COOKIE-ECHO. */
> > cookie = asoc->peer.cookie;
> > if (cookie) {
> > +   if (asoc->peer.cookie_allocated)
> > +   kfree(cookie);
> > asoc->peer.cookie = kmemdup(cookie, asoc->peer.cookie_len, gfp);
> > if (!asoc->peer.cookie)
> > goto clean_up;
> > +   asoc->peer.cookie_allocated=1;
> > }
> >  
> > /* RFC 2960 7.2.1 The initial value of ssthresh MAY be arbitrarily
> 
> What if we kmemdup directly at sctp_process_param(), as it's done for
> others already? Like SCTP_PARAM_RANDOM and SCTP_PARAM_HMAC_ALGO. I
> don't see a reason for SCTP_PARAM_STATE_COOKIE to be different
> here. This way it would be always allocated, and ready to be kfreed.
> 
> We still need to free it after the handshake, btw.
> 
Yeah, that makes sense, I'll give that a shot.
Neil

>   Marcelo
> 


Re: [PATCH] Fix xoring of arch_get_random_long into crng->state array

2019-05-30 Thread Neil Horman
On Wed, May 29, 2019 at 11:12:01PM -0400, Theodore Ts'o wrote:
> On Tue, Apr 02, 2019 at 06:00:25PM -0400, Neil Horman wrote:
> > When _crng_extract is called, any arch that has a registered
> > arch_get_random_long method, attempts to mix an unsigned long value into
> > the crng->state buffer, it only mixes in 32 of the 64 bits available,
> > because the state buffer is an array of u32 values, even though 2 u32
> > are expected to be filled (owing to the fact that it expects indexes 14
> > and 15 to be filled).
> 
> Index 15 does get initialized; in fact, it's changed each time
> crng_reseed() is called.
> 
> The way things currently work is that we use state[12] and state[13]
> as 64-bit counter (it gets incremented each time we call
> _extract_crng), and state[14] and state[15] are nonce values.  After
> crng->state has been in use for five minutes, we reseed the crng by
> grabbing randomness from the input pool, and using that to initialize
> state[4..15].  (State[0..3] are always set to the ChaCha20 constant of
> "expand 32-byte k".)
> 
> If the CPU provides and RDRAND-like instruction (which can be the case
> for x86, PPC, and S390), we xor it into state[14].  Whether we xor any
> extra entropy into state[15] to be honest, really doesn't matter much.
> I think I was trying to keep things simple, and it wasn't worth it to
> call RDRAND twice on a 32-bit x86.  (And there isn't an
> arch_get_random_long_long.  :-)
> 
> Why do we do this at all?  Well, the goal was to feed in some
> contributing randomness from RDRAND when we turn the CRNG crank.  (The
> reason why we don't just XOR in the RDRAND into the output ohf the
> CRNG is mainly to assuage critics that hypothetical RDRAND backdoor
> has access to the CPU registers.  So we perturb the inputs to the
> CRNG, on the theory that if malicious firmware can reverse
> CHACHA20... we've got bigger problems.  :-)  We get up to 20 bytes out
> of a single turn of the CRNG crank, so whether we mix in 4 bytes or 8
> bytes from RDRAND, we're never going to be depending on RDRAND
> completely in any case.
> 
> The bottom line is that I'm not at all convinced it worth the effort
> to mix in 8 bytes versus 4 bytes from RDRAND.  This is really a CRNG,
> and the RDRAND inputs really don't change that.
> 
Ok, so what I'm getting is that the exclusion of the second 32 bit word here
from >state[15], isn't an oversight, its just skipped because its not
worth taking the time for the extra write there, and this is not a bug.  I'm ok
with that.

Thanks for the explination
Neil

> - Ted
> 


Re: memory leak in sctp_process_init

2019-05-29 Thread Neil Horman
On Tue, May 28, 2019 at 07:15:50AM -0400, Neil Horman wrote:
> On Mon, May 27, 2019 at 10:36:00PM -0300, Marcelo Ricardo Leitner wrote:
> > On Mon, May 27, 2019 at 05:48:06PM -0700, syzbot wrote:
> > > Hello,
> > > 
> > > syzbot found the following crash on:
> > > 
> > > HEAD commit:9c7db500 Merge tag 'selinux-pr-20190521' of 
> > > git://git.kern..
> > > git tree:   upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=10388530a0
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=61dd9e15a761691d
> > > dashboard link: 
> > > https://syzkaller.appspot.com/bug?extid=f7e9153b037eac9b1df8
> > > compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> > > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=10e32f8ca0
> > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=177fa530a0
> > > 
> > > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > > Reported-by: syzbot+f7e9153b037eac9b1...@syzkaller.appspotmail.com
> > > 
> > >  0 to HW filter on device batadv0
> > > executing program
> > > executing program
> > > executing program
> > > BUG: memory leak
> > > unreferenced object 0x88810ef68400 (size 1024):
> > >   comm "syz-executor273", pid 7046, jiffies 4294945598 (age 28.770s)
> > >   hex dump (first 32 bytes):
> > > 1d de 28 8d de 0b 1b e3 b5 c2 f9 68 fd 1a 97 25  ..(h...%
> > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> > >   backtrace:
> > > [<a02cebbd>] kmemleak_alloc_recursive
> > > include/linux/kmemleak.h:55 [inline]
> > > [<a02cebbd>] slab_post_alloc_hook mm/slab.h:439 [inline]
> > > [<a02cebbd>] slab_alloc mm/slab.c:3326 [inline]
> > > [<a02cebbd>] __do_kmalloc mm/slab.c:3658 [inline]
> > > [<a02cebbd>] __kmalloc_track_caller+0x15d/0x2c0 mm/slab.c:3675
> > > [<9e6245e6>] kmemdup+0x27/0x60 mm/util.c:119
> > > [<dfdc5d2d>] kmemdup include/linux/string.h:432 [inline]
> > > [<dfdc5d2d>] sctp_process_init+0xa7e/0xc20
> > > net/sctp/sm_make_chunk.c:2437
> > > [<b58b62f8>] sctp_cmd_process_init 
> > > net/sctp/sm_sideeffect.c:682
> > > [inline]
> > > [<b58b62f8>] sctp_cmd_interpreter 
> > > net/sctp/sm_sideeffect.c:1384
> > > [inline]
> > > [<b58b62f8>] sctp_side_effects net/sctp/sm_sideeffect.c:1194
> > > [inline]
> > > [<b58b62f8>] sctp_do_sm+0xbdc/0x1d60
> > > net/sctp/sm_sideeffect.c:1165
> > 
> > Note that this is on the client side. It was handling the INIT_ACK
> > chunk, from sctp_sf_do_5_1C_ack().
> > 
> > I'm not seeing anything else other than sctp_association_free()
> > releasing this memory. This means 2 things:
> > - Every time the cookie is retransmitted, it leaks. As shown by the
> >   repetitive leaks here.
> > - The cookie remains allocated throughout the association, which is
> >   also not good as that's a 1k that we could have released back to the
> >   system right after the handshake.
> > 
> >   Marcelo
> > 
> If we have an INIT chunk bundled with a COOKIE_ECHO chunk in the same packet,
> this might occur.  Processing for each chunk (via sctp_cmd_process_init and
> sctp_sf_do_5_1D_ce both call sctp_process_init, which would cause a second 
> write
> to asoc->peer.cookie, leaving the first write (set via kmemdup), to be 
> orphaned
> and leak.  Seems like we should set a flag to determine if we've already 
> cloned
> the cookie, and free the old one if its set.  If we wanted to do that on the
> cheap, we might be able to get away with checking asoc->stream->[in|out]cnt 
> for
> being non-zero as an indicator if we've already cloned the cookie
> 
> Neil
> 
> 

Completely untested, but can you give this patch a shot?


diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 0767701ef362..a5772d72eb87 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1701,6 +1701,7 @@ struct sctp_association {
__u8sack_needed:1, /* Do we need to sack the peer? */
sack_generation:1,
zero_window_announced:1;
+   cookie_allocated:1
__u32   sack_cnt;
 
__u32   adaptation_ind;  /* Adaptation Code poin

Re: [PATCH net-next] sctp: deduplicate identical skb_checksum_ops

2019-05-29 Thread Neil Horman
On Wed, May 29, 2019 at 05:39:41PM +0200, Matteo Croce wrote:
> The same skb_checksum_ops struct is defined twice in two different places,
> leading to code duplication. Declare it as a global variable into a common
> header instead of allocating it on the stack on each function call.
> bloat-o-meter reports a slight code shrink.
> 
> add/remove: 1/1 grow/shrink: 0/10 up/down: 128/-1282 (-1154)
> Function old new   delta
> sctp_csum_ops  - 128+128
> crc32c_csum_ops   16   - -16
> sctp_rcv66166583 -33
> sctp_packet_pack45424504 -38
> nf_conntrack_sctp_packet49804926 -54
> execute_masked_set_action   64536389 -64
> tcf_csum_sctp575 428-147
> sctp_gso_segment12921126-166
> sctp_csum_check  579 412-167
> sctp_snat_handler957 772-185
> sctp_dnat_handler   13211132-189
> l4proto_manip_pkt   25362313-223
> Total: Before=359297613, After=359296459, chg -0.00%
> 
> Reviewed-by: Xin Long 
> Signed-off-by: Matteo Croce 
> ---
>  include/net/sctp/checksum.h | 12 +++-
>  net/sctp/offload.c  |  7 +--
>  2 files changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/include/net/sctp/checksum.h b/include/net/sctp/checksum.h
> index 314699333bec..5a9bb09f32b6 100644
> --- a/include/net/sctp/checksum.h
> +++ b/include/net/sctp/checksum.h
> @@ -43,19 +43,21 @@ static inline __wsum sctp_csum_combine(__wsum csum, 
> __wsum csum2,
>  (__force __u32)csum2, len);
>  }
>  
> +static const struct skb_checksum_ops sctp_csum_ops = {
> + .update  = sctp_csum_update,
> + .combine = sctp_csum_combine,
> +};
> +
>  static inline __le32 sctp_compute_cksum(const struct sk_buff *skb,
>   unsigned int offset)
>  {
>   struct sctphdr *sh = (struct sctphdr *)(skb->data + offset);
> - const struct skb_checksum_ops ops = {
> - .update  = sctp_csum_update,
> - .combine = sctp_csum_combine,
> - };
>   __le32 old = sh->checksum;
>   __wsum new;
>  
>   sh->checksum = 0;
> - new = ~__skb_checksum(skb, offset, skb->len - offset, ~(__wsum)0, );
> + new = ~__skb_checksum(skb, offset, skb->len - offset, ~(__wsum)0,
> +   _csum_ops);
>   sh->checksum = old;
>  
>   return cpu_to_le32((__force __u32)new);
> diff --git a/net/sctp/offload.c b/net/sctp/offload.c
> index edfcf16e704c..dac46dfadab5 100644
> --- a/net/sctp/offload.c
> +++ b/net/sctp/offload.c
> @@ -103,11 +103,6 @@ static const struct net_offload sctp6_offload = {
>   },
>  };
>  
> -static const struct skb_checksum_ops crc32c_csum_ops = {
> - .update  = sctp_csum_update,
> - .combine = sctp_csum_combine,
> -};
> -
>  int __init sctp_offload_init(void)
>  {
>   int ret;
> @@ -120,7 +115,7 @@ int __init sctp_offload_init(void)
>   if (ret)
>   goto ipv4;
>  
> - crc32c_csum_stub = _csum_ops;
> + crc32c_csum_stub = _csum_ops;
>   return ret;
>  
>  ipv4:
> -- 
> 2.21.0
> 
> 
Acked-by: Neil Horman 



Re: [PATCH] Fix xoring of arch_get_random_long into crng->state array

2019-05-29 Thread Neil Horman
On Wed, May 29, 2019 at 03:57:07PM +, David Laight wrote:
> From: Neil Horman [mailto:nhor...@tuxdriver.com]
> > Sent: 29 May 2019 16:52
> > On Wed, May 29, 2019 at 01:51:24PM +, David Laight wrote:
> > > From: Neil Horman
> > > > Sent: 29 May 2019 14:42
> > > > On Tue, Apr 02, 2019 at 06:00:25PM -0400, Neil Horman wrote:
> > > > > When _crng_extract is called, any arch that has a registered
> > > > > arch_get_random_long method, attempts to mix an unsigned long value 
> > > > > into
> > > > > the crng->state buffer, it only mixes in 32 of the 64 bits available,
> > > > > because the state buffer is an array of u32 values, even though 2 u32
> > > > > are expected to be filled (owing to the fact that it expects indexes 
> > > > > 14
> > > > > and 15 to be filled).
> > > > >
> > > > > Bring the expected behavior into alignment by casting index 14 to an
> > > > > unsignled long pointer, and xoring that in instead.
> > > ...
> > > > > diff --git a/drivers/char/random.c b/drivers/char/random.c
> > > > > index 38c6d1af6d1c..8178618458ac 100644
> > > > > --- a/drivers/char/random.c
> > > > > +++ b/drivers/char/random.c
> > > > > @@ -975,14 +975,16 @@ static void _extract_crng(struct crng_state 
> > > > > *crng,
> > > > > __u8 out[CHACHA_BLOCK_SIZE])
> > > > >  {
> > > > >   unsigned long v, flags;
> > > > > -
> > > > > + unsigned long *archrnd;
> > > > >   if (crng_ready() &&
> > > > >   (time_after(crng_global_init_time, crng->init_time) ||
> > > > >time_after(jiffies, crng->init_time + 
> > > > > CRNG_RESEED_INTERVAL)))
> > > > >   crng_reseed(crng, crng == _crng ? _pool : 
> > > > > NULL);
> > > > >   spin_lock_irqsave(>lock, flags);
> > > > > - if (arch_get_random_long())
> > > > > - crng->state[14] ^= v;
> > > > > + if (arch_get_random_long()) {
> > > > > + archrnd = (unsigned long *)>state[14];
> > > > > + *archrnd ^= v;
> > > > > + }
> > >
> > > Isn't that likely to generate a misaligned memory access?
> > >
> > I'm not quite sure how it would, crng->state is an array of _u32's, and so 
> > every
> > even element should be on a 64 bit boundary.
> 
> Only if the first item is aligned
> Add a u32 before it and you'll probably flip the alignment.
> 
Sure (assuming no padding by the compiler of leading elements), but thats not
the case here, state is the first element in the array.  I suppose we could add
an __attribute__((aligned,8)) to the element if you think it would help

Neil

>   David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)
> 
> 


Re: [PATCH] Fix xoring of arch_get_random_long into crng->state array

2019-05-29 Thread Neil Horman
On Wed, May 29, 2019 at 01:51:24PM +, David Laight wrote:
> From: Neil Horman
> > Sent: 29 May 2019 14:42
> > On Tue, Apr 02, 2019 at 06:00:25PM -0400, Neil Horman wrote:
> > > When _crng_extract is called, any arch that has a registered
> > > arch_get_random_long method, attempts to mix an unsigned long value into
> > > the crng->state buffer, it only mixes in 32 of the 64 bits available,
> > > because the state buffer is an array of u32 values, even though 2 u32
> > > are expected to be filled (owing to the fact that it expects indexes 14
> > > and 15 to be filled).
> > >
> > > Bring the expected behavior into alignment by casting index 14 to an
> > > unsignled long pointer, and xoring that in instead.
> ...
> > > diff --git a/drivers/char/random.c b/drivers/char/random.c
> > > index 38c6d1af6d1c..8178618458ac 100644
> > > --- a/drivers/char/random.c
> > > +++ b/drivers/char/random.c
> > > @@ -975,14 +975,16 @@ static void _extract_crng(struct crng_state *crng,
> > > __u8 out[CHACHA_BLOCK_SIZE])
> > >  {
> > >   unsigned long v, flags;
> > > -
> > > + unsigned long *archrnd;
> > >   if (crng_ready() &&
> > >   (time_after(crng_global_init_time, crng->init_time) ||
> > >time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)))
> > >   crng_reseed(crng, crng == _crng ? _pool : NULL);
> > >   spin_lock_irqsave(>lock, flags);
> > > - if (arch_get_random_long())
> > > - crng->state[14] ^= v;
> > > + if (arch_get_random_long()) {
> > > + archrnd = (unsigned long *)>state[14];
> > > + *archrnd ^= v;
> > > + }
> 
> Isn't that likely to generate a misaligned memory access?
> 
I'm not quite sure how it would, crng->state is an array of _u32's, and so every
even element should be on a 64 bit boundary.

Neil

>   David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)
> 
> 


Re: [PATCH] Fix xoring of arch_get_random_long into crng->state array

2019-05-29 Thread Neil Horman
On Tue, Apr 02, 2019 at 06:00:25PM -0400, Neil Horman wrote:
> When _crng_extract is called, any arch that has a registered
> arch_get_random_long method, attempts to mix an unsigned long value into
> the crng->state buffer, it only mixes in 32 of the 64 bits available,
> because the state buffer is an array of u32 values, even though 2 u32
> are expected to be filled (owing to the fact that it expects indexes 14
> and 15 to be filled).
> 
> Bring the expected behavior into alignment by casting index 14 to an
> unsignled long pointer, and xoring that in instead.
> 
> Tested successfully by myself
> 
> Signed-off-by: Neil Horman 
> Reported-by: Steve Grubb 
> CC: "Theodore Ts'o" 
> CC: Arnd Bergmann 
> CC: Greg Kroah-Hartman 
> ---
>  drivers/char/random.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index 38c6d1af6d1c..8178618458ac 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -975,14 +975,16 @@ static void _extract_crng(struct crng_state *crng,
> __u8 out[CHACHA_BLOCK_SIZE])
>  {
>   unsigned long v, flags;
> -
> + unsigned long *archrnd;
>   if (crng_ready() &&
>   (time_after(crng_global_init_time, crng->init_time) ||
>time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)))
>   crng_reseed(crng, crng == _crng ? _pool : NULL);
>   spin_lock_irqsave(>lock, flags);
> - if (arch_get_random_long())
> - crng->state[14] ^= v;
> + if (arch_get_random_long()) {
> + archrnd = (unsigned long *)>state[14];
> + *archrnd ^= v;
> + }
>   chacha20_block(>state[0], out);
>   if (crng->state[12] == 0)
>   crng->state[13]++;
> -- 
> 2.20.1
> 
> 

Ping, Arnd, Ted, Greg, any comment here?
Neil



Re: memory leak in sctp_process_init

2019-05-28 Thread Neil Horman
On Mon, May 27, 2019 at 10:36:00PM -0300, Marcelo Ricardo Leitner wrote:
> On Mon, May 27, 2019 at 05:48:06PM -0700, syzbot wrote:
> > Hello,
> > 
> > syzbot found the following crash on:
> > 
> > HEAD commit:9c7db500 Merge tag 'selinux-pr-20190521' of git://git.kern..
> > git tree:   upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=10388530a0
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=61dd9e15a761691d
> > dashboard link: https://syzkaller.appspot.com/bug?extid=f7e9153b037eac9b1df8
> > compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=10e32f8ca0
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=177fa530a0
> > 
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+f7e9153b037eac9b1...@syzkaller.appspotmail.com
> > 
> >  0 to HW filter on device batadv0
> > executing program
> > executing program
> > executing program
> > BUG: memory leak
> > unreferenced object 0x88810ef68400 (size 1024):
> >   comm "syz-executor273", pid 7046, jiffies 4294945598 (age 28.770s)
> >   hex dump (first 32 bytes):
> > 1d de 28 8d de 0b 1b e3 b5 c2 f9 68 fd 1a 97 25  ..(h...%
> > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> >   backtrace:
> > [] kmemleak_alloc_recursive
> > include/linux/kmemleak.h:55 [inline]
> > [] slab_post_alloc_hook mm/slab.h:439 [inline]
> > [] slab_alloc mm/slab.c:3326 [inline]
> > [] __do_kmalloc mm/slab.c:3658 [inline]
> > [] __kmalloc_track_caller+0x15d/0x2c0 mm/slab.c:3675
> > [<9e6245e6>] kmemdup+0x27/0x60 mm/util.c:119
> > [] kmemdup include/linux/string.h:432 [inline]
> > [] sctp_process_init+0xa7e/0xc20
> > net/sctp/sm_make_chunk.c:2437
> > [] sctp_cmd_process_init net/sctp/sm_sideeffect.c:682
> > [inline]
> > [] sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1384
> > [inline]
> > [] sctp_side_effects net/sctp/sm_sideeffect.c:1194
> > [inline]
> > [] sctp_do_sm+0xbdc/0x1d60
> > net/sctp/sm_sideeffect.c:1165
> 
> Note that this is on the client side. It was handling the INIT_ACK
> chunk, from sctp_sf_do_5_1C_ack().
> 
> I'm not seeing anything else other than sctp_association_free()
> releasing this memory. This means 2 things:
> - Every time the cookie is retransmitted, it leaks. As shown by the
>   repetitive leaks here.
> - The cookie remains allocated throughout the association, which is
>   also not good as that's a 1k that we could have released back to the
>   system right after the handshake.
> 
>   Marcelo
> 
If we have an INIT chunk bundled with a COOKIE_ECHO chunk in the same packet,
this might occur.  Processing for each chunk (via sctp_cmd_process_init and
sctp_sf_do_5_1D_ce both call sctp_process_init, which would cause a second write
to asoc->peer.cookie, leaving the first write (set via kmemdup), to be orphaned
and leak.  Seems like we should set a flag to determine if we've already cloned
the cookie, and free the old one if its set.  If we wanted to do that on the
cheap, we might be able to get away with checking asoc->stream->[in|out]cnt for
being non-zero as an indicator if we've already cloned the cookie

Neil



Re: [PATCH net-next 10/12] net: hns3: Add handling of MAC tunnel interruption

2019-04-19 Thread Neil Horman
On Fri, Apr 19, 2019 at 11:05:45AM +0800, Huazhong Tan wrote:
> From: Weihang Li 
> 
> MAC tnl interruptions are different from other type of RAS and MSI-X
> errors, because some bits, such as OVF/LR/RF will occur during link up
> and down.
> 
> The drivers should clear status of all MAC tnl interruption bits but
> shouldn't print any message that would mislead the users.
> 
> In case that link down and re-up in a short time because of some reasons,
> we record when they occurred, and users can query them by debugfs.
> 
> Signed-off-by: Weihang Li 
> Signed-off-by: Peng Li 
> ---
>>
>bool en)
>  {
> @@ -1611,6 +1636,7 @@ pci_ers_result_t hclge_handle_hw_ras_error(struct 
> hnae3_ae_dev *ae_dev)
>  int hclge_handle_hw_msix_error(struct hclge_dev *hdev,
>  unsigned long *reset_requests)
>  {
> + struct hclge_mac_tnl_stats mac_tnl_stats;
>   struct device *dev = >pdev->dev;
>   u32 mpf_bd_num, pf_bd_num, bd_num;
>   enum hnae3_reset_type reset_level;
> @@ -1745,6 +1771,31 @@ int hclge_handle_hw_msix_error(struct hclge_dev *hdev,
>   set_bit(HNAE3_GLOBAL_RESET, reset_requests);
>   }
>  
> + /* query and clear mac tnl interruptions */
> + hclge_cmd_setup_basic_desc([0], HCLGE_OPC_QUERY_MAC_TNL_INT,
> +true);
> + ret = hclge_cmd_send(>hw, [0], 1);
Is this running in interrupt context ever?  I don't think it is, but the
function name makes me think otherwise.  If it is, this could be unsafe as you
take a spinlock in hclge_cmd_send, which is protected against bottom halves, but
not interrupts.  That could cause a deadlock if there is a path to get here
directly from an interrupt context.
Neil



Re: [PATCH ghak90 V6 05/10] audit: add contid support for signalling the audit daemon

2019-04-09 Thread Neil Horman
On Tue, Apr 09, 2019 at 02:57:50PM +0200, Ondrej Mosnacek wrote:
> On Tue, Apr 9, 2019 at 5:40 AM Richard Guy Briggs  wrote:
> > Add audit container identifier support to the action of signalling the
> > audit daemon.
> >
> > Since this would need to add an element to the audit_sig_info struct,
> > a new record type AUDIT_SIGNAL_INFO2 was created with a new
> > audit_sig_info2 struct.  Corresponding support is required in the
> > userspace code to reflect the new record request and reply type.
> > An older userspace won't break since it won't know to request this
> > record type.
> >
> > Signed-off-by: Richard Guy Briggs 
> 
> This looks good to me.
> 
> Reviewed-by: Ondrej Mosnacek 
> 
> Although I'm wondering if we shouldn't try to future-proof the
> AUDIT_SIGNAL_INFO2 format somehow, so that we don't need to add
> another AUDIT_SIGNAL_INFO3 when the need arises to add yet-another
> identifier to it... The simplest solution I can come up with is to add
> a "version" field at the beginning (set to 2 initially), then v_len
> at the beginning of data for version . But maybe this is too
> complicated for too little gain...
> 
So, I'm not sure how often this needs to be revised (if its not often, this may
be just fine), but if future proofing is warranted, it might be worthwhile to
just use the netlink TLV encoding thats available today.  The kernel has a suite
of nla_put_ macros (like nla_put_u32()), and the userspace netlink library
can parse those messages fairly easily.  It would let you send arbitrary length
messages with a terminator type at the end of the array.

That said, I don't think we want to do that right now for just this message.  A
better approach would be to do this now, and in a subsequent patch, create an
AUDIT version 2 netlink protocol that converts all the messages we send to that
format for consistency.  Such a change would be large and warrant its own patch
set and review.

I'm good with this patch as it is

Acked-by: Neil Horman 

> > ---
> >  include/linux/audit.h   |  7 +++
> >  include/uapi/linux/audit.h  |  1 +
> >  kernel/audit.c  | 27 +++
> >  kernel/audit.h  |  1 +
> >  kernel/auditsc.c|  1 +
> >  security/selinux/nlmsgtab.c |  1 +
> >  6 files changed, 38 insertions(+)
> >
> > diff --git a/include/linux/audit.h b/include/linux/audit.h
> > index 43438192ca2a..c2dec9157463 100644
> > --- a/include/linux/audit.h
> > +++ b/include/linux/audit.h
> > @@ -37,6 +37,13 @@ struct audit_sig_info {
> > charctx[0];
> >  };
> >
> > +struct audit_sig_info2 {
> > +   uid_t   uid;
> > +   pid_t   pid;
> > +   u64 cid;
> > +   charctx[0];
> > +};
> > +
> >  struct audit_buffer;
> >  struct audit_context;
> >  struct inode;
> > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> > index 55fde9970762..10cc67926cf1 100644
> > --- a/include/uapi/linux/audit.h
> > +++ b/include/uapi/linux/audit.h
> > @@ -72,6 +72,7 @@
> >  #define AUDIT_SET_FEATURE  1018/* Turn an audit feature on or off 
> > */
> >  #define AUDIT_GET_FEATURE  1019/* Get which features are enabled */
> >  #define AUDIT_CONTAINER_OP 1020/* Define the container id and info 
> > */
> > +#define AUDIT_SIGNAL_INFO2 1021/* Get info auditd signal sender */
> >
> >  #define AUDIT_FIRST_USER_MSG   1100/* Userspace messages mostly 
> > uninteresting to kernel */
> >  #define AUDIT_USER_AVC 1107/* We filter this differently */
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index 3e0af53f3c4d..87e1d367f98c 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -138,6 +138,7 @@ struct audit_net {
> >  kuid_t audit_sig_uid = INVALID_UID;
> >  pid_t  audit_sig_pid = -1;
> >  u32audit_sig_sid = 0;
> > +u64audit_sig_cid = AUDIT_CID_UNSET;
> >
> >  /* Records can be lost in several ways:
> > 0) [suppressed in audit_alloc]
> > @@ -1097,6 +1098,7 @@ static int audit_netlink_ok(struct sk_buff *skb, u16 
> > msg_type)
> > case AUDIT_ADD_RULE:
> > case AUDIT_DEL_RULE:
> > case AUDIT_SIGNAL_INFO:
> > +   case AUDIT_SIGNAL_INFO2:
> > case AUDIT_TTY_GET:
> > case AUDIT_TTY_SET:
> > case AUDIT_TRIM:
> > @@ -1260,6 +1262,7 @@ static int audit_receive_msg(struct sk_buff *skb, 
> > struct nlmsghdr *nlh)
> > struct audit_buffer *

[PATCH] Fix xoring of arch_get_random_long into crng->state array

2019-04-02 Thread Neil Horman
When _crng_extract is called, any arch that has a registered
arch_get_random_long method, attempts to mix an unsigned long value into
the crng->state buffer, it only mixes in 32 of the 64 bits available,
because the state buffer is an array of u32 values, even though 2 u32
are expected to be filled (owing to the fact that it expects indexes 14
and 15 to be filled).

Bring the expected behavior into alignment by casting index 14 to an
unsignled long pointer, and xoring that in instead.

Tested successfully by myself

Signed-off-by: Neil Horman 
Reported-by: Steve Grubb 
CC: "Theodore Ts'o" 
CC: Arnd Bergmann 
CC: Greg Kroah-Hartman 
---
 drivers/char/random.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 38c6d1af6d1c..8178618458ac 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -975,14 +975,16 @@ static void _extract_crng(struct crng_state *crng,
  __u8 out[CHACHA_BLOCK_SIZE])
 {
unsigned long v, flags;
-
+   unsigned long *archrnd;
if (crng_ready() &&
(time_after(crng_global_init_time, crng->init_time) ||
 time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)))
crng_reseed(crng, crng == _crng ? _pool : NULL);
spin_lock_irqsave(>lock, flags);
-   if (arch_get_random_long())
-   crng->state[14] ^= v;
+   if (arch_get_random_long()) {
+   archrnd = (unsigned long *)>state[14];
+   *archrnd ^= v;
+   }
chacha20_block(>state[0], out);
if (crng->state[12] == 0)
crng->state[13]++;
-- 
2.20.1



Re: [PATCH v19,RESEND 01/27] x86/cpufeatures: Add Intel-defined SGX feature bit

2019-03-20 Thread Neil Horman
On Wed, Mar 20, 2019 at 06:20:53PM +0200, Jarkko Sakkinen wrote:
> From: Kai Huang 
> 
> X86_FEATURE_SGX reflects whether or not the CPU supports Intel's
> Software Guard eXtensions (SGX).
> 
> Signed-off-by: Kai Huang 
> Co-developed-by: Jarkko Sakkinen 
> Signed-off-by: Jarkko Sakkinen 
> Reviewed-by: Borislav Petkov 
> ---
>  arch/x86/include/asm/cpufeatures.h   | 1 +
>  arch/x86/include/asm/disabled-features.h | 8 +++-
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/cpufeatures.h 
> b/arch/x86/include/asm/cpufeatures.h
> index 981ff9479648..a16325db4cff 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -236,6 +236,7 @@
>  /* Intel-defined CPU features, CPUID level 0x0007:0 (EBX), word 9 */
>  #define X86_FEATURE_FSGSBASE ( 9*32+ 0) /* RDFSBASE, WRFSBASE, 
> RDGSBASE, WRGSBASE instructions*/
>  #define X86_FEATURE_TSC_ADJUST   ( 9*32+ 1) /* TSC adjustment 
> MSR 0x3B */
> +#define X86_FEATURE_SGX  ( 9*32+ 2) /* Software Guard 
> Extensions */
>  #define X86_FEATURE_BMI1 ( 9*32+ 3) /* 1st group bit 
> manipulation extensions */
>  #define X86_FEATURE_HLE  ( 9*32+ 4) /* Hardware Lock 
> Elision */
>  #define X86_FEATURE_AVX2 ( 9*32+ 5) /* AVX2 instructions */
> diff --git a/arch/x86/include/asm/disabled-features.h 
> b/arch/x86/include/asm/disabled-features.h
> index a5ea841cc6d2..74de07d0f390 100644
> --- a/arch/x86/include/asm/disabled-features.h
> +++ b/arch/x86/include/asm/disabled-features.h
> @@ -62,6 +62,12 @@
>  # define DISABLE_PTI (1 << (X86_FEATURE_PTI & 31))
>  #endif
>  
> +#ifdef CONFIG_INTEL_SGX
> +# define DISABLE_SGX_CORE0
> +#else
> +# define DISABLE_SGX_CORE(1 << (X86_FEATURE_SGX & 31))
> +#endif
> +
>  /*
>   * Make sure to add features to the correct mask
>   */
> @@ -74,7 +80,7 @@
>  #define DISABLED_MASK6   0
>  #define DISABLED_MASK7   (DISABLE_PTI)
>  #define DISABLED_MASK8   0
> -#define DISABLED_MASK9   (DISABLE_MPX|DISABLE_SMAP)
> +#define DISABLED_MASK9   (DISABLE_MPX|DISABLE_SMAP|DISABLE_SGX_CORE)
>  #define DISABLED_MASK10  0
>  #define DISABLED_MASK11  0
>  #define DISABLED_MASK12  0
> -- 
> 2.19.1
> 
Just out of curiosity, would it be worthwhile to separate out the cpufeature
patches here to post and integrate them separately?  It would at least reduce
the size of this patch set slightly, as these aren't controversial changes

Neil



Re: [PATCH ghak90 V5 08/10] audit: add containerid filtering

2019-03-18 Thread Neil Horman
tr))
> + goto exit_free;
> + f->val64 = ((u64 *)str)[0];
> + break;
>   }
>   }
>  
> @@ -664,6 +673,11 @@ static struct audit_rule_data 
> *audit_krule_to_data(struct audit_krule *krule)
>   data->buflen += data->values[i] =
>   audit_pack_string(, 
> audit_mark_path(krule->exe));
>   break;
> + case AUDIT_CONTID:
> + data->buflen += data->values[i] = sizeof(u64);
> + for (i = 0; i < sizeof(u64); i++)
> + ((char *)bufp)[i] = ((char *)>val64)[i];
> + break;
>   case AUDIT_LOGINUID_SET:
>   if (krule->pflags & AUDIT_LOGINUID_LEGACY && !f->val) {
>   data->fields[i] = AUDIT_LOGINUID;
> @@ -750,6 +764,10 @@ static int audit_compare_rule(struct audit_krule *a, 
> struct audit_krule *b)
>   if (!gid_eq(a->fields[i].gid, b->fields[i].gid))
>   return 1;
>   break;
> + case AUDIT_CONTID:
> + if (a->fields[i].val64 != b->fields[i].val64)
> + return 1;
> + break;
>   default:
>   if (a->fields[i].val != b->fields[i].val)
>   return 1;
> @@ -1206,6 +1224,31 @@ int audit_comparator(u32 left, u32 op, u32 right)
>   }
>  }
>  
> +int audit_comparator64(u64 left, u32 op, u64 right)
> +{
> + switch (op) {
> + case Audit_equal:
> + return (left == right);
> + case Audit_not_equal:
> + return (left != right);
> + case Audit_lt:
> + return (left < right);
> + case Audit_le:
> + return (left <= right);
> + case Audit_gt:
> + return (left > right);
> + case Audit_ge:
> + return (left >= right);
> + case Audit_bitmask:
> + return (left & right);
> + case Audit_bittest:
> + return ((left & right) == right);
> + default:
> + BUG();
> + return 0;
> + }
> +}
> +
>  int audit_uid_comparator(kuid_t left, u32 op, kuid_t right)
>  {
>   switch (op) {
> @@ -1344,6 +1387,10 @@ int audit_filter(int msgtype, unsigned int listtype)
>   result = 
> audit_comparator(audit_loginuid_set(current),
> f->op, f->val);
>   break;
> + case AUDIT_CONTID:
> + result = 
> audit_comparator64(audit_get_contid(current),
> +   f->op, f->val64);
> + break;
>   case AUDIT_MSGTYPE:
>   result = audit_comparator(msgtype, f->op, 
> f->val);
>   break;
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index aa5d13b4fbbb..2d74238e9638 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -616,6 +616,9 @@ static int audit_filter_rules(struct task_struct *tsk,
>   case AUDIT_LOGINUID_SET:
>   result = audit_comparator(audit_loginuid_set(tsk), 
> f->op, f->val);
>   break;
> + case AUDIT_CONTID:
> + result = audit_comparator64(audit_get_contid(tsk), 
> f->op, f->val64);
> + break;
>   case AUDIT_SUBJ_USER:
>   case AUDIT_SUBJ_ROLE:
>   case AUDIT_SUBJ_TYPE:
> -- 
> 1.8.3.1
> 
> 
Acked-by: Neil Horman 


Re: general protection fault in sctp_sched_rr_dequeue

2019-03-06 Thread Neil Horman
On Wed, Mar 06, 2019 at 06:43:48PM +0800, Xin Long wrote:
> On Wed, Mar 6, 2019 at 9:42 AM syzbot
>  wrote:
> >
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:63bdf4284c38 Merge branch 'linus' of git://git.kernel.org/..
> > git tree:   upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=100347cb20
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=872be05707464aaa
> > dashboard link: https://syzkaller.appspot.com/bug?extid=4c9934f20522c0efd657
> > compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=11cd9b0320
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=127de8e720
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+4c9934f20522c0efd...@syzkaller.appspotmail.com
> >
> > kauditd_printk_skb: 2 callbacks suppressed
> > audit: type=1400 audit(1551833288.424:35): avc:  denied  { map } for
> > pid=8035 comm="bash" path="/bin/bash" dev="sda1" ino=1457
> > scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023
> > tcontext=system_u:object_r:file_t:s0 tclass=file permissive=1
> > audit: type=1400 audit(1551833294.934:36): avc:  denied  { map } for
> > pid=8047 comm="syz-executor778" path="/root/syz-executor778173561"
> > dev="sda1" ino=16484 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023
> > tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1
> > kasan: CONFIG_KASAN_INLINE enabled
> > kasan: GPF could be caused by NULL-ptr deref or user memory access
> > general protection fault:  [#1] PREEMPT SMP KASAN
> > CPU: 1 PID: 8047 Comm: syz-executor778 Not tainted 5.0.0+ #7
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > RIP: 0010:sctp_sched_rr_dequeue+0xd3/0x170 net/sctp/stream_sched_rr.c:141
> The panic was caused by sched->init() reset stream->rr_next = NULL, even
> if outq->out_chunk_list is not empty.
> 
> We should remove the sched->init() from sctp_stream_init(), since
> all sched info was moved into sout->ext and sctp_stream_alloc_out()
> will not afffect it.
> 
I think what you're saying is we can just let sctp_outq_init handle the stream
scheduler initalization, correct?  If so, ACK to that approach
Neil

> > Code: ea 03 80 3c 02 00 0f 85 a2 00 00 00 48 8b 5b 08 e8 62 20 ee fa 48 8d
> > 7b 30 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75
> > 53 4c 8b 6b 30 4c 89 e7 49 83 ed 18 4c 89 ee e8 b4
> > RSP: 0018:88809eacf040 EFLAGS: 00010206
> > RAX: dc00 RBX:  RCX: 8679cd9f
> > RDX: 0006 RSI: 8681c41e RDI: 0030
> > RBP: 88809eacf058 R08: 8880a12bc300 R09: 0002
> > R10: ed1015d25bcf R11: 8880ae92de7b R12: 88807cae6ca0
> > R13: 88807cae6580 R14: dc00 R15: 88809eacf198
> > FS:  01865880() GS:8880ae90() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 55af2d491150 CR3: 8dd7b000 CR4: 001406e0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe0ff0 DR7: 0400
> > Call Trace:
> >   sctp_outq_dequeue_data net/sctp/outqueue.c:90 [inline]
> >   sctp_outq_flush_data net/sctp/outqueue.c:1079 [inline]
> >   sctp_outq_flush+0xba2/0x2790 net/sctp/outqueue.c:1205
> >   sctp_outq_uncork+0x6c/0x80 net/sctp/outqueue.c:772
> >   sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
> >   sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
> >   sctp_do_sm+0x513/0x5390 net/sctp/sm_sideeffect.c:1191
> >   sctp_assoc_bh_rcv+0x343/0x660 net/sctp/associola.c:1074
> >   sctp_inq_push+0x1ea/0x290 net/sctp/inqueue.c:95
> >   sctp_backlog_rcv+0x189/0xbc0 net/sctp/input.c:354
> >   sk_backlog_rcv include/net/sock.h:937 [inline]
> >   __release_sock+0x12e/0x3a0 net/core/sock.c:2413
> >   release_sock+0x59/0x1c0 net/core/sock.c:2929
> >   sctp_wait_for_connect+0x316/0x540 net/sctp/socket.c:8999
> >   sctp_sendmsg_to_asoc+0x13e2/0x17d0 net/sctp/socket.c:1968
> >   sctp_sendmsg+0x10a9/0x17e0 net/sctp/socket.c:2114
> >   inet_sendmsg+0x147/0x5d0 net/ipv4/af_inet.c:798
> >   sock_sendmsg_nosec net/socket.c:622 [inline]
> >   sock_sendmsg+0xdd/0x130 net/socket.c:632
> >   ___sys_sendmsg+0x806/0x930 net/socket.c:2137
> >   __sys_sendmsg+0x105/0x1d0 net/socket.c:2175
> >   __do_sys_sendmsg net/socket.c:2184 [inline]
> >   __se_sys_sendmsg net/socket.c:2182 [inline]
> >   __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2182
> >   do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
> >   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > RIP: 0033:0x440159
> > Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7
> > 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
> > ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
> 

Re: [PATCH net] sctp: get sctphdr by offset in sctp_compute_cksum

2019-02-27 Thread Neil Horman
On Wed, Feb 27, 2019 at 08:53:26PM +0800, Xin Long wrote:
> On Tue, Feb 26, 2019 at 8:29 PM Neil Horman  wrote:
> >
> > On Tue, Feb 26, 2019 at 12:15:54AM +0800, Xin Long wrote:
> > > On Mon, Feb 25, 2019 at 10:08 PM Neil Horman  
> > > wrote:
> > > >
> > > > On Mon, Feb 25, 2019 at 09:20:44PM +0800, Xin Long wrote:
> > > > > On Mon, Feb 25, 2019 at 8:47 PM Neil Horman  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Feb 25, 2019 at 07:25:37PM +0800, Xin Long wrote:
> > > > > > > sctp_hdr(skb) only works when skb->transport_header is set 
> > > > > > > properly.
> > > > > > >
> > > > > > > But in the path of nf_conntrack_in: sctp_packet() -> sctp_error()
> > > > > > >
> > > > > > > skb->transport_header is not guaranteed to be right value for 
> > > > > > > sctp.
> > > > > > > It will cause to fail to check the checksum for sctp packets.
> > > > > > >
> > > > > > > So fix it by using offset, which is always right in all places.
> > > > > > >
> > > > > > > Fixes: e6d8b64b34aa ("net: sctp: fix and consolidate SCTP 
> > > > > > > checksumming code")
> > > > > > > Reported-by: Li Shuang 
> > > > > > > Signed-off-by: Xin Long 
> > > > > > > ---
> > > > > > >  include/net/sctp/checksum.h | 2 +-
> > > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/include/net/sctp/checksum.h 
> > > > > > > b/include/net/sctp/checksum.h
> > > > > > > index 32ee65a..1c6e6c0 100644
> > > > > > > --- a/include/net/sctp/checksum.h
> > > > > > > +++ b/include/net/sctp/checksum.h
> > > > > > > @@ -61,7 +61,7 @@ static inline __wsum sctp_csum_combine(__wsum 
> > > > > > > csum, __wsum csum2,
> > > > > > >  static inline __le32 sctp_compute_cksum(const struct sk_buff 
> > > > > > > *skb,
> > > > > > >   unsigned int offset)
> > > > > > >  {
> > > > > > > - struct sctphdr *sh = sctp_hdr(skb);
> > > > > > > + struct sctphdr *sh = (struct sctphdr *)(skb->data + offset);
> > > > > > >   const struct skb_checksum_ops ops = {
> > > > > > >   .update  = sctp_csum_update,
> > > > > > >   .combine = sctp_csum_combine,
> > > > > > > --
> > > > > > > 2.1.0
> > > > > > >
> > > > > > >
> > > > > > Shouldn't you use skb_set_transport_header and skb_transport_header 
> > > > > > here?
> > > > > you mean:
> > > > > skb_set_transport_header(skb, offset);
> > > > > sh = sctp_hdr(skb);
> > > > > ?
> > > > >
> > > > > There's no place counting on here to set transport_header.
> > > > > It will be a kinda redundant job, yet skb is 'const'.
> > > > >
> > > > I'm not sure what you mean by "theres no place counting here".  We have 
> > > > the
> > > > transport header offset, and you're doing the exact same computation 
> > > > that that
> > > > function does.  It seems like we should use it in case the underlying
> > > > implementation changes.
> > > 1. skb_set_transport_header() and sctp_hdr() are like:
> > > skb->transport_header = skb->data - skb->head;
> > > skb->transport_header += offset
> > > sh = skb->head + skb->transport_header;
> > >
> > > 2. in this patch:
> > > sh = (struct sctphdr *)(skb->data + offset);  only
> > >
> > > I think the 2nd one is better.
> > >
> > > I feel it's weird to set transport_header here if it's only for
> > > sctp_hdr(skb) in here.
> > >
> > > As for "underlying implementation changes", I don't know exactly the case
> > > but there are quite a few places doing things like:
> > > *hdr = (struct *hdr *)(skb->data + hdroff);
> > >
> > > I'd think it's safe. no?
> > >
> > Safe, yes, it just doesn't seem right.  I know y

Re: [PATCH net] sctp: get sctphdr by offset in sctp_compute_cksum

2019-02-26 Thread Neil Horman
On Tue, Feb 26, 2019 at 12:15:54AM +0800, Xin Long wrote:
> On Mon, Feb 25, 2019 at 10:08 PM Neil Horman  wrote:
> >
> > On Mon, Feb 25, 2019 at 09:20:44PM +0800, Xin Long wrote:
> > > On Mon, Feb 25, 2019 at 8:47 PM Neil Horman  wrote:
> > > >
> > > > On Mon, Feb 25, 2019 at 07:25:37PM +0800, Xin Long wrote:
> > > > > sctp_hdr(skb) only works when skb->transport_header is set properly.
> > > > >
> > > > > But in the path of nf_conntrack_in: sctp_packet() -> sctp_error()
> > > > >
> > > > > skb->transport_header is not guaranteed to be right value for sctp.
> > > > > It will cause to fail to check the checksum for sctp packets.
> > > > >
> > > > > So fix it by using offset, which is always right in all places.
> > > > >
> > > > > Fixes: e6d8b64b34aa ("net: sctp: fix and consolidate SCTP 
> > > > > checksumming code")
> > > > > Reported-by: Li Shuang 
> > > > > Signed-off-by: Xin Long 
> > > > > ---
> > > > >  include/net/sctp/checksum.h | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/include/net/sctp/checksum.h b/include/net/sctp/checksum.h
> > > > > index 32ee65a..1c6e6c0 100644
> > > > > --- a/include/net/sctp/checksum.h
> > > > > +++ b/include/net/sctp/checksum.h
> > > > > @@ -61,7 +61,7 @@ static inline __wsum sctp_csum_combine(__wsum csum, 
> > > > > __wsum csum2,
> > > > >  static inline __le32 sctp_compute_cksum(const struct sk_buff *skb,
> > > > >   unsigned int offset)
> > > > >  {
> > > > > - struct sctphdr *sh = sctp_hdr(skb);
> > > > > + struct sctphdr *sh = (struct sctphdr *)(skb->data + offset);
> > > > >   const struct skb_checksum_ops ops = {
> > > > >   .update  = sctp_csum_update,
> > > > >   .combine = sctp_csum_combine,
> > > > > --
> > > > > 2.1.0
> > > > >
> > > > >
> > > > Shouldn't you use skb_set_transport_header and skb_transport_header 
> > > > here?
> > > you mean:
> > > skb_set_transport_header(skb, offset);
> > > sh = sctp_hdr(skb);
> > > ?
> > >
> > > There's no place counting on here to set transport_header.
> > > It will be a kinda redundant job, yet skb is 'const'.
> > >
> > I'm not sure what you mean by "theres no place counting here".  We have the
> > transport header offset, and you're doing the exact same computation that 
> > that
> > function does.  It seems like we should use it in case the underlying
> > implementation changes.
> 1. skb_set_transport_header() and sctp_hdr() are like:
> skb->transport_header = skb->data - skb->head;
> skb->transport_header += offset
> sh = skb->head + skb->transport_header;
> 
> 2. in this patch:
> sh = (struct sctphdr *)(skb->data + offset);  only
> 
> I think the 2nd one is better.
> 
> I feel it's weird to set transport_header here if it's only for
> sctp_hdr(skb) in here.
> 
> As for "underlying implementation changes", I don't know exactly the case
> but there are quite a few places doing things like:
> *hdr = (struct *hdr *)(skb->data + hdroff);
> 
> I'd think it's safe. no?
> 
Safe, yes, it just doesn't seem right.  I know you've pointed out several places
below that rapidly compute transport offsets in a one-off fashion, but at this
same time, the other primary transports (tcp and udp), all seems to use the
transport header to do their work (linearizing as necessecary, which sctp also
does in sctp_rcv, at least in most cases).
> >
> > I understand what you are saying regarding the use of a const variable 
> > there,
> > but perhaps thats an argument for removing the const storage classifier.  
> > Better
> > still, it would be good to figure out why all paths to this function don't
> > already set the transport header offset to begin with (addressing your 
> > redundant
> > comment)
> The issue was reported when going to nf_conntrack by br_netfilter's
> bridge-nf-call-iptables.
> As you can see on nf_conntrack_in() path, even iphdr is got by:
>iph = skb_header_pointer(skb, nhoff, sizeof(_iph), &_iph);
> It's impossible to set skb->transport_header when we're not sure iphdr
> in linearized memory.
> 
But if the skb isn't linearized, computing the transport header manually isn't
going to help you anyway.  You can see that in skb_header_pointer.  If the
offset they are trying to get to is outside the bounds of the length of the skb
(i.e. the fragmented case), it calls skb_copy_bits to linearize the needed
segment.  It seems we should be doing something simmilar.  In most cases we are
already linearized from sctp_rcv (possibly all, I need to think about that). All
I'm really saying is that by using the skb apis we insulate ourselves from
potential changes in how skbs might work in the future.  I'm not strictly bound
to setting the transport header, but we should definately be getting the
transport header via the skb utility functions wherever possible.

Neil



Re: [PATCH net] sctp: get sctphdr by offset in sctp_compute_cksum

2019-02-25 Thread Neil Horman
On Mon, Feb 25, 2019 at 09:20:44PM +0800, Xin Long wrote:
> On Mon, Feb 25, 2019 at 8:47 PM Neil Horman  wrote:
> >
> > On Mon, Feb 25, 2019 at 07:25:37PM +0800, Xin Long wrote:
> > > sctp_hdr(skb) only works when skb->transport_header is set properly.
> > >
> > > But in the path of nf_conntrack_in: sctp_packet() -> sctp_error()
> > >
> > > skb->transport_header is not guaranteed to be right value for sctp.
> > > It will cause to fail to check the checksum for sctp packets.
> > >
> > > So fix it by using offset, which is always right in all places.
> > >
> > > Fixes: e6d8b64b34aa ("net: sctp: fix and consolidate SCTP checksumming 
> > > code")
> > > Reported-by: Li Shuang 
> > > Signed-off-by: Xin Long 
> > > ---
> > >  include/net/sctp/checksum.h | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/include/net/sctp/checksum.h b/include/net/sctp/checksum.h
> > > index 32ee65a..1c6e6c0 100644
> > > --- a/include/net/sctp/checksum.h
> > > +++ b/include/net/sctp/checksum.h
> > > @@ -61,7 +61,7 @@ static inline __wsum sctp_csum_combine(__wsum csum, 
> > > __wsum csum2,
> > >  static inline __le32 sctp_compute_cksum(const struct sk_buff *skb,
> > >   unsigned int offset)
> > >  {
> > > - struct sctphdr *sh = sctp_hdr(skb);
> > > + struct sctphdr *sh = (struct sctphdr *)(skb->data + offset);
> > >   const struct skb_checksum_ops ops = {
> > >   .update  = sctp_csum_update,
> > >   .combine = sctp_csum_combine,
> > > --
> > > 2.1.0
> > >
> > >
> > Shouldn't you use skb_set_transport_header and skb_transport_header here?
> you mean:
> skb_set_transport_header(skb, offset);
> sh = sctp_hdr(skb);
> ?
> 
> There's no place counting on here to set transport_header.
> It will be a kinda redundant job, yet skb is 'const'.
> 
I'm not sure what you mean by "theres no place counting here".  We have the
transport header offset, and you're doing the exact same computation that that
function does.  It seems like we should use it in case the underlying
implementation changes. 

I understand what you are saying regarding the use of a const variable there,
but perhaps thats an argument for removing the const storage classifier.  Better
still, it would be good to figure out why all paths to this function don't
already set the transport header offset to begin with (addressing your redundant
comment)

Regards
Neil



Re: [PATCH net] sctp: get sctphdr by offset in sctp_compute_cksum

2019-02-25 Thread Neil Horman
On Mon, Feb 25, 2019 at 07:25:37PM +0800, Xin Long wrote:
> sctp_hdr(skb) only works when skb->transport_header is set properly.
> 
> But in the path of nf_conntrack_in: sctp_packet() -> sctp_error()
> 
> skb->transport_header is not guaranteed to be right value for sctp.
> It will cause to fail to check the checksum for sctp packets.
> 
> So fix it by using offset, which is always right in all places.
> 
> Fixes: e6d8b64b34aa ("net: sctp: fix and consolidate SCTP checksumming code")
> Reported-by: Li Shuang 
> Signed-off-by: Xin Long 
> ---
>  include/net/sctp/checksum.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/net/sctp/checksum.h b/include/net/sctp/checksum.h
> index 32ee65a..1c6e6c0 100644
> --- a/include/net/sctp/checksum.h
> +++ b/include/net/sctp/checksum.h
> @@ -61,7 +61,7 @@ static inline __wsum sctp_csum_combine(__wsum csum, __wsum 
> csum2,
>  static inline __le32 sctp_compute_cksum(const struct sk_buff *skb,
>   unsigned int offset)
>  {
> - struct sctphdr *sh = sctp_hdr(skb);
> + struct sctphdr *sh = (struct sctphdr *)(skb->data + offset);
>   const struct skb_checksum_ops ops = {
>   .update  = sctp_csum_update,
>   .combine = sctp_csum_combine,
> -- 
> 2.1.0
> 
> 
Shouldn't you use skb_set_transport_header and skb_transport_header here?

Neil



Re: [PATCH 08/15] perf script python: add Python3 support to net_dropmonitor.py

2019-02-25 Thread Neil Horman
On Fri, Feb 22, 2019 at 03:06:12PM -0800, Tony Jones wrote:
> Support both Python2 and Python3 in the net_dropmonitor.py script
> 
> There may be differences in the ordering of output lines due to
> differences in dictionary ordering etc.  However the format within lines
> should be unchanged.
> 
> The use of 'from __future__' implies the minimum supported Python2 version
> is now v2.6
> 
> Signed-off-by: Tony Jones 
> Signed-off-by: Seeteena Thoufeek 
> Cc: Neil Horman 
> ---
>  tools/perf/scripts/python/net_dropmonitor.py | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/perf/scripts/python/net_dropmonitor.py 
> b/tools/perf/scripts/python/net_dropmonitor.py
> index a150164b44a3..212557a02c50 100755
> --- a/tools/perf/scripts/python/net_dropmonitor.py
> +++ b/tools/perf/scripts/python/net_dropmonitor.py
> @@ -1,6 +1,8 @@
>  # Monitor the system for dropped packets and proudce a report of drop 
> locations and counts
>  # SPDX-License-Identifier: GPL-2.0
>  
> +from __future__ import print_function
> +
>  import os
>  import sys
>  
> @@ -50,19 +52,19 @@ def get_sym(sloc):
>   return (None, 0)
>  
>  def print_drop_table():
> - print "%25s %25s %25s" % ("LOCATION", "OFFSET", "COUNT")
> + print("%25s %25s %25s" % ("LOCATION", "OFFSET", "COUNT"))
>   for i in drop_log.keys():
>   (sym, off) = get_sym(i)
>   if sym == None:
>   sym = i
> - print "%25s %25s %25s" % (sym, off, drop_log[i])
> + print("%25s %25s %25s" % (sym, off, drop_log[i]))
>  
>  
>  def trace_begin():
> - print "Starting trace (Ctrl-C to dump results)"
> + print("Starting trace (Ctrl-C to dump results)")
>  
>  def trace_end():
> - print "Gathering kallsyms data"
> + print("Gathering kallsyms data")
>   get_kallsyms_table()
>   print_drop_table()
>  
> -- 
> 2.20.1
> 
> 
Acked-by: Neil Horman 


Re: [PATCH net] sctp: set stream ext to NULL after freeing it in sctp_stream_outq_migrate

2019-02-12 Thread Neil Horman
On Tue, Feb 12, 2019 at 06:51:01PM +0800, Xin Long wrote:
> In sctp_stream_init(), after sctp_stream_outq_migrate() freed the
> surplus streams' ext, but sctp_stream_alloc_out() returns -ENOMEM,
> stream->outcnt will not be set to 'outcnt'.
> 
> With the bigger value on stream->outcnt, when closing the assoc and
> freeing its streams, the ext of those surplus streams will be freed
> again since those stream exts were not set to NULL after freeing in
> sctp_stream_outq_migrate(). Then the invalid-free issue reported by
> syzbot would be triggered.
> 
> We fix it by simply setting them to NULL after freeing.
> 
> Fixes: 5e32a431 ("sctp: introduce stream scheduler foundations")
> Reported-by: syzbot+58e480e7b28f2d890...@syzkaller.appspotmail.com
> Signed-off-by: Xin Long 
> ---
>  net/sctp/stream.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sctp/stream.c b/net/sctp/stream.c
> index f246331..2936ed1 100644
> --- a/net/sctp/stream.c
> +++ b/net/sctp/stream.c
> @@ -144,8 +144,10 @@ static void sctp_stream_outq_migrate(struct sctp_stream 
> *stream,
>   }
>   }
>  
> - for (i = outcnt; i < stream->outcnt; i++)
> + for (i = outcnt; i < stream->outcnt; i++) {
>   kfree(SCTP_SO(stream, i)->ext);
> + SCTP_SO(stream, i)->ext = NULL;
> +     }
>  }
>  
>  static int sctp_stream_alloc_out(struct sctp_stream *stream, __u16 outcnt,
> -- 
> 2.1.0
> 
> 
Acked-by: Neil Horman 



Re: [PATCH net] sctp: call gso_reset_checksum when computing checksum in sctp_gso_segment

2019-02-12 Thread Neil Horman
On Tue, Feb 12, 2019 at 06:47:30PM +0800, Xin Long wrote:
> Jianlin reported a panic when running sctp gso over gre over vlan device:
> 
>   [   84.772930] RIP: 0010:do_csum+0x6d/0x170
>   [   84.790605] Call Trace:
>   [   84.791054]  csum_partial+0xd/0x20
>   [   84.791657]  gre_gso_segment+0x2c3/0x390
>   [   84.792364]  inet_gso_segment+0x161/0x3e0
>   [   84.793071]  skb_mac_gso_segment+0xb8/0x120
>   [   84.793846]  __skb_gso_segment+0x7e/0x180
>   [   84.794581]  validate_xmit_skb+0x141/0x2e0
>   [   84.795297]  __dev_queue_xmit+0x258/0x8f0
>   [   84.795949]  ? eth_header+0x26/0xc0
>   [   84.796581]  ip_finish_output2+0x196/0x430
>   [   84.797295]  ? skb_gso_validate_network_len+0x11/0x80
>   [   84.798183]  ? ip_finish_output+0x169/0x270
>   [   84.798875]  ip_output+0x6c/0xe0
>   [   84.799413]  ? ip_append_data.part.50+0xc0/0xc0
>   [   84.800145]  iptunnel_xmit+0x144/0x1c0
>   [   84.800814]  ip_tunnel_xmit+0x62d/0x930 [ip_tunnel]
>   [   84.801699]  gre_tap_xmit+0xac/0xf0 [ip_gre]
>   [   84.802395]  dev_hard_start_xmit+0xa5/0x210
>   [   84.803086]  sch_direct_xmit+0x14f/0x340
>   [   84.803733]  __dev_queue_xmit+0x799/0x8f0
>   [   84.804472]  ip_finish_output2+0x2e0/0x430
>   [   84.805255]  ? skb_gso_validate_network_len+0x11/0x80
>   [   84.806154]  ip_output+0x6c/0xe0
>   [   84.806721]  ? ip_append_data.part.50+0xc0/0xc0
>   [   84.807516]  sctp_packet_transmit+0x716/0xa10 [sctp]
>   [   84.808337]  sctp_outq_flush+0xd7/0x880 [sctp]
> 
> It was caused by SKB_GSO_CB(skb)->csum_start not set in sctp_gso_segment.
> sctp_gso_segment() calls skb_segment() with 'feature | NETIF_F_HW_CSUM',
> which causes SKB_GSO_CB(skb)->csum_start not to be set in skb_segment().
> 
> For TCP/UDP, when feature supports HW_CSUM, CHECKSUM_PARTIAL will be set
> and gso_reset_checksum will be called to set SKB_GSO_CB(skb)->csum_start.
> 
> So SCTP should do the same as TCP/UDP, to call gso_reset_checksum() when
> computing checksum in sctp_gso_segment.
> 
> Reported-by: Jianlin Shi 
> Signed-off-by: Xin Long 
> ---
>  net/sctp/offload.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/sctp/offload.c b/net/sctp/offload.c
> index 123e9f2..edfcf16 100644
> --- a/net/sctp/offload.c
> +++ b/net/sctp/offload.c
> @@ -36,6 +36,7 @@ static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
>  {
>   skb->ip_summed = CHECKSUM_NONE;
>   skb->csum_not_inet = 0;
> + gso_reset_checksum(skb, ~0);
>   return sctp_compute_cksum(skb, skb_transport_offset(skb));
>  }
>  
> -- 
> 2.1.0
> 
> 
> 
Acked-by: Neil Horman 



Re: [PATCH net] sctp: make sctp_setsockopt_events() less strict about the option length

2019-02-11 Thread Neil Horman
On Sun, Feb 10, 2019 at 10:46:16AM -0200, Marcelo Ricardo Leitner wrote:
> On Sat, Feb 09, 2019 at 03:12:17PM -0800, David Miller wrote:
> > From: Marcelo Ricardo Leitner 
> > Date: Wed, 6 Feb 2019 18:37:54 -0200
> > 
> > > On Wed, Feb 06, 2019 at 12:14:30PM -0800, Julien Gomes wrote:
> > >> Make sctp_setsockopt_events() able to accept sctp_event_subscribe
> > >> structures longer than the current definitions.
> > >> 
> > >> This should prevent unjustified setsockopt() failures due to struct
> > >> sctp_event_subscribe extensions (as in 4.11 and 4.12) when using
> > >> binaries that should be compatible, but were built with later kernel
> > >> uapi headers.
> > > 
> > > Not sure if we support backwards compatibility like this?
> > 
> > What a complete mess we have here.
> > 
> > Use new socket option numbers next time, do not change the size and/or
> > layout of existing socket options.
> 
> What about reusing the same socket option, but defining a new struct?
> Say, MYSOCKOPT supports struct mysockopt, struct mysockopt2, struct
> mysockopt3...
> 
> That way we have a clear definition of the user's intent.
> 
Thats possible, but I think thats pretty equivalaent to what daves saying, in
that he wants us to identify all the sizes of this struct and the git history
and act on them accordingly.  Having internal versions of the struct seems like
a fine way to get there, but I think we need to consider how we got to this
situations before we go down the implementation path.

> > 
> > This whole thread, if you read it, is basically "if we compatability
> > this way, that breaks, and if we do compatability this other way oh
> > shit this other thing doesn't work."
> > 
> > I think we really need to specifically check for the difference sizes
> > that existed one by one, clear out the part not given by the user, and
> > backport this as far back as possible in a way that in the older kernels
> > we see if the user is actually trying to use the new features and if so
> > error out.
> 
> I'm afraid clearing out may not be enough, though seems it's the best
> we can do so far. If the struct is allocated but not fully initialized
> via a memset, but by setting its fields one by one, the remaining new
> fields will be left uninitinialized.
> 

I'm not sure this even makes sense.  Currently (as I understood it), the issue
we are facing is the one in which an application is built against a newer kernel
and run on an older one, the implication there being that the application will
pass in a buffer that is larger than what the kernel expects.  In that
situation, clearing isn't needed, all thats needed (I think), is a memcmp of the
space between the sizeof(kernel struct version), and sizeof(userspace struct
version) to see if any bits are non-zero.  If they are, we error out, otherwise,
we ignore the space and move forward as though that overage doesn't exist.

Mind you, I'm not (yet) advocating for that approach, just trying to clarify
whats needed.
> > 
> > Which, btw, is terrible behavior.  Newly compiled apps should work on
> > older kernels if they don't try to use the new features, and if they
> 
> One use case here is: a given distro is using kernel X and app Foo is
> built against it. Then upgrades to X+1, Foo is patched to fix an issue
> and is rebuilt against X+1. The user upgrades Foo package but for
> whatever reason, doesn't upgrade kernel or reboot the system. Here,
> Foo doesn't work anymore until the new kernel is also running.
> 
Yes, thats the use case that we're trying to address.

> > can the ones that want to try to use the new features should be able
> > to fall back when that feature isn't available in a non-ambiguous
> > and precisely defined way.
> > 
> > The fact that the use of the new feature is hidden in the new
> > structure elements is really rotten.
> > 
> > This patch, at best, needs some work and definitely a longer and more
> > detailed commit message.
> 
FWIW, before we decide on a course of action, I think I need to point out that,
over the last 10 years, we've extended this structure 6 times, in the following
commits:
0f3fffd8ab1db
7e8616d8e7731
e1cdd553d482c
35ea82d611da5
c95129d127c6d
b444153fb5a64

The first two I believe were modifications during a period when sctp was
actually getting integrated to the kernel, but the last 4 were definately done
during more recent development periods and wen't in without any commentary about
the impact to UAPI compatibility.  The check for optlen > sizeof(struct
sctp_event_subscribe) was made back in 2008, and while not spelled out, seems
pretty clearly directed at enforcing compatibility with older appliations, not
compatibility with newer applications running on older kernels.

I really worry about situations in which we need to support applications
expecting features that the running kernel doesn't have.  In this particular
situation it seems like a fixable thing, but I could envision situations in
which we just can't do it, and I don't want to 

Re: [PATCH net] sctp: make sctp_setsockopt_events() less strict about the option length

2019-02-08 Thread Neil Horman
On Fri, Feb 08, 2019 at 09:53:03AM +, David Laight wrote:
> From: 'Marcelo Ricardo Leitner'
> > Sent: 07 February 2019 17:47
> ...
> > > > Maybe what we want(ed) here then is explicit versioning, to have the 3
> > > > definitions available. Then the application is able to use, say struct
> > > > sctp_event_subscribe, and be happy with it, while there is struct
> > > > sctp_event_subscribe_v2 and struct sctp_event_subscribe_v3 there too.
> > > >
> > > > But it's too late for that now because that would break applications
> > > > already using the new fields in sctp_event_subscribe.
> > >
> > > It is probably better to break the recompilation of the few programs
> > > that use the new fields (and have them not work on old kernels)
> > > than to stop recompilations of old programs stop working on old
> > > kernels or have requested new options silently ignored.
> > 
> > I got confused here, not sure what you mean. Seems there is one "stop"
> > word too many.
> 
> More confusing than I intended...
> 
> With the current kernel and headers a 'new program' (one that
> needs the new options) will fail to run on an old kernel - which is good.
> However a recompilation of an 'old program' (that doesn't use
> the new options) will also fail to run on an old kernel - which is bad.
> 
I disagree with this, at least as a unilateral statement.  I would assert that
an old program, within the constraints of the issue being discussed here, will
run perfectly well, when built and run against a new kernel.

At issue is the size of the structure sctp_event_subscribe, and the fact that in
several instances over the last few years, its been extended to be larger and
encompass more events to subscribe to.

Nominally an application will use this structure (roughly) as follows:

...
struct sctp_event_subscribe events;
size_t evsize = sizeof(events);

memset(, 0, sizeof(events));

events.sctp_send_failure_event = 1; /*example event subscription*/

if (sctp_setsocktpt(sctp_fd, SOL_SCTP, SCTP_EVENTS, , ) < 0) {
/* do error recovery */
}




Assume this code will be built and run against kernel versions A and B, in
which:
A) has a struct sctp_event_subscribe with a size of 9 bytes
B) has a struct sctp_event_subscribe with a size of 10 bytes (due to the added
field sctp_sender_dry_event)

That gives us 4 cases to handle

1) Application build against kernel A and run on kernel A.  This works fine, the
sizes of the struct in question will always match

2) Application is built against kernel A and run on kernel B.  In this case,
everything will work because the application passes a buffer of size 9, and the
kernel accepts it, because it allows for buffers to be shorter than the current
struct sctp_event_subscribe size. The kernel simply operates on the options
available in the buffer.  The application is none the wiser, because it has no
knoweldge of the new option, nor should it because it was built against kernel
A, that never offered that option

3) Application is built against kernel B and run on kernel B.  This works fine
for the same reason as (1).

4) Application is built against kernel B and run on kernel A.  This will break
because the application is passing a buffer that is larger than what the kernel
expects, and rightly so.   The application is passing in a buffer that is
incompatible with what the running kernel expects.

We could look into ways in which to detect the cases in which this might be
'ok', but I don't see why we should bother, because at some point its still an
error to pass in an incompatible buffer.  In my mind this is no different than
trying to run a program that allocates hugepages on a kernel that doesn't
support hugepages (just to make up an example).  Applications built against
newer kernel can't expect all the features/semantics/etc to be identical to
older kernels.

> Changing the kernel to ignore extra events flags breaks the 'new'
> program.
> 
It shouldn't.  Assuming you have a program built against headers from kernel B
(above), if you set a field in the structure that only exists in kernel B, and
try to run it on kernel A, you will get an EINVAL return, which is correct
behavior because you are attempting to deliver information to the kernel that
kernel A (the running kernel) doesn't know about.  Thats correct behavior.

> Versioning the structure now (even though it should have been done
> earlier) won't change the behaviour of existing binaries.
> 
I won't disagree about the niceness of versioning, but that ship has sailed.

> However a recompilation of an 'old' program would use the 'old'
> structure and work on old kernels.
To be clear,  this is situation (1) above, and yeah, running on the kernel you
built your application against should always work from a compatibility
standpoint. 

> Attempts to recompile a 'new' program will fail - until the structure
> name (or some #define to enable the extra fields) is changed.
> 
Yes, but this is alawys the case for structures that 

Re: [PATCH net] sctp: make sctp_setsockopt_events() less strict about the option length

2019-02-07 Thread Neil Horman
On Wed, Feb 06, 2019 at 01:48:44PM -0800, Julien Gomes wrote:
> 
> 
> On 2/6/19 1:39 PM, Neil Horman wrote:
> > On Wed, Feb 06, 2019 at 01:26:55PM -0800, Julien Gomes wrote:
> >>
> >>
> >> On 2/6/19 1:07 PM, Marcelo Ricardo Leitner wrote:
> >>> On Wed, Feb 06, 2019 at 12:48:38PM -0800, Julien Gomes wrote:
> >>>>
> >>>>
> >>>> On 2/6/19 12:37 PM, Marcelo Ricardo Leitner wrote:
> >>>>> On Wed, Feb 06, 2019 at 12:14:30PM -0800, Julien Gomes wrote:
> >>>>>> Make sctp_setsockopt_events() able to accept sctp_event_subscribe
> >>>>>> structures longer than the current definitions.
> >>>>>>
> >>>>>> This should prevent unjustified setsockopt() failures due to struct
> >>>>>> sctp_event_subscribe extensions (as in 4.11 and 4.12) when using
> >>>>>> binaries that should be compatible, but were built with later kernel
> >>>>>> uapi headers.
> >>>>>
> >>>>> Not sure if we support backwards compatibility like this?
> >>>>>
> >>>>> My issue with this change is that by doing this, application will have
> >>>>> no clue if the new bits were ignored or not and it may think that an
> >>>>> event is enabled while it is not.
> >>>>>
> >>>>> A workaround would be to do a getsockopt and check the size that was
> >>>>> returned. But then, it might as well use the right struct here in the
> >>>>> first place.
> >>>>>
> >>>>> I'm seeing current implementation as an implicitly versioned argument:
> >>>>> it will always accept setsockopt calls with an old struct (v4.11 or
> >>>>> v4.12), but if the user tries to use v3 on a v1-only system, it will
> >>>>> be rejected. Pretty much like using a newer setsockopt on an old
> >>>>> system.
> >>>>
> >>>> With the current implementation, given sources that say are supposed to
> >>>> run on a 4.9 kernel (no use of any newer field added in 4.11 or 4.12),
> >>>> we can't rebuild the exact same sources on a 4.19 kernel and still run
> >>>> them on 4.9 without messing with structures re-definition.
> >>>
> >>> Maybe what we want(ed) here then is explicit versioning, to have the 3
> >>> definitions available. Then the application is able to use, say struct
> >>> sctp_event_subscribe, and be happy with it, while there is struct
> >>> sctp_event_subscribe_v2 and struct sctp_event_subscribe_v3 there too.
> >>>
> >>> But it's too late for that now because that would break applications
> >>> already using the new fields in sctp_event_subscribe.
> >>
> >> Right.
> >>
> >>>
> >>>>
> >>>> I understand your point, but this still looks like a sort of uapi
> >>>> breakage to me.
> >>>
> >>> Not disagreeing. I really just don't know how supported that is.
> >>> Willing to know so I can pay more attention to this on future changes.
> >>>
> >>> Btw, is this the only occurrence?
> >>
> >> Can't really say, this is one I witnessed, I haven't really looked for
> >> others.
> >>
> >>>
> >>>>
> >>>>
> >>>> I also had another way to work-around this in mind, by copying optlen
> >>>> bytes and checking that any additional field (not included in the
> >>>> "current" kernel structure definition) is not set, returning EINVAL in
> >>>> such case to keep a similar to current behavior.
> >>>> The issue with this is that I didn't find a suitable (ie not totally
> >>>> arbitrary such as "twice the existing structure size") upper limit to
> >>>> optlen.
> >>>
> >>> Seems interesting. Why would it need that upper limit to optlen?
> >>>
> >>> Say struct v1 had 4 bytes, v3 now had 12. The user supplies 12 bytes
> >>> to the kernel that only knows about 4 bytes. It can check that (12-4)
> >>> bytes in the end, make sure no bit is on and use only the first 4.
> >>>
> >>> The fact that it was 12 or 200 shouldn't matter, should it? As long as
> >>> the (200-4) bytes are 0'ed, only the first 4 will be used and it
> >>> should be ok, otherwise EINVAL. No need to know

Re: [PATCH net] sctp: make sctp_setsockopt_events() less strict about the option length

2019-02-07 Thread Neil Horman
On Wed, Feb 06, 2019 at 01:48:42PM -0800, Julien Gomes wrote:
> 
> 
> On 2/6/19 1:23 PM, Neil Horman wrote:
> > On Wed, Feb 06, 2019 at 07:07:23PM -0200, Marcelo Ricardo Leitner wrote:
> >> On Wed, Feb 06, 2019 at 12:48:38PM -0800, Julien Gomes wrote:
> >>>
> >>>
> >>> On 2/6/19 12:37 PM, Marcelo Ricardo Leitner wrote:
> >>>> On Wed, Feb 06, 2019 at 12:14:30PM -0800, Julien Gomes wrote:
> >>>>> Make sctp_setsockopt_events() able to accept sctp_event_subscribe
> >>>>> structures longer than the current definitions.
> >>>>>
> >>>>> This should prevent unjustified setsockopt() failures due to struct
> >>>>> sctp_event_subscribe extensions (as in 4.11 and 4.12) when using
> >>>>> binaries that should be compatible, but were built with later kernel
> >>>>> uapi headers.
> >>>>
> >>>> Not sure if we support backwards compatibility like this?
> >>>>
> >>>> My issue with this change is that by doing this, application will have
> >>>> no clue if the new bits were ignored or not and it may think that an
> >>>> event is enabled while it is not.
> >>>>
> >>>> A workaround would be to do a getsockopt and check the size that was
> >>>> returned. But then, it might as well use the right struct here in the
> >>>> first place.
> >>>>
> >>>> I'm seeing current implementation as an implicitly versioned argument:
> >>>> it will always accept setsockopt calls with an old struct (v4.11 or
> >>>> v4.12), but if the user tries to use v3 on a v1-only system, it will
> >>>> be rejected. Pretty much like using a newer setsockopt on an old
> >>>> system.
> >>>
> >>> With the current implementation, given sources that say are supposed to
> >>> run on a 4.9 kernel (no use of any newer field added in 4.11 or 4.12),
> >>> we can't rebuild the exact same sources on a 4.19 kernel and still run
> >>> them on 4.9 without messing with structures re-definition.
> >>
> >> Maybe what we want(ed) here then is explicit versioning, to have the 3
> >> definitions available. Then the application is able to use, say struct
> >> sctp_event_subscribe, and be happy with it, while there is struct
> >> sctp_event_subscribe_v2 and struct sctp_event_subscribe_v3 there too.
> >>
> >> But it's too late for that now because that would break applications
> >> already using the new fields in sctp_event_subscribe.
> >>
> > Yeah, I'm not supportive of codifying that knoweldge in the kernel.  If we 
> > were
> > to support bi-directional versioning, I would encode it into lksctp-tools 
> > rather
> > than the kernel.
> 
> I'm not sure that forcing a library on users is a good reason to break UAPI.
> 
Thats a misleading statement.  We've never supported running newer applications
on older kernels, and no one is forcing anyone to use the lksctp-tools library,
I was only suggesting that, if we were to support this compatibility, that might
be a place to offer it.

Its also worth noting that we have precident for this.  If you look at the git
log, this particular structure has been extended about 6 times in the life of
sctp.

> > 
> >>>
> >>> I understand your point, but this still looks like a sort of uapi
> >>> breakage to me.
> >>
> >> Not disagreeing. I really just don't know how supported that is.
> >> Willing to know so I can pay more attention to this on future changes.
> >>
> >> Btw, is this the only occurrence?
> >>
> > No, I think there are a few others (maybe paddrparams?)
> > 
> >>>
> >>>
> >>> I also had another way to work-around this in mind, by copying optlen
> >>> bytes and checking that any additional field (not included in the
> >>> "current" kernel structure definition) is not set, returning EINVAL in
> >>> such case to keep a similar to current behavior.
> >>> The issue with this is that I didn't find a suitable (ie not totally
> >>> arbitrary such as "twice the existing structure size") upper limit to
> >>> optlen.
> >>
> >> Seems interesting. Why would it need that upper limit to optlen?
> >>
> > I think the thought was to differentiate between someone passing a legit 
> > larger
> > structure from a newer UAPI, from someone just passing in a massive
> &g

Re: [PATCH net] sctp: make sctp_setsockopt_events() less strict about the option length

2019-02-06 Thread Neil Horman
On Wed, Feb 06, 2019 at 01:26:55PM -0800, Julien Gomes wrote:
> 
> 
> On 2/6/19 1:07 PM, Marcelo Ricardo Leitner wrote:
> > On Wed, Feb 06, 2019 at 12:48:38PM -0800, Julien Gomes wrote:
> >>
> >>
> >> On 2/6/19 12:37 PM, Marcelo Ricardo Leitner wrote:
> >>> On Wed, Feb 06, 2019 at 12:14:30PM -0800, Julien Gomes wrote:
>  Make sctp_setsockopt_events() able to accept sctp_event_subscribe
>  structures longer than the current definitions.
> 
>  This should prevent unjustified setsockopt() failures due to struct
>  sctp_event_subscribe extensions (as in 4.11 and 4.12) when using
>  binaries that should be compatible, but were built with later kernel
>  uapi headers.
> >>>
> >>> Not sure if we support backwards compatibility like this?
> >>>
> >>> My issue with this change is that by doing this, application will have
> >>> no clue if the new bits were ignored or not and it may think that an
> >>> event is enabled while it is not.
> >>>
> >>> A workaround would be to do a getsockopt and check the size that was
> >>> returned. But then, it might as well use the right struct here in the
> >>> first place.
> >>>
> >>> I'm seeing current implementation as an implicitly versioned argument:
> >>> it will always accept setsockopt calls with an old struct (v4.11 or
> >>> v4.12), but if the user tries to use v3 on a v1-only system, it will
> >>> be rejected. Pretty much like using a newer setsockopt on an old
> >>> system.
> >>
> >> With the current implementation, given sources that say are supposed to
> >> run on a 4.9 kernel (no use of any newer field added in 4.11 or 4.12),
> >> we can't rebuild the exact same sources on a 4.19 kernel and still run
> >> them on 4.9 without messing with structures re-definition.
> > 
> > Maybe what we want(ed) here then is explicit versioning, to have the 3
> > definitions available. Then the application is able to use, say struct
> > sctp_event_subscribe, and be happy with it, while there is struct
> > sctp_event_subscribe_v2 and struct sctp_event_subscribe_v3 there too.
> > 
> > But it's too late for that now because that would break applications
> > already using the new fields in sctp_event_subscribe.
> 
> Right.
> 
> > 
> >>
> >> I understand your point, but this still looks like a sort of uapi
> >> breakage to me.
> > 
> > Not disagreeing. I really just don't know how supported that is.
> > Willing to know so I can pay more attention to this on future changes.
> > 
> > Btw, is this the only occurrence?
> 
> Can't really say, this is one I witnessed, I haven't really looked for
> others.
> 
> > 
> >>
> >>
> >> I also had another way to work-around this in mind, by copying optlen
> >> bytes and checking that any additional field (not included in the
> >> "current" kernel structure definition) is not set, returning EINVAL in
> >> such case to keep a similar to current behavior.
> >> The issue with this is that I didn't find a suitable (ie not totally
> >> arbitrary such as "twice the existing structure size") upper limit to
> >> optlen.
> > 
> > Seems interesting. Why would it need that upper limit to optlen?
> > 
> > Say struct v1 had 4 bytes, v3 now had 12. The user supplies 12 bytes
> > to the kernel that only knows about 4 bytes. It can check that (12-4)
> > bytes in the end, make sure no bit is on and use only the first 4.
> > 
> > The fact that it was 12 or 200 shouldn't matter, should it? As long as
> > the (200-4) bytes are 0'ed, only the first 4 will be used and it
> > should be ok, otherwise EINVAL. No need to know how big the current
> > current actually is because it wouldn't be validating that here: just
> > that it can safely use the first 4 bytes.
> 
> The upper limit concern is more regarding the call to copy_from_user
> with an unrestricted/unchecked value.
Copy_from_user should be safe to copy an arbitrary amount, the only restriction
is that optlen can't exceed the size of the buffer receiving the data in the
kernel.  From that standpoint your patch is safe.  However,  that exposes the
problem of checking any tail data on the userspace buffer.  That is to say, if
you want to ensure that any extra data that gets sent from userspace isn't
'set', you would have to copy that extra data in consumable chunks and check
them individaully, and that screams DOS to me (i.e. imagine a user passing in a
4GB buffer, and having to wait for the kernel to copy each X sized chunk,
looking for non-zero values).

> I am not sure of how much of a risk/how exploitable this could be,
> that's why I cautiously wanted to limit it in the first place just in case.
> 
> > 
> >>
> >>>
> 
>  Signed-off-by: Julien Gomes 
>  ---
>   net/sctp/socket.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
>  diff --git a/net/sctp/socket.c b/net/sctp/socket.c
>  index 9644bdc8e85c..f9717e2789da 100644
>  --- a/net/sctp/socket.c
>  +++ b/net/sctp/socket.c
>  @@ -2311,7 +2311,7 @@ static int sctp_setsockopt_events(struct 

Re: [PATCH net] sctp: make sctp_setsockopt_events() less strict about the option length

2019-02-06 Thread Neil Horman
On Wed, Feb 06, 2019 at 07:07:23PM -0200, Marcelo Ricardo Leitner wrote:
> On Wed, Feb 06, 2019 at 12:48:38PM -0800, Julien Gomes wrote:
> > 
> > 
> > On 2/6/19 12:37 PM, Marcelo Ricardo Leitner wrote:
> > > On Wed, Feb 06, 2019 at 12:14:30PM -0800, Julien Gomes wrote:
> > >> Make sctp_setsockopt_events() able to accept sctp_event_subscribe
> > >> structures longer than the current definitions.
> > >>
> > >> This should prevent unjustified setsockopt() failures due to struct
> > >> sctp_event_subscribe extensions (as in 4.11 and 4.12) when using
> > >> binaries that should be compatible, but were built with later kernel
> > >> uapi headers.
> > > 
> > > Not sure if we support backwards compatibility like this?
> > > 
> > > My issue with this change is that by doing this, application will have
> > > no clue if the new bits were ignored or not and it may think that an
> > > event is enabled while it is not.
> > > 
> > > A workaround would be to do a getsockopt and check the size that was
> > > returned. But then, it might as well use the right struct here in the
> > > first place.
> > > 
> > > I'm seeing current implementation as an implicitly versioned argument:
> > > it will always accept setsockopt calls with an old struct (v4.11 or
> > > v4.12), but if the user tries to use v3 on a v1-only system, it will
> > > be rejected. Pretty much like using a newer setsockopt on an old
> > > system.
> > 
> > With the current implementation, given sources that say are supposed to
> > run on a 4.9 kernel (no use of any newer field added in 4.11 or 4.12),
> > we can't rebuild the exact same sources on a 4.19 kernel and still run
> > them on 4.9 without messing with structures re-definition.
> 
> Maybe what we want(ed) here then is explicit versioning, to have the 3
> definitions available. Then the application is able to use, say struct
> sctp_event_subscribe, and be happy with it, while there is struct
> sctp_event_subscribe_v2 and struct sctp_event_subscribe_v3 there too.
> 
> But it's too late for that now because that would break applications
> already using the new fields in sctp_event_subscribe.
> 
Yeah, I'm not supportive of codifying that knoweldge in the kernel.  If we were
to support bi-directional versioning, I would encode it into lksctp-tools rather
than the kernel.

> > 
> > I understand your point, but this still looks like a sort of uapi
> > breakage to me.
> 
> Not disagreeing. I really just don't know how supported that is.
> Willing to know so I can pay more attention to this on future changes.
> 
> Btw, is this the only occurrence?
> 
No, I think there are a few others (maybe paddrparams?)

> > 
> > 
> > I also had another way to work-around this in mind, by copying optlen
> > bytes and checking that any additional field (not included in the
> > "current" kernel structure definition) is not set, returning EINVAL in
> > such case to keep a similar to current behavior.
> > The issue with this is that I didn't find a suitable (ie not totally
> > arbitrary such as "twice the existing structure size") upper limit to
> > optlen.
> 
> Seems interesting. Why would it need that upper limit to optlen?
> 
I think the thought was to differentiate between someone passing a legit larger
structure from a newer UAPI, from someone just passing in a massive
inappropriately sized buffer (even if the return on both is the same).

> Say struct v1 had 4 bytes, v3 now had 12. The user supplies 12 bytes
> to the kernel that only knows about 4 bytes. It can check that (12-4)
> bytes in the end, make sure no bit is on and use only the first 4.
> 
> The fact that it was 12 or 200 shouldn't matter, should it? As long as
> the (200-4) bytes are 0'ed, only the first 4 will be used and it
> should be ok, otherwise EINVAL. No need to know how big the current
> current actually is because it wouldn't be validating that here: just
> that it can safely use the first 4 bytes.
> 
I'm less than excited about making the kernel check an unbounded user space
buffer, thats seems like a potential DOS attack from an unpriviledged user to
me.  I'm also still hung up on the notion that, despite how we do this, this
patch is going into the latest kernel, so it will only work on a kernel that
already understands the most recent set of subscriptions.  It would work if we,
again someday in the future extended this struct, someone built against that
newer UAPI, and then tried to run it on a kernel that had this patch.

FWIW, there is an existing implied method to determine the available
subscription events. sctp_getsockopt_events does clamp the size of the output
buffer, and returns that information in the optlen field via put_user.  An
application that was build against UAPIs from 4.19 could pass in the 4.19
sctp_event_subscribe struct to sctp_getsockopt_events, and read the output
length, whcih would inform the application of the events that the kernel is
capable of reporting, and limit itself to only using those events.  Its not a
perfect 

Re: [PATCH net] sctp: make sctp_setsockopt_events() less strict about the option length

2019-02-06 Thread Neil Horman
On Wed, Feb 06, 2019 at 12:48:38PM -0800, Julien Gomes wrote:
> 
> 
> On 2/6/19 12:37 PM, Marcelo Ricardo Leitner wrote:
> > On Wed, Feb 06, 2019 at 12:14:30PM -0800, Julien Gomes wrote:
> >> Make sctp_setsockopt_events() able to accept sctp_event_subscribe
> >> structures longer than the current definitions.
> >>
> >> This should prevent unjustified setsockopt() failures due to struct
> >> sctp_event_subscribe extensions (as in 4.11 and 4.12) when using
> >> binaries that should be compatible, but were built with later kernel
> >> uapi headers.
> > 
> > Not sure if we support backwards compatibility like this?
> > 
> > My issue with this change is that by doing this, application will have
> > no clue if the new bits were ignored or not and it may think that an
> > event is enabled while it is not.
> > 
> > A workaround would be to do a getsockopt and check the size that was
> > returned. But then, it might as well use the right struct here in the
> > first place.
> > 
> > I'm seeing current implementation as an implicitly versioned argument:
> > it will always accept setsockopt calls with an old struct (v4.11 or
> > v4.12), but if the user tries to use v3 on a v1-only system, it will
> > be rejected. Pretty much like using a newer setsockopt on an old
> > system.
> 
> With the current implementation, given sources that say are supposed to
> run on a 4.9 kernel (no use of any newer field added in 4.11 or 4.12),
What given sources say that?  I understand it might be expected, but this is an
common concern with setsockopt method on many protocols, it just so happens that
sctp extends them more than other protocols.

> we can't rebuild the exact same sources on a 4.19 kernel and still run
> them on 4.9 without messing with structures re-definition.
> 
Right, put another way, we support backward compatibility with older userspace
applications, but not newer one.  I.e. if you build an application against the
4.9 SCTP API, it should work with the 4.19 UAPI, but not vice versa, which it
seems is like what you are trying to do here.

> I understand your point, but this still looks like a sort of uapi
> breakage to me.
> 
> 
> I also had another way to work-around this in mind, by copying optlen
> bytes and checking that any additional field (not included in the
> "current" kernel structure definition) is not set, returning EINVAL in
> such case to keep a similar to current behavior.
> The issue with this is that I didn't find a suitable (ie not totally
> arbitrary such as "twice the existing structure size") upper limit to
> optlen.
> 
There is no real uppper limit to the size of the structure in this case, and
IIRC this isn't the only sockopt structure that can be exentded for SCTP in this
way.

I really don't see a sane way to allow newer userspaces to be compatible with
older kernels here.  If we were to do it I would suggest moving the
responsibility for that feature into lksctp-tools, versioning that library such
that correlary symbols are versioned to translate the application view of the
socket options structs to the size and format that the running kernel
undertands.  Note that I'm not really advocating for that, as it seems like a
fast moving target, but if we were to do it I think that would be the most sane
way to handle it.

Neil

> > 
> >>
> >> Signed-off-by: Julien Gomes 
> >> ---
> >>  net/sctp/socket.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> >> index 9644bdc8e85c..f9717e2789da 100644
> >> --- a/net/sctp/socket.c
> >> +++ b/net/sctp/socket.c
> >> @@ -2311,7 +2311,7 @@ static int sctp_setsockopt_events(struct sock *sk, 
> >> char __user *optval,
> >>int i;
> >>  
> >>if (optlen > sizeof(struct sctp_event_subscribe))
> >> -  return -EINVAL;
> >> +  optlen = sizeof(struct sctp_event_subscribe);
> >>  
> >>if (copy_from_user(, optval, optlen))
> >>return -EFAULT;
> >> -- 
> >> 2.20.1
> >>
> 
> 


Re: [PATCH net] sctp: make sctp_setsockopt_events() less strict about the option length

2019-02-06 Thread Neil Horman
On Wed, Feb 06, 2019 at 12:14:30PM -0800, Julien Gomes wrote:
> Make sctp_setsockopt_events() able to accept sctp_event_subscribe
> structures longer than the current definitions.
> 
> This should prevent unjustified setsockopt() failures due to struct
> sctp_event_subscribe extensions (as in 4.11 and 4.12) when using
> binaries that should be compatible, but were built with later kernel
> uapi headers.
> 
> Signed-off-by: Julien Gomes 
> ---
>  net/sctp/socket.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 9644bdc8e85c..f9717e2789da 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -2311,7 +2311,7 @@ static int sctp_setsockopt_events(struct sock *sk, char 
> __user *optval,
>   int i;
>  
>   if (optlen > sizeof(struct sctp_event_subscribe))
> - return -EINVAL;
> + optlen = sizeof(struct sctp_event_subscribe);
>  
I'm not sure I like this.  If you have a userspace application built against
more recent uapi headers than the kernel you are actually running on, then by
defintion you won't have this check in place, and you'll get EINVAL returns
anyway.  If you just backport this patch to an older kernel, you'll not get the
EINVAL return, but you will get silent failures on event subscriptions that your
application thinks exists, but the kernel doesn't recognize.  

This would make sense if you had a way to communicate back to user space the
unrecognized options, but since we don't (currently) have that, I would rather
see the EINVAL returned than just have things not work.

Neil

>   if (copy_from_user(, optval, optlen))
>   return -EFAULT;
> -- 
> 2.20.1
> 
> 


Re: [PATCHv3 net] sctp: check and update stream->out_curr when allocating stream_out

2019-02-03 Thread Neil Horman
On Mon, Feb 04, 2019 at 03:27:58AM +0800, Xin Long wrote:
> Now when using stream reconfig to add out streams, stream->out
> will get re-allocated, and all old streams' information will
> be copied to the new ones and the old ones will be freed.
> 
> So without stream->out_curr updated, next time when trying to
> send from stream->out_curr stream, a panic would be caused.
> 
> This patch is to check and update stream->out_curr when
> allocating stream_out.
> 
> v1->v2:
>   - define fa_index() to get elem index from stream->out_curr.
> v2->v3:
>   - repost with no change.
> 
> Fixes: 5e32a431 ("sctp: introduce stream scheduler foundations")
> Reported-by: Ying Xu 
> Reported-by: syzbot+e33a3a138267ca119...@syzkaller.appspotmail.com
> Signed-off-by: Xin Long 
> ---
>  net/sctp/stream.c | 20 
>  1 file changed, 20 insertions(+)
> 
> diff --git a/net/sctp/stream.c b/net/sctp/stream.c
> index 80e0ae5..f246331 100644
> --- a/net/sctp/stream.c
> +++ b/net/sctp/stream.c
> @@ -84,6 +84,19 @@ static void fa_zero(struct flex_array *fa, size_t index, 
> size_t count)
>   }
>  }
>  
> +static size_t fa_index(struct flex_array *fa, void *elem, size_t count)
> +{
> + size_t index = 0;
> +
> + while (count--) {
> + if (elem == flex_array_get(fa, index))
> + break;
> + index++;
> + }
> +
> + return index;
> +}
> +
>  /* Migrates chunks from stream queues to new stream queues if needed,
>   * but not across associations. Also, removes those chunks to streams
>   * higher than the new max.
> @@ -147,6 +160,13 @@ static int sctp_stream_alloc_out(struct sctp_stream 
> *stream, __u16 outcnt,
>  
>   if (stream->out) {
>   fa_copy(out, stream->out, 0, min(outcnt, stream->outcnt));
> + if (stream->out_curr) {
> + size_t index = fa_index(stream->out, stream->out_curr,
> + stream->outcnt);
> +
> +         BUG_ON(index == stream->outcnt);
> + stream->out_curr = flex_array_get(out, index);
> + }
>   fa_free(stream->out);
>   }
>  
> -- 
> 2.1.0
> 
> 
Acked-by: Neil Horman 



Re: [PATCH 6/7] sctp: Convert to genradix

2019-01-18 Thread Neil Horman
On Fri, Jan 18, 2019 at 11:35:34AM -0200, Marcelo Ricardo Leitner wrote:
> On Fri, Jan 18, 2019 at 08:10:22AM -0500, Neil Horman wrote:
> > On Tue, Jan 15, 2019 at 12:29:26PM -0200, Marcelo Ricardo Leitner wrote:
> > > On Mon, Dec 17, 2018 at 04:00:21PM -0500, Kent Overstreet wrote:
> > > > On Mon, Dec 17, 2018 at 12:50:01PM -0800, Andrew Morton wrote:
> > > > > On Mon, 17 Dec 2018 08:19:28 -0500 Kent Overstreet 
> > > > >  wrote:
> > > > > 
> > > > > > @@ -535,9 +470,6 @@ int sctp_send_add_streams(struct 
> > > > > > sctp_association *asoc,
> > > > > > goto out;
> > > > > > }
> > > > > >  
> > > > > > -   stream->incnt = incnt;
> > > > > > -   stream->outcnt = outcnt;
> > > > > > -
> > > > > > asoc->strreset_outstanding = !!out + !!in;
> > > > > >  
> > > > > 
> > > > > I'm seeing a reject here for some reason.  Using todays's linux-next,
> > > > > but there are no changes against net/sctp/stream.c in -next.  The
> > > > > assignment to stream->incnt has disappeared.  I did this:
> > > > > 
> > > > > @@ -535,8 +470,6 @@ int sctp_send_add_streams(struct sctp_as
> > > > >   goto out;
> > > > >   }
> > > > >  
> > > > > - stream->outcnt = outcnt;
> > > > > -
> > > > >   asoc->strreset_outstanding = !!out + !!in;
> > > > >  
> > > > >  out:
> > > > > 
> > > > > 
> > > > > We're at 4.20-rc7 and this series is rather large.  I'll merge them 
> > > > > all
> > > > > to see what happens, but I don't think it's 4.21-rc1 material?
> > > > 
> > > > Yeah, agreed. Thanks!
> > > 
> > > Ping? Where did this go?
> > > 
> > As I recall kent reposted his series convirting flex_arrays to radix trees 
> > such
> > that it included sctp's uses.
> 
> That should be this patchset already.  Or you mean another (re)post,
> v2 or so? If yes then I missed it somehow but I still see the
> flex_array in v5.0-rc2:
> 
> net-next]$ git ls-tree v5.0-rc2 -- lib/flex_array.c
> 100644 blob 2eed22fa507c7cb0756d7ef643f8a3454eb455eclib/flex_array.c
> 
> and missing the convertion on sctp stack.
> 
> Current patch is the only one I could find with sctp being included:
> https://lore.kernel.org/lkml/20181217210021.GA7144@kmo-pixel/T/#re4f8656af37431a376044399681a8771375a4405
> 
Correct, and it doesn't seem to be included yet, though I acked Kents patch as I
recall.  not sure what the hold up is at this point
Neil

> > 
> > I think xin needs to repost the sctp reallocation patch to make use of those
> > radix trees appropriately still (assuming any additional work still needs 
> > to be
> > done)
> 
> This came up on another converstation with him and the lack of this
> convertion is blocking him on at least one fix, on one that the fix
> using flex_arrays is one and genradix is another.
> 
>   Marcelo
> 
> > 
> > Neil
> > 
> > >   Marcelo
> > > 
> 


Re: [PATCH 6/7] sctp: Convert to genradix

2019-01-18 Thread Neil Horman
On Tue, Jan 15, 2019 at 12:29:26PM -0200, Marcelo Ricardo Leitner wrote:
> On Mon, Dec 17, 2018 at 04:00:21PM -0500, Kent Overstreet wrote:
> > On Mon, Dec 17, 2018 at 12:50:01PM -0800, Andrew Morton wrote:
> > > On Mon, 17 Dec 2018 08:19:28 -0500 Kent Overstreet 
> > >  wrote:
> > > 
> > > > @@ -535,9 +470,6 @@ int sctp_send_add_streams(struct sctp_association 
> > > > *asoc,
> > > > goto out;
> > > > }
> > > >  
> > > > -   stream->incnt = incnt;
> > > > -   stream->outcnt = outcnt;
> > > > -
> > > > asoc->strreset_outstanding = !!out + !!in;
> > > >  
> > > 
> > > I'm seeing a reject here for some reason.  Using todays's linux-next,
> > > but there are no changes against net/sctp/stream.c in -next.  The
> > > assignment to stream->incnt has disappeared.  I did this:
> > > 
> > > @@ -535,8 +470,6 @@ int sctp_send_add_streams(struct sctp_as
> > >   goto out;
> > >   }
> > >  
> > > - stream->outcnt = outcnt;
> > > -
> > >   asoc->strreset_outstanding = !!out + !!in;
> > >  
> > >  out:
> > > 
> > > 
> > > We're at 4.20-rc7 and this series is rather large.  I'll merge them all
> > > to see what happens, but I don't think it's 4.21-rc1 material?
> > 
> > Yeah, agreed. Thanks!
> 
> Ping? Where did this go?
> 
As I recall kent reposted his series convirting flex_arrays to radix trees such
that it included sctp's uses.

I think xin needs to repost the sctp reallocation patch to make use of those
radix trees appropriately still (assuming any additional work still needs to be
done)

Neil

>   Marcelo
> 


Re: [PATCH net] sctp: allocate sctp_sockaddr_entry with kzalloc

2019-01-14 Thread Neil Horman
On Mon, Jan 14, 2019 at 06:34:02PM +0800, Xin Long wrote:
> The similar issue as fixed in Commit 4a2eb0c37b47 ("sctp: initialize
> sin6_flowinfo for ipv6 addrs in sctp_inet6addr_event") also exists
> in sctp_inetaddr_event, as Alexander noticed.
> 
> To fix it, allocate sctp_sockaddr_entry with kzalloc for both sctp
> ipv4 and ipv6 addresses, as does in sctp_v4/6_copy_addrlist().
> 
> Reported-by: Alexander Potapenko 
> Signed-off-by: Xin Long 
> ---
>  net/sctp/ipv6.c | 5 +
>  net/sctp/protocol.c | 4 +---
>  2 files changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index b9ed271..ed8e006 100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -97,11 +97,9 @@ static int sctp_inet6addr_event(struct notifier_block 
> *this, unsigned long ev,
>  
>   switch (ev) {
>   case NETDEV_UP:
> - addr = kmalloc(sizeof(struct sctp_sockaddr_entry), GFP_ATOMIC);
> + addr = kzalloc(sizeof(*addr), GFP_ATOMIC);
>   if (addr) {
>   addr->a.v6.sin6_family = AF_INET6;
> - addr->a.v6.sin6_port = 0;
> - addr->a.v6.sin6_flowinfo = 0;
>   addr->a.v6.sin6_addr = ifa->addr;
>   addr->a.v6.sin6_scope_id = ifa->idev->dev->ifindex;
>   addr->valid = 1;
> @@ -434,7 +432,6 @@ static void sctp_v6_copy_addrlist(struct list_head 
> *addrlist,
>   addr = kzalloc(sizeof(*addr), GFP_ATOMIC);
>   if (addr) {
>   addr->a.v6.sin6_family = AF_INET6;
> - addr->a.v6.sin6_port = 0;
>   addr->a.v6.sin6_addr = ifp->addr;
>   addr->a.v6.sin6_scope_id = dev->ifindex;
>   addr->valid = 1;
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index d5878ae..4e0eeb1 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -101,7 +101,6 @@ static void sctp_v4_copy_addrlist(struct list_head 
> *addrlist,
>   addr = kzalloc(sizeof(*addr), GFP_ATOMIC);
>   if (addr) {
>   addr->a.v4.sin_family = AF_INET;
> - addr->a.v4.sin_port = 0;
>   addr->a.v4.sin_addr.s_addr = ifa->ifa_local;
>   addr->valid = 1;
>   INIT_LIST_HEAD(>list);
> @@ -776,10 +775,9 @@ static int sctp_inetaddr_event(struct notifier_block 
> *this, unsigned long ev,
>  
>   switch (ev) {
>   case NETDEV_UP:
> - addr = kmalloc(sizeof(struct sctp_sockaddr_entry), GFP_ATOMIC);
> + addr = kzalloc(sizeof(*addr), GFP_ATOMIC);
>   if (addr) {
>   addr->a.v4.sin_family = AF_INET;
> - addr->a.v4.sin_port = 0;
>   addr->a.v4.sin_addr.s_addr = ifa->ifa_local;
>   addr->valid = 1;
>   spin_lock_bh(>sctp.local_addr_lock);
> -- 
> 2.1.0
> 
> 
Acked-by: Neil Horman 



Re: [PATCH 6/6] Drop flex_arrays

2018-12-18 Thread Neil Horman
On Mon, Dec 17, 2018 at 07:50:11AM -0500, Kent Overstreet wrote:
> On Thu, Dec 13, 2018 at 01:09:17PM -0500, Neil Horman wrote:
> > On Thu, Dec 13, 2018 at 08:45:33AM -0800, Matthew Wilcox wrote:
> > > On Thu, Dec 13, 2018 at 10:51:49AM -0500, Neil Horman wrote:
> > > > On Thu, Dec 13, 2018 at 06:41:11AM -0800, Matthew Wilcox wrote:
> > > > > On Thu, Dec 13, 2018 at 09:30:47PM +0900, Xin Long wrote:
> > > > > > On Sat, Sep 8, 2018 at 1:57 AM Kent Overstreet
> > > > > >  wrote:
> > > > > > >
> > > > > > > All existing users have been converted to generic radix trees
> > > > > > NAK, SCTP is still using flex_arrays,
> > > > > > # grep flex_array net/sctp/*
> > > > > > 
> > > > > > This patch will break the build.
> > > > > 
> > > > > sctp added that user after this patch was sent.  Please stop adding
> > > > > flexarray users!
> > > > > 
> > > > > This particular user should probably have just used kvmalloc.
> > > > > 
> > > > 
> > > > No, I don't think thats right.
> > > > 
> > > > This appears to have been sent on September 7th.  Commit
> > > > 0d493b4d0be352b5e361e4fa0bc3efe952d8b10e, which added the use of 
> > > > flex_arrays to
> > > > sctp, seems to have been merged on August 10th, a month prior.
> > > 
> > > Are you seriously suggesting anybody sending cleanups needs to be
> > > monitoring every single email list to see if anybody has added a new user?
> > > Removing the flexarray has been advertised since May.
> > > https://lkml.org/lkml/2018/5/22/1142
> > > 
> > I don't see how thats any more egregious than everyone else having to 
> > monitor
> > for removals of code thats in the tree at some indeterminate future.  The 
> > long and the short of it
> > is that a new flex_array user was added in the intervening 7 months that 
> > this
> > patch has been waiting to go in, and it will now break if merged.  I'm 
> > sorry we
> > started using it during that time, but it got missed by everyone in the 
> > chain
> > that merged it, and hasn't been noticed in the 4 months since.  It is what 
> > it
> > is, and now it needs to be undone. 
> > 
> > > > regardless, however, sctp has a current in-tree use of flex_arrays, and 
> > > > merging
> > > > this patch will break the build without a respin.
> > > 
> > > Great.  I await your patch to replace the flexarray usage.
> > Sure, we'll get to it as soon as we can, or, if you are in a hurry, you can
> > replace the same usage, like you've done for all the other users in this 
> > series.
> 
> This is really my fault for slacking on getting generic-radix-trees in, and
> given that the sctp code has been merged I'll do the conversion.
> 
Thank you, I appreciate that.

> However.
> 
> Looking at the sctp code, honestly, wtf is going on here.
> 
> sctp_send_add_streams() calls sctp_stream_alloc_out() when it wants to make 
> the
> out flex_array bigger - ok, this makes sense, you're using a flex_array 
> because
> you want something resizable.
> 
> But wait, look what it actually does - it unconditionally frees the old flex
> array and preallocates a new one and copies the contents of the old one over.
> 
> Without, as far as I can tell, any locking whatsoever.
> 
> Was this code tested? Reviewed?
> 
Yup, both sides are protected by the socket lock for which the sctp connection
is associated.  Its locked in the sctp_setsockopt function, which is the path
through which we update/reallocate these flex arrays, and its also locked on
transmit in sctp_sendmsg, and on receive in sctp_rcv (via bh_lock_sock)

Neil



Re: [PATCH 1/1] net-next/hinic:optmize rx refill buffer mechanism

2018-12-17 Thread Neil Horman
On Sun, Dec 16, 2018 at 10:32:34PM +, Xue Chaojing wrote:
> In rx_alloc_pkts(), there is no need to schedule a different tasklet for
> refill and it will cause some extra overhead. this patch remove it.
> 
> Suggested-by: Neil Horman 
> Signed-off-by: Xue Chaojing 
> ---
>  drivers/net/ethernet/huawei/hinic/hinic_rx.c | 23 +---
>  drivers/net/ethernet/huawei/hinic/hinic_rx.h |  2 --
>  2 files changed, 5 insertions(+), 20 deletions(-)
> 
I thought I had responded to this already, but this still looks like the same
patch.  While I'm supportive of the removal of the second tasklet (as its not
needed), this patch still doesn't address the holes that can result in your rx
ring buffer.  By always receiving a frame , and then just filling as many of the
remaining buffers as possible, you open yourself to the possibility that, in low
memory conditions, you will wind up with an empty rx ring that will result in
hanging your NIC.  You need to, for every recevied frame, pre-allocate a
replacement buffer, and, should that allocation fail, return the received frame,
to the ring, recording a drop in the process.  That way your ring buffer will
never have holes in it.

Neil
> diff --git a/drivers/net/ethernet/huawei/hinic/hinic_rx.c 
> b/drivers/net/ethernet/huawei/hinic/hinic_rx.c
> index f86f2e693224..0098b206e7e9 100644
> --- a/drivers/net/ethernet/huawei/hinic/hinic_rx.c
> +++ b/drivers/net/ethernet/huawei/hinic/hinic_rx.c
> @@ -43,6 +43,7 @@
>  #define RX_IRQ_NO_LLI_TIMER 0
>  #define RX_IRQ_NO_CREDIT0
>  #define RX_IRQ_NO_RESEND_TIMER  0
> +#define HINIC_RX_BUFFER_WRITE   16
>  
>  /**
>   * hinic_rxq_clean_stats - Clean the statistics of specific queue
> @@ -229,7 +230,6 @@ static int rx_alloc_pkts(struct hinic_rxq *rxq)
>   wmb();  /* write all the wqes before update PI */
>  
>   hinic_rq_update(rxq->rq, prod_idx);
> - tasklet_schedule(>rx_task);
>   }
>  
>   return i;
> @@ -258,17 +258,6 @@ static void free_all_rx_skbs(struct hinic_rxq *rxq)
>   }
>  }
>  
> -/**
> - * rx_alloc_task - tasklet for queue allocation
> - * @data: rx queue
> - **/
> -static void rx_alloc_task(unsigned long data)
> -{
> - struct hinic_rxq *rxq = (struct hinic_rxq *)data;
> -
> - (void)rx_alloc_pkts(rxq);
> -}
> -
>  /**
>   * rx_recv_jumbo_pkt - Rx handler for jumbo pkt
>   * @rxq: rx queue
> @@ -333,6 +322,7 @@ static int rxq_recv(struct hinic_rxq *rxq, int budget)
>   struct hinic_qp *qp = container_of(rxq->rq, struct hinic_qp, rq);
>   u64 pkt_len = 0, rx_bytes = 0;
>   struct hinic_rq_wqe *rq_wqe;
> + unsigned int free_wqebbs;
>   int num_wqes, pkts = 0;
>   struct hinic_sge sge;
>   struct sk_buff *skb;
> @@ -376,8 +366,9 @@ static int rxq_recv(struct hinic_rxq *rxq, int budget)
>   rx_bytes += pkt_len;
>   }
>  
> - if (pkts)
> - tasklet_schedule(>rx_task); /* rx_alloc_pkts */
> + free_wqebbs = hinic_get_rq_free_wqebbs(rxq->rq);
> + if (free_wqebbs > HINIC_RX_BUFFER_WRITE)
> + rx_alloc_pkts(rxq);
>  
>   u64_stats_update_begin(>rxq_stats.syncp);
>   rxq->rxq_stats.pkts += pkts;
> @@ -494,8 +485,6 @@ int hinic_init_rxq(struct hinic_rxq *rxq, struct hinic_rq 
> *rq,
>  
>   sprintf(rxq->irq_name, "hinic_rxq%d", qp->q_id);
>  
> - tasklet_init(>rx_task, rx_alloc_task, (unsigned long)rxq);
> -
>   pkts = rx_alloc_pkts(rxq);
>   if (!pkts) {
>   err = -ENOMEM;
> @@ -512,7 +501,6 @@ int hinic_init_rxq(struct hinic_rxq *rxq, struct hinic_rq 
> *rq,
>  
>  err_req_rx_irq:
>  err_rx_pkts:
> - tasklet_kill(>rx_task);
>   free_all_rx_skbs(rxq);
>   devm_kfree(>dev, rxq->irq_name);
>   return err;
> @@ -528,7 +516,6 @@ void hinic_clean_rxq(struct hinic_rxq *rxq)
>  
>   rx_free_irq(rxq);
>  
> - tasklet_kill(>rx_task);
>   free_all_rx_skbs(rxq);
>   devm_kfree(>dev, rxq->irq_name);
>  }
> diff --git a/drivers/net/ethernet/huawei/hinic/hinic_rx.h 
> b/drivers/net/ethernet/huawei/hinic/hinic_rx.h
> index ab3fabab91b2..f8ed3fa6c8ee 100644
> --- a/drivers/net/ethernet/huawei/hinic/hinic_rx.h
> +++ b/drivers/net/ethernet/huawei/hinic/hinic_rx.h
> @@ -42,8 +42,6 @@ struct hinic_rxq {
>  
>   char*irq_name;
>  
> - struct tasklet_struct   rx_task;
> -
>   struct napi_struct  napi;
>  };
>  
> -- 
> 2.17.1
> 


Re: [PATCH 6/6] Drop flex_arrays

2018-12-13 Thread Neil Horman
On Thu, Dec 13, 2018 at 08:45:33AM -0800, Matthew Wilcox wrote:
> On Thu, Dec 13, 2018 at 10:51:49AM -0500, Neil Horman wrote:
> > On Thu, Dec 13, 2018 at 06:41:11AM -0800, Matthew Wilcox wrote:
> > > On Thu, Dec 13, 2018 at 09:30:47PM +0900, Xin Long wrote:
> > > > On Sat, Sep 8, 2018 at 1:57 AM Kent Overstreet
> > > >  wrote:
> > > > >
> > > > > All existing users have been converted to generic radix trees
> > > > NAK, SCTP is still using flex_arrays,
> > > > # grep flex_array net/sctp/*
> > > > 
> > > > This patch will break the build.
> > > 
> > > sctp added that user after this patch was sent.  Please stop adding
> > > flexarray users!
> > > 
> > > This particular user should probably have just used kvmalloc.
> > > 
> > 
> > No, I don't think thats right.
> > 
> > This appears to have been sent on September 7th.  Commit
> > 0d493b4d0be352b5e361e4fa0bc3efe952d8b10e, which added the use of 
> > flex_arrays to
> > sctp, seems to have been merged on August 10th, a month prior.
> 
> Are you seriously suggesting anybody sending cleanups needs to be
> monitoring every single email list to see if anybody has added a new user?
> Removing the flexarray has been advertised since May.
> https://lkml.org/lkml/2018/5/22/1142
> 
I don't see how thats any more egregious than everyone else having to monitor
for removals of code thats in the tree at some indeterminate future.  The long 
and the short of it
is that a new flex_array user was added in the intervening 7 months that this
patch has been waiting to go in, and it will now break if merged.  I'm sorry we
started using it during that time, but it got missed by everyone in the chain
that merged it, and hasn't been noticed in the 4 months since.  It is what it
is, and now it needs to be undone. 

> > regardless, however, sctp has a current in-tree use of flex_arrays, and 
> > merging
> > this patch will break the build without a respin.
> 
> Great.  I await your patch to replace the flexarray usage.
Sure, we'll get to it as soon as we can, or, if you are in a hurry, you can
replace the same usage, like you've done for all the other users in this series.

Neil


Re: [PATCH 6/6] Drop flex_arrays

2018-12-13 Thread Neil Horman
On Thu, Dec 13, 2018 at 06:41:11AM -0800, Matthew Wilcox wrote:
> On Thu, Dec 13, 2018 at 09:30:47PM +0900, Xin Long wrote:
> > On Sat, Sep 8, 2018 at 1:57 AM Kent Overstreet
> >  wrote:
> > >
> > > All existing users have been converted to generic radix trees
> > NAK, SCTP is still using flex_arrays,
> > # grep flex_array net/sctp/*
> > 
> > This patch will break the build.
> 
> sctp added that user after this patch was sent.  Please stop adding
> flexarray users!
> 
> This particular user should probably have just used kvmalloc.
> 

No, I don't think thats right.

This appears to have been sent on September 7th.  Commit
0d493b4d0be352b5e361e4fa0bc3efe952d8b10e, which added the use of flex_arrays to
sctp, seems to have been merged on August 10th, a month prior.

regardless, however, sctp has a current in-tree use of flex_arrays, and merging
this patch will break the build without a respin.

Neil




Re: [PATCH 1/1] net-next/hinic:optmize rx refill buffer mechanism

2018-12-13 Thread Neil Horman
On Wed, Dec 12, 2018 at 05:40:23PM +, Xue Chaojing wrote:
> There is no need to schedule a different tasklet for refill,
> This patch remove it.
> 
> Suggested-by: Neil Horman 
> Signed-off-by: Xue Chaojing 
> ---
>  drivers/net/ethernet/huawei/hinic/hinic_rx.c | 23 +---
>  drivers/net/ethernet/huawei/hinic/hinic_rx.h |  2 --
>  2 files changed, 5 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/net/ethernet/huawei/hinic/hinic_rx.c 
> b/drivers/net/ethernet/huawei/hinic/hinic_rx.c
> index f86f2e693224..0098b206e7e9 100644
> --- a/drivers/net/ethernet/huawei/hinic/hinic_rx.c
> +++ b/drivers/net/ethernet/huawei/hinic/hinic_rx.c
> @@ -43,6 +43,7 @@
>  #define RX_IRQ_NO_LLI_TIMER 0
>  #define RX_IRQ_NO_CREDIT0
>  #define RX_IRQ_NO_RESEND_TIMER  0
> +#define HINIC_RX_BUFFER_WRITE   16
>  
>  /**
>   * hinic_rxq_clean_stats - Clean the statistics of specific queue
> @@ -229,7 +230,6 @@ static int rx_alloc_pkts(struct hinic_rxq *rxq)
>   wmb();  /* write all the wqes before update PI */
>  
>   hinic_rq_update(rxq->rq, prod_idx);
> - tasklet_schedule(>rx_task);
>   }
>  
>   return i;
> @@ -258,17 +258,6 @@ static void free_all_rx_skbs(struct hinic_rxq *rxq)
>   }
>  }
>  
> -/**
> - * rx_alloc_task - tasklet for queue allocation
> - * @data: rx queue
> - **/
> -static void rx_alloc_task(unsigned long data)
> -{
> - struct hinic_rxq *rxq = (struct hinic_rxq *)data;
> -
> - (void)rx_alloc_pkts(rxq);
> -}
> -
>  /**
>   * rx_recv_jumbo_pkt - Rx handler for jumbo pkt
>   * @rxq: rx queue
> @@ -333,6 +322,7 @@ static int rxq_recv(struct hinic_rxq *rxq, int budget)
>   struct hinic_qp *qp = container_of(rxq->rq, struct hinic_qp, rq);
>   u64 pkt_len = 0, rx_bytes = 0;
>   struct hinic_rq_wqe *rq_wqe;
> + unsigned int free_wqebbs;
>   int num_wqes, pkts = 0;
>   struct hinic_sge sge;
>   struct sk_buff *skb;
> @@ -376,8 +366,9 @@ static int rxq_recv(struct hinic_rxq *rxq, int budget)
>   rx_bytes += pkt_len;
>   }
>  
> - if (pkts)
> - tasklet_schedule(>rx_task); /* rx_alloc_pkts */
> + free_wqebbs = hinic_get_rq_free_wqebbs(rxq->rq);
> + if (free_wqebbs > HINIC_RX_BUFFER_WRITE)
> + rx_alloc_pkts(rxq);
>  
>   u64_stats_update_begin(>rxq_stats.syncp);
>   rxq->rxq_stats.pkts += pkts;
> @@ -494,8 +485,6 @@ int hinic_init_rxq(struct hinic_rxq *rxq, struct hinic_rq 
> *rq,
>  
>   sprintf(rxq->irq_name, "hinic_rxq%d", qp->q_id);
>  
> - tasklet_init(>rx_task, rx_alloc_task, (unsigned long)rxq);
> -
>   pkts = rx_alloc_pkts(rxq);
>   if (!pkts) {
>   err = -ENOMEM;
> @@ -512,7 +501,6 @@ int hinic_init_rxq(struct hinic_rxq *rxq, struct hinic_rq 
> *rq,
>  
>  err_req_rx_irq:
>  err_rx_pkts:
> - tasklet_kill(>rx_task);
>   free_all_rx_skbs(rxq);
>   devm_kfree(>dev, rxq->irq_name);
>   return err;
> @@ -528,7 +516,6 @@ void hinic_clean_rxq(struct hinic_rxq *rxq)
>  
>   rx_free_irq(rxq);
>  
> - tasklet_kill(>rx_task);
>   free_all_rx_skbs(rxq);
>   devm_kfree(>dev, rxq->irq_name);
>  }
> diff --git a/drivers/net/ethernet/huawei/hinic/hinic_rx.h 
> b/drivers/net/ethernet/huawei/hinic/hinic_rx.h
> index ab3fabab91b2..f8ed3fa6c8ee 100644
> --- a/drivers/net/ethernet/huawei/hinic/hinic_rx.h
> +++ b/drivers/net/ethernet/huawei/hinic/hinic_rx.h
> @@ -42,8 +42,6 @@ struct hinic_rxq {
>  
>   char*irq_name;
>  
> - struct tasklet_struct   rx_task;
> -
>   struct napi_struct  napi;
>  };
>  
> -- 
> 2.17.1
> 
I like that you're getting rid of the extra tasklet, but the other part of this
is properly refilling your rx ring.  The way you have this coded, you always
blindly just receive the incomming frame, even if your refill operation fails.
If you get a long enough period in which you are memory constrained, you will
wind up with an empty rx ring, which isn't good.  With this patch, if your ring
becomes empty, then you will stop receiving frames (no buffers to put them in),
which in turn will prevent further attempts to refill the ring.  The result is
effectively a hang in traffic reception that is only solveable by a NIC reset.

Common practice is to, for each skb that you intend to receive:

1) Allocate a replacement buffer/skb
2a) If allocation succedes, receive the buffer currently on the ring, and
replace it with the buffer from (1)
2b) If allocation fails, record a frame drop, mark the existing buffer as clean,
and move on

This process ensures that the ring never has any gaps in it, preventing the
above hang condition.

Neil
 




Re: [PATCHv2 net 0/3] net: add support for flex_array_resize in flex_array

2018-12-12 Thread Neil Horman
On Tue, Dec 11, 2018 at 10:50:00PM -0800, David Miller wrote:
> From: Xin Long 
> Date: Fri,  7 Dec 2018 14:30:32 +0800
> 
> > Without the support for the total_nr_elements's growing or shrinking
> > dynamically, flex_array is not that 'flexible'. Like when users want
> > to change the size, they have to redo flex_array_alloc and copy all
> > the elements from the old to the new one.  The worse thing is every
> > element's memory gets changed.
> > 
> > To implement flex_array_resize based on current code, the difficult
> > thing is to process the size border of FLEX_ARRAY_BASE_BYTES_LEFT,
> > where the base data memory may change to an array for the 2nd level
> > data memory for growing, likewise for shrinking.
> > 
> > To make this part easier, we separate the base data memory and define
> > FLEX_ARRAY_BASE_SIZE as a same value of FLEX_ARRAY_PART_SIZE, as Neil
> > suggested.  When new size is crossing the border, the base memory is
> > allocated as the array for the 2nd level data memory and its part[0]
> > is pointed to the old base memory, and do the opposite for shrinking.
> > 
> > But it doesn't do any memory allocation or shrinking for elements in
> > flex_array_resize, as which should be done by flex_array_prealloc or
> > flex_array_shrink called by users.  No memory leaks can be caused by
> > that.
> > 
> > SCTP has benefited a lot from flex_array_resize() for managing its
> > stream memory so far.
> > 
> > v1->v2:
> >   Cc LKML and more developers.
> 
> So I don't know what to do about this series.
> 
> One of the responses stated that it has been proposed to remove flex_array
> and I don't know what to make of that, nor can I tell if that makes this
> series inappropriate or not.
> 


I suggest xin respond to messageid 
20180523011821.12165-6-kent.overstr...@gmail.com>
and send a NAK, indicating that his patch seems like it will break the build,
as, looking through it, it never removes flex_array calls from the sctp code.
If kent reposts with a conversion of the sctp code to radix trees, we're done.
If not, you can move forward with this commit.

Neil



Re: [PATCH net] sctp: initialize sin6_flowinfo for ipv6 addrs in sctp_inet6addr_event

2018-12-10 Thread Neil Horman
On Mon, Dec 10, 2018 at 06:00:52PM +0800, Xin Long wrote:
> syzbot reported a kernel-infoleak, which is caused by an uninitialized
> field(sin6_flowinfo) of addr->a.v6 in sctp_inet6addr_event().
> The call trace is as below:
> 
>   BUG: KMSAN: kernel-infoleak in _copy_to_user+0x19a/0x230 lib/usercopy.c:33
>   CPU: 1 PID: 8164 Comm: syz-executor2 Not tainted 4.20.0-rc3+ #95
>   Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>   Google 01/01/2011
>   Call Trace:
> __dump_stack lib/dump_stack.c:77 [inline]
> dump_stack+0x32d/0x480 lib/dump_stack.c:113
> kmsan_report+0x12c/0x290 mm/kmsan/kmsan.c:683
> kmsan_internal_check_memory+0x32a/0xa50 mm/kmsan/kmsan.c:743
> kmsan_copy_to_user+0x78/0xd0 mm/kmsan/kmsan_hooks.c:634
> _copy_to_user+0x19a/0x230 lib/usercopy.c:33
> copy_to_user include/linux/uaccess.h:183 [inline]
> sctp_getsockopt_local_addrs net/sctp/socket.c:5998 [inline]
> sctp_getsockopt+0x15248/0x186f0 net/sctp/socket.c:7477
> sock_common_getsockopt+0x13f/0x180 net/core/sock.c:2937
> __sys_getsockopt+0x489/0x550 net/socket.c:1939
> __do_sys_getsockopt net/socket.c:1950 [inline]
> __se_sys_getsockopt+0xe1/0x100 net/socket.c:1947
> __x64_sys_getsockopt+0x62/0x80 net/socket.c:1947
> do_syscall_64+0xcf/0x110 arch/x86/entry/common.c:291
> entry_SYSCALL_64_after_hwframe+0x63/0xe7
> 
> sin6_flowinfo is not really used by SCTP, so it will be fixed by simply
> setting it to 0.
> 
> The issue exists since very beginning.
> Thanks Alexander for the reproducer provided.
> 
> Reported-by: syzbot+ad5d327e6936a2e28...@syzkaller.appspotmail.com
> Signed-off-by: Xin Long 
> ---
>  net/sctp/ipv6.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index fc6c5e4..7f0539d 100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -101,6 +101,7 @@ static int sctp_inet6addr_event(struct notifier_block 
> *this, unsigned long ev,
>   if (addr) {
>   addr->a.v6.sin6_family = AF_INET6;
>   addr->a.v6.sin6_port = 0;
> + addr->a.v6.sin6_flowinfo = 0;
>   addr->a.v6.sin6_addr = ifa->addr;
>   addr->a.v6.sin6_scope_id = ifa->idev->dev->ifindex;
>   addr->valid = 1;
> -- 
> 2.1.0
> 
> 
Acked-by: Neil Horman 



Re: [PATCH] vmw_pvrdma: Release netdev when vmxnet3 module is removed

2018-07-01 Thread Neil Horman
On Sat, Jun 30, 2018 at 10:15:07PM +0300, Dan Carpenter wrote:
> Hi Neil,
> 
> I love your patch! Perhaps something to improve:
> 
> url:
> https://github.com/0day-ci/linux/commits/Neil-Horman/vmw_pvrdma-Release-netdev-when-vmxnet3-module-is-removed/20180628-232414
> 
> smatch warnings:
> drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c:987 pvrdma_pci_probe() warn: 
> variable dereferenced before check 'dev->netdev' (see line 985)
> 
Appreciate the smatch check, but this was caught by visual review and fixed in 
V2 already.

Best
Neil



Re: [PATCH] vmw_pvrdma: Release netdev when vmxnet3 module is removed

2018-07-01 Thread Neil Horman
On Sat, Jun 30, 2018 at 10:15:07PM +0300, Dan Carpenter wrote:
> Hi Neil,
> 
> I love your patch! Perhaps something to improve:
> 
> url:
> https://github.com/0day-ci/linux/commits/Neil-Horman/vmw_pvrdma-Release-netdev-when-vmxnet3-module-is-removed/20180628-232414
> 
> smatch warnings:
> drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c:987 pvrdma_pci_probe() warn: 
> variable dereferenced before check 'dev->netdev' (see line 985)
> 
Appreciate the smatch check, but this was caught by visual review and fixed in 
V2 already.

Best
Neil



[PATCH v2] vmw_pvrdma: Release netdev when vmxnet3 module is removed

2018-06-29 Thread Neil Horman
On repeated module load/unload cycles, its possible for the pvrmda
driver to encounter this crash:

...
297.032448] RIP: 0010:[]  [] 
netdev_walk_all_upper_dev_rcu+0x50/0xb0
[  297.034078] RSP: 0018:95087780bd08  EFLAGS: 00010286
[  297.034986] RAX:  RBX:  RCX: 95087a0c
[  297.036196] RDX: 95087a0c RSI: 839e44e0 RDI: 950835d0c000
[  297.037421] RBP: 95087780bd40 R08: 95087a0e0ea0 R09: abddacd03f8e0ea0
[  297.038636] R10: abddacd03f8e0ea0 R11: ef5901e9dbc0 R12: 95087a0c
[  297.039854] R13: 839e44e0 R14: 95087a0c R15: 950835d0c828
[  297.041071] FS:  () GS:95087fc0() 
knlGS:
[  297.042443] CS:  0010 DS:  ES:  CR0: 80050033
[  297.043429] CR2: ffe8 CR3: 7a652000 CR4: 003607f0
[  297.044674] DR0:  DR1:  DR2: 
[  297.045893] DR3:  DR6: fffe0ff0 DR7: 0400
[  297.047109] Call Trace:
[  297.047545]  [] netdev_has_upper_dev_all_rcu+0x18/0x20
[  297.048691]  [] is_eth_port_of_netdev+0x2f/0xa0 [ib_core]
[  297.049886]  [] ? 
is_eth_active_slave_of_bonding_rcu+0x70/0x70 [ib_core]
...

This occurs because vmw_pvrdma on probe stores a pointer to the netdev
that exists on function 0 of the same bus/device/slot (which represents
the vmxnet3 ethernet driver).  However, it never removes this pointer if
the vmxnet3 module is removed, leading to crashes resulting from use
after free dereferencing incidents like the one above.

The fix is pretty straightforward.  vmw_pvrdma should listen for
NETDEV_REGISTER and NETDEV_UNREGISTER events in its event listener code
block, and update the stored netdev pointer accordingly.  This solution
has been tested by myself and the reporter with successful results.
This fix also allows the pvrdma driver to find its underlying ethernet
device in the event that vmxnet3 is loaded after pvrdma, which it was
not able to do before.

Signed-off-by: Neil Horman 
Reported-by: ruq...@redhat.com
CC: Adit Ranadive 
CC: VMware PV-Drivers 
CC: Doug Ledford 
CC: Jason Gunthorpe 
CC: linux-kernel@vger.kernel.org

---
Change notes

v2)
 * Move dev_hold in pvrda_pci_probe to below null check (aditr)
 * Add dev_puts to probe error path and pvrda_pci_remove (jgg)
 * Cleaned up some checkpatch warnings (nhorman)
---
 .../infiniband/hw/vmw_pvrdma/pvrdma_main.c| 39 ++-
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c 
b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
index 0be33a81bbe6..970d24d887c2 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
@@ -699,8 +699,12 @@ static int pvrdma_del_gid(const struct ib_gid_attr *attr, 
void **context)
 }
 
 static void pvrdma_netdevice_event_handle(struct pvrdma_dev *dev,
+ struct net_device *ndev,
  unsigned long event)
 {
+   struct pci_dev *pdev_net;
+   unsigned int slot;
+
switch (event) {
case NETDEV_REBOOT:
case NETDEV_DOWN:
@@ -718,6 +722,24 @@ static void pvrdma_netdevice_event_handle(struct 
pvrdma_dev *dev,
else
pvrdma_dispatch_event(dev, 1, IB_EVENT_PORT_ACTIVE);
break;
+   case NETDEV_UNREGISTER:
+   dev_put(dev->netdev);
+   dev->netdev = NULL;
+   break;
+   case NETDEV_REGISTER:
+   /* vmxnet3 will have same bus, slot. But func will be 0 */
+   slot = PCI_SLOT(dev->pdev->devfn);
+   pdev_net = pci_get_slot(dev->pdev->bus,
+   PCI_DEVFN(slot, 0));
+   if ((dev->netdev == NULL) &&
+   (pci_get_drvdata(pdev_net) == ndev)) {
+   /* this is our netdev */
+   dev->netdev = ndev;
+   dev_hold(ndev);
+   }
+   pci_dev_put(pdev_net);
+   break;
+
default:
dev_dbg(>pdev->dev, "ignore netdevice event %ld on %s\n",
event, dev->ib_dev.name);
@@ -734,8 +756,11 @@ static void pvrdma_netdevice_event_work(struct work_struct 
*work)
 
mutex_lock(_device_list_lock);
list_for_each_entry(dev, _device_list, device_link) {
-   if (dev->netdev == netdev_work->event_netdev) {
-   pvrdma_netdevice_event_handle(dev, netdev_work->event);
+   if ((netdev_work->event == NETDEV_REGISTER) ||
+   (dev->netdev == netdev_work->event_netdev)) {
+   pvrdma_netdevice_event_handle(dev,
+ netdev_work->event_netdev,

[PATCH v2] vmw_pvrdma: Release netdev when vmxnet3 module is removed

2018-06-29 Thread Neil Horman
On repeated module load/unload cycles, its possible for the pvrmda
driver to encounter this crash:

...
297.032448] RIP: 0010:[]  [] 
netdev_walk_all_upper_dev_rcu+0x50/0xb0
[  297.034078] RSP: 0018:95087780bd08  EFLAGS: 00010286
[  297.034986] RAX:  RBX:  RCX: 95087a0c
[  297.036196] RDX: 95087a0c RSI: 839e44e0 RDI: 950835d0c000
[  297.037421] RBP: 95087780bd40 R08: 95087a0e0ea0 R09: abddacd03f8e0ea0
[  297.038636] R10: abddacd03f8e0ea0 R11: ef5901e9dbc0 R12: 95087a0c
[  297.039854] R13: 839e44e0 R14: 95087a0c R15: 950835d0c828
[  297.041071] FS:  () GS:95087fc0() 
knlGS:
[  297.042443] CS:  0010 DS:  ES:  CR0: 80050033
[  297.043429] CR2: ffe8 CR3: 7a652000 CR4: 003607f0
[  297.044674] DR0:  DR1:  DR2: 
[  297.045893] DR3:  DR6: fffe0ff0 DR7: 0400
[  297.047109] Call Trace:
[  297.047545]  [] netdev_has_upper_dev_all_rcu+0x18/0x20
[  297.048691]  [] is_eth_port_of_netdev+0x2f/0xa0 [ib_core]
[  297.049886]  [] ? 
is_eth_active_slave_of_bonding_rcu+0x70/0x70 [ib_core]
...

This occurs because vmw_pvrdma on probe stores a pointer to the netdev
that exists on function 0 of the same bus/device/slot (which represents
the vmxnet3 ethernet driver).  However, it never removes this pointer if
the vmxnet3 module is removed, leading to crashes resulting from use
after free dereferencing incidents like the one above.

The fix is pretty straightforward.  vmw_pvrdma should listen for
NETDEV_REGISTER and NETDEV_UNREGISTER events in its event listener code
block, and update the stored netdev pointer accordingly.  This solution
has been tested by myself and the reporter with successful results.
This fix also allows the pvrdma driver to find its underlying ethernet
device in the event that vmxnet3 is loaded after pvrdma, which it was
not able to do before.

Signed-off-by: Neil Horman 
Reported-by: ruq...@redhat.com
CC: Adit Ranadive 
CC: VMware PV-Drivers 
CC: Doug Ledford 
CC: Jason Gunthorpe 
CC: linux-kernel@vger.kernel.org

---
Change notes

v2)
 * Move dev_hold in pvrda_pci_probe to below null check (aditr)
 * Add dev_puts to probe error path and pvrda_pci_remove (jgg)
 * Cleaned up some checkpatch warnings (nhorman)
---
 .../infiniband/hw/vmw_pvrdma/pvrdma_main.c| 39 ++-
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c 
b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
index 0be33a81bbe6..970d24d887c2 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
@@ -699,8 +699,12 @@ static int pvrdma_del_gid(const struct ib_gid_attr *attr, 
void **context)
 }
 
 static void pvrdma_netdevice_event_handle(struct pvrdma_dev *dev,
+ struct net_device *ndev,
  unsigned long event)
 {
+   struct pci_dev *pdev_net;
+   unsigned int slot;
+
switch (event) {
case NETDEV_REBOOT:
case NETDEV_DOWN:
@@ -718,6 +722,24 @@ static void pvrdma_netdevice_event_handle(struct 
pvrdma_dev *dev,
else
pvrdma_dispatch_event(dev, 1, IB_EVENT_PORT_ACTIVE);
break;
+   case NETDEV_UNREGISTER:
+   dev_put(dev->netdev);
+   dev->netdev = NULL;
+   break;
+   case NETDEV_REGISTER:
+   /* vmxnet3 will have same bus, slot. But func will be 0 */
+   slot = PCI_SLOT(dev->pdev->devfn);
+   pdev_net = pci_get_slot(dev->pdev->bus,
+   PCI_DEVFN(slot, 0));
+   if ((dev->netdev == NULL) &&
+   (pci_get_drvdata(pdev_net) == ndev)) {
+   /* this is our netdev */
+   dev->netdev = ndev;
+   dev_hold(ndev);
+   }
+   pci_dev_put(pdev_net);
+   break;
+
default:
dev_dbg(>pdev->dev, "ignore netdevice event %ld on %s\n",
event, dev->ib_dev.name);
@@ -734,8 +756,11 @@ static void pvrdma_netdevice_event_work(struct work_struct 
*work)
 
mutex_lock(_device_list_lock);
list_for_each_entry(dev, _device_list, device_link) {
-   if (dev->netdev == netdev_work->event_netdev) {
-   pvrdma_netdevice_event_handle(dev, netdev_work->event);
+   if ((netdev_work->event == NETDEV_REGISTER) ||
+   (dev->netdev == netdev_work->event_netdev)) {
+   pvrdma_netdevice_event_handle(dev,
+ netdev_work->event_netdev,

Re: [PATCH] vmw_pvrdma: Release netdev when vmxnet3 module is removed

2018-06-29 Thread Neil Horman
On Thu, Jun 28, 2018 at 09:15:46PM +, Adit Ranadive wrote:
> On 6/28/18, 1:37 PM, "Jason Gunthorpe"  wrote:
> > On Thu, Jun 28, 2018 at 03:45:26PM -0400, Neil Horman wrote:
> > > On Thu, Jun 28, 2018 at 12:59:46PM -0600, Jason Gunthorpe wrote:
> > > > On Thu, Jun 28, 2018 at 09:59:38AM -0400, Neil Horman wrote:
> > > > > On repeated module load/unload cycles, its possible for the pvrmda
> > > > > driver to encounter this crash:
> 
> > > > > @@ -962,6 +982,7 @@ static int pvrdma_pci_probe(struct pci_dev *pdev,
> > > > >   }
> > > > >  
> > > > >   dev->netdev = pci_get_drvdata(pdev_net);
> > > > > + dev_hold(dev->netdev);
> 
> That doesn't seem right. If the vmxnet3 driver isn't loaded at all or failed
> to create a netdev, you would be requesting a hold on a NULL netdev. What if
> you moved this to after the if(!dev->netdev) check?
> 
You're correct, I was thinking that there was a null check in dev_hold, but
there isn't, it needs to be moved after the the !dev->netdev, and released in
the error path.

> > > > >   pci_dev_put(pdev_net);
> > > > >   if (!dev->netdev) {
> > > > >   dev_err(>dev, "failed to get vmxnet3 device\n");
> > > > 
> > > > I see a lot of new dev_hold's here, where are the matching
> > > > dev_puts()?
> > > > 
> > I'm not sure I'd call 2 alot, but sure, there is a new dev_hold in the
> > pvrdma_pci_probe routine, to hold a reference to the netdev that is looked 
> > up
> > there.  It is balanced by the NETDEV_UNREGISTER case in
> > pvrdma_netdevice_event_handle.  The UNREGISTER clause is also balancing the
> > NETDEV_REGISTER case of the hanlder that looks up the matching netdev 
> > should a
> > new device be registered.  Note that we will only hold a single device at a
> > time, because a given pvrdma device only recongnizes a single vmxnet3 device
> > (the one on function 0 of its own bus/device tuple).
> > 
> > I don't see how the dev_hold in pvrdma_pci_probe is undone during
> > error unwind (eg goto err_free_cq_ring)
> > 
> > And I don't see how it is put when pvrdma_pci_remove() is called.
> 
> That's right. These seem missing as well. 
> 
yup



Re: [PATCH] vmw_pvrdma: Release netdev when vmxnet3 module is removed

2018-06-29 Thread Neil Horman
On Thu, Jun 28, 2018 at 09:15:46PM +, Adit Ranadive wrote:
> On 6/28/18, 1:37 PM, "Jason Gunthorpe"  wrote:
> > On Thu, Jun 28, 2018 at 03:45:26PM -0400, Neil Horman wrote:
> > > On Thu, Jun 28, 2018 at 12:59:46PM -0600, Jason Gunthorpe wrote:
> > > > On Thu, Jun 28, 2018 at 09:59:38AM -0400, Neil Horman wrote:
> > > > > On repeated module load/unload cycles, its possible for the pvrmda
> > > > > driver to encounter this crash:
> 
> > > > > @@ -962,6 +982,7 @@ static int pvrdma_pci_probe(struct pci_dev *pdev,
> > > > >   }
> > > > >  
> > > > >   dev->netdev = pci_get_drvdata(pdev_net);
> > > > > + dev_hold(dev->netdev);
> 
> That doesn't seem right. If the vmxnet3 driver isn't loaded at all or failed
> to create a netdev, you would be requesting a hold on a NULL netdev. What if
> you moved this to after the if(!dev->netdev) check?
> 
You're correct, I was thinking that there was a null check in dev_hold, but
there isn't, it needs to be moved after the the !dev->netdev, and released in
the error path.

> > > > >   pci_dev_put(pdev_net);
> > > > >   if (!dev->netdev) {
> > > > >   dev_err(>dev, "failed to get vmxnet3 device\n");
> > > > 
> > > > I see a lot of new dev_hold's here, where are the matching
> > > > dev_puts()?
> > > > 
> > I'm not sure I'd call 2 alot, but sure, there is a new dev_hold in the
> > pvrdma_pci_probe routine, to hold a reference to the netdev that is looked 
> > up
> > there.  It is balanced by the NETDEV_UNREGISTER case in
> > pvrdma_netdevice_event_handle.  The UNREGISTER clause is also balancing the
> > NETDEV_REGISTER case of the hanlder that looks up the matching netdev 
> > should a
> > new device be registered.  Note that we will only hold a single device at a
> > time, because a given pvrdma device only recongnizes a single vmxnet3 device
> > (the one on function 0 of its own bus/device tuple).
> > 
> > I don't see how the dev_hold in pvrdma_pci_probe is undone during
> > error unwind (eg goto err_free_cq_ring)
> > 
> > And I don't see how it is put when pvrdma_pci_remove() is called.
> 
> That's right. These seem missing as well. 
> 
yup



Re: [PATCH] vmw_pvrdma: Release netdev when vmxnet3 module is removed

2018-06-29 Thread Neil Horman
On Thu, Jun 28, 2018 at 02:37:09PM -0600, Jason Gunthorpe wrote:
> On Thu, Jun 28, 2018 at 03:45:26PM -0400, Neil Horman wrote:
> > On Thu, Jun 28, 2018 at 12:59:46PM -0600, Jason Gunthorpe wrote:
> > > On Thu, Jun 28, 2018 at 09:59:38AM -0400, Neil Horman wrote:
> > > > On repeated module load/unload cycles, its possible for the pvrmda
> > > > driver to encounter this crash:
> > > > 
> > > > ...
> > > > 297.032448] RIP: 0010:[]  [] 
> > > > netdev_walk_all_upper_dev_rcu+0x50/0xb0
> > > > [  297.034078] RSP: 0018:95087780bd08  EFLAGS: 00010286
> > > > [  297.034986] RAX:  RBX:  RCX: 
> > > > 95087a0c
> > > > [  297.036196] RDX: 95087a0c RSI: 839e44e0 RDI: 
> > > > 950835d0c000
> > > > [  297.037421] RBP: 95087780bd40 R08: 95087a0e0ea0 R09: 
> > > > abddacd03f8e0ea0
> > > > [  297.038636] R10: abddacd03f8e0ea0 R11: ef5901e9dbc0 R12: 
> > > > 95087a0c
> > > > [  297.039854] R13: 839e44e0 R14: 95087a0c R15: 
> > > > 950835d0c828
> > > > [  297.041071] FS:  () GS:95087fc0() 
> > > > knlGS:
> > > > [  297.042443] CS:  0010 DS:  ES:  CR0: 80050033
> > > > [  297.043429] CR2: ffe8 CR3: 7a652000 CR4: 
> > > > 003607f0
> > > > [  297.044674] DR0:  DR1:  DR2: 
> > > > 
> > > > [  297.045893] DR3:  DR6: fffe0ff0 DR7: 
> > > > 0400
> > > > [  297.047109] Call Trace:
> > > > [  297.047545]  [] 
> > > > netdev_has_upper_dev_all_rcu+0x18/0x20
> > > > [  297.048691]  [] is_eth_port_of_netdev+0x2f/0xa0 
> > > > [ib_core]
> > > > [  297.049886]  [] ? 
> > > > is_eth_active_slave_of_bonding_rcu+0x70/0x70 [ib_core]
> > > > ...
> > > > 
> > > > This occurs because vmw_pvrdma on probe stores a pointer to the netdev
> > > > that exists on function 0 of the same bus/device/slot (which represents
> > > > the vmxnet3 ethernet driver).  However, it never removes this pointer if
> > > > the vmxnet3 module is removed, leading to crashes resulting from use
> > > > after free dereferencing incidents like the one above.
> > > > 
> > > > The fix is pretty straightforward.  vmw_pvrdma should listen for
> > > > NETDEV_REGISTER and NETDEV_UNREGISTER events in its event listener code
> > > > block, and update the stored netdev pointer accordingly.  This solution
> > > > has been tested by myself and the reporter with successful results.
> > > > This fix also allows the pvrdma driver to find its underlying ethernet
> > > > device in the event that vmxnet3 is loaded after pvrdma, which it was
> > > > not able to do before.
> > > > 
> > > > Signed-off-by: Neil Horman 
> > > > Reported-by: ruq...@redhat.com
> > > > CC: Adit Ranadive 
> > > > CC: VMware PV-Drivers 
> > > > CC: Doug Ledford 
> > > > CC: Jason Gunthorpe 
> > > > CC: linux-kernel@vger.kernel.org
> > > >  .../infiniband/hw/vmw_pvrdma/pvrdma_main.c| 25 +--
> > > >  1 file changed, 23 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c 
> > > > b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
> > > > index 0be33a81bbe6..5b4782078a74 100644
> > > > +++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
> > > > @@ -699,8 +699,12 @@ static int pvrdma_del_gid(const struct ib_gid_attr 
> > > > *attr, void **context)
> > > >  }
> > > >  
> > > >  static void pvrdma_netdevice_event_handle(struct pvrdma_dev *dev,
> > > > + struct net_device *ndev,
> > > >   unsigned long event)
> > > >  {
> > > > +   struct pci_dev *pdev_net;
> > > > +
> > > > +
> > > > switch (event) {
> > > > case NETDEV_REBOOT:
> > > > case NETDEV_DOWN:
> > > > @@ -718,6 +722,21 @@ static void pvrdma_netdevice_event_handle(struct 
> > > > pvrdma_dev *dev,
> > > > else
> > > >  

Re: [PATCH] vmw_pvrdma: Release netdev when vmxnet3 module is removed

2018-06-29 Thread Neil Horman
On Thu, Jun 28, 2018 at 02:37:09PM -0600, Jason Gunthorpe wrote:
> On Thu, Jun 28, 2018 at 03:45:26PM -0400, Neil Horman wrote:
> > On Thu, Jun 28, 2018 at 12:59:46PM -0600, Jason Gunthorpe wrote:
> > > On Thu, Jun 28, 2018 at 09:59:38AM -0400, Neil Horman wrote:
> > > > On repeated module load/unload cycles, its possible for the pvrmda
> > > > driver to encounter this crash:
> > > > 
> > > > ...
> > > > 297.032448] RIP: 0010:[]  [] 
> > > > netdev_walk_all_upper_dev_rcu+0x50/0xb0
> > > > [  297.034078] RSP: 0018:95087780bd08  EFLAGS: 00010286
> > > > [  297.034986] RAX:  RBX:  RCX: 
> > > > 95087a0c
> > > > [  297.036196] RDX: 95087a0c RSI: 839e44e0 RDI: 
> > > > 950835d0c000
> > > > [  297.037421] RBP: 95087780bd40 R08: 95087a0e0ea0 R09: 
> > > > abddacd03f8e0ea0
> > > > [  297.038636] R10: abddacd03f8e0ea0 R11: ef5901e9dbc0 R12: 
> > > > 95087a0c
> > > > [  297.039854] R13: 839e44e0 R14: 95087a0c R15: 
> > > > 950835d0c828
> > > > [  297.041071] FS:  () GS:95087fc0() 
> > > > knlGS:
> > > > [  297.042443] CS:  0010 DS:  ES:  CR0: 80050033
> > > > [  297.043429] CR2: ffe8 CR3: 7a652000 CR4: 
> > > > 003607f0
> > > > [  297.044674] DR0:  DR1:  DR2: 
> > > > 
> > > > [  297.045893] DR3:  DR6: fffe0ff0 DR7: 
> > > > 0400
> > > > [  297.047109] Call Trace:
> > > > [  297.047545]  [] 
> > > > netdev_has_upper_dev_all_rcu+0x18/0x20
> > > > [  297.048691]  [] is_eth_port_of_netdev+0x2f/0xa0 
> > > > [ib_core]
> > > > [  297.049886]  [] ? 
> > > > is_eth_active_slave_of_bonding_rcu+0x70/0x70 [ib_core]
> > > > ...
> > > > 
> > > > This occurs because vmw_pvrdma on probe stores a pointer to the netdev
> > > > that exists on function 0 of the same bus/device/slot (which represents
> > > > the vmxnet3 ethernet driver).  However, it never removes this pointer if
> > > > the vmxnet3 module is removed, leading to crashes resulting from use
> > > > after free dereferencing incidents like the one above.
> > > > 
> > > > The fix is pretty straightforward.  vmw_pvrdma should listen for
> > > > NETDEV_REGISTER and NETDEV_UNREGISTER events in its event listener code
> > > > block, and update the stored netdev pointer accordingly.  This solution
> > > > has been tested by myself and the reporter with successful results.
> > > > This fix also allows the pvrdma driver to find its underlying ethernet
> > > > device in the event that vmxnet3 is loaded after pvrdma, which it was
> > > > not able to do before.
> > > > 
> > > > Signed-off-by: Neil Horman 
> > > > Reported-by: ruq...@redhat.com
> > > > CC: Adit Ranadive 
> > > > CC: VMware PV-Drivers 
> > > > CC: Doug Ledford 
> > > > CC: Jason Gunthorpe 
> > > > CC: linux-kernel@vger.kernel.org
> > > >  .../infiniband/hw/vmw_pvrdma/pvrdma_main.c| 25 +--
> > > >  1 file changed, 23 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c 
> > > > b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
> > > > index 0be33a81bbe6..5b4782078a74 100644
> > > > +++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
> > > > @@ -699,8 +699,12 @@ static int pvrdma_del_gid(const struct ib_gid_attr 
> > > > *attr, void **context)
> > > >  }
> > > >  
> > > >  static void pvrdma_netdevice_event_handle(struct pvrdma_dev *dev,
> > > > + struct net_device *ndev,
> > > >   unsigned long event)
> > > >  {
> > > > +   struct pci_dev *pdev_net;
> > > > +
> > > > +
> > > > switch (event) {
> > > > case NETDEV_REBOOT:
> > > > case NETDEV_DOWN:
> > > > @@ -718,6 +722,21 @@ static void pvrdma_netdevice_event_handle(struct 
> > > > pvrdma_dev *dev,
> > > > else
> > > >  

Re: [PATCH] vmw_pvrdma: Release netdev when vmxnet3 module is removed

2018-06-28 Thread Neil Horman
On Thu, Jun 28, 2018 at 12:59:46PM -0600, Jason Gunthorpe wrote:
> On Thu, Jun 28, 2018 at 09:59:38AM -0400, Neil Horman wrote:
> > On repeated module load/unload cycles, its possible for the pvrmda
> > driver to encounter this crash:
> > 
> > ...
> > 297.032448] RIP: 0010:[]  [] 
> > netdev_walk_all_upper_dev_rcu+0x50/0xb0
> > [  297.034078] RSP: 0018:95087780bd08  EFLAGS: 00010286
> > [  297.034986] RAX:  RBX:  RCX: 
> > 95087a0c
> > [  297.036196] RDX: 95087a0c RSI: 839e44e0 RDI: 
> > 950835d0c000
> > [  297.037421] RBP: 95087780bd40 R08: 95087a0e0ea0 R09: 
> > abddacd03f8e0ea0
> > [  297.038636] R10: abddacd03f8e0ea0 R11: ef5901e9dbc0 R12: 
> > 95087a0c
> > [  297.039854] R13: 839e44e0 R14: 95087a0c R15: 
> > 950835d0c828
> > [  297.041071] FS:  () GS:95087fc0() 
> > knlGS:
> > [  297.042443] CS:  0010 DS:  ES:  CR0: 80050033
> > [  297.043429] CR2: ffe8 CR3: 7a652000 CR4: 
> > 003607f0
> > [  297.044674] DR0:  DR1:  DR2: 
> > 
> > [  297.045893] DR3:  DR6: fffe0ff0 DR7: 
> > 0400
> > [  297.047109] Call Trace:
> > [  297.047545]  [] netdev_has_upper_dev_all_rcu+0x18/0x20
> > [  297.048691]  [] is_eth_port_of_netdev+0x2f/0xa0 
> > [ib_core]
> > [  297.049886]  [] ? 
> > is_eth_active_slave_of_bonding_rcu+0x70/0x70 [ib_core]
> > ...
> > 
> > This occurs because vmw_pvrdma on probe stores a pointer to the netdev
> > that exists on function 0 of the same bus/device/slot (which represents
> > the vmxnet3 ethernet driver).  However, it never removes this pointer if
> > the vmxnet3 module is removed, leading to crashes resulting from use
> > after free dereferencing incidents like the one above.
> > 
> > The fix is pretty straightforward.  vmw_pvrdma should listen for
> > NETDEV_REGISTER and NETDEV_UNREGISTER events in its event listener code
> > block, and update the stored netdev pointer accordingly.  This solution
> > has been tested by myself and the reporter with successful results.
> > This fix also allows the pvrdma driver to find its underlying ethernet
> > device in the event that vmxnet3 is loaded after pvrdma, which it was
> > not able to do before.
> > 
> > Signed-off-by: Neil Horman 
> > Reported-by: ruq...@redhat.com
> > CC: Adit Ranadive 
> > CC: VMware PV-Drivers 
> > CC: Doug Ledford 
> > CC: Jason Gunthorpe 
> > CC: linux-kernel@vger.kernel.org
> >  .../infiniband/hw/vmw_pvrdma/pvrdma_main.c| 25 +--
> >  1 file changed, 23 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c 
> > b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
> > index 0be33a81bbe6..5b4782078a74 100644
> > +++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
> > @@ -699,8 +699,12 @@ static int pvrdma_del_gid(const struct ib_gid_attr 
> > *attr, void **context)
> >  }
> >  
> >  static void pvrdma_netdevice_event_handle(struct pvrdma_dev *dev,
> > + struct net_device *ndev,
> >   unsigned long event)
> >  {
> > +   struct pci_dev *pdev_net;
> > +
> > +
> > switch (event) {
> > case NETDEV_REBOOT:
> > case NETDEV_DOWN:
> > @@ -718,6 +722,21 @@ static void pvrdma_netdevice_event_handle(struct 
> > pvrdma_dev *dev,
> > else
> > pvrdma_dispatch_event(dev, 1, IB_EVENT_PORT_ACTIVE);
> > break;
> > +   case NETDEV_UNREGISTER:
> > +   dev_put(dev->netdev);
> > +   dev->netdev = NULL;
> > +   break;
> > +   case NETDEV_REGISTER:
> > +   /* Paired vmxnet3 will have same bus, slot. But func will be 0 
> > */
> > +   pdev_net = pci_get_slot(dev->pdev->bus, 
> > PCI_DEVFN(PCI_SLOT(dev->pdev->devfn), 0));
> > +   if ((dev->netdev == NULL) && (pci_get_drvdata(pdev_net) == 
> > ndev)) {
> > +   /* this is our netdev */
> > +   dev->netdev = ndev;
> > +   dev_hold(ndev);
> > +   }
> > +   pci_dev_put(pdev_net);
> > +   break;
> > +
> > default:
> > dev_dbg(>pdev->dev, "ignore netdevice ev

Re: [PATCH] vmw_pvrdma: Release netdev when vmxnet3 module is removed

2018-06-28 Thread Neil Horman
On Thu, Jun 28, 2018 at 12:59:46PM -0600, Jason Gunthorpe wrote:
> On Thu, Jun 28, 2018 at 09:59:38AM -0400, Neil Horman wrote:
> > On repeated module load/unload cycles, its possible for the pvrmda
> > driver to encounter this crash:
> > 
> > ...
> > 297.032448] RIP: 0010:[]  [] 
> > netdev_walk_all_upper_dev_rcu+0x50/0xb0
> > [  297.034078] RSP: 0018:95087780bd08  EFLAGS: 00010286
> > [  297.034986] RAX:  RBX:  RCX: 
> > 95087a0c
> > [  297.036196] RDX: 95087a0c RSI: 839e44e0 RDI: 
> > 950835d0c000
> > [  297.037421] RBP: 95087780bd40 R08: 95087a0e0ea0 R09: 
> > abddacd03f8e0ea0
> > [  297.038636] R10: abddacd03f8e0ea0 R11: ef5901e9dbc0 R12: 
> > 95087a0c
> > [  297.039854] R13: 839e44e0 R14: 95087a0c R15: 
> > 950835d0c828
> > [  297.041071] FS:  () GS:95087fc0() 
> > knlGS:
> > [  297.042443] CS:  0010 DS:  ES:  CR0: 80050033
> > [  297.043429] CR2: ffe8 CR3: 7a652000 CR4: 
> > 003607f0
> > [  297.044674] DR0:  DR1:  DR2: 
> > 
> > [  297.045893] DR3:  DR6: fffe0ff0 DR7: 
> > 0400
> > [  297.047109] Call Trace:
> > [  297.047545]  [] netdev_has_upper_dev_all_rcu+0x18/0x20
> > [  297.048691]  [] is_eth_port_of_netdev+0x2f/0xa0 
> > [ib_core]
> > [  297.049886]  [] ? 
> > is_eth_active_slave_of_bonding_rcu+0x70/0x70 [ib_core]
> > ...
> > 
> > This occurs because vmw_pvrdma on probe stores a pointer to the netdev
> > that exists on function 0 of the same bus/device/slot (which represents
> > the vmxnet3 ethernet driver).  However, it never removes this pointer if
> > the vmxnet3 module is removed, leading to crashes resulting from use
> > after free dereferencing incidents like the one above.
> > 
> > The fix is pretty straightforward.  vmw_pvrdma should listen for
> > NETDEV_REGISTER and NETDEV_UNREGISTER events in its event listener code
> > block, and update the stored netdev pointer accordingly.  This solution
> > has been tested by myself and the reporter with successful results.
> > This fix also allows the pvrdma driver to find its underlying ethernet
> > device in the event that vmxnet3 is loaded after pvrdma, which it was
> > not able to do before.
> > 
> > Signed-off-by: Neil Horman 
> > Reported-by: ruq...@redhat.com
> > CC: Adit Ranadive 
> > CC: VMware PV-Drivers 
> > CC: Doug Ledford 
> > CC: Jason Gunthorpe 
> > CC: linux-kernel@vger.kernel.org
> >  .../infiniband/hw/vmw_pvrdma/pvrdma_main.c| 25 +--
> >  1 file changed, 23 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c 
> > b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
> > index 0be33a81bbe6..5b4782078a74 100644
> > +++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
> > @@ -699,8 +699,12 @@ static int pvrdma_del_gid(const struct ib_gid_attr 
> > *attr, void **context)
> >  }
> >  
> >  static void pvrdma_netdevice_event_handle(struct pvrdma_dev *dev,
> > + struct net_device *ndev,
> >   unsigned long event)
> >  {
> > +   struct pci_dev *pdev_net;
> > +
> > +
> > switch (event) {
> > case NETDEV_REBOOT:
> > case NETDEV_DOWN:
> > @@ -718,6 +722,21 @@ static void pvrdma_netdevice_event_handle(struct 
> > pvrdma_dev *dev,
> > else
> > pvrdma_dispatch_event(dev, 1, IB_EVENT_PORT_ACTIVE);
> > break;
> > +   case NETDEV_UNREGISTER:
> > +   dev_put(dev->netdev);
> > +   dev->netdev = NULL;
> > +   break;
> > +   case NETDEV_REGISTER:
> > +   /* Paired vmxnet3 will have same bus, slot. But func will be 0 
> > */
> > +   pdev_net = pci_get_slot(dev->pdev->bus, 
> > PCI_DEVFN(PCI_SLOT(dev->pdev->devfn), 0));
> > +   if ((dev->netdev == NULL) && (pci_get_drvdata(pdev_net) == 
> > ndev)) {
> > +   /* this is our netdev */
> > +   dev->netdev = ndev;
> > +   dev_hold(ndev);
> > +   }
> > +   pci_dev_put(pdev_net);
> > +   break;
> > +
> > default:
> > dev_dbg(>pdev->dev, "ignore netdevice ev

[PATCH] vmw_pvrdma: Release netdev when vmxnet3 module is removed

2018-06-28 Thread Neil Horman
On repeated module load/unload cycles, its possible for the pvrmda
driver to encounter this crash:

...
297.032448] RIP: 0010:[]  [] 
netdev_walk_all_upper_dev_rcu+0x50/0xb0
[  297.034078] RSP: 0018:95087780bd08  EFLAGS: 00010286
[  297.034986] RAX:  RBX:  RCX: 95087a0c
[  297.036196] RDX: 95087a0c RSI: 839e44e0 RDI: 950835d0c000
[  297.037421] RBP: 95087780bd40 R08: 95087a0e0ea0 R09: abddacd03f8e0ea0
[  297.038636] R10: abddacd03f8e0ea0 R11: ef5901e9dbc0 R12: 95087a0c
[  297.039854] R13: 839e44e0 R14: 95087a0c R15: 950835d0c828
[  297.041071] FS:  () GS:95087fc0() 
knlGS:
[  297.042443] CS:  0010 DS:  ES:  CR0: 80050033
[  297.043429] CR2: ffe8 CR3: 7a652000 CR4: 003607f0
[  297.044674] DR0:  DR1:  DR2: 
[  297.045893] DR3:  DR6: fffe0ff0 DR7: 0400
[  297.047109] Call Trace:
[  297.047545]  [] netdev_has_upper_dev_all_rcu+0x18/0x20
[  297.048691]  [] is_eth_port_of_netdev+0x2f/0xa0 [ib_core]
[  297.049886]  [] ? 
is_eth_active_slave_of_bonding_rcu+0x70/0x70 [ib_core]
...

This occurs because vmw_pvrdma on probe stores a pointer to the netdev
that exists on function 0 of the same bus/device/slot (which represents
the vmxnet3 ethernet driver).  However, it never removes this pointer if
the vmxnet3 module is removed, leading to crashes resulting from use
after free dereferencing incidents like the one above.

The fix is pretty straightforward.  vmw_pvrdma should listen for
NETDEV_REGISTER and NETDEV_UNREGISTER events in its event listener code
block, and update the stored netdev pointer accordingly.  This solution
has been tested by myself and the reporter with successful results.
This fix also allows the pvrdma driver to find its underlying ethernet
device in the event that vmxnet3 is loaded after pvrdma, which it was
not able to do before.

Signed-off-by: Neil Horman 
Reported-by: ruq...@redhat.com
CC: Adit Ranadive 
CC: VMware PV-Drivers 
CC: Doug Ledford 
CC: Jason Gunthorpe 
CC: linux-kernel@vger.kernel.org
---
 .../infiniband/hw/vmw_pvrdma/pvrdma_main.c| 25 +--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c 
b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
index 0be33a81bbe6..5b4782078a74 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
@@ -699,8 +699,12 @@ static int pvrdma_del_gid(const struct ib_gid_attr *attr, 
void **context)
 }
 
 static void pvrdma_netdevice_event_handle(struct pvrdma_dev *dev,
+ struct net_device *ndev,
  unsigned long event)
 {
+   struct pci_dev *pdev_net;
+
+
switch (event) {
case NETDEV_REBOOT:
case NETDEV_DOWN:
@@ -718,6 +722,21 @@ static void pvrdma_netdevice_event_handle(struct 
pvrdma_dev *dev,
else
pvrdma_dispatch_event(dev, 1, IB_EVENT_PORT_ACTIVE);
break;
+   case NETDEV_UNREGISTER:
+   dev_put(dev->netdev);
+   dev->netdev = NULL;
+   break;
+   case NETDEV_REGISTER:
+   /* Paired vmxnet3 will have same bus, slot. But func will be 0 
*/
+   pdev_net = pci_get_slot(dev->pdev->bus, 
PCI_DEVFN(PCI_SLOT(dev->pdev->devfn), 0));
+   if ((dev->netdev == NULL) && (pci_get_drvdata(pdev_net) == 
ndev)) {
+   /* this is our netdev */
+   dev->netdev = ndev;
+   dev_hold(ndev);
+   }
+   pci_dev_put(pdev_net);
+   break;
+
default:
dev_dbg(>pdev->dev, "ignore netdevice event %ld on %s\n",
event, dev->ib_dev.name);
@@ -734,8 +753,9 @@ static void pvrdma_netdevice_event_work(struct work_struct 
*work)
 
mutex_lock(_device_list_lock);
list_for_each_entry(dev, _device_list, device_link) {
-   if (dev->netdev == netdev_work->event_netdev) {
-   pvrdma_netdevice_event_handle(dev, netdev_work->event);
+   if ((netdev_work->event == NETDEV_REGISTER) ||
+   (dev->netdev == netdev_work->event_netdev)) {
+   pvrdma_netdevice_event_handle(dev, 
netdev_work->event_netdev, netdev_work->event);
break;
}
}
@@ -962,6 +982,7 @@ static int pvrdma_pci_probe(struct pci_dev *pdev,
}
 
dev->netdev = pci_get_drvdata(pdev_net);
+   dev_hold(dev->netdev);
pci_dev_put(pdev_net);
if (!dev->netdev) {
dev_err(>dev, "failed to get vmxnet3 device\n");
-- 
2.17.1



[PATCH] vmw_pvrdma: Release netdev when vmxnet3 module is removed

2018-06-28 Thread Neil Horman
On repeated module load/unload cycles, its possible for the pvrmda
driver to encounter this crash:

...
297.032448] RIP: 0010:[]  [] 
netdev_walk_all_upper_dev_rcu+0x50/0xb0
[  297.034078] RSP: 0018:95087780bd08  EFLAGS: 00010286
[  297.034986] RAX:  RBX:  RCX: 95087a0c
[  297.036196] RDX: 95087a0c RSI: 839e44e0 RDI: 950835d0c000
[  297.037421] RBP: 95087780bd40 R08: 95087a0e0ea0 R09: abddacd03f8e0ea0
[  297.038636] R10: abddacd03f8e0ea0 R11: ef5901e9dbc0 R12: 95087a0c
[  297.039854] R13: 839e44e0 R14: 95087a0c R15: 950835d0c828
[  297.041071] FS:  () GS:95087fc0() 
knlGS:
[  297.042443] CS:  0010 DS:  ES:  CR0: 80050033
[  297.043429] CR2: ffe8 CR3: 7a652000 CR4: 003607f0
[  297.044674] DR0:  DR1:  DR2: 
[  297.045893] DR3:  DR6: fffe0ff0 DR7: 0400
[  297.047109] Call Trace:
[  297.047545]  [] netdev_has_upper_dev_all_rcu+0x18/0x20
[  297.048691]  [] is_eth_port_of_netdev+0x2f/0xa0 [ib_core]
[  297.049886]  [] ? 
is_eth_active_slave_of_bonding_rcu+0x70/0x70 [ib_core]
...

This occurs because vmw_pvrdma on probe stores a pointer to the netdev
that exists on function 0 of the same bus/device/slot (which represents
the vmxnet3 ethernet driver).  However, it never removes this pointer if
the vmxnet3 module is removed, leading to crashes resulting from use
after free dereferencing incidents like the one above.

The fix is pretty straightforward.  vmw_pvrdma should listen for
NETDEV_REGISTER and NETDEV_UNREGISTER events in its event listener code
block, and update the stored netdev pointer accordingly.  This solution
has been tested by myself and the reporter with successful results.
This fix also allows the pvrdma driver to find its underlying ethernet
device in the event that vmxnet3 is loaded after pvrdma, which it was
not able to do before.

Signed-off-by: Neil Horman 
Reported-by: ruq...@redhat.com
CC: Adit Ranadive 
CC: VMware PV-Drivers 
CC: Doug Ledford 
CC: Jason Gunthorpe 
CC: linux-kernel@vger.kernel.org
---
 .../infiniband/hw/vmw_pvrdma/pvrdma_main.c| 25 +--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c 
b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
index 0be33a81bbe6..5b4782078a74 100644
--- a/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
+++ b/drivers/infiniband/hw/vmw_pvrdma/pvrdma_main.c
@@ -699,8 +699,12 @@ static int pvrdma_del_gid(const struct ib_gid_attr *attr, 
void **context)
 }
 
 static void pvrdma_netdevice_event_handle(struct pvrdma_dev *dev,
+ struct net_device *ndev,
  unsigned long event)
 {
+   struct pci_dev *pdev_net;
+
+
switch (event) {
case NETDEV_REBOOT:
case NETDEV_DOWN:
@@ -718,6 +722,21 @@ static void pvrdma_netdevice_event_handle(struct 
pvrdma_dev *dev,
else
pvrdma_dispatch_event(dev, 1, IB_EVENT_PORT_ACTIVE);
break;
+   case NETDEV_UNREGISTER:
+   dev_put(dev->netdev);
+   dev->netdev = NULL;
+   break;
+   case NETDEV_REGISTER:
+   /* Paired vmxnet3 will have same bus, slot. But func will be 0 
*/
+   pdev_net = pci_get_slot(dev->pdev->bus, 
PCI_DEVFN(PCI_SLOT(dev->pdev->devfn), 0));
+   if ((dev->netdev == NULL) && (pci_get_drvdata(pdev_net) == 
ndev)) {
+   /* this is our netdev */
+   dev->netdev = ndev;
+   dev_hold(ndev);
+   }
+   pci_dev_put(pdev_net);
+   break;
+
default:
dev_dbg(>pdev->dev, "ignore netdevice event %ld on %s\n",
event, dev->ib_dev.name);
@@ -734,8 +753,9 @@ static void pvrdma_netdevice_event_work(struct work_struct 
*work)
 
mutex_lock(_device_list_lock);
list_for_each_entry(dev, _device_list, device_link) {
-   if (dev->netdev == netdev_work->event_netdev) {
-   pvrdma_netdevice_event_handle(dev, netdev_work->event);
+   if ((netdev_work->event == NETDEV_REGISTER) ||
+   (dev->netdev == netdev_work->event_netdev)) {
+   pvrdma_netdevice_event_handle(dev, 
netdev_work->event_netdev, netdev_work->event);
break;
}
}
@@ -962,6 +982,7 @@ static int pvrdma_pci_probe(struct pci_dev *pdev,
}
 
dev->netdev = pci_get_drvdata(pdev_net);
+   dev_hold(dev->netdev);
pci_dev_put(pdev_net);
if (!dev->netdev) {
dev_err(>dev, "failed to get vmxnet3 device\n");
-- 
2.17.1



Re: [PATCH v11 09/13] x86, sgx: basic routines for enclave page cache

2018-06-25 Thread Neil Horman
On Mon, Jun 25, 2018 at 12:21:22PM +0300, Jarkko Sakkinen wrote:
> On Wed, 2018-06-20 at 06:21 -0700, Sean Christopherson wrote:
> > On Fri, 2018-06-08 at 19:09 +0200, Jarkko Sakkinen wrote:
> > > SGX has a set of data structures to maintain information about the 
> > > enclaves
> > > and their security properties. BIOS reserves a fixed size region of
> > > physical memory for these structures by setting Processor Reserved Memory
> > > Range Registers (PRMRR). This memory area is called Enclave Page Cache
> > > (EPC).
> > > 
> > > This commit implements the basic routines to allocate and free pages from
> > > different EPC banks. There is also a swapper thread ksgxswapd for EPC 
> > > pages
> > > that gets woken up by sgx_alloc_page() when we run below the low 
> > > watermark.
> > > The swapper thread continues swapping pages up until it reaches the high
> > > watermark.
> > > 
> > > Each subsystem that uses SGX must provide a set of callbacks for EPC
> > > pages that are used to reclaim, block and write an EPC page. Kernel
> > > takes the responsibility of maintaining LRU cache for them.
> > > 
> > > Signed-off-by: Jarkko Sakkinen 
> > > ---
> > >  arch/x86/include/asm/sgx.h  |  67 +
> > >  arch/x86/include/asm/sgx_arch.h | 224 
> > >  arch/x86/kernel/cpu/intel_sgx.c | 443 +++-
> > >  3 files changed, 732 insertions(+), 2 deletions(-)
> > >  create mode 100644 arch/x86/include/asm/sgx_arch.h
> > 
> > ...
> > 
> > > +struct sgx_pcmd {
> > > + struct sgx_secinfo secinfo;
> > > + uint64_t enclave_id;
> > > + uint8_t reserved[40];
> > > + uint8_t mac[16];
> > > +};
> > 
> > sgx_pcmd has a 128-byte alignment requirement.  I think it's
> > worth specifying here as sgx_pcmd is small enough that it could
> > be put on the stack, e.g. by KVM when trapping and executing
> > ELD* on behalf of a guest VM.
> > 
> > In fact, it probably makes sense to add alightment attributes
> > to all SGX structs for self-documentation purposes, even though
> > many of them will never be allocated statically or on the stack.
> 
> I agree with this. It also documents stuff so that you don't have
> to look it up from the SDM.
> 
> Neil: this should also clear your concerns.
> 
Agreed
Neil

> /Jarkko


Re: [PATCH v11 09/13] x86, sgx: basic routines for enclave page cache

2018-06-25 Thread Neil Horman
On Mon, Jun 25, 2018 at 12:21:22PM +0300, Jarkko Sakkinen wrote:
> On Wed, 2018-06-20 at 06:21 -0700, Sean Christopherson wrote:
> > On Fri, 2018-06-08 at 19:09 +0200, Jarkko Sakkinen wrote:
> > > SGX has a set of data structures to maintain information about the 
> > > enclaves
> > > and their security properties. BIOS reserves a fixed size region of
> > > physical memory for these structures by setting Processor Reserved Memory
> > > Range Registers (PRMRR). This memory area is called Enclave Page Cache
> > > (EPC).
> > > 
> > > This commit implements the basic routines to allocate and free pages from
> > > different EPC banks. There is also a swapper thread ksgxswapd for EPC 
> > > pages
> > > that gets woken up by sgx_alloc_page() when we run below the low 
> > > watermark.
> > > The swapper thread continues swapping pages up until it reaches the high
> > > watermark.
> > > 
> > > Each subsystem that uses SGX must provide a set of callbacks for EPC
> > > pages that are used to reclaim, block and write an EPC page. Kernel
> > > takes the responsibility of maintaining LRU cache for them.
> > > 
> > > Signed-off-by: Jarkko Sakkinen 
> > > ---
> > >  arch/x86/include/asm/sgx.h  |  67 +
> > >  arch/x86/include/asm/sgx_arch.h | 224 
> > >  arch/x86/kernel/cpu/intel_sgx.c | 443 +++-
> > >  3 files changed, 732 insertions(+), 2 deletions(-)
> > >  create mode 100644 arch/x86/include/asm/sgx_arch.h
> > 
> > ...
> > 
> > > +struct sgx_pcmd {
> > > + struct sgx_secinfo secinfo;
> > > + uint64_t enclave_id;
> > > + uint8_t reserved[40];
> > > + uint8_t mac[16];
> > > +};
> > 
> > sgx_pcmd has a 128-byte alignment requirement.  I think it's
> > worth specifying here as sgx_pcmd is small enough that it could
> > be put on the stack, e.g. by KVM when trapping and executing
> > ELD* on behalf of a guest VM.
> > 
> > In fact, it probably makes sense to add alightment attributes
> > to all SGX structs for self-documentation purposes, even though
> > many of them will never be allocated statically or on the stack.
> 
> I agree with this. It also documents stuff so that you don't have
> to look it up from the SDM.
> 
> Neil: this should also clear your concerns.
> 
Agreed
Neil

> /Jarkko


Re: [intel-sgx-kernel-dev] [PATCH v11 13/13] intel_sgx: in-kernel launch enclave

2018-06-21 Thread Neil Horman
On Thu, Jun 21, 2018 at 08:32:25AM -0400, Nathaniel McCallum wrote:
> On Wed, Jun 20, 2018 at 5:02 PM Sean Christopherson
>  wrote:
> >
> > On Wed, Jun 20, 2018 at 11:39:00AM -0700, Jethro Beekman wrote:
> > > On 2018-06-20 11:16, Jethro Beekman wrote:
> > > > > This last bit is also repeated in different words in Table 35-2 and
> > > > > Section 42.2.2. The MSRs are *not writable* before the write-lock bit
> > > > > itself is locked. Meaning the MSRs are either locked with Intel's key
> > > > > hash, or not locked at all.
> > >
> > > Actually, this might be a documentation bug. I have some test hardware 
> > > and I
> > > was able to configure the MSRs in the BIOS and then read the MSRs after 
> > > boot
> > > like this:
> > >
> > > MSR 0x3a 0x00040005
> > > MSR 0x8c 0x20180620
> > > MSR 0x8d 0x20180620
> > > MSR 0x8e 0x20180620
> > > MSR 0x8f 0x20180620
> > >
> > > Since this is not production hardware, it could also be a CPU bug of 
> > > course.
> > >
> > > If it is indeed possible to configure AND lock the MSR values to non-Intel
> > > values, I'm very much in favor of Nathaniels proposal to treat the launch
> > > enclave like any other firmware blob.
> >
> > It's not a CPU or documentation bug (though the latter is arguable).
> > SGX has an activation step that is triggered by doing a WRMSR(0x7a)
> > with bit 0 set.  Until SGX is activated, the SGX related bits in
> > IA32_FEATURE_CONTROL cannot be set, i.e. SGX can't be enabled.  But,
> > the LE hash MSRs are fully writable prior to activation, e.g. to
> > allow firmware to lock down the LE key with a non-Intel value.
> >
> > So yes, it's possible to lock the MSRs to a non-Intel value.  The
> > obvious caveat is that whatever blob is used to write the MSRs would
> > need be executed prior to activation.
> 
> This implies that it should be possible to create MSR activation (and
> an embedded launch enclave?) entirely as a UEFI module. The kernel
> would still get to manage who has access to /dev/sgx and other
> important non-cryptographic policy details. Users would still be able
> to control the cryptographic policy details (via BIOS Secure Boot
> configuration that exists today). Distributions could still control
> cryptographic policy details via signing of the UEFI module with their
> own Secure Boot key (or using something like shim). The UEFI module
> (and possibly the external launch enclave) could be distributed via
> linux-firmware.
> 
> Andy/Neil, does this work for you?
> 
I need some time to digest it.  Who in your mind is writing the UEFI module.  Is
that the firmware vendor or IHV?

Neil

> > As for the SDM, it's a documentation... omission?  SGX activation
> > is intentionally omitted from the SDM.  The intended usage model is
> > that firmware will always do the activation (if it wants SGX enabled),
> > i.e. post-firmware software will only ever "see" SGX as disabled or
> > in the fully activated state, and so the SDM doesn't describe SGX
> > behavior prior to activation.  I believe the activation process, or
> > at least what is required from firmware, is documented in the BIOS
> > writer's guide.
> >
> > > Jethro Beekman | Fortanix
> > >
> >
> >


Re: [intel-sgx-kernel-dev] [PATCH v11 13/13] intel_sgx: in-kernel launch enclave

2018-06-21 Thread Neil Horman
On Thu, Jun 21, 2018 at 08:32:25AM -0400, Nathaniel McCallum wrote:
> On Wed, Jun 20, 2018 at 5:02 PM Sean Christopherson
>  wrote:
> >
> > On Wed, Jun 20, 2018 at 11:39:00AM -0700, Jethro Beekman wrote:
> > > On 2018-06-20 11:16, Jethro Beekman wrote:
> > > > > This last bit is also repeated in different words in Table 35-2 and
> > > > > Section 42.2.2. The MSRs are *not writable* before the write-lock bit
> > > > > itself is locked. Meaning the MSRs are either locked with Intel's key
> > > > > hash, or not locked at all.
> > >
> > > Actually, this might be a documentation bug. I have some test hardware 
> > > and I
> > > was able to configure the MSRs in the BIOS and then read the MSRs after 
> > > boot
> > > like this:
> > >
> > > MSR 0x3a 0x00040005
> > > MSR 0x8c 0x20180620
> > > MSR 0x8d 0x20180620
> > > MSR 0x8e 0x20180620
> > > MSR 0x8f 0x20180620
> > >
> > > Since this is not production hardware, it could also be a CPU bug of 
> > > course.
> > >
> > > If it is indeed possible to configure AND lock the MSR values to non-Intel
> > > values, I'm very much in favor of Nathaniels proposal to treat the launch
> > > enclave like any other firmware blob.
> >
> > It's not a CPU or documentation bug (though the latter is arguable).
> > SGX has an activation step that is triggered by doing a WRMSR(0x7a)
> > with bit 0 set.  Until SGX is activated, the SGX related bits in
> > IA32_FEATURE_CONTROL cannot be set, i.e. SGX can't be enabled.  But,
> > the LE hash MSRs are fully writable prior to activation, e.g. to
> > allow firmware to lock down the LE key with a non-Intel value.
> >
> > So yes, it's possible to lock the MSRs to a non-Intel value.  The
> > obvious caveat is that whatever blob is used to write the MSRs would
> > need be executed prior to activation.
> 
> This implies that it should be possible to create MSR activation (and
> an embedded launch enclave?) entirely as a UEFI module. The kernel
> would still get to manage who has access to /dev/sgx and other
> important non-cryptographic policy details. Users would still be able
> to control the cryptographic policy details (via BIOS Secure Boot
> configuration that exists today). Distributions could still control
> cryptographic policy details via signing of the UEFI module with their
> own Secure Boot key (or using something like shim). The UEFI module
> (and possibly the external launch enclave) could be distributed via
> linux-firmware.
> 
> Andy/Neil, does this work for you?
> 
I need some time to digest it.  Who in your mind is writing the UEFI module.  Is
that the firmware vendor or IHV?

Neil

> > As for the SDM, it's a documentation... omission?  SGX activation
> > is intentionally omitted from the SDM.  The intended usage model is
> > that firmware will always do the activation (if it wants SGX enabled),
> > i.e. post-firmware software will only ever "see" SGX as disabled or
> > in the fully activated state, and so the SDM doesn't describe SGX
> > behavior prior to activation.  I believe the activation process, or
> > at least what is required from firmware, is documented in the BIOS
> > writer's guide.
> >
> > > Jethro Beekman | Fortanix
> > >
> >
> >


Re: [PATCH v11 09/13] x86, sgx: basic routines for enclave page cache

2018-06-19 Thread Neil Horman
On Tue, Jun 19, 2018 at 05:57:53PM +0300, Jarkko Sakkinen wrote:
> On Fri, Jun 08, 2018 at 11:24:12AM -0700, Dave Hansen wrote:
> > On 06/08/2018 10:09 AM, Jarkko Sakkinen wrote:
> > > SGX has a set of data structures to maintain information about the 
> > > enclaves
> > > and their security properties. BIOS reserves a fixed size region of
> > > physical memory for these structures by setting Processor Reserved Memory
> > > Range Registers (PRMRR). This memory area is called Enclave Page Cache
> > > (EPC).
> > > 
> > > This commit implements the basic routines to allocate and free pages from
> > > different EPC banks. There is also a swapper thread ksgxswapd for EPC 
> > > pages
> > > that gets woken up by sgx_alloc_page() when we run below the low 
> > > watermark.
> > > The swapper thread continues swapping pages up until it reaches the high
> > > watermark.
> > 
> > Yay!  A new memory manager in arch-specific code.
> > 
> > > Each subsystem that uses SGX must provide a set of callbacks for EPC
> > > pages that are used to reclaim, block and write an EPC page. Kernel
> > > takes the responsibility of maintaining LRU cache for them.
> > 
> > What does a "subsystem that uses SGX" mean?  Do we have one of those
> > already?
> 
> Driver and KVM.
> 
> > > +struct sgx_secs {
> > > + uint64_t size;
> > > + uint64_t base;
> > > + uint32_t ssaframesize;
> > > + uint32_t miscselect;
> > > + uint8_t reserved1[SGX_SECS_RESERVED1_SIZE];
> > > + uint64_t attributes;
> > > + uint64_t xfrm;
> > > + uint32_t mrenclave[8];
> > > + uint8_t reserved2[SGX_SECS_RESERVED2_SIZE];
> > > + uint32_t mrsigner[8];
> > > + uint8_t reserved3[SGX_SECS_RESERVED3_SIZE];
> > > + uint16_t isvvprodid;
> > > + uint16_t isvsvn;
> > > + uint8_t reserved4[SGX_SECS_RESERVED4_SIZE];
> > > +};
> > 
> > This is a hardware structure, right?  Doesn't it need to be packed?
> 
> Everything is aligned properly in this struct.
> 
I don't think you can guarantee that.  I understand that the reserved size is
likely computed to turn those u8's into 32/64 byte regions, but the uint16t
isvvprodid and isvsvn might get padded to 32 or 64 bytes in software dependent
on the processor you build for.

And even so, its suceptible to being
misaligned if a new version of the hardware adds or removes elements.  Adding a
packed attribute seems like a safe approach (or at least a no-op in the current
state)

Neil


Re: [PATCH v11 09/13] x86, sgx: basic routines for enclave page cache

2018-06-19 Thread Neil Horman
On Tue, Jun 19, 2018 at 05:57:53PM +0300, Jarkko Sakkinen wrote:
> On Fri, Jun 08, 2018 at 11:24:12AM -0700, Dave Hansen wrote:
> > On 06/08/2018 10:09 AM, Jarkko Sakkinen wrote:
> > > SGX has a set of data structures to maintain information about the 
> > > enclaves
> > > and their security properties. BIOS reserves a fixed size region of
> > > physical memory for these structures by setting Processor Reserved Memory
> > > Range Registers (PRMRR). This memory area is called Enclave Page Cache
> > > (EPC).
> > > 
> > > This commit implements the basic routines to allocate and free pages from
> > > different EPC banks. There is also a swapper thread ksgxswapd for EPC 
> > > pages
> > > that gets woken up by sgx_alloc_page() when we run below the low 
> > > watermark.
> > > The swapper thread continues swapping pages up until it reaches the high
> > > watermark.
> > 
> > Yay!  A new memory manager in arch-specific code.
> > 
> > > Each subsystem that uses SGX must provide a set of callbacks for EPC
> > > pages that are used to reclaim, block and write an EPC page. Kernel
> > > takes the responsibility of maintaining LRU cache for them.
> > 
> > What does a "subsystem that uses SGX" mean?  Do we have one of those
> > already?
> 
> Driver and KVM.
> 
> > > +struct sgx_secs {
> > > + uint64_t size;
> > > + uint64_t base;
> > > + uint32_t ssaframesize;
> > > + uint32_t miscselect;
> > > + uint8_t reserved1[SGX_SECS_RESERVED1_SIZE];
> > > + uint64_t attributes;
> > > + uint64_t xfrm;
> > > + uint32_t mrenclave[8];
> > > + uint8_t reserved2[SGX_SECS_RESERVED2_SIZE];
> > > + uint32_t mrsigner[8];
> > > + uint8_t reserved3[SGX_SECS_RESERVED3_SIZE];
> > > + uint16_t isvvprodid;
> > > + uint16_t isvsvn;
> > > + uint8_t reserved4[SGX_SECS_RESERVED4_SIZE];
> > > +};
> > 
> > This is a hardware structure, right?  Doesn't it need to be packed?
> 
> Everything is aligned properly in this struct.
> 
I don't think you can guarantee that.  I understand that the reserved size is
likely computed to turn those u8's into 32/64 byte regions, but the uint16t
isvvprodid and isvsvn might get padded to 32 or 64 bytes in software dependent
on the processor you build for.

And even so, its suceptible to being
misaligned if a new version of the hardware adds or removes elements.  Adding a
packed attribute seems like a safe approach (or at least a no-op in the current
state)

Neil


Re: [intel-sgx-kernel-dev] [PATCH v11 13/13] intel_sgx: in-kernel launch enclave

2018-06-19 Thread Neil Horman
On Mon, Jun 18, 2018 at 02:58:59PM -0700, Andy Lutomirski wrote:
> On Tue, Jun 12, 2018 at 10:45 AM Neil Horman  wrote:
> >
> > On Mon, Jun 11, 2018 at 09:55:29PM -0700, Andy Lutomirski wrote:
> > > On Mon, Jun 11, 2018 at 4:52 AM Neil Horman  wrote:
> > > >
> > > > On Sun, Jun 10, 2018 at 10:17:13PM -0700, Andy Lutomirski wrote:
> > > > > > On Jun 9, 2018, at 10:39 PM, Andy Lutomirski  
> > > > > > wrote:
> > > > > >
> > > > > > On Fri, Jun 8, 2018 at 10:32 AM Jarkko Sakkinen
> > > > > >  wrote:
> > > > > >>
> > > > > >> The Launch Enclave (LE) generates cryptographic launch tokens for 
> > > > > >> user
> > > > > >> enclaves. A launch token is used by EINIT to check whether the 
> > > > > >> enclave
> > > > > >> is authorized to launch or not. By having its own launch enclave, 
> > > > > >> Linux
> > > > > >> has full control of the enclave launch process.
> > > > > >>
> > > > > >> LE is wrapped into a user space proxy program that reads enclave
> > > > > >> signatures outputs launch tokens. The kernel-side glue code is
> > > > > >> implemented by using the user space helper framework.  The IPC 
> > > > > >> between
> > > > > >> the LE proxy program and kernel is handled with an anonymous inode.
> > > > > >>
> > > > > >> The commit also adds enclave signing tool that is used by kbuild to
> > > > > >> measure and sign the launch enclave. CONFIG_INTEL_SGX_SIGNING_KEY 
> > > > > >> points
> > > > > >> to a PEM-file for the 3072-bit RSA key that is used as the LE 
> > > > > >> public key
> > > > > >> pair. The default location is:
> > > > > >>
> > > > > >>  drivers/platform/x86/intel_sgx/sgx_signing_key.pem
> > > > > >>
> > > > > >> If the default key does not exist kbuild will generate a random 
> > > > > >> key and
> > > > > >> place it to this location. KBUILD_SGX_SIGN_PIN can be used to 
> > > > > >> specify
> > > > > >> the passphrase for the LE public key.
> > > > > >
> > > > > > It seems to me that it might be more useful to just commit a key 
> > > > > > pair
> > > > > > into the kernel.  As far as I know, there is no security whatsoever
> > > > > > gained by keeping the private key private, so why not make
> > > > > > reproducible builds easier by simply fixing the key?
> > > > >
> > > > > Having thought about this some more, I think that you should
> > > > > completely remove support for specifying a key. Provide a fixed key
> > > > > pair, hard code the cache, and call it a day. If you make the key
> > > > > configurable, every vendor that has any vendor keys (Debian, Ubuntu,
> > > > > Fedora, Red Hat, SuSE, Clear Linux, etc) will see that config option
> > > > > and set up their own key pair for no gain whatsoever.  Instead, it'll
> > > > > give some illusion of security and it'll slow down operations in a VM
> > > > > guest due to swapping out the values of the MSRs.  And, if the code to
> > > > > support a locked MSR that just happens to have the right value stays
> > > > > in the kernel, then we'll risk having vendors actually ship one
> > > > > distro's public key hash, and that will seriously suck.
> > > > >
> > > > If you hard code the key pair however, doesn't that imply that anyone 
> > > > can sign a
> > > > user space binary as a launch enclave, and potentially gain control of 
> > > > the token
> > > > granting process?
> > >
> > > Yes and no.
> > >
> > > First of all, the kernel driver shouldn't be allowing user code to
> > > launch a launch enclave regardless of signature.  I haven't gotten far
> > > enough in reviewing the code to see whether that's the case, but if
> > > it's not, it should be fixed before it's merged.
> > >
> > Ok, I agree with you here.
> >
> > > But keep in mind that control of the token granting process is not the
> > > same thing as control over the right to lau

Re: [intel-sgx-kernel-dev] [PATCH v11 13/13] intel_sgx: in-kernel launch enclave

2018-06-19 Thread Neil Horman
On Mon, Jun 18, 2018 at 02:58:59PM -0700, Andy Lutomirski wrote:
> On Tue, Jun 12, 2018 at 10:45 AM Neil Horman  wrote:
> >
> > On Mon, Jun 11, 2018 at 09:55:29PM -0700, Andy Lutomirski wrote:
> > > On Mon, Jun 11, 2018 at 4:52 AM Neil Horman  wrote:
> > > >
> > > > On Sun, Jun 10, 2018 at 10:17:13PM -0700, Andy Lutomirski wrote:
> > > > > > On Jun 9, 2018, at 10:39 PM, Andy Lutomirski  
> > > > > > wrote:
> > > > > >
> > > > > > On Fri, Jun 8, 2018 at 10:32 AM Jarkko Sakkinen
> > > > > >  wrote:
> > > > > >>
> > > > > >> The Launch Enclave (LE) generates cryptographic launch tokens for 
> > > > > >> user
> > > > > >> enclaves. A launch token is used by EINIT to check whether the 
> > > > > >> enclave
> > > > > >> is authorized to launch or not. By having its own launch enclave, 
> > > > > >> Linux
> > > > > >> has full control of the enclave launch process.
> > > > > >>
> > > > > >> LE is wrapped into a user space proxy program that reads enclave
> > > > > >> signatures outputs launch tokens. The kernel-side glue code is
> > > > > >> implemented by using the user space helper framework.  The IPC 
> > > > > >> between
> > > > > >> the LE proxy program and kernel is handled with an anonymous inode.
> > > > > >>
> > > > > >> The commit also adds enclave signing tool that is used by kbuild to
> > > > > >> measure and sign the launch enclave. CONFIG_INTEL_SGX_SIGNING_KEY 
> > > > > >> points
> > > > > >> to a PEM-file for the 3072-bit RSA key that is used as the LE 
> > > > > >> public key
> > > > > >> pair. The default location is:
> > > > > >>
> > > > > >>  drivers/platform/x86/intel_sgx/sgx_signing_key.pem
> > > > > >>
> > > > > >> If the default key does not exist kbuild will generate a random 
> > > > > >> key and
> > > > > >> place it to this location. KBUILD_SGX_SIGN_PIN can be used to 
> > > > > >> specify
> > > > > >> the passphrase for the LE public key.
> > > > > >
> > > > > > It seems to me that it might be more useful to just commit a key 
> > > > > > pair
> > > > > > into the kernel.  As far as I know, there is no security whatsoever
> > > > > > gained by keeping the private key private, so why not make
> > > > > > reproducible builds easier by simply fixing the key?
> > > > >
> > > > > Having thought about this some more, I think that you should
> > > > > completely remove support for specifying a key. Provide a fixed key
> > > > > pair, hard code the cache, and call it a day. If you make the key
> > > > > configurable, every vendor that has any vendor keys (Debian, Ubuntu,
> > > > > Fedora, Red Hat, SuSE, Clear Linux, etc) will see that config option
> > > > > and set up their own key pair for no gain whatsoever.  Instead, it'll
> > > > > give some illusion of security and it'll slow down operations in a VM
> > > > > guest due to swapping out the values of the MSRs.  And, if the code to
> > > > > support a locked MSR that just happens to have the right value stays
> > > > > in the kernel, then we'll risk having vendors actually ship one
> > > > > distro's public key hash, and that will seriously suck.
> > > > >
> > > > If you hard code the key pair however, doesn't that imply that anyone 
> > > > can sign a
> > > > user space binary as a launch enclave, and potentially gain control of 
> > > > the token
> > > > granting process?
> > >
> > > Yes and no.
> > >
> > > First of all, the kernel driver shouldn't be allowing user code to
> > > launch a launch enclave regardless of signature.  I haven't gotten far
> > > enough in reviewing the code to see whether that's the case, but if
> > > it's not, it should be fixed before it's merged.
> > >
> > Ok, I agree with you here.
> >
> > > But keep in mind that control of the token granting process is not the
> > > same thing as control over the right to lau

Re: [intel-sgx-kernel-dev] [PATCH v11 13/13] intel_sgx: in-kernel launch enclave

2018-06-12 Thread Neil Horman
On Mon, Jun 11, 2018 at 09:55:29PM -0700, Andy Lutomirski wrote:
> On Mon, Jun 11, 2018 at 4:52 AM Neil Horman  wrote:
> >
> > On Sun, Jun 10, 2018 at 10:17:13PM -0700, Andy Lutomirski wrote:
> > > > On Jun 9, 2018, at 10:39 PM, Andy Lutomirski  wrote:
> > > >
> > > > On Fri, Jun 8, 2018 at 10:32 AM Jarkko Sakkinen
> > > >  wrote:
> > > >>
> > > >> The Launch Enclave (LE) generates cryptographic launch tokens for user
> > > >> enclaves. A launch token is used by EINIT to check whether the enclave
> > > >> is authorized to launch or not. By having its own launch enclave, Linux
> > > >> has full control of the enclave launch process.
> > > >>
> > > >> LE is wrapped into a user space proxy program that reads enclave
> > > >> signatures outputs launch tokens. The kernel-side glue code is
> > > >> implemented by using the user space helper framework.  The IPC between
> > > >> the LE proxy program and kernel is handled with an anonymous inode.
> > > >>
> > > >> The commit also adds enclave signing tool that is used by kbuild to
> > > >> measure and sign the launch enclave. CONFIG_INTEL_SGX_SIGNING_KEY 
> > > >> points
> > > >> to a PEM-file for the 3072-bit RSA key that is used as the LE public 
> > > >> key
> > > >> pair. The default location is:
> > > >>
> > > >>  drivers/platform/x86/intel_sgx/sgx_signing_key.pem
> > > >>
> > > >> If the default key does not exist kbuild will generate a random key and
> > > >> place it to this location. KBUILD_SGX_SIGN_PIN can be used to specify
> > > >> the passphrase for the LE public key.
> > > >
> > > > It seems to me that it might be more useful to just commit a key pair
> > > > into the kernel.  As far as I know, there is no security whatsoever
> > > > gained by keeping the private key private, so why not make
> > > > reproducible builds easier by simply fixing the key?
> > >
> > > Having thought about this some more, I think that you should
> > > completely remove support for specifying a key. Provide a fixed key
> > > pair, hard code the cache, and call it a day. If you make the key
> > > configurable, every vendor that has any vendor keys (Debian, Ubuntu,
> > > Fedora, Red Hat, SuSE, Clear Linux, etc) will see that config option
> > > and set up their own key pair for no gain whatsoever.  Instead, it'll
> > > give some illusion of security and it'll slow down operations in a VM
> > > guest due to swapping out the values of the MSRs.  And, if the code to
> > > support a locked MSR that just happens to have the right value stays
> > > in the kernel, then we'll risk having vendors actually ship one
> > > distro's public key hash, and that will seriously suck.
> > >
> > If you hard code the key pair however, doesn't that imply that anyone can 
> > sign a
> > user space binary as a launch enclave, and potentially gain control of the 
> > token
> > granting process?
> 
> Yes and no.
> 
> First of all, the kernel driver shouldn't be allowing user code to
> launch a launch enclave regardless of signature.  I haven't gotten far
> enough in reviewing the code to see whether that's the case, but if
> it's not, it should be fixed before it's merged.
> 
Ok, I agree with you here.

> But keep in mind that control of the token granting process is not the
> same thing as control over the right to launch an enclave.  On systems
> without the LE hash MSRs, Intel controls the token granting process
> and, barring some attack, an enclave that isn't blessed by Intel can't
> be launched.  Support for that model will not be merged into upstream
> Linux.  But on systems that have the LE hash MSRs and leave them
> unlocked, there is effectively no hardware-enforced launch control.
> Instead we have good old kernel policy.  If a user wants to launch an
> enclave, they need to get the kernel to launch the enclave, and the
> kernel needs to apply its policy.  The patch here (the in-kernel
> launch enclave) has a wide-open policy.
> 

Right, also agree here.  Systems without Flexible Launch Control are a
non-starter, we're only considering FLC systems here

> So, as a practical matter, if every distro has their own LE key and
> keeps it totally safe, then a system that locks the MSRs to one
> distro's key makes it quite annoying to run another distro's intel_sgx
> driver, but there is no effect on the actual security 

Re: [intel-sgx-kernel-dev] [PATCH v11 13/13] intel_sgx: in-kernel launch enclave

2018-06-12 Thread Neil Horman
On Mon, Jun 11, 2018 at 09:55:29PM -0700, Andy Lutomirski wrote:
> On Mon, Jun 11, 2018 at 4:52 AM Neil Horman  wrote:
> >
> > On Sun, Jun 10, 2018 at 10:17:13PM -0700, Andy Lutomirski wrote:
> > > > On Jun 9, 2018, at 10:39 PM, Andy Lutomirski  wrote:
> > > >
> > > > On Fri, Jun 8, 2018 at 10:32 AM Jarkko Sakkinen
> > > >  wrote:
> > > >>
> > > >> The Launch Enclave (LE) generates cryptographic launch tokens for user
> > > >> enclaves. A launch token is used by EINIT to check whether the enclave
> > > >> is authorized to launch or not. By having its own launch enclave, Linux
> > > >> has full control of the enclave launch process.
> > > >>
> > > >> LE is wrapped into a user space proxy program that reads enclave
> > > >> signatures outputs launch tokens. The kernel-side glue code is
> > > >> implemented by using the user space helper framework.  The IPC between
> > > >> the LE proxy program and kernel is handled with an anonymous inode.
> > > >>
> > > >> The commit also adds enclave signing tool that is used by kbuild to
> > > >> measure and sign the launch enclave. CONFIG_INTEL_SGX_SIGNING_KEY 
> > > >> points
> > > >> to a PEM-file for the 3072-bit RSA key that is used as the LE public 
> > > >> key
> > > >> pair. The default location is:
> > > >>
> > > >>  drivers/platform/x86/intel_sgx/sgx_signing_key.pem
> > > >>
> > > >> If the default key does not exist kbuild will generate a random key and
> > > >> place it to this location. KBUILD_SGX_SIGN_PIN can be used to specify
> > > >> the passphrase for the LE public key.
> > > >
> > > > It seems to me that it might be more useful to just commit a key pair
> > > > into the kernel.  As far as I know, there is no security whatsoever
> > > > gained by keeping the private key private, so why not make
> > > > reproducible builds easier by simply fixing the key?
> > >
> > > Having thought about this some more, I think that you should
> > > completely remove support for specifying a key. Provide a fixed key
> > > pair, hard code the cache, and call it a day. If you make the key
> > > configurable, every vendor that has any vendor keys (Debian, Ubuntu,
> > > Fedora, Red Hat, SuSE, Clear Linux, etc) will see that config option
> > > and set up their own key pair for no gain whatsoever.  Instead, it'll
> > > give some illusion of security and it'll slow down operations in a VM
> > > guest due to swapping out the values of the MSRs.  And, if the code to
> > > support a locked MSR that just happens to have the right value stays
> > > in the kernel, then we'll risk having vendors actually ship one
> > > distro's public key hash, and that will seriously suck.
> > >
> > If you hard code the key pair however, doesn't that imply that anyone can 
> > sign a
> > user space binary as a launch enclave, and potentially gain control of the 
> > token
> > granting process?
> 
> Yes and no.
> 
> First of all, the kernel driver shouldn't be allowing user code to
> launch a launch enclave regardless of signature.  I haven't gotten far
> enough in reviewing the code to see whether that's the case, but if
> it's not, it should be fixed before it's merged.
> 
Ok, I agree with you here.

> But keep in mind that control of the token granting process is not the
> same thing as control over the right to launch an enclave.  On systems
> without the LE hash MSRs, Intel controls the token granting process
> and, barring some attack, an enclave that isn't blessed by Intel can't
> be launched.  Support for that model will not be merged into upstream
> Linux.  But on systems that have the LE hash MSRs and leave them
> unlocked, there is effectively no hardware-enforced launch control.
> Instead we have good old kernel policy.  If a user wants to launch an
> enclave, they need to get the kernel to launch the enclave, and the
> kernel needs to apply its policy.  The patch here (the in-kernel
> launch enclave) has a wide-open policy.
> 

Right, also agree here.  Systems without Flexible Launch Control are a
non-starter, we're only considering FLC systems here

> So, as a practical matter, if every distro has their own LE key and
> keeps it totally safe, then a system that locks the MSRs to one
> distro's key makes it quite annoying to run another distro's intel_sgx
> driver, but there is no effect on the actual security 

Re: [intel-sgx-kernel-dev] [PATCH v11 13/13] intel_sgx: in-kernel launch enclave

2018-06-11 Thread Neil Horman
On Sun, Jun 10, 2018 at 10:17:13PM -0700, Andy Lutomirski wrote:
> > On Jun 9, 2018, at 10:39 PM, Andy Lutomirski  wrote:
> >
> > On Fri, Jun 8, 2018 at 10:32 AM Jarkko Sakkinen
> >  wrote:
> >>
> >> The Launch Enclave (LE) generates cryptographic launch tokens for user
> >> enclaves. A launch token is used by EINIT to check whether the enclave
> >> is authorized to launch or not. By having its own launch enclave, Linux
> >> has full control of the enclave launch process.
> >>
> >> LE is wrapped into a user space proxy program that reads enclave
> >> signatures outputs launch tokens. The kernel-side glue code is
> >> implemented by using the user space helper framework.  The IPC between
> >> the LE proxy program and kernel is handled with an anonymous inode.
> >>
> >> The commit also adds enclave signing tool that is used by kbuild to
> >> measure and sign the launch enclave. CONFIG_INTEL_SGX_SIGNING_KEY points
> >> to a PEM-file for the 3072-bit RSA key that is used as the LE public key
> >> pair. The default location is:
> >>
> >>  drivers/platform/x86/intel_sgx/sgx_signing_key.pem
> >>
> >> If the default key does not exist kbuild will generate a random key and
> >> place it to this location. KBUILD_SGX_SIGN_PIN can be used to specify
> >> the passphrase for the LE public key.
> >
> > It seems to me that it might be more useful to just commit a key pair
> > into the kernel.  As far as I know, there is no security whatsoever
> > gained by keeping the private key private, so why not make
> > reproducible builds easier by simply fixing the key?
> 
> Having thought about this some more, I think that you should
> completely remove support for specifying a key. Provide a fixed key
> pair, hard code the cache, and call it a day. If you make the key
> configurable, every vendor that has any vendor keys (Debian, Ubuntu,
> Fedora, Red Hat, SuSE, Clear Linux, etc) will see that config option
> and set up their own key pair for no gain whatsoever.  Instead, it'll
> give some illusion of security and it'll slow down operations in a VM
> guest due to swapping out the values of the MSRs.  And, if the code to
> support a locked MSR that just happens to have the right value stays
> in the kernel, then we'll risk having vendors actually ship one
> distro's public key hash, and that will seriously suck.
> 
If you hard code the key pair however, doesn't that imply that anyone can sign a
user space binary as a launch enclave, and potentially gain control of the token
granting process?  It was my understanding that the value of the key pair was
that the end user was guaranteed autonomy and security over which processes
could start enclaves.  By publishing a fixed key pair, it seems to remove that
ability.

What would be nicer (I think) would be the abilty to specify both the public and
the private key at run time.  the use case here is not one in which a vendor or
os distribution ships a key pair, but one in which a downstream user doesn't
want a vendor/os distribution to have any cryptographic information installed on
their system

Neil



Re: [intel-sgx-kernel-dev] [PATCH v11 13/13] intel_sgx: in-kernel launch enclave

2018-06-11 Thread Neil Horman
On Sun, Jun 10, 2018 at 10:17:13PM -0700, Andy Lutomirski wrote:
> > On Jun 9, 2018, at 10:39 PM, Andy Lutomirski  wrote:
> >
> > On Fri, Jun 8, 2018 at 10:32 AM Jarkko Sakkinen
> >  wrote:
> >>
> >> The Launch Enclave (LE) generates cryptographic launch tokens for user
> >> enclaves. A launch token is used by EINIT to check whether the enclave
> >> is authorized to launch or not. By having its own launch enclave, Linux
> >> has full control of the enclave launch process.
> >>
> >> LE is wrapped into a user space proxy program that reads enclave
> >> signatures outputs launch tokens. The kernel-side glue code is
> >> implemented by using the user space helper framework.  The IPC between
> >> the LE proxy program and kernel is handled with an anonymous inode.
> >>
> >> The commit also adds enclave signing tool that is used by kbuild to
> >> measure and sign the launch enclave. CONFIG_INTEL_SGX_SIGNING_KEY points
> >> to a PEM-file for the 3072-bit RSA key that is used as the LE public key
> >> pair. The default location is:
> >>
> >>  drivers/platform/x86/intel_sgx/sgx_signing_key.pem
> >>
> >> If the default key does not exist kbuild will generate a random key and
> >> place it to this location. KBUILD_SGX_SIGN_PIN can be used to specify
> >> the passphrase for the LE public key.
> >
> > It seems to me that it might be more useful to just commit a key pair
> > into the kernel.  As far as I know, there is no security whatsoever
> > gained by keeping the private key private, so why not make
> > reproducible builds easier by simply fixing the key?
> 
> Having thought about this some more, I think that you should
> completely remove support for specifying a key. Provide a fixed key
> pair, hard code the cache, and call it a day. If you make the key
> configurable, every vendor that has any vendor keys (Debian, Ubuntu,
> Fedora, Red Hat, SuSE, Clear Linux, etc) will see that config option
> and set up their own key pair for no gain whatsoever.  Instead, it'll
> give some illusion of security and it'll slow down operations in a VM
> guest due to swapping out the values of the MSRs.  And, if the code to
> support a locked MSR that just happens to have the right value stays
> in the kernel, then we'll risk having vendors actually ship one
> distro's public key hash, and that will seriously suck.
> 
If you hard code the key pair however, doesn't that imply that anyone can sign a
user space binary as a launch enclave, and potentially gain control of the token
granting process?  It was my understanding that the value of the key pair was
that the end user was guaranteed autonomy and security over which processes
could start enclaves.  By publishing a fixed key pair, it seems to remove that
ability.

What would be nicer (I think) would be the abilty to specify both the public and
the private key at run time.  the use case here is not one in which a vendor or
os distribution ships a key pair, but one in which a downstream user doesn't
want a vendor/os distribution to have any cryptographic information installed on
their system

Neil



Re: [PATCH v11 07/13] x86, sgx: detect Intel SGX

2018-06-11 Thread Neil Horman
On Fri, Jun 08, 2018 at 07:09:42PM +0200, Jarkko Sakkinen wrote:
> From: Sean Christopherson 
> 
> Intel(R) SGX is a set of CPU instructions that can be used by applications
> to set aside private regions of code and data. The code outside the enclave
> is disallowed to access the memory inside the enclave by the CPU access
> control.
> 
> This commit adds the check for SGX to arch/x86 and a new config option,
> INTEL_SGX_CORE. Exposes a boolean variable 'sgx_enabled' to query whether
> or not the SGX support is available.
> 
> Signed-off-by: Sean Christopherson 
> Reviewed-by: Jarkko Sakkinen 
> Tested-by: Jarkko Sakkinen 
> Signed-off-by: Jarkko Sakkinen 
> ---
>  arch/x86/Kconfig| 19 
>  arch/x86/include/asm/sgx.h  | 25 
>  arch/x86/include/asm/sgx_pr.h   | 20 +
>  arch/x86/kernel/cpu/Makefile|  1 +
>  arch/x86/kernel/cpu/intel_sgx.c | 53 +
>  5 files changed, 118 insertions(+)
>  create mode 100644 arch/x86/include/asm/sgx.h
>  create mode 100644 arch/x86/include/asm/sgx_pr.h
>  create mode 100644 arch/x86/kernel/cpu/intel_sgx.c
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index c07f492b871a..42015d5366ef 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1925,6 +1925,25 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
>  
> If unsure, say y.
>  
> +config INTEL_SGX_CORE
> + prompt "Intel SGX core functionality"
> + def_bool n
> + depends on X86_64 && CPU_SUP_INTEL
> + help
> + Intel Software Guard eXtensions (SGX) is a set of CPU instructions
> + that allows ring 3 applications to create enclaves; private regions
> + of memory that are protected, by hardware, from unauthorized access
> + and/or modification.
> +
> + This option enables kernel recognition of SGX, high-level management
> + of the Enclave Page Cache (EPC), tracking and writing of SGX Launch
> + Enclave Hash MSRs, and allows for virtualization of SGX via KVM. By
> + iteslf, this option does not provide SGX support to userspace.
> +
> + For details, see Documentation/x86/intel_sgx.rst
> +
> + If unsure, say N.
> +
>  config EFI
>   bool "EFI runtime service support"
>   depends on ACPI
> diff --git a/arch/x86/include/asm/sgx.h b/arch/x86/include/asm/sgx.h
> new file mode 100644
> index ..fa3e6e0eb8af
> --- /dev/null
> +++ b/arch/x86/include/asm/sgx.h
> @@ -0,0 +1,25 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> +// Copyright(c) 2016-18 Intel Corporation.
> +//
> +// Authors:
> +//
> +// Jarkko Sakkinen 
> +// Suresh Siddha 
> +// Sean Christopherson 
> +
> +#ifndef _ASM_X86_SGX_H
> +#define _ASM_X86_SGX_H
> +
> +#include 
> +
> +#define SGX_CPUID 0x12
> +
Agree with Dave, this can just be remoed and you can use the feature macro from
cpuid.h instead

Neil



Re: [PATCH v11 07/13] x86, sgx: detect Intel SGX

2018-06-11 Thread Neil Horman
On Fri, Jun 08, 2018 at 07:09:42PM +0200, Jarkko Sakkinen wrote:
> From: Sean Christopherson 
> 
> Intel(R) SGX is a set of CPU instructions that can be used by applications
> to set aside private regions of code and data. The code outside the enclave
> is disallowed to access the memory inside the enclave by the CPU access
> control.
> 
> This commit adds the check for SGX to arch/x86 and a new config option,
> INTEL_SGX_CORE. Exposes a boolean variable 'sgx_enabled' to query whether
> or not the SGX support is available.
> 
> Signed-off-by: Sean Christopherson 
> Reviewed-by: Jarkko Sakkinen 
> Tested-by: Jarkko Sakkinen 
> Signed-off-by: Jarkko Sakkinen 
> ---
>  arch/x86/Kconfig| 19 
>  arch/x86/include/asm/sgx.h  | 25 
>  arch/x86/include/asm/sgx_pr.h   | 20 +
>  arch/x86/kernel/cpu/Makefile|  1 +
>  arch/x86/kernel/cpu/intel_sgx.c | 53 +
>  5 files changed, 118 insertions(+)
>  create mode 100644 arch/x86/include/asm/sgx.h
>  create mode 100644 arch/x86/include/asm/sgx_pr.h
>  create mode 100644 arch/x86/kernel/cpu/intel_sgx.c
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index c07f492b871a..42015d5366ef 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1925,6 +1925,25 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
>  
> If unsure, say y.
>  
> +config INTEL_SGX_CORE
> + prompt "Intel SGX core functionality"
> + def_bool n
> + depends on X86_64 && CPU_SUP_INTEL
> + help
> + Intel Software Guard eXtensions (SGX) is a set of CPU instructions
> + that allows ring 3 applications to create enclaves; private regions
> + of memory that are protected, by hardware, from unauthorized access
> + and/or modification.
> +
> + This option enables kernel recognition of SGX, high-level management
> + of the Enclave Page Cache (EPC), tracking and writing of SGX Launch
> + Enclave Hash MSRs, and allows for virtualization of SGX via KVM. By
> + iteslf, this option does not provide SGX support to userspace.
> +
> + For details, see Documentation/x86/intel_sgx.rst
> +
> + If unsure, say N.
> +
>  config EFI
>   bool "EFI runtime service support"
>   depends on ACPI
> diff --git a/arch/x86/include/asm/sgx.h b/arch/x86/include/asm/sgx.h
> new file mode 100644
> index ..fa3e6e0eb8af
> --- /dev/null
> +++ b/arch/x86/include/asm/sgx.h
> @@ -0,0 +1,25 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> +// Copyright(c) 2016-18 Intel Corporation.
> +//
> +// Authors:
> +//
> +// Jarkko Sakkinen 
> +// Suresh Siddha 
> +// Sean Christopherson 
> +
> +#ifndef _ASM_X86_SGX_H
> +#define _ASM_X86_SGX_H
> +
> +#include 
> +
> +#define SGX_CPUID 0x12
> +
Agree with Dave, this can just be remoed and you can use the feature macro from
cpuid.h instead

Neil



Re: INFO: rcu detected stall in kfree_skbmem

2018-05-14 Thread Neil Horman
On Fri, May 11, 2018 at 12:00:38PM +0200, Dmitry Vyukov wrote:
> On Mon, Apr 30, 2018 at 8:09 PM, syzbot
>  wrote:
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:5d1365940a68 Merge
> > git://git.kernel.org/pub/scm/linux/kerne...
> > git tree:   net-next
> > console output: https://syzkaller.appspot.com/x/log.txt?id=5667997129637888
> > kernel config:
> > https://syzkaller.appspot.com/x/.config?id=-5947642240294114534
> > dashboard link: https://syzkaller.appspot.com/bug?extid=fc78715ba3b3257caf6a
> > compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> >
> > Unfortunately, I don't have any reproducer for this crash yet.
> 
> This looks sctp-related, +sctp maintainers.
> 
Looking at the entire trace, it appears that we are getting caught in the
kfree_skb that is getting triggered in enqueue_to_backlog which occurs when our
rx backlog list grows over netdev_max_backlog packets.  That suggests to me that
whatever test(s) is/are causing this trace are queuing up a large number of
frames to be sent over the loopback interface, and are never/rarely getting
received.  Looking up higher in the stack, in the sctp_generate_heartbeat_event
function, we (in addition to the rcu_read_lock in sctp_v6_xmit) we also hold the
socket lock during the entirety of the xmit operaion.  Is it possible that we
are just enqueuing so many frames for xmit that we are blocking progress of
other threads using the same socket that we cross the RCU self detected stall
boundary?  While its not a fix per se, it might be a worthwhile test to limit
the number of frames we flush in a single pass.  

Neil

> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+fc78715ba3b3257ca...@syzkaller.appspotmail.com
> >
> > INFO: rcu_sched self-detected stall on CPU
> > 1-...!: (1 GPs behind) idle=a3e/1/4611686018427387908
> > softirq=71980/71983 fqs=33
> >  (t=125000 jiffies g=39438 c=39437 q=958)
> > rcu_sched kthread starved for 124829 jiffies! g39438 c39437 f0x0
> > RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0
> > RCU grace-period kthread stack dump:
> > rcu_sched   R  running task23768 9  2 0x8000
> > Call Trace:
> >  context_switch kernel/sched/core.c:2848 [inline]
> >  __schedule+0x801/0x1e30 kernel/sched/core.c:3490
> >  schedule+0xef/0x430 kernel/sched/core.c:3549
> >  schedule_timeout+0x138/0x240 kernel/time/timer.c:1801
> >  rcu_gp_kthread+0x6b5/0x1940 kernel/rcu/tree.c:2231
> >  kthread+0x345/0x410 kernel/kthread.c:238
> >  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:411
> > NMI backtrace for cpu 1
> > CPU: 1 PID: 20560 Comm: syz-executor4 Not tainted 4.16.0+ #1
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Call Trace:
> >  
> >  __dump_stack lib/dump_stack.c:77 [inline]
> >  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
> >  nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
> >  nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
> >  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
> >  trigger_single_cpu_backtrace include/linux/nmi.h:156 [inline]
> >  rcu_dump_cpu_stacks+0x175/0x1c2 kernel/rcu/tree.c:1376
> >  print_cpu_stall kernel/rcu/tree.c:1525 [inline]
> >  check_cpu_stall.isra.61.cold.80+0x36c/0x59a kernel/rcu/tree.c:1593
> >  __rcu_pending kernel/rcu/tree.c:3356 [inline]
> >  rcu_pending kernel/rcu/tree.c:3401 [inline]
> >  rcu_check_callbacks+0x21b/0xad0 kernel/rcu/tree.c:2763
> >  update_process_times+0x2d/0x70 kernel/time/timer.c:1636
> >  tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:173
> >  tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1283
> >  __run_hrtimer kernel/time/hrtimer.c:1386 [inline]
> >  __hrtimer_run_queues+0x3e3/0x10a0 kernel/time/hrtimer.c:1448
> >  hrtimer_interrupt+0x286/0x650 kernel/time/hrtimer.c:1506
> >  local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1025 [inline]
> >  smp_apic_timer_interrupt+0x15d/0x710 arch/x86/kernel/apic/apic.c:1050
> >  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
> > RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783
> > [inline]
> > RIP: 0010:kmem_cache_free+0xb3/0x2d0 mm/slab.c:3757
> > RSP: 0018:8801db105228 EFLAGS: 0282 ORIG_RAX: ff13
> > RAX: 0007 RBX: 8800b055c940 RCX: 11003b2345a5
> > RDX:  RSI: 8801d91a2d80 RDI: 0282
> > RBP: 8801db105248 R08: 8801d91a2cb8 R09: 0002
> > R10: 8801d91a2480 R11:  R12: 8801d9848e40
> > R13: 0282 R14: 85b7f27c R15: 
> >  kfree_skbmem+0x13c/0x210 net/core/skbuff.c:582
> >  __kfree_skb net/core/skbuff.c:642 [inline]
> >  kfree_skb+0x19d/0x560 net/core/skbuff.c:659
> >  enqueue_to_backlog+0x2fc/0xc90 net/core/dev.c:3968
> >  netif_rx_internal+0x14d/0xae0 

Re: INFO: rcu detected stall in kfree_skbmem

2018-05-14 Thread Neil Horman
On Fri, May 11, 2018 at 12:00:38PM +0200, Dmitry Vyukov wrote:
> On Mon, Apr 30, 2018 at 8:09 PM, syzbot
>  wrote:
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:5d1365940a68 Merge
> > git://git.kernel.org/pub/scm/linux/kerne...
> > git tree:   net-next
> > console output: https://syzkaller.appspot.com/x/log.txt?id=5667997129637888
> > kernel config:
> > https://syzkaller.appspot.com/x/.config?id=-5947642240294114534
> > dashboard link: https://syzkaller.appspot.com/bug?extid=fc78715ba3b3257caf6a
> > compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> >
> > Unfortunately, I don't have any reproducer for this crash yet.
> 
> This looks sctp-related, +sctp maintainers.
> 
Looking at the entire trace, it appears that we are getting caught in the
kfree_skb that is getting triggered in enqueue_to_backlog which occurs when our
rx backlog list grows over netdev_max_backlog packets.  That suggests to me that
whatever test(s) is/are causing this trace are queuing up a large number of
frames to be sent over the loopback interface, and are never/rarely getting
received.  Looking up higher in the stack, in the sctp_generate_heartbeat_event
function, we (in addition to the rcu_read_lock in sctp_v6_xmit) we also hold the
socket lock during the entirety of the xmit operaion.  Is it possible that we
are just enqueuing so many frames for xmit that we are blocking progress of
other threads using the same socket that we cross the RCU self detected stall
boundary?  While its not a fix per se, it might be a worthwhile test to limit
the number of frames we flush in a single pass.  

Neil

> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+fc78715ba3b3257ca...@syzkaller.appspotmail.com
> >
> > INFO: rcu_sched self-detected stall on CPU
> > 1-...!: (1 GPs behind) idle=a3e/1/4611686018427387908
> > softirq=71980/71983 fqs=33
> >  (t=125000 jiffies g=39438 c=39437 q=958)
> > rcu_sched kthread starved for 124829 jiffies! g39438 c39437 f0x0
> > RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0
> > RCU grace-period kthread stack dump:
> > rcu_sched   R  running task23768 9  2 0x8000
> > Call Trace:
> >  context_switch kernel/sched/core.c:2848 [inline]
> >  __schedule+0x801/0x1e30 kernel/sched/core.c:3490
> >  schedule+0xef/0x430 kernel/sched/core.c:3549
> >  schedule_timeout+0x138/0x240 kernel/time/timer.c:1801
> >  rcu_gp_kthread+0x6b5/0x1940 kernel/rcu/tree.c:2231
> >  kthread+0x345/0x410 kernel/kthread.c:238
> >  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:411
> > NMI backtrace for cpu 1
> > CPU: 1 PID: 20560 Comm: syz-executor4 Not tainted 4.16.0+ #1
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Call Trace:
> >  
> >  __dump_stack lib/dump_stack.c:77 [inline]
> >  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
> >  nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
> >  nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
> >  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
> >  trigger_single_cpu_backtrace include/linux/nmi.h:156 [inline]
> >  rcu_dump_cpu_stacks+0x175/0x1c2 kernel/rcu/tree.c:1376
> >  print_cpu_stall kernel/rcu/tree.c:1525 [inline]
> >  check_cpu_stall.isra.61.cold.80+0x36c/0x59a kernel/rcu/tree.c:1593
> >  __rcu_pending kernel/rcu/tree.c:3356 [inline]
> >  rcu_pending kernel/rcu/tree.c:3401 [inline]
> >  rcu_check_callbacks+0x21b/0xad0 kernel/rcu/tree.c:2763
> >  update_process_times+0x2d/0x70 kernel/time/timer.c:1636
> >  tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:173
> >  tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1283
> >  __run_hrtimer kernel/time/hrtimer.c:1386 [inline]
> >  __hrtimer_run_queues+0x3e3/0x10a0 kernel/time/hrtimer.c:1448
> >  hrtimer_interrupt+0x286/0x650 kernel/time/hrtimer.c:1506
> >  local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1025 [inline]
> >  smp_apic_timer_interrupt+0x15d/0x710 arch/x86/kernel/apic/apic.c:1050
> >  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:862
> > RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:783
> > [inline]
> > RIP: 0010:kmem_cache_free+0xb3/0x2d0 mm/slab.c:3757
> > RSP: 0018:8801db105228 EFLAGS: 0282 ORIG_RAX: ff13
> > RAX: 0007 RBX: 8800b055c940 RCX: 11003b2345a5
> > RDX:  RSI: 8801d91a2d80 RDI: 0282
> > RBP: 8801db105248 R08: 8801d91a2cb8 R09: 0002
> > R10: 8801d91a2480 R11:  R12: 8801d9848e40
> > R13: 0282 R14: 85b7f27c R15: 
> >  kfree_skbmem+0x13c/0x210 net/core/skbuff.c:582
> >  __kfree_skb net/core/skbuff.c:642 [inline]
> >  kfree_skb+0x19d/0x560 net/core/skbuff.c:659
> >  enqueue_to_backlog+0x2fc/0xc90 net/core/dev.c:3968
> >  netif_rx_internal+0x14d/0xae0 net/core/dev.c:4181
> >  netif_rx+0xba/0x400 net/core/dev.c:4206
> > 

  1   2   3   4   5   6   7   8   9   10   >