RE: [PATCH net] tipc: eliminate possible recursive locking detected by LOCKDEP

2018-10-11 Thread Jon Maloy
Acked-by: Jon Maloy 

///jon


> -Original Message-
> From: Ying Xue 
> Sent: October 11, 2018 7:58 AM
> To: Jon Maloy ; dvyu...@google.com
> Cc: da...@davemloft.net; parthasarathy.bhuvara...@ericsson.com;
> netdev@vger.kernel.org; linux-ker...@vger.kernel.org; tipc-
> discuss...@lists.sourceforge.net
> Subject: [PATCH net] tipc: eliminate possible recursive locking detected by
> LOCKDEP
> 
> When booting kernel with LOCKDEP option, below warning info was found:
> 
> WARNING: possible recursive locking detected 4.19.0-rc7+ #14 Not tainted
> 
> swapper/0/1 is trying to acquire lock:
> dcfc0fc8 (&(>lock)->rlock#4){+...}, at: spin_lock_bh
> include/linux/spinlock.h:334 [inline]
> dcfc0fc8 (&(>lock)->rlock#4){+...}, at:
> tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850
> 
> but task is already holding lock:
> cbb9b036 (&(>lock)->rlock#4){+...}, at: spin_lock_bh
> include/linux/spinlock.h:334 [inline]
> cbb9b036 (&(>lock)->rlock#4){+...}, at:
> tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>CPU0
>
>   lock(&(>lock)->rlock#4);
>   lock(&(>lock)->rlock#4);
> 
>  *** DEADLOCK ***
> 
>  May be due to missing lock nesting notation
> 
> 2 locks held by swapper/0/1:
>  #0: f7539d34 (pernet_ops_rwsem){+.+.}, at:
> register_pernet_subsys+0x19/0x40 net/core/net_namespace.c:1051
>  #1: cbb9b036 (&(>lock)->rlock#4){+...}, at:
> spin_lock_bh include/linux/spinlock.h:334 [inline]
>  #1: cbb9b036 (&(>lock)->rlock#4){+...}, at:
> tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849
> 
> stack backtrace:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.0-rc7+ #14 Hardware name:
> QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1af/0x295 lib/dump_stack.c:113  print_deadlock_bug
> kernel/locking/lockdep.c:1759 [inline]  check_deadlock
> kernel/locking/lockdep.c:1803 [inline]  validate_chain
> kernel/locking/lockdep.c:2399 [inline]
>  __lock_acquire+0xf1e/0x3c60 kernel/locking/lockdep.c:3411
>  lock_acquire+0x1db/0x520 kernel/locking/lockdep.c:3900
> __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
>  _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168  spin_lock_bh
> include/linux/spinlock.h:334 [inline]
>  tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850
>  tipc_link_bc_create+0xb5/0x1f0 net/tipc/link.c:526
>  tipc_bcast_init+0x59b/0xab0 net/tipc/bcast.c:521
>  tipc_init_net+0x472/0x610 net/tipc/core.c:82
>  ops_init+0xf7/0x520 net/core/net_namespace.c:129
> __register_pernet_operations net/core/net_namespace.c:940 [inline]
>  register_pernet_operations+0x453/0xac0 net/core/net_namespace.c:1011
>  register_pernet_subsys+0x28/0x40 net/core/net_namespace.c:1052
>  tipc_init+0x83/0x104 net/tipc/core.c:140  do_one_initcall+0x109/0x70a
> init/main.c:885  do_initcall_level init/main.c:953 [inline]  do_initcalls
> init/main.c:961 [inline]  do_basic_setup init/main.c:979 [inline]
> kernel_init_freeable+0x4bd/0x57f init/main.c:1144
>  kernel_init+0x13/0x180 init/main.c:1063
>  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413
> 
> The reason why the noise above was complained by LOCKDEP is because we
> nested to hold l->wakeupq.lock and l->inputq->lock in tipc_link_reset
> function. In fact it's unnecessary to move skb buffer from l->wakeupq queue
> to l->inputq queue while holding the two locks at the same time.
> Instead, we can move skb buffers in l->wakeupq queue to a temporary list
> first and then move the buffers of the temporary list to l->inputq queue,
> which is also safe for us.
> 
> Fixes: 3f32d0be6c16 ("tipc: lock wakeup & inputq at tipc_link_reset()")
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Ying Xue 
> ---
>  net/tipc/link.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/net/tipc/link.c b/net/tipc/link.c index fb886b5..1d21ae4 100644
> --- a/net/tipc/link.c
> +++ b/net/tipc/link.c
> @@ -843,14 +843,21 @@ static void link_prepare_wakeup(struct tipc_link *l)
> 
>  void tipc_link_reset(struct tipc_link *l)  {
> + struct sk_buff_head list;
> +
> + __skb_queue_head_init();
> +
>   l->in_session = false;
>   l->session++;
>   l->mtu = l->advertised_mtu;
> +
>   spin_lock_bh(>wakeupq.lock);
> + skb_queue_splice_init(>wakeupq, );
> + spin_unlock_bh(>wakeupq.lock);
> +
>   spin_lock_bh(>inputq->lock);
> - skb_queue_splice_init(>wakeupq, l->inputq);
> + skb_queue_splice_init(, l->inputq);
>   spin_unlock_bh(>inputq->lock);
> - spin_unlock_bh(>wakeupq.lock);
> 
>   __skb_queue_purge(>transmq);
>   __skb_queue_purge(>deferdq);
> --
> 2.7.4



RE: net/tipc: recursive locking in tipc_link_reset

2018-10-11 Thread Jon Maloy
Hi Dmitry,
Yes, we are aware of this, the kernel test robot warned us about this a few 
days ago.
I am looking into it.

///jon

> -Original Message-
> From: Dmitry Vyukov 
> Sent: October 11, 2018 3:55 AM
> To: parthasarathy.bhuvara...@ericsson.com; Jon Maloy
> ; David Miller ; Ying Xue
> ; netdev ; tipc-
> discuss...@lists.sourceforge.net; LKML  Subject: net/tipc: recursive locking in tipc_link_reset
> 
> Hi,
> 
> I am getting the following error while booting the latest kernel on
> bb2d8f2f61047cbde08b78ec03e4ebdb01ee5434 (Oct 10). Config is attached.
> 
> Since this happens during boot, this makes LOCKDEP completely unusable,
> does not allow to discover any other locking issues and masks all new bugs
> being introduced into kernel.
> Please fix asap.
> Thanks
> 
> 
> WARNING: possible recursive locking detected 4.19.0-rc7+ #14 Not tainted
> 
> swapper/0/1 is trying to acquire lock:
> dcfc0fc8 (&(>lock)->rlock#4){+...}, at: spin_lock_bh
> include/linux/spinlock.h:334 [inline]
> dcfc0fc8 (&(>lock)->rlock#4){+...}, at:
> tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850
> 
> but task is already holding lock:
> cbb9b036 (&(>lock)->rlock#4){+...}, at: spin_lock_bh
> include/linux/spinlock.h:334 [inline]
> cbb9b036 (&(>lock)->rlock#4){+...}, at:
> tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>CPU0
>
>   lock(&(>lock)->rlock#4);
>   lock(&(>lock)->rlock#4);
> 
>  *** DEADLOCK ***
> 
>  May be due to missing lock nesting notation
> 
> 2 locks held by swapper/0/1:
>  #0: f7539d34 (pernet_ops_rwsem){+.+.}, at:
> register_pernet_subsys+0x19/0x40 net/core/net_namespace.c:1051
>  #1: cbb9b036 (&(>lock)->rlock#4){+...}, at:
> spin_lock_bh include/linux/spinlock.h:334 [inline]
>  #1: cbb9b036 (&(>lock)->rlock#4){+...}, at:
> tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849
> 
> stack backtrace:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.0-rc7+ #14 Hardware name:
> QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1af/0x295 lib/dump_stack.c:113  print_deadlock_bug
> kernel/locking/lockdep.c:1759 [inline]  check_deadlock
> kernel/locking/lockdep.c:1803 [inline]  validate_chain
> kernel/locking/lockdep.c:2399 [inline]
>  __lock_acquire+0xf1e/0x3c60 kernel/locking/lockdep.c:3411
>  lock_acquire+0x1db/0x520 kernel/locking/lockdep.c:3900
> __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
>  _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168  spin_lock_bh
> include/linux/spinlock.h:334 [inline]
>  tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850
>  tipc_link_bc_create+0xb5/0x1f0 net/tipc/link.c:526
>  tipc_bcast_init+0x59b/0xab0 net/tipc/bcast.c:521
>  tipc_init_net+0x472/0x610 net/tipc/core.c:82
>  ops_init+0xf7/0x520 net/core/net_namespace.c:129
> __register_pernet_operations net/core/net_namespace.c:940 [inline]
>  register_pernet_operations+0x453/0xac0 net/core/net_namespace.c:1011
>  register_pernet_subsys+0x28/0x40 net/core/net_namespace.c:1052
>  tipc_init+0x83/0x104 net/tipc/core.c:140  do_one_initcall+0x109/0x70a
> init/main.c:885  do_initcall_level init/main.c:953 [inline]  do_initcalls
> init/main.c:961 [inline]  do_basic_setup init/main.c:979 [inline]
> kernel_init_freeable+0x4bd/0x57f init/main.c:1144
>  kernel_init+0x13/0x180 init/main.c:1063
>  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413


Re: [PATCH net-next] virtio_net: force_napi_tx module param.

2018-07-24 Thread Jon Olson
On Tue, Jul 24, 2018 at 3:46 PM Michael S. Tsirkin  wrote:
>
> On Tue, Jul 24, 2018 at 06:31:54PM -0400, Willem de Bruijn wrote:
> > On Tue, Jul 24, 2018 at 6:23 PM Michael S. Tsirkin  wrote:
> > >
> > > On Tue, Jul 24, 2018 at 04:52:53PM -0400, Willem de Bruijn wrote:
> > > > >From the above linked patch, I understand that there are yet
> > > > other special cases in production, such as a hard cap on #tx queues to
> > > > 32 regardless of number of vcpus.
> > >
> > > I don't think upstream kernels have this limit - we can
> > > now use vmalloc for higher number of queues.
> >
> > Yes. that patch* mentioned it as a google compute engine imposed
> > limit. It is exactly such cloud provider imposed rules that I'm
> > concerned about working around in upstream drivers.
> >
> > * for reference, I mean https://patchwork.ozlabs.org/patch/725249/
>
> Yea. Why does GCE do it btw?

There are a few reasons for the limit, some historical, some current.

Historically we did this because of a kernel limit on the number of
TAP queues (in Montreal I thought this limit was 32). To my chagrin,
the limit upstream at the time we did it was actually eight. We had
increased the limit from eight to 32 internally, and it appears in
upstream it has subsequently increased upstream to 256. We no longer
use TAP for networking, so that constraint no longer applies for us,
but when looking at removing/raising the limit we discovered no
workloads that clearly benefited from lifting it, and it also placed
more pressure on our virtual networking stack particularly on the Tx
side. We left it as-is.

In terms of current reasons there are really two. One is memory usage.
As you know, virtio-net uses rx/tx pairs, so there's an expectation
that the guest will have an Rx queue for every Tx queue. We run our
individual virtqueues fairly deep (4096 entries) to give guests a wide
time window for re-posting Rx buffers and avoiding starvation on
packet delivery. Filling an Rx vring with max-sized mergeable buffers
(4096 bytes) is 16MB of GFP_ATOMIC allocations. At 32 queues this can
be up to 512MB of memory posted for network buffers. Scaling this to
the largest VM GCE offers today (160 VCPUs -- n1-ultramem-160) keeping
all of the Rx rings full would (in the large average Rx packet size
case) consume up to 2.5 GB(!) of guest RAM. Now, those VMs have 3.8T
of RAM available, but I don't believe we've observed a situation where
they would have benefited from having 2.5 gigs of buffers posted for
incoming network traffic :)

The second reason is interrupt related -- as I mentioned above, we
have found no workloads that clearly benefit from so many queues, but
we have found workloads that degrade. In particular workloads that do
a lot of small packet processing but which aren't extremely latency
sensitive can achieve higher PPS by taking fewer interrupt across
fewer VCPUs due to better batching (this also incurs higher latency,
but at the limit the "busy" cores end up suppressing most interrupts
and spending most of their cycles farming out work). Memcache is a
good example here, particularly if the latency targets for request
completion are in the ~milliseconds range (rather than the
microseconds we typically strive for with TCP_RR-style workloads).

All of that said, we haven't been forthcoming with data (and
unfortunately I don't have it handy in a useful form, otherwise I'd
simply post it here), so I understand the hesitation to simply run
with napi_tx across the board. As Willem said, this patch seemed like
the least disruptive way to allow us to continue down the road of
"universal" NAPI Tx and to hopefully get data across enough workloads
(with VMs small, large, and absurdly large :) to present a compelling
argument in one direction or another. As far as I know there aren't
currently any NAPI related ethtool commands (based on a quick perusal
of ethtool.h) -- it seems like it would be fairly involved/heavyweight
to plumb one solely for this unless NAPI Tx is something many users
will want to tune (and for which other drivers would support tuning).

--
Jon Olson


[PATCH net-next] ifb: fix packets checksum

2018-05-24 Thread Jon Maxwell
Fixup the checksum for CHECKSUM_COMPLETE when pulling skbs on RX path. 
Otherwise we get splats when tc mirred is used to redirect packets to ifb.

Before fix:

nic: hw csum failure

Signed-off-by: Jon Maxwell <jmaxwel...@gmail.com>
---
 drivers/net/ifb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ifb.c b/drivers/net/ifb.c
index 5f2897ec0edc..d345c61d476c 100644
--- a/drivers/net/ifb.c
+++ b/drivers/net/ifb.c
@@ -102,7 +102,7 @@ static void ifb_ri_tasklet(unsigned long _txp)
if (!skb->tc_from_ingress) {
dev_queue_xmit(skb);
} else {
-   skb_pull(skb, skb->mac_len);
+   skb_pull_rcsum(skb, skb->mac_len);
netif_receive_skb(skb);
}
}
-- 
2.13.6



RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-23 Thread Jon Rosen (jrosen)


> -Original Message-
> From: Willem de Bruijn [mailto:willemdebruijn.ker...@gmail.com]
> Sent: Wednesday, May 23, 2018 9:37 AM
> To: Jon Rosen (jrosen) <jro...@cisco.com>
> Cc: David S. Miller <da...@davemloft.net>; Willem de Bruijn 
> <will...@google.com>; Eric Dumazet
> <eduma...@google.com>; Kees Cook <keesc...@chromium.org>; David Windsor 
> <dwind...@gmail.com>; Rosen,
> Rami <rami.ro...@intel.com>; Reshetova, Elena <elena.reshet...@intel.com>; 
> Mike Maloney
> <malo...@google.com>; Benjamin Poirier <bpoir...@suse.com>; Thomas Gleixner 
> <t...@linutronix.de>; Greg
> Kroah-Hartman <gre...@linuxfoundation.org>; open list:NETWORKING [GENERAL] 
> <netdev@vger.kernel.org>;
> open list <linux-ker...@vger.kernel.org>
> Subject: Re: [PATCH v2] packet: track ring entry use using a shadow ring to 
> prevent RX ring overrun
> 
> On Wed, May 23, 2018 at 7:54 AM, Jon Rosen (jrosen) <jro...@cisco.com> wrote:
> >> > For the ring, there is no requirement to allocate exactly the amount
> >> > specified by the user request. Safer than relying on shared memory
> >> > and simpler than the extra allocation in this patch would be to allocate
> >> > extra shadow memory at the end of the ring (and not mmap that).
> >> >
> >> > That still leaves an extra cold cacheline vs using tp_padding.
> >>
> >> Given my lack of experience and knowledge in writing kernel code
> >> it was easier for me to allocate the shadow ring as a separate
> >> structure.  Of course it's not about me and my skills so if it's
> >> more appropriate to allocate at the tail of the existing ring
> >> then certainly I can look at doing that.
> >
> > The memory for the ring is not one contiguous block, it's an array of
> > blocks of pages (or 'order' sized blocks of pages). I don't think
> > increasing the size of each of the blocks to provided storage would be
> > such a good idea as it will risk spilling over into the next order and
> > wasting lots of memory. I suspect it's also more complex than a single
> > shadow ring to do both the allocation and the access.
> >
> > It could be tacked onto the end of the pg_vec[] used to store the
> > pointers to the blocks. The challenge with that is that a pg_vec[] is
> > created for each of RX and TX rings so either it would have to
> > allocate unnecessary storage for TX or the caller will have to say if
> > extra space should be allocated or not.  E.g.:
> >
> > static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order, int 
> > scratch, void **scratch_p)
> >
> > I'm not sure avoiding the extra allocation and moving it to the
> > pg_vec[] for the RX ring is going to get the simplification you were
> > hoping for.  Is there another way of storing the shadow ring which
> > I should consider?
> 
> I did indeed mean attaching extra pages to pg_vec[]. It should be
> simpler than a separate structure, but I may be wrong.

I don't think it would be too bad, it may actually turn out to be
convenient to implement.

> 
> Either way, I still would prefer to avoid the shadow buffer completely.
> It incurs complexity and cycle cost on all users because of only the
> rare (non-existent?) consumer that overwrites the padding bytes.

I prefer that as well.  I'm just not sure there is a bulletproof
solution without the shadow state.  I also wish it were only a
theoretical issue but unfortunately it is actually something our
customers have seen.
> 
> Perhaps we can use padding yet avoid deadlock by writing a
> timed value. The simplest would be jiffies >> N. Then only a
> process that writes this exact value would be subject to drops and
> then still only for a limited period.
> 
> Instead of depending on wall clock time, like jiffies, another option
> would be to keep a percpu array of values. Each cpu has a zero
> entry if it is not writing, nonzero if it is. If a writer encounters a
> number in padding that is > num_cpus, then the state is garbage
> from userspace. If <= num_cpus, it is adhered to only until that cpu
> clears its entry, which is guaranteed to happen eventually.
> 
> Just a quick thought. This might not fly at all upon closer scrutiny.

I'm not sure I understand the suggestion, but I'll think on it
some more.

Some other options maybe worth considering (in no specific order):
- test the application to see if it will consume entries if tp_status
  is set to anything other than TP_STATUS_USER, only use shadow if
  it doesn't strictly honor the TP_STATUS_USER bit.

- skip shadow if we see new TP_STATUS_USER_TO_KERNEL is used

- use tp_len == -1 to indicate inuse





RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-23 Thread Jon Rosen (jrosen)
> > For the ring, there is no requirement to allocate exactly the amount
> > specified by the user request. Safer than relying on shared memory
> > and simpler than the extra allocation in this patch would be to allocate
> > extra shadow memory at the end of the ring (and not mmap that).
> >
> > That still leaves an extra cold cacheline vs using tp_padding.
> 
> Given my lack of experience and knowledge in writing kernel code
> it was easier for me to allocate the shadow ring as a separate
> structure.  Of course it's not about me and my skills so if it's
> more appropriate to allocate at the tail of the existing ring
> then certainly I can look at doing that.

The memory for the ring is not one contiguous block, it's an array of
blocks of pages (or 'order' sized blocks of pages). I don't think
increasing the size of each of the blocks to provided storage would be
such a good idea as it will risk spilling over into the next order and
wasting lots of memory. I suspect it's also more complex than a single
shadow ring to do both the allocation and the access.

It could be tacked onto the end of the pg_vec[] used to store the
pointers to the blocks. The challenge with that is that a pg_vec[] is
created for each of RX and TX rings so either it would have to
allocate unnecessary storage for TX or the caller will have to say if
extra space should be allocated or not.  E.g.:

static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order, int 
scratch, void **scratch_p)

I'm not sure avoiding the extra allocation and moving it to the
pg_vec[] for the RX ring is going to get the simplification you were
hoping for.  Is there another way of storing the shadow ring which
I should consider?


RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-23 Thread Jon Rosen (jrosen)
> >>> I think the bigger issues as you've pointed out are the cost of
> >>> the additional spin lock and should the additional state be
> >>> stored in-band (fewer cache lines) or out-of band (less risk of
> >>> breaking due to unpredictable application behavior).
> >>
> >> We don't need the spinlock if clearing the shadow byte after
> >> setting the status to user.
> >>
> >> Worst case, user will set it back to kernel while the shadow
> >> byte is not cleared yet and the next producer will drop a packet.
> >> But next producers will make progress, so there is no deadlock
> >> or corruption.
> >
> > I thought so too for a while but after spending more time than I
> > care to admit I relized the following sequence was occuring:
> >
> >Core A   Core B
> >--   --
> >- Enter spin_lock
> >-   Get tp_status of head (X)
> >tp_status == 0
> >-   Check inuse
> >inuse == 0
> >-   Allocate entry X
> >advance head (X+1)
> >set inuse=1
> >- Exit spin_lock
> >
> >  
> >
> >  > where N = size of ring>
> >
> > - Enter spin_lock
> > -   get tp_status of head (X+N)
> > tp_status == 0 (but slot
> > in use for X on core A)
> >
> >- write tp_status of <--- trouble!
> >  X = TP_STATUS_USER <--- trouble!
> >- write inuse=0  <--- trouble!
> >
> > -   Check inuse
> > inuse == 0
> > -   Allocate entry X+N
> > advance head (X+N+1)
> > set inuse=1
> > - Exit spin_lock
> >
> >
> > At this point Core A just passed slot X to userspace with a
> > packet and Core B has just been assigned slot X+N (same slot as
> > X) for it's new packet. Both cores A and B end up filling in that
> > slot.  Tracking ths donw was one of the reasons it took me a
> > while to produce these updated diffs.
> 
> Is this not just an ordering issue? Since inuse is set after tp_status,
> it has to be tested first (and barriers are needed to avoid reordering).

I changed the code as you suggest to do the inuse check first and
removed the extra added spin_lock/unlock and it seems to be working.
I was able to run through the night without an issue (normally I would
hit the ring corruption in 1 to 2 hours).

Thanks for pointing that out, I should have caught that myself.  Next
I'll look at your suggestion for where to put the shadow ring.


RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-22 Thread Jon Rosen (jrosen)
On Monday, May 21, 2018 2:17 PM, Jon Rosen (jrosen) <jro...@cisco.com> wrote:
> On Monday, May 21, 2018 1:07 PM, Willem de Bruijn
> <willemdebruijn.ker...@gmail.com> wrote:
>> On Mon, May 21, 2018 at 8:57 AM, Jon Rosen (jrosen) <jro...@cisco.com> wrote:

...snip...

>>
>> A setsockopt for userspace to signal a stricter interpretation of
>> tp_status to elide the shadow hack could then be considered.
>> It's not pretty. Either way, no full new version is required.
>>
>>> As much as I would like to find a solution that doesn't require
>>> the spin lock I have yet to do so. Maybe the answer is that
>>> existing applications will need to suffer the performance impact
>>> but a new version or option for TPACKET_V1/V2 could be added to
>>> indicate strict adherence of the TP_STATUS_USER bit and then the
>>> original diffs could be used.

It looks like adding new socket options is pretty rare so I
wonder if a better option might be to define a new TP_STATUS_XXX
bit which would signal from a userspace application to the kernel
that it strictly interprets the TP_STATUS_USER bit to determine
ownership.

Todays applications set tp_status = TP_STATUS_KERNEL(0) for the
kernel to pick up the entry.  We could define a new value to pass
ownership as well as one to indicate to other kernel threads that
an entry is inuse:

#define TP_STATUS_USER_TO_KERNEL(1 << 8)
#define TP_STATUS_INUSE (1 << 9)

If the kernel sees tp_status == TP_STATUS_KERNEL then it should
use the shadow method for tacking ownership. If it sees tp_status
== TP_STATUS_USER_TO_KERNEL then it can use the TP_STATUS_INUSE
method.

>>>
>>> There is another option I was considering but have yet to try
>>> which would avoid needing a shadow ring by using counter(s) to
>>> track maximum sequence number queued to userspace vs. the next
>>> sequence number to be allocated in the ring.  If the difference
>>> is greater than the size of the ring then the ring can be
>>> considered full and the allocation would fail. Of course this may
>>> create an additional hotspot between cores, not sure if that
>>> would be significant or not.
>>
>> Please do have a look, but I don't think that this will work in this
>> case in practice. It requires tracking the producer tail. Updating
>> the slowest writer requires probing each subsequent slot's status
>> byte to find the new tail, which is a lot of (by then cold) cacheline
>> reads.
>
> I've thought about it a little more and am not convinced it's
> workable but I'll spend a little more time on it before giving
> up.

I've given up on this method.  Just don't see how to make it work.



RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-21 Thread Jon Rosen (jrosen)
On Monday, May 21, 2018 1:07 PM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
>On Mon, May 21, 2018 at 8:57 AM, Jon Rosen (jrosen) <jro...@cisco.com> wrote:
>> On Sunday, May 20, 2018 7:22 PM, Willem de Bruijn
>> <willemdebruijn.ker...@gmail.com> wrote:
>>> On Sun, May 20, 2018 at 6:51 PM, Willem de Bruijn
>>> <willemdebruijn.ker...@gmail.com> wrote:
>>>> On Sat, May 19, 2018 at 8:07 AM, Jon Rosen <jro...@cisco.com> wrote:
>>>>> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
>>>>> casues the ring to get corrupted by allowing multiple kernel threads
>>>>> to claim ownership of the same ring entry. Track ownership in a shadow
>>>>> ring structure to prevent other kernel threads from reusing the same
>>>>> entry before it's fully filled in, passed to user space, and then
>>>>> eventually passed back to the kernel for use with a new packet.
>>>>>
>>>>> Signed-off-by: Jon Rosen <jro...@cisco.com>
>>>>> ---
>>>>>
>>>>> There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages
>>>>> the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2.  This bug makes
>>>>> it possible for multiple kernel threads to claim ownership of the same
>>>>> ring entry, corrupting the ring and the corresponding packet(s).
>>>>>
>>>>> These diffs are the second proposed solution, previous proposal was 
>>>>> described
>>>>> in https://www.mail-archive.com/netdev@vger.kernel.org/msg227468.html
>>>>> subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock
>>>>> to prevent RX ring overrun
>>>>>
>>>>> Those diffs would have changed the binary interface and have broken 
>>>>> certain
>>>>> applications. Consensus was that such a change would be inappropriate.
>>>>>
>>>>> These new diffs use a shadow ring in kernel space for tracking 
>>>>> intermediate
>>>>> state of an entry and prevent more than one kernel thread from 
>>>>> simultaneously
>>>>> allocating a ring entry. This avoids any impact to the binary interface
>>>>> between kernel and userspace but comes at the additional cost of 
>>>>> requiring a
>>>>> second spin_lock when passing ownership of a ring entry to userspace.
>>>>>
>>>>> Jon Rosen (1):
>>>>>   packet: track ring entry use using a shadow ring to prevent RX ring
>>>>> overrun
>>>>>
>>>>>  net/packet/af_packet.c | 64 
>>>>> ++
>>>>>  net/packet/internal.h  | 14 +++
>>>>>  2 files changed, 78 insertions(+)
>>>>>
>>>>
>>>>> @@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
>>>>> net_device *dev,
>>>>>  #endif
>>>>>
>>>>> if (po->tp_version <= TPACKET_V2) {
>>>>> +   spin_lock(>sk_receive_queue.lock);
>>>>> __packet_set_status(po, h.raw, status);
>>>>> +   packet_rx_shadow_release(rx_shadow_ring_entry);
>>>>> +   spin_unlock(>sk_receive_queue.lock);
>>>>> +
>>>>> sk->sk_data_ready(sk);
>>>>
>>>> Thanks for continuing to look at this. I spent some time on it last time
>>>> around but got stuck, too.
>>>>
>>>> This version takes an extra spinlock in the hot path. That will be very
>>>> expensive. Once we need to accept that, we could opt for a simpler
>>>> implementation akin to the one discussed in the previous thread:
>>>>
>>>> stash a value in tp_padding or similar while tp_status remains
>>>> TP_STATUS_KERNEL to signal ownership to concurrent kernel
>>>> threads. The issue previously was that that field could not atomically
>>>> be cleared together with __packet_set_status. This is no longer
>>>> an issue when holding the queue lock.
>>>>
>>>> With a field like tp_padding, unlike tp_len, it is arguably also safe to
>>>> clear it after flipping status (userspace should treat it as undefined).
>>>>
>>>> With v1 tpacket_hdr, no explicit padding field is defined but due to
>>&g

RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-21 Thread Jon Rosen (jrosen)
On Sunday, May 20, 2018 7:22 PM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
> On Sun, May 20, 2018 at 6:51 PM, Willem de Bruijn
> <willemdebruijn.ker...@gmail.com> wrote:
>> On Sat, May 19, 2018 at 8:07 AM, Jon Rosen <jro...@cisco.com> wrote:
>>> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
>>> casues the ring to get corrupted by allowing multiple kernel threads
>>> to claim ownership of the same ring entry. Track ownership in a shadow
>>> ring structure to prevent other kernel threads from reusing the same
>>> entry before it's fully filled in, passed to user space, and then
>>> eventually passed back to the kernel for use with a new packet.
>>>
>>> Signed-off-by: Jon Rosen <jro...@cisco.com>
>>> ---
>>>
>>> There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages
>>> the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2.  This bug makes
>>> it possible for multiple kernel threads to claim ownership of the same
>>> ring entry, corrupting the ring and the corresponding packet(s).
>>>
>>> These diffs are the second proposed solution, previous proposal was 
>>> described
>>> in https://www.mail-archive.com/netdev@vger.kernel.org/msg227468.html
>>> subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock
>>> to prevent RX ring overrun
>>>
>>> Those diffs would have changed the binary interface and have broken certain
>>> applications. Consensus was that such a change would be inappropriate.
>>>
>>> These new diffs use a shadow ring in kernel space for tracking intermediate
>>> state of an entry and prevent more than one kernel thread from 
>>> simultaneously
>>> allocating a ring entry. This avoids any impact to the binary interface
>>> between kernel and userspace but comes at the additional cost of requiring a
>>> second spin_lock when passing ownership of a ring entry to userspace.
>>>
>>> Jon Rosen (1):
>>>   packet: track ring entry use using a shadow ring to prevent RX ring
>>> overrun
>>>
>>>  net/packet/af_packet.c | 64 
>>> ++
>>>  net/packet/internal.h  | 14 +++
>>>  2 files changed, 78 insertions(+)
>>>
>>
>>> @@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
>>> net_device *dev,
>>>  #endif
>>>
>>> if (po->tp_version <= TPACKET_V2) {
>>> +   spin_lock(>sk_receive_queue.lock);
>>> __packet_set_status(po, h.raw, status);
>>> +   packet_rx_shadow_release(rx_shadow_ring_entry);
>>> +   spin_unlock(>sk_receive_queue.lock);
>>> +
>>> sk->sk_data_ready(sk);
>>
>> Thanks for continuing to look at this. I spent some time on it last time
>> around but got stuck, too.
>>
>> This version takes an extra spinlock in the hot path. That will be very
>> expensive. Once we need to accept that, we could opt for a simpler
>> implementation akin to the one discussed in the previous thread:
>>
>> stash a value in tp_padding or similar while tp_status remains
>> TP_STATUS_KERNEL to signal ownership to concurrent kernel
>> threads. The issue previously was that that field could not atomically
>> be cleared together with __packet_set_status. This is no longer
>> an issue when holding the queue lock.
>>
>> With a field like tp_padding, unlike tp_len, it is arguably also safe to
>> clear it after flipping status (userspace should treat it as undefined).
>>
>> With v1 tpacket_hdr, no explicit padding field is defined but due to
>> TPACKET_HDRLEN alignment it exists on both 32 and 64 bit
>> platforms.
>>
>> The danger with using padding is that a process may write to it
>> and cause deadlock, of course. There is no logical reason for doing
>> so.
>
> For the ring, there is no requirement to allocate exactly the amount
> specified by the user request. Safer than relying on shared memory
> and simpler than the extra allocation in this patch would be to allocate
> extra shadow memory at the end of the ring (and not mmap that).
>
> That still leaves an extra cold cacheline vs using tp_padding.

Given my lack of experience and knowledge in writing kernel code
it was easier for me to allocate the shadow ring as a separate
structure.  Of course it's not about me and my skills so if it's
more appropriate to allocate at the tai

RE: [PATCH net-next] tipc: eliminate complaint of KMSAN uninit-value in tipc_conn_rcv_sub

2018-05-21 Thread Jon Maloy


> -Original Message-
> From: netdev-ow...@vger.kernel.org <netdev-ow...@vger.kernel.org>
> On Behalf Of David Miller
> Sent: Saturday, May 19, 2018 23:00
> To: ying@windriver.com
> Cc: netdev@vger.kernel.org; Jon Maloy <jon.ma...@ericsson.com>;
> syzkaller-b...@googlegroups.com; tipc-discuss...@lists.sourceforge.net
> Subject: Re: [PATCH net-next] tipc: eliminate complaint of KMSAN uninit-
> value in tipc_conn_rcv_sub
> 
> From: Ying Xue <ying@windriver.com>
> Date: Fri, 18 May 2018 19:50:55 +0800
> 
> > As variable s of struct tipc_subscr type is not initialized in
> > tipc_conn_rcv_from_sock() before it is used in tipc_conn_rcv_sub(),
> > KMSAN reported the following uninit-value type complaint:
> 
> I agree with others that the short read is the bug.
> 
> You need to decide what should happen if not a full tipc_subscr object is
> obtained from the sock_recvmsg() call.
> 
> Proceeding to pass it on to tipc_conn_rcv_sub() cannot possibly be correct.
> 
> You're not getting what you are expecting from the peer, the memset() you
> are adding doesn't change that.
> 
> And once you get this badly sized read, what does that do to the stream of
> subsequent recvmsg calls here?

This socket/connection of type SOCK_SEQPACKET, so if anything like this 
happens, it is an error, and the connection should be aborted.
///jon



RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-05-19 Thread Jon Rosen (jrosen)
Forward link to a new proposed patch at:
https://www.mail-archive.com/netdev@vger.kernel.org/msg236629.html



[PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

2018-05-19 Thread Jon Rosen
Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
casues the ring to get corrupted by allowing multiple kernel threads
to claim ownership of the same ring entry. Track ownership in a shadow
ring structure to prevent other kernel threads from reusing the same
entry before it's fully filled in, passed to user space, and then
eventually passed back to the kernel for use with a new packet.

Signed-off-by: Jon Rosen <jro...@cisco.com>
---

There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages
the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2.  This bug makes
it possible for multiple kernel threads to claim ownership of the same
ring entry, corrupting the ring and the corresponding packet(s).

These diffs are the second proposed solution, previous proposal was described
in https://www.mail-archive.com/netdev@vger.kernel.org/msg227468.html
subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock
to prevent RX ring overrun

Those diffs would have changed the binary interface and have broken certain
applications. Consensus was that such a change would be inappropriate.

These new diffs use a shadow ring in kernel space for tracking intermediate
state of an entry and prevent more than one kernel thread from simultaneously
allocating a ring entry. This avoids any impact to the binary interface
between kernel and userspace but comes at the additional cost of requiring a
second spin_lock when passing ownership of a ring entry to userspace.

Jon Rosen (1):
  packet: track ring entry use using a shadow ring to prevent RX ring
overrun

 net/packet/af_packet.c | 64 ++
 net/packet/internal.h  | 14 +++
 2 files changed, 78 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index e0f3f4a..4d08c8e 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2165,6 +2165,26 @@ static int packet_rcv(struct sk_buff *skb, struct 
net_device *dev,
return 0;
 }
 
+static inline void *packet_rx_shadow_aquire_head(struct packet_sock *po)
+{
+   struct packet_ring_shadow_entry *entry;
+
+   entry = >rx_shadow.ring[po->rx_ring.head];
+   if (unlikely(entry->inuse))
+   return NULL;
+
+   entry->inuse = 1;
+   return (void *)entry;
+}
+
+static inline void packet_rx_shadow_release(void *_entry)
+{
+   struct packet_ring_shadow_entry *entry;
+
+   entry = (struct packet_ring_shadow_entry *)_entry;
+   entry->inuse = 0;
+}
+
 static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
   struct packet_type *pt, struct net_device *orig_dev)
 {
@@ -2182,6 +2202,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev,
__u32 ts_status;
bool is_drop_n_account = false;
bool do_vnet = false;
+   void *rx_shadow_ring_entry = NULL;
 
/* struct tpacket{2,3}_hdr is aligned to a multiple of 
TPACKET_ALIGNMENT.
 * We may add members to them until current aligned size without forcing
@@ -2277,7 +2298,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev,
if (!h.raw)
goto drop_n_account;
if (po->tp_version <= TPACKET_V2) {
+   /* Attempt to allocate shadow ring entry.
+* If already inuse then the ring is full.
+*/
+   rx_shadow_ring_entry = packet_rx_shadow_aquire_head(po);
+   if (unlikely(!rx_shadow_ring_entry))
+   goto ring_is_full;
+
packet_increment_rx_head(po, >rx_ring);
+
/*
 * LOSING will be reported till you read the stats,
 * because it's COR - Clear On Read.
@@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev,
 #endif
 
if (po->tp_version <= TPACKET_V2) {
+   spin_lock(>sk_receive_queue.lock);
__packet_set_status(po, h.raw, status);
+   packet_rx_shadow_release(rx_shadow_ring_entry);
+   spin_unlock(>sk_receive_queue.lock);
+
sk->sk_data_ready(sk);
} else {
prb_clear_blk_fill_status(>rx_ring);
@@ -4197,6 +4230,25 @@ static struct pgv *alloc_pg_vec(struct tpacket_req *req, 
int order)
goto out;
 }
 
+static struct packet_ring_shadow_entry *
+   packet_rx_shadow_alloc(unsigned int tp_frame_nr)
+{
+   struct packet_ring_shadow_entry *rx_shadow_ring;
+   int ring_size;
+   int i;
+
+   ring_size = tp_frame_nr * sizeof(*rx_shadow_ring);
+   rx_shadow_ring = kmalloc(ring_size, GFP_KERNEL);
+
+   if (!rx_shadow_ring)
+   return NULL;
+
+   for (i = 0; i < tp_frame_nr; i++)
+   rx_shadow_ring[i].inuse = 0;
+
+   return rx_shadow_ring;
+}
+
 static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,

[iproute2-next v3 1/1] tipc: fixed node and name table listings

2018-05-17 Thread Jon Maloy
We make it easier for users to correlate between 128-bit node
identities and 32-bit node hash number by extending the 'node list'
command to also show the hash number.

We also improve the 'nametable show' command to show the node identity
instead of the node hash number. Since the former potentially is much
longer than the latter, we make room for it by eliminating the (to the
user) irrelevant publication key. We also reorder some of the columns so
that the node id comes last, since this looks nicer and is more logical.

---
v2: Fixed compiler warning as per comment from David Ahern
v3: Fixed leaking socket as per comment from David Ahern

Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 tipc/misc.c  | 20 
 tipc/misc.h  |  1 +
 tipc/nametable.c | 18 ++
 tipc/node.c  | 19 ---
 tipc/peer.c  |  4 
 5 files changed, 43 insertions(+), 19 deletions(-)

diff --git a/tipc/misc.c b/tipc/misc.c
index 16849f1..e4b1cd0 100644
--- a/tipc/misc.c
+++ b/tipc/misc.c
@@ -13,6 +13,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 #include "misc.h"
 
 #define IN_RANGE(val, low, high) ((val) <= (high) && (val) >= (low))
@@ -109,3 +113,19 @@ void nodeid2str(uint8_t *id, char *str)
for (i = 31; str[i] == '0'; i--)
str[i] = 0;
 }
+
+void hash2nodestr(uint32_t hash, char *str)
+{
+   struct tipc_sioc_nodeid_req nr = {};
+   int sd;
+
+   sd = socket(AF_TIPC, SOCK_RDM, 0);
+   if (sd < 0) {
+   fprintf(stderr, "opening TIPC socket: %s\n", strerror(errno));
+   return;
+   }
+   nr.peer = hash;
+   if (!ioctl(sd, SIOCGETNODEID, ))
+   nodeid2str((uint8_t *)nr.node_id, str);
+   close(sd);
+}
diff --git a/tipc/misc.h b/tipc/misc.h
index 6e8afdd..ff2f31f 100644
--- a/tipc/misc.h
+++ b/tipc/misc.h
@@ -17,5 +17,6 @@
 uint32_t str2addr(char *str);
 int str2nodeid(char *str, uint8_t *id);
 void nodeid2str(uint8_t *id, char *str);
+void hash2nodestr(uint32_t hash, char *str);
 
 #endif
diff --git a/tipc/nametable.c b/tipc/nametable.c
index 2578940..ae73dfa 100644
--- a/tipc/nametable.c
+++ b/tipc/nametable.c
@@ -20,6 +20,7 @@
 #include "cmdl.h"
 #include "msg.h"
 #include "nametable.h"
+#include "misc.h"
 
 #define PORTID_STR_LEN 45 /* Four u32 and five delimiter chars */
 
@@ -31,6 +32,7 @@ static int nametable_show_cb(const struct nlmsghdr *nlh, void 
*data)
struct nlattr *attrs[TIPC_NLA_NAME_TABLE_MAX + 1] = {};
struct nlattr *publ[TIPC_NLA_PUBL_MAX + 1] = {};
const char *scope[] = { "", "zone", "cluster", "node" };
+   char str[33] = {0,};
 
mnl_attr_parse(nlh, sizeof(*genl), parse_attrs, info);
if (!info[TIPC_NLA_NAME_TABLE])
@@ -45,20 +47,20 @@ static int nametable_show_cb(const struct nlmsghdr *nlh, 
void *data)
return MNL_CB_ERROR;
 
if (!*iteration)
-   printf("%-10s %-10s %-10s %-10s %-10s %-10s\n",
-  "Type", "Lower", "Upper", "Node", "Port",
-  "Publication Scope");
+   printf("%-10s %-10s %-10s %-8s %-10s %-33s\n",
+  "Type", "Lower", "Upper", "Scope", "Port",
+  "Node");
(*iteration)++;
 
-   printf("%-10u %-10u %-10u %-10x %-10u %-12u",
+   hash2nodestr(mnl_attr_get_u32(publ[TIPC_NLA_PUBL_NODE]), str);
+
+   printf("%-10u %-10u %-10u %-8s %-10u %s\n",
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_TYPE]),
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_LOWER]),
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_UPPER]),
-  mnl_attr_get_u32(publ[TIPC_NLA_PUBL_NODE]),
+  scope[mnl_attr_get_u32(publ[TIPC_NLA_PUBL_SCOPE])],
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_REF]),
-  mnl_attr_get_u32(publ[TIPC_NLA_PUBL_KEY]));
-
-   printf("%s\n", scope[mnl_attr_get_u32(publ[TIPC_NLA_PUBL_SCOPE])]);
+  str);
 
return MNL_CB_OK;
 }
diff --git a/tipc/node.c b/tipc/node.c
index b73b644..0fa1064 100644
--- a/tipc/node.c
+++ b/tipc/node.c
@@ -26,10 +26,11 @@
 
 static int node_list_cb(const struct nlmsghdr *nlh, void *data)
 {
-   uint32_t addr;
struct genlmsghdr *genl = mnl_nlmsg_get_payload(nlh);
struct nlattr *info[TIPC_NLA_MAX + 1] = {};
struct nlattr *attrs[TIPC_NLA_NODE_MAX + 1] = {};
+   char str[33] = {};
+   uint32_t addr;
 
mnl_attr_parse(nlh, sizeof(*genl), parse_attrs, info);
if (!info[TIPC_NLA_NODE])
@@ -40,13 +41,12 @@ static int node_list_cb(const struct nlmsghdr *nlh, void 
*data)
return MN

[iproute2-next v2 1/1] tipc: fixed node and name table listings

2018-05-15 Thread Jon Maloy
We make it easier for users to correlate between 128-bit node
identities and 32-bit node hash number by extending the 'node list'
command to also show the hash number.

We also improve the 'nametable show' command to show the node identity
instead of the node hash number. Since the former potentially is much
longer than the latter, we make room for it by eliminating the (to the
user) irrelevant publication key. We also reorder some of the columns so
that the node id comes last, since this looks nicer and is more logical.

---
v2: Fixed compiler warning as per comment from David Ahern

Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 tipc/misc.c  | 18 ++
 tipc/misc.h  |  1 +
 tipc/nametable.c | 18 ++
 tipc/node.c  | 19 ---
 tipc/peer.c  |  4 
 5 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/tipc/misc.c b/tipc/misc.c
index 16849f1..e8b726f 100644
--- a/tipc/misc.c
+++ b/tipc/misc.c
@@ -13,6 +13,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 #include "misc.h"
 
 #define IN_RANGE(val, low, high) ((val) <= (high) && (val) >= (low))
@@ -109,3 +112,18 @@ void nodeid2str(uint8_t *id, char *str)
for (i = 31; str[i] == '0'; i--)
str[i] = 0;
 }
+
+void hash2nodestr(uint32_t hash, char *str)
+{
+   struct tipc_sioc_nodeid_req nr = {};
+   int sd;
+
+   sd = socket(AF_TIPC, SOCK_RDM, 0);
+   if (sd < 0) {
+   fprintf(stderr, "opening TIPC socket: %s\n", strerror(errno));
+   return;
+   }
+   nr.peer = hash;
+   if (!ioctl(sd, SIOCGETNODEID, ))
+   nodeid2str((uint8_t *)nr.node_id, str);
+}
diff --git a/tipc/misc.h b/tipc/misc.h
index 6e8afdd..ff2f31f 100644
--- a/tipc/misc.h
+++ b/tipc/misc.h
@@ -17,5 +17,6 @@
 uint32_t str2addr(char *str);
 int str2nodeid(char *str, uint8_t *id);
 void nodeid2str(uint8_t *id, char *str);
+void hash2nodestr(uint32_t hash, char *str);
 
 #endif
diff --git a/tipc/nametable.c b/tipc/nametable.c
index 2578940..ae73dfa 100644
--- a/tipc/nametable.c
+++ b/tipc/nametable.c
@@ -20,6 +20,7 @@
 #include "cmdl.h"
 #include "msg.h"
 #include "nametable.h"
+#include "misc.h"
 
 #define PORTID_STR_LEN 45 /* Four u32 and five delimiter chars */
 
@@ -31,6 +32,7 @@ static int nametable_show_cb(const struct nlmsghdr *nlh, void 
*data)
struct nlattr *attrs[TIPC_NLA_NAME_TABLE_MAX + 1] = {};
struct nlattr *publ[TIPC_NLA_PUBL_MAX + 1] = {};
const char *scope[] = { "", "zone", "cluster", "node" };
+   char str[33] = {0,};
 
mnl_attr_parse(nlh, sizeof(*genl), parse_attrs, info);
if (!info[TIPC_NLA_NAME_TABLE])
@@ -45,20 +47,20 @@ static int nametable_show_cb(const struct nlmsghdr *nlh, 
void *data)
return MNL_CB_ERROR;
 
if (!*iteration)
-   printf("%-10s %-10s %-10s %-10s %-10s %-10s\n",
-  "Type", "Lower", "Upper", "Node", "Port",
-  "Publication Scope");
+   printf("%-10s %-10s %-10s %-8s %-10s %-33s\n",
+  "Type", "Lower", "Upper", "Scope", "Port",
+  "Node");
(*iteration)++;
 
-   printf("%-10u %-10u %-10u %-10x %-10u %-12u",
+   hash2nodestr(mnl_attr_get_u32(publ[TIPC_NLA_PUBL_NODE]), str);
+
+   printf("%-10u %-10u %-10u %-8s %-10u %s\n",
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_TYPE]),
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_LOWER]),
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_UPPER]),
-  mnl_attr_get_u32(publ[TIPC_NLA_PUBL_NODE]),
+  scope[mnl_attr_get_u32(publ[TIPC_NLA_PUBL_SCOPE])],
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_REF]),
-  mnl_attr_get_u32(publ[TIPC_NLA_PUBL_KEY]));
-
-   printf("%s\n", scope[mnl_attr_get_u32(publ[TIPC_NLA_PUBL_SCOPE])]);
+  str);
 
return MNL_CB_OK;
 }
diff --git a/tipc/node.c b/tipc/node.c
index b73b644..0fa1064 100644
--- a/tipc/node.c
+++ b/tipc/node.c
@@ -26,10 +26,11 @@
 
 static int node_list_cb(const struct nlmsghdr *nlh, void *data)
 {
-   uint32_t addr;
struct genlmsghdr *genl = mnl_nlmsg_get_payload(nlh);
struct nlattr *info[TIPC_NLA_MAX + 1] = {};
struct nlattr *attrs[TIPC_NLA_NODE_MAX + 1] = {};
+   char str[33] = {};
+   uint32_t addr;
 
mnl_attr_parse(nlh, sizeof(*genl), parse_attrs, info);
if (!info[TIPC_NLA_NODE])
@@ -40,13 +41,12 @@ static int node_list_cb(const struct nlmsghdr *nlh, void 
*data)
return MNL_CB_ERROR;
 
addr = mnl_attr_get_u32(attrs[TIPC_NLA_NODE_ADDR]);
-   printf("%x: ", 

[PATCH net-next v2] tcp: Add mark for TIMEWAIT sockets

2018-05-10 Thread Jon Maxwell
This version has some suggestions by Eric Dumazet:

- Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP 
races. 
- Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark"
statement. 
- Factorize code as sk_fullsock() check is not necessary.

Aidan McGurn from Openwave Mobility systems reported the following bug:

"Marked routing is broken on customer deployment. Its effects are large 
increase in Uplink retransmissions caused by the client never receiving 
the final ACK to their FINACK - this ACK misses the mark and routes out 
of the incorrect route."

Currently marks are added to sk_buffs for replies when the "fwmark_reflect" 
sysctl is enabled. But not for TW sockets that had sk->sk_mark set via 
setsockopt(SO_MARK..).  

Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the 
original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark 
location. 
Then progate this so that the skb gets sent with the correct mark. Do the same 
for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that
netfilter rules are still honored.

Signed-off-by: Jon Maxwell <jmaxwel...@gmail.com>
---
 include/net/inet_timewait_sock.h |  1 +
 net/ipv4/ip_output.c |  2 +-
 net/ipv4/tcp_ipv4.c  | 16 ++--
 net/ipv4/tcp_minisocks.c |  1 +
 net/ipv6/tcp_ipv6.c  |  6 +-
 5 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index c7be1ca8e562..659d8ed5a3bc 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -62,6 +62,7 @@ struct inet_timewait_sock {
 #define tw_dr  __tw_common.skc_tw_dr
 
int tw_timeout;
+   __u32   tw_mark;
volatile unsigned char  tw_substate;
unsigned char   tw_rcv_wscale;
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 95adb171f852..b5e21eb198d8 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1561,7 +1561,7 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
oif = skb->skb_iif;
 
flowi4_init_output(, oif,
-  IP4_REPLY_MARK(net, skb->mark),
+  IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark,
   RT_TOS(arg->tos),
   RT_SCOPE_UNIVERSE, ip_hdr(skb)->protocol,
   ip_reply_arg_flowi_flags(arg),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b50838..caf23de88f8a 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -621,6 +621,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct 
sk_buff *skb)
struct sock *sk1 = NULL;
 #endif
struct net *net;
+   struct sock *ctl_sk;
 
/* Never send a reset in response to a reset. */
if (th->rst)
@@ -723,11 +724,16 @@ static void tcp_v4_send_reset(const struct sock *sk, 
struct sk_buff *skb)
arg.tos = ip_hdr(skb)->tos;
arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
-   ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
+   ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
+   if (sk)
+   ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
+  inet_twsk(sk)->tw_mark : sk->sk_mark;
+   ip_send_unicast_reply(ctl_sk,
  skb, _SKB_CB(skb)->header.h4.opt,
  ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
  , arg.iov[0].iov_len);
 
+   ctl_sk->sk_mark = 0;
__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
__TCP_INC_STATS(net, TCP_MIB_OUTRSTS);
local_bh_enable();
@@ -759,6 +765,7 @@ static void tcp_v4_send_ack(const struct sock *sk,
} rep;
struct net *net = sock_net(sk);
struct ip_reply_arg arg;
+   struct sock *ctl_sk;
 
memset(, 0, sizeof(struct tcphdr));
memset(, 0, sizeof(arg));
@@ -809,11 +816,16 @@ static void tcp_v4_send_ack(const struct sock *sk,
arg.tos = tos;
arg.uid = sock_net_uid(net, sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
-   ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
+   ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
+   if (sk)
+   ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
+  inet_twsk(sk)->tw_mark : sk->sk_mark;
+   ip_send_unicast_reply(ctl_sk,
  skb, _SKB_CB(skb)->header.h4.opt,
  ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
  , arg.iov[0].iov_len);
 
+   ctl_sk->s

[PATCH net-next v1] tcp: Add mark for TIMEWAIT sockets

2018-05-09 Thread Jon Maxwell
This version has some suggestions by Eric Dumazet:

- Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP 
races. 
- Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark"
statement. 

Aidan McGurn from Openwave Mobility systems reported the following bug:

"Marked routing is broken on customer deployment. Its effects are large 
increase in Uplink retransmissions caused by the client never receiving 
the final ACK to their FINACK - this ACK misses the mark and routes out 
of the incorrect route."

Currently marks are added to sk_buffs for replies when the "fwmark_reflect" 
sysctl is enabled. But not for TW sockets that had sk->sk_mark set via 
setsockopt(SO_MARK..).  

Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the 
original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark 
location. 
Then progate this so that the skb gets sent with the correct mark. Do the same 
for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that
netfilter rules are still honored.

Signed-off-by: Jon Maxwell <jmaxwel...@gmail.com>
---
 include/net/inet_timewait_sock.h |  1 +
 net/ipv4/ip_output.c |  2 +-
 net/ipv4/tcp_ipv4.c  | 18 --
 net/ipv4/tcp_minisocks.c |  1 +
 net/ipv6/tcp_ipv6.c  |  7 ++-
 5 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index c7be1ca8e562..659d8ed5a3bc 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -62,6 +62,7 @@ struct inet_timewait_sock {
 #define tw_dr  __tw_common.skc_tw_dr
 
int tw_timeout;
+   __u32   tw_mark;
volatile unsigned char  tw_substate;
unsigned char   tw_rcv_wscale;
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 95adb171f852..b5e21eb198d8 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1561,7 +1561,7 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
oif = skb->skb_iif;
 
flowi4_init_output(, oif,
-  IP4_REPLY_MARK(net, skb->mark),
+  IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark,
   RT_TOS(arg->tos),
   RT_SCOPE_UNIVERSE, ip_hdr(skb)->protocol,
   ip_reply_arg_flowi_flags(arg),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b50838..fbee36579c83 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -621,6 +621,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct 
sk_buff *skb)
struct sock *sk1 = NULL;
 #endif
struct net *net;
+   struct sock *ctl_sk;
 
/* Never send a reset in response to a reset. */
if (th->rst)
@@ -723,11 +724,17 @@ static void tcp_v4_send_reset(const struct sock *sk, 
struct sk_buff *skb)
arg.tos = ip_hdr(skb)->tos;
arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
-   ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
+   ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
+   if (sk && sk->sk_state == TCP_TIME_WAIT)
+   ctl_sk->sk_mark = inet_twsk(sk)->tw_mark;
+   else if (sk && sk_fullsock(sk))
+   ctl_sk->sk_mark = sk->sk_mark;
+   ip_send_unicast_reply(ctl_sk,
  skb, _SKB_CB(skb)->header.h4.opt,
  ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
  , arg.iov[0].iov_len);
 
+   ctl_sk->sk_mark = 0;
__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
__TCP_INC_STATS(net, TCP_MIB_OUTRSTS);
local_bh_enable();
@@ -759,6 +766,7 @@ static void tcp_v4_send_ack(const struct sock *sk,
} rep;
struct net *net = sock_net(sk);
struct ip_reply_arg arg;
+   struct sock *ctl_sk;
 
memset(, 0, sizeof(struct tcphdr));
memset(, 0, sizeof(arg));
@@ -809,11 +817,17 @@ static void tcp_v4_send_ack(const struct sock *sk,
arg.tos = tos;
arg.uid = sock_net_uid(net, sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
-   ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
+   ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
+   if (sk && sk->sk_state == TCP_TIME_WAIT)
+   ctl_sk->sk_mark = inet_twsk(sk)->tw_mark;
+   else if (sk && sk_fullsock(sk))
+   ctl_sk->sk_mark = sk->sk_mark;
+   ip_send_unicast_reply(ctl_sk,
  skb, _SKB_CB(skb)->header.h4.opt,
  ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
   

[PATCH net-next] tcp: Add mark for TIMEWAIT sockets

2018-05-09 Thread Jon Maxwell
Aidan McGurn from Openwave Mobility systems reported the following bug:

"Marked routing is broken on customer deployment. Its effects are large 
increase in Uplink retransmissions caused by the client never receiving 
the final ACK to their FINACK - this ACK misses the mark and routes out 
of the incorrect route."

Currently marks are added to sk_buffs for replies when the "fwmark_reflect" 
sysctl is enabled. But not for TIME_WAIT sockets where the original socket had 
sk->sk_mark set via setsockopt(SO_MARK..).  

Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the 
original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark 
location. 
Then copy this into ctl_sk->sk_mark so that the skb gets sent with the correct 
mark. Do the same for resets. Give the "fwmark_reflect" sysctl precedence over 
sk->sk_mark so that netfilter rules are still honored.

Signed-off-by: Jon Maxwell <jmaxwel...@gmail.com>
---
 include/net/inet_timewait_sock.h |  1 +
 net/ipv4/ip_output.c |  3 ++-
 net/ipv4/tcp_ipv4.c  | 18 --
 net/ipv4/tcp_minisocks.c |  1 +
 net/ipv6/tcp_ipv6.c  |  8 +++-
 5 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index c7be1ca8e562..659d8ed5a3bc 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -62,6 +62,7 @@ struct inet_timewait_sock {
 #define tw_dr  __tw_common.skc_tw_dr
 
int tw_timeout;
+   __u32   tw_mark;
volatile unsigned char  tw_substate;
unsigned char   tw_rcv_wscale;
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 95adb171f852..cca4412dc4cb 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1539,6 +1539,7 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
struct sk_buff *nskb;
int err;
int oif;
+   __u32 mark = IP4_REPLY_MARK(net, skb->mark);
 
if (__ip_options_echo(net, , skb, sopt))
return;
@@ -1561,7 +1562,7 @@ void ip_send_unicast_reply(struct sock *sk, struct 
sk_buff *skb,
oif = skb->skb_iif;
 
flowi4_init_output(, oif,
-  IP4_REPLY_MARK(net, skb->mark),
+  mark ? (mark) : sk->sk_mark,
   RT_TOS(arg->tos),
   RT_SCOPE_UNIVERSE, ip_hdr(skb)->protocol,
   ip_reply_arg_flowi_flags(arg),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b50838..fbee36579c83 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -621,6 +621,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct 
sk_buff *skb)
struct sock *sk1 = NULL;
 #endif
struct net *net;
+   struct sock *ctl_sk;
 
/* Never send a reset in response to a reset. */
if (th->rst)
@@ -723,11 +724,17 @@ static void tcp_v4_send_reset(const struct sock *sk, 
struct sk_buff *skb)
arg.tos = ip_hdr(skb)->tos;
arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
-   ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
+   ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
+   if (sk && sk->sk_state == TCP_TIME_WAIT)
+   ctl_sk->sk_mark = inet_twsk(sk)->tw_mark;
+   else if (sk && sk_fullsock(sk))
+   ctl_sk->sk_mark = sk->sk_mark;
+   ip_send_unicast_reply(ctl_sk,
  skb, _SKB_CB(skb)->header.h4.opt,
  ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
  , arg.iov[0].iov_len);
 
+   ctl_sk->sk_mark = 0;
__TCP_INC_STATS(net, TCP_MIB_OUTSEGS);
__TCP_INC_STATS(net, TCP_MIB_OUTRSTS);
local_bh_enable();
@@ -759,6 +766,7 @@ static void tcp_v4_send_ack(const struct sock *sk,
} rep;
struct net *net = sock_net(sk);
struct ip_reply_arg arg;
+   struct sock *ctl_sk;
 
memset(, 0, sizeof(struct tcphdr));
memset(, 0, sizeof(arg));
@@ -809,11 +817,17 @@ static void tcp_v4_send_ack(const struct sock *sk,
arg.tos = tos;
arg.uid = sock_net_uid(net, sk_fullsock(sk) ? sk : NULL);
local_bh_disable();
-   ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
+   ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk);
+   if (sk && sk->sk_state == TCP_TIME_WAIT)
+   ctl_sk->sk_mark = inet_twsk(sk)->tw_mark;
+   else if (sk && sk_fullsock(sk))
+   ctl_sk->sk_mark = sk->sk_mark;
+   ip_send_unicast_reply(ctl_sk,
  skb, _SKB_CB(skb)->header.h4.opt,
   

RE: [PATCH net] tipc: fix one byte leak in tipc_sk_set_orig_addr()

2018-05-09 Thread Jon Maloy
Acked-by: Jon Maloy <jon.ma...@ericsson.com>

Thank you Eric.

> -Original Message-
> From: Eric Dumazet [mailto:eduma...@google.com]
> Sent: Wednesday, May 09, 2018 09:50
> To: David S . Miller <da...@davemloft.net>
> Cc: netdev <netdev@vger.kernel.org>; Eric Dumazet
> <eduma...@google.com>; Eric Dumazet <eric.duma...@gmail.com>; Jon
> Maloy <jon.ma...@ericsson.com>; Ying Xue <ying@windriver.com>
> Subject: [PATCH net] tipc: fix one byte leak in tipc_sk_set_orig_addr()
> 
> sysbot/KMSAN reported an uninit-value in recvmsg() that I tracked down to
> tipc_sk_set_orig_addr(), missing
> srcaddr->member.scope initialization.
> 
> This patches moves srcaddr->sock.scope init to follow fields order and ease
> future verifications.
> 
> BUG: KMSAN: uninit-value in copy_to_user include/linux/uaccess.h:184
> [inline]
> BUG: KMSAN: uninit-value in move_addr_to_user+0x32e/0x530
> net/socket.c:226
> CPU: 0 PID: 4549 Comm: syz-executor287 Not tainted 4.17.0-rc3+ #88
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS Google 01/01/2011 Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x185/0x1d0 lib/dump_stack.c:113
>  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
>  kmsan_internal_check_memory+0x135/0x1e0 mm/kmsan/kmsan.c:1157
>  kmsan_copy_to_user+0x69/0x160 mm/kmsan/kmsan.c:1199  copy_to_user
> include/linux/uaccess.h:184 [inline]
>  move_addr_to_user+0x32e/0x530 net/socket.c:226
>  ___sys_recvmsg+0x4e2/0x810 net/socket.c:2285  __sys_recvmsg
> net/socket.c:2328 [inline]  __do_sys_recvmsg net/socket.c:2338 [inline]
> __se_sys_recvmsg net/socket.c:2335 [inline]
>  __x64_sys_recvmsg+0x325/0x460 net/socket.c:2335
>  do_syscall_64+0x154/0x220 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x4455e9
> RSP: 002b:7fe3bd36ddb8 EFLAGS: 0246 ORIG_RAX:
> 002f
> RAX: ffda RBX: 006dac24 RCX: 004455e9
> RDX: 2002 RSI: 2400 RDI: 0003
> RBP: 006dac20 R08:  R09: 
> R10:  R11: 0246 R12: 
> R13: 7fff98ce4b6f R14: 7fe3bd36e9c0 R15: 0003
> 
> Local variable description: addr@___sys_recvmsg Variable was created
> at:
>  ___sys_recvmsg+0xd5/0x810 net/socket.c:2246  __sys_recvmsg
> net/socket.c:2328 [inline]  __do_sys_recvmsg net/socket.c:2338 [inline]
> __se_sys_recvmsg net/socket.c:2335 [inline]
>  __x64_sys_recvmsg+0x325/0x460 net/socket.c:2335
> 
> Byte 19 of 32 is uninitialized
> 
> Fixes: 31c82a2d9d51 ("tipc: add second source address to
> recvmsg()/recvfrom()")
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Reported-by: syzbot <syzkal...@googlegroups.com>
> Cc: Jon Maloy <jon.ma...@ericsson.com>
> Cc: Ying Xue <ying@windriver.com>
> ---
>  net/tipc/socket.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/tipc/socket.c b/net/tipc/socket.c index
> 252a52ae0893261fc6f146ad8c59f375fdce..6be21575503aa532014e7aa141
> 5b2bf294757308 100644
> --- a/net/tipc/socket.c
> +++ b/net/tipc/socket.c
> @@ -1516,10 +1516,10 @@ static void tipc_sk_set_orig_addr(struct msghdr
> *m, struct sk_buff *skb)
> 
>   srcaddr->sock.family = AF_TIPC;
>   srcaddr->sock.addrtype = TIPC_ADDR_ID;
> + srcaddr->sock.scope = 0;
>   srcaddr->sock.addr.id.ref = msg_origport(hdr);
>   srcaddr->sock.addr.id.node = msg_orignode(hdr);
>   srcaddr->sock.addr.name.domain = 0;
> - srcaddr->sock.scope = 0;
>   m->msg_namelen = sizeof(struct sockaddr_tipc);
> 
>   if (!msg_in_group(hdr))
> @@ -1528,6 +1528,7 @@ static void tipc_sk_set_orig_addr(struct msghdr
> *m, struct sk_buff *skb)
>   /* Group message users may also want to know sending member's id
> */
>   srcaddr->member.family = AF_TIPC;
>   srcaddr->member.addrtype = TIPC_ADDR_NAME;
> + srcaddr->member.scope = 0;
>   srcaddr->member.addr.name.name.type = msg_nametype(hdr);
>   srcaddr->member.addr.name.name.instance = TIPC_SKB_CB(skb)-
> >orig_member;
>   srcaddr->member.addr.name.domain = 0;
> --
> 2.17.0.441.gb46fe60e1d-goog



[net-next 1/1] tipc: clean up removal of binding table items

2018-05-08 Thread Jon Maloy
In commit be47e41d77fb ("tipc: fix use-after-free in tipc_nametbl_stop")
we fixed a problem caused by premature release of service range items.

That fix is correct, and solved the problem. However, it doesn't address
the root of the problem, which is that we don't lookup the tipc_service
 -> service_range -> publication items in the correct hierarchical
order.

In this commit we try to make this right, and as a side effect obtain
some code simplification.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/name_table.c | 103 ++
 1 file changed, 53 insertions(+), 50 deletions(-)

diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index dd1c4fa..bebe88c 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -136,12 +136,12 @@ static struct tipc_service *tipc_service_create(u32 type, 
struct hlist_head *hd)
 }
 
 /**
- * tipc_service_find_range - find service range matching a service instance
+ * tipc_service_first_range - find first service range in tree matching 
instance
  *
  * Very time-critical, so binary search through range rb tree
  */
-static struct service_range *tipc_service_find_range(struct tipc_service *sc,
-u32 instance)
+static struct service_range *tipc_service_first_range(struct tipc_service *sc,
+ u32 instance)
 {
struct rb_node *n = sc->ranges.rb_node;
struct service_range *sr;
@@ -158,6 +158,30 @@ static struct service_range 
*tipc_service_find_range(struct tipc_service *sc,
return NULL;
 }
 
+/*  tipc_service_find_range - find service range matching publication 
parameters
+ */
+static struct service_range *tipc_service_find_range(struct tipc_service *sc,
+u32 lower, u32 upper)
+{
+   struct rb_node *n = sc->ranges.rb_node;
+   struct service_range *sr;
+
+   sr = tipc_service_first_range(sc, lower);
+   if (!sr)
+   return NULL;
+
+   /* Look for exact match */
+   for (n = >tree_node; n; n = rb_next(n)) {
+   sr = container_of(n, struct service_range, tree_node);
+   if (sr->upper == upper)
+   break;
+   }
+   if (!n || sr->lower != lower || sr->upper != upper)
+   return NULL;
+
+   return sr;
+}
+
 static struct service_range *tipc_service_create_range(struct tipc_service *sc,
   u32 lower, u32 upper)
 {
@@ -238,54 +262,19 @@ static struct publication 
*tipc_service_insert_publ(struct net *net,
 /**
  * tipc_service_remove_publ - remove a publication from a service
  */
-static struct publication *tipc_service_remove_publ(struct net *net,
-   struct tipc_service *sc,
-   u32 lower, u32 upper,
-   u32 node, u32 key,
-   struct service_range **rng)
+static struct publication *tipc_service_remove_publ(struct service_range *sr,
+   u32 node, u32 key)
 {
-   struct tipc_subscription *sub, *tmp;
-   struct service_range *sr;
struct publication *p;
-   bool found = false;
-   bool last = false;
-   struct rb_node *n;
-
-   sr = tipc_service_find_range(sc, lower);
-   if (!sr)
-   return NULL;
 
-   /* Find exact matching service range */
-   for (n = >tree_node; n; n = rb_next(n)) {
-   sr = container_of(n, struct service_range, tree_node);
-   if (sr->upper == upper)
-   break;
-   }
-   if (!n || sr->lower != lower || sr->upper != upper)
-   return NULL;
-
-   /* Find publication, if it exists */
list_for_each_entry(p, >all_publ, all_publ) {
if (p->key != key || (node && node != p->node))
continue;
-   found = true;
-   break;
+   list_del(>all_publ);
+   list_del(>local_publ);
+   return p;
}
-   if (!found)
-   return NULL;
-
-   list_del(>all_publ);
-   list_del(>local_publ);
-   if (list_empty(>all_publ))
-   last = true;
-
-   /* Notify any waiting subscriptions */
-   list_for_each_entry_safe(sub, tmp, >subscriptions, service_list) {
-   tipc_sub_report_overlap(sub, p->lower, p->upper, TIPC_WITHDRAWN,
-   p->port, p->node, p->scope, last);
-   }
-   *rng = sr;
-   return p;
+   return NULL;
 }
 
 /**
@@ -376,17 +365,31 @@ struct publication

[iproute2-next 1/1] tipc: fixed node and name table listings

2018-05-07 Thread Jon Maloy
We make it easier for users to correlate between 128-bit node
identities and 32-bit node hash by extending the 'node list'
command to also show the hash value.

We also improve the 'nametable show' command to show the node identity
instead of the node hash value. Since the former potentially is much
longer than the latter, we make room for it by eliminating the (to the
user) irrelevant publication key. We also reorder some of the columns
so that the node id comes last, since this looks nicer and more logical.

Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 tipc/misc.c  | 18 ++
 tipc/misc.h  |  1 +
 tipc/nametable.c | 18 ++
 tipc/node.c  | 19 ---
 tipc/peer.c  |  4 
 5 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/tipc/misc.c b/tipc/misc.c
index 16849f1..13dbaad 100644
--- a/tipc/misc.c
+++ b/tipc/misc.c
@@ -13,6 +13,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 #include "misc.h"
 
 #define IN_RANGE(val, low, high) ((val) <= (high) && (val) >= (low))
@@ -109,3 +112,18 @@ void nodeid2str(uint8_t *id, char *str)
for (i = 31; str[i] == '0'; i--)
str[i] = 0;
 }
+
+void hash2nodestr(uint32_t hash, char *str)
+{
+   struct tipc_sioc_nodeid_req nr = {};
+   int sd;
+
+   sd = socket(AF_TIPC, SOCK_RDM, 0);
+   if (sd < 0) {
+   fprintf(stderr, "opening TIPC socket: %s\n", strerror(errno));
+   return;
+   }
+   nr.peer = hash;
+   if (!ioctl(sd, SIOCGETNODEID, ))
+   nodeid2str(nr.node_id, str);
+}
diff --git a/tipc/misc.h b/tipc/misc.h
index 6e8afdd..ff2f31f 100644
--- a/tipc/misc.h
+++ b/tipc/misc.h
@@ -17,5 +17,6 @@
 uint32_t str2addr(char *str);
 int str2nodeid(char *str, uint8_t *id);
 void nodeid2str(uint8_t *id, char *str);
+void hash2nodestr(uint32_t hash, char *str);
 
 #endif
diff --git a/tipc/nametable.c b/tipc/nametable.c
index 2578940..ae73dfa 100644
--- a/tipc/nametable.c
+++ b/tipc/nametable.c
@@ -20,6 +20,7 @@
 #include "cmdl.h"
 #include "msg.h"
 #include "nametable.h"
+#include "misc.h"
 
 #define PORTID_STR_LEN 45 /* Four u32 and five delimiter chars */
 
@@ -31,6 +32,7 @@ static int nametable_show_cb(const struct nlmsghdr *nlh, void 
*data)
struct nlattr *attrs[TIPC_NLA_NAME_TABLE_MAX + 1] = {};
struct nlattr *publ[TIPC_NLA_PUBL_MAX + 1] = {};
const char *scope[] = { "", "zone", "cluster", "node" };
+   char str[33] = {0,};
 
mnl_attr_parse(nlh, sizeof(*genl), parse_attrs, info);
if (!info[TIPC_NLA_NAME_TABLE])
@@ -45,20 +47,20 @@ static int nametable_show_cb(const struct nlmsghdr *nlh, 
void *data)
return MNL_CB_ERROR;
 
if (!*iteration)
-   printf("%-10s %-10s %-10s %-10s %-10s %-10s\n",
-  "Type", "Lower", "Upper", "Node", "Port",
-  "Publication Scope");
+   printf("%-10s %-10s %-10s %-8s %-10s %-33s\n",
+  "Type", "Lower", "Upper", "Scope", "Port",
+  "Node");
(*iteration)++;
 
-   printf("%-10u %-10u %-10u %-10x %-10u %-12u",
+   hash2nodestr(mnl_attr_get_u32(publ[TIPC_NLA_PUBL_NODE]), str);
+
+   printf("%-10u %-10u %-10u %-8s %-10u %s\n",
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_TYPE]),
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_LOWER]),
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_UPPER]),
-  mnl_attr_get_u32(publ[TIPC_NLA_PUBL_NODE]),
+  scope[mnl_attr_get_u32(publ[TIPC_NLA_PUBL_SCOPE])],
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_REF]),
-  mnl_attr_get_u32(publ[TIPC_NLA_PUBL_KEY]));
-
-   printf("%s\n", scope[mnl_attr_get_u32(publ[TIPC_NLA_PUBL_SCOPE])]);
+  str);
 
return MNL_CB_OK;
 }
diff --git a/tipc/node.c b/tipc/node.c
index b73b644..0fa1064 100644
--- a/tipc/node.c
+++ b/tipc/node.c
@@ -26,10 +26,11 @@
 
 static int node_list_cb(const struct nlmsghdr *nlh, void *data)
 {
-   uint32_t addr;
struct genlmsghdr *genl = mnl_nlmsg_get_payload(nlh);
struct nlattr *info[TIPC_NLA_MAX + 1] = {};
struct nlattr *attrs[TIPC_NLA_NODE_MAX + 1] = {};
+   char str[33] = {};
+   uint32_t addr;
 
mnl_attr_parse(nlh, sizeof(*genl), parse_attrs, info);
if (!info[TIPC_NLA_NODE])
@@ -40,13 +41,12 @@ static int node_list_cb(const struct nlmsghdr *nlh, void 
*data)
return MNL_CB_ERROR;
 
addr = mnl_attr_get_u32(attrs[TIPC_NLA_NODE_ADDR]);
-   printf("%x: ", addr);
-
+   hash2nodestr(addr, str);
+   printf("%-32s %08x ", str, a

RE: [PATCH net-next] flow_dissector: do not rely on implicit casts

2018-05-07 Thread Jon Maloy
Acked-by: Jon Maloy <jon.ma...@ericsson.com>


> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Paolo Abeni
> Sent: Monday, May 07, 2018 06:06
> To: netdev@vger.kernel.org
> Cc: David S. Miller <da...@davemloft.net>
> Subject: [PATCH net-next] flow_dissector: do not rely on implicit casts
> 
> This change fixes a couple of type mismatch reported by the sparse tool,
> explicitly using the requested type for the offending arguments.
> 
> Signed-off-by: Paolo Abeni <pab...@redhat.com>
> ---
>  include/net/tipc.h| 4 ++--
>  net/core/flow_dissector.c | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/net/tipc.h b/include/net/tipc.h index
> 07670ec022a7..f0e7e6bc1bef 100644
> --- a/include/net/tipc.h
> +++ b/include/net/tipc.h
> @@ -44,11 +44,11 @@ struct tipc_basic_hdr {
>   __be32 w[4];
>  };
> 
> -static inline u32 tipc_hdr_rps_key(struct tipc_basic_hdr *hdr)
> +static inline __be32 tipc_hdr_rps_key(struct tipc_basic_hdr *hdr)
>  {
>   u32 w0 = ntohl(hdr->w[0]);
>   bool keepalive_msg = (w0 & KEEPALIVE_MSG_MASK) ==
> KEEPALIVE_MSG_MASK;
> - int key;
> + __be32 key;
> 
>   /* Return source node identity as key */
>   if (likely(!keepalive_msg))
> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c index
> 030d4ca177fb..4fc1e84d77ec 100644
> --- a/net/core/flow_dissector.c
> +++ b/net/core/flow_dissector.c
> @@ -1316,7 +1316,7 @@ u32 skb_get_poff(const struct sk_buff *skb)  {
>   struct flow_keys_basic keys;
> 
> - if (!skb_flow_dissect_flow_keys_basic(skb, , 0, 0, 0, 0, 0))
> + if (!skb_flow_dissect_flow_keys_basic(skb, , NULL, 0, 0, 0, 0))
>   return 0;
> 
>   return __skb_get_poff(skb, skb->data, , skb_headlen(skb));
> --
> 2.14.3



RE: KMSAN: uninit-value in strcmp

2018-05-03 Thread Jon Maloy


> -Original Message-
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Thursday, May 03, 2018 15:22
> To: syzbot+df0257c92ffd4fcc5...@syzkaller.appspotmail.com
> Cc: Jon Maloy <jon.ma...@ericsson.com>; linux-ker...@vger.kernel.org;
> netdev@vger.kernel.org; syzkaller-b...@googlegroups.com; tipc-
> discuss...@lists.sourceforge.net; ying@windriver.com
> Subject: Re: KMSAN: uninit-value in strcmp
> 
> From: syzbot <syzbot+df0257c92ffd4fcc5...@syzkaller.appspotmail.com>
> Date: Thu, 03 May 2018 11:44:02 -0700
> 
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:17 [inline]
> >  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
> >  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
> >  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683
> >  strcmp+0xf7/0x160 lib/string.c:329
> >  tipc_nl_node_get_link+0x220/0x6f0 net/tipc/node.c:1881
> > genl_family_rcv_msg net/netlink/genetlink.c:599 [inline]
> 
> Hmmm, TIPC_NL_LINK_GET uses tipc_nl_policy, which has a proper nesting
> entry for TIPC_NLA_LINK.  I wonder how the code goes about validating
> TIPC_NLA_LINK_NAME in such a case?  Does it?

I assume that a strncmp() instead of a strcmp() would avert this particular 
crash, but it doesn't sound like that is what you are after here?
To be honest, I will need to study this code a little myself to understand if 
there is more that has to be done.

///jon

> 
> This may be the problem.


RE: [PATCH] tipc: fix a potential missing-check bug

2018-05-01 Thread Jon Maloy


> -Original Message-
> From: Wenwen Wang [mailto:wang6...@umn.edu]
> Sent: Tuesday, May 01, 2018 00:26
> To: Wenwen Wang <wang6...@umn.edu>
> Cc: Kangjie Lu <k...@umn.edu>; Jon Maloy <jon.ma...@ericsson.com>; Ying
> Xue <ying@windriver.com>; David S. Miller <da...@davemloft.net>;
> open list:TIPC NETWORK LAYER <netdev@vger.kernel.org>; open list:TIPC
> NETWORK LAYER <tipc-discuss...@lists.sourceforge.net>; open list  ker...@vger.kernel.org>
> Subject: [PATCH] tipc: fix a potential missing-check bug
> 
> In tipc_link_xmit(), the member field "len" of l->backlog[imp] must be less
> than the member field "limit" of l->backlog[imp] when imp is equal to
> TIPC_SYSTEM_IMPORTANCE. Otherwise, an error code, i.e., -ENOBUFS, is
> returned. This is enforced by the security check. However, at the end of
> tipc_link_xmit(), the length of "list" is added to l->backlog[imp].len without
> any further check. This can potentially cause unexpected values for
> l->backlog[imp].len. If imp is equal to TIPC_SYSTEM_IMPORTANCE and the
> original value of l->backlog[imp].len is less than l->backlog[imp].limit, 
> after
> this addition, l->backlog[imp] could be larger than
> l->backlog[imp].limit. 

It can, but only once. That is the intention with allowing oversubscription. 
This is expected and permitted.
At next sending attempt, if the send queue has not been reduced in the 
meantime, the link will be reset, as intended.

> That means the security check can potentially be
> bypassed,  especially when an adversary can control the length of "list".

The length of 'list' is entirely controlled by TIPC itself, either by the 
socket layer (where length  always is 1 for this type of messages) or
 name_dist, In the latter case the length is also 1, except at first link 
setup, when there guaranteed is no congestion anyway.

I appreciate your interest, but this patch is not needed.

BR
///jon

> 
> This patch performs such a check after the modification to
> l->backlog[imp].len (if imp is TIPC_SYSTEM_IMPORTANCE) to avoid such
> security issues. An error code will be returned if an unexpected value of
> l->backlog[imp].len is generated.
> 
> Signed-off-by: Wenwen Wang <wang6...@umn.edu>
> ---
>  net/tipc/link.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/net/tipc/link.c b/net/tipc/link.c index 695acb7..62972fa 100644
> --- a/net/tipc/link.c
> +++ b/net/tipc/link.c
> @@ -948,6 +948,11 @@ int tipc_link_xmit(struct tipc_link *l, struct
> sk_buff_head *list,
>   continue;
>   }
>   l->backlog[imp].len += skb_queue_len(list);
> + if (imp == TIPC_SYSTEM_IMPORTANCE &&
> + l->backlog[imp].len >= l->backlog[imp].limit) {
> + pr_warn("%s<%s>, link overflow", link_rst_msg, l-
> >name);
> + return -ENOBUFS;
> + }
>   skb_queue_splice_tail_init(list, backlogq);
>   }
>   l->snd_nxt = seqno;
> --
> 2.7.4



[net-next 1/1] tipc: introduce ioctl for fetching node identity

2018-04-25 Thread Jon Maloy
After the introduction of a 128-bit node identity it may be difficult
for a user to correlate between this identity and the generated node
hash address.

We now try to make this easier by introducing a new ioctl() call for
fetching a node identity by using the hash value as key. This will
be particularly useful when we extend some of the commands in the
'tipc' tool, but we also expect regular user applications to need
this feature.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 include/uapi/linux/tipc.h | 12 
 net/tipc/node.c   | 21 +
 net/tipc/node.h   |  1 +
 net/tipc/socket.c | 13 +++--
 4 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/tipc.h b/include/uapi/linux/tipc.h
index bf6d286..6b2fd4d 100644
--- a/include/uapi/linux/tipc.h
+++ b/include/uapi/linux/tipc.h
@@ -209,16 +209,16 @@ struct tipc_group_req {
  * The string formatting for each name element is:
  * media: media
  * interface: media:interface name
- * link: Z.C.N:interface-Z.C.N:interface
- *
+ * link: node:interface-node:interface
  */
-
+#define TIPC_NODEID_LEN 16
 #define TIPC_MAX_MEDIA_NAME16
 #define TIPC_MAX_IF_NAME   16
 #define TIPC_MAX_BEARER_NAME   32
 #define TIPC_MAX_LINK_NAME 68
 
-#define SIOCGETLINKNAMESIOCPROTOPRIVATE
+#define SIOCGETLINKNAMESIOCPROTOPRIVATE
+#define SIOCGETNODEID  (SIOCPROTOPRIVATE + 1)
 
 struct tipc_sioc_ln_req {
__u32 peer;
@@ -226,6 +226,10 @@ struct tipc_sioc_ln_req {
char linkname[TIPC_MAX_LINK_NAME];
 };
 
+struct tipc_sioc_nodeid_req {
+   __u32 peer;
+   char node_id[TIPC_NODEID_LEN];
+};
 
 /* The macros and functions below are deprecated:
  */
diff --git a/net/tipc/node.c b/net/tipc/node.c
index e9c52e14..81e6dd0 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -195,6 +195,27 @@ int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel)
return mtu;
 }
 
+bool tipc_node_get_id(struct net *net, u32 addr, u8 *id)
+{
+   u8 *own_id = tipc_own_id(net);
+   struct tipc_node *n;
+
+   if (!own_id)
+   return true;
+
+   if (addr == tipc_own_addr(net)) {
+   memcpy(id, own_id, TIPC_NODEID_LEN);
+   return true;
+   }
+   n = tipc_node_find(net, addr);
+   if (!n)
+   return false;
+
+   memcpy(id, >peer_id, TIPC_NODEID_LEN);
+   tipc_node_put(n);
+   return true;
+}
+
 u16 tipc_node_get_capabilities(struct net *net, u32 addr)
 {
struct tipc_node *n;
diff --git a/net/tipc/node.h b/net/tipc/node.h
index bb271a3..846c8f2 100644
--- a/net/tipc/node.h
+++ b/net/tipc/node.h
@@ -60,6 +60,7 @@ enum {
 #define INVALID_BEARER_ID -1
 
 void tipc_node_stop(struct net *net);
+bool tipc_node_get_id(struct net *net, u32 addr, u8 *id);
 u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr);
 void tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128,
  struct tipc_bearer *bearer,
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 252a52ae..c499200 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -2973,7 +2973,8 @@ static int tipc_getsockopt(struct socket *sock, int lvl, 
int opt,
 
 static int tipc_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
 {
-   struct sock *sk = sock->sk;
+   struct net *net = sock_net(sock->sk);
+   struct tipc_sioc_nodeid_req nr = {0};
struct tipc_sioc_ln_req lnr;
void __user *argp = (void __user *)arg;
 
@@ -2981,7 +2982,7 @@ static int tipc_ioctl(struct socket *sock, unsigned int 
cmd, unsigned long arg)
case SIOCGETLINKNAME:
if (copy_from_user(, argp, sizeof(lnr)))
return -EFAULT;
-   if (!tipc_node_get_linkname(sock_net(sk),
+   if (!tipc_node_get_linkname(net,
lnr.bearer_id & 0x, lnr.peer,
lnr.linkname, TIPC_MAX_LINK_NAME)) {
if (copy_to_user(argp, , sizeof(lnr)))
@@ -2989,6 +2990,14 @@ static int tipc_ioctl(struct socket *sock, unsigned int 
cmd, unsigned long arg)
return 0;
}
return -EADDRNOTAVAIL;
+   case SIOCGETNODEID:
+   if (copy_from_user(, argp, sizeof(nr)))
+   return -EFAULT;
+   if (!tipc_node_get_id(net, nr.peer, nr.node_id))
+   return -EADDRNOTAVAIL;
+   if (copy_to_user(argp, , sizeof(nr)))
+   return -EFAULT;
+   return 0;
default:
return -ENOIOCTLCMD;
}
-- 
2.1.4



[net 1/1] tipc: fix bug in function tipc_nl_node_dump_monitor

2018-04-25 Thread Jon Maloy
Commit 36a50a989ee8 ("tipc: fix infinite loop when dumping link monitor
summary") intended to fix a problem with user tool looping when max
number of bearers are enabled.

Unfortunately, the wrong version of the commit was posted, so the
problem was not solved at all.

This commit adds the missing part.

Fixes: 36a50a989ee8 ("tipc: fix infinite loop when dumping link monitor 
summary")
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/node.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tipc/node.c b/net/tipc/node.c
index 6f98b56..baaf93f 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -2244,7 +2244,7 @@ int tipc_nl_node_dump_monitor(struct sk_buff *skb, struct 
netlink_callback *cb)
 
rtnl_lock();
for (bearer_id = prev_bearer; bearer_id < MAX_BEARERS; bearer_id++) {
-   err = __tipc_nl_add_monitor(net, , prev_bearer);
+   err = __tipc_nl_add_monitor(net, , bearer_id);
if (err)
break;
}
-- 
2.1.4



[net 1/1] tipc: fix infinite loop when dumping link monitor summary

2018-04-17 Thread Jon Maloy
From: Tung Nguyen <tung.q.ngu...@dektech.com.au>

When configuring the number of used bearers to MAX_BEARER and issuing
command "tipc link monitor summary", the command enters infinite loop
in user space.

This issue happens because function tipc_nl_node_dump_monitor() returns
the wrong 'prev_bearer' value when all potential monitors have been
scanned.

The correct behavior is to always try to scan all monitors until either
the netlink message is full, in which case we return the bearer identity
of the affected monitor, or we continue through the whole bearer array
until we can return MAX_BEARERS. This solution also caters for the case
where there may be gaps in the bearer array.

Signed-off-by: Tung Nguyen <tung.q.ngu...@dektech.com.au>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/monitor.c |  2 +-
 net/tipc/node.c| 11 ---
 2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/net/tipc/monitor.c b/net/tipc/monitor.c
index 32dc33a..5453e56 100644
--- a/net/tipc/monitor.c
+++ b/net/tipc/monitor.c
@@ -777,7 +777,7 @@ int __tipc_nl_add_monitor(struct net *net, struct 
tipc_nl_msg *msg,
 
ret = tipc_bearer_get_name(net, bearer_name, bearer_id);
if (ret || !mon)
-   return -EINVAL;
+   return 0;
 
hdr = genlmsg_put(msg->skb, msg->portid, msg->seq, _genl_family,
  NLM_F_MULTI, TIPC_NL_MON_GET);
diff --git a/net/tipc/node.c b/net/tipc/node.c
index c77dd2f..6f98b56 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -2232,8 +2232,8 @@ int tipc_nl_node_dump_monitor(struct sk_buff *skb, struct 
netlink_callback *cb)
struct net *net = sock_net(skb->sk);
u32 prev_bearer = cb->args[0];
struct tipc_nl_msg msg;
+   int bearer_id;
int err;
-   int i;
 
if (prev_bearer == MAX_BEARERS)
return 0;
@@ -2243,16 +2243,13 @@ int tipc_nl_node_dump_monitor(struct sk_buff *skb, 
struct netlink_callback *cb)
msg.seq = cb->nlh->nlmsg_seq;
 
rtnl_lock();
-   for (i = prev_bearer; i < MAX_BEARERS; i++) {
-   prev_bearer = i;
+   for (bearer_id = prev_bearer; bearer_id < MAX_BEARERS; bearer_id++) {
err = __tipc_nl_add_monitor(net, , prev_bearer);
if (err)
-   goto out;
+   break;
}
-
-out:
rtnl_unlock();
-   cb->args[0] = prev_bearer;
+   cb->args[0] = bearer_id;
 
return skb->len;
 }
-- 
2.1.4



[net 1/1] tipc: fix use-after-free in tipc_nametbl_stop

2018-04-17 Thread Jon Maloy
When we delete a service item in tipc_nametbl_stop() we loop over
all service ranges in the service's RB tree, and for each service
range we loop over its pertaining publications while calling
tipc_service_remove_publ() for each of them.

However, tipc_service_remove_publ() has the side effect that it also
removes the comprising service range item when there are no publications
left. This leads to a "use-after-free" access when the inner loop
continues to the next iteration, since the range item holding the list
we are looping no longer exists.

We fix this by moving the delete of the service range item outside
the said function. Instead, we now let the two functions calling it
test if the list is empty and perform the removal when that is the
case.

Reported-by: syzbot+d64b64afc55660106...@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/name_table.c | 29 +
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index 4068eaa..dd1c4fa 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -241,7 +241,8 @@ static struct publication *tipc_service_insert_publ(struct 
net *net,
 static struct publication *tipc_service_remove_publ(struct net *net,
struct tipc_service *sc,
u32 lower, u32 upper,
-   u32 node, u32 key)
+   u32 node, u32 key,
+   struct service_range **rng)
 {
struct tipc_subscription *sub, *tmp;
struct service_range *sr;
@@ -275,19 +276,15 @@ static struct publication 
*tipc_service_remove_publ(struct net *net,
 
list_del(>all_publ);
list_del(>local_publ);
-
-   /* Remove service range item if this was its last publication */
-   if (list_empty(>all_publ)) {
+   if (list_empty(>all_publ))
last = true;
-   rb_erase(>tree_node, >ranges);
-   kfree(sr);
-   }
 
/* Notify any waiting subscriptions */
list_for_each_entry_safe(sub, tmp, >subscriptions, service_list) {
tipc_sub_report_overlap(sub, p->lower, p->upper, TIPC_WITHDRAWN,
p->port, p->node, p->scope, last);
}
+   *rng = sr;
return p;
 }
 
@@ -379,13 +376,20 @@ struct publication *tipc_nametbl_remove_publ(struct net 
*net, u32 type,
 u32 node, u32 key)
 {
struct tipc_service *sc = tipc_service_find(net, type);
+   struct service_range *sr = NULL;
struct publication *p = NULL;
 
if (!sc)
return NULL;
 
spin_lock_bh(>lock);
-   p = tipc_service_remove_publ(net, sc, lower, upper, node, key);
+   p = tipc_service_remove_publ(net, sc, lower, upper, node, key, );
+
+   /* Remove service range item if this was its last publication */
+   if (sr && list_empty(>all_publ)) {
+   rb_erase(>tree_node, >ranges);
+   kfree(sr);
+   }
 
/* Delete service item if this no more publications and subscriptions */
if (RB_EMPTY_ROOT(>ranges) && list_empty(>subscriptions)) {
@@ -747,16 +751,17 @@ int tipc_nametbl_init(struct net *net)
 static void tipc_service_delete(struct net *net, struct tipc_service *sc)
 {
struct service_range *sr, *tmpr;
-   struct publication *p, *tmpb;
+   struct publication *p, *tmp;
 
spin_lock_bh(>lock);
rbtree_postorder_for_each_entry_safe(sr, tmpr, >ranges, tree_node) {
-   list_for_each_entry_safe(p, tmpb,
->all_publ, all_publ) {
+   list_for_each_entry_safe(p, tmp, >all_publ, all_publ) {
tipc_service_remove_publ(net, sc, p->lower, p->upper,
-p->node, p->key);
+p->node, p->key, );
kfree_rcu(p, rcu);
}
+   rb_erase(>tree_node, >ranges);
+   kfree(sr);
}
hlist_del_init_rcu(>service_list);
spin_unlock_bh(>lock);
-- 
2.1.4



RE: [PATCH net 0/2] tipc: Better check user provided attributes

2018-04-16 Thread Jon Maloy
Acked-by: Jon Maloy <jon.ma...@ericsson.com>

Thank you, Eric.


> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Eric Dumazet
> Sent: Monday, April 16, 2018 11:30
> To: David S . Miller <da...@davemloft.net>
> Cc: netdev <netdev@vger.kernel.org>; Eric Dumazet
> <eduma...@google.com>; Eric Dumazet <eric.duma...@gmail.com>
> Subject: [PATCH net 0/2] tipc: Better check user provided attributes
> 
> syzbot reported a crash in __tipc_nl_net_set()
> 
> While fixing it, I also had to fix an old bug involving TIPC_NLA_NET_ADDR
> 
> Eric Dumazet (2):
>   tipc: add policy for TIPC_NLA_NET_ADDR
>   tipc: fix possible crash in __tipc_nl_net_set()
> 
>  net/tipc/net.c | 2 ++
>  net/tipc/netlink.c | 5 -
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> --
> 2.17.0.484.g0c8726318c-goog



[net 1/1] tipc: fix missing initializer in tipc_sendmsg()

2018-04-11 Thread Jon Maloy
The stack variable 'dnode' in __tipc_sendmsg() may theoretically
end up tipc_node_get_mtu() as an unitilalized variable.

We fix this by intializing the variable at declaration. We also add
a default else clause to the two conditional ones already there, so
that we never end up in the named function if the given address
type is illegal.

Reported-by: syzbot+b0975ce9355b347c1...@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/socket.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 1fd1c8b..252a52ae 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1278,7 +1278,7 @@ static int __tipc_sendmsg(struct socket *sock, struct 
msghdr *m, size_t dlen)
struct tipc_msg *hdr = >phdr;
struct tipc_name_seq *seq;
struct sk_buff_head pkts;
-   u32 dnode, dport;
+   u32 dport, dnode = 0;
u32 type, inst;
int mtu, rc;
 
@@ -1348,6 +1348,8 @@ static int __tipc_sendmsg(struct socket *sock, struct 
msghdr *m, size_t dlen)
msg_set_destnode(hdr, dnode);
msg_set_destport(hdr, dest->addr.id.ref);
msg_set_hdr_sz(hdr, BASIC_H_SIZE);
+   } else {
+   return -EINVAL;
}
 
/* Block or return if destination link is congested */
-- 
2.1.4



[net 1/1] tipc: fix unbalanced reference counter

2018-04-11 Thread Jon Maloy
When a topology subscription is created, we may encounter (or KASAN
may provoke) a failure to create a corresponding service instance in
the binding table. Instead of letting the tipc_nametbl_subscribe()
report the failure back to the caller, the function just makes a warning
printout and returns, without incrementing the subscription reference
counter as expected by the caller.

This makes the caller believe that the subscription was successful, so
it will at a later moment try to unsubscribe the item. This involves
a sub_put() call. Since the reference counter never was incremented
in the first place, we get a premature delete of the subscription item,
followed by a "use-after-free" warning.

We fix this by adding a return value to tipc_nametbl_subscribe() and
make the caller aware of the failure to subscribe.

This bug seems to always have been around, but this fix only applies
back to the commit shown below. Given the low risk of this happening
we believe this to be sufficient.

Fixes: commit 218527fe27ad ("tipc: replace name table service range
array with rb tree")
Reported-by: syzbot+aa245f26d42b8305d...@syzkaller.appspotmail.com

Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/name_table.c | 5 -
 net/tipc/name_table.h | 2 +-
 net/tipc/subscr.c | 5 -
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index b1fe209..4068eaa 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -665,13 +665,14 @@ int tipc_nametbl_withdraw(struct net *net, u32 type, u32 
lower,
 /**
  * tipc_nametbl_subscribe - add a subscription object to the name table
  */
-void tipc_nametbl_subscribe(struct tipc_subscription *sub)
+bool tipc_nametbl_subscribe(struct tipc_subscription *sub)
 {
struct name_table *nt = tipc_name_table(sub->net);
struct tipc_net *tn = tipc_net(sub->net);
struct tipc_subscr *s = >evt.s;
u32 type = tipc_sub_read(s, seq.type);
struct tipc_service *sc;
+   bool res = true;
 
spin_lock_bh(>nametbl_lock);
sc = tipc_service_find(sub->net, type);
@@ -685,8 +686,10 @@ void tipc_nametbl_subscribe(struct tipc_subscription *sub)
pr_warn("Failed to subscribe for {%u,%u,%u}\n", type,
tipc_sub_read(s, seq.lower),
tipc_sub_read(s, seq.upper));
+   res = false;
}
spin_unlock_bh(>nametbl_lock);
+   return res;
 }
 
 /**
diff --git a/net/tipc/name_table.h b/net/tipc/name_table.h
index 4b14fc2..0febba4 100644
--- a/net/tipc/name_table.h
+++ b/net/tipc/name_table.h
@@ -126,7 +126,7 @@ struct publication *tipc_nametbl_insert_publ(struct net 
*net, u32 type,
 struct publication *tipc_nametbl_remove_publ(struct net *net, u32 type,
 u32 lower, u32 upper,
 u32 node, u32 key);
-void tipc_nametbl_subscribe(struct tipc_subscription *s);
+bool tipc_nametbl_subscribe(struct tipc_subscription *s);
 void tipc_nametbl_unsubscribe(struct tipc_subscription *s);
 int tipc_nametbl_init(struct net *net);
 void tipc_nametbl_stop(struct net *net);
diff --git a/net/tipc/subscr.c b/net/tipc/subscr.c
index b7d80bc..f340e53 100644
--- a/net/tipc/subscr.c
+++ b/net/tipc/subscr.c
@@ -153,7 +153,10 @@ struct tipc_subscription *tipc_sub_subscribe(struct net 
*net,
memcpy(>evt.s, s, sizeof(*s));
spin_lock_init(>lock);
kref_init(>kref);
-   tipc_nametbl_subscribe(sub);
+   if (!tipc_nametbl_subscribe(sub)) {
+   kfree(sub);
+   return NULL;
+   }
timer_setup(>timer, tipc_sub_timeout, 0);
timeout = tipc_sub_read(>evt.s, timeout);
if (timeout != TIPC_WAIT_FOREVER)
-- 
2.1.4



RE: [PATCH v3] net: tipc: Replace GFP_ATOMIC with GFP_KERNEL in tipc_mon_create

2018-04-11 Thread Jon Maloy


> -Original Message-
> From: Ying Xue [mailto:ying@windriver.com]
> Sent: Wednesday, April 11, 2018 06:27
> To: Jia-Ju Bai <baijiaju1...@gmail.com>; Jon Maloy
> <jon.ma...@ericsson.com>; da...@davemloft.net
> Cc: netdev@vger.kernel.org; tipc-discuss...@lists.sourceforge.net; linux-
> ker...@vger.kernel.org
> Subject: Re: [PATCH v3] net: tipc: Replace GFP_ATOMIC with GFP_KERNEL in
> tipc_mon_create
> 
> On 04/11/2018 06:24 PM, Jia-Ju Bai wrote:
> > tipc_mon_create() is never called in atomic context.
> >
> > The call chain ending up at tipc_mon_create() is:
> > [1] tipc_mon_create() <- tipc_enable_bearer() <-
> > tipc_nl_bearer_enable()
> > tipc_nl_bearer_enable() calls rtnl_lock(), which indicates this
> > function is not called in atomic context.
> >
> > Despite never getting called from atomic context,
> > tipc_mon_create() calls kzalloc() with GFP_ATOMIC, which does not
> > sleep for allocation.
> > GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
> which
> > can sleep and improve the possibility of successful allocation.
> >
> > This is found by a static analysis tool named DCNS written by myself.
> > And I also manually check it.
> >
> > Signed-off-by: Jia-Ju Bai <baijiaju1...@gmail.com>
> 
> Acked-by: Ying Xue <ying@windriver.com>
Acked-by: Jon Maloy <jon.ma...@ericsson.com>
> 
> > ---
> > v2:
> > * Modify the description of GFP_ATOMIC in v1.
> >   Thank Eric for good advice.
> > v3:
> > * Modify wrong text in description in v2.
> >   Thank Ying for good advice.
> > ---
> >  net/tipc/monitor.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/net/tipc/monitor.c b/net/tipc/monitor.c index
> > 9e109bb..9714d80 100644
> > --- a/net/tipc/monitor.c
> > +++ b/net/tipc/monitor.c
> > @@ -604,9 +604,9 @@ int tipc_mon_create(struct net *net, int bearer_id)
> > if (tn->monitors[bearer_id])
> > return 0;
> >
> > -   mon = kzalloc(sizeof(*mon), GFP_ATOMIC);
> > -   self = kzalloc(sizeof(*self), GFP_ATOMIC);
> > -   dom = kzalloc(sizeof(*dom), GFP_ATOMIC);
> > +   mon = kzalloc(sizeof(*mon), GFP_KERNEL);
> > +   self = kzalloc(sizeof(*self), GFP_KERNEL);
> > +   dom = kzalloc(sizeof(*dom), GFP_KERNEL);
> > if (!mon || !self || !dom) {
> > kfree(mon);
> > kfree(self);
> >


RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-04-04 Thread Jon Rosen (jrosen)
> >> >One issue with the above proposed change to use TP_STATUS_IN_PROGRESS
> >> >is that the documentation of the tp_status field is somewhat
> >> >inconsistent.  In some places it's described as TP_STATUS_KERNEL(0)
> >> >meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0)
> >> >meaning the entry is owned by user space.  In other places ownership
> >> >by user space is defined by the TP_STATUS_USER(1) bit being set.
> >>
> >> But indeed this example in packet_mmap.txt is problematic
> >>
> >> if (status == TP_STATUS_KERNEL)
> >> retval = poll(, 1, timeout);
> >>
> >> It does not really matter whether the docs are possibly inconsistent and
> >> which one is authoritative. Examples like the above make it likely that
> >> some user code expects such code to work.
> >
> > Yes, that's exactly my concern.  Yet another troubling example seems to be
> > lipbcap which also is looking specifically for status to be anything other 
> > than
> > TP_STATUS_KERNEL(0) to indicate a frame is available in user space.
> 
> Good catch. If pcap-linux.c relies on this then the status field
> cannot be changed. Other fields can be modified freely while tp_status
> remains 0, perhaps that's an option.

Possibly. Someone else suggested something similar but in at least the
one example we thought through it still seemed like it didn't address the 
problem.

For example, let's say we used tp_len == -1 to indicate to other kernel threads
that the entry was already in progress.  This would require that user space 
never
set tp_len = -1 before returning the entry back to the kernel.  If it did then 
no
kernel thread would ever claim ownership and the ring would hang.

Now, it seems pretty unlikely that user space would do such a thing so maybe we
could look past that, but then we run into the issue that there is still a 
window
of opportunity for other kernel threads to come in and wrap the ring.

The reason is we can't set tp_len to the correct length after setting tp_status 
because
user space could grab the entry and see tp_len == -1 so we have to set tp_len
before we set tp_status. This means that there is still a window where other
kernel threads could come in and see tp_len as something other than -1 and
a tp_status of TP_STATUS_KERNEL and think it's ok to allocate the entry.
This puts us back to where we are today (arguably with a smaller window,
but a window none the less).

Alternatively we could reacquire the spin_lock to then set tp_len followed by
tp_status.  This would give the necessary indivisibility in the kernel while 
preserving proper order as made visible to user space, but it comes at the cost
of another spin_lock.

Thanks for the suggestion.  If you can think of a way around this I'm all ears.
I'll think on this some more but so far I'm stuck on how to get past having to
broaden the scope of the spin_lock, reacquire the spin_lock, or use some sort
of atomic construct along with a parallel shadow ring structure (still thinking
through that one as well).



RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-04-04 Thread Jon Rosen (jrosen)
On Wednesday, April 04, 2018 9:49 AM, Willem de Bruijn <will...@google.com> 
wrote:
> 
> On Tue, Apr 3, 2018 at 11:55 PM, Jon Rosen <jro...@cisco.com> wrote:
> > Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
> > casues the ring to get corrupted by allowing multiple kernel threads
> > to claim ownership of the same ring entry, Mark the ring entry as
> > already being used within the spin_lock to prevent other kernel
> > threads from reusing the same entry before it's fully filled in,
> > passed to user space, and then eventually passed back to the kernel
> > for use with a new packet.
> >
> > Note that the proposed change may modify the semantics of the
> > interface between kernel space and user space in a way which may cause
> > some applications to no longer work properly.
> 
> As long as TP_STATUS_USER (1) is not set, userspace should ignore
> the slot..
> 
> >One issue with the above proposed change to use TP_STATUS_IN_PROGRESS
> >is that the documentation of the tp_status field is somewhat
> >inconsistent.  In some places it's described as TP_STATUS_KERNEL(0)
> >meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0)
> >meaning the entry is owned by user space.  In other places ownership
> >by user space is defined by the TP_STATUS_USER(1) bit being set.
> 
> But indeed this example in packet_mmap.txt is problematic
> 
> if (status == TP_STATUS_KERNEL)
> retval = poll(, 1, timeout);
> 
> It does not really matter whether the docs are possibly inconsistent and
> which one is authoritative. Examples like the above make it likely that
> some user code expects such code to work.

Yes, that's exactly my concern.  Yet another troubling example seems to be
lipbcap which also is looking specifically for status to be anything other than
TP_STATUS_KERNEL(0) to indicate a frame is available in user space.

Either way things are broken. They are broken as they stand now because the
ring can get overrun and the kernel and user space tracking of the ring can
get out of sync.  And they are broken with the below change because some user
space applications will be looking for anything other than TP_STATUS_KERNEL,
so again the ring will get out of sync.

The difference here being that the way it is today is on average (across all 
environments
and across all user space apps) less likely to occur while with the change 
below it is
much more likely to occur.

Maybe the right answer here is to implement a fix that is compatible for 
existing
applications and accept any potential performance impacts and then add yet 
another
version (TPACKET_V4?) which more strictly requires the TP_STATUS_USER bit for
passing ownership.

> 
> > +++ b/net/packet/af_packet.c
> > @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
> > net_device *dev,
> > if (po->stats.stats1.tp_drops)
> > status |= TP_STATUS_LOSING;
> > }
> > +
> > +/*
> > + * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other
> > + * kernel threads from re-using this same entry.
> > + */
> > +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING
> 
> No need to reinterpret existing flags. tp_status is a u32 with
> sufficient undefined bits.

Agreed.

> 
> > +   if (po->tp_version <= TPACKET_V2)
> > +__packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS);
> > +
> > po->stats.stats1.tp_packets++;
> > if (copy_skb) {
> > status |= TP_STATUS_COPY;
> > --
> > 2.10.3.dirty
> >

Thanks for the feedback!
Jon.


RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-04-04 Thread Jon Rosen (jrosen)


> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > index e0f3f4a..264d7b2 100644
> > --- a/net/packet/af_packet.c
> > +++ b/net/packet/af_packet.c
> > @@ -2287,6 +2287,15 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
> > net_device *dev,
> > if (po->stats.stats1.tp_drops)
> > status |= TP_STATUS_LOSING;
> > }
> > +
> > +/*
> > + * Mark this entry as TP_STATUS_IN_PROGRESS to prevent other
> > + * kernel threads from re-using this same entry.
> > + */
> > +#define TP_STATUS_IN_PROGRESS TP_STATUS_LOSING
> > +   if (po->tp_version <= TPACKET_V2)
> > +__packet_set_status(po, h.raw, TP_STATUS_IN_PROGRESS);
> > +
> > po->stats.stats1.tp_packets++;
> > if (copy_skb) {
> > status |= TP_STATUS_COPY;
> 
> This patch looks correct. Please resend it with proper signed-off-by
> and with a kernel code indenting style (tabs).  Is this bug present
> since the beginning of af_packet and multiqueue devices or did it get
> introduced in some previous kernel?

Sorry about the tabs, I'll fix that and try to figure out what I did wrong with
the signed-off-by.

I've looked back as far as I could find online (2.6.11) and it would appear that
this bug has always been there.

Thanks, jon.



[RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun

2018-04-03 Thread Jon Rosen
Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
casues the ring to get corrupted by allowing multiple kernel threads
to claim ownership of the same ring entry, Mark the ring entry as
already being used within the spin_lock to prevent other kernel
threads from reusing the same entry before it's fully filled in,
passed to user space, and then eventually passed back to the kernel
for use with a new packet.

Note that the proposed change may modify the semantics of the
interface between kernel space and user space in a way which may cause
some applications to no longer work properly. More discussion on this
change can be found in the additional comments section titled
"3. Discussion on packet_mmap ownership semantics:".

Signed-off-by: Jon Rosen <jro...@cisco.com>
---

Additional Comments Section
---

1. Description of the diffs:


   TPACKET_V1 and TPACKET_V2 format rings:
   ---
   Mark each entry as TP_STATUS_IN_PROGRESS after allocating to
   prevent other kernel threads from re-using the same entry.

   This is necessary because there may be a delay from the time the
   spin_lock is released to the time that the packet is completed and
   the corresponding ring entry is marked as owned by user space.  If
   during this time other kernel threads enqueue more packets to the
   ring than the size of the ring then it will cause mutliple kernel
   threads to operate on the same entry at the same time, corrupting
   packets and the ring state.

   By marking the entry as allocated (IN_PROGRESS) we prevent other
   kernel threads from incorrectly re-using an entry that is still in
   the progress of being filled in before it is passed to user space.

   This forces each entry through the following states:

   +-> 1. (tp_status == TP_STATUS_KERNEL)
   |  Free: For use by any kernel thread to store a new packet
   |
   |   2. !(tp_status == TP_STATUS_KERNEL) && !(tp_status & TP_STATUS_USER)
   |  Allocated: In use by a *specific* kernel thread
   |
   |   3. (tp_status & TP_STATUS_USER)
   |  Available: Packet available for user space to process
   |
   +-- Loop back to #1 when user space writes entry as TP_STATUS_KERNEL


   No impact on TPACKET_V3 format rings:
   -
   Packet entry ownership is already protected from other kernel
   threads potentially re-using the same entry. This is done inside
   packet_current_rx_frame() where storage is allocated for the
   current packet. Since this is done within the spin_lock no
   additional status updates for this entry are required.


   Defining TP_STATUS_IN_PROGRESS:
   ---
   Rather than defining a new-bit we re-use an existing bit for this
   intermediate state.  Status will eventually be overwritten with the
   actual true status when passed to user space.  Any bit used to pass
   information to user space other than the one that passes ownership
   is suitable (can't use TP_STATUS_USER).  Alternatively a new bit
   could be defined.


2. More detailed discussion:

   Ring entries basically have 2 states, owned by the kernel or owned by
   user space. For single producer/single consumer this works fine. For
   multiple producers there is a window between the call to spin_unlock
   [F] and the call to __packet_set_status [J] where if there are enough
   packets added to the ring by other kernel threads then the ring can
   wrap and multiple threads will end up using the same ring entries.

   This occurs because the ring entry alocated at [C] did not modify the
   state of the entry so it continues to appear as owned by the kernel
   and available for use for new packets even though it has already been
   allocated.

   A simple fix is to temporarily mark the ring entries within the spin
   lock such that user space will still think it?s owned by the kernel
   and other kernel threads will not see it as available to be used for
   new packets. If a kernel thread gets delayed between [F] and [J] for
   an extended period of time and the ring wraps back to the same point
   then subsiquent kernel threads attempts to allocate will fail and be
   treated as the ring being full.

   The change below at [D] uses a newly defined TP_STATUS_IN_PROGRESS bit
   to prevent other kernel threads from re-using the same entry. Note that
   any existing bit other than TP_STATUS_USER could have been used.

   af_packet.c:tpacket_rcv()
  ... code removed for brevity ...

  // Acquire spin lock
A:spin_lock(>sk_receive_queue.lock);

// Preemption is disabled

// Get current ring entry
B:  h.raw = packet_current_rx_frame(
po, skb, TP_STATUS_KERNEL, (macoff+snaplen));

// Get out if ring is full
// Code not show but it will also release lock
   

RE: general protection fault in tipc_nametbl_unsubscribe

2018-04-03 Thread Jon Maloy
#syz dup: general protection fault in __list_del_entry_valid (3)

> -Original Message-
> From: syzbot
> [mailto:syzbot+4859fe19555ea87c4...@syzkaller.appspotmail.com]
> Sent: Monday, April 02, 2018 02:01
> To: da...@davemloft.net; Jon Maloy <jon.ma...@ericsson.com>; linux-
> ker...@vger.kernel.org; netdev@vger.kernel.org; syzkaller-
> b...@googlegroups.com; tipc-discuss...@lists.sourceforge.net;
> ying@windriver.com
> Subject: general protection fault in tipc_nametbl_unsubscribe
> 
> Hello,
> 
> syzbot hit the following crash on upstream commit
> 10b84daddbec72c6b440216a69de9a9605127f7a (Sat Mar 31 17:59:00 2018
> +) Merge branch 'perf-urgent-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> syzbot dashboard link:
> https://syzkaller.appspot.com/bug?extid=4859fe19555ea87c42f3
> 
> So far this crash happened 3 times on upstream.
> C reproducer:
> https://syzkaller.appspot.com/x/repro.c?id=4775372465897472
> syzkaller reproducer:
> https://syzkaller.appspot.com/x/repro.syz?id=4868734988582912
> Raw console output:
> https://syzkaller.appspot.com/x/log.txt?id=507380209544
> Kernel config:
> https://syzkaller.appspot.com/x/.config?id=-2760467897697295172
> compiler: gcc (GCC) 7.1.1 20170620
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+4859fe19555ea87c4...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for details.
> If you forward the report, please keep this part and the footer.
> 
> R13:  R14:  R15:  Name
> sequence creation failed, no memory Failed to create subscription for
> {24576,0,4294967295}
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault:  [#1] SMP KASAN Dumping ftrace buffer:
> (ftrace buffer empty)
> Modules linked in:
> CPU: 1 PID: 4447 Comm: syzkaller851181 Not tainted 4.16.0-rc7+ #374
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS Google 01/01/2011
> RIP: 0010:__list_del_entry_valid+0x7e/0x150 lib/list_debug.c:51
> RSP: 0018:8801ae1aef48 EFLAGS: 00010246
> RAX: dc00 RBX:  RCX: 
> RDX:  RSI: 8801cf54c760 RDI: 8801cf54c768
> RBP: 8801ae1aef60 R08: 110035c35cff R09: 89956150
> R10: 8801ae1aee28 R11: 168a R12: 87745ea0
> R13: 8801ae1af100 R14: 8801cf54c760 R15: 8801cf4c8cc0
> FS:  () GS:8801db10()
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 55dce15c3090 CR3: 0846a002 CR4: 001606e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400 Call
> Trace:
>   __list_del_entry include/linux/list.h:117 [inline]
>   list_del_init include/linux/list.h:159 [inline]
>   tipc_nametbl_unsubscribe+0x318/0x990 net/tipc/name_table.c:848
>   tipc_subscrb_subscrp_delete+0x1e9/0x460 net/tipc/subscr.c:212
>   tipc_subscrb_delete net/tipc/subscr.c:242 [inline]
>   tipc_subscrb_release_cb+0x17/0x30 net/tipc/subscr.c:321
>   tipc_topsrv_kern_unsubscr+0x2c3/0x430 net/tipc/server.c:535
>   tipc_group_delete+0x2c0/0x3d0 net/tipc/group.c:231
>   tipc_sk_leave+0x10b/0x200 net/tipc/socket.c:2795
>   tipc_release+0x154/0xff0 net/tipc/socket.c:577
>   sock_release+0x8d/0x1e0 net/socket.c:595
>   sock_close+0x16/0x20 net/socket.c:1149
>   __fput+0x327/0x7e0 fs/file_table.c:209
>   fput+0x15/0x20 fs/file_table.c:243
>   task_work_run+0x199/0x270 kernel/task_work.c:113
>   exit_task_work include/linux/task_work.h:22 [inline]
>   do_exit+0x9bb/0x1ad0 kernel/exit.c:865
>   do_group_exit+0x149/0x400 kernel/exit.c:968
>   SYSC_exit_group kernel/exit.c:979 [inline]
>   SyS_exit_group+0x1d/0x20 kernel/exit.c:977
>   do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>   entry_SYSCALL_64_after_hwframe+0x42/0xb7
> RIP: 0033:0x43f228
> RSP: 002b:7ffde31217e8 EFLAGS: 0246 ORIG_RAX:
> 00e7
> RAX: ffda RBX:  RCX: 0043f228
> RDX:  RSI: 003c RDI: 
> RBP: 004bf308 R08: 00e7 R09: ffd0
> R10: 204ee000 R11: 0246 R12: 0001
> R13: 006d1180 R14:  R15: 
> Code: 00 00 00 00 ad de 49 39 c4 74 66 48 b8 00 02 00 00 00 00 ad de 48 89 da 
> 48
> 39 c3 74 65 48 c1 ea 03 48 b8 00 00 00 00 00 fc ff df <80> 3c 02 00
> 75 7b 48 8b 13 48 39 f2 75 57 49 8d 7c 24 08 48 

[net-next 1/1] tipc: Fix missing list initializations in struct tipc_subscription

2018-04-03 Thread Jon Maloy
When an item of struct tipc_subscription is created, we fail to
initialize the two lists aggregated into the struct. This has so far
never been a problem, since the items are just added to a root
object by list_add(), which does not require the addee list to be
pre-initialized. However, syzbot is provoking situations where this
addition fails, whereupon the attempted removal if the item from
the list causes a crash.

This problem seems to always have been around, despite that the code
for creating this object was rewritten in commit 242e82cc95f6 ("tipc:
collapse subscription creation functions"), which is still in net-next.

We fix this for that commit by initializing the two lists properly.

Fixes: 242e82cc95f6 ("tipc: collapse subscription creation functions")
Reported-by: syzbot+0bb443b74ce09197e...@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/subscr.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/tipc/subscr.c b/net/tipc/subscr.c
index 6925a98..b7d80bc 100644
--- a/net/tipc/subscr.c
+++ b/net/tipc/subscr.c
@@ -145,6 +145,8 @@ struct tipc_subscription *tipc_sub_subscribe(struct net 
*net,
pr_warn("Subscription rejected, no memory\n");
return NULL;
}
+   INIT_LIST_HEAD(>service_list);
+   INIT_LIST_HEAD(>sub_list);
sub->net = net;
sub->conid = conid;
sub->inactive = false;
-- 
2.1.4



RE: [iproute2-next 0/2] tipc: changes to addressing structure

2018-03-29 Thread Jon Maloy

> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of David Ahern
> Sent: Thursday, March 29, 2018 13:59
> To: Jon Maloy <jon.ma...@ericsson.com>; da...@davemloft.net;
> netdev@vger.kernel.org
> Cc: Mohan Krishna Ghanta Krishnamurthy
[..]
bit node addresses as an integer in hex format,
> >>i.e., we remove the assumption about an internal structure.
> >>
> >
> > Applied to iproute2-next. Thanks,
> >
> 
> BTW, please consider adding json support to tipc. It will make tipc command
> more robust to changes in output format.

Yes, we will do that.

///jon



[net-next v2 2/5] tipc: refactor name table translate function

2018-03-29 Thread Jon Maloy
The function tipc_nametbl_translate() function is ugly and hard to
follow. This can be improved somewhat by introducing a stack variable
for holding the publication list to be used and re-ordering the if-
clauses for selection of algorithm.

Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/name_table.c | 61 +--
 1 file changed, 25 insertions(+), 36 deletions(-)

diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index e06c7a8..4bdc580 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -399,29 +399,32 @@ struct publication *tipc_nametbl_remove_publ(struct net 
*net, u32 type,
 /**
  * tipc_nametbl_translate - perform service instance to socket translation
  *
- * On entry, 'destnode' is the search domain used during translation.
+ * On entry, 'dnode' is the search domain used during translation.
  *
  * On exit:
- * - if name translation is deferred to another node/cluster/zone,
- *   leaves 'destnode' unchanged (will be non-zero) and returns 0
- * - if name translation is attempted and succeeds, sets 'destnode'
- *   to publication node and returns port reference (will be non-zero)
- * - if name translation is attempted and fails, sets 'destnode' to 0
- *   and returns 0
+ * - if translation is deferred to another node, leave 'dnode' unchanged and
+ *   return 0
+ * - if translation is attempted and succeeds, set 'dnode' to the publishing
+ *   node and return the published (non-zero) port number
+ * - if translation is attempted and fails, set 'dnode' to 0 and return 0
+ *
+ * Note that for legacy users (node configured with Z.C.N address format) the
+ * 'closest-first' lookup algorithm must be maintained, i.e., if dnode is 0
+ * we must look in the local binding list first
  */
-u32 tipc_nametbl_translate(struct net *net, u32 type, u32 instance,
-  u32 *destnode)
+u32 tipc_nametbl_translate(struct net *net, u32 type, u32 instance, u32 *dnode)
 {
struct tipc_net *tn = tipc_net(net);
bool legacy = tn->legacy_addr_format;
u32 self = tipc_own_addr(net);
struct service_range *sr;
struct tipc_service *sc;
+   struct list_head *list;
struct publication *p;
u32 port = 0;
u32 node = 0;
 
-   if (!tipc_in_scope(legacy, *destnode, self))
+   if (!tipc_in_scope(legacy, *dnode, self))
return 0;
 
rcu_read_lock();
@@ -434,43 +437,29 @@ u32 tipc_nametbl_translate(struct net *net, u32 type, u32 
instance,
if (unlikely(!sr))
goto no_match;
 
-   /* Closest-First Algorithm */
-   if (legacy && !*destnode) {
-   if (!list_empty(>local_publ)) {
-   p = list_first_entry(>local_publ,
-struct publication,
-local_publ);
-   list_move_tail(>local_publ,
-  >local_publ);
-   } else {
-   p = list_first_entry(>all_publ,
-struct publication,
-all_publ);
-   list_move_tail(>all_publ,
-  >all_publ);
-   }
-   }
-
-   /* Round-Robin Algorithm */
-   else if (*destnode == self) {
-   if (list_empty(>local_publ))
+   /* Select lookup algorithm: local, closest-first or round-robin */
+   if (*dnode == self) {
+   list = >local_publ;
+   if (list_empty(list))
goto no_match;
-   p = list_first_entry(>local_publ, struct publication,
-local_publ);
+   p = list_first_entry(list, struct publication, local_publ);
+   list_move_tail(>local_publ, >local_publ);
+   } else if (legacy && !*dnode && !list_empty(>local_publ)) {
+   list = >local_publ;
+   p = list_first_entry(list, struct publication, local_publ);
list_move_tail(>local_publ, >local_publ);
} else {
-   p = list_first_entry(>all_publ, struct publication,
-all_publ);
+   list = >all_publ;
+   p = list_first_entry(list, struct publication, all_publ);
list_move_tail(>all_publ, >all_publ);
}
-
port = p->port;
node = p->node;
 no_match:
spin_unlock_bh(>lock);
 not_found:
rcu_read_unlock();
-   *destnode = node;
+   *dnode = node;
return port;
 }
 
-- 
2.1.4



[net-next v2 5/5] tipc: avoid possible string overflow

2018-03-29 Thread Jon Maloy
gcc points out that the combined length of the fixed-length inputs to
l->name is larger than the destination buffer size:

net/tipc/link.c: In function 'tipc_link_create':
net/tipc/link.c:465:26: error: '%s' directive writing up to 32 bytes
into a region of size between 26 and 58 [-Werror=format-overflow=]
sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str);

net/tipc/link.c:465:2: note: 'sprintf' output 11 or more bytes
(assuming 75) into a destination of size 60
sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str);

A detailed analysis reveals that the theoretical maximum length of
a link name is:
max self_str + 1 + max if_name + 1 + max peer_str + 1 + max if_name =
16 + 1 + 15 + 1 + 16 + 1 + 15 = 65
Since we also need space for a trailing zero we now set MAX_LINK_NAME
to 68.

Just to be on the safe side we also replace the sprintf() call with
snprintf().

Fixes: 25b0b9c4e835 ("tipc: handle collisions of 32-bit node address
hash values")
Reported-by: Arnd Bergmann <a...@arndb.de>

Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 include/uapi/linux/tipc.h | 2 +-
 net/tipc/link.c   | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/tipc.h b/include/uapi/linux/tipc.h
index 156224a..bf6d286 100644
--- a/include/uapi/linux/tipc.h
+++ b/include/uapi/linux/tipc.h
@@ -216,7 +216,7 @@ struct tipc_group_req {
 #define TIPC_MAX_MEDIA_NAME16
 #define TIPC_MAX_IF_NAME   16
 #define TIPC_MAX_BEARER_NAME   32
-#define TIPC_MAX_LINK_NAME 60
+#define TIPC_MAX_LINK_NAME 68
 
 #define SIOCGETLINKNAMESIOCPROTOPRIVATE
 
diff --git a/net/tipc/link.c b/net/tipc/link.c
index 8f2a949..695acb7 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -462,7 +462,8 @@ bool tipc_link_create(struct net *net, char *if_name, int 
bearer_id,
sprintf(peer_str, "%x", peer);
}
/* Peer i/f name will be completed by reset/activate message */
-   sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str);
+   snprintf(l->name, sizeof(l->name), "%s:%s-%s:unknown",
+self_str, if_name, peer_str);
 
strcpy(l->if_name, if_name);
l->addr = peer;
-- 
2.1.4



[net-next v2 1/5] tipc: replace name table service range array with rb tree

2018-03-29 Thread Jon Maloy
The current design of the binding table has an unnecessary memory
consuming and complex data structure. It aggregates the service range
items into an array, which is expanded by a factor two every time it
becomes too small to hold a new item. Furthermore, the arrays never
shrink when the number of ranges diminishes.

We now replace this array with an RB tree that is holding the range
items as tree nodes, each range directly holding a list of bindings.

This, along with a few name changes, improves both readability and
volume of the code, as well as reducing memory consumption and hopefully
improving cache hit rate.

Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/core.h   |1 +
 net/tipc/link.c   |2 +-
 net/tipc/name_table.c | 1032 ++---
 net/tipc/name_table.h |2 +-
 net/tipc/node.c   |4 +-
 net/tipc/subscr.h |4 +-
 6 files changed, 477 insertions(+), 568 deletions(-)

diff --git a/net/tipc/core.h b/net/tipc/core.h
index d0f64ca..8020a6c 100644
--- a/net/tipc/core.h
+++ b/net/tipc/core.h
@@ -58,6 +58,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct tipc_node;
 struct tipc_bearer;
diff --git a/net/tipc/link.c b/net/tipc/link.c
index 1289b4b..8f2a949 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -1810,7 +1810,7 @@ int tipc_link_bc_nack_rcv(struct tipc_link *l, struct 
sk_buff *skb,
 
 void tipc_link_set_queue_limits(struct tipc_link *l, u32 win)
 {
-   int max_bulk = TIPC_MAX_PUBLICATIONS / (l->mtu / ITEM_SIZE);
+   int max_bulk = TIPC_MAX_PUBL / (l->mtu / ITEM_SIZE);
 
l->window = win;
l->backlog[TIPC_LOW_IMPORTANCE].limit  = max_t(u16, 50, win);
diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index 4359605..e06c7a8 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -44,52 +44,40 @@
 #include "addr.h"
 #include "node.h"
 #include "group.h"
-#include 
-
-#define TIPC_NAMETBL_SIZE 1024 /* must be a power of 2 */
 
 /**
- * struct name_info - name sequence publication info
- * @node_list: list of publications on own node of this <type,lower,upper>
- * @all_publ: list of all publications of this <type,lower,upper>
+ * struct service_range - container for all bindings of a service range
+ * @lower: service range lower bound
+ * @upper: service range upper bound
+ * @tree_node: member of service range RB tree
+ * @local_publ: list of identical publications made from this node
+ *   Used by closest_first lookup and multicast lookup algorithm
+ * @all_publ: all publications identical to this one, whatever node and scope
+ *   Used by round-robin lookup algorithm
  */
-struct name_info {
-   struct list_head local_publ;
-   struct list_head all_publ;
-};
-
-/**
- * struct sub_seq - container for all published instances of a name sequence
- * @lower: name sequence lower bound
- * @upper: name sequence upper bound
- * @info: pointer to name sequence publication info
- */
-struct sub_seq {
+struct service_range {
u32 lower;
u32 upper;
-   struct name_info *info;
+   struct rb_node tree_node;
+   struct list_head local_publ;
+   struct list_head all_publ;
 };
 
 /**
- * struct name_seq - container for all published instances of a name type
- * @type: 32 bit 'type' value for name sequence
- * @sseq: pointer to dynamically-sized array of sub-sequences of this 'type';
- *sub-sequences are sorted in ascending order
- * @alloc: number of sub-sequences currently in array
- * @first_free: array index of first unused sub-sequence entry
- * @ns_list: links to adjacent name sequences in hash chain
- * @subscriptions: list of subscriptions for this 'type'
- * @lock: spinlock controlling access to publication lists of all sub-sequences
+ * struct tipc_service - container for all published instances of a service 
type
+ * @type: 32 bit 'type' value for service
+ * @ranges: rb tree containing all service ranges for this service
+ * @service_list: links to adjacent name ranges in hash chain
+ * @subscriptions: list of subscriptions for this service type
+ * @lock: spinlock controlling access to pertaining service ranges/publications
  * @rcu: RCU callback head used for deferred freeing
  */
-struct name_seq {
+struct tipc_service {
u32 type;
-   struct sub_seq *sseqs;
-   u32 alloc;
-   u32 first_free;
-   struct hlist_node ns_list;
+   struct rb_root ranges;
+   struct hlist_node service_list;
struct list_head subscriptions;
-   spinlock_t lock;
+   spinlock_t lock; /* Covers service range list */
struct rcu_head rcu;
 };
 
@@ -99,17 +87,16 @@ static int hash(int x)
 }
 
 /**
- * publ_create - create a publication structure
+ * tipc_publ_create - create a publication structure
  */
-static struct publication *publ_create(u32 type, u32 lower, u32 upper,
-   

[net-next v2 3/5] tipc: permit overlapping service ranges in name table

2018-03-29 Thread Jon Maloy
With the new RB tree structure for service ranges it becomes possible to
solve an old problem; - we can now allow overlapping service ranges in
the table.

When inserting a new service range to the tree, we use 'lower' as primary
key, and when necessary 'upper' as secondary key.

Since there may now be multiple service ranges matching an indicated
'lower' value, we must also add the 'upper' value to the functions
used for removing publications, so that the correct, corresponding
range item can be found.

These changes guarantee that a well-formed publication/withdrawal item
from a peer node never will be rejected, and make it possible to
eliminate the problematic backlog functionality we currently have for
handling such cases.

Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/name_distr.c | 90 +--
 net/tipc/name_distr.h |  1 -
 net/tipc/name_table.c | 64 +---
 net/tipc/name_table.h |  8 ++---
 net/tipc/net.c|  2 +-
 net/tipc/node.c   |  2 +-
 net/tipc/socket.c |  4 +--
 7 files changed, 60 insertions(+), 111 deletions(-)

diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
index 8240a85..51b4b96 100644
--- a/net/tipc/name_distr.c
+++ b/net/tipc/name_distr.c
@@ -204,12 +204,12 @@ void tipc_named_node_up(struct net *net, u32 dnode)
  */
 static void tipc_publ_purge(struct net *net, struct publication *publ, u32 
addr)
 {
-   struct tipc_net *tn = net_generic(net, tipc_net_id);
+   struct tipc_net *tn = tipc_net(net);
struct publication *p;
 
spin_lock_bh(>nametbl_lock);
-   p = tipc_nametbl_remove_publ(net, publ->type, publ->lower,
-publ->node, publ->port, publ->key);
+   p = tipc_nametbl_remove_publ(net, publ->type, publ->lower, publ->upper,
+publ->node, publ->key);
if (p)
tipc_node_unsubscribe(net, >binding_node, addr);
spin_unlock_bh(>nametbl_lock);
@@ -261,28 +261,31 @@ void tipc_publ_notify(struct net *net, struct list_head 
*nsub_list, u32 addr)
 static bool tipc_update_nametbl(struct net *net, struct distr_item *i,
u32 node, u32 dtype)
 {
-   struct publication *publ = NULL;
+   struct publication *p = NULL;
+   u32 lower = ntohl(i->lower);
+   u32 upper = ntohl(i->upper);
+   u32 type = ntohl(i->type);
+   u32 port = ntohl(i->port);
+   u32 key = ntohl(i->key);
 
if (dtype == PUBLICATION) {
-   publ = tipc_nametbl_insert_publ(net, ntohl(i->type),
-   ntohl(i->lower),
-   ntohl(i->upper),
-   TIPC_CLUSTER_SCOPE, node,
-   ntohl(i->port), ntohl(i->key));
-   if (publ) {
-   tipc_node_subscribe(net, >binding_node, node);
+   p = tipc_nametbl_insert_publ(net, type, lower, upper,
+TIPC_CLUSTER_SCOPE, node,
+port, key);
+   if (p) {
+   tipc_node_subscribe(net, >binding_node, node);
return true;
}
} else if (dtype == WITHDRAWAL) {
-   publ = tipc_nametbl_remove_publ(net, ntohl(i->type),
-   ntohl(i->lower),
-   node, ntohl(i->port),
-   ntohl(i->key));
-   if (publ) {
-   tipc_node_unsubscribe(net, >binding_node, node);
-   kfree_rcu(publ, rcu);
+   p = tipc_nametbl_remove_publ(net, type, lower,
+upper, node, key);
+   if (p) {
+   tipc_node_unsubscribe(net, >binding_node, node);
+   kfree_rcu(p, rcu);
return true;
}
+   pr_warn_ratelimited("Failed to remove binding %u,%u from %x\n",
+   type, lower, node);
} else {
pr_warn("Unrecognized name table message received\n");
}
@@ -290,53 +293,6 @@ static bool tipc_update_nametbl(struct net *net, struct 
distr_item *i,
 }
 
 /**
- * tipc_named_add_backlog - add a failed name table update to the backlog
- *
- */
-static void tipc_named_add_backlog(struct net *net, struct distr_item *i,
-  u32 type, u32 node)
-{
-   struct distr_queue_item *e;
-   struct tipc_net *tn = net_generic(net, tipc_net_id);
-   unsigned long now = get_jiffies_64();
-
-   e = kzalloc(sizeof(*e), GFP_ATO

[net-next v2 4/5] tipc: tipc: rename address types in user api

2018-03-29 Thread Jon Maloy
The three address type structs in the user API have names that in
reality reflect the specific, non-Linux environment where they were
originally created.

We now give them more intuitive names, in accordance with how TIPC is
described in the current documentation.

struct tipc_portid   -> struct tipc_socket_addr
struct tipc_name -> struct tipc_service_addr
struct tipc_name_seq -> struct tipc_service_range

To avoid confusion, we also update some commmets and macro names to
 match the new terminology.

For compatibility, we add macros that map all old names to the new ones.

Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 include/uapi/linux/tipc.h | 57 +++
 1 file changed, 33 insertions(+), 24 deletions(-)

diff --git a/include/uapi/linux/tipc.h b/include/uapi/linux/tipc.h
index 4ac9f1f..156224a 100644
--- a/include/uapi/linux/tipc.h
+++ b/include/uapi/linux/tipc.h
@@ -45,33 +45,33 @@
  * TIPC addressing primitives
  */
 
-struct tipc_portid {
+struct tipc_socket_addr {
__u32 ref;
__u32 node;
 };
 
-struct tipc_name {
+struct tipc_service_addr {
__u32 type;
__u32 instance;
 };
 
-struct tipc_name_seq {
+struct tipc_service_range {
__u32 type;
__u32 lower;
__u32 upper;
 };
 
 /*
- * Application-accessible port name types
+ * Application-accessible service types
  */
 
-#define TIPC_CFG_SRV   0   /* configuration service name type */
-#define TIPC_TOP_SRV   1   /* topology service name type */
-#define TIPC_LINK_STATE2   /* link state name type */
-#define TIPC_RESERVED_TYPES64  /* lowest user-publishable name type */
+#define TIPC_NODE_STATE0   /* node state service type */
+#define TIPC_TOP_SRV   1   /* topology server service type */
+#define TIPC_LINK_STATE2   /* link state service type */
+#define TIPC_RESERVED_TYPES64  /* lowest user-allowed service type */
 
 /*
- * Publication scopes when binding port names and port name sequences
+ * Publication scopes when binding service / service range
  */
 enum tipc_scope {
TIPC_CLUSTER_SCOPE = 2, /* 0 can also be used */
@@ -108,28 +108,28 @@ enum tipc_scope {
  * TIPC topology subscription service definitions
  */
 
-#define TIPC_SUB_PORTS 0x01/* filter for port availability */
-#define TIPC_SUB_SERVICE   0x02/* filter for service availability */
-#define TIPC_SUB_CANCEL0x04/* cancel a subscription */
+#define TIPC_SUB_PORTS  0x01/* filter: evt at each match */
+#define TIPC_SUB_SERVICE0x02/* filter: evt at first up/last down */
+#define TIPC_SUB_CANCEL 0x04/* filter: cancel a subscription */
 
 #define TIPC_WAIT_FOREVER  (~0)/* timeout for permanent subscription */
 
 struct tipc_subscr {
-   struct tipc_name_seq seq;   /* name sequence of interest */
+   struct tipc_service_range seq;  /* range of interest */
__u32 timeout;  /* subscription duration (in ms) */
__u32 filter;   /* bitmask of filter options */
char usr_handle[8]; /* available for subscriber use */
 };
 
 #define TIPC_PUBLISHED 1   /* publication event */
-#define TIPC_WITHDRAWN 2   /* withdraw event */
+#define TIPC_WITHDRAWN 2   /* withdrawal event */
 #define TIPC_SUBSCR_TIMEOUT3   /* subscription timeout event */
 
 struct tipc_event {
__u32 event;/* event type */
-   __u32 found_lower;  /* matching name seq instances */
-   __u32 found_upper;  /*"  "" "  */
-   struct tipc_portid port;/* associated port */
+   __u32 found_lower;  /* matching range */
+   __u32 found_upper;  /*"  "*/
+   struct tipc_socket_addr port;   /* associated socket */
struct tipc_subscr s;   /* associated subscription */
 };
 
@@ -149,20 +149,20 @@ struct tipc_event {
 #define SOL_TIPC   271
 #endif
 
-#define TIPC_ADDR_NAMESEQ  1
-#define TIPC_ADDR_MCAST1
-#define TIPC_ADDR_NAME 2
-#define TIPC_ADDR_ID   3
+#define TIPC_ADDR_MCAST 1
+#define TIPC_SERVICE_RANGE  1
+#define TIPC_SERVICE_ADDR   2
+#define TIPC_SOCKET_ADDR3
 
 struct sockaddr_tipc {
unsigned short family;
unsigned char  addrtype;
signed   char  scope;
union {
-   struct tipc_portid id;
-   struct tipc_name_seq nameseq;
+   struct tipc_socket_addr id;
+   struct tipc_service_range nameseq;
struct {
-   struct tipc_name name;
+   struct tipc_service_addr name;
__u32 domain;
} name

[net-next v2 0/5] tipc: slim down name table

2018-03-29 Thread Jon Maloy
We clean up and improve the name binding table:

 - Replace the memory consuming 'sub_sequence/service range' array with
   an RB tree.
 - Introduce support for overlapping service sequences/ranges

 v2: #1: Fixed a missing initialization reported by David Miller
 #4: Obsoleted and replaced a few more macros to get a consistent
 terminology in the API.
 #5: Added new commit to fix a potential string overflow bug (it
 is still only in net-next) reported by Arnd Bergmann

Jon Maloy (5):
  tipc: replace name table service range array with rb tree
  tipc: refactor name table translate function
  tipc: permit overlapping service ranges in name table
  tipc: tipc: rename address types in user api
  tipc: avoid possible string overflow

 include/uapi/linux/tipc.h |   59 +--
 net/tipc/core.h   |1 +
 net/tipc/link.c   |5 +-
 net/tipc/name_distr.c |   90 +---
 net/tipc/name_distr.h |1 -
 net/tipc/name_table.c | 1075 -
 net/tipc/name_table.h |   10 +-
 net/tipc/net.c|2 +-
 net/tipc/node.c   |4 +-
 net/tipc/socket.c |4 +-
 net/tipc/subscr.h |4 +-
 11 files changed, 556 insertions(+), 699 deletions(-)

-- 
2.1.4



[iproute2-next 1/2] tipc: introduce command for handling a new 128-bit node identity

2018-03-28 Thread Jon Maloy
We add the possibility to set and get a 128 bit node identifier, as
an alternative to the legacy 32-bit node address we are using now.

We also add an option to set and get 'clusterid' in the node. This
is the same as what we have so far called 'netid' and performs the
same operations. For compatibility the old 'netid' commands are
retained, -we just remove them from the help texts.

Acked-by: GhantaKrishnamurthy MohanKrishna 
<mohan.krishna.ghanta.krishnamur...@ericsson.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 include/uapi/linux/tipc_netlink.h |  2 +
 tipc/misc.c   | 78 ++-
 tipc/misc.h   |  2 +
 tipc/node.c   | 98 +--
 4 files changed, 174 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/tipc_netlink.h 
b/include/uapi/linux/tipc_netlink.h
index 469aa67..6bf8ec6 100644
--- a/include/uapi/linux/tipc_netlink.h
+++ b/include/uapi/linux/tipc_netlink.h
@@ -162,6 +162,8 @@ enum {
TIPC_NLA_NET_UNSPEC,
TIPC_NLA_NET_ID,/* u32 */
TIPC_NLA_NET_ADDR,  /* u32 */
+   TIPC_NLA_NET_NODEID,/* u64 */
+   TIPC_NLA_NET_NODEID_W1, /* u64 */
 
__TIPC_NLA_NET_MAX,
TIPC_NLA_NET_MAX = __TIPC_NLA_NET_MAX - 1
diff --git a/tipc/misc.c b/tipc/misc.c
index 8091222..16849f1 100644
--- a/tipc/misc.c
+++ b/tipc/misc.c
@@ -12,7 +12,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include "misc.h"
 
 #define IN_RANGE(val, low, high) ((val) <= (high) && (val) >= (low))
@@ -33,3 +33,79 @@ uint32_t str2addr(char *str)
fprintf(stderr, "invalid network address \"%s\"\n", str);
return 0;
 }
+
+static int is_hex(char *arr, int last)
+{
+   int i;
+
+   while (!arr[last])
+   last--;
+
+   for (i = 0; i <= last; i++) {
+   if (!IN_RANGE(arr[i], '0', '9') &&
+   !IN_RANGE(arr[i], 'a', 'f') &&
+   !IN_RANGE(arr[i], 'A', 'F'))
+   return 0;
+   }
+   return 1;
+}
+
+static int is_name(char *arr, int last)
+{
+   int i;
+   char c;
+
+   while (!arr[last])
+   last--;
+
+   if (last > 15)
+   return 0;
+
+   for (i = 0; i <= last; i++) {
+   c = arr[i];
+   if (!IN_RANGE(c, '0', '9') && !IN_RANGE(c, 'a', 'z') &&
+   !IN_RANGE(c, 'A', 'Z') && c != '-' && c != '_' &&
+   c != '.' && c != ':' && c != '@')
+   return 0;
+   }
+   return 1;
+}
+
+int str2nodeid(char *str, uint8_t *id)
+{
+   int len = strlen(str);
+   int i;
+
+   if (len > 32)
+   return -1;
+
+   if (is_name(str, len - 1)) {
+   memcpy(id, str, len);
+   return 0;
+   }
+   if (!is_hex(str, len - 1))
+   return -1;
+
+   str[len] = '0';
+   for (i = 0; i < 16; i++) {
+   if (sscanf([2 * i], "%2hhx", [i]) != 1)
+   break;
+   }
+   return 0;
+}
+
+void nodeid2str(uint8_t *id, char *str)
+{
+   int i;
+
+   if (is_name((char *)id, 15)) {
+   memcpy(str, id, 16);
+   return;
+   }
+
+   for (i = 0; i < 16; i++)
+   sprintf([2 * i], "%02x", id[i]);
+
+   for (i = 31; str[i] == '0'; i--)
+   str[i] = 0;
+}
diff --git a/tipc/misc.h b/tipc/misc.h
index 585df74..6e8afdd 100644
--- a/tipc/misc.h
+++ b/tipc/misc.h
@@ -15,5 +15,7 @@
 #include 
 
 uint32_t str2addr(char *str);
+int str2nodeid(char *str, uint8_t *id);
+void nodeid2str(uint8_t *id, char *str);
 
 #endif
diff --git a/tipc/node.c b/tipc/node.c
index fe085ae..3ebbe0b 100644
--- a/tipc/node.c
+++ b/tipc/node.c
@@ -131,6 +131,90 @@ static int cmd_node_get_addr(struct nlmsghdr *nlh, const 
struct cmd *cmd,
return 0;
 }
 
+static int cmd_node_set_nodeid(struct nlmsghdr *nlh, const struct cmd *cmd,
+  struct cmdl *cmdl, void *data)
+{
+   char buf[MNL_SOCKET_BUFFER_SIZE];
+   uint8_t id[16] = {0,};
+   uint64_t *w0 = (uint64_t *) [0];
+   uint64_t *w1 = (uint64_t *) [8];
+   struct nlattr *nest;
+   char *str;
+
+   if (cmdl->argc != cmdl->optind + 1) {
+   fprintf(stderr, "Usage: %s node set nodeid NODE_ID\n",
+   cmdl->argv[0]);
+   return -EINVAL;
+   }
+
+   str = shift_cmdl(cmdl);
+   if (str2nodeid(str, id)) {
+   fprintf(stderr, "Invalid node identity\n");
+   return -EINVAL;
+   }
+
+   nlh = msg_init(buf, TIPC_NL_NET_SET);
+   if (!nlh) {
+   fprintf(stderr, "error, message init

[iproute2-next 0/2] tipc: changes to addressing structure

2018-03-28 Thread Jon Maloy
1: We introduce ability to set/get 128-bit node identities
2: We rename 'net id' to 'cluster id' in the command API, 
   of course in a compatible way.
3: We print out all 32-bit node addresses as an integer in hex format,
   i.e., we remove the assumption about an internal structure.

Jon Maloy (2):
  tipc: introduce command for handling a new 128-bit node identity
  tipc: change node address printout formats

 include/uapi/linux/tipc_netlink.h |   2 +
 tipc/link.c   |   3 +-
 tipc/misc.c   |  78 ++-
 tipc/misc.h   |   2 +
 tipc/nametable.c  |  16 ++
 tipc/node.c   | 109 +-
 tipc/socket.c |   3 +-
 7 files changed, 183 insertions(+), 30 deletions(-)

-- 
2.1.4



[iproute2-next 2/2] tipc: change node address printout formats

2018-03-28 Thread Jon Maloy
Since a node address now per definition is only an unstructured 32-bit
integer it makes no sense print it out as a structured string.

In this commit, we replace all occurrences of "" printouts with
just an "%x".

Acked-by: GhantaKrishnamurthy MohanKrishna 
<mohan.krishna.ghanta.krishnamur...@ericsson.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 tipc/link.c  |  3 +--
 tipc/nametable.c | 16 +---
 tipc/node.c  | 11 ++-
 tipc/socket.c|  3 +--
 4 files changed, 9 insertions(+), 24 deletions(-)

diff --git a/tipc/link.c b/tipc/link.c
index 4ae1c91..02f14aa 100644
--- a/tipc/link.c
+++ b/tipc/link.c
@@ -616,8 +616,7 @@ static void link_mon_print_non_applied(uint16_t applied, 
uint16_t member_cnt,
if (i != applied)
printf(",");
 
-   sprintf(addr_str, "%u.%u.%u:", tipc_zone(members[i]),
-   tipc_cluster(members[i]), tipc_node(members[i]));
+   sprintf(addr_str, "%x:", members[i]);
state = map_get(up_map, i) ? 'U' : 'D';
printf("%s%c", addr_str, state);
}
diff --git a/tipc/nametable.c b/tipc/nametable.c
index 770a644..2578940 100644
--- a/tipc/nametable.c
+++ b/tipc/nametable.c
@@ -26,7 +26,6 @@
 static int nametable_show_cb(const struct nlmsghdr *nlh, void *data)
 {
int *iteration = data;
-   char port_id[PORTID_STR_LEN];
struct genlmsghdr *genl = mnl_nlmsg_get_payload(nlh);
struct nlattr *info[TIPC_NLA_MAX + 1] = {};
struct nlattr *attrs[TIPC_NLA_NAME_TABLE_MAX + 1] = {};
@@ -46,22 +45,17 @@ static int nametable_show_cb(const struct nlmsghdr *nlh, 
void *data)
return MNL_CB_ERROR;
 
if (!*iteration)
-   printf("%-10s %-10s %-10s %-26s %-10s\n",
-  "Type", "Lower", "Upper", "Port Identity",
+   printf("%-10s %-10s %-10s %-10s %-10s %-10s\n",
+  "Type", "Lower", "Upper", "Node", "Port",
   "Publication Scope");
(*iteration)++;
 
-   snprintf(port_id, sizeof(port_id), "<%u.%u.%u:%u>",
-tipc_zone(mnl_attr_get_u32(publ[TIPC_NLA_PUBL_NODE])),
-tipc_cluster(mnl_attr_get_u32(publ[TIPC_NLA_PUBL_NODE])),
-tipc_node(mnl_attr_get_u32(publ[TIPC_NLA_PUBL_NODE])),
-mnl_attr_get_u32(publ[TIPC_NLA_PUBL_REF]));
-
-   printf("%-10u %-10u %-10u %-26s %-12u",
+   printf("%-10u %-10u %-10u %-10x %-10u %-12u",
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_TYPE]),
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_LOWER]),
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_UPPER]),
-  port_id,
+  mnl_attr_get_u32(publ[TIPC_NLA_PUBL_NODE]),
+  mnl_attr_get_u32(publ[TIPC_NLA_PUBL_REF]),
   mnl_attr_get_u32(publ[TIPC_NLA_PUBL_KEY]));
 
printf("%s\n", scope[mnl_attr_get_u32(publ[TIPC_NLA_PUBL_SCOPE])]);
diff --git a/tipc/node.c b/tipc/node.c
index 3ebbe0b..b73b644 100644
--- a/tipc/node.c
+++ b/tipc/node.c
@@ -40,10 +40,7 @@ static int node_list_cb(const struct nlmsghdr *nlh, void 
*data)
return MNL_CB_ERROR;
 
addr = mnl_attr_get_u32(attrs[TIPC_NLA_NODE_ADDR]);
-   printf("<%u.%u.%u>: ",
-   tipc_zone(addr),
-   tipc_cluster(addr),
-   tipc_node(addr));
+   printf("%x: ", addr);
 
if (attrs[TIPC_NLA_NODE_UP])
printf("up\n");
@@ -123,11 +120,7 @@ static int cmd_node_get_addr(struct nlmsghdr *nlh, const 
struct cmd *cmd,
}
close(sk);
 
-   printf("<%u.%u.%u>\n",
-   tipc_zone(addr.addr.id.node),
-   tipc_cluster(addr.addr.id.node),
-   tipc_node(addr.addr.id.node));
-
+   printf("%x\n", addr.addr.id.node);
return 0;
 }
 
diff --git a/tipc/socket.c b/tipc/socket.c
index 48ba821..852984e 100644
--- a/tipc/socket.c
+++ b/tipc/socket.c
@@ -84,8 +84,7 @@ static int sock_list_cb(const struct nlmsghdr *nlh, void 
*data)
mnl_attr_parse_nested(attrs[TIPC_NLA_SOCK_CON], parse_attrs, 
con);
node = mnl_attr_get_u32(con[TIPC_NLA_CON_NODE]);
 
-   printf("  connected to <%u.%u.%u:%u>", tipc_zone(node),
-   tipc_cluster(node), tipc_node(node),
+   printf("  connected to %x:%u", node,
mnl_attr_get_u32(con[TIPC_NLA_CON_SOCK]));
 
if (con[TIPC_NLA_CON_FLAG])
-- 
2.1.4



[net-next 4/4] tipc: tipc: rename address types in user api

2018-03-28 Thread Jon Maloy
The three address type structs in the user API have names that in
reality reflect the specific, non-Linux environment where they were
originally created.

We now give them more intuitive names, in accordance with how TIPC is
described in the current documentation.

struct tipc_portid   -> struct tipc_socket_addr
struct tipc_name -> struct tipc_service_addr
struct tipc_name_seq -> struct tipc_service_range

For compatibility, we add macros that map the old names to the new ones.

Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 include/uapi/linux/tipc.h | 32 ++--
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/include/uapi/linux/tipc.h b/include/uapi/linux/tipc.h
index 4ac9f1f..254f6e3 100644
--- a/include/uapi/linux/tipc.h
+++ b/include/uapi/linux/tipc.h
@@ -45,17 +45,17 @@
  * TIPC addressing primitives
  */
 
-struct tipc_portid {
+struct tipc_socket_addr {
__u32 ref;
__u32 node;
 };
 
-struct tipc_name {
+struct tipc_service_addr {
__u32 type;
__u32 instance;
 };
 
-struct tipc_name_seq {
+struct tipc_service_range {
__u32 type;
__u32 lower;
__u32 upper;
@@ -108,28 +108,28 @@ enum tipc_scope {
  * TIPC topology subscription service definitions
  */
 
-#define TIPC_SUB_PORTS 0x01/* filter for port availability */
-#define TIPC_SUB_SERVICE   0x02/* filter for service availability */
-#define TIPC_SUB_CANCEL0x04/* cancel a subscription */
+#define TIPC_SUB_PORTS  0x01/* filter: evt at each match */
+#define TIPC_SUB_SERVICE0x02/* filter: evt at first up/last down */
+#define TIPC_SUB_CANCEL 0x04/* filter: cancel a subscription */
 
 #define TIPC_WAIT_FOREVER  (~0)/* timeout for permanent subscription */
 
 struct tipc_subscr {
-   struct tipc_name_seq seq;   /* name sequence of interest */
+   struct tipc_service_range seq;  /* range of interest */
__u32 timeout;  /* subscription duration (in ms) */
__u32 filter;   /* bitmask of filter options */
char usr_handle[8]; /* available for subscriber use */
 };
 
 #define TIPC_PUBLISHED 1   /* publication event */
-#define TIPC_WITHDRAWN 2   /* withdraw event */
+#define TIPC_WITHDRAWN 2   /* withdrawal event */
 #define TIPC_SUBSCR_TIMEOUT3   /* subscription timeout event */
 
 struct tipc_event {
__u32 event;/* event type */
-   __u32 found_lower;  /* matching name seq instances */
-   __u32 found_upper;  /*"  "" "  */
-   struct tipc_portid port;/* associated port */
+   __u32 found_lower;  /* matching range */
+   __u32 found_upper;  /*"  "*/
+   struct tipc_socket_addr port;   /* associated socket */
struct tipc_subscr s;   /* associated subscription */
 };
 
@@ -159,10 +159,10 @@ struct sockaddr_tipc {
unsigned char  addrtype;
signed   char  scope;
union {
-   struct tipc_portid id;
-   struct tipc_name_seq nameseq;
+   struct tipc_socket_addr id;
+   struct tipc_service_range nameseq;
struct {
-   struct tipc_name name;
+   struct tipc_service_addr name;
__u32 domain;
} name;
} addr;
@@ -250,6 +250,10 @@ struct tipc_sioc_ln_req {
 
 #define TIPC_ZONE_CLUSTER_MASK (TIPC_ZONE_MASK | TIPC_CLUSTER_MASK)
 
+#define tipc_portid tipc_socket_addr
+#define tipc_name tipc_service_addr
+#define tipc_name_seq tipc_service_range
+
 static inline __u32 tipc_addr(unsigned int zone,
  unsigned int cluster,
  unsigned int node)
-- 
2.1.4



[net-next 3/4] tipc: permit overlapping service ranges in name table

2018-03-28 Thread Jon Maloy
With the new RB tree structure for service ranges it becomes possible to
solve an old problem; - we can now allow overlapping service ranges in
the table.

When inserting a new service range to the tree, we use 'lower' as primary
key, and when necessary 'upper' as secondary key.

Since there may now be multiple service ranges matching an indicated
'lower' value, we must also add the 'upper' value to the functions
used for removing publications, so that the correct, corresponding
range item can be found.

These changes guarantee that a well-formed publication/withdrawal item
from a peer node never will be rejected, and make it possible to
eliminate the problematic backlog functionality we currently have for
handling such cases.

Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/name_distr.c | 90 +--
 net/tipc/name_distr.h |  1 -
 net/tipc/name_table.c | 64 +---
 net/tipc/name_table.h |  8 ++---
 net/tipc/net.c|  2 +-
 net/tipc/node.c   |  2 +-
 net/tipc/socket.c |  4 +--
 7 files changed, 60 insertions(+), 111 deletions(-)

diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
index 8240a85..51b4b96 100644
--- a/net/tipc/name_distr.c
+++ b/net/tipc/name_distr.c
@@ -204,12 +204,12 @@ void tipc_named_node_up(struct net *net, u32 dnode)
  */
 static void tipc_publ_purge(struct net *net, struct publication *publ, u32 
addr)
 {
-   struct tipc_net *tn = net_generic(net, tipc_net_id);
+   struct tipc_net *tn = tipc_net(net);
struct publication *p;
 
spin_lock_bh(>nametbl_lock);
-   p = tipc_nametbl_remove_publ(net, publ->type, publ->lower,
-publ->node, publ->port, publ->key);
+   p = tipc_nametbl_remove_publ(net, publ->type, publ->lower, publ->upper,
+publ->node, publ->key);
if (p)
tipc_node_unsubscribe(net, >binding_node, addr);
spin_unlock_bh(>nametbl_lock);
@@ -261,28 +261,31 @@ void tipc_publ_notify(struct net *net, struct list_head 
*nsub_list, u32 addr)
 static bool tipc_update_nametbl(struct net *net, struct distr_item *i,
u32 node, u32 dtype)
 {
-   struct publication *publ = NULL;
+   struct publication *p = NULL;
+   u32 lower = ntohl(i->lower);
+   u32 upper = ntohl(i->upper);
+   u32 type = ntohl(i->type);
+   u32 port = ntohl(i->port);
+   u32 key = ntohl(i->key);
 
if (dtype == PUBLICATION) {
-   publ = tipc_nametbl_insert_publ(net, ntohl(i->type),
-   ntohl(i->lower),
-   ntohl(i->upper),
-   TIPC_CLUSTER_SCOPE, node,
-   ntohl(i->port), ntohl(i->key));
-   if (publ) {
-   tipc_node_subscribe(net, >binding_node, node);
+   p = tipc_nametbl_insert_publ(net, type, lower, upper,
+TIPC_CLUSTER_SCOPE, node,
+port, key);
+   if (p) {
+   tipc_node_subscribe(net, >binding_node, node);
return true;
}
} else if (dtype == WITHDRAWAL) {
-   publ = tipc_nametbl_remove_publ(net, ntohl(i->type),
-   ntohl(i->lower),
-   node, ntohl(i->port),
-   ntohl(i->key));
-   if (publ) {
-   tipc_node_unsubscribe(net, >binding_node, node);
-   kfree_rcu(publ, rcu);
+   p = tipc_nametbl_remove_publ(net, type, lower,
+upper, node, key);
+   if (p) {
+   tipc_node_unsubscribe(net, >binding_node, node);
+   kfree_rcu(p, rcu);
return true;
}
+   pr_warn_ratelimited("Failed to remove binding %u,%u from %x\n",
+   type, lower, node);
} else {
pr_warn("Unrecognized name table message received\n");
}
@@ -290,53 +293,6 @@ static bool tipc_update_nametbl(struct net *net, struct 
distr_item *i,
 }
 
 /**
- * tipc_named_add_backlog - add a failed name table update to the backlog
- *
- */
-static void tipc_named_add_backlog(struct net *net, struct distr_item *i,
-  u32 type, u32 node)
-{
-   struct distr_queue_item *e;
-   struct tipc_net *tn = net_generic(net, tipc_net_id);
-   unsigned long now = get_jiffies_64();
-
-   e = kzalloc(sizeof(*e), GFP_ATO

[net-next 2/4] tipc: refactor name table translate function

2018-03-28 Thread Jon Maloy
The function tipc_nametbl_translate() function is ugly and hard to
follow. This can be improved somewhat by introducing a stack variable
for holding the publication list to be used and re-ordering the if-
clauses for selection of algorithm.

Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/name_table.c | 61 +--
 1 file changed, 25 insertions(+), 36 deletions(-)

diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index c309402..9915be0 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -399,29 +399,32 @@ struct publication *tipc_nametbl_remove_publ(struct net 
*net, u32 type,
 /**
  * tipc_nametbl_translate - perform service instance to socket translation
  *
- * On entry, 'destnode' is the search domain used during translation.
+ * On entry, 'dnode' is the search domain used during translation.
  *
  * On exit:
- * - if name translation is deferred to another node/cluster/zone,
- *   leaves 'destnode' unchanged (will be non-zero) and returns 0
- * - if name translation is attempted and succeeds, sets 'destnode'
- *   to publication node and returns port reference (will be non-zero)
- * - if name translation is attempted and fails, sets 'destnode' to 0
- *   and returns 0
+ * - if translation is deferred to another node, leave 'dnode' unchanged and
+ *   return 0
+ * - if translation is attempted and succeeds, set 'dnode' to the publishing
+ *   node and return the published (non-zero) port number
+ * - if translation is attempted and fails, set 'dnode' to 0 and return 0
+ *
+ * Note that for legacy users (node configured with Z.C.N address format) the
+ * 'closest-first' lookup algorithm must be maintained, i.e., if dnode is 0
+ * we must look in the local binding list first
  */
-u32 tipc_nametbl_translate(struct net *net, u32 type, u32 instance,
-  u32 *destnode)
+u32 tipc_nametbl_translate(struct net *net, u32 type, u32 instance, u32 *dnode)
 {
struct tipc_net *tn = tipc_net(net);
bool legacy = tn->legacy_addr_format;
u32 self = tipc_own_addr(net);
struct service_range *sr;
struct tipc_service *sc;
+   struct list_head *list;
struct publication *p;
u32 port = 0;
u32 node = 0;
 
-   if (!tipc_in_scope(legacy, *destnode, self))
+   if (!tipc_in_scope(legacy, *dnode, self))
return 0;
 
rcu_read_lock();
@@ -434,43 +437,29 @@ u32 tipc_nametbl_translate(struct net *net, u32 type, u32 
instance,
if (unlikely(!sr))
goto no_match;
 
-   /* Closest-First Algorithm */
-   if (legacy && !*destnode) {
-   if (!list_empty(>local_publ)) {
-   p = list_first_entry(>local_publ,
-struct publication,
-local_publ);
-   list_move_tail(>local_publ,
-  >local_publ);
-   } else {
-   p = list_first_entry(>all_publ,
-struct publication,
-all_publ);
-   list_move_tail(>all_publ,
-  >all_publ);
-   }
-   }
-
-   /* Round-Robin Algorithm */
-   else if (*destnode == self) {
-   if (list_empty(>local_publ))
+   /* Select lookup algorithm: local, closest-first or round-robin */
+   if (*dnode == self) {
+   list = >local_publ;
+   if (list_empty(list))
goto no_match;
-   p = list_first_entry(>local_publ, struct publication,
-local_publ);
+   p = list_first_entry(list, struct publication, local_publ);
+   list_move_tail(>local_publ, >local_publ);
+   } else if (legacy && !*dnode && !list_empty(>local_publ)) {
+   list = >local_publ;
+   p = list_first_entry(list, struct publication, local_publ);
list_move_tail(>local_publ, >local_publ);
} else {
-   p = list_first_entry(>all_publ, struct publication,
-all_publ);
+   list = >all_publ;
+   p = list_first_entry(list, struct publication, all_publ);
list_move_tail(>all_publ, >all_publ);
}
-
port = p->port;
node = p->node;
 no_match:
spin_unlock_bh(>lock);
 not_found:
rcu_read_unlock();
-   *destnode = node;
+   *dnode = node;
return port;
 }
 
-- 
2.1.4



[net-next 1/4] tipc: replace name table service range array with rb tree

2018-03-28 Thread Jon Maloy
The current design of the binding table has an unnecessary memory
consuming and complex data structure. It aggregates the service range
items into an array, which is expanded by a factor two every time it
becomes too small to hold a new item. Furthermore, the arrays never
shrink when the number of ranges diminishes.

We now replace this array with an RB tree that is holding the range
items as tree nodes, each range directly holding a list of bindings.

This, along with a few name changes, improves both readability and
volume of the code, as well as reducing memory consumption and hopefully
improving cache hit rate.

Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/core.h   |1 +
 net/tipc/link.c   |2 +-
 net/tipc/name_table.c | 1032 ++---
 net/tipc/name_table.h |2 +-
 net/tipc/node.c   |4 +-
 net/tipc/subscr.h |4 +-
 6 files changed, 477 insertions(+), 568 deletions(-)

diff --git a/net/tipc/core.h b/net/tipc/core.h
index d0f64ca..8020a6c 100644
--- a/net/tipc/core.h
+++ b/net/tipc/core.h
@@ -58,6 +58,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct tipc_node;
 struct tipc_bearer;
diff --git a/net/tipc/link.c b/net/tipc/link.c
index 1289b4b..8f2a949 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -1810,7 +1810,7 @@ int tipc_link_bc_nack_rcv(struct tipc_link *l, struct 
sk_buff *skb,
 
 void tipc_link_set_queue_limits(struct tipc_link *l, u32 win)
 {
-   int max_bulk = TIPC_MAX_PUBLICATIONS / (l->mtu / ITEM_SIZE);
+   int max_bulk = TIPC_MAX_PUBL / (l->mtu / ITEM_SIZE);
 
l->window = win;
l->backlog[TIPC_LOW_IMPORTANCE].limit  = max_t(u16, 50, win);
diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index 4359605..c309402 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -44,52 +44,40 @@
 #include "addr.h"
 #include "node.h"
 #include "group.h"
-#include 
-
-#define TIPC_NAMETBL_SIZE 1024 /* must be a power of 2 */
 
 /**
- * struct name_info - name sequence publication info
- * @node_list: list of publications on own node of this <type,lower,upper>
- * @all_publ: list of all publications of this <type,lower,upper>
+ * struct service_range - container for all bindings of a service range
+ * @lower: service range lower bound
+ * @upper: service range upper bound
+ * @tree_node: member of service range RB tree
+ * @local_publ: list of identical publications made from this node
+ *   Used by closest_first lookup and multicast lookup algorithm
+ * @all_publ: all publications identical to this one, whatever node and scope
+ *   Used by round-robin lookup algorithm
  */
-struct name_info {
-   struct list_head local_publ;
-   struct list_head all_publ;
-};
-
-/**
- * struct sub_seq - container for all published instances of a name sequence
- * @lower: name sequence lower bound
- * @upper: name sequence upper bound
- * @info: pointer to name sequence publication info
- */
-struct sub_seq {
+struct service_range {
u32 lower;
u32 upper;
-   struct name_info *info;
+   struct rb_node tree_node;
+   struct list_head local_publ;
+   struct list_head all_publ;
 };
 
 /**
- * struct name_seq - container for all published instances of a name type
- * @type: 32 bit 'type' value for name sequence
- * @sseq: pointer to dynamically-sized array of sub-sequences of this 'type';
- *sub-sequences are sorted in ascending order
- * @alloc: number of sub-sequences currently in array
- * @first_free: array index of first unused sub-sequence entry
- * @ns_list: links to adjacent name sequences in hash chain
- * @subscriptions: list of subscriptions for this 'type'
- * @lock: spinlock controlling access to publication lists of all sub-sequences
+ * struct tipc_service - container for all published instances of a service 
type
+ * @type: 32 bit 'type' value for service
+ * @ranges: rb tree containing all service ranges for this service
+ * @service_list: links to adjacent name ranges in hash chain
+ * @subscriptions: list of subscriptions for this service type
+ * @lock: spinlock controlling access to pertaining service ranges/publications
  * @rcu: RCU callback head used for deferred freeing
  */
-struct name_seq {
+struct tipc_service {
u32 type;
-   struct sub_seq *sseqs;
-   u32 alloc;
-   u32 first_free;
-   struct hlist_node ns_list;
+   struct rb_root ranges;
+   struct hlist_node service_list;
struct list_head subscriptions;
-   spinlock_t lock;
+   spinlock_t lock; /* Covers service range list */
struct rcu_head rcu;
 };
 
@@ -99,17 +87,16 @@ static int hash(int x)
 }
 
 /**
- * publ_create - create a publication structure
+ * tipc_publ_create - create a publication structure
  */
-static struct publication *publ_create(u32 type, u32 lower, u32 upper,
-   

[net-next 0/4] tipc: slim down name table

2018-03-28 Thread Jon Maloy
We clean up and improve the name binding table:

 - Replace the memory consuming 'sub_sequence/service range' array with
   an RB tree.
 - Introduce support for overlapping service sequences/ranges

Jon Maloy (4):
  tipc: replace name table service range array with rb tree
  tipc: refactor name table translate function
  tipc: permit overlapping service ranges in name table
  tipc: tipc: rename address types in user api

 include/uapi/linux/tipc.h |   32 +-
 net/tipc/core.h   |1 +
 net/tipc/link.c   |2 +-
 net/tipc/name_distr.c |   90 +---
 net/tipc/name_distr.h |1 -
 net/tipc/name_table.c | 1075 -
 net/tipc/name_table.h |   10 +-
 net/tipc/net.c|2 +-
 net/tipc/node.c   |4 +-
 net/tipc/socket.c |4 +-
 net/tipc/subscr.h |4 +-
 11 files changed, 538 insertions(+), 687 deletions(-)

-- 
2.1.4



RE: [PATCH] tipc: avoid possible string overflow

2018-03-28 Thread Jon Maloy


> -Original Message-
> From: Arnd Bergmann [mailto:a...@arndb.de]
> Sent: Wednesday, March 28, 2018 10:02
> To: Jon Maloy <jon.ma...@ericsson.com>; Ying Xue
> <ying@windriver.com>; David S. Miller <da...@davemloft.net>
> Cc: Arnd Bergmann <a...@arndb.de>; Parthasarathy Bhuvaragan
> <parthasarathy.bhuvara...@ericsson.com>; netdev@vger.kernel.org; tipc-
> discuss...@lists.sourceforge.net; linux-ker...@vger.kernel.org
> Subject: [PATCH] tipc: avoid possible string overflow
> 
> gcc points out that the combined length of the fixed-length inputs to
> l->name is larger than the destination buffer size:
> 
> net/tipc/link.c: In function 'tipc_link_create':
> net/tipc/link.c:465:26: error: '%s' directive writing up to 32 bytes into a 
> region
> of size between 26 and 58 [-Werror=format-overflow=]
>   sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str);
>   ^~  
> net/tipc/link.c:465:2: note: 'sprintf' output 11 or more bytes (assuming 75)
> into a destination of size 60
>   sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str);
> 
> Using snprintf() ensures that the destination is still a nul-terminated 
> string in
> all cases. It's still theoretically possible that the string gets trunctated 
> though,
> so this patch should be carefully reviewed to ensure that either truncation is
> impossible in practice, or that we're ok with the truncation.

Theoretically, maximum bearer name is MAX_BEARER_NAME - 3  = 29  (because 
if_name is only the part after the ":"  in a bearer name, and is 
zero-terminated.
The lines just above in the code reveals that the maximum length of self_str 
and peer_str is 16.
This taken together means that the theoretically max length of a link name 
becomes:
16  + 1 + 29 + 1 + 16 + 1 + 29 = 93.  Since we also need room for a terminating 
zero, we need to extend the tipc_link::name array to 96 bytes.

I'll fix that.

Thank you to for reporting this.
///jon

> 
> Fixes: 25b0b9c4e835 ("tipc: handle collisions of 32-bit node address hash
> values")
> Signed-off-by: Arnd Bergmann <a...@arndb.de>
> ---
>  net/tipc/link.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/tipc/link.c b/net/tipc/link.c index 
> 1289b4ba404f..c195ba036035
> 100644
> --- a/net/tipc/link.c
> +++ b/net/tipc/link.c
> @@ -462,7 +462,8 @@ bool tipc_link_create(struct net *net, char *if_name,
> int bearer_id,
>   sprintf(peer_str, "%x", peer);
>   }
>   /* Peer i/f name will be completed by reset/activate message */
> - sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str);
> + snprintf(l->name, sizeof(l->name), "%s:%s-%s:unknown",
> +  self_str, if_name, peer_str);
> 
>   strcpy(l->if_name, if_name);
>   l->addr = peer;
> --
> 2.9.0



RE: [PATCH net-next] tipc: tipc_node_create() can be static

2018-03-26 Thread Jon Maloy
Acked-by: Jon Maloy <jon.ma...@ericsson.com>
Thanks
///jon

> -Original Message-
> From: Wei Yongjun [mailto:weiyongj...@huawei.com]
> Sent: Monday, March 26, 2018 10:33
> To: Jon Maloy <jon.ma...@ericsson.com>; Ying Xue
> <ying@windriver.com>
> Cc: Wei Yongjun <weiyongj...@huawei.com>; netdev@vger.kernel.org;
> tipc-discuss...@lists.sourceforge.net; kernel-janit...@vger.kernel.org
> Subject: [PATCH net-next] tipc: tipc_node_create() can be static
> 
> Fixes the following sparse warning:
> 
> net/tipc/node.c:336:18: warning:
>  symbol 'tipc_node_create' was not declared. Should it be static?
> 
> Signed-off-by: Wei Yongjun <weiyongj...@huawei.com>
> ---
>  net/tipc/node.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/tipc/node.c b/net/tipc/node.c index 4a95c8c..4fb4327 100644
> --- a/net/tipc/node.c
> +++ b/net/tipc/node.c
> @@ -333,8 +333,8 @@ static void tipc_node_write_unlock(struct tipc_node
> *n)
>   }
>  }
> 
> -struct tipc_node *tipc_node_create(struct net *net, u32 addr,
> -u8 *peer_id, u16 capabilities)
> +static struct tipc_node *tipc_node_create(struct net *net, u32 addr,
> +   u8 *peer_id, u16 capabilities)
>  {
>   struct tipc_net *tn = net_generic(net, tipc_net_id);
>   struct tipc_node *n, *temp_node;



RE: [PATCH net-next] tipc: fix error handling in tipc_udp_enable()

2018-03-26 Thread Jon Maloy
Acked-by: Jon Maloy <jon.ma...@ericsson.com>
Thank you, Wei.

> -Original Message-
> From: Wei Yongjun [mailto:weiyongj...@huawei.com]
> Sent: Monday, March 26, 2018 10:33
> To: Jon Maloy <jon.ma...@ericsson.com>; Ying Xue
> <ying@windriver.com>
> Cc: Wei Yongjun <weiyongj...@huawei.com>; netdev@vger.kernel.org;
> tipc-discuss...@lists.sourceforge.net; kernel-janit...@vger.kernel.org
> Subject: [PATCH net-next] tipc: fix error handling in tipc_udp_enable()
> 
> Release alloced resource before return from the error handling case in
> tipc_udp_enable(), otherwise will cause memory leak.
> 
> Fixes: 52dfae5c85a4 ("tipc: obtain node identity from interface by default")
> Signed-off-by: Wei Yongjun <weiyongj...@huawei.com>
> ---
>  net/tipc/udp_media.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c index
> 2c13b18..e7d91f5 100644
> --- a/net/tipc/udp_media.c
> +++ b/net/tipc/udp_media.c
> @@ -687,7 +687,8 @@ static int tipc_udp_enable(struct net *net, struct
> tipc_bearer *b,
>   }
>   if (!tipc_own_id(net)) {
>   pr_warn("Failed to set node id, please configure
> manually\n");
> - return -EINVAL;
> + err = -EINVAL;
> + goto err;
>   }
> 
>   b->bcast_addr.media_id = TIPC_MEDIA_TYPE_UDP;



RE: [RFC PATCH net-next] tipc: tipc_disc_addr_trial_msg() can be static

2018-03-24 Thread Jon Maloy
Acked-by: Jon Maloy jon.ma...@ericsson.com

Thanks, Fengguang

> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of kbuild test robot
> Sent: Friday, March 23, 2018 15:48
> To: Jon Maloy <jon.ma...@ericsson.com>
> Cc: kbuild-...@01.org; netdev@vger.kernel.org; Ying Xue
> <ying@windriver.com>; tipc-discuss...@lists.sourceforge.net; linux-
> ker...@vger.kernel.org
> Subject: [RFC PATCH net-next] tipc: tipc_disc_addr_trial_msg() can be static
> 
> 
> Fixes: 25b0b9c4e835 ("tipc: handle collisions of 32-bit node address hash
> values")
> Signed-off-by: Fengguang Wu <fengguang...@intel.com>
> ---
>  discover.c |   14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/net/tipc/discover.c b/net/tipc/discover.c index e765573..9f666e0
> 100644
> --- a/net/tipc/discover.c
> +++ b/net/tipc/discover.c
> @@ -134,13 +134,13 @@ static void disc_dupl_alert(struct tipc_bearer *b,
> u32 node_addr,
> 
>  /* tipc_disc_addr_trial(): - handle an address uniqueness trial from peer
>   */
> -bool tipc_disc_addr_trial_msg(struct tipc_discoverer *d,
> -   struct tipc_media_addr *maddr,
> -   struct tipc_bearer *b,
> -   u32 dst, u32 src,
> -   u32 sugg_addr,
> -   u8 *peer_id,
> -   int mtyp)
> +static bool tipc_disc_addr_trial_msg(struct tipc_discoverer *d,
> +  struct tipc_media_addr *maddr,
> +  struct tipc_bearer *b,
> +  u32 dst, u32 src,
> +  u32 sugg_addr,
> +  u8 *peer_id,
> +  int mtyp)
>  {
>   struct net *net = d->net;
>   struct tipc_net *tn = tipc_net(net);


RE: [iproute2 1/1] ss: Add support for TIPC socket diag in ss tool

2018-03-23 Thread Jon Maloy
Hi Mohan,
I remember you mentioned the possibility to add this functionality to the tipc 
tool, too.
Would it be easy to add a 'tipc ss' command that just calls 'ss' with the same 
parameters? I.e., no duplication of functionality ?

///jon


> -Original Message-
> From: Mohan Krishna Ghanta Krishnamurthy
> Sent: Friday, March 23, 2018 10:01
> To: tipc-discuss...@lists.sourceforge.net; Jon Maloy
> <jon.ma...@ericsson.com>; ma...@donjonn.com;
> ying@windriver.com; Mohan Krishna Ghanta Krishnamurthy
> <mohan.krishna.ghanta.krishnamur...@ericsson.com>;
> netdev@vger.kernel.org; step...@networkplumber.org
> Cc: Parthasarathy Bhuvaragan <parthasarathy.bhuvara...@gmail.com>
> Subject: [iproute2 1/1] ss: Add support for TIPC socket diag in ss tool
> 
> For iproute 4.x
> Allow TIPC socket statistics to be dumped with --tipc and tipc specific info
> with --tipcinfo.
> 
> Acked-by: Jon Maloy <jon.ma...@ericsson.com>
> Signed-off-by: GhantaKrishnamurthy MohanKrishna
> <mohan.krishna.ghanta.krishnamur...@ericsson.com>
> Signed-off-by: Parthasarathy Bhuvaragan
> <parthasarathy.bhuvara...@gmail.com>
> ---
>  misc/ss.c | 166
> ++
> +++-
>  1 file changed, 164 insertions(+), 2 deletions(-)
> 
> diff --git a/misc/ss.c b/misc/ss.c
> index e047f9c04582..812f45717af9 100644
> --- a/misc/ss.c
> +++ b/misc/ss.c
> @@ -45,6 +45,10 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
> +#include 
> 
>  #define MAGIC_SEQ 123456
>  #define BUF_CHUNK (1024 * 1024)
> @@ -104,6 +108,7 @@ int show_sock_ctx;
>  int show_header = 1;
>  int follow_events;
>  int sctp_ino;
> +int show_tipcinfo;
> 
>  enum col_id {
>   COL_NETID,
> @@ -191,6 +196,7 @@ enum {
>   SCTP_DB,
>   VSOCK_ST_DB,
>   VSOCK_DG_DB,
> + TIPC_DB,
>   MAX_DB
>  };
> 
> @@ -230,6 +236,7 @@ enum {
> 
>  #define SS_ALL ((1 << SS_MAX) - 1)
>  #define SS_CONN (SS_ALL &
> ~((1<<SS_LISTEN)|(1<<SS_CLOSE)|(1<<SS_TIME_WAIT)|(1<<SS_SYN_RECV)
> ))
> +#define TIPC_SS_CONN
> ((1<<SS_ESTABLISHED)|(1<<SS_LISTEN)|(1<<SS_CLOSE))
> 
>  #include "ssfilter.h"
> 
> @@ -297,6 +304,10 @@ static const struct filter default_dbs[MAX_DB] = {
>   .states   = SS_CONN,
>   .families = FAMILY_MASK(AF_VSOCK),
>   },
> + [TIPC_DB] = {
> + .states   = TIPC_SS_CONN,
> + .families = FAMILY_MASK(AF_TIPC),
> + },
>  };
> 
>  static const struct filter default_afs[AF_MAX] = { @@ -324,6 +335,10 @@
> static const struct filter default_afs[AF_MAX] = {
>   .dbs= VSOCK_DBM,
>   .states = SS_CONN,
>   },
> + [AF_TIPC] = {
> + .dbs= (1 << TIPC_DB),
> + .states = TIPC_SS_CONN,
> + },
>  };
> 
>  static int do_default = 1;
> @@ -364,6 +379,7 @@ static void filter_default_dbs(struct filter *f)
>   filter_db_set(f, SCTP_DB);
>   filter_db_set(f, VSOCK_ST_DB);
>   filter_db_set(f, VSOCK_DG_DB);
> + filter_db_set(f, TIPC_DB);
>  }
> 
>  static void filter_states_set(struct filter *f, int states) @@ -748,6 +764,14
> @@ static const char *sctp_sstate_name[] = {
>   [SCTP_STATE_SHUTDOWN_ACK_SENT] = "ACK_SENT",  };
> 
> +static const char * const stype_nameg[] = {
> + "UNKNOWN",
> + [SOCK_STREAM] = "STREAM",
> + [SOCK_DGRAM] = "DGRAM",
> + [SOCK_RDM] = "RDM",
> + [SOCK_SEQPACKET] = "SEQPACKET",
> +};
> +
>  struct sockstat {
>   struct sockstat*next;
>   unsigned inttype;
> @@ -888,6 +912,22 @@ static const char *vsock_netid_name(int type)
>   }
>  }
> 
> +static const char *tipc_netid_name(int type) {
> + switch (type) {
> + case SOCK_STREAM:
> + return "ti_st";
> + case SOCK_DGRAM:
> + return "ti_dg";
> + case SOCK_RDM:
> + return "ti_rd";
> + case SOCK_SEQPACKET:
> + return "ti_sq";
> + default:
> + return "???";
> + }
> +}
> +
>  /* Allocate and initialize a new buffer chunk */  static struct buf_chunk
> *buf_chunk_new(void)  { @@ -1274,6 +1314,9 @@ static void
> sock_state_print(struct sockstat *s)
>   case AF_NETLINK:
>   sock_name = "nl";
>   break;
> + case AF_TIPC:
> + sock_name = tipc_netid_name(s->type);
> + break;
> 

[net-next 1/8] tipc: refactor function tipc_enable_bearer()

2018-03-22 Thread Jon Maloy
As a preparation for the next commits we try to reduce the footprint of
the function tipc_enable_bearer(), while hopefully making is simpler to
follow.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/bearer.c | 136 --
 1 file changed, 70 insertions(+), 66 deletions(-)

diff --git a/net/tipc/bearer.c b/net/tipc/bearer.c
index f3d2e83..e18cb27 100644
--- a/net/tipc/bearer.c
+++ b/net/tipc/bearer.c
@@ -230,88 +230,90 @@ void tipc_bearer_remove_dest(struct net *net, u32 
bearer_id, u32 dest)
  * tipc_enable_bearer - enable bearer with the given name
  */
 static int tipc_enable_bearer(struct net *net, const char *name,
- u32 disc_domain, u32 priority,
+ u32 disc_domain, u32 prio,
  struct nlattr *attr[])
 {
-   struct tipc_net *tn = net_generic(net, tipc_net_id);
+   struct tipc_net *tn = tipc_net(net);
+   struct tipc_bearer_names b_names;
+   u32 self = tipc_own_addr(net);
+   int with_this_prio = 1;
struct tipc_bearer *b;
struct tipc_media *m;
-   struct tipc_bearer_names b_names;
struct sk_buff *skb;
char addr_string[16];
-   u32 bearer_id;
-   u32 with_this_prio;
-   u32 i;
+   int bearer_id = 0;
int res = -EINVAL;
+   char *errstr = "";
 
-   if (!tn->own_addr) {
-   pr_warn("Bearer <%s> rejected, not supported in standalone 
mode\n",
-   name);
-   return -ENOPROTOOPT;
+   if (!self) {
+   errstr = "not supported in standalone mode";
+   res = -ENOPROTOOPT;
+   goto rejected;
}
+
if (!bearer_name_validate(name, _names)) {
-   pr_warn("Bearer <%s> rejected, illegal name\n", name);
-   return -EINVAL;
+   errstr = "illegal name";
+   goto rejected;
}
-   if (tipc_addr_domain_valid(disc_domain) &&
-   (disc_domain != tn->own_addr)) {
-   if (tipc_in_scope(disc_domain, tn->own_addr)) {
-   disc_domain = tn->own_addr & TIPC_ZONE_CLUSTER_MASK;
-   res = 0;   /* accept any node in own cluster */
-   } else if (in_own_cluster_exact(net, disc_domain))
-   res = 0;   /* accept specified node in own cluster */
+
+   if (tipc_addr_domain_valid(disc_domain) && disc_domain != self) {
+   if (tipc_in_scope(disc_domain, self)) {
+   /* Accept any node in own cluster */
+   disc_domain = self & TIPC_ZONE_CLUSTER_MASK;
+   res = 0;
+   } else if (in_own_cluster_exact(net, disc_domain)) {
+   /* Accept specified node in own cluster */
+   res = 0;
+   }
}
if (res) {
-   pr_warn("Bearer <%s> rejected, illegal discovery domain\n",
-   name);
-   return -EINVAL;
+   errstr = "illegal discovery domain";
+   goto rejected;
}
-   if ((priority > TIPC_MAX_LINK_PRI) &&
-   (priority != TIPC_MEDIA_LINK_PRI)) {
-   pr_warn("Bearer <%s> rejected, illegal priority\n", name);
-   return -EINVAL;
+
+   if (prio > TIPC_MAX_LINK_PRI && prio != TIPC_MEDIA_LINK_PRI) {
+   errstr = "illegal priority";
+   goto rejected;
}
 
m = tipc_media_find(b_names.media_name);
if (!m) {
-   pr_warn("Bearer <%s> rejected, media <%s> not registered\n",
-   name, b_names.media_name);
-   return -EINVAL;
+   errstr = "media not registered";
+   goto rejected;
}
 
-   if (priority == TIPC_MEDIA_LINK_PRI)
-   priority = m->priority;
+   if (prio == TIPC_MEDIA_LINK_PRI)
+   prio = m->priority;
 
-restart:
-   bearer_id = MAX_BEARERS;
-   with_this_prio = 1;
-   for (i = MAX_BEARERS; i-- != 0; ) {
-   b = rtnl_dereference(tn->bearer_list[i]);
-   if (!b) {
-   bearer_id = i;
-   continue;
-   }
+   /* Check new bearer vs existing ones and find free bearer id if any */
+   while (bearer_id < MAX_BEARERS) {
+   b = rtnl_dereference(tn->bearer_list[bearer_id]);
+   if (!b)
+   break;
if (!strcmp(name, b->name)) {
-   pr_warn("Bearer <%s> rejected, already enabled\n",
-   

[net-next 7/8] tipc: handle collisions of 32-bit node address hash values

2018-03-22 Thread Jon Maloy
When a 32-bit node address is generated from a 128-bit identifier,
there is a risk of collisions which must be discovered and handled.

We do this as follows:
- We don't apply the generated address immediately to the node, but do
  instead initiate a 1 sec trial period to allow other cluster members
  to discover and handle such collisions.

- During the trial period the node periodically sends out a new type
  of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
  to all the other nodes in the cluster.

- When a node is receiving such a message, it must check that the
  presented 32-bit identifier either is unused, or was used by the very
  same peer in a previous session. In both cases it accepts the request
  by not responding to it.

- If it finds that the same node has been up before using a different
  address, it responds with a DSC_TRIAL_FAIL_MSG containing that
  address.

- If it finds that the address has already been taken by some other
  node, it generates a new, unused address and returns it to the
  requester.

- During the trial period the requesting node must always be prepared
  to accept a failure message, i.e., a message where a peer suggests a
  different (or equal)  address to the one tried. In those cases it
  must apply the suggested value as trial address and restart the trial
  period.

This algorithm ensures that in the vast majority of cases a node will
have the same address before and after a reboot. If a legacy user
configures the address explicitly, there will be no trial period and
messages, so this protocol addition is completely backwards compatible.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/addr.c |   3 +-
 net/tipc/bearer.c   |   3 +-
 net/tipc/core.c |   2 +
 net/tipc/core.h |   2 +
 net/tipc/discover.c | 126 
 net/tipc/link.c |  26 +++
 net/tipc/link.h |   4 +-
 net/tipc/msg.h  |  23 +-
 net/tipc/net.c  |   4 +-
 net/tipc/node.c |  85 ---
 net/tipc/node.h |   3 +-
 11 files changed, 236 insertions(+), 45 deletions(-)

diff --git a/net/tipc/addr.c b/net/tipc/addr.c
index 4841e98..b88d48d 100644
--- a/net/tipc/addr.c
+++ b/net/tipc/addr.c
@@ -59,7 +59,7 @@ void tipc_set_node_id(struct net *net, u8 *id)
 
memcpy(tn->node_id, id, NODE_ID_LEN);
tipc_nodeid2string(tn->node_id_string, id);
-   tn->node_addr = tmp[0] ^ tmp[1] ^ tmp[2] ^ tmp[3];
+   tn->trial_addr = tmp[0] ^ tmp[1] ^ tmp[2] ^ tmp[3];
pr_info("Own node identity %s, cluster identity %u\n",
tipc_own_id_string(net), tn->net_id);
 }
@@ -74,6 +74,7 @@ void tipc_set_node_addr(struct net *net, u32 addr)
sprintf(node_id, "%x", addr);
tipc_set_node_id(net, node_id);
}
+   tn->trial_addr = addr;
pr_info("32-bit node address hash set to %x\n", addr);
 }
 
diff --git a/net/tipc/bearer.c b/net/tipc/bearer.c
index a71f318..ae5b44c 100644
--- a/net/tipc/bearer.c
+++ b/net/tipc/bearer.c
@@ -235,7 +235,6 @@ static int tipc_enable_bearer(struct net *net, const char 
*name,
 {
struct tipc_net *tn = tipc_net(net);
struct tipc_bearer_names b_names;
-   u32 self = tipc_own_addr(net);
int with_this_prio = 1;
struct tipc_bearer *b;
struct tipc_media *m;
@@ -244,7 +243,7 @@ static int tipc_enable_bearer(struct net *net, const char 
*name,
int res = -EINVAL;
char *errstr = "";
 
-   if (!self) {
+   if (!tipc_own_id(net)) {
errstr = "not supported in standalone mode";
res = -ENOPROTOOPT;
goto rejected;
diff --git a/net/tipc/core.c b/net/tipc/core.c
index e92fed4..52dfc51 100644
--- a/net/tipc/core.c
+++ b/net/tipc/core.c
@@ -57,6 +57,8 @@ static int __net_init tipc_init_net(struct net *net)
 
tn->net_id = 4711;
tn->node_addr = 0;
+   tn->trial_addr = 0;
+   tn->addr_trial_end = 0;
memset(tn->node_id, 0, sizeof(tn->node_id));
memset(tn->node_id_string, 0, sizeof(tn->node_id_string));
tn->mon_threshold = TIPC_DEF_MON_THRESHOLD;
diff --git a/net/tipc/core.h b/net/tipc/core.h
index eabad41..d0f64ca 100644
--- a/net/tipc/core.h
+++ b/net/tipc/core.h
@@ -82,6 +82,8 @@ extern int sysctl_tipc_named_timeout __read_mostly;
 struct tipc_net {
u8  node_id[NODE_ID_LEN];
u32 node_addr;
+   u32 trial_addr;
+   unsigned long addr_trial_end;
char node_id_string[NODE_ID_STR_LEN];
int net_id;
int random;
diff --git a/net/tipc/discover.c b/net/tipc/discover.c
index b4c4cd1..e765573 100644
--- a/net/tipc/discover.c
+++ b/net/tipc/discover.c
@@ -1,7 +1,7 @@
 /*
  * net/tipc/discover.c
  *
- * Copyright (c) 2003-2006, 2014-2015, Ericsso

[net-next 8/8] tipc: obtain node identity from interface by default

2018-03-22 Thread Jon Maloy
Selecting and explicitly configuring a TIPC node identity may be
unwanted in some cases.

In this commit we introduce a default setting if the identity has not
been set at the moment the first bearer is enabled. We do this by
using a raw copy of a unique identifier from the used interface: MAC
address in the case of an L2 bearer, IPv4/IPv6 address in the case
of a UDP bearer.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/bearer.c| 24 +++-
 net/tipc/net.h   |  1 +
 net/tipc/udp_media.c | 13 +
 3 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/net/tipc/bearer.c b/net/tipc/bearer.c
index ae5b44c..f7d47c8 100644
--- a/net/tipc/bearer.c
+++ b/net/tipc/bearer.c
@@ -243,12 +243,6 @@ static int tipc_enable_bearer(struct net *net, const char 
*name,
int res = -EINVAL;
char *errstr = "";
 
-   if (!tipc_own_id(net)) {
-   errstr = "not supported in standalone mode";
-   res = -ENOPROTOOPT;
-   goto rejected;
-   }
-
if (!bearer_name_validate(name, _names)) {
errstr = "illegal name";
goto rejected;
@@ -381,11 +375,13 @@ static void bearer_disable(struct net *net, struct 
tipc_bearer *b)
 int tipc_enable_l2_media(struct net *net, struct tipc_bearer *b,
 struct nlattr *attr[])
 {
+   char *dev_name = strchr((const char *)b->name, ':') + 1;
+   int hwaddr_len = b->media->hwaddr_len;
+   u8 node_id[NODE_ID_LEN] = {0,};
struct net_device *dev;
-   char *driver_name = strchr((const char *)b->name, ':') + 1;
 
/* Find device with specified name */
-   dev = dev_get_by_name(net, driver_name);
+   dev = dev_get_by_name(net, dev_name);
if (!dev)
return -ENODEV;
if (tipc_mtu_bad(dev, 0)) {
@@ -393,6 +389,16 @@ int tipc_enable_l2_media(struct net *net, struct 
tipc_bearer *b,
return -EINVAL;
}
 
+   /* Autoconfigure own node identity if needed */
+   if (!tipc_own_id(net) && hwaddr_len <= NODE_ID_LEN) {
+   memcpy(node_id, dev->dev_addr, hwaddr_len);
+   tipc_net_init(net, node_id, 0);
+   }
+   if (!tipc_own_id(net)) {
+   pr_warn("Failed to obtain node identity\n");
+   return -EINVAL;
+   }
+
/* Associate TIPC bearer with L2 bearer */
rcu_assign_pointer(b->media_ptr, dev);
b->pt.dev = dev;
@@ -400,7 +406,7 @@ int tipc_enable_l2_media(struct net *net, struct 
tipc_bearer *b,
b->pt.func = tipc_l2_rcv_msg;
dev_add_pack(>pt);
memset(>bcast_addr, 0, sizeof(b->bcast_addr));
-   memcpy(b->bcast_addr.value, dev->broadcast, b->media->hwaddr_len);
+   memcpy(b->bcast_addr.value, dev->broadcast, hwaddr_len);
b->bcast_addr.media_id = b->media->type_id;
b->bcast_addr.broadcast = TIPC_BROADCAST_SUPPORT;
b->mtu = dev->mtu;
diff --git a/net/tipc/net.h b/net/tipc/net.h
index 08efa60..09ad02b 100644
--- a/net/tipc/net.h
+++ b/net/tipc/net.h
@@ -41,6 +41,7 @@
 
 extern const struct nla_policy tipc_nl_net_policy[];
 
+int tipc_net_init(struct net *net, u8 *node_id, u32 addr);
 void tipc_net_finalize(struct net *net, u32 addr);
 void tipc_net_stop(struct net *net);
 int tipc_nl_net_dump(struct sk_buff *skb, struct netlink_callback *cb);
diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
index 3deabca..2c13b18 100644
--- a/net/tipc/udp_media.c
+++ b/net/tipc/udp_media.c
@@ -47,6 +47,8 @@
 #include 
 #include 
 #include "core.h"
+#include "addr.h"
+#include "net.h"
 #include "bearer.h"
 #include "netlink.h"
 #include "msg.h"
@@ -647,6 +649,7 @@ static int tipc_udp_enable(struct net *net, struct 
tipc_bearer *b,
struct udp_port_cfg udp_conf = {0};
struct udp_tunnel_sock_cfg tuncfg = {NULL};
struct nlattr *opts[TIPC_NLA_UDP_MAX + 1];
+   u8 node_id[NODE_ID_LEN] = {0,};
 
ub = kzalloc(sizeof(*ub), GFP_ATOMIC);
if (!ub)
@@ -677,6 +680,16 @@ static int tipc_udp_enable(struct net *net, struct 
tipc_bearer *b,
if (err)
goto err;
 
+   /* Autoconfigure own node identity if needed */
+   if (!tipc_own_id(net)) {
+   memcpy(node_id, local.ipv6.in6_u.u6_addr8, 16);
+   tipc_net_init(net, node_id, 0);
+   }
+   if (!tipc_own_id(net)) {
+   pr_warn("Failed to set node id, please configure manually\n");
+   return -EINVAL;
+   }
+
b->bcast_addr.media_id = TIPC_MEDIA_TYPE_UDP;
b->bcast_addr.broadcast = TIPC_BROADCAST_SUPPORT;
rcu_assign_pointer(b->media_ptr, ub);
-- 
2.1.4



[net-next 5/8] tipc: remove direct accesses to own_addr field in struct tipc_net

2018-03-22 Thread Jon Maloy
As a preparation to changing the addressing structure of TIPC we replace
all direct accesses to the tipc_net::own_addr field with the function
dedicated for this, tipc_own_addr().

There are no changes to program logics in this commit.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/addr.c   |  6 +++---
 net/tipc/addr.h   |  2 +-
 net/tipc/discover.c   |  3 ++-
 net/tipc/link.c   |  9 -
 net/tipc/name_distr.c | 11 ++-
 net/tipc/name_table.c |  6 +++---
 net/tipc/net.c| 31 +--
 net/tipc/socket.c | 23 ++-
 8 files changed, 42 insertions(+), 49 deletions(-)

diff --git a/net/tipc/addr.c b/net/tipc/addr.c
index 1998799..6e06b4d 100644
--- a/net/tipc/addr.c
+++ b/net/tipc/addr.c
@@ -43,9 +43,7 @@
  */
 int in_own_node(struct net *net, u32 addr)
 {
-   struct tipc_net *tn = net_generic(net, tipc_net_id);
-
-   return (addr == tn->own_addr) || !addr;
+   return addr == tipc_own_addr(net) || !addr;
 }
 
 bool tipc_in_scope(bool legacy_format, u32 domain, u32 addr)
@@ -56,6 +54,8 @@ bool tipc_in_scope(bool legacy_format, u32 domain, u32 addr)
return false;
if (domain == tipc_cluster_mask(addr)) /* domain  */
return true;
+   if (domain == (addr & TIPC_ZONE_CLUSTER_MASK)) /* domain  */
+   return true;
if (domain == (addr & TIPC_ZONE_MASK)) /* domain  */
return true;
return false;
diff --git a/net/tipc/addr.h b/net/tipc/addr.h
index 97bdc0e..6b48f0d 100644
--- a/net/tipc/addr.h
+++ b/net/tipc/addr.h
@@ -45,7 +45,7 @@
 
 static inline u32 tipc_own_addr(struct net *net)
 {
-   struct tipc_net *tn = net_generic(net, tipc_net_id);
+   struct tipc_net *tn = tipc_net(net);
 
return tn->own_addr;
 }
diff --git a/net/tipc/discover.c b/net/tipc/discover.c
index 82556e1..94d5240 100644
--- a/net/tipc/discover.c
+++ b/net/tipc/discover.c
@@ -81,11 +81,12 @@ static void tipc_disc_init_msg(struct net *net, struct 
sk_buff *skb,
   u32 mtyp, struct tipc_bearer *b)
 {
struct tipc_net *tn = tipc_net(net);
+   u32 self = tipc_own_addr(net);
u32 dest_domain = b->domain;
struct tipc_msg *hdr;
 
hdr = buf_msg(skb);
-   tipc_msg_init(tn->own_addr, hdr, LINK_CONFIG, mtyp,
+   tipc_msg_init(self, hdr, LINK_CONFIG, mtyp,
  MAX_H_SIZE, dest_domain);
msg_set_non_seq(hdr, 1);
msg_set_node_sig(hdr, tn->random);
diff --git a/net/tipc/link.c b/net/tipc/link.c
index 86fde00..4aa56e3 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -1936,11 +1936,11 @@ static int __tipc_nl_add_stats(struct sk_buff *skb, 
struct tipc_stats *s)
 int __tipc_nl_add_link(struct net *net, struct tipc_nl_msg *msg,
   struct tipc_link *link, int nlflags)
 {
-   int err;
-   void *hdr;
+   u32 self = tipc_own_addr(net);
struct nlattr *attrs;
struct nlattr *prop;
-   struct tipc_net *tn = net_generic(net, tipc_net_id);
+   void *hdr;
+   int err;
 
hdr = genlmsg_put(msg->skb, msg->portid, msg->seq, _genl_family,
  nlflags, TIPC_NL_LINK_GET);
@@ -1953,8 +1953,7 @@ int __tipc_nl_add_link(struct net *net, struct 
tipc_nl_msg *msg,
 
if (nla_put_string(msg->skb, TIPC_NLA_LINK_NAME, link->name))
goto attr_msg_full;
-   if (nla_put_u32(msg->skb, TIPC_NLA_LINK_DEST,
-   tipc_cluster_mask(tn->own_addr)))
+   if (nla_put_u32(msg->skb, TIPC_NLA_LINK_DEST, tipc_cluster_mask(self)))
goto attr_msg_full;
if (nla_put_u32(msg->skb, TIPC_NLA_LINK_MTU, link->mtu))
goto attr_msg_full;
diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
index 28d095a..7e571f4 100644
--- a/net/tipc/name_distr.c
+++ b/net/tipc/name_distr.c
@@ -68,14 +68,14 @@ static void publ_to_item(struct distr_item *i, struct 
publication *p)
 static struct sk_buff *named_prepare_buf(struct net *net, u32 type, u32 size,
 u32 dest)
 {
-   struct tipc_net *tn = net_generic(net, tipc_net_id);
struct sk_buff *buf = tipc_buf_acquire(INT_H_SIZE + size, GFP_ATOMIC);
+   u32 self = tipc_own_addr(net);
struct tipc_msg *msg;
 
if (buf != NULL) {
msg = buf_msg(buf);
-   tipc_msg_init(tn->own_addr, msg, NAME_DISTRIBUTOR, type,
- INT_H_SIZE, dest);
+   tipc_msg_init(self, msg, NAME_DISTRIBUTOR,
+ type, INT_H_SIZE, dest);
msg_set_size(msg, INT_H_SIZE + size);
}
return buf;
@@ -382,13 +382,14 @@ void tipc_named_reinit(struct net *net)
struct name_table *nt = tipc_name_table(net);
struct tipc_net *tn

[net-next 2/8] tipc: some cleanups in the file discover.c

2018-03-22 Thread Jon Maloy
To facilitate the coming changes in the neighbor discovery functionality
we make some renaming and refactoring of that code. The functional changes
in this commit are trivial, e.g., that we move the message sending call in
tipc_disc_timeout() outside the spinlock protected region.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/bearer.c   |   8 +-
 net/tipc/bearer.h   |   2 +-
 net/tipc/discover.c | 303 +---
 net/tipc/discover.h |   8 +-
 4 files changed, 155 insertions(+), 166 deletions(-)

diff --git a/net/tipc/bearer.c b/net/tipc/bearer.c
index e18cb27..76340b9 100644
--- a/net/tipc/bearer.c
+++ b/net/tipc/bearer.c
@@ -210,7 +210,7 @@ void tipc_bearer_add_dest(struct net *net, u32 bearer_id, 
u32 dest)
rcu_read_lock();
b = rcu_dereference_rtnl(tn->bearer_list[bearer_id]);
if (b)
-   tipc_disc_add_dest(b->link_req);
+   tipc_disc_add_dest(b->disc);
rcu_read_unlock();
 }
 
@@ -222,7 +222,7 @@ void tipc_bearer_remove_dest(struct net *net, u32 
bearer_id, u32 dest)
rcu_read_lock();
b = rcu_dereference_rtnl(tn->bearer_list[bearer_id]);
if (b)
-   tipc_disc_remove_dest(b->link_req);
+   tipc_disc_remove_dest(b->disc);
rcu_read_unlock();
 }
 
@@ -389,8 +389,8 @@ static void bearer_disable(struct net *net, struct 
tipc_bearer *b)
tipc_node_delete_links(net, bearer_id);
b->media->disable_media(b);
RCU_INIT_POINTER(b->media_ptr, NULL);
-   if (b->link_req)
-   tipc_disc_delete(b->link_req);
+   if (b->disc)
+   tipc_disc_delete(b->disc);
RCU_INIT_POINTER(tn->bearer_list[bearer_id], NULL);
kfree_rcu(b, rcu);
tipc_mon_delete(net, bearer_id);
diff --git a/net/tipc/bearer.h b/net/tipc/bearer.h
index a53613d..6efcee6 100644
--- a/net/tipc/bearer.h
+++ b/net/tipc/bearer.h
@@ -159,7 +159,7 @@ struct tipc_bearer {
u32 tolerance;
u32 domain;
u32 identity;
-   struct tipc_link_req *link_req;
+   struct tipc_discoverer *disc;
char net_plane;
unsigned long up;
 };
diff --git a/net/tipc/discover.c b/net/tipc/discover.c
index 92e4828..09f7555 100644
--- a/net/tipc/discover.c
+++ b/net/tipc/discover.c
@@ -39,34 +39,34 @@
 #include "discover.h"
 
 /* min delay during bearer start up */
-#define TIPC_LINK_REQ_INIT msecs_to_jiffies(125)
+#define TIPC_DISC_INIT msecs_to_jiffies(125)
 /* max delay if bearer has no links */
-#define TIPC_LINK_REQ_FAST msecs_to_jiffies(1000)
+#define TIPC_DISC_FAST msecs_to_jiffies(1000)
 /* max delay if bearer has links */
-#define TIPC_LINK_REQ_SLOW msecs_to_jiffies(6)
+#define TIPC_DISC_SLOW msecs_to_jiffies(6)
 /* indicates no timer in use */
-#define TIPC_LINK_REQ_INACTIVE 0x
+#define TIPC_DISC_INACTIVE 0x
 
 /**
- * struct tipc_link_req - information about an ongoing link setup request
+ * struct tipc_discoverer - information about an ongoing link setup request
  * @bearer_id: identity of bearer issuing requests
  * @net: network namespace instance
  * @dest: destination address for request messages
  * @domain: network domain to which links can be established
  * @num_nodes: number of nodes currently discovered (i.e. with an active link)
  * @lock: spinlock for controlling access to requests
- * @buf: request message to be (repeatedly) sent
+ * @skb: request message to be (repeatedly) sent
  * @timer: timer governing period between requests
  * @timer_intv: current interval between requests (in ms)
  */
-struct tipc_link_req {
+struct tipc_discoverer {
u32 bearer_id;
struct tipc_media_addr dest;
struct net *net;
u32 domain;
int num_nodes;
spinlock_t lock;
-   struct sk_buff *buf;
+   struct sk_buff *skb;
struct timer_list timer;
unsigned long timer_intv;
 };
@@ -77,22 +77,35 @@ struct tipc_link_req {
  * @type: message type (request or response)
  * @b: ptr to bearer issuing message
  */
-static void tipc_disc_init_msg(struct net *net, struct sk_buff *buf, u32 type,
-  struct tipc_bearer *b)
+static void tipc_disc_init_msg(struct net *net, struct sk_buff *skb,
+  u32 mtyp, struct tipc_bearer *b)
 {
-   struct tipc_net *tn = net_generic(net, tipc_net_id);
-   struct tipc_msg *msg;
+   struct tipc_net *tn = tipc_net(net);
u32 dest_domain = b->domain;
+   struct tipc_msg *hdr;
 
-   msg = buf_msg(buf);
-   tipc_msg_init(tn->own_addr, msg, LINK_CONFIG, type,
+   hdr = buf_msg(skb);
+   tipc_msg_init(tn->own_addr, hdr, LINK_CONFIG, mtyp,
  MAX_H_SIZE, dest_domain);
-   msg_set_non_seq(msg, 1);
-   msg_set_node_sig(msg, tn->random);
-   msg_set

[net-next 4/8] tipc: allow closest-first lookup algorithm when legacy address is configured

2018-03-22 Thread Jon Maloy
The removal of an internal structure of the node address has an unwanted
side effect.
- Currently, if a user is sending an anycast message with destination
  domain 0, the tipc_namebl_translate() function will use the 'closest-
  first' algorithm to first look for a node local destination, and only
  when no such is found, will it resort to the cluster global 'round-
  robin' lookup algorithm.
- Current users can get around this, and enforce unconditional use of
  global round-robin by indicating a destination as Z.0.0 or Z.C.0.
- This option disappears when we make the node address flat, since the
  lookup algorithm has no way of recognizing this case. So, as long as
  there are node local destinations, the algorithm will always select
  one of those, and there is nothing the sender can do to change this.

We solve this by eliminating the 'closest-first' option, which was never
a good idea anyway, for non-legacy users, but only for those. To
distinguish between legacy users and non-legacy users we introduce a new
flag 'legacy_addr_format' in struct tipc_core, to be set when the user
configures a legacy-style Z.C.N node address. Hence, when a legacy user
indicates a zero lookup domain 'closest-first' is selected, and in all
other cases we use 'round-robin'.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/addr.c   | 12 +++-
 net/tipc/addr.h   |  2 +-
 net/tipc/core.h   |  3 ++-
 net/tipc/discover.c   | 13 ++---
 net/tipc/name_table.c |  8 +---
 net/tipc/net.c|  2 +-
 6 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/net/tipc/addr.c b/net/tipc/addr.c
index dfc31a7..1998799 100644
--- a/net/tipc/addr.c
+++ b/net/tipc/addr.c
@@ -48,15 +48,17 @@ int in_own_node(struct net *net, u32 addr)
return (addr == tn->own_addr) || !addr;
 }
 
-int tipc_in_scope(u32 domain, u32 addr)
+bool tipc_in_scope(bool legacy_format, u32 domain, u32 addr)
 {
if (!domain || (domain == addr))
-   return 1;
+   return true;
+   if (!legacy_format)
+   return false;
if (domain == tipc_cluster_mask(addr)) /* domain  */
-   return 1;
+   return true;
if (domain == (addr & TIPC_ZONE_MASK)) /* domain  */
-   return 1;
-   return 0;
+   return true;
+   return false;
 }
 
 char *tipc_addr_string_fill(char *string, u32 addr)
diff --git a/net/tipc/addr.h b/net/tipc/addr.h
index 5ffde51..97bdc0e 100644
--- a/net/tipc/addr.h
+++ b/net/tipc/addr.h
@@ -67,7 +67,7 @@ static inline int tipc_scope2node(struct net *net, int sc)
 
 u32 tipc_own_addr(struct net *net);
 int in_own_node(struct net *net, u32 addr);
-int tipc_in_scope(u32 domain, u32 addr);
+bool tipc_in_scope(bool legacy_format, u32 domain, u32 addr);
 char *tipc_addr_string_fill(char *string, u32 addr);
 
 #endif
diff --git a/net/tipc/core.h b/net/tipc/core.h
index 347f850..bd2b112 100644
--- a/net/tipc/core.h
+++ b/net/tipc/core.h
@@ -1,7 +1,7 @@
 /*
  * net/tipc/core.h: Include file for TIPC global declarations
  *
- * Copyright (c) 2005-2006, 2013 Ericsson AB
+ * Copyright (c) 2005-2006, 2013-2018 Ericsson AB
  * Copyright (c) 2005-2007, 2010-2013, Wind River Systems
  * All rights reserved.
  *
@@ -81,6 +81,7 @@ struct tipc_net {
u32 own_addr;
int net_id;
int random;
+   bool legacy_addr_format;
 
/* Node table and node list */
spinlock_t node_list_lock;
diff --git a/net/tipc/discover.c b/net/tipc/discover.c
index 669af12..82556e1 100644
--- a/net/tipc/discover.c
+++ b/net/tipc/discover.c
@@ -139,6 +139,7 @@ void tipc_disc_rcv(struct net *net, struct sk_buff *skb,
struct tipc_net *tn = tipc_net(net);
struct tipc_msg *hdr = buf_msg(skb);
u16 caps = msg_node_capabilities(hdr);
+   bool legacy = tn->legacy_addr_format;
u32 signature = msg_node_sig(hdr);
u32 dst = msg_dest_domain(hdr);
u32 net_id = msg_bc_netid(hdr);
@@ -165,13 +166,11 @@ void tipc_disc_rcv(struct net *net, struct sk_buff *skb,
disc_dupl_alert(b, self, );
return;
}
-   /* Domain filter only works if both peers use legacy address format */
-   if (b->domain) {
-   if (!tipc_in_scope(dst, self))
-   return;
-   if (!tipc_in_scope(b->domain, src))
-   return;
-   }
+   if (!tipc_in_scope(legacy, dst, self))
+   return;
+   if (!tipc_in_scope(legacy, b->domain, src))
+   return;
+
tipc_node_check_dest(net, src, b, caps, signature,
 , , _addr);
if (dupl_addr)
diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index bbbfc07..7478acb 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -499,7 +499,9 @@ struct publication *tipc_nametbl_remove_publ(s

[net-next 3/8] tipc: remove restrictions on node address values

2018-03-22 Thread Jon Maloy
Nominally, TIPC organizes network nodes into a three-level network
hierarchy consisting of the levels 'zone', 'cluster' and 'node'. This
hierarchy is reflected in the node address format, - it is sub-divided
into an 8-bit zone id, and 12 bit cluster id, and a 12-bit node id.

However, the 'zone' and 'cluster' levels have in reality never been
fully implemented,and never will be. The result of this has been
that the first 20 bits the node identity structure have been wasted,
and the usable node identity range within a cluster has been limited
to 12 bits. This is starting to become a problem.

In the following commits, we will need to be able to connect between
nodes which are using the whole 32-bit value space of the node address.
We therefore remove the restrictions on which values can be assigned
to node identity, -it is from now on only a 32-bit integer with no
assumed internal structure.

Isolation between clusters is now achieved only by setting different
values for the 'network id' field used during neighbor discovery, in
practice leading to the latter becoming the new cluster identity.

The rules for accepting discovery requests/responses from neighboring
nodes now become:

- If the user is using legacy address format on both peers, reception
  of discovery messages is subject to the legacy lookup domain check
  in addition to the cluster id check.

- Otherwise, the discovery request/response is always accepted, provided
  both peers have the same network id.

This secures backwards compatibility for users who have been using zone
or cluster identities as cluster separators, instead of the intended
'network id'.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/addr.c | 50 +-
 net/tipc/addr.h | 11 ---
 net/tipc/bearer.c   | 27 ---
 net/tipc/discover.c | 15 +++
 net/tipc/link.c |  6 ++
 net/tipc/net.c  |  4 ++--
 net/tipc/node.c |  8 ++--
 net/tipc/node.h |  5 +++--
 8 files changed, 21 insertions(+), 105 deletions(-)

diff --git a/net/tipc/addr.c b/net/tipc/addr.c
index 97cd857..dfc31a7 100644
--- a/net/tipc/addr.c
+++ b/net/tipc/addr.c
@@ -39,21 +39,6 @@
 #include "core.h"
 
 /**
- * in_own_cluster - test for cluster inclusion; <0.0.0> always matches
- */
-int in_own_cluster(struct net *net, u32 addr)
-{
-   return in_own_cluster_exact(net, addr) || !addr;
-}
-
-int in_own_cluster_exact(struct net *net, u32 addr)
-{
-   struct tipc_net *tn = net_generic(net, tipc_net_id);
-
-   return !((addr ^ tn->own_addr) >> 12);
-}
-
-/**
  * in_own_node - test for node inclusion; <0.0.0> always matches
  */
 int in_own_node(struct net *net, u32 addr)
@@ -63,46 +48,13 @@ int in_own_node(struct net *net, u32 addr)
return (addr == tn->own_addr) || !addr;
 }
 
-/**
- * tipc_addr_domain_valid - validates a network domain address
- *
- * Accepts , , , and <0.0.0>,
- * where Z, C, and N are non-zero.
- *
- * Returns 1 if domain address is valid, otherwise 0
- */
-int tipc_addr_domain_valid(u32 addr)
-{
-   u32 n = tipc_node(addr);
-   u32 c = tipc_cluster(addr);
-   u32 z = tipc_zone(addr);
-
-   if (n && (!z || !c))
-   return 0;
-   if (c && !z)
-   return 0;
-   return 1;
-}
-
-/**
- * tipc_addr_node_valid - validates a proposed network address for this node
- *
- * Accepts , where Z, C, and N are non-zero.
- *
- * Returns 1 if address can be used, otherwise 0
- */
-int tipc_addr_node_valid(u32 addr)
-{
-   return tipc_addr_domain_valid(addr) && tipc_node(addr);
-}
-
 int tipc_in_scope(u32 domain, u32 addr)
 {
if (!domain || (domain == addr))
return 1;
if (domain == tipc_cluster_mask(addr)) /* domain  */
return 1;
-   if (domain == tipc_zone_mask(addr)) /* domain  */
+   if (domain == (addr & TIPC_ZONE_MASK)) /* domain  */
return 1;
return 0;
 }
diff --git a/net/tipc/addr.h b/net/tipc/addr.h
index 2ecf5a5..5ffde51 100644
--- a/net/tipc/addr.h
+++ b/net/tipc/addr.h
@@ -50,11 +50,6 @@ static inline u32 tipc_own_addr(struct net *net)
return tn->own_addr;
 }
 
-static inline u32 tipc_zone_mask(u32 addr)
-{
-   return addr & TIPC_ZONE_MASK;
-}
-
 static inline u32 tipc_cluster_mask(u32 addr)
 {
return addr & TIPC_ZONE_CLUSTER_MASK;
@@ -71,14 +66,8 @@ static inline int tipc_scope2node(struct net *net, int sc)
 }
 
 u32 tipc_own_addr(struct net *net);
-int in_own_cluster(struct net *net, u32 addr);
-int in_own_cluster_exact(struct net *net, u32 addr);
 int in_own_node(struct net *net, u32 addr);
-u32 addr_domain(struct net *net, u32 sc);
-int tipc_addr_domain_valid(u32);
-int tipc_addr_node_valid(u32 addr);
 int tipc_in_scope(u32 domain, u32 addr);
-int tipc_addr_scope(u32 dom

[net-next 6/8] tipc: add 128-bit node identifier

2018-03-22 Thread Jon Maloy
We add a 128-bit node identity, as an alternative to the currently used
32-bit node address.

For the sake of compatibility and to minimize message header changes
we retain the existing 32-bit address field. When not set explicitly by
the user, this field will be filled with a hash value generated from the
much longer node identity, and be used as a shorthand value for the
latter.

We permit either the address or the identity to be set by configuration,
but not both, so when the address value is set by a legacy user the
corresponding 128-bit node identity is generated based on the that value.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 include/uapi/linux/tipc_netlink.h |  2 +
 net/tipc/addr.c   | 81 ---
 net/tipc/addr.h   | 28 +++---
 net/tipc/core.c   |  4 +-
 net/tipc/core.h   |  6 ++-
 net/tipc/discover.c   |  4 +-
 net/tipc/link.c   |  6 ++-
 net/tipc/name_distr.c |  6 +--
 net/tipc/net.c| 51 
 net/tipc/net.h|  4 +-
 net/tipc/node.c   |  8 +---
 net/tipc/node.h   |  4 +-
 12 files changed, 148 insertions(+), 56 deletions(-)

diff --git a/include/uapi/linux/tipc_netlink.h 
b/include/uapi/linux/tipc_netlink.h
index d896ded..0affb68 100644
--- a/include/uapi/linux/tipc_netlink.h
+++ b/include/uapi/linux/tipc_netlink.h
@@ -169,6 +169,8 @@ enum {
TIPC_NLA_NET_UNSPEC,
TIPC_NLA_NET_ID,/* u32 */
TIPC_NLA_NET_ADDR,  /* u32 */
+   TIPC_NLA_NET_NODEID,/* u64 */
+   TIPC_NLA_NET_NODEID_W1, /* u64 */
 
__TIPC_NLA_NET_MAX,
TIPC_NLA_NET_MAX = __TIPC_NLA_NET_MAX - 1
diff --git a/net/tipc/addr.c b/net/tipc/addr.c
index 6e06b4d..4841e98 100644
--- a/net/tipc/addr.c
+++ b/net/tipc/addr.c
@@ -1,7 +1,7 @@
 /*
  * net/tipc/addr.c: TIPC address utility routines
  *
- * Copyright (c) 2000-2006, Ericsson AB
+ * Copyright (c) 2000-2006, 2018, Ericsson AB
  * Copyright (c) 2004-2005, 2010-2011, Wind River Systems
  * All rights reserved.
  *
@@ -34,18 +34,9 @@
  * POSSIBILITY OF SUCH DAMAGE.
  */
 
-#include 
 #include "addr.h"
 #include "core.h"
 
-/**
- * in_own_node - test for node inclusion; <0.0.0> always matches
- */
-int in_own_node(struct net *net, u32 addr)
-{
-   return addr == tipc_own_addr(net) || !addr;
-}
-
 bool tipc_in_scope(bool legacy_format, u32 domain, u32 addr)
 {
if (!domain || (domain == addr))
@@ -61,9 +52,71 @@ bool tipc_in_scope(bool legacy_format, u32 domain, u32 addr)
return false;
 }
 
-char *tipc_addr_string_fill(char *string, u32 addr)
+void tipc_set_node_id(struct net *net, u8 *id)
+{
+   struct tipc_net *tn = tipc_net(net);
+   u32 *tmp = (u32 *)id;
+
+   memcpy(tn->node_id, id, NODE_ID_LEN);
+   tipc_nodeid2string(tn->node_id_string, id);
+   tn->node_addr = tmp[0] ^ tmp[1] ^ tmp[2] ^ tmp[3];
+   pr_info("Own node identity %s, cluster identity %u\n",
+   tipc_own_id_string(net), tn->net_id);
+}
+
+void tipc_set_node_addr(struct net *net, u32 addr)
 {
-   snprintf(string, 16, "<%u.%u.%u>",
-tipc_zone(addr), tipc_cluster(addr), tipc_node(addr));
-   return string;
+   struct tipc_net *tn = tipc_net(net);
+   u8 node_id[NODE_ID_LEN] = {0,};
+
+   tn->node_addr = addr;
+   if (!tipc_own_id(net)) {
+   sprintf(node_id, "%x", addr);
+   tipc_set_node_id(net, node_id);
+   }
+   pr_info("32-bit node address hash set to %x\n", addr);
+}
+
+char *tipc_nodeid2string(char *str, u8 *id)
+{
+   int i;
+   u8 c;
+
+   /* Already a string ? */
+   for (i = 0; i < NODE_ID_LEN; i++) {
+   c = id[i];
+   if (c >= '0' && c <= '9')
+   continue;
+   if (c >= 'A' && c <= 'Z')
+   continue;
+   if (c >= 'a' && c <= 'z')
+   continue;
+   if (c == '.')
+   continue;
+   if (c == ':')
+   continue;
+   if (c == '_')
+   continue;
+   if (c == '-')
+   continue;
+   if (c == '@')
+   continue;
+   if (c != 0)
+   break;
+   }
+   if (i == NODE_ID_LEN) {
+   memcpy(str, id, NODE_ID_LEN);
+   str[NODE_ID_LEN] = 0;
+   return str;
+   }
+
+   /* Translate to hex string */
+   for (i = 0; i < NODE_ID_LEN; i++)
+   sprintf([2 * i], "%02x", id[i]);
+
+   /* Strip off trailin

[net-next 0/8] tipc: introduce 128-bit auto-configurable node id

2018-03-22 Thread Jon Maloy
We introduce a 128-bit free-format node identity as an alternative to
the legacy  structured 32-bit node address.

We also make configuration of this identity optional; if a bearer is
enabled without a pre-configured node id it will be set automatically
based on the used interface's MAC or IP address.

Jon Maloy (8):
  tipc: refactor function tipc_enable_bearer()
  tipc: some cleanups in the file discover.c
  tipc: remove restrictions on node address values
  tipc: allow closest-first lookup algorithm when legacy address is
configured
  tipc: remove direct accesses to own_addr field in struct tipc_net
  tipc: add 128-bit node identifier
  tipc: handle collisions of 32-bit node address hash values
  tipc: obtain node identity from interface by default

 include/uapi/linux/tipc_netlink.h |   2 +
 net/tipc/addr.c   | 128 +++--
 net/tipc/addr.h   |  37 ++--
 net/tipc/bearer.c | 152 +++
 net/tipc/bearer.h |   2 +-
 net/tipc/core.c   |   6 +-
 net/tipc/core.h   |  11 +-
 net/tipc/discover.c   | 392 ++
 net/tipc/discover.h   |   8 +-
 net/tipc/link.c   |  33 ++--
 net/tipc/link.h   |   4 +-
 net/tipc/msg.h|  23 ++-
 net/tipc/name_distr.c |  17 +-
 net/tipc/name_table.c |  14 +-
 net/tipc/net.c|  80 
 net/tipc/net.h|   5 +-
 net/tipc/node.c   | 101 --
 net/tipc/node.h   |   8 +-
 net/tipc/socket.c |  23 +--
 net/tipc/udp_media.c  |  13 ++
 20 files changed, 634 insertions(+), 425 deletions(-)

-- 
2.1.4



RE: [net-next 1/5] tipc: obsolete TIPC_ZONE_SCOPE

2018-03-15 Thread Jon Maloy
No, it won't. I just moved those functions and #defines to the bottom of the 
same file, and marked them as 'deprecated'.

BR
///jon

> -Original Message-
> From: Jiri Pirko [mailto:j...@resnulli.us]
> Sent: Thursday, March 15, 2018 12:11
> To: Jon Maloy <jon.ma...@ericsson.com>
> Cc: da...@davemloft.net; netdev@vger.kernel.org; Mohan Krishna Ghanta
> Krishnamurthy <mohan.krishna.ghanta.krishnamur...@ericsson.com>; Tung
> Quang Nguyen <tung.q.ngu...@dektech.com.au>; Hoang Huu Le
> <hoang.h...@dektech.com.au>; Canh Duc Luu
> <canh.d@dektech.com.au>; Ying Xue <ying@windriver.com>; tipc-
> discuss...@lists.sourceforge.net
> Subject: Re: [net-next 1/5] tipc: obsolete TIPC_ZONE_SCOPE
> 
> Thu, Mar 15, 2018 at 04:48:51PM CET, jon.ma...@ericsson.com wrote:
> >Publications for TIPC_CLUSTER_SCOPE and TIPC_ZONE_SCOPE are in all
> >aspects handled the same way, both on the publishing node and on the
> >receiving nodes.
> >
> >Despite previous ambitions to the contrary, this is never going to
> >change, so we take the conseqeunce of this and obsolete
> TIPC_ZONE_SCOPE
> >and related macros/functions. Whenever a user is doing a bind() or a
> >sendmsg() attempt using ZONE_SCOPE we translate this internally to
> >CLUSTER_SCOPE, while we remain compatible with users and remote nodes
> still using ZONE_SCOPE.
> >
> >Furthermore, the non-formalized scope value 0 has always been permitted
> >for use during lookup, with the same meaning as
> ZONE_SCOPE/CLUSTER_SCOPE.
> >We now permit it even as binding scope, but for compatibility reasons
> >we choose to not change the value of TIPC_CLUSTER_SCOPE.
> >
> >Acked-by: Ying Xue <ying@windriver.com>
> >Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
> 
> [...]
> 
> 
> >diff --git a/include/uapi/linux/tipc.h b/include/uapi/linux/tipc.h
> >index 14bacc7..4ac9f1f 100644
> >--- a/include/uapi/linux/tipc.h
> >+++ b/include/uapi/linux/tipc.h
> >@@ -61,50 +61,6 @@ struct tipc_name_seq {
> > __u32 upper;
> > };
> >
> >-/* TIPC Address Size, Offset, Mask specification for Z.C.N
> >- */
> >-#define TIPC_NODE_BITS  12
> >-#define TIPC_CLUSTER_BITS   12
> >-#define TIPC_ZONE_BITS  8
> >-
> >-#define TIPC_NODE_OFFSET0
> >-#define TIPC_CLUSTER_OFFSET TIPC_NODE_BITS
> >-#define TIPC_ZONE_OFFSET(TIPC_CLUSTER_OFFSET +
> TIPC_CLUSTER_BITS)
> >-
> >-#define TIPC_NODE_SIZE  ((1UL << TIPC_NODE_BITS) - 1)
> >-#define TIPC_CLUSTER_SIZE   ((1UL << TIPC_CLUSTER_BITS) - 1)
> >-#define TIPC_ZONE_SIZE  ((1UL << TIPC_ZONE_BITS) - 1)
> >-
> >-#define TIPC_NODE_MASK  (TIPC_NODE_SIZE <<
> TIPC_NODE_OFFSET)
> >-#define TIPC_CLUSTER_MASK   (TIPC_CLUSTER_SIZE <<
> TIPC_CLUSTER_OFFSET)
> >-#define TIPC_ZONE_MASK  (TIPC_ZONE_SIZE <<
> TIPC_ZONE_OFFSET)
> >-
> >-#define TIPC_ZONE_CLUSTER_MASK (TIPC_ZONE_MASK |
> TIPC_CLUSTER_MASK)
> >-
> >-static inline __u32 tipc_addr(unsigned int zone,
> >-  unsigned int cluster,
> >-  unsigned int node)
> >-{
> >-return (zone << TIPC_ZONE_OFFSET) |
> >-(cluster << TIPC_CLUSTER_OFFSET) |
> >-node;
> >-}
> >-
> >-static inline unsigned int tipc_zone(__u32 addr) -{
> >-return addr >> TIPC_ZONE_OFFSET;
> >-}
> >-
> >-static inline unsigned int tipc_cluster(__u32 addr) -{
> >-return (addr & TIPC_CLUSTER_MASK) >> TIPC_CLUSTER_OFFSET;
> >-}
> >-
> >-static inline unsigned int tipc_node(__u32 addr) -{
> >-return addr & TIPC_NODE_MASK;
> >-}
> 
> If someone includes tipc.h and uses any of this, your patch is going to break
> his compilation. Would anyone have good reason to use any of this?


[net-next 2/5] tipc: remove zone publication list in name table

2018-03-15 Thread Jon Maloy
As a consequence of the previous commit we nan now eliminate zone scope
related lists in the name table. We start with name_table::publ_list[3],
which can now be replaced with two lists, one for node scope publications
and one for cluster scope publications.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/core.h   |  5 +
 net/tipc/name_distr.c | 39 ++-
 net/tipc/name_table.c |  5 ++---
 net/tipc/name_table.h |  6 --
 4 files changed, 29 insertions(+), 26 deletions(-)

diff --git a/net/tipc/core.h b/net/tipc/core.h
index ff8b071..347f850 100644
--- a/net/tipc/core.h
+++ b/net/tipc/core.h
@@ -131,6 +131,11 @@ static inline struct list_head *tipc_nodes(struct net *net)
return _net(net)->node_list;
 }
 
+static inline struct name_table *tipc_name_table(struct net *net)
+{
+   return tipc_net(net)->nametbl;
+}
+
 static inline struct tipc_topsrv *tipc_topsrv(struct net *net)
 {
return tipc_net(net)->topsrv;
diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
index 23f8899..11ce205 100644
--- a/net/tipc/name_distr.c
+++ b/net/tipc/name_distr.c
@@ -86,25 +86,25 @@ static struct sk_buff *named_prepare_buf(struct net *net, 
u32 type, u32 size,
  */
 struct sk_buff *tipc_named_publish(struct net *net, struct publication *publ)
 {
-   struct tipc_net *tn = net_generic(net, tipc_net_id);
-   struct sk_buff *buf;
+   struct name_table *nt = tipc_name_table(net);
struct distr_item *item;
+   struct sk_buff *skb;
 
-   list_add_tail_rcu(>local_list,
- >nametbl->publ_list[publ->scope]);
-
-   if (publ->scope == TIPC_NODE_SCOPE)
+   if (publ->scope == TIPC_NODE_SCOPE) {
+   list_add_tail_rcu(>local_list, >node_scope);
return NULL;
+   }
+   list_add_tail_rcu(>local_list, >cluster_scope);
 
-   buf = named_prepare_buf(net, PUBLICATION, ITEM_SIZE, 0);
-   if (!buf) {
+   skb = named_prepare_buf(net, PUBLICATION, ITEM_SIZE, 0);
+   if (!skb) {
pr_warn("Publication distribution failure\n");
return NULL;
}
 
-   item = (struct distr_item *)msg_data(buf_msg(buf));
+   item = (struct distr_item *)msg_data(buf_msg(skb));
publ_to_item(item, publ);
-   return buf;
+   return skb;
 }
 
 /**
@@ -184,16 +184,13 @@ static void named_distribute(struct net *net, struct 
sk_buff_head *list,
  */
 void tipc_named_node_up(struct net *net, u32 dnode)
 {
-   struct tipc_net *tn = net_generic(net, tipc_net_id);
+   struct name_table *nt = tipc_name_table(net);
struct sk_buff_head head;
 
__skb_queue_head_init();
 
rcu_read_lock();
-   named_distribute(net, , dnode,
->nametbl->publ_list[TIPC_CLUSTER_SCOPE]);
-   named_distribute(net, , dnode,
->nametbl->publ_list[TIPC_ZONE_SCOPE]);
+   named_distribute(net, , dnode, >cluster_scope);
rcu_read_unlock();
 
tipc_node_xmit(net, , dnode, 0);
@@ -382,16 +379,16 @@ void tipc_named_rcv(struct net *net, struct sk_buff_head 
*inputq)
  */
 void tipc_named_reinit(struct net *net)
 {
-   struct tipc_net *tn = net_generic(net, tipc_net_id);
+   struct name_table *nt = tipc_name_table(net);
+   struct tipc_net *tn = tipc_net(net);
struct publication *publ;
-   int scope;
 
spin_lock_bh(>nametbl_lock);
 
-   for (scope = TIPC_ZONE_SCOPE; scope <= TIPC_NODE_SCOPE; scope++)
-   list_for_each_entry_rcu(publ, >nametbl->publ_list[scope],
-   local_list)
-   publ->node = tn->own_addr;
+   list_for_each_entry_rcu(publ, >node_scope, local_list)
+   publ->node = tn->own_addr;
+   list_for_each_entry_rcu(publ, >cluster_scope, local_list)
+   publ->node = tn->own_addr;
 
spin_unlock_bh(>nametbl_lock);
 }
diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index 6772390..1a3a327 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -878,9 +878,8 @@ int tipc_nametbl_init(struct net *net)
for (i = 0; i < TIPC_NAMETBL_SIZE; i++)
INIT_HLIST_HEAD(_nametbl->seq_hlist[i]);
 
-   INIT_LIST_HEAD(_nametbl->publ_list[TIPC_ZONE_SCOPE]);
-   INIT_LIST_HEAD(_nametbl->publ_list[TIPC_CLUSTER_SCOPE]);
-   INIT_LIST_HEAD(_nametbl->publ_list[TIPC_NODE_SCOPE]);
+   INIT_LIST_HEAD(_nametbl->node_scope);
+   INIT_LIST_HEAD(_nametbl->cluster_scope);
tn->nametbl = tipc_nametbl;
spin_lock_init(>nametbl_lock);
return 0;
diff --git a/net/tipc/name_table.h b/net/tipc/name_table.h
index 1765260..47f72cd 100644
--- a/net/tipc/name_table.h
+++ b/net/tipc/name_table.h
@

[net-next 5/5] tipc: some name changes

2018-03-15 Thread Jon Maloy
We rename some lists and fields in struct publication both to make
the naming more consistent and to better reflect their roles. We
also update the descriptions of those lists.

node_list -> local_publ
cluster_list -> all_publ
pport_list -> binding_sock
ref -> port

There are no functional changes in this commit.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/name_distr.c |  12 ++---
 net/tipc/name_distr.h |   2 +-
 net/tipc/name_table.c | 143 --
 net/tipc/name_table.h |  38 --
 net/tipc/socket.c |  14 ++---
 5 files changed, 106 insertions(+), 103 deletions(-)

diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
index 4c54fb3..28d095a 100644
--- a/net/tipc/name_distr.c
+++ b/net/tipc/name_distr.c
@@ -56,7 +56,7 @@ static void publ_to_item(struct distr_item *i, struct 
publication *p)
i->type = htonl(p->type);
i->lower = htonl(p->lower);
i->upper = htonl(p->upper);
-   i->ref = htonl(p->ref);
+   i->port = htonl(p->port);
i->key = htonl(p->key);
 }
 
@@ -209,15 +209,15 @@ static void tipc_publ_purge(struct net *net, struct 
publication *publ, u32 addr)
 
spin_lock_bh(>nametbl_lock);
p = tipc_nametbl_remove_publ(net, publ->type, publ->lower,
-publ->node, publ->ref, publ->key);
+publ->node, publ->port, publ->key);
if (p)
tipc_node_unsubscribe(net, >binding_node, addr);
spin_unlock_bh(>nametbl_lock);
 
if (p != publ) {
pr_err("Unable to remove publication from failed node\n"
-  " (type=%u, lower=%u, node=0x%x, ref=%u, key=%u)\n",
-  publ->type, publ->lower, publ->node, publ->ref,
+  " (type=%u, lower=%u, node=0x%x, port=%u, key=%u)\n",
+  publ->type, publ->lower, publ->node, publ->port,
   publ->key);
}
 
@@ -268,7 +268,7 @@ static bool tipc_update_nametbl(struct net *net, struct 
distr_item *i,
ntohl(i->lower),
ntohl(i->upper),
TIPC_CLUSTER_SCOPE, node,
-   ntohl(i->ref), ntohl(i->key));
+   ntohl(i->port), ntohl(i->key));
if (publ) {
tipc_node_subscribe(net, >binding_node, node);
return true;
@@ -276,7 +276,7 @@ static bool tipc_update_nametbl(struct net *net, struct 
distr_item *i,
} else if (dtype == WITHDRAWAL) {
publ = tipc_nametbl_remove_publ(net, ntohl(i->type),
ntohl(i->lower),
-   node, ntohl(i->ref),
+   node, ntohl(i->port),
ntohl(i->key));
if (publ) {
tipc_node_unsubscribe(net, >binding_node, node);
diff --git a/net/tipc/name_distr.h b/net/tipc/name_distr.h
index 1264ba0..4753e62 100644
--- a/net/tipc/name_distr.h
+++ b/net/tipc/name_distr.h
@@ -63,7 +63,7 @@ struct distr_item {
__be32 type;
__be32 lower;
__be32 upper;
-   __be32 ref;
+   __be32 port;
__be32 key;
 };
 
diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index 6d7b4c7..bbbfc07 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -1,7 +1,7 @@
 /*
  * net/tipc/name_table.c: TIPC name table code
  *
- * Copyright (c) 2000-2006, 2014-2015, Ericsson AB
+ * Copyright (c) 2000-2006, 2014-2018, Ericsson AB
  * Copyright (c) 2004-2008, 2010-2014, Wind River Systems
  * All rights reserved.
  *
@@ -51,11 +51,11 @@
 /**
  * struct name_info - name sequence publication info
  * @node_list: list of publications on own node of this <type,lower,upper>
- * @cluster_list: list of all publications of this <type,lower,upper>
+ * @all_publ: list of all publications of this <type,lower,upper>
  */
 struct name_info {
-   struct list_head node_list;
-   struct list_head cluster_list;
+   struct list_head local_publ;
+   struct list_head all_publ;
 };
 
 /**
@@ -102,7 +102,7 @@ static int hash(int x)
  * publ_create - create a publication structure
  */
 static struct publication *publ_create(u32 type, u32 lower, u32 upper,
-  u32 scope, u32 node, u32 port_ref,
+  u32 scope, u32 node, u32 port,
   u32 key)
 {

[net-next 4/5] tipc: merge two lists in struct publication

2018-03-15 Thread Jon Maloy
The size of struct publication can be reduced further. Membership in
lists 'nodesub_list' and 'local_list' is mutually exlusive, in that
remote publications use the former and local publications the latter.
We replace the two lists with one single, named 'binding_node' which
reflects what it really is.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/name_distr.c | 20 ++--
 net/tipc/name_table.h |  5 ++---
 2 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
index 11ce205..4c54fb3 100644
--- a/net/tipc/name_distr.c
+++ b/net/tipc/name_distr.c
@@ -91,10 +91,10 @@ struct sk_buff *tipc_named_publish(struct net *net, struct 
publication *publ)
struct sk_buff *skb;
 
if (publ->scope == TIPC_NODE_SCOPE) {
-   list_add_tail_rcu(>local_list, >node_scope);
+   list_add_tail_rcu(>binding_node, >node_scope);
return NULL;
}
-   list_add_tail_rcu(>local_list, >cluster_scope);
+   list_add_tail_rcu(>binding_node, >cluster_scope);
 
skb = named_prepare_buf(net, PUBLICATION, ITEM_SIZE, 0);
if (!skb) {
@@ -115,7 +115,7 @@ struct sk_buff *tipc_named_withdraw(struct net *net, struct 
publication *publ)
struct sk_buff *buf;
struct distr_item *item;
 
-   list_del(>local_list);
+   list_del(>binding_node);
 
if (publ->scope == TIPC_NODE_SCOPE)
return NULL;
@@ -147,7 +147,7 @@ static void named_distribute(struct net *net, struct 
sk_buff_head *list,
ITEM_SIZE) * ITEM_SIZE;
u32 msg_rem = msg_dsz;
 
-   list_for_each_entry(publ, pls, local_list) {
+   list_for_each_entry(publ, pls, binding_node) {
/* Prepare next buffer: */
if (!skb) {
skb = named_prepare_buf(net, PUBLICATION, msg_rem,
@@ -211,7 +211,7 @@ static void tipc_publ_purge(struct net *net, struct 
publication *publ, u32 addr)
p = tipc_nametbl_remove_publ(net, publ->type, publ->lower,
 publ->node, publ->ref, publ->key);
if (p)
-   tipc_node_unsubscribe(net, >nodesub_list, addr);
+   tipc_node_unsubscribe(net, >binding_node, addr);
spin_unlock_bh(>nametbl_lock);
 
if (p != publ) {
@@ -246,7 +246,7 @@ void tipc_publ_notify(struct net *net, struct list_head 
*nsub_list, u32 addr)
 {
struct publication *publ, *tmp;
 
-   list_for_each_entry_safe(publ, tmp, nsub_list, nodesub_list)
+   list_for_each_entry_safe(publ, tmp, nsub_list, binding_node)
tipc_publ_purge(net, publ, addr);
tipc_dist_queue_purge(net, addr);
 }
@@ -270,7 +270,7 @@ static bool tipc_update_nametbl(struct net *net, struct 
distr_item *i,
TIPC_CLUSTER_SCOPE, node,
ntohl(i->ref), ntohl(i->key));
if (publ) {
-   tipc_node_subscribe(net, >nodesub_list, node);
+   tipc_node_subscribe(net, >binding_node, node);
return true;
}
} else if (dtype == WITHDRAWAL) {
@@ -279,7 +279,7 @@ static bool tipc_update_nametbl(struct net *net, struct 
distr_item *i,
node, ntohl(i->ref),
ntohl(i->key));
if (publ) {
-   tipc_node_unsubscribe(net, >nodesub_list, node);
+   tipc_node_unsubscribe(net, >binding_node, node);
kfree_rcu(publ, rcu);
return true;
}
@@ -385,9 +385,9 @@ void tipc_named_reinit(struct net *net)
 
spin_lock_bh(>nametbl_lock);
 
-   list_for_each_entry_rcu(publ, >node_scope, local_list)
+   list_for_each_entry_rcu(publ, >node_scope, binding_node)
publ->node = tn->own_addr;
-   list_for_each_entry_rcu(publ, >cluster_scope, local_list)
+   list_for_each_entry_rcu(publ, >cluster_scope, binding_node)
publ->node = tn->own_addr;
 
spin_unlock_bh(>nametbl_lock);
diff --git a/net/tipc/name_table.h b/net/tipc/name_table.h
index a9063e2..cb16bd8 100644
--- a/net/tipc/name_table.h
+++ b/net/tipc/name_table.h
@@ -1,7 +1,7 @@
 /*
  * net/tipc/name_table.h: Include file for TIPC name table code
  *
- * Copyright (c) 2000-2006, 2014-2015, Ericsson AB
+ * Copyright (c) 2000-2006, 2014-2018, Ericsson AB
  * Copyright (c) 2004-2005, 2010-2011, Wind River Systems
  * All rights reserved.
  *
@@ -76,8 +76,7 @@ struct publication {
u32 node;
u32 ref;
u32 key;
-   struct list_head nodesub_list;
-   struct list_h

[net-next 3/5] tipc: remove zone_list member in struct publication

2018-03-15 Thread Jon Maloy
As a further consequence of the previous commits, we can also remove
the member 'zone_list 'in struct name_info and struct publication.
Instead, we now let the member cluster_list take over the role a
container of all publications of a given <type,lower, upper>.
We also remove the counters for the size of those lists, since
they don't serve any purpose.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/name_table.c | 101 ++
 net/tipc/name_table.h |   5 +--
 2 files changed, 30 insertions(+), 76 deletions(-)

diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index 1a3a327..6d7b4c7 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -50,24 +50,12 @@
 
 /**
  * struct name_info - name sequence publication info
- * @node_list: circular list of publications made by own node
- * @cluster_list: circular list of publications made by own cluster
- * @zone_list: circular list of publications made by own zone
- * @node_list_size: number of entries in "node_list"
- * @cluster_list_size: number of entries in "cluster_list"
- * @zone_list_size: number of entries in "zone_list"
- *
- * Note: The zone list always contains at least one entry, since all
- *   publications of the associated name sequence belong to it.
- *   (The cluster and node lists may be empty.)
+ * @node_list: list of publications on own node of this <type,lower,upper>
+ * @cluster_list: list of all publications of this <type,lower,upper>
  */
 struct name_info {
struct list_head node_list;
struct list_head cluster_list;
-   struct list_head zone_list;
-   u32 node_list_size;
-   u32 cluster_list_size;
-   u32 zone_list_size;
 };
 
 /**
@@ -249,7 +237,7 @@ static struct publication *tipc_nameseq_insert_publ(struct 
net *net,
info = sseq->info;
 
/* Check if an identical publication already exists */
-   list_for_each_entry(publ, >zone_list, zone_list) {
+   list_for_each_entry(publ, >cluster_list, cluster_list) {
if ((publ->ref == port) && (publ->key == key) &&
(!publ->node || (publ->node == node)))
return NULL;
@@ -292,7 +280,6 @@ static struct publication *tipc_nameseq_insert_publ(struct 
net *net,
 
INIT_LIST_HEAD(>node_list);
INIT_LIST_HEAD(>cluster_list);
-   INIT_LIST_HEAD(>zone_list);
 
/* Insert new sub-sequence */
sseq = >sseqs[inspos];
@@ -311,18 +298,10 @@ static struct publication 
*tipc_nameseq_insert_publ(struct net *net,
if (!publ)
return NULL;
 
-   list_add(>zone_list, >zone_list);
-   info->zone_list_size++;
-
-   if (in_own_cluster(net, node)) {
-   list_add(>cluster_list, >cluster_list);
-   info->cluster_list_size++;
-   }
+   list_add(>cluster_list, >cluster_list);
 
-   if (in_own_node(net, node)) {
+   if (in_own_node(net, node))
list_add(>node_list, >node_list);
-   info->node_list_size++;
-   }
 
/* Any subscriptions waiting for notification?  */
list_for_each_entry_safe(s, st, >subscriptions, nameseq_list) {
@@ -363,7 +342,7 @@ static struct publication *tipc_nameseq_remove_publ(struct 
net *net,
info = sseq->info;
 
/* Locate publication, if it exists */
-   list_for_each_entry(publ, >zone_list, zone_list) {
+   list_for_each_entry(publ, >cluster_list, cluster_list) {
if ((publ->key == key) && (publ->ref == ref) &&
(!publ->node || (publ->node == node)))
goto found;
@@ -371,24 +350,12 @@ static struct publication 
*tipc_nameseq_remove_publ(struct net *net,
return NULL;
 
 found:
-   /* Remove publication from zone scope list */
-   list_del(>zone_list);
-   info->zone_list_size--;
-
-   /* Remove publication from cluster scope list, if present */
-   if (in_own_cluster(net, node)) {
-   list_del(>cluster_list);
-   info->cluster_list_size--;
-   }
-
-   /* Remove publication from node scope list, if present */
-   if (in_own_node(net, node)) {
+   list_del(>cluster_list);
+   if (in_own_node(net, node))
list_del(>node_list);
-   info->node_list_size--;
-   }
 
/* Contract subseq list if no more publications for that subseq */
-   if (list_empty(>zone_list)) {
+   if (list_empty(>cluster_list)) {
kfree(info);
free = >sseqs[nseq->first_free--];
memmove(sseq, sseq + 1, (

[net-next 0/5] tipc: obsolete zone concept

2018-03-15 Thread Jon Maloy
Functionality related to the 'zone' concept was never implemented in
TIPC. In this series we eliminate the remaining traces of it in the 
code, and can hence take a first step in reducing the footprint and
complexity of the binding table.

Jon Maloy (5):
  tipc: obsolete TIPC_ZONE_SCOPE
  tipc: remove zone publication list in name table
  tipc: remove zone_list member in struct publication
  tipc: merge two lists in struct publication
  tipc: some name changes

 include/uapi/linux/tipc.h | 102 ---
 net/tipc/addr.c   |  31 ---
 net/tipc/addr.h   |  10 +++
 net/tipc/core.h   |   5 ++
 net/tipc/msg.c|   2 +-
 net/tipc/name_distr.c |  63 +++---
 net/tipc/name_distr.h |   2 +-
 net/tipc/name_table.c | 206 ++
 net/tipc/name_table.h |  54 ++--
 net/tipc/net.c|   2 +-
 net/tipc/socket.c |  29 ---
 11 files changed, 227 insertions(+), 279 deletions(-)

-- 
2.1.4



[net-next 1/5] tipc: obsolete TIPC_ZONE_SCOPE

2018-03-15 Thread Jon Maloy
Publications for TIPC_CLUSTER_SCOPE and TIPC_ZONE_SCOPE are in all
aspects handled the same way, both on the publishing node and on the
receiving nodes.

Despite previous ambitions to the contrary, this is never going to change,
so we take the conseqeunce of this and obsolete TIPC_ZONE_SCOPE and related
macros/functions. Whenever a user is doing a bind() or a sendmsg() attempt
using ZONE_SCOPE we translate this internally to CLUSTER_SCOPE, while we
remain compatible with users and remote nodes still using ZONE_SCOPE.

Furthermore, the non-formalized scope value 0 has always been permitted
for use during lookup, with the same meaning as ZONE_SCOPE/CLUSTER_SCOPE.
We now permit it even as binding scope, but for compatibility reasons we
choose to not change the value of TIPC_CLUSTER_SCOPE.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 include/uapi/linux/tipc.h | 102 --
 net/tipc/addr.c   |  31 --
 net/tipc/addr.h   |  10 +
 net/tipc/msg.c|   2 +-
 net/tipc/name_table.c |   3 +-
 net/tipc/net.c|   2 +-
 net/tipc/socket.c |  15 ---
 7 files changed, 77 insertions(+), 88 deletions(-)

diff --git a/include/uapi/linux/tipc.h b/include/uapi/linux/tipc.h
index 14bacc7..4ac9f1f 100644
--- a/include/uapi/linux/tipc.h
+++ b/include/uapi/linux/tipc.h
@@ -61,50 +61,6 @@ struct tipc_name_seq {
__u32 upper;
 };
 
-/* TIPC Address Size, Offset, Mask specification for Z.C.N
- */
-#define TIPC_NODE_BITS  12
-#define TIPC_CLUSTER_BITS   12
-#define TIPC_ZONE_BITS  8
-
-#define TIPC_NODE_OFFSET0
-#define TIPC_CLUSTER_OFFSET TIPC_NODE_BITS
-#define TIPC_ZONE_OFFSET(TIPC_CLUSTER_OFFSET + TIPC_CLUSTER_BITS)
-
-#define TIPC_NODE_SIZE  ((1UL << TIPC_NODE_BITS) - 1)
-#define TIPC_CLUSTER_SIZE   ((1UL << TIPC_CLUSTER_BITS) - 1)
-#define TIPC_ZONE_SIZE  ((1UL << TIPC_ZONE_BITS) - 1)
-
-#define TIPC_NODE_MASK (TIPC_NODE_SIZE << TIPC_NODE_OFFSET)
-#define TIPC_CLUSTER_MASK  (TIPC_CLUSTER_SIZE << TIPC_CLUSTER_OFFSET)
-#define TIPC_ZONE_MASK (TIPC_ZONE_SIZE << TIPC_ZONE_OFFSET)
-
-#define TIPC_ZONE_CLUSTER_MASK (TIPC_ZONE_MASK | TIPC_CLUSTER_MASK)
-
-static inline __u32 tipc_addr(unsigned int zone,
- unsigned int cluster,
- unsigned int node)
-{
-   return (zone << TIPC_ZONE_OFFSET) |
-   (cluster << TIPC_CLUSTER_OFFSET) |
-   node;
-}
-
-static inline unsigned int tipc_zone(__u32 addr)
-{
-   return addr >> TIPC_ZONE_OFFSET;
-}
-
-static inline unsigned int tipc_cluster(__u32 addr)
-{
-   return (addr & TIPC_CLUSTER_MASK) >> TIPC_CLUSTER_OFFSET;
-}
-
-static inline unsigned int tipc_node(__u32 addr)
-{
-   return addr & TIPC_NODE_MASK;
-}
-
 /*
  * Application-accessible port name types
  */
@@ -117,9 +73,10 @@ static inline unsigned int tipc_node(__u32 addr)
 /*
  * Publication scopes when binding port names and port name sequences
  */
-#define TIPC_ZONE_SCOPE 1
-#define TIPC_CLUSTER_SCOPE  2
-#define TIPC_NODE_SCOPE 3
+enum tipc_scope {
+   TIPC_CLUSTER_SCOPE = 2, /* 0 can also be used */
+   TIPC_NODE_SCOPE= 3
+};
 
 /*
  * Limiting values for messages
@@ -243,7 +200,7 @@ struct sockaddr_tipc {
 struct tipc_group_req {
__u32 type;  /* group id */
__u32 instance;  /* member id */
-   __u32 scope; /* zone/cluster/node */
+   __u32 scope; /* cluster/node */
__u32 flags;
 };
 
@@ -268,4 +225,53 @@ struct tipc_sioc_ln_req {
__u32 bearer_id;
char linkname[TIPC_MAX_LINK_NAME];
 };
+
+
+/* The macros and functions below are deprecated:
+ */
+
+#define TIPC_ZONE_SCOPE 1
+
+#define TIPC_NODE_BITS  12
+#define TIPC_CLUSTER_BITS   12
+#define TIPC_ZONE_BITS  8
+
+#define TIPC_NODE_OFFSET0
+#define TIPC_CLUSTER_OFFSET TIPC_NODE_BITS
+#define TIPC_ZONE_OFFSET(TIPC_CLUSTER_OFFSET + TIPC_CLUSTER_BITS)
+
+#define TIPC_NODE_SIZE  ((1UL << TIPC_NODE_BITS) - 1)
+#define TIPC_CLUSTER_SIZE   ((1UL << TIPC_CLUSTER_BITS) - 1)
+#define TIPC_ZONE_SIZE  ((1UL << TIPC_ZONE_BITS) - 1)
+
+#define TIPC_NODE_MASK (TIPC_NODE_SIZE << TIPC_NODE_OFFSET)
+#define TIPC_CLUSTER_MASK  (TIPC_CLUSTER_SIZE << TIPC_CLUSTER_OFFSET)
+#define TIPC_ZONE_MASK (TIPC_ZONE_SIZE << TIPC_ZONE_OFFSET)
+
+#define TIPC_ZONE_CLUSTER_MASK (TIPC_ZONE_MASK | TIPC_CLUSTER_MASK)
+
+static inline __u32 tipc_addr(unsigned int zone,
+ unsigned int cluster,
+ unsigned int node)
+{
+   return (zone << TIPC_ZONE_OFFSET) |
+   (cluster << TIPC_CLUSTER_OFFSET) |
+   node;

[net-next 0/5] tipc: obsolete zone concept

2018-03-15 Thread Jon Maloy
Functionality related to the zone concept was never implemented in TIPC.
In this series we eliminate the remaining traces of it in the code, and
can hence take a first step in reducing the footprint and complexity of
the binding table.

Jon Maloy (5):
  tipc: obsolete TIPC_ZONE_SCOPE
  tipc: remove zone publication list in name table
  tipc: remove zone_list member in struct publication
  tipc: merge two lists in struct publication
  tipc: some name changes

 include/uapi/linux/tipc.h | 102 ---
 net/tipc/addr.c   |  31 ---
 net/tipc/addr.h   |  10 +++
 net/tipc/core.h   |   5 ++
 net/tipc/msg.c|   2 +-
 net/tipc/name_distr.c |  63 +++---
 net/tipc/name_distr.h |   2 +-
 net/tipc/name_table.c | 206 ++
 net/tipc/name_table.h |  54 ++--
 net/tipc/net.c|   2 +-
 net/tipc/socket.c |  29 ---
 11 files changed, 227 insertions(+), 279 deletions(-)

-- 
2.1.4



[net-next 1/1] tipc: sockopt(TIPC_SO_RCVBUF) for setting receive buffer

2018-02-27 Thread Jon Maloy
From: Hoang Le <hoang.h...@dektech.com.au>

We introduce a set/getsockopt for setting socket receive buffer per
individual socket. This has turned out to sometimes be necessary for
anycast and multicast receivers when used without flow control.

Signed-off-by: Hoang Le <hoang.h...@dektech.com.au>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 include/uapi/linux/tipc.h | 1 +
 net/tipc/socket.c | 8 
 2 files changed, 9 insertions(+)

diff --git a/include/uapi/linux/tipc.h b/include/uapi/linux/tipc.h
index 14bacc7..8ce4a69 100644
--- a/include/uapi/linux/tipc.h
+++ b/include/uapi/linux/tipc.h
@@ -233,6 +233,7 @@ struct sockaddr_tipc {
 #define TIPC_MCAST_REPLICAST134 /* Default: TIPC selects. No arg */
 #define TIPC_GROUP_JOIN 135 /* Takes struct tipc_group_req* */
 #define TIPC_GROUP_LEAVE136 /* No argument */
+#define TIPC_SO_RCVBUF  137 /* Range tipc_rmem_min:tipc_rmem_max */
 
 /*
  * Flag values
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 8b04e60..cfd519b 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -2829,6 +2829,7 @@ static int tipc_setsockopt(struct socket *sock, int lvl, 
int opt,
case TIPC_SRC_DROPPABLE:
case TIPC_DEST_DROPPABLE:
case TIPC_CONN_TIMEOUT:
+   case TIPC_SO_RCVBUF:
if (ol < sizeof(value))
return -EINVAL;
if (get_user(value, (u32 __user *)ov))
@@ -2877,6 +2878,10 @@ static int tipc_setsockopt(struct socket *sock, int lvl, 
int opt,
case TIPC_GROUP_LEAVE:
res = tipc_sk_leave(tsk);
break;
+   case TIPC_SO_RCVBUF:
+   value = max_t(int, value, sysctl_tipc_rmem[0]);
+   sk->sk_rcvbuf = min_t(int, value, sysctl_tipc_rmem[2]);
+   break;
default:
res = -EINVAL;
}
@@ -2945,6 +2950,9 @@ static int tipc_getsockopt(struct socket *sock, int lvl, 
int opt,
tipc_group_self(tsk->group, , );
value = seq.type;
break;
+   case TIPC_SO_RCVBUF:
+   value = sk->sk_rcvbuf;
+   break;
default:
res = -EINVAL;
}
-- 
2.1.4



[net 1/1] tipc: correct initial value for group congestion flag

2018-02-26 Thread Jon Maloy
In commit 60c253069632 ("tipc: fix race between poll() and
setsockopt()") we introduced a pointer from struct tipc_group to the
'group_is_connected' flag in struct tipc_sock, so that this field can
be checked without dereferencing the group pointer of the latter struct.

The initial value for this flag is correctly set to 'false' when a
group is created, but we miss the case when no group is created at
all, in which case the initial value should be 'true'. This has the
effect that SOCK_RDM/DGRAM sockets sending datagrams never receive
POLLOUT if they request so.

This commit corrects this bug.

Reported-by: Hoang Le <hoang.h...@dektek.com.au>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/group.c  | 1 +
 net/tipc/socket.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/net/tipc/group.c b/net/tipc/group.c
index 03086cc..d7a7bef 100644
--- a/net/tipc/group.c
+++ b/net/tipc/group.c
@@ -189,6 +189,7 @@ struct tipc_group *tipc_group_create(struct net *net, u32 
portid,
grp->loopback = mreq->flags & TIPC_GROUP_LOOPBACK;
grp->events = mreq->flags & TIPC_GROUP_MEMBER_EVTS;
grp->open = group_is_open;
+   *grp->open = false;
filter |= global ? TIPC_SUB_CLUSTER_SCOPE : TIPC_SUB_NODE_SCOPE;
if (tipc_topsrv_kern_subscr(net, portid, type, 0, ~0,
filter, >subid))
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index f934771..8b04e60 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -473,6 +473,7 @@ static int tipc_sk_create(struct net *net, struct socket 
*sock,
sk->sk_write_space = tipc_write_space;
sk->sk_destruct = tipc_sock_destruct;
tsk->conn_timeout = CONN_TIMEOUT_DEFAULT;
+   tsk->group_is_open = true;
atomic_set(>dupl_rcvcnt, 0);
 
/* Start out with safe limits until we receive an advertised window */
-- 
2.1.4



RE: [PATCH net-next] tipc: don't call sock_release() in atomic context

2018-02-19 Thread Jon Maloy


> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Paolo Abeni
> Sent: Monday, February 19, 2018 19:02
> To: netdev@vger.kernel.org
> Cc: Jon Maloy <jon.ma...@ericsson.com>; Ying Xue
> <ying@windriver.com>; David S. Miller <da...@davemloft.net>
> Subject: [PATCH net-next] tipc: don't call sock_release() in atomic context
> 
> syzbot reported a scheduling while atomic issue at netns destruction time:
> 
> BUG: sleeping function called from invalid context at net/core/sock.c:2769
> in_atomic(): 1, irqs_disabled(): 0, pid: 85, name: kworker/u4:3
> 5 locks held by kworker/u4:3/85:
>   #0:  ((wq_completion)"%s""netns"){+.+.}, at: [<c9792deb>]
> process_one_work+0xaaf/0x1af0 kernel/workqueue.c:2084
>   #1:  (net_cleanup_work){+.+.}, at: [<adc12e2a>]
> process_one_work+0xb01/0x1af0 kernel/workqueue.c:2088
>   #2:  (net_sem){}, at: [<9ccb5669>] cleanup_net+0x23f/0xd20
> net/core/net_namespace.c:494
>   #3:  (net_mutex){+.+.}, at: [<a92767d9>]
> cleanup_net+0xa7d/0xd20
> net/core/net_namespace.c:496
>   #4:  (&(>idr_lock)->rlock){+...}, at: [<1343e568>]
> spin_lock_bh include/linux/spinlock.h:315 [inline]
>   #4:  (&(>idr_lock)->rlock){+...}, at: [<1343e568>]
> tipc_topsrv_stop+0x231/0x610 net/tipc/topsrv.c:685
> CPU: 0 PID: 85 Comm: kworker/u4:3 Not tainted 4.16.0-rc1+ #230 Hardware
> name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Workqueue: netns cleanup_net
> Call Trace:
>   __dump_stack lib/dump_stack.c:17 [inline]
>   dump_stack+0x194/0x257 lib/dump_stack.c:53
>   ___might_sleep+0x2b2/0x470 kernel/sched/core.c:6128
>   __might_sleep+0x95/0x190 kernel/sched/core.c:6081
>   lock_sock_nested+0x37/0x110 net/core/sock.c:2769
>   lock_sock include/net/sock.h:1463 [inline]
>   tipc_release+0x103/0xff0 net/tipc/socket.c:572
>   sock_release+0x8d/0x1e0 net/socket.c:594
>   tipc_topsrv_stop+0x3c0/0x610 net/tipc/topsrv.c:696
>   tipc_exit_net+0x15/0x40 net/tipc/core.c:96
>   ops_exit_list.isra.6+0xae/0x150 net/core/net_namespace.c:148
>   cleanup_net+0x6ba/0xd20 net/core/net_namespace.c:529
>   process_one_work+0xbbf/0x1af0 kernel/workqueue.c:2113
>   worker_thread+0x223/0x1990 kernel/workqueue.c:2247
>   kthread+0x33c/0x400 kernel/kthread.c:238
>   ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:429
> 
> This is caused by tipc_topsrv_stop() releasing the listener socket with the 
> idr
> lock held. This changeset addresses the issue moving the release operation
> outside such lock.

Thank you Paolo. This was too obvious for me to catch ☹
Acked-by:  ///jon

> 
> Reported-and-tested-by:
> syzbot+749d9d87c294c00ca...@syzkaller.appspotmail.com
> Fixes: 0ef897be12b8 ("tipc: separate topology server listener socket from
> subcsriber sockets")
> Signed-off-by: Paolo Abeni <pab...@redhat.com>
> ---
>  net/tipc/topsrv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c index
> 02013e00f287..63f35eae7236 100644
> --- a/net/tipc/topsrv.c
> +++ b/net/tipc/topsrv.c
> @@ -693,9 +693,9 @@ void tipc_topsrv_stop(struct net *net)
>   }
>   __module_get(lsock->ops->owner);
>   __module_get(lsock->sk->sk_prot_creator->owner);
> - sock_release(lsock);
>   srv->listener = NULL;
>   spin_unlock_bh(>idr_lock);
> + sock_release(lsock);
>   tipc_topsrv_work_stop(srv);
>   idr_destroy(>conn_idr);
>   kfree(srv);
> --
> 2.14.3



RE: [net-next v2 1/1] tipc: avoid unnecessary copying of bundled messages

2018-02-19 Thread Jon Maloy


> -Original Message-
> From: David Laight [mailto:david.lai...@aculab.com]
> Sent: Monday, February 19, 2018 14:30
> To: Jon Maloy <jon.ma...@ericsson.com>
> Cc: netdev@vger.kernel.org; Mohan Krishna Ghanta Krishnamurthy
> <mohan.krishna.ghanta.krishnamur...@ericsson.com>; Tung Quang Nguyen
> <tung.q.ngu...@dektech.com.au>; Hoang Huu Le
> <hoang.h...@dektech.com.au>; Canh Duc Luu
> <canh.d@dektech.com.au>; Ying Xue <ying@windriver.com>; tipc-
> discuss...@lists.sourceforge.net
> Subject: RE: [net-next v2 1/1] tipc: avoid unnecessary copying of bundled
> messages
> 
> From: Jon Maloy <jon.ma...@ericsson.com>
> Date: Thu, 15 Feb 2018 14:14:37 +0100
> 
> > A received sk buffer may contain dozens of smaller 'bundled' messages
> > which after extraction go each in their own direction.
> >
> > Unfortunately, when we extract those messages using skb_clone() each
> > of the extracted buffers inherit the truesize value of the original
> > buffer. Apart from causing massive overaccounting of the base buffer's
> > memory, this often causes tipc_msg_validate() to come to the false
> > conclusion that the ratio truesize/datasize > 4, and perform an
> > unnecessary copying of the extracted buffer.
> >
> > We now fix this problem by explicitly correcting the truesize value of
> > the buffer clones to be the truesize of the clone itself plus a
> > calculated fraction of the base buffer's overhead. This change
> > eliminates the overaccounting and at least mitigates the occurrence of
> > unnecessary buffer copying.
> 
> Have you actually checked that copying the data when you extract the
> messages isn't faster than cloning and trying to avoid the copy?
> Copying at the point is probably cheaper because it leads to a simpler
> message structure.

Yes, that is probably what I'll end up doing, if copying is unavoidable anyway.

///jon

> 
>   David



RE: BUG: sleeping function called from invalid context at net/core/sock.c:LINE (3)

2018-02-19 Thread Jon Maloy
I don't understand this one. tipc_topsrv_stop() can only be trigged from a user 
doing rmmod(), and I double checked that this is running in user mode.
How does the call chain you are reporting occur?

///jon


> -Original Message-
> From: Kirill Tkhai [mailto:ktk...@virtuozzo.com]
> Sent: Saturday, February 17, 2018 23:23
> To: Dmitry Vyukov <dvyu...@google.com>; syzbot
> <syzbot+749d9d87c294c00ca...@syzkaller.appspotmail.com>; Jon Maloy
> <jon.ma...@ericsson.com>; Ying Xue <ying@windriver.com>
> Cc: Andrei Vagin <ava...@virtuozzo.com>; David Miller
> <da...@davemloft.net>; Eric W. Biederman <ebied...@xmission.com>;
> Florian Westphal <f...@strlen.de>; LKML <linux-ker...@vger.kernel.org>;
> netdev <netdev@vger.kernel.org>; Nicolas Dichtel
> <nicolas.dich...@6wind.com>; roman.k...@sysgo.com; syzkaller-
> b...@googlegroups.com; tipc-discuss...@lists.sourceforge.net
> Subject: Re: BUG: sleeping function called from invalid context at
> net/core/sock.c:LINE (3)
> 
> On 17.02.2018 11:15, Dmitry Vyukov wrote:
> > On Sat, Feb 17, 2018 at 4:00 AM, syzbot
> > <syzbot+749d9d87c294c00ca...@syzkaller.appspotmail.com> wrote:
> >> Hello,
> >>
> >> syzbot hit the following crash on net-next commit
> >> 65bd449c32c2745df61913ab54087e77f9d9b70d (Fri Feb 16 20:26:35 2018
> >> +) Merge branch 'tipc-de-generealize-topology-server'
> >
> > +tipc maintainers
> 
> This looks to be caused by commit 0ef897be12b8
> "tipc: separate topology server listener socket from subcsriber sockets"
> 
> Thanks,
> Kirill


[net-next 1/1] tipc: fix bug on error path in tipc_topsrv_kern_subscr()

2018-02-19 Thread Jon Maloy
In commit cc1ea9ffadf7 ("tipc: eliminate struct tipc_subscriber") we
re-introduced an old bug on the error path in the function
tipc_topsrv_kern_subscr(). We now re-introduce the correction too.

Reported-by: syzbot+f62e0f2a0ef578703...@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/topsrv.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c
index 02013e0..25925be 100644
--- a/net/tipc/topsrv.c
+++ b/net/tipc/topsrv.c
@@ -580,9 +580,10 @@ bool tipc_topsrv_kern_subscr(struct net *net, u32 port, 
u32 type, u32 lower,
*conid = con->conid;
con->sock = NULL;
rc = tipc_conn_rcv_sub(tipc_topsrv(net), con, );
-   if (rc < 0)
-   tipc_conn_close(con);
-   return !rc;
+   if (rc >= 0)
+   return true;
+   conn_put(con);
+   return false;
 }
 
 void tipc_topsrv_kern_unsubscr(struct net *net, int conid)
-- 
2.1.4



RE: [net-next v2 1/1] tipc: avoid unnecessary copying of bundled messages

2018-02-17 Thread Jon Maloy


> -Original Message-
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Friday, February 16, 2018 21:33
> To: Jon Maloy <jon.ma...@ericsson.com>
> Cc: netdev@vger.kernel.org; Mohan Krishna Ghanta Krishnamurthy
> <mohan.krishna.ghanta.krishnamur...@ericsson.com>; Tung Quang Nguyen
> <tung.q.ngu...@dektech.com.au>; Hoang Huu Le
> <hoang.h...@dektech.com.au>; Canh Duc Luu
> <canh.d@dektech.com.au>; Ying Xue <ying@windriver.com>; tipc-
> discuss...@lists.sourceforge.net
> Subject: Re: [net-next v2 1/1] tipc: avoid unnecessary copying of bundled
> messages
> 
> From: Jon Maloy <jon.ma...@ericsson.com>
> Date: Thu, 15 Feb 2018 14:14:37 +0100
> 
> > A received sk buffer may contain dozens of smaller 'bundled' messages
> > which after extraction go each in their own direction.
> >
> > Unfortunately, when we extract those messages using skb_clone() each
> > of the extracted buffers inherit the truesize value of the original
> > buffer. Apart from causing massive overaccounting of the base buffer's
> > memory, this often causes tipc_msg_validate() to come to the false
> > conclusion that the ratio truesize/datasize > 4, and perform an
> > unnecessary copying of the extracted buffer.
> >
> > We now fix this problem by explicitly correcting the truesize value of
> > the buffer clones to be the truesize of the clone itself plus a
> > calculated fraction of the base buffer's overhead. This change
> > eliminates the overaccounting and at least mitigates the occurrence of
> > unnecessary buffer copying.
> >
> > Reported-by: Hoang Le <hoang.h...@dektek.com.au>
> > Acked-by: Ying Xue <ying@windriver.com>
> > Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
> 
> As I explained in my previous two emails, I don't think this method of
> accounting is appropriate.
> 
> All of your clones must use the same skb->truesize as the original SKB
> because each and every one of them keeps the full buffer from being
> liberated until they are released.

I understand what you are saying, although I am not happy with its consequences 
in this case. I guess I will just leave it the way it is until I can come up 
with something  smarter.
///jon


[net-next v2 1/1] tipc: avoid unnecessary copying of bundled messages

2018-02-15 Thread Jon Maloy
A received sk buffer may contain dozens of smaller 'bundled' messages
which after extraction go each in their own direction.

Unfortunately, when we extract those messages using skb_clone() each
of the extracted buffers inherit the truesize value of the original
buffer. Apart from causing massive overaccounting of the base buffer's
memory, this often causes tipc_msg_validate() to come to the false
conclusion that the ratio truesize/datasize > 4, and perform an
unnecessary copying of the extracted buffer.

We now fix this problem by explicitly correcting the truesize value of
the buffer clones to be the truesize of the clone itself plus a
calculated fraction of the base buffer's overhead. This change
eliminates the overaccounting and at least mitigates the occurrence
of unnecessary buffer copying.

Reported-by: Hoang Le <hoang.h...@dektek.com.au>
Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/msg.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/tipc/msg.c b/net/tipc/msg.c
index 4e1c6f6..ce0bfc4 100644
--- a/net/tipc/msg.c
+++ b/net/tipc/msg.c
@@ -416,8 +416,8 @@ bool tipc_msg_bundle(struct sk_buff *skb, struct tipc_msg 
*msg, u32 mtu)
  */
 bool tipc_msg_extract(struct sk_buff *skb, struct sk_buff **iskb, int *pos)
 {
+   int imsz, offset, clone_cnt, skb_overhead;
struct tipc_msg *msg;
-   int imsz, offset;
 
*iskb = NULL;
if (unlikely(skb_linearize(skb)))
@@ -434,6 +434,11 @@ bool tipc_msg_extract(struct sk_buff *skb, struct sk_buff 
**iskb, int *pos)
skb_pull(*iskb, offset);
imsz = msg_size(buf_msg(*iskb));
skb_trim(*iskb, imsz);
+
+   /* Scale extracted buffer's truesize to avoid double accounting */
+   clone_cnt = max_t(u32, 1, msg_msgcnt(msg));
+   skb_overhead = skb->truesize - skb->len;
+   (*iskb)->truesize = SKB_TRUESIZE(imsz) + skb_overhead / clone_cnt;
if (unlikely(!tipc_msg_validate(iskb)))
goto none;
*pos += align(imsz);
-- 
2.1.4



[net-next 09/10] tipc: separate topology server listener socket from subcsriber sockets

2018-02-15 Thread Jon Maloy
We move the listener socket to struct tipc_server and give it its own
work item. This makes it easier to follow the code, and entails some
simplifications in the reception code in subscriber sockets.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/server.c | 328 --
 1 file changed, 147 insertions(+), 181 deletions(-)

diff --git a/net/tipc/server.c b/net/tipc/server.c
index 0abbdd6..0e351e8 100644
--- a/net/tipc/server.c
+++ b/net/tipc/server.c
@@ -64,7 +64,6 @@
  * @tipc_conn_new: callback will be called when new connection is incoming
  * @tipc_conn_release: callback will be called before releasing the connection
  * @tipc_conn_recvmsg: callback will be called when message arrives
- * @saddr: TIPC server address
  * @name: server name
  * @imp: message importance
  * @type: socket type
@@ -74,10 +73,11 @@ struct tipc_server {
spinlock_t idr_lock; /* for idr list */
int idr_in_use;
struct net *net;
+   struct work_struct awork;
struct workqueue_struct *rcv_wq;
struct workqueue_struct *send_wq;
int max_rcvbuf_size;
-   struct sockaddr_tipc *saddr;
+   struct socket *listener;
char name[TIPC_SERVER_NAME_LEN];
 };
 
@@ -106,7 +106,6 @@ struct tipc_conn {
struct list_head sub_list;
spinlock_t sub_lock; /* for subscription list */
struct work_struct rwork;
-   int (*rx_action) (struct tipc_conn *con);
struct list_head outqueue;
spinlock_t outqueue_lock;
struct work_struct swork;
@@ -121,12 +120,6 @@ struct outqueue_entry {
 
 static void tipc_recv_work(struct work_struct *work);
 static void tipc_send_work(struct work_struct *work);
-static void tipc_clean_outqueues(struct tipc_conn *con);
-
-static struct tipc_conn *sock2con(struct sock *sk)
-{
-   return sk->sk_user_data;
-}
 
 static bool connected(struct tipc_conn *con)
 {
@@ -137,21 +130,21 @@ static void tipc_conn_kref_release(struct kref *kref)
 {
struct tipc_conn *con = container_of(kref, struct tipc_conn, kref);
struct tipc_server *s = con->server;
-   struct socket *sock = con->sock;
+   struct outqueue_entry *e, *safe;
 
-   if (sock) {
-   if (test_bit(CF_SERVER, >flags)) {
-   __module_get(sock->ops->owner);
-   __module_get(sock->sk->sk_prot_creator->owner);
-   }
-   sock_release(sock);
-   con->sock = NULL;
-   }
spin_lock_bh(>idr_lock);
idr_remove(>conn_idr, con->conid);
s->idr_in_use--;
spin_unlock_bh(>idr_lock);
-   tipc_clean_outqueues(con);
+   if (con->sock)
+   sock_release(con->sock);
+
+   spin_lock_bh(>outqueue_lock);
+   list_for_each_entry_safe(e, safe, >outqueue, list) {
+   list_del(>list);
+   kfree(e);
+   }
+   spin_unlock_bh(>outqueue_lock);
kfree(con);
 }
 
@@ -178,14 +171,14 @@ static struct tipc_conn *tipc_conn_lookup(struct 
tipc_server *s, int conid)
 }
 
 /* sock_data_ready - interrupt callback indicating the socket has data to read
- * The queued job is launched in tipc_recv_from_sock()
+ * The queued work is launched into tipc_recv_work()->tipc_recv_from_sock()
  */
 static void sock_data_ready(struct sock *sk)
 {
struct tipc_conn *con;
 
read_lock_bh(>sk_callback_lock);
-   con = sock2con(sk);
+   con = sk->sk_user_data;
if (connected(con)) {
conn_get(con);
if (!queue_work(con->server->rcv_wq, >rwork))
@@ -195,15 +188,15 @@ static void sock_data_ready(struct sock *sk)
 }
 
 /* sock_write_space - interrupt callback after a sendmsg EAGAIN
- * Indicates that there now is more is space in the send buffer
- * The queued job is launched in tipc_send_to_sock()
+ * Indicates that there now is more space in the send buffer
+ * The queued work is launched into tipc_send_work()->tipc_send_to_sock()
  */
 static void sock_write_space(struct sock *sk)
 {
struct tipc_conn *con;
 
read_lock_bh(>sk_callback_lock);
-   con = sock2con(sk);
+   con = sk->sk_user_data;
if (connected(con)) {
conn_get(con);
if (!queue_work(con->server->send_wq, >swork))
@@ -212,23 +205,8 @@ static void sock_write_space(struct sock *sk)
read_unlock_bh(>sk_callback_lock);
 }
 
-static void tipc_register_callbacks(struct socket *sock, struct tipc_conn *con)
-{
-   struct sock *sk = sock->sk;
-
-   write_lock_bh(>sk_callback_lock);
-
-   sk->sk_data_ready = sock_data_ready;
-   sk->sk_write_space = sock_write_space;
-   sk->sk_user_data = con;
-
-   con->sock = sock;
-
-   write_unlock_bh(>sk_callback_lock);
-}
-
 /

[net-next 10/10] tipc: rename tipc_server to tipc_topsrv

2018-02-15 Thread Jon Maloy
We rename struct tipc_server to struct tipc_topsrv. This reflect its now
specialized role as topology server. Accoringly, we change or add function
prefixes to make it clearer which functionality those belong to.

There are no functional changes in this commit.

Acked-by: Ying.Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/Makefile |   2 +-
 net/tipc/core.h   |   6 +-
 net/tipc/group.c  |   2 +-
 net/tipc/server.c | 700 -
 net/tipc/server.h |  57 -
 net/tipc/subscr.c |   2 +-
 net/tipc/subscr.h |   2 +-
 net/tipc/topsrv.c | 702 ++
 net/tipc/topsrv.h |  54 +
 9 files changed, 763 insertions(+), 764 deletions(-)
 delete mode 100644 net/tipc/server.c
 delete mode 100644 net/tipc/server.h
 create mode 100644 net/tipc/topsrv.c
 create mode 100644 net/tipc/topsrv.h

diff --git a/net/tipc/Makefile b/net/tipc/Makefile
index 37bb0bf..1edb719 100644
--- a/net/tipc/Makefile
+++ b/net/tipc/Makefile
@@ -9,7 +9,7 @@ tipc-y  += addr.o bcast.o bearer.o \
   core.o link.o discover.o msg.o  \
   name_distr.o  subscr.o monitor.o name_table.o net.o  \
   netlink.o netlink_compat.o node.o socket.o eth_media.o \
-  server.o socket.o group.o
+  topsrv.o socket.o group.o
 
 tipc-$(CONFIG_TIPC_MEDIA_UDP)  += udp_media.o
 tipc-$(CONFIG_TIPC_MEDIA_IB)   += ib_media.o
diff --git a/net/tipc/core.h b/net/tipc/core.h
index 20b21af..ff8b071 100644
--- a/net/tipc/core.h
+++ b/net/tipc/core.h
@@ -64,7 +64,7 @@ struct tipc_bearer;
 struct tipc_bc_base;
 struct tipc_link;
 struct tipc_name_table;
-struct tipc_server;
+struct tipc_topsrv;
 struct tipc_monitor;
 
 #define TIPC_MOD_VER "2.0.0"
@@ -112,7 +112,7 @@ struct tipc_net {
struct list_head dist_queue;
 
/* Topology subscription server */
-   struct tipc_server *topsrv;
+   struct tipc_topsrv *topsrv;
atomic_t subscription_count;
 };
 
@@ -131,7 +131,7 @@ static inline struct list_head *tipc_nodes(struct net *net)
return _net(net)->node_list;
 }
 
-static inline struct tipc_server *tipc_topsrv(struct net *net)
+static inline struct tipc_topsrv *tipc_topsrv(struct net *net)
 {
return tipc_net(net)->topsrv;
 }
diff --git a/net/tipc/group.c b/net/tipc/group.c
index 122162a..03086cc 100644
--- a/net/tipc/group.c
+++ b/net/tipc/group.c
@@ -37,7 +37,7 @@
 #include "addr.h"
 #include "group.h"
 #include "bcast.h"
-#include "server.h"
+#include "topsrv.h"
 #include "msg.h"
 #include "socket.h"
 #include "node.h"
diff --git a/net/tipc/server.c b/net/tipc/server.c
deleted file mode 100644
index 0e351e8..000
--- a/net/tipc/server.c
+++ /dev/null
@@ -1,700 +0,0 @@
-/*
- * net/tipc/server.c: TIPC server infrastructure
- *
- * Copyright (c) 2012-2013, Wind River Systems
- * Copyright (c) 2017, Ericsson AB
- * All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions are met:
- *
- * 1. Redistributions of source code must retain the above copyright
- *notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- *notice, this list of conditions and the following disclaimer in the
- *documentation and/or other materials provided with the distribution.
- * 3. Neither the names of the copyright holders nor the names of its
- *contributors may be used to endorse or promote products derived from
- *this software without specific prior written permission.
- *
- * Alternatively, this software may be distributed under the terms of the
- * GNU General Public License ("GPL") version 2 as published by the Free
- * Software Foundation.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
- * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
- * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
- * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
- * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
- * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
- * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
- * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
- * POSSIBILITY OF SUCH DAMAGE.
- */
-
-#include "subscr.h"
-#include "server.h"
-#include "core.h"
-#include "socket.h"
-#include "addr.h"
-#include "msg.h"
-#include 
-#include 
-
-/*

[net-next 08/10] tipc: make struct tipc_server private for server.c

2018-02-15 Thread Jon Maloy
In order to narrow the interface and dependencies between the topology
server and the subscription/binding table functionality we move struct
tipc_server inside the file server.c. This requires some code
adaptations in other files, but those are mostly minor.

The most important change is that we have to move the start/stop
functions for the topology server to server.c, where they logically
belong anyway.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/name_table.c |  10 ++--
 net/tipc/server.c | 124 --
 net/tipc/server.h |  36 +--
 net/tipc/subscr.c |  64 ++
 net/tipc/subscr.h |   4 +-
 5 files changed, 110 insertions(+), 128 deletions(-)

diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index b234b7e..e01c9c6 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -813,8 +813,7 @@ int tipc_nametbl_withdraw(struct net *net, u32 type, u32 
lower, u32 ref,
  */
 void tipc_nametbl_subscribe(struct tipc_subscription *sub)
 {
-   struct tipc_server *srv = sub->server;
-   struct tipc_net *tn = tipc_net(srv->net);
+   struct tipc_net *tn = tipc_net(sub->net);
struct tipc_subscr *s = >evt.s;
u32 type = tipc_sub_read(s, seq.type);
int index = hash(type);
@@ -822,7 +821,7 @@ void tipc_nametbl_subscribe(struct tipc_subscription *sub)
struct tipc_name_seq ns;
 
spin_lock_bh(>nametbl_lock);
-   seq = nametbl_find_seq(srv->net, type);
+   seq = nametbl_find_seq(sub->net, type);
if (!seq)
seq = tipc_nameseq_create(type, >nametbl->seq_hlist[index]);
if (seq) {
@@ -844,14 +843,13 @@ void tipc_nametbl_subscribe(struct tipc_subscription *sub)
  */
 void tipc_nametbl_unsubscribe(struct tipc_subscription *sub)
 {
-   struct tipc_server *srv = sub->server;
struct tipc_subscr *s = >evt.s;
-   struct tipc_net *tn = tipc_net(srv->net);
+   struct tipc_net *tn = tipc_net(sub->net);
struct name_seq *seq;
u32 type = tipc_sub_read(s, seq.type);
 
spin_lock_bh(>nametbl_lock);
-   seq = nametbl_find_seq(srv->net, type);
+   seq = nametbl_find_seq(sub->net, type);
if (seq != NULL) {
spin_lock_bh(>lock);
list_del_init(>nameseq_list);
diff --git a/net/tipc/server.c b/net/tipc/server.c
index a5c112e..0abbdd6 100644
--- a/net/tipc/server.c
+++ b/net/tipc/server.c
@@ -49,7 +49,37 @@
 #define CF_CONNECTED   1
 #define CF_SERVER  2
 
-#define sock2con(x) ((struct tipc_conn *)(x)->sk_user_data)
+#define TIPC_SERVER_NAME_LEN   32
+
+/**
+ * struct tipc_server - TIPC server structure
+ * @conn_idr: identifier set of connection
+ * @idr_lock: protect the connection identifier set
+ * @idr_in_use: amount of allocated identifier entry
+ * @net: network namspace instance
+ * @rcvbuf_cache: memory cache of server receive buffer
+ * @rcv_wq: receive workqueue
+ * @send_wq: send workqueue
+ * @max_rcvbuf_size: maximum permitted receive message length
+ * @tipc_conn_new: callback will be called when new connection is incoming
+ * @tipc_conn_release: callback will be called before releasing the connection
+ * @tipc_conn_recvmsg: callback will be called when message arrives
+ * @saddr: TIPC server address
+ * @name: server name
+ * @imp: message importance
+ * @type: socket type
+ */
+struct tipc_server {
+   struct idr conn_idr;
+   spinlock_t idr_lock; /* for idr list */
+   int idr_in_use;
+   struct net *net;
+   struct workqueue_struct *rcv_wq;
+   struct workqueue_struct *send_wq;
+   int max_rcvbuf_size;
+   struct sockaddr_tipc *saddr;
+   char name[TIPC_SERVER_NAME_LEN];
+};
 
 /**
  * struct tipc_conn - TIPC connection structure
@@ -93,6 +123,11 @@ static void tipc_recv_work(struct work_struct *work);
 static void tipc_send_work(struct work_struct *work);
 static void tipc_clean_outqueues(struct tipc_conn *con);
 
+static struct tipc_conn *sock2con(struct sock *sk)
+{
+   return sk->sk_user_data;
+}
+
 static bool connected(struct tipc_conn *con)
 {
return con && test_bit(CF_CONNECTED, >flags);
@@ -198,14 +233,17 @@ static void tipc_register_callbacks(struct socket *sock, 
struct tipc_conn *con)
 static void tipc_con_delete_sub(struct tipc_conn *con, struct tipc_subscr *s)
 {
struct list_head *sub_list = >sub_list;
+   struct tipc_net *tn = tipc_net(con->server->net);
struct tipc_subscription *sub, *tmp;
 
spin_lock_bh(>sub_lock);
list_for_each_entry_safe(sub, tmp, sub_list, sub_list) {
-   if (!s || !memcmp(s, >evt.s, sizeof(*s)))
+   if (!s || !memcmp(s, >evt.s, sizeof(*s))) {
tipc_sub_unsubscribe(sub);
-   

[net-next 07/10] tipc: some prefix changes

2018-02-15 Thread Jon Maloy
Since we now have removed struct tipc_subscriber from the code, and
only struct tipc_subscription remains, there is no longer need for long
and awkward prefixes to distinguish between their pertaining functions.

We now change all tipc_subscrp_* prefixes to tipc_sub_*. This is
a purely cosmetic change.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/name_table.c | 33 
 net/tipc/server.c |  5 ++---
 net/tipc/subscr.c | 52 +--
 net/tipc/subscr.h | 20 ++--
 4 files changed, 54 insertions(+), 56 deletions(-)

diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index 2fbd0a2..b234b7e 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -326,10 +326,10 @@ static struct publication 
*tipc_nameseq_insert_publ(struct net *net,
 
/* Any subscriptions waiting for notification?  */
list_for_each_entry_safe(s, st, >subscriptions, nameseq_list) {
-   tipc_subscrp_report_overlap(s, publ->lower, publ->upper,
-   TIPC_PUBLISHED, publ->ref,
-   publ->node, publ->scope,
-   created_subseq);
+   tipc_sub_report_overlap(s, publ->lower, publ->upper,
+   TIPC_PUBLISHED, publ->ref,
+   publ->node, publ->scope,
+   created_subseq);
}
return publ;
 }
@@ -397,10 +397,9 @@ static struct publication *tipc_nameseq_remove_publ(struct 
net *net,
 
/* Notify any waiting subscriptions */
list_for_each_entry_safe(s, st, >subscriptions, nameseq_list) {
-   tipc_subscrp_report_overlap(s, publ->lower, publ->upper,
-   TIPC_WITHDRAWN, publ->ref,
-   publ->node, publ->scope,
-   removed_subseq);
+   tipc_sub_report_overlap(s, publ->lower, publ->upper,
+   TIPC_WITHDRAWN, publ->ref, publ->node,
+   publ->scope, removed_subseq);
}
 
return publ;
@@ -424,25 +423,25 @@ static void tipc_nameseq_subscribe(struct name_seq *nseq,
ns.upper = tipc_sub_read(s, seq.upper);
no_status = tipc_sub_read(s, filter) & TIPC_SUB_NO_STATUS;
 
-   tipc_subscrp_get(sub);
+   tipc_sub_get(sub);
list_add(>nameseq_list, >subscriptions);
 
if (no_status || !sseq)
return;
 
while (sseq != >sseqs[nseq->first_free]) {
-   if (tipc_subscrp_check_overlap(, sseq->lower, sseq->upper)) {
+   if (tipc_sub_check_overlap(, sseq->lower, sseq->upper)) {
struct publication *crs;
struct name_info *info = sseq->info;
int must_report = 1;
 
list_for_each_entry(crs, >zone_list, zone_list) {
-   tipc_subscrp_report_overlap(sub, sseq->lower,
-   sseq->upper,
-   TIPC_PUBLISHED,
-   crs->ref, crs->node,
-   crs->scope,
-   must_report);
+   tipc_sub_report_overlap(sub, sseq->lower,
+   sseq->upper,
+   TIPC_PUBLISHED,
+   crs->ref, crs->node,
+   crs->scope,
+   must_report);
must_report = 0;
}
}
@@ -856,7 +855,7 @@ void tipc_nametbl_unsubscribe(struct tipc_subscription *sub)
if (seq != NULL) {
spin_lock_bh(>lock);
list_del_init(>nameseq_list);
-   tipc_subscrp_put(sub);
+   tipc_sub_put(sub);
if (!seq->first_free && list_empty(>subscriptions)) {
hlist_del_init_rcu(>ns_list);
kfree(seq->sseqs);
diff --git a/net/tipc/server.c b/net/tipc/server.c
index 6a18b10..a5c112e 100644
--- a/net/tipc/server.c
+++ b/net/tipc/server.c
@@ -201,7 +201,7 @@ static void tipc_con_delete_sub(struct tipc_conn *con, 
struct tipc_subscr *s)
struct tipc_subscription *sub, *tm

[net-next 06/10] tipc: collapse subscription creation functions

2018-02-15 Thread Jon Maloy
After the previous changes it becomes logical to collapse the two-level
creation of subscription instances into one. We do that here.

We also rename the creation and deletion functions for more consistency.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/server.c |  4 ++--
 net/tipc/server.h |  1 +
 net/tipc/subscr.c | 46 --
 net/tipc/subscr.h | 14 +++---
 4 files changed, 22 insertions(+), 43 deletions(-)

diff --git a/net/tipc/server.c b/net/tipc/server.c
index 5d231fa..6a18b10 100644
--- a/net/tipc/server.c
+++ b/net/tipc/server.c
@@ -203,7 +203,7 @@ static void tipc_con_delete_sub(struct tipc_conn *con, 
struct tipc_subscr *s)
spin_lock_bh(>sub_lock);
list_for_each_entry_safe(sub, tmp, sub_list, subscrp_list) {
if (!s || !memcmp(s, >evt.s, sizeof(*s)))
-   tipc_sub_delete(sub);
+   tipc_sub_unsubscribe(sub);
else if (s)
break;
}
@@ -278,7 +278,7 @@ static int tipc_con_rcv_sub(struct tipc_server *srv,
tipc_con_delete_sub(con, s);
return 0;
}
-   sub = tipc_subscrp_subscribe(srv, s, con->conid);
+   sub = tipc_sub_subscribe(srv, s, con->conid);
if (!sub)
return -1;
 
diff --git a/net/tipc/server.h b/net/tipc/server.h
index 2de8709..995b795 100644
--- a/net/tipc/server.h
+++ b/net/tipc/server.h
@@ -2,6 +2,7 @@
  * net/tipc/server.h: Include file for TIPC server code
  *
  * Copyright (c) 2012-2013, Wind River Systems
+ * Copyright (c) 2017, Ericsson AB
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
diff --git a/net/tipc/subscr.c b/net/tipc/subscr.c
index 406b09f..8d37b61 100644
--- a/net/tipc/subscr.c
+++ b/net/tipc/subscr.c
@@ -134,33 +134,29 @@ void tipc_subscrp_get(struct tipc_subscription 
*subscription)
kref_get(>kref);
 }
 
-static struct tipc_subscription *tipc_subscrp_create(struct tipc_server *srv,
-struct tipc_subscr *s,
-int conid)
+struct tipc_subscription *tipc_sub_subscribe(struct tipc_server *srv,
+struct tipc_subscr *s,
+int conid)
 {
struct tipc_net *tn = tipc_net(srv->net);
struct tipc_subscription *sub;
u32 filter = tipc_sub_read(s, filter);
+   u32 timeout;
 
-   /* Refuse subscription if global limit exceeded */
-   if (atomic_read(>subscription_count) >= TIPC_MAX_SUBSCRIPTIONS) {
-   pr_warn("Subscription rejected, limit reached (%u)\n",
-   TIPC_MAX_SUBSCRIPTIONS);
+   if (atomic_read(>subscription_count) >= TIPC_MAX_SUBSCR) {
+   pr_warn("Subscription rejected, max (%u)\n", TIPC_MAX_SUBSCR);
+   return NULL;
+   }
+   if ((filter & TIPC_SUB_PORTS && filter & TIPC_SUB_SERVICE) ||
+   (tipc_sub_read(s, seq.lower) > tipc_sub_read(s, seq.upper))) {
+   pr_warn("Subscription rejected, illegal request\n");
return NULL;
}
-
-   /* Allocate subscription object */
sub = kmalloc(sizeof(*sub), GFP_ATOMIC);
if (!sub) {
pr_warn("Subscription rejected, no memory\n");
return NULL;
}
-
-   /* Initialize subscription object */
-   if (filter & TIPC_SUB_PORTS && filter & TIPC_SUB_SERVICE)
-   goto err;
-   if (tipc_sub_read(s, seq.lower) > tipc_sub_read(s, seq.upper))
-   goto err;
sub->server = srv;
sub->conid = conid;
sub->inactive = false;
@@ -168,24 +164,6 @@ static struct tipc_subscription 
*tipc_subscrp_create(struct tipc_server *srv,
spin_lock_init(>lock);
atomic_inc(>subscription_count);
kref_init(>kref);
-   return sub;
-err:
-   pr_warn("Subscription rejected, illegal request\n");
-   kfree(sub);
-   return NULL;
-}
-
-struct tipc_subscription *tipc_subscrp_subscribe(struct tipc_server *srv,
-struct tipc_subscr *s,
-int conid)
-{
-   struct tipc_subscription *sub = NULL;
-   u32 timeout;
-
-   sub = tipc_subscrp_create(srv, s, conid);
-   if (!sub)
-   return NULL;
-
tipc_nametbl_subscribe(sub);
timer_setup(>timer, tipc_subscrp_timeout, 0);
timeout = tipc_sub_read(>evt.s, timeout);
@@ -194,7 +172,7 @@ struct tipc_subscription *tipc_subscrp_subscribe(struct 
tipc_server *srv,
return sub;
 }
 
-void tipc_sub_delete(struct 

[net-next 05/10] tipc: simplify endianness handling in topology subscriber

2018-02-15 Thread Jon Maloy
Because of the requirement for total distribution transparency, users
send subscriptions and receive topology events in their own host format.
It is up to the topology server to determine this format and do the
correct conversions to and from its own host format when needed.

Until now, this has been handled in a rather non-transparent way inside
the topology server and subscriber code, leading to unnecessary
complexity when creating subscriptions and issuing events.

We now improve this situation by adding two new macros, tipc_sub_read()
and tipc_evt_write(). Both those functions calculate the need for
conversion internally before performing their respective operations.
Hence, all handling of such conversions become transparent to the rest
of the code.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/name_table.c | 42 +++---
 net/tipc/name_table.h |  2 +-
 net/tipc/server.c | 25 ++--
 net/tipc/subscr.c | 83 +++
 net/tipc/subscr.h | 36 --
 5 files changed, 86 insertions(+), 102 deletions(-)

diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index c0ca7be..2fbd0a2 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -412,18 +412,22 @@ static struct publication 
*tipc_nameseq_remove_publ(struct net *net,
  * sequence overlapping with the requested sequence
  */
 static void tipc_nameseq_subscribe(struct name_seq *nseq,
-  struct tipc_subscription *s,
-  bool status)
+  struct tipc_subscription *sub)
 {
struct sub_seq *sseq = nseq->sseqs;
struct tipc_name_seq ns;
+   struct tipc_subscr *s = >evt.s;
+   bool no_status;
 
-   tipc_subscrp_convert_seq(>evt.s.seq, s->swap, );
+   ns.type = tipc_sub_read(s, seq.type);
+   ns.lower = tipc_sub_read(s, seq.lower);
+   ns.upper = tipc_sub_read(s, seq.upper);
+   no_status = tipc_sub_read(s, filter) & TIPC_SUB_NO_STATUS;
 
-   tipc_subscrp_get(s);
-   list_add(>nameseq_list, >subscriptions);
+   tipc_subscrp_get(sub);
+   list_add(>nameseq_list, >subscriptions);
 
-   if (!status || !sseq)
+   if (no_status || !sseq)
return;
 
while (sseq != >sseqs[nseq->first_free]) {
@@ -433,7 +437,7 @@ static void tipc_nameseq_subscribe(struct name_seq *nseq,
int must_report = 1;
 
list_for_each_entry(crs, >zone_list, zone_list) {
-   tipc_subscrp_report_overlap(s, sseq->lower,
+   tipc_subscrp_report_overlap(sub, sseq->lower,
sseq->upper,
TIPC_PUBLISHED,
crs->ref, crs->node,
@@ -808,11 +812,12 @@ int tipc_nametbl_withdraw(struct net *net, u32 type, u32 
lower, u32 ref,
 /**
  * tipc_nametbl_subscribe - add a subscription object to the name table
  */
-void tipc_nametbl_subscribe(struct tipc_subscription *s, bool status)
+void tipc_nametbl_subscribe(struct tipc_subscription *sub)
 {
-   struct tipc_server *srv = s->server;
+   struct tipc_server *srv = sub->server;
struct tipc_net *tn = tipc_net(srv->net);
-   u32 type = tipc_subscrp_convert_seq_type(s->evt.s.seq.type, s->swap);
+   struct tipc_subscr *s = >evt.s;
+   u32 type = tipc_sub_read(s, seq.type);
int index = hash(type);
struct name_seq *seq;
struct tipc_name_seq ns;
@@ -823,10 +828,12 @@ void tipc_nametbl_subscribe(struct tipc_subscription *s, 
bool status)
seq = tipc_nameseq_create(type, >nametbl->seq_hlist[index]);
if (seq) {
spin_lock_bh(>lock);
-   tipc_nameseq_subscribe(seq, s, status);
+   tipc_nameseq_subscribe(seq, sub);
spin_unlock_bh(>lock);
} else {
-   tipc_subscrp_convert_seq(>evt.s.seq, s->swap, );
+   ns.type = tipc_sub_read(s, seq.type);
+   ns.lower = tipc_sub_read(s, seq.lower);
+   ns.upper = tipc_sub_read(s, seq.upper);
pr_warn("Failed to create subscription for {%u,%u,%u}\n",
ns.type, ns.lower, ns.upper);
}
@@ -836,19 +843,20 @@ void tipc_nametbl_subscribe(struct tipc_subscription *s, 
bool status)
 /**
  * tipc_nametbl_unsubscribe - remove a subscription object from name table
  */
-void tipc_nametbl_unsubscribe(struct tipc_subscription *s)
+void tipc_nametbl_unsubscribe(struct tipc_subscription *sub)
 {
-   struct tipc_server *srv = s->server;
+   struct tipc_server *srv = sub->server;
+ 

[net-next 04/10] tipc: simplify interaction between subscription and topology connection

2018-02-15 Thread Jon Maloy
The message transmission and reception in the topology server is more
generic than is currently necessary. By basing the funtionality on the
fact that we only send items of type struct tipc_event and always
receive items of struct tipc_subcr we can make several simplifications,
and also get rid of some unnecessary dynamic memory allocations.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/name_table.c |  10 +--
 net/tipc/server.c | 170 +-
 net/tipc/server.h |  12 +---
 net/tipc/subscr.c |  40 ++--
 net/tipc/subscr.h |   5 +-
 5 files changed, 88 insertions(+), 149 deletions(-)

diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c
index ed0457c..c0ca7be 100644
--- a/net/tipc/name_table.c
+++ b/net/tipc/name_table.c
@@ -810,14 +810,15 @@ int tipc_nametbl_withdraw(struct net *net, u32 type, u32 
lower, u32 ref,
  */
 void tipc_nametbl_subscribe(struct tipc_subscription *s, bool status)
 {
-   struct tipc_net *tn = net_generic(s->net, tipc_net_id);
+   struct tipc_server *srv = s->server;
+   struct tipc_net *tn = tipc_net(srv->net);
u32 type = tipc_subscrp_convert_seq_type(s->evt.s.seq.type, s->swap);
int index = hash(type);
struct name_seq *seq;
struct tipc_name_seq ns;
 
spin_lock_bh(>nametbl_lock);
-   seq = nametbl_find_seq(s->net, type);
+   seq = nametbl_find_seq(srv->net, type);
if (!seq)
seq = tipc_nameseq_create(type, >nametbl->seq_hlist[index]);
if (seq) {
@@ -837,12 +838,13 @@ void tipc_nametbl_subscribe(struct tipc_subscription *s, 
bool status)
  */
 void tipc_nametbl_unsubscribe(struct tipc_subscription *s)
 {
-   struct tipc_net *tn = net_generic(s->net, tipc_net_id);
+   struct tipc_server *srv = s->server;
+   struct tipc_net *tn = tipc_net(srv->net);
struct name_seq *seq;
u32 type = tipc_subscrp_convert_seq_type(s->evt.s.seq.type, s->swap);
 
spin_lock_bh(>nametbl_lock);
-   seq = nametbl_find_seq(s->net, type);
+   seq = nametbl_find_seq(srv->net, type);
if (seq != NULL) {
spin_lock_bh(>lock);
list_del_init(>nameseq_list);
diff --git a/net/tipc/server.c b/net/tipc/server.c
index b8268c0..7933fb9 100644
--- a/net/tipc/server.c
+++ b/net/tipc/server.c
@@ -84,9 +84,9 @@ struct tipc_conn {
 
 /* An entry waiting to be sent */
 struct outqueue_entry {
-   u32 evt;
+   bool inactive;
+   struct tipc_event evt;
struct list_head list;
-   struct kvec iov;
 };
 
 static void tipc_recv_work(struct work_struct *work);
@@ -154,6 +154,9 @@ static struct tipc_conn *tipc_conn_lookup(struct 
tipc_server *s, int conid)
return con;
 }
 
+/* sock_data_ready - interrupt callback indicating the socket has data to read
+ * The queued job is launched in tipc_recv_from_sock()
+ */
 static void sock_data_ready(struct sock *sk)
 {
struct tipc_conn *con;
@@ -168,6 +171,10 @@ static void sock_data_ready(struct sock *sk)
read_unlock_bh(>sk_callback_lock);
 }
 
+/* sock_write_space - interrupt callback after a sendmsg EAGAIN
+ * Indicates that there now is more is space in the send buffer
+ * The queued job is launched in tipc_send_to_sock()
+ */
 static void sock_write_space(struct sock *sk)
 {
struct tipc_conn *con;
@@ -273,10 +280,10 @@ static struct tipc_conn *tipc_alloc_conn(struct 
tipc_server *s)
return con;
 }
 
-int tipc_con_rcv_sub(struct net *net, int conid, struct tipc_conn *con,
-void *buf, size_t len)
+static int tipc_con_rcv_sub(struct tipc_server *srv,
+   struct tipc_conn *con,
+   struct tipc_subscr *s)
 {
-   struct tipc_subscr *s = (struct tipc_subscr *)buf;
struct tipc_subscription *sub;
bool status;
int swap;
@@ -292,7 +299,7 @@ int tipc_con_rcv_sub(struct net *net, int conid, struct 
tipc_conn *con,
return 0;
}
status = !(s->filter & htohl(TIPC_SUB_NO_STATUS, swap));
-   sub = tipc_subscrp_subscribe(net, s, conid, swap, status);
+   sub = tipc_subscrp_subscribe(srv, s, con->conid, swap, status);
if (!sub)
return -1;
 
@@ -304,43 +311,27 @@ int tipc_con_rcv_sub(struct net *net, int conid, struct 
tipc_conn *con,
 
 static int tipc_receive_from_sock(struct tipc_conn *con)
 {
-   struct tipc_server *s = con->server;
+   struct tipc_server *srv = con->server;
struct sock *sk = con->sock->sk;
struct msghdr msg = {};
+   struct tipc_subscr s;
struct kvec iov;
-   void *buf;
int ret;
 
-   buf = kmem_cache_alloc(s->rcvbuf_cache, GFP_ATOMIC);
-   if (!buf) {
-   ret = -ENOMEM;
-   goto out_close;
- 

[net-next 02/10] tipc: remove unnecessary function pointers

2018-02-15 Thread Jon Maloy
Interaction between the functionality in server.c and subscr.c is
done via function pointers installed in struct server. This makes
the code harder to follow, and doesn't serve any obvious purpose.

Here, we replace the function pointers with direct function calls.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/server.c | 21 ++---
 net/tipc/server.h |  5 -
 net/tipc/subscr.c | 27 ++-
 net/tipc/subscr.h |  4 
 4 files changed, 20 insertions(+), 37 deletions(-)

diff --git a/net/tipc/server.c b/net/tipc/server.c
index 04a6dd9..8aa2a33 100644
--- a/net/tipc/server.c
+++ b/net/tipc/server.c
@@ -33,6 +33,7 @@
  * POSSIBILITY OF SUCH DAMAGE.
  */
 
+#include "subscr.h"
 #include "server.h"
 #include "core.h"
 #include "socket.h"
@@ -182,7 +183,6 @@ static void tipc_register_callbacks(struct socket *sock, 
struct tipc_conn *con)
 
 static void tipc_close_conn(struct tipc_conn *con)
 {
-   struct tipc_server *s = con->server;
struct sock *sk = con->sock->sk;
bool disconnect = false;
 
@@ -191,7 +191,7 @@ static void tipc_close_conn(struct tipc_conn *con)
if (disconnect) {
sk->sk_user_data = NULL;
if (con->conid)
-   s->tipc_conn_release(con->conid, con->usr_data);
+   tipc_subscrb_delete(con->usr_data);
}
write_unlock_bh(>sk_callback_lock);
 
@@ -240,7 +240,6 @@ static int tipc_receive_from_sock(struct tipc_conn *con)
 {
struct tipc_server *s = con->server;
struct sock *sk = con->sock->sk;
-   struct sockaddr_tipc addr;
struct msghdr msg = {};
struct kvec iov;
void *buf;
@@ -254,7 +253,7 @@ static int tipc_receive_from_sock(struct tipc_conn *con)
 
iov.iov_base = buf;
iov.iov_len = s->max_rcvbuf_size;
-   msg.msg_name = 
+   msg.msg_name = NULL;
iov_iter_kvec(_iter, READ | ITER_KVEC, , 1, iov.iov_len);
ret = sock_recvmsg(con->sock, , MSG_DONTWAIT);
if (ret <= 0) {
@@ -264,8 +263,8 @@ static int tipc_receive_from_sock(struct tipc_conn *con)
 
read_lock_bh(>sk_callback_lock);
if (test_bit(CF_CONNECTED, >flags))
-   ret = s->tipc_conn_recvmsg(sock_net(con->sock->sk), con->conid,
-  , con->usr_data, buf, ret);
+   ret = tipc_subscrb_rcv(sock_net(con->sock->sk), con->conid,
+  con->usr_data, buf, ret);
read_unlock_bh(>sk_callback_lock);
kmem_cache_free(s->rcvbuf_cache, buf);
if (ret < 0)
@@ -284,7 +283,6 @@ static int tipc_receive_from_sock(struct tipc_conn *con)
 
 static int tipc_accept_from_sock(struct tipc_conn *con)
 {
-   struct tipc_server *s = con->server;
struct socket *sock = con->sock;
struct socket *newsock;
struct tipc_conn *newcon;
@@ -305,7 +303,8 @@ static int tipc_accept_from_sock(struct tipc_conn *con)
tipc_register_callbacks(newsock, newcon);
 
/* Notify that new connection is incoming */
-   newcon->usr_data = s->tipc_conn_new(newcon->conid);
+   newcon->usr_data = tipc_subscrb_create(newcon->conid);
+
if (!newcon->usr_data) {
sock_release(newsock);
conn_put(newcon);
@@ -489,7 +488,7 @@ bool tipc_topsrv_kern_subscr(struct net *net, u32 port, u32 
type, u32 lower,
 
*conid = con->conid;
s = con->server;
-   scbr = s->tipc_conn_new(*conid);
+   scbr = tipc_subscrb_create(*conid);
if (!scbr) {
conn_put(con);
return false;
@@ -497,7 +496,7 @@ bool tipc_topsrv_kern_subscr(struct net *net, u32 port, u32 
type, u32 lower,
 
con->usr_data = scbr;
con->sock = NULL;
-   s->tipc_conn_recvmsg(net, *conid, NULL, scbr, , sizeof(sub));
+   tipc_subscrb_rcv(net, *conid, scbr, , sizeof(sub));
return true;
 }
 
@@ -513,7 +512,7 @@ void tipc_topsrv_kern_unsubscr(struct net *net, int conid)
test_and_clear_bit(CF_CONNECTED, >flags);
srv = con->server;
if (con->conid)
-   srv->tipc_conn_release(con->conid, con->usr_data);
+   tipc_subscrb_delete(con->usr_data);
conn_put(con);
conn_put(con);
 }
diff --git a/net/tipc/server.h b/net/tipc/server.h
index 434736d..b4b83bd 100644
--- a/net/tipc/server.h
+++ b/net/tipc/server.h
@@ -72,11 +72,6 @@ struct tipc_server {
struct workqueue_struct *rcv_wq;
struct workqueue_struct *send_wq;
int max_rcvbuf_size;
-   void *(*tipc_conn_new)(int conid);
-   void (*tipc_conn_release)(int conid, void *usr_data);
-   int (*tipc_conn_recvmsg)(struct net 

[net-next 03/10] tipc: eliminate struct tipc_subscriber

2018-02-15 Thread Jon Maloy
It is unnecessary to keep two structures, struct tipc_conn and struct
tipc_subscriber, with a one-to-one relationship and still with different
life cycles. The fact that the two often run in different contexts, and
still may access each other via direct pointers constitutes an additional
hazard, something we have experienced at several occasions, and still
see happening.

We have identified at least two remaining problems that are easier to
fix if we simplify the topology server data structure somewhat.

- When there is a race between a subscription up/down event and a
  timeout event, it is fully possible that the former might be delivered
  after the latter, leading to confusion for the receiver.

- The function tipc_subcrp_timeout() is executing in interrupt context,
  while the following call chain is at least theoretically possible:
  tipc_subscrp_timeout()
tipc_subscrp_send_event()
  tipc_conn_sendmsg()
conn_put()
  tipc_conn_kref_release()
sock_release(sock)

I.e., we end up calling a function that might try to sleep in
interrupt context. To eliminate this, we need to ensure that the
tipc_conn structure and the socket, as well as the subscription
instances, only are deleted in work queue context, i.e., after the
timeout event really has been sent out.

We now remove this unnecessary complexity, by merging data and
functionality of the subscriber structure into struct tipc_conn
and the associated file server.c. We thereafter add a spinlock and
a new 'inactive' state to the subscription structure. Using those,
both problems described above can be easily solved.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/server.c | 161 --
 net/tipc/server.h |   2 +-
 net/tipc/subscr.c | 173 ++
 net/tipc/subscr.h |  17 +++---
 4 files changed, 146 insertions(+), 207 deletions(-)

diff --git a/net/tipc/server.c b/net/tipc/server.c
index 8aa2a33..b8268c0 100644
--- a/net/tipc/server.c
+++ b/net/tipc/server.c
@@ -2,6 +2,7 @@
  * net/tipc/server.c: TIPC server infrastructure
  *
  * Copyright (c) 2012-2013, Wind River Systems
+ * Copyright (c) 2017, Ericsson AB
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
@@ -57,12 +58,13 @@
  * @sock: socket handler associated with connection
  * @flags: indicates connection state
  * @server: pointer to connected server
+ * @sub_list: lsit to all pertaing subscriptions
+ * @sub_lock: lock protecting the subscription list
+ * @outqueue_lock: control access to the outqueue
  * @rwork: receive work item
- * @usr_data: user-specified field
  * @rx_action: what to do when connection socket is active
  * @outqueue: pointer to first outbound message in queue
  * @outqueue_lock: control access to the outqueue
- * @outqueue: list of connection objects for its server
  * @swork: send work item
  */
 struct tipc_conn {
@@ -71,9 +73,10 @@ struct tipc_conn {
struct socket *sock;
unsigned long flags;
struct tipc_server *server;
+   struct list_head sub_list;
+   spinlock_t sub_lock; /* for subscription list */
struct work_struct rwork;
int (*rx_action) (struct tipc_conn *con);
-   void *usr_data;
struct list_head outqueue;
spinlock_t outqueue_lock;
struct work_struct swork;
@@ -81,6 +84,7 @@ struct tipc_conn {
 
 /* An entry waiting to be sent */
 struct outqueue_entry {
+   u32 evt;
struct list_head list;
struct kvec iov;
 };
@@ -89,18 +93,33 @@ static void tipc_recv_work(struct work_struct *work);
 static void tipc_send_work(struct work_struct *work);
 static void tipc_clean_outqueues(struct tipc_conn *con);
 
+static bool connected(struct tipc_conn *con)
+{
+   return con && test_bit(CF_CONNECTED, >flags);
+}
+
+/**
+ * htohl - convert value to endianness used by destination
+ * @in: value to convert
+ * @swap: non-zero if endianness must be reversed
+ *
+ * Returns converted value
+ */
+static u32 htohl(u32 in, int swap)
+{
+   return swap ? swab32(in) : in;
+}
+
 static void tipc_conn_kref_release(struct kref *kref)
 {
struct tipc_conn *con = container_of(kref, struct tipc_conn, kref);
struct tipc_server *s = con->server;
struct socket *sock = con->sock;
-   struct sock *sk;
 
if (sock) {
-   sk = sock->sk;
if (test_bit(CF_SERVER, >flags)) {
__module_get(sock->ops->owner);
-   __module_get(sk->sk_prot_creator->owner);
+   __module_get(sock->sk->sk_prot_creator->owner);
}
sock_release(sock);
con->sock = NULL;
@@ -129,11 +148,8 @@ static struct tipc_conn *tipc_conn_lookup(struct 
tipc_server *s, int conid

[net-next 01/10] tipc: remove redundant code in topology server

2018-02-15 Thread Jon Maloy
The socket handling in the topology server is unnecessarily generic.
It is prepared to handle both SOCK_RDM, SOCK_DGRAM and SOCK_STREAM
type sockets, as well as the only socket type which is really used,
SOCK_SEQPACKET.

We now remove this redundant code to make the code more readable.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/server.c | 36 +++-
 net/tipc/server.h |  4 +---
 net/tipc/subscr.c |  4 +---
 3 files changed, 9 insertions(+), 35 deletions(-)

diff --git a/net/tipc/server.c b/net/tipc/server.c
index df0c563..04a6dd9 100644
--- a/net/tipc/server.c
+++ b/net/tipc/server.c
@@ -82,7 +82,6 @@ struct tipc_conn {
 struct outqueue_entry {
struct list_head list;
struct kvec iov;
-   struct sockaddr_tipc dest;
 };
 
 static void tipc_recv_work(struct work_struct *work);
@@ -93,7 +92,6 @@ static void tipc_conn_kref_release(struct kref *kref)
 {
struct tipc_conn *con = container_of(kref, struct tipc_conn, kref);
struct tipc_server *s = con->server;
-   struct sockaddr_tipc *saddr = s->saddr;
struct socket *sock = con->sock;
struct sock *sk;
 
@@ -103,8 +101,6 @@ static void tipc_conn_kref_release(struct kref *kref)
__module_get(sock->ops->owner);
__module_get(sk->sk_prot_creator->owner);
}
-   saddr->scope = -TIPC_NODE_SCOPE;
-   kernel_bind(sock, (struct sockaddr *)saddr, sizeof(*saddr));
sock_release(sock);
con->sock = NULL;
}
@@ -325,36 +321,24 @@ static struct socket *tipc_create_listen_sock(struct 
tipc_conn *con)
 {
struct tipc_server *s = con->server;
struct socket *sock = NULL;
+   int imp = TIPC_CRITICAL_IMPORTANCE;
int ret;
 
ret = sock_create_kern(s->net, AF_TIPC, SOCK_SEQPACKET, 0, );
if (ret < 0)
return NULL;
ret = kernel_setsockopt(sock, SOL_TIPC, TIPC_IMPORTANCE,
-   (char *)>imp, sizeof(s->imp));
+   (char *), sizeof(imp));
if (ret < 0)
goto create_err;
ret = kernel_bind(sock, (struct sockaddr *)s->saddr, sizeof(*s->saddr));
if (ret < 0)
goto create_err;
 
-   switch (s->type) {
-   case SOCK_STREAM:
-   case SOCK_SEQPACKET:
-   con->rx_action = tipc_accept_from_sock;
-
-   ret = kernel_listen(sock, 0);
-   if (ret < 0)
-   goto create_err;
-   break;
-   case SOCK_DGRAM:
-   case SOCK_RDM:
-   con->rx_action = tipc_receive_from_sock;
-   break;
-   default:
-   pr_err("Unknown socket type %d\n", s->type);
+   con->rx_action = tipc_accept_from_sock;
+   ret = kernel_listen(sock, 0);
+   if (ret < 0)
goto create_err;
-   }
 
/* As server's listening socket owner and creator is the same module,
 * we have to decrease TIPC module reference count to guarantee that
@@ -444,7 +428,7 @@ static void tipc_clean_outqueues(struct tipc_conn *con)
 }
 
 int tipc_conn_sendmsg(struct tipc_server *s, int conid,
- struct sockaddr_tipc *addr, void *data, size_t len)
+ void *data, size_t len)
 {
struct outqueue_entry *e;
struct tipc_conn *con;
@@ -464,9 +448,6 @@ int tipc_conn_sendmsg(struct tipc_server *s, int conid,
return -ENOMEM;
}
 
-   if (addr)
-   memcpy(>dest, addr, sizeof(struct sockaddr_tipc));
-
spin_lock_bh(>outqueue_lock);
list_add_tail(>list, >outqueue);
spin_unlock_bh(>outqueue_lock);
@@ -575,10 +556,6 @@ static void tipc_send_to_sock(struct tipc_conn *con)
if (con->sock) {
memset(, 0, sizeof(msg));
msg.msg_flags = MSG_DONTWAIT;
-   if (s->type == SOCK_DGRAM || s->type == SOCK_RDM) {
-   msg.msg_name = >dest;
-   msg.msg_namelen = sizeof(struct sockaddr_tipc);
-   }
ret = kernel_sendmsg(con->sock, , >iov, 1,
 e->iov.iov_len);
if (ret == -EWOULDBLOCK || ret == 0) {
@@ -591,6 +568,7 @@ static void tipc_send_to_sock(struct tipc_conn *con)
evt = e->iov.iov_base;
tipc_send_kern_top_evt(s->net, evt);
}
+
/* Don't starve users filling buffers */
if (++count >= MAX_SEND_MSG_COUNT) {
cond_resched();
diff --git a/net/tipc/ser

[net-next 00/10] tipc: de-generealize topology server

2018-02-15 Thread Jon Maloy
The topology server is partially based on a template that is much
more generic than what we need. This results in a code that is
unnecessarily hard to follow and keeping bug free.

We now take the consequence of the fact that we only have one such
server in TIPC, - with no prospects for introducing any more, and
adapt the code to the specialized task is really is doing.


Jon Maloy (10):
  tipc: remove redundant code in topology server
  tipc: remove unnecessary function pointers
  tipc: eliminate struct tipc_subscriber
  tipc: simplify interaction between subscription and topology
connection
  tipc: simplify endianness handling in topology subscriber
  tipc: collapse subscription creation functions
  tipc: some prefix changes
  tipc: make struct tipc_server private for server.c
  tipc: separate topology server listener socket from subcsriber sockets
  tipc: rename tipc_server to tipc_topsrv

 net/tipc/Makefile |   2 +-
 net/tipc/core.h   |   6 +-
 net/tipc/group.c  |   2 +-
 net/tipc/name_table.c |  73 +++---
 net/tipc/name_table.h |   2 +-
 net/tipc/server.c | 710 --
 net/tipc/server.h | 103 
 net/tipc/subscr.c | 361 +
 net/tipc/subscr.h |  66 +++--
 net/tipc/topsrv.c | 702 +
 net/tipc/topsrv.h |  54 
 11 files changed, 912 insertions(+), 1169 deletions(-)
 delete mode 100644 net/tipc/server.c
 delete mode 100644 net/tipc/server.h
 create mode 100644 net/tipc/topsrv.c
 create mode 100644 net/tipc/topsrv.h

-- 
2.1.4



RE: [net-next 1/1] tipc: avoid unnecessary copying of bundled messages

2018-02-15 Thread Jon Maloy


> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of David Miller
> Sent: Wednesday, February 14, 2018 21:27
> To: Jon Maloy <jon.ma...@ericsson.com>
> Cc: netdev@vger.kernel.org; Mohan Krishna Ghanta Krishnamurthy
> <mohan.krishna.ghanta.krishnamur...@ericsson.com>; Tung Quang Nguyen
> <tung.q.ngu...@dektech.com.au>; Hoang Huu Le
> <hoang.h...@dektech.com.au>; Canh Duc Luu
> <canh.d@dektech.com.au>; Ying Xue <ying@windriver.com>; tipc-
> discuss...@lists.sourceforge.net
> Subject: Re: [net-next 1/1] tipc: avoid unnecessary copying of bundled
> messages
> 
> From: Jon Maloy <jon.ma...@ericsson.com>
> Date: Wed, 14 Feb 2018 13:50:31 +0100
> 
> > diff --git a/net/tipc/msg.c b/net/tipc/msg.c index 4e1c6f6..a368fa8
> > 100644
> > --- a/net/tipc/msg.c
> > +++ b/net/tipc/msg.c
> > @@ -434,6 +434,9 @@ bool tipc_msg_extract(struct sk_buff *skb, struct
> sk_buff **iskb, int *pos)
> > skb_pull(*iskb, offset);
> > imsz = msg_size(buf_msg(*iskb));
> > skb_trim(*iskb, imsz);
> > +
> > +   /* Scale extracted buffer's truesize to avoid double accounting */
> > +   (*iskb)->truesize = SKB_TRUESIZE(imsz);
> > if (unlikely(!tipc_msg_validate(iskb)))
> > goto none;
> > *pos += align(imsz);
> 
> As Eric said, you have to be really careful here.
> 
> If you clone a 10K SKB 10 times, you really have to account for the full
> truesize 10 times.
> 
> That is unless you explicitly trim off frags in the new clone, then adjust the
> truesize by explicitly decreasing it by the amount of memory backing the
> frags you trimmed off completely (not partially).

The buffers we are cloning are linearized 1 MTU incoming buffers. There are no 
fragments. 
Each clone normally points to only a tiny fraction of the data area of the base 
buffer.
I don't claim that copying always is bad, but in this case it happens in the 
majority of cases, and as I see it completely unnecessarily.

There is actually some under accounting, however, since we now only count the 
data area of the base buffer (== the sum of the data area of the clones) plus 
the overhead of the clones.
A more accurate calculation, taking into account even the overhead of the base 
buffer, would look  like this:
(*iskb)->truesize =  SKB_TRUSIZE(imsz) + (skb->truesize - skb->len)  / 
msg_msgcnt(msg);

I.e.,  we calculate the overhead of the base buffer and divide it equally among 
the clones.
Now I really can't see we are missing anything.

BR
///jon

> 
> Finally, you can only do this on an SKB that has never entered a socket SKB
> queue, otherwise you corrupt memory accounting.


[net-next 1/1] tipc: avoid unnecessary copying of bundled messages

2018-02-14 Thread Jon Maloy
A received sk buffer may contain dozens of smaller 'bundled' messages
which after extraction go each in their own direction.

Unfortunately, when we extract those messages using skb_clone() each
of the extracted buffers inherit the truesize value of the original
buffer. Apart from causing massive overaccounting of the base buffer's
memory, this often causes tipc_msg_validate() to come to the false
conclusion that the ratio truesize/datasize > 4, and perform an
unnecessary copying of the extracted buffer.

We now fix this problem by explicitly correcting the truesize value of
the buffer clones to be the truesize of the clone itself. This change
eliminates both the overaccounting and the unnecessary buffer copying.

Reported-by: Hoang Le <hoang.h...@dektek.com.au>
Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/msg.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/tipc/msg.c b/net/tipc/msg.c
index 4e1c6f6..a368fa8 100644
--- a/net/tipc/msg.c
+++ b/net/tipc/msg.c
@@ -434,6 +434,9 @@ bool tipc_msg_extract(struct sk_buff *skb, struct sk_buff 
**iskb, int *pos)
skb_pull(*iskb, offset);
imsz = msg_size(buf_msg(*iskb));
skb_trim(*iskb, imsz);
+
+   /* Scale extracted buffer's truesize to avoid double accounting */
+   (*iskb)->truesize = SKB_TRUESIZE(imsz);
if (unlikely(!tipc_msg_validate(iskb)))
goto none;
*pos += align(imsz);
-- 
2.1.4



[net-next 1/1] tipc: apply bearer link tolerance on running links

2018-02-14 Thread Jon Maloy
Currently, the default link tolerance set in struct tipc_bearer only
has effect on links going up after that moment. I.e., a user has to
reset all the node's links across that bearer to have the new value
applied. This is too limiting and disturbing on a running cluster to
be useful.

We now change this so that also already existing links are updated
dynamically, without any need for a reset, when the bearer value is
changed. We leverage the already existing per-link functionality
for this to achieve the wanted effect.

Acked-by: Ying Xue <ying@windriver.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/bearer.c |  8 +---
 net/tipc/link.c   |  3 ++-
 net/tipc/node.c   | 24 
 net/tipc/node.h   |  1 +
 4 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/net/tipc/bearer.c b/net/tipc/bearer.c
index c800147..83d284f 100644
--- a/net/tipc/bearer.c
+++ b/net/tipc/bearer.c
@@ -946,11 +946,11 @@ int tipc_nl_bearer_add(struct sk_buff *skb, struct 
genl_info *info)
 
 int tipc_nl_bearer_set(struct sk_buff *skb, struct genl_info *info)
 {
-   int err;
-   char *name;
struct tipc_bearer *b;
struct nlattr *attrs[TIPC_NLA_BEARER_MAX + 1];
struct net *net = sock_net(skb->sk);
+   char *name;
+   int err;
 
if (!info->attrs[TIPC_NLA_BEARER])
return -EINVAL;
@@ -982,8 +982,10 @@ int tipc_nl_bearer_set(struct sk_buff *skb, struct 
genl_info *info)
return err;
}
 
-   if (props[TIPC_NLA_PROP_TOL])
+   if (props[TIPC_NLA_PROP_TOL]) {
b->tolerance = nla_get_u32(props[TIPC_NLA_PROP_TOL]);
+   tipc_node_apply_tolerance(net, b);
+   }
if (props[TIPC_NLA_PROP_PRIO])
b->priority = nla_get_u32(props[TIPC_NLA_PROP_PRIO]);
if (props[TIPC_NLA_PROP_WIN])
diff --git a/net/tipc/link.c b/net/tipc/link.c
index 2d6b2ae..3c23046 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -2126,7 +2126,8 @@ void tipc_link_set_tolerance(struct tipc_link *l, u32 tol,
 struct sk_buff_head *xmitq)
 {
l->tolerance = tol;
-   tipc_link_build_proto_msg(l, STATE_MSG, 0, 0, 0, tol, 0, xmitq);
+   if (link_is_up(l))
+   tipc_link_build_proto_msg(l, STATE_MSG, 0, 0, 0, tol, 0, xmitq);
 }
 
 void tipc_link_set_prio(struct tipc_link *l, u32 prio,
diff --git a/net/tipc/node.c b/net/tipc/node.c
index 9036d87..389193d 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1618,6 +1618,30 @@ void tipc_rcv(struct net *net, struct sk_buff *skb, 
struct tipc_bearer *b)
kfree_skb(skb);
 }
 
+void tipc_node_apply_tolerance(struct net *net, struct tipc_bearer *b)
+{
+   struct tipc_net *tn = tipc_net(net);
+   int bearer_id = b->identity;
+   struct sk_buff_head xmitq;
+   struct tipc_link_entry *e;
+   struct tipc_node *n;
+
+   __skb_queue_head_init();
+
+   rcu_read_lock();
+
+   list_for_each_entry_rcu(n, >node_list, list) {
+   tipc_node_write_lock(n);
+   e = >links[bearer_id];
+   if (e->link)
+   tipc_link_set_tolerance(e->link, b->tolerance, );
+   tipc_node_write_unlock(n);
+   tipc_bearer_xmit(net, bearer_id, , >maddr);
+   }
+
+   rcu_read_unlock();
+}
+
 int tipc_nl_peer_rm(struct sk_buff *skb, struct genl_info *info)
 {
struct net *net = sock_net(skb->sk);
diff --git a/net/tipc/node.h b/net/tipc/node.h
index acd58d2..4ce5e3a 100644
--- a/net/tipc/node.h
+++ b/net/tipc/node.h
@@ -65,6 +65,7 @@ void tipc_node_check_dest(struct net *net, u32 onode,
  struct tipc_media_addr *maddr,
  bool *respond, bool *dupl_addr);
 void tipc_node_delete_links(struct net *net, int bearer_id);
+void tipc_node_apply_tolerance(struct net *net, struct tipc_bearer *b);
 int tipc_node_get_linkname(struct net *net, u32 bearer_id, u32 node,
   char *linkname, size_t len);
 int tipc_node_xmit(struct net *net, struct sk_buff_head *list, u32 dnode,
-- 
2.1.4



RE: Serious performance degradation in Linux 4.15

2018-02-13 Thread Jon Maloy
The person who reported this is on vacation right now. I will be back with more 
detailed info in two weeks.

///jon

> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Peter Zijlstra
> Sent: Monday, February 12, 2018 16:17
> To: Jon Maloy <jon.ma...@ericsson.com>
> Cc: netdev@vger.kernel.org; mi...@kernel.org; David Miller
> (da...@davemloft.net) <da...@davemloft.net>; Mike Galbraith
> <umgwanakikb...@gmail.com>; Matt Fleming <m...@codeblueprint.co.uk>
> Subject: Re: Serious performance degradation in Linux 4.15
> 
> On Fri, Feb 09, 2018 at 05:59:12PM +, Jon Maloy wrote:
> > Command for TCP:
> > "netperf TCP_STREAM  (netperf -n 4 -f m -c 4 -C 4 -P 1 -H 10.0.0.1 -t
> TCP_STREAM -l 10 -- -O THROUGHPUT)"
> > Command for TIPC:
> > "netperf TIPC_STREAM (netperf -n 4 -f m -c 4 -C 4 -P 1 -H 10.0.0.1 -t
> TCP_STREAM -l 10 -- -O THROUGHPUT)"
> 
> That looks like identical tests to me. And my netperf (debian testing) doesn't
> appear to have -t TIPC_STREAM.
> 
> Please try a coherent report and I'll have another look. Don't (again) forget 
> to
> mention what kind of setup you're running this on.
> 
> 
> On my IVB-EP (2 sockets, 10 cores, 2 threads), performance cpufreq, PTI=n
> RETPOLINE=n, I get:
> 
> 
> CPUS=`grep -c ^processor /proc/cpuinfo`
> 
> for test in TCP_STREAM
> do
> for i in 1 $((CPUS/4)) $((CPUS/2)) $((CPUS)) $((CPUS*2))
> do
> echo -n $test-$i ": "
> 
> (
>   for ((j=0; j<i; j++))
>   do
> netperf -t $test -4 -c -C -l 60 -P0 | head -1 &
>   done
> 
>   wait
> ) | awk '{ n++; v+=$5; } END { print "Avg: " v/n }'
> done
> done
> 
> 
> 
> NO_WA_OLD WA_IDLE WA_WEIGHT:
> 
> TCP_STREAM-1 : Avg: 44139.8
> TCP_STREAM-10 : Avg: 27301.6
> TCP_STREAM-20 : Avg: 12701.5
> TCP_STREAM-40 : Avg: 5711.62
> TCP_STREAM-80 : Avg: 2870.16
> 
> 
> WA_OLD NO_WA_IDLE NO_WA_WEIGHT:
> 
> TCP_STREAM-1 : Avg: 25293.1
> TCP_STREAM-10 : Avg: 28196.3
> TCP_STREAM-20 : Avg: 12463.7
> TCP_STREAM-40 : Avg: 5566.83
> TCP_STREAM-80 : Avg: 2630.03
> 
> ---
>  include/linux/sched/topology.h |  4 ++
>  kernel/sched/fair.c| 99
> +-
>  kernel/sched/features.h|  2 +
>  3 files changed, 93 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 26347741ba50..2cb74343c252 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -72,6 +72,10 @@ struct sched_domain_shared {
>   atomic_tref;
>   atomic_tnr_busy_cpus;
>   int has_idle_cores;
> +
> + unsigned long   nr_running;
> + unsigned long   load;
> + unsigned long   capacity;
>  };
> 
>  struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index
> 5eb3ffc9be84..4a561311241a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5680,6 +5680,68 @@ static int wake_wide(struct task_struct *p)
>   return 1;
>  }
> 
> +struct llc_stats {
> + unsigned long nr_running;
> + unsigned long load;
> + unsigned long capacity;
> + int has_capacity;
> +};
> +
> +static bool get_llc_stats(struct llc_stats *stats, int cpu) {
> + struct sched_domain_shared *sds =
> +rcu_dereference(per_cpu(sd_llc_shared, cpu));
> +
> + if (!sds)
> + return false;
> +
> + stats->nr_running = READ_ONCE(sds->nr_running);
> + stats->load   = READ_ONCE(sds->load);
> + stats->capacity   = READ_ONCE(sds->capacity);
> + stats->has_capacity = stats->nr_running < per_cpu(sd_llc_size, cpu);
> +
> + return true;
> +}
> +
> +static int
> +wake_affine_old(struct sched_domain *sd, struct task_struct *p,
> + int this_cpu, int prev_cpu, int sync) {
> + struct llc_stats prev_stats, this_stats;
> + s64 this_eff_load, prev_eff_load;
> + unsigned long task_load;
> +
> + if (!get_llc_stats(_stats, prev_cpu) ||
> + !get_llc_stats(_stats, this_cpu))
> + return nr_cpumask_bits;
> +
> + if (sync) {
> + unsigned long current_load = task_h_load(current);
> + if (current_load > this_stats.load)
> + return this_cpu;
> +
> + this_stats.load -= current_load;
> + }
> +
&

Serious performance degradation in Linux 4.15

2018-02-09 Thread Jon Maloy
The two commits 
d153b153446f7 (" sched/core: Fix wake_affine() performance regression") and
f2cdd9cc6c97 ("sched/core: Address more wake_affine() regressions")
are causing a serious performance degradation in Linux 4.5.

The effect is worst on TIPC, but even TCP is affected, as the figures below 
show. 


Command for TCP:
"netperf TCP_STREAM (netperf -n 4 -f m -c 4 -C 4 -P 1 -H 10.0.0.1 -t TCP_STREAM 
-l 10 -- -O THROUGHPUT)"

v4.15-rc1 without f2cdd9cc6c97e, d153b153446f7:   1293.67
V4.15 with the two commits:   1104.58 

i.e., a degradation of 17 % for TCP

Command for TIPC:
"netperf TIPC_STREAM (netperf -n 4 -f m -c 4 -C 4 -P 1 -H 10.0.0.1 -t 
TCP_STREAM -l 10 -- -O THROUGHPUT)"

v4.15-rc1 without f2cdd9cc6c97e, d153b153446f7:   786.22
V4.15 with the two commits:   223.18

i.e., a degradation of 71 % 

This is really bad, and I hope you have a plan for reintroducing this in some 
form.

BR
Jon Maloy







v4.15-rc1->latest   diff (%)
"netperf TCP_STREAM (netperf -n 4 -f m -c 4 -C 4 -P 1 -H 10.0.0.1 -t TCP_STREAM 
-l 10 -- -O THROUGHPUT)"1293.67 1104.58 -14.62%
"benchmark TIPC
(client_bench -c 1 -m 65000 -t)"786.67  215.67  -72.58%
"netperf TIPC_STREAM
(netperf -n 4 -f m -c 4 -C 4 -P 1 -H 10.0.0.1 -t TIPC_STREAM -l 10 -- -O 
THROUGHPUT)"   786.22  223.18  -71.61%


[net 1/1] tipc: fix skb truesize/datasize ratio control

2018-02-08 Thread Jon Maloy
From: Hoang Le <hoang.h...@dektek.com.au>

In commit d618d09a68e4 ("tipc: enforce valid ratio between skb truesize
and contents") we introduced a test for ensuring that the condition
truesize/datasize <= 4 is true for a received buffer. Unfortunately this
test has two problems.

- Because of the integer arithmetics the test
  if (skb->truesize / buf_roundup_len(skb) > 4) will miss all
  ratios [4 < ratio < 5], which was not the intention.
- The buffer returned by skb_copy() inherits skb->truesize of the
  original buffer, which doesn't help the situation at all.

In this commit, we change the ratio condition and replace skb_copy()
with a call to skb_copy_expand() to finally get this right.

Acked-by: Jon Maloy <jon.ma...@ericsson.com>
Signed-off-by: Jon Maloy <jon.ma...@ericsson.com>
---
 net/tipc/msg.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/tipc/msg.c b/net/tipc/msg.c
index 55d8ba9..4e1c6f6 100644
--- a/net/tipc/msg.c
+++ b/net/tipc/msg.c
@@ -208,8 +208,8 @@ bool tipc_msg_validate(struct sk_buff **_skb)
int msz, hsz;
 
/* Ensure that flow control ratio condition is satisfied */
-   if (unlikely(skb->truesize / buf_roundup_len(skb) > 4)) {
-   skb = skb_copy(skb, GFP_ATOMIC);
+   if (unlikely(skb->truesize / buf_roundup_len(skb) >= 4)) {
+   skb = skb_copy_expand(skb, BUF_HEADROOM, 0, GFP_ATOMIC);
if (!skb)
return false;
kfree_skb(*_skb);
-- 
2.1.4



  1   2   3   4   5   6   7   >