from:"Trond Myklebust"

Re: [PATCH net-next] SUNRPC: drop pointless static qualifier in xdr_get_next_encode_buffer()

2018-11-07 Thread Trond Myklebust

On Thu, 2018-11-08 at 02:04 +, YueHaibing wrote:
> There is no need to have the '__be32 *p' variable static since new
> value
> always be assigned before use it.
> 
> Signed-off-by: YueHaibing 
> ---
>  net/sunrpc/xdr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
> index 2bbb8d3..d80b156 100644
> --- a/net/sunrpc/xdr.c
> +++ b/net/sunrpc/xdr.c
> @@ -546,7 +546,7 @@ void xdr_commit_encode(struct xdr_stream *xdr)
>  static __be32 *xdr_get_next_encode_buffer(struct xdr_stream *xdr,
>   size_t nbytes)
>  {
> - static __be32 *p;
> + __be32 *p;
>   int space_left;
>   int frag1bytes, frag2bytes;
> 

Ouch, that's a really nasty bug that could definitely cause corruption
if you have 2 threads simultaneously calling this function! This really
deserves to be a stable patch.

Thank you, YueHaibing!

Bruce, do you want to shepherd this one in?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com

Re: general protection fault in encode_rpcb_string

2018-04-17 Thread Trond Myklebust

On Tue, 2018-04-17 at 17:33 -0400, J. Bruce Fields wrote:
> On Mon, Apr 16, 2018 at 09:02:01PM -0700, syzbot wrote:
> > syzbot hit the following crash on bpf-next commit
> > 5d1365940a68dd57b031b6e3c07d7d451cd69daf (Thu Apr 12 18:09:05 2018
> > +)
> > Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> > syzbot dashboard link:
> > https://syzkaller.appspot.com/bug?extid=4b98281f2401ab849f4b
> > 
> > So far this crash happened 2 times on bpf-next.
> > C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6433835633
> > 868800
> > syzkaller reproducer:
> > https://syzkaller.appspot.com/x/repro.syz?id=6407311794896896
> > Raw console output:
> > https://syzkaller.appspot.com/x/log.txt?id=5861511176126464
> 
> Based on that, looks like it's attempting an nfs mount while causing
> kmalloc failures?
> 
> Probably one of rpcb->r_netid, r_addr, or r_owner was bad in
> rpcb_enc_getaddr.
> 
> Hm, and previous log makes it look like it was an
> rpc_sockaddr2uaddr()
> in rpcb_getport_async() that was made to fail.  Do we need to check
> for
> failure of:
> 
>   map->r_addr = rpc_sockaddr2uaddr(sap, GFP_ATOMIC);
> 
> ?

Yes, and we can probably convert it, and the other GFP_ATOMIC
allocations in the rpcbind client to use GFP_NOFS in order to improve
reliability.

Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammer.space

Re: sunrpc: infinite unkillable console spam in xs_tcp_setup_socket

2017-11-24 Thread Trond Myklebust

On Mon, 2017-11-20 at 14:02 +0100, Dmitry Vyukov wrote:
> Hello,
> 
> The following program triggers infinite stream of the following
> output
> on console. The program is unkillable and this effectively brings the
> machine down:
> 
> 
> ** 16 printk messages dropped ** [12875.022917] xs_tcp_setup_socket:
> connect returned unhandled error -113
>

Does the following fix the issue?

8<-
From f48d3f01df45f50f0145060f5272ccf1aea855ac Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.mykleb...@primarydata.com>
Date: Fri, 24 Nov 2017 12:00:24 -0500
Subject: [PATCH] SUNRPC: Allow connect to return EHOSTUNREACH

Reported-by: Dmitry Vyukov <dvyu...@google.com>
Signed-off-by: Trond Myklebust <trond.mykleb...@primarydata.com>
---
 net/sunrpc/xprtsock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 4dad5da388d6..8cb40f8ffa5b 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2437,6 +2437,7 @@ static void xs_tcp_setup_socket(struct work_struct *work)
case -ECONNREFUSED:
case -ECONNRESET:
case -ENETUNREACH:
+   case -EHOSTUNREACH:
case -EADDRINUSE:
case -ENOBUFS:
/*
-- 
2.14.3

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.mykleb...@primarydata.com

Re: [PATCH 06/21] nfs client: exit_net cleanup check added

2017-11-05 Thread Trond Myklebust

On Sun, 2017-11-05 at 19:48 +0300, Vasily Averin wrote:
> On 2017-11-05 19:02, Trond Myklebust wrote:
> > On Sun, 2017-11-05 at 13:00 +0300, Vasily Averin wrote:
> > > Be sure that nfs_client_list and nfs_volume_list lists
> > > initialized
> > > in net_init hook were return to initial state in net_exit hook.
> > > 
> > > Signed-off-by: Vasily Averin <v...@virtuozzo.com>
> > > ---
> > >  fs/nfs/client.c | 4 
> > >  1 file changed, 4 insertions(+)
> > > 
> > > diff --git a/fs/nfs/client.c b/fs/nfs/client.c
> > > index 22880ef..7c0691c 100644
> > > --- a/fs/nfs/client.c
> > > +++ b/fs/nfs/client.c
> > > @@ -204,6 +204,10 @@ void nfs_cleanup_cb_ident_idr(struct net
> > > *net)
> > >   struct nfs_net *nn = net_generic(net, nfs_net_id);
> > >  
> > >   idr_destroy(>cb_ident_idr);
> > > + WARN(!list_empty(>nfs_client_list),
> > > +  "net %p exit: nfs_client_list is not empty\n",
> > > net);
> > > + WARN(!list_empty(>nfs_volume_list),
> > > +  "net %p exit: nfs_volume_list is not empty\n",
> > > net);
> > >  }
> > >  
> > 
> > Why do we need these? Is there a specific bug that you are trying
> > to
> > track down?
> 
> I hope such checks allows to detect leaked per-netns objects.
> Also I hope that all new pernet_operations will inherit such checks
> too.
> 
> I assume that elements added into per-net lists should not live
> longer than net namespace,
> and should be deleted from the list. I think exit_net hook is good
> place for such check.
> 
> Recently I've found lost list_entry and enabled timer on stop of net
> namespace.
> Then I've reviewed all existing pernet_operations and found that many
> drivers
> have such checks already. So I decided to complete this task and add
> such checks
> into all affected subsystems.
> 

Unless there is a known problem that has specific debugging needs, this
kind of assert should take the form of a WARN_ONCE() so that it doesn't
fill user syslogs with redundant warnings if triggered.

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.mykleb...@primarydata.com

Re: [PATCH 06/21] nfs client: exit_net cleanup check added

2017-11-05 Thread Trond Myklebust

On Sun, 2017-11-05 at 13:00 +0300, Vasily Averin wrote:
> Be sure that nfs_client_list and nfs_volume_list lists initialized
> in net_init hook were return to initial state in net_exit hook.
> 
> Signed-off-by: Vasily Averin <v...@virtuozzo.com>
> ---
>  fs/nfs/client.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/nfs/client.c b/fs/nfs/client.c
> index 22880ef..7c0691c 100644
> --- a/fs/nfs/client.c
> +++ b/fs/nfs/client.c
> @@ -204,6 +204,10 @@ void nfs_cleanup_cb_ident_idr(struct net *net)
>   struct nfs_net *nn = net_generic(net, nfs_net_id);
>  
>   idr_destroy(>cb_ident_idr);
> + WARN(!list_empty(>nfs_client_list),
> +  "net %p exit: nfs_client_list is not empty\n", net);
> + WARN(!list_empty(>nfs_volume_list),
> +  "net %p exit: nfs_volume_list is not empty\n", net);
>  }
>  

Why do we need these? Is there a specific bug that you are trying to
track down?

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.mykleb...@primarydata.com

Re: possible deadlock in flush_work (2)

2017-11-05 Thread Trond Myklebust

dep.c:1258
> >  check_prev_add kernel/locking/lockdep.c:1901 [inline]
> >  check_prevs_add kernel/locking/lockdep.c:2018 [inline]
> >  validate_chain kernel/locking/lockdep.c:2460 [inline]
> >  __lock_acquire+0x2f55/0x3d50 kernel/locking/lockdep.c:3487
> >  lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3991
> >  start_flush_work kernel/workqueue.c:2851 [inline]
> >  flush_work+0x57f/0x8a0 kernel/workqueue.c:2882
> >  __cancel_work_timer+0x30a/0x7e0 kernel/workqueue.c:2954
> >  cancel_work_sync+0x17/0x20 kernel/workqueue.c:2990
> >  xprt_destroy+0xa1/0x130 net/sunrpc/xprt.c:1467
> >  xprt_destroy_kref net/sunrpc/xprt.c:1477 [inline]
> >  kref_put include/linux/kref.h:70 [inline]
> >  xprt_put+0x38/0x40 net/sunrpc/xprt.c:1501
> >  rpc_task_release_client+0x299/0x430 net/sunrpc/clnt.c:986
> >  rpc_release_resources_task+0x7f/0xa0 net/sunrpc/sched.c:1020
> >  rpc_release_task net/sunrpc/sched.c:1059 [inline]
> >  __rpc_execute+0x4d9/0xe70 net/sunrpc/sched.c:824
> >  rpc_async_schedule+0x16/0x20 net/sunrpc/sched.c:848
> >  process_one_work+0xbf0/0x1bc0 kernel/workqueue.c:2112
> >  worker_thread+0x223/0x1990 kernel/workqueue.c:2246
> >  kthread+0x38b/0x470 kernel/kthread.c:242
> >  ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
> 
> 
> +sunrpc maintainers

A fix for this has already been merged. Please retest with an up to
date kernel.

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.mykleb...@primarydata.com

Re: [PATCH 01/23] net, sunrpc: convert rpc_cred.cr_count from atomic_t to refcount_t

2017-03-17 Thread Trond Myklebust

On Fri, 2017-03-17 at 09:02 -0400, Jeff Layton wrote:
> On Fri, 2017-03-17 at 12:50 +0000, Trond Myklebust wrote:
> > On Fri, 2017-03-17 at 14:10 +0200, Elena Reshetova wrote:
> > > refcount_t type and corresponding API should be
> > > used instead of atomic_t when the variable is used as
> > > a reference counter. This allows to avoid accidental
> > > refcounter overflows that might lead to use-after-free
> > > situations.
> > > 
> > > Signed-off-by: Elena Reshetova <elena.reshet...@intel.com>
> > > Signed-off-by: Hans Liljestrand <ishkam...@gmail.com>
> > > Signed-off-by: Kees Cook <keesc...@chromium.org>
> > > Signed-off-by: David Windsor <dwind...@gmail.com>
> > > ---
> > >  include/linux/sunrpc/auth.h |  8 
> > >  net/sunrpc/auth.c   | 12 ++--
> > >  2 files changed, 10 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/include/linux/sunrpc/auth.h
> > > b/include/linux/sunrpc/auth.h
> > > index b1bc62b..bd36e0b 100644
> > > --- a/include/linux/sunrpc/auth.h
> > > +++ b/include/linux/sunrpc/auth.h
> > > @@ -15,7 +15,7 @@
> > >  #include 
> > >  #include 
> > >  
> > > -#include 
> > > +#include 
> > >  #include 
> > >  #include 
> > >  #include 
> > > @@ -68,7 +68,7 @@ struct rpc_cred {
> > >  #endif
> > >   unsigned long   cr_expire;  /* when
> > > to gc
> > > */
> > >   unsigned long   cr_flags;   /* various
> > > flags */
> > > - atomic_tcr_count;   /* ref count */
> > > + refcount_t  cr_count;   /* ref count
> > > */
> > > 
> > 
> > NACK. That's going to be hitting
> > WARN_ONCE(!refcount_inc_not_zero(r),
> > "refcount_t: increment on 0; use-after-free.\n") like there's no
> > tomorrow...
> > 
> > Please stop with these automated conversions. They are going to
> > cause a
> > lot more bugs than they fix.
> > 
> 
> Agreed. These patchsets are touching places where we've already
> banged
> out most of the refcounting bugs. I'm against doing large scale
> conversions like this without a damned good reason.
> 
> I think it may be best to do this sort of thing in a more piecemeal
> fashion. Pick a subsystem or two and do the conversions there to
> prove
> that they're better than what we have. If the subsystem already has
> problems with its refcounting, then so much the better. Point to bugs
> that this new infrastructure helped find.
> 
> Encourage people to adopt your new infrastructure as new refcounted
> objects are introduced into the kernel. You might even consider a LWN
> article about this.
> 
> Eventually we'll get around to changing existing code to use it, once
> there is a sufficient advantage to doing so. Most likely when we're
> reworking the code for other reasons, or when we're chasing some
> horrid
> refcounting bug and think that this might help find it.

The main issue is that this "refcount_t" implementation appears to be
assuming that there is one and only one model for refcounts (the one
where a value of "0" means "free me immediately").

The kernel has a plethora of object caching implementations where this
is simply not the case; the dcache is a prime example, and this cache
is another. In both these implementation, the atomic_t variable is
being used more as a semaphore-style lock that prevents freeing of the
object while it is in active use as opposed to being freeable, but
cached. This is why these automated conversions are a nuisance and a
source of bugs.

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.mykleb...@primarydata.com

Re: [PATCH 01/23] net, sunrpc: convert rpc_cred.cr_count from atomic_t to refcount_t

2017-03-17 Thread Trond Myklebust

On Fri, 2017-03-17 at 14:10 +0200, Elena Reshetova wrote:
> refcount_t type and corresponding API should be
> used instead of atomic_t when the variable is used as
> a reference counter. This allows to avoid accidental
> refcounter overflows that might lead to use-after-free
> situations.
> 
> Signed-off-by: Elena Reshetova <elena.reshet...@intel.com>
> Signed-off-by: Hans Liljestrand <ishkam...@gmail.com>
> Signed-off-by: Kees Cook <keesc...@chromium.org>
> Signed-off-by: David Windsor <dwind...@gmail.com>
> ---
>  include/linux/sunrpc/auth.h |  8 
>  net/sunrpc/auth.c   | 12 ++--
>  2 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/sunrpc/auth.h
> b/include/linux/sunrpc/auth.h
> index b1bc62b..bd36e0b 100644
> --- a/include/linux/sunrpc/auth.h
> +++ b/include/linux/sunrpc/auth.h
> @@ -15,7 +15,7 @@
>  #include 
>  #include 
>  
> -#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -68,7 +68,7 @@ struct rpc_cred {
>  #endif
>   unsigned long   cr_expire;  /* when to gc
> */
>   unsigned long   cr_flags;   /* various
> flags */
> - atomic_tcr_count;   /* ref count */
> + refcount_t  cr_count;   /* ref count */
> 

NACK. That's going to be hitting WARN_ONCE(!refcount_inc_not_zero(r),
"refcount_t: increment on 0; use-after-free.\n") like there's no
tomorrow...

Please stop with these automated conversions. They are going to cause a
lot more bugs than they fix.

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.mykleb...@primarydata.com

Re: [PATCHv1] sunrpc: fix write space race causing stalls

2016-09-16 Thread Trond Myklebust


> On Sep 16, 2016, at 13:29, David Vrabel <david.vra...@citrix.com> wrote:
> 
> On 16/09/16 18:06, Trond Myklebust wrote:
>> 
>>> On Sep 16, 2016, at 12:41, David Vrabel <david.vra...@citrix.com> wrote:
>>> 
>>> On 16/09/16 17:01, Trond Myklebust wrote:
>>>> 
>>>>> On Sep 16, 2016, at 08:28, David Vrabel <david.vra...@citrix.com> wrote:
>>>>> 
>>>>> Write space becoming available may race with putting the task to sleep
>>>>> in xprt_wait_for_buffer_space().  The existing mechanism to avoid the
>>>>> race does not work.
>>>>> 
>>>>> This (edited) partial trace illustrates the problem:
>>>>> 
>>>>> [1] rpc_task_run_action: task:43546@5 ... action=call_transmit
>>>>> [2] xs_write_space <-xs_tcp_write_space
>>>>> [3] xprt_write_space <-xs_write_space
>>>>> [4] rpc_task_sleep: task:43546@5 ...
>>>>> [5] xs_write_space <-xs_tcp_write_space
>>>>> 
>>>>> [1] Task 43546 runs but is out of write space.
>>>>> 
>>>>> [2] Space becomes available, xs_write_space() clears the
>>>>>  SOCKWQ_ASYNC_NOSPACE bit.
>>>>> 
>>>>> [3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but
>>>>>  this has not yet been queued and the wake up is lost.
>>>>> 
>>>>> [4] xs_nospace() is called which calls xprt_wait_for_buffer_space()
>>>>>  which queues task 43546.
>>>>> 
>>>>> [5] The call to sk->sk_write_space() at the end of xs_nospace() (which
>>>>>  is supposed to handle the above race) does not call
>>>>>  xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and
>>>>>  thus the task is not woken.
>>>>> 
>>>>> Fix the race by have xprt_wait_for_buffer_space() check for write
>>>>> space after putting the task to sleep.
>>>>> 
>>>>> Signed-off-by: David Vrabel <david.vra...@citrix.com>
>>>>> ---
>>>>> include/linux/sunrpc/xprt.h |  1 +
>>>>> net/sunrpc/xprt.c   |  4 
>>>>> net/sunrpc/xprtsock.c   | 21 +++--
>>>>> 3 files changed, 24 insertions(+), 2 deletions(-)
>>>>> 
>>>>> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
>>>>> index a16070d..621e74b 100644
>>>>> --- a/include/linux/sunrpc/xprt.h
>>>>> +++ b/include/linux/sunrpc/xprt.h
>>>>> @@ -129,6 +129,7 @@ struct rpc_xprt_ops {
>>>>>   void(*connect)(struct rpc_xprt *xprt, struct rpc_task 
>>>>> *task);
>>>>>   void *  (*buf_alloc)(struct rpc_task *task, size_t size);
>>>>>   void(*buf_free)(void *buffer);
>>>>> + bool(*have_write_space)(struct rpc_xprt *task);
>>>>>   int (*send_request)(struct rpc_task *task);
>>>>>   void(*set_retrans_timeout)(struct rpc_task *task);
>>>>>   void(*timer)(struct rpc_xprt *xprt, struct rpc_task *task);
>>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>>>> index ea244b2..d3c1b1e 100644
>>>>> --- a/net/sunrpc/xprt.c
>>>>> +++ b/net/sunrpc/xprt.c
>>>>> @@ -502,6 +502,10 @@ void xprt_wait_for_buffer_space(struct rpc_task 
>>>>> *task, rpc_action action)
>>>>> 
>>>>>   task->tk_timeout = RPC_IS_SOFT(task) ? req->rq_timeout : 0;
>>>>>   rpc_sleep_on(>pending, task, action);
>>>>> +
>>>>> + /* Write space notification may race with putting task to sleep. */
>>>>> + if (xprt->ops->have_write_space(xprt))
>>>>> + rpc_wake_up_queued_task(>pending, task);
>>>>> }
>>>>> EXPORT_SYMBOL_GPL(xprt_wait_for_buffer_space);
>>>>> 
>>>>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>>>>> index bf16883..211de5b 100644
>>>>> --- a/net/sunrpc/xprtsock.c
>>>>> +++ b/net/sunrpc/xprtsock.c
>>>>> @@ -472,8 +472,6 @@ static int xs_nospace(struct rpc_task *task)
>>>>> 
>>>>>   spin_unlock_bh(>transport_lock);
>>>>> 
>>>>> - /* Race breaker in case memory is freed before above code is called */
>>>>> - sk->sk_write_space(sk);
>>>>>   return ret;
>>>>> }
>>>> 
>>>> Instead of these callbacks, why not just add a call to
>>>> sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk) after queueing the task in
>>>> xs_nospace()? Won’t that fix the existing race breaker?
>>> 
>>> I don't see how that would help.  If sk->sk_write_space was already
>>> called, SOCKWQ_ASYNC_NOSPACE will still be clear and the next call to
>>> sk->sk_write_space will still be a nop.
>> 
>> Sorry. Copy+paste error. I meant SOCKWQ_ASYNC_NOSPACE.
>> 
>>> 
>>> Or did you mean SOCKWQ_ASYNC_NOSPACE here?  It doesn't seem right to set
>>> this bit when we don't know if there's space or not.
>> 
>> Why not?
> 
> I prefer my solution because:
> 
> a) It obviously fixes the race (games with bits are less understandable).
> 
> b) It requires fewer atomic ops.
> 
> c) It doesn't require me to understand what the behaviour of the
> socket-internal SOCKWQ_ASYNC_NOSPACE bit is or should be.
> 
> d) I'm not sure I understand the objection to the additional
> have_write_space method -- it has simple, clear behaviour.
> 

I don’t see the point of adding 24 lines of code over 3 different files if the 
problem can be solved with 1 line of code.

Re: [PATCHv1] sunrpc: fix write space race causing stalls

2016-09-16 Thread Trond Myklebust


> On Sep 16, 2016, at 12:41, David Vrabel <david.vra...@citrix.com> wrote:
> 
> On 16/09/16 17:01, Trond Myklebust wrote:
>> 
>>> On Sep 16, 2016, at 08:28, David Vrabel <david.vra...@citrix.com> wrote:
>>> 
>>> Write space becoming available may race with putting the task to sleep
>>> in xprt_wait_for_buffer_space().  The existing mechanism to avoid the
>>> race does not work.
>>> 
>>> This (edited) partial trace illustrates the problem:
>>> 
>>>  [1] rpc_task_run_action: task:43546@5 ... action=call_transmit
>>>  [2] xs_write_space <-xs_tcp_write_space
>>>  [3] xprt_write_space <-xs_write_space
>>>  [4] rpc_task_sleep: task:43546@5 ...
>>>  [5] xs_write_space <-xs_tcp_write_space
>>> 
>>> [1] Task 43546 runs but is out of write space.
>>> 
>>> [2] Space becomes available, xs_write_space() clears the
>>>   SOCKWQ_ASYNC_NOSPACE bit.
>>> 
>>> [3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but
>>>   this has not yet been queued and the wake up is lost.
>>> 
>>> [4] xs_nospace() is called which calls xprt_wait_for_buffer_space()
>>>   which queues task 43546.
>>> 
>>> [5] The call to sk->sk_write_space() at the end of xs_nospace() (which
>>>   is supposed to handle the above race) does not call
>>>   xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and
>>>   thus the task is not woken.
>>> 
>>> Fix the race by have xprt_wait_for_buffer_space() check for write
>>> space after putting the task to sleep.
>>> 
>>> Signed-off-by: David Vrabel <david.vra...@citrix.com>
>>> ---
>>> include/linux/sunrpc/xprt.h |  1 +
>>> net/sunrpc/xprt.c   |  4 
>>> net/sunrpc/xprtsock.c   | 21 +++--
>>> 3 files changed, 24 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
>>> index a16070d..621e74b 100644
>>> --- a/include/linux/sunrpc/xprt.h
>>> +++ b/include/linux/sunrpc/xprt.h
>>> @@ -129,6 +129,7 @@ struct rpc_xprt_ops {
>>> void(*connect)(struct rpc_xprt *xprt, struct rpc_task 
>>> *task);
>>> void *  (*buf_alloc)(struct rpc_task *task, size_t size);
>>> void(*buf_free)(void *buffer);
>>> +   bool(*have_write_space)(struct rpc_xprt *task);
>>> int (*send_request)(struct rpc_task *task);
>>> void(*set_retrans_timeout)(struct rpc_task *task);
>>> void(*timer)(struct rpc_xprt *xprt, struct rpc_task *task);
>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>> index ea244b2..d3c1b1e 100644
>>> --- a/net/sunrpc/xprt.c
>>> +++ b/net/sunrpc/xprt.c
>>> @@ -502,6 +502,10 @@ void xprt_wait_for_buffer_space(struct rpc_task *task, 
>>> rpc_action action)
>>> 
>>> task->tk_timeout = RPC_IS_SOFT(task) ? req->rq_timeout : 0;
>>> rpc_sleep_on(>pending, task, action);
>>> +
>>> +   /* Write space notification may race with putting task to sleep. */
>>> +   if (xprt->ops->have_write_space(xprt))
>>> +   rpc_wake_up_queued_task(>pending, task);
>>> }
>>> EXPORT_SYMBOL_GPL(xprt_wait_for_buffer_space);
>>> 
>>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>>> index bf16883..211de5b 100644
>>> --- a/net/sunrpc/xprtsock.c
>>> +++ b/net/sunrpc/xprtsock.c
>>> @@ -472,8 +472,6 @@ static int xs_nospace(struct rpc_task *task)
>>> 
>>> spin_unlock_bh(>transport_lock);
>>> 
>>> -   /* Race breaker in case memory is freed before above code is called */
>>> -   sk->sk_write_space(sk);
>>> return ret;
>>> }
>> 
>> Instead of these callbacks, why not just add a call to
>> sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk) after queueing the task in
>> xs_nospace()? Won’t that fix the existing race breaker?
> 
> I don't see how that would help.  If sk->sk_write_space was already
> called, SOCKWQ_ASYNC_NOSPACE will still be clear and the next call to
> sk->sk_write_space will still be a nop.

Sorry. Copy+paste error. I meant SOCKWQ_ASYNC_NOSPACE.

> 
> Or did you mean SOCKWQ_ASYNC_NOSPACE here?  It doesn't seem right to set
> this bit when we don't know if there's space or not.

Why not?

Re: [PATCHv1] sunrpc: fix write space race causing stalls

2016-09-16 Thread Trond Myklebust


> On Sep 16, 2016, at 08:28, David Vrabel  wrote:
> 
> Write space becoming available may race with putting the task to sleep
> in xprt_wait_for_buffer_space().  The existing mechanism to avoid the
> race does not work.
> 
> This (edited) partial trace illustrates the problem:
> 
>   [1] rpc_task_run_action: task:43546@5 ... action=call_transmit
>   [2] xs_write_space <-xs_tcp_write_space
>   [3] xprt_write_space <-xs_write_space
>   [4] rpc_task_sleep: task:43546@5 ...
>   [5] xs_write_space <-xs_tcp_write_space
> 
> [1] Task 43546 runs but is out of write space.
> 
> [2] Space becomes available, xs_write_space() clears the
>SOCKWQ_ASYNC_NOSPACE bit.
> 
> [3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but
>this has not yet been queued and the wake up is lost.
> 
> [4] xs_nospace() is called which calls xprt_wait_for_buffer_space()
>which queues task 43546.
> 
> [5] The call to sk->sk_write_space() at the end of xs_nospace() (which
>is supposed to handle the above race) does not call
>xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and
>thus the task is not woken.
> 
> Fix the race by have xprt_wait_for_buffer_space() check for write
> space after putting the task to sleep.
> 
> Signed-off-by: David Vrabel 
> ---
> include/linux/sunrpc/xprt.h |  1 +
> net/sunrpc/xprt.c   |  4 
> net/sunrpc/xprtsock.c   | 21 +++--
> 3 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> index a16070d..621e74b 100644
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -129,6 +129,7 @@ struct rpc_xprt_ops {
>   void(*connect)(struct rpc_xprt *xprt, struct rpc_task 
> *task);
>   void *  (*buf_alloc)(struct rpc_task *task, size_t size);
>   void(*buf_free)(void *buffer);
> + bool(*have_write_space)(struct rpc_xprt *task);
>   int (*send_request)(struct rpc_task *task);
>   void(*set_retrans_timeout)(struct rpc_task *task);
>   void(*timer)(struct rpc_xprt *xprt, struct rpc_task *task);
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index ea244b2..d3c1b1e 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -502,6 +502,10 @@ void xprt_wait_for_buffer_space(struct rpc_task *task, 
> rpc_action action)
> 
>   task->tk_timeout = RPC_IS_SOFT(task) ? req->rq_timeout : 0;
>   rpc_sleep_on(>pending, task, action);
> +
> + /* Write space notification may race with putting task to sleep. */
> + if (xprt->ops->have_write_space(xprt))
> + rpc_wake_up_queued_task(>pending, task);
> }
> EXPORT_SYMBOL_GPL(xprt_wait_for_buffer_space);
> 
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index bf16883..211de5b 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -472,8 +472,6 @@ static int xs_nospace(struct rpc_task *task)
> 
>   spin_unlock_bh(>transport_lock);
> 
> - /* Race breaker in case memory is freed before above code is called */
> - sk->sk_write_space(sk);
>   return ret;
> }

Instead of these callbacks, why not just add a call to 
sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk) after queueing the task in xs_nospace()? 
Won’t that fix the existing race breaker?

Cheers
  Trond

Re: [PATCH 1/2] NFSv4.1: work around -Wmaybe-uninitialized warning

2016-08-31 Thread Trond Myklebust

> On Aug 31, 2016, at 09:37, Arnd Bergmann <a...@arndb.de> wrote:
> 
> On Wednesday, August 31, 2016 1:17:48 PM CEST Trond Myklebust wrote:
>> What version of gcc are you using? I’m unable to reproduce with gcc 6.1.1..
> 
> This is also on 6.1.1 for ARM. Note that 6e8d666e9253 ("Disable
> "maybe-uninitialized" warning globally") turned off those warnings, so
> unless you explicitly pass -Wmaybe-uninitialized (e.g. by building with
> "make W=1"), you won't get it.
> 

I’m not getting that error on gcc 6.1.1 for x86_64 with either “make W=1” or 
“make W=2”.
“make W=3” does gives rise to one warning in nfs4_slot_get_seqid:

/home/trondmy/devel/kernel/linux/fs/nfs/nfs4session.c: In function 
‘nfs4_slot_get_seqid’:
/home/trondmy/devel/kernel/linux/fs/nfs/nfs4session.c:184:10: warning: 
conversion to ‘int’ from ‘long int’ may alter its value [-Wconversion]
   return PTR_ERR(slot);
  ^

(which is another false positive) but that’s all...

> The reason I'm still sending the patches for this warning is that
> we do get a number of valid ones (this was the only false positive
> out of the seven such warnings since last week).

There is a Zen-like quality to IS_ERR() when it casts a const pointer to an 
unsigned long, back to a non-const pointer, and then back to an unsigned long 
before comparing it to another unsigned long cast constant negative integer. 
However, I’m not sure the C99 standard would agree that a positive test result 
implies we can assume that a simple cast of the same pointer to a signed long 
will result in a negative, non-zero valued errno.

I suspect that if we really want to fix these false negatives, we should 
probably address that issue.

Cheers
  Trond

Re: [PATCH 1/2] NFSv4.1: work around -Wmaybe-uninitialized warning

2016-08-31 Thread Trond Myklebust


> On Aug 31, 2016, at 08:39, Arnd Bergmann  wrote:
> 
> A bugfix introduced a harmless gcc warning in nfs4_slot_seqid_in_use:
> 
> fs/nfs/nfs4session.c:203:54: error: 'cur_seq' may be used uninitialized in 
> this function [-Werror=maybe-uninitialized]
> 
> gcc is not smart enough to conclude that the IS_ERR/PTR_ERR pair
> results in a nonzero return value here. Using PTR_ERR_OR_ZERO()
> instead makes this clear to the compiler.
> 
> Signed-off-by: Arnd Bergmann 
> Fixes: e09c978aae5b ("NFSv4.1: Fix Oopsable condition in server callback 
> races")
> ---
> fs/nfs/nfs4session.c | 10 ++
> 1 file changed, 6 insertions(+), 4 deletions(-)
> 
> The patch that caused this just came in for v4.8-rc5. As the warning
> is now disabled by default and this is harmless, this can probably
> get queued for v4.9 instead.
> 
> I mentioned earlier that I got the new warning for net-next, but
> failed to notice that it had come from mainline instead.
> 
> diff --git a/fs/nfs/nfs4session.c b/fs/nfs/nfs4session.c
> index b62973045a3e..150c5a1879bf 100644
> --- a/fs/nfs/nfs4session.c
> +++ b/fs/nfs/nfs4session.c
> @@ -178,12 +178,14 @@ static int nfs4_slot_get_seqid(struct nfs4_slot_table  
> *tbl, u32 slotid,
>   __must_hold(>slot_tbl_lock)
> {
>   struct nfs4_slot *slot;
> + int ret;
> 
>   slot = nfs4_lookup_slot(tbl, slotid);
> - if (IS_ERR(slot))
> - return PTR_ERR(slot);
> - *seq_nr = slot->seq_nr;
> - return 0;
> + ret = PTR_ERR_OR_ZERO(slot);
> + if (!ret)
> + *seq_nr = slot->seq_nr;
> +
> + return ret;
> }
> 

What version of gcc are you using? I’m unable to reproduce with gcc 6.1.1..

Re: [PATCH 1/2] SUNRPC: accept() may return sockets that are still in SYN_RECV

2016-07-27 Thread Trond Myklebust

Hi Eric,

> On Jul 27, 2016, at 14:59, Eric Dumazet <eric.duma...@gmail.com> wrote:
> 
> On Wed, 2016-07-27 at 14:48 -0400, Fields Bruce James wrote:
>> On Tue, Jul 26, 2016 at 04:08:29PM +, Trond Myklebust wrote:
>>> 
>>>> On Jul 26, 2016, at 11:43, J. Bruce Fields <bfie...@fieldses.org> wrote:
>>>> 
>>>> On Tue, Jul 26, 2016 at 09:51:19AM -0400, Trond Myklebust wrote:
>>>>> We're seeing traces of the following form:
>>>>> 
>>>>> [10952.396347] svc: transport 88042ba4a 000 dequeued, inuse=2
>>>>> [10952.396351] svc: tcp_accept 88042ba4 a000 sock 88042a6e4c80
>>>>> [10952.396362] nfsd: connect from 10.2.6.1, port=187
>>>>> [10952.396364] svc: svc_setup_socket 8800b99bcf00
>>>>> [10952.396368] setting up TCP socket for reading
>>>>> [10952.396370] svc: svc_setup_socket created 8803eb10a000 (inet 
>>>>> 88042b75b800)
>>>>> [10952.396373] svc: transport 8803eb10a000 put into queue
>>>>> [10952.396375] svc: transport 88042ba4a000 put into queue
>>>>> [10952.396377] svc: server 8800bb0ec000 waiting for data (to = 
>>>>> 360)
>>>>> [10952.396380] svc: transport 8803eb10a000 dequeued, inuse=2
>>>>> [10952.396381] svc_recv: found XPT_CLOSE
>>>>> [10952.396397] svc: svc_delete_xprt(8803eb10a000)
>>>>> [10952.396398] svc: svc_tcp_sock_detach(8803eb10a000)
>>>>> [10952.396399] svc: svc_sock_detach(8803eb10a000)
>>>>> [10952.396412] svc: svc_sock_free(8803eb10a000)
>>>>> 
>>>>> i.e. an immediate close of the socket after initialisation.
>>>> 
>>>> Interesting, thanks!
>>>> 
>>>> So the one thing I don't understand is why this is correct behavior for
>>>> accept--I thought it wasn't supposed to return a socket until it was
>>>> fully established.
>>> 
>>> inet_accept() appears to allow SYN_RECV:
>> 
>> OK.  Cc'ing netdev just to make sure we didn't overlook anything.
>> 
> 
> SYN_RECV after accept() is a TCP Fast Open property I think.
> 
> Maybe you are playing with some global TCP Fast Open settings ?
> 

The Linux kernel client should not be using TCP fast open, but it is possible 
that some of the other NFSv3 clients we’re using are.
Would a standard knfsd listener respond to a TCP fast open request, or would 
the default behaviour be to ignore it?

If the default behaviour for the server is to allow fast open, then we do need 
these patches, IMO.

> 
>> (Also: what were user-visible symptoms?  Mounts failing, or unexpected
>> delays?)
>> 

Connection retry storms on the server.

>> --b.
>> 
>>> 
>>> int inet_accept(struct socket *sock, struct socket *newsock, int flags)
>>> {
>>>struct sock *sk1 = sock->sk;
>>>int err = -EINVAL;
>>>struct sock *sk2 = sk1->sk_prot->accept(sk1, flags, );
>>> 
>>>if (!sk2)
>>>goto do_err;
>>> 
>>>lock_sock(sk2);
>>> 
>>>sock_rps_record_flow(sk2);
>>>WARN_ON(!((1 << sk2->sk_state) &
>>>  (TCPF_ESTABLISHED | TCPF_SYN_RECV |
>>>  TCPF_CLOSE_WAIT | TCPF_CLOSE)));
>>> 
>>>sock_graft(sk2, newsock);
>>> 
>>>newsock->state = SS_CONNECTED;
>>>err = 0;
>>>release_sock(sk2);
>>> do_err:
>>>return err;
>>> }
> 
>

Re: It's back! (Re: [REGRESSION] NFS is creating a hidden port (left over from xs_bind() ))

2016-06-30 Thread Trond Myklebust


> On Jun 30, 2016, at 08:59, Steven Rostedt  wrote:
> 
> [ resending as a new email, as I'm assuming people do not sort their
>  INBOX via last email on thread, thus my last email is sitting in the
>  bottom of everyone's INBOX ]
> 
> I've hit this again. Not sure when it started, but I applied my old
> debug trace_printk() patch (attached) and rebooted (4.5.7). I just
> tested the latest kernel from Linus's tree (from last nights pull), and
> it still gives me the problem.
> 
> Here's the trace I have:
> 
>kworker/3:1H-134   [003] ..s.61.036129: inet_csk_get_port: snum 805
>kworker/3:1H-134   [003] ..s.61.036135: 
> => sched_clock
> => inet_addr_type_table
> => security_capable
> => inet_bind
> => xs_bind
> => release_sock
> => sock_setsockopt
> => __sock_create
> => xs_create_sock.isra.19
> => xs_tcp_setup_socket
> => process_one_work
> => worker_thread
> => worker_thread
> => kthread
> => ret_from_fork
> => kthread  
>kworker/3:1H-134   [003] ..s.61.036136: inet_bind_hash: add 805
>kworker/3:1H-134   [003] ..s.61.036138: 
> => inet_csk_get_port
> => sched_clock
> => inet_addr_type_table
> => security_capable
> => inet_bind
> => xs_bind
> => release_sock
> => sock_setsockopt
> => __sock_create
> => xs_create_sock.isra.19
> => xs_tcp_setup_socket
> => process_one_work
> => worker_thread
> => worker_thread
> => kthread
> => ret_from_fork
> => kthread  
>kworker/3:1H-134   [003] 61.036139: xs_bind: RPC:   xs_bind 
> 4.136.255.255:805: ok (0)
>kworker/3:1H-134   [003] 61.036140: xs_tcp_setup_socket: RPC:  
>  worker connecting xprt 880407eca800 via tcp to 192.168.23.22 (port 43651)
>kworker/3:1H-134   [003] 61.036162: xs_tcp_setup_socket: RPC:  
>  880407eca800 connect status 115 connected 0 sock state 2
>  -0 [001] ..s.61.036450: xs_tcp_state_change: RPC:  
>  xs_tcp_state_change client 880407eca800...
>  -0 [001] ..s.61.036452: xs_tcp_state_change: RPC:  
>  state 1 conn 0 dead 0 zapped 1 sk_shutdown 0
>kworker/1:1H-136   [001] 61.036476: xprt_connect_status: RPC:
> 43 xprt_connect_status: retrying
>kworker/1:1H-136   [001] 61.036478: xprt_prepare_transmit: RPC:
> 43 xprt_prepare_transmit
>kworker/1:1H-136   [001] 61.036479: xprt_transmit: RPC:43 
> xprt_transmit(72)
>kworker/1:1H-136   [001] 61.036486: xs_tcp_send_request: RPC:  
>  xs_tcp_send_request(72) = 0
>kworker/1:1H-136   [001] 61.036487: xprt_transmit: RPC:43 xmit 
> complete
>  -0 [001] ..s.61.036789: xs_tcp_data_ready: RPC:   
> xs_tcp_data_ready...
>kworker/1:1H-136   [001] 61.036798: xs_tcp_data_recv: RPC:   
> xs_tcp_data_recv started
>kworker/1:1H-136   [001] 61.036799: xs_tcp_data_recv: RPC:   
> reading TCP record fragment of length 24
>kworker/1:1H-136   [001] 61.036799: xs_tcp_data_recv: RPC:   
> reading XID (4 bytes)
>kworker/1:1H-136   [001] 61.036800: xs_tcp_data_recv: RPC:   
> reading request with XID 2f4c3f88
>kworker/1:1H-136   [001] 61.036800: xs_tcp_data_recv: RPC:   
> reading CALL/REPLY flag (4 bytes)
>kworker/1:1H-136   [001] 61.036801: xs_tcp_data_recv: RPC:   
> read reply XID 2f4c3f88
>kworker/1:1H-136   [001] ..s.61.036801: xs_tcp_data_recv: RPC:   
> XID 2f4c3f88 read 16 bytes
>kworker/1:1H-136   [001] ..s.61.036802: xs_tcp_data_recv: RPC:   
> xprt = 880407eca800, tcp_copied = 24, tcp_offset = 24, tcp_reclen = 24
>kworker/1:1H-136   [001] ..s.61.036802: xprt_complete_rqst: RPC:43 
> xid 2f4c3f88 complete (24 bytes received)
>kworker/1:1H-136   [001] 61.036803: xs_tcp_data_recv: RPC:   
> xs_tcp_data_recv done
>kworker/1:1H-136   [001] 61.036812: xprt_release: RPC:43 
> release request 88040b270800
> 
> 
> # unhide-tcp 
> Unhide-tcp 20130526
> Copyright © 2013 Yago Jesus & Patrick Gouin
> License GPLv3+ : GNU GPL version 3 or later
> http://www.unhide-forensics.info
> Used options: 
> [*]Starting TCP checking
> 
> Found Hidden port that not appears in ss: 805
> 

What is a “Hidden port that not appears in ss: 805”, and what does this report 
mean? Are we failing to close a socket?

Cheers
  Trond

Re: [PATCH 8/8] net: sunrpc: Replace CURRENT_TIME by current_fs_time()

2016-02-22 Thread Trond Myklebust

On Mon, Feb 22, 2016 at 10:17 AM, Deepa Dinamani <deepa.ker...@gmail.com> wrote:
>
> CURRENT_TIME macro is not appropriate for filesystems as it
> doesn't use the right granularity for filesystem timestamps.
> Use current_fs_time() instead.
>
> Signed-off-by: Deepa Dinamani <deepa.ker...@gmail.com>
> Cc: "J. Bruce Fields" <bfie...@fieldses.org>
> Cc: Jeff Layton <jlay...@poochiereds.net>
> Cc: Trond Myklebust <trond.mykleb...@primarydata.com>
> Cc: Anna Schumaker <anna.schuma...@netapp.com>
> Cc: "David S. Miller" <da...@davemloft.net>
> Cc: linux-...@vger.kernel.org
> Cc: netdev@vger.kernel.org
> ---
>  net/sunrpc/rpc_pipe.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/sunrpc/rpc_pipe.c b/net/sunrpc/rpc_pipe.c
> index 31789ef..bab3187 100644
> --- a/net/sunrpc/rpc_pipe.c
> +++ b/net/sunrpc/rpc_pipe.c
> @@ -477,7 +477,9 @@ rpc_get_inode(struct super_block *sb, umode_t mode)
> return NULL;
> inode->i_ino = get_next_ino();
> inode->i_mode = mode;
> -   inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
> +   inode->i_atime = current_fs_time(sb);
> +   inode->i_mtime = inode->i_atime;
> +   inode->i_ctime = inode->i_atime;
> switch (mode & S_IFMT) {
> case S_IFDIR:
> inode->i_fop = _dir_operations;

Why would we care? This is a pseudo-fs. There is no expectation w.r.t.
timestamp accuracy or resolution.

Cheers,
  Trond

Re: [PATCH v2] sunrpc: fix waitqueue_active without memory barrier in sunrpc

2015-10-09 Thread Trond Myklebust

On Fri, Oct 9, 2015 at 5:18 PM, J. Bruce Fields  wrote:
>
> On Fri, Oct 09, 2015 at 06:29:44AM +, Kosuke Tatsukawa wrote:
> > Neil Brown wrote:
> > > Kosuke Tatsukawa  writes:
> > >
> > >> There are several places in net/sunrpc/svcsock.c which calls
> > >> waitqueue_active() without calling a memory barrier.  Add a memory
> > >> barrier just as in wq_has_sleeper().
> > >>
> > >> I found this issue when I was looking through the linux source code
> > >> for places calling waitqueue_active() before wake_up*(), but without
> > >> preceding memory barriers, after sending a patch to fix a similar
> > >> issue in drivers/tty/n_tty.c  (Details about the original issue can be
> > >> found here: https://lkml.org/lkml/2015/9/28/849).
> > >
> > > hi,
> > > this feels like the wrong approach to the problem.  It requires extra
> > > 'smb_mb's to be spread around which are hard to understand as easy to
> > > forget.
> > >
> > > A quick look seems to suggest that (nearly) every waitqueue_active()
> > > will need an smb_mb.  Could we just put the smb_mb() inside
> > > waitqueue_active()??
> > 
> >
> > There are around 200 occurrences of waitqueue_active() in the kernel
> > source, and most of the places which use it before wake_up are either
> > protected by some spin lock, or already has a memory barrier or some
> > kind of atomic operation before it.
> >
> > Simply adding smp_mb() to waitqueue_active() would incur extra cost in
> > many cases and won't be a good idea.
> >
> > Another way to solve this problem is to remove the waitqueue_active(),
> > making the code look like this;
> >   if (wq)
> >   wake_up_interruptible(wq);
> > This also fixes the problem because the spinlock in the wake_up*() acts
> > as a memory barrier and prevents the code from being reordered by the
> > CPU (and it also makes the resulting code is much simpler).
>
> I might not care which we did, except I don't have the means to test
> this quickly, and I guess this is some of our most frequently called
> code.
>
> I suppose your patch is the most conservative approach, as the
> alternative is a spinlock/unlock in wake_up_interruptible, which I
> assume is necessarily more expensive than an smp_mb().
>
> As far as I can tell it's been this way since forever.  (Well, since a
> 2002 patch "NFSD: TCP: rationalise locking in RPC server routines" which
> removed some spinlocks from the data_ready routines.)
>
> I don't understand what the actual race is yet (which code exactly is
> missing the wakeup in this case?  nfsd threads seem to instead get
> woken up by the wake_up_process() in svc_xprt_do_enqueue().)
>

Those threads still use blocking calls for sendpage() and sendmsg(),
so presumably they may be affected.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Race with ip=dhcp bootparameter in ip_rcv_finish on am335x

2015-09-23 Thread Trond Myklebust

+linux-nfs mailing list

On Wed, Sep 23, 2015 at 8:27 AM, Alexander Aring <alex.ar...@gmail.com> wrote:
> Hi,
>
> On Wed, Sep 23, 2015 at 01:57:57PM +0200, Alexander Aring wrote:
> ...
>> >
>>
>> Ok, I think I have two issues with two different races the first one was
>> fixed by bde6f9ded1bd ("net: Initialize table in fib result"), but the
>> second one is still there:
>>
>> [8.615806] [ cut here ]
>> [8.620678] Kernel BUG at c016c3d0 [verbose debug info unavailable]
>> [8.627229] Internal error: Oops - BUG: 0 [#1] SMP ARM
>> [8.632611] Modules linked in:
>> [8.635836] CPU: 0 PID: 766 Comm: kworker/0:1H Tainted: GW   
>> 4.2.0-11248-gfbd0351 #140
>> [8.645208] Hardware name: Generic AM33XX (Flattened Device Tree)
>> [8.651616] Workqueue: rpciod xprt_autoclose
>> [8.656091] task: ce3c52c0 ti: ce642000 task.ti: ce642000
>> [8.661744] PC is at iput+0x1a8/0x1f0
>> [8.665579] LR is at xprt_autoclose+0x2c/0x54
>> [8.670136] pc : []lr : []psr: 2113
>> [8.670136] sp : ce643e80  ip :   fp : c0b56688
>> [8.682133] r10: 0001  r9 : ce643ec8  r8 : 
>> [8.687599] r7 : feff3000  r6 : ce615800  r5 : ce615bc0  r4 : ce615b54
>> [8.694421] r3 : 0060  r2 : 000f  r1 : 0f10e000  r0 : cdbed720
>> [8.701254] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment 
>> none
>> [8.708718] Control: 10c5387d  Table: 80004019  DAC: 0051
>> [8.714732] Process kworker/0:1H (pid: 766, stack limit = 0xce642218)
>> [8.721464] Stack: (0xce643e80 to 0xce644000)
>> [8.726033] 3e80: c066f828 ce615b54 ce615bc0 ce615800 feff3000  
>> ce643ec8 c066c884
>> [8.734596] 3ea0: ce615b54 ce5ff440 cfb9e340 c0057928 0001  
>> c00578b4 cfb9e340
>> [8.743152] 3ec0: c0057cc8  c137972c c0cc1960  c09979f4 
>> cfb9e340 cfb9e340
>> [8.751714] 3ee0: ce5ff458 cfb9e370 ce642000 0008 c0b55ba0 ce5ff440 
>> cfb9e340 c0057c54
>> [8.760274] 3f00: ce659940 ce5ff440 c0057c18  ce659940 ce5ff440 
>> c0057c18 
>> [8.768834] 3f20:    c005d918 c0b5697c  
>>  ce5ff440
>> [8.777390] 3f40:   dead4ead   c0b65d60 
>>  
>> [8.785951] 3f60: c0922088 ce643f64 ce643f64   dead4ead 
>>  
>> [8.794513] 3f80: c0b65d60   c0922088 ce643f90 ce643f90 
>> ce643fac ce659940
>> [8.803069] 3fa0: c005d844   c000f770   
>>  
>> [8.811628] 3fc0:       
>>  
>> [8.820185] 3fe0:     0013  
>> 8fdf6861 8fdf6c61
>> [8.828741] [] (iput) from [] 
>> (xprt_autoclose+0x2c/0x54)
>> [8.836133] [] (xprt_autoclose) from [] 
>> (process_one_work+0x19c/0x48c)
>> [8.844784] [] (process_one_work) from [] 
>> (worker_thread+0x3c/0x4a0)
>> [8.853256] [] (worker_thread) from [] 
>> (kthread+0xd4/0xf0)
>> [8.860827] [] (kthread) from [] 
>> (ret_from_fork+0x14/0x24)
>> [8.868387] Code: e59f0044 e59f1044 ebfb467a eac1 (e7f001f2)
>
> Additional missing information is that I am booting via nfsroot and
> xprt_autoclose is something from sunrpc.
>
> Finally I figured out that commit
> 4876cc779ff525b9c2376d8076edf47815e71f2c ("SUNRPC: Ensure we release the
> TCP socket once it has been closed") occur this races. After reverting
> this commit everything works fine.
>
> I added now:
>
> Steven Rostedt <rost...@goodmis.org>
> Trond Myklebust <trond.mykleb...@primarydata.com>
>
> to cc to report about this issue.
>

Is that happening when the transport is being torn down? If so, is it
fixed by 
http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=commitdiff;h=79234c3db6842a3de03817211d891e0c2878f756
?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFS/TCP/IPv6 acting strangely in 4.2

2015-09-17 Thread Trond Myklebust

Hi Russell,

On Thu, 2015-09-17 at 14:57 +0100, Russell King - ARM Linux wrote:
> On Fri, Sep 11, 2015 at 05:49:38PM +0100, Russell King - ARM Linux
> wrote:
> > Following that idea, I just tried the patch below, and it seems to
> > work.
> > I don't know whether it handles all cases after a call to
> > kernel_connect(),
> > but it stops the multiple connection attempts:
> > 
> >   1   0.00 armada388 -> n2100 TCP 1009→nfs [SYN] Seq=3794066539
> > Win=28560 Len=0 MSS=1440 SACK_PERM=1 TSval=15712 TSecr=870317691
> > WS=128
> >   2   0.000414 n2100 -> armada388 TCP nfs→1009 [SYN, ACK]
> > Seq=1884476522 Ack=3794066540 Win=28560 Len=0 MSS=1440 SACK_PERM=1
> > TSval=870318939 TSecr=15712 WS=64
> >   3   0.000787 armada388 -> n2100 TCP 1009→nfs [ACK] Seq=3794066540
> > Ack=1884476523 Win=28672 Len=0 TSval=15712 TSecr=870318939
> >   4   0.001304 armada388 -> n2100 NFS V3 ACCESS Call, FH:
> > 0x905379cc, [Check: RD LU MD XT DL]
> >   5   0.001566 n2100 -> armada388 TCP nfs→1009 [ACK] Seq=1884476523
> > Ack=379400 Win=28608 Len=0 TSval=870318939 TSecr=15712
> >   6   0.001640 armada388 -> n2100 NFS V3 ACCESS Call, FH:
> > 0x905379cc, [Check: RD LU MD XT DL]
> >   7   0.001866 n2100 -> armada388 TCP nfs→1009 [ACK] Seq=1884476523
> > Ack=3794066780 Win=28608 Len=0 TSval=870318939 TSecr=15712
> >   8   0.003070 n2100 -> armada388 NFS V3 ACCESS Reply (Call In 4),
> > [Allowed: RD LU MD XT DL]
> >   9   0.003415 armada388 -> n2100 TCP 1009→nfs [ACK] Seq=3794066780
> > Ack=1884476647 Win=28672 Len=0 TSval=15712 TSecr=870318939
> >  10   0.003592 armada388 -> n2100 NFS V3 ACCESS Call, FH:
> > 0xe15fc9c9, [Check: RD LU MD XT DL]
> >  11   0.004354 n2100 -> armada388 NFS V3 ACCESS Reply (Call In 6),
> > [Allowed: RD LU MD XT DL]
> >  12   0.004682 armada388 -> n2100 NFS V3 ACCESS Call, FH:
> > 0xe15fc9c9, [Check: RD LU MD XT DL]
> >  13   0.005365 n2100 -> armada388 NFS V3 ACCESS Reply (Call In 10),
> > [Allowed: RD LU MD XT DL]
> >  14   0.005701 armada388 -> n2100 NFS V3 GETATTR Call, FH:
> > 0xe15fc9c9
> > ...
> 
> NFS people - any comments on this patch?  Is it the correct way to
> solve
> this problem (please see the first message in this thread for the
> problem.)
> Without this patch, NFS is unusable as it tries to launch multiple
> new
> connections from the same port to the NFS server without giving the
> NFS
> server time to respond and establish the TCP connection.

I agree that it addresses a real problem here, however there are a
couple of issues with the patch itself:

AFAICS, the 2 possible next states for SYN_SENT are TCP_ESTABLISHED and
TCP_CLOSE, so if the connection attempt fails, this patch leaves the
XPRT_CONNECTING flag set.
There is also the issue that clearing XPRT_CONNECTING in TCP_FIN_WAIT1,
TCP_CLOSE_WAIT and TCP_CLOSING could interfere with another connection
attempt by canceling the XPRT_CONNECTING state.

How about the following? It is based on your patch, but adds a check to
ensure that xs_tcp_state_change() doesn't clear the 'connecting' state
more than once (which could otherwise still happen in the TCP_CLOSE
case).

8<---
>From 4dbfdebbc09982a9248866f8256549456e2b2efd Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.mykleb...@primarydata.com>
Date: Wed, 16 Sep 2015 23:43:17 -0400
Subject: [PATCH] SUNRPC: Ensure that we wait for connections to complete
 before retrying

Commit 718ba5b87343, moved the responsibility for unlocking the socket to
xs_tcp_setup_socket, meaning that the socket will be unlocked before we
know that it has finished trying to connect. The following patch is based on
an initial patch by Russell King to ensure that we delay clearing the
XPRT_SOCK_CONNECTING flag until we either know that we failed to initiate
a connection attempt, or the connection attempt itself failed.

Fixes: 718ba5b87343 ("SUNRPC: Add helpers to prevent socket create from racing")
Reported-by: Russell King <li...@arm.linux.org.uk>
Signed-off-by: Trond Myklebust <trond.mykleb...@primarydata.com>
---
 include/linux/sunrpc/xprtsock.h |  3 +++
 net/sunrpc/xprtsock.c   | 11 ---
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/sunrpc/xprtsock.h b/include/linux/sunrpc/xprtsock.h
index 7591788e9fbf..357e44c1a46b 100644
--- a/include/linux/sunrpc/xprtsock.h
+++ b/include/linux/sunrpc/xprtsock.h
@@ -42,6 +42,7 @@ struct sock_xprt {
/*
 * Connection of transports
 */
+   unsigned long   sock_state;
struct delayed_work connect_worker;
struct sockaddr_storage srcaddr;
unsigned short  srcport;

Re: NFS/TCP/IPv6 acting strangely in 4.2

2015-09-17 Thread Trond Myklebust

On Thu, Sep 17, 2015 at 5:47 PM, Russell King - ARM Linux
<li...@arm.linux.org.uk> wrote:
> On Thu, Sep 17, 2015 at 10:18:29AM -0400, Trond Myklebust wrote:
>> Hi Russell,
>>
>> On Thu, 2015-09-17 at 14:57 +0100, Russell King - ARM Linux wrote:
>> > On Fri, Sep 11, 2015 at 05:49:38PM +0100, Russell King - ARM Linux
>> > wrote:
>> > > Following that idea, I just tried the patch below, and it seems to
>> > > work.
>> > > I don't know whether it handles all cases after a call to
>> > > kernel_connect(),
>> > > but it stops the multiple connection attempts:
>> > >
>> > >   1   0.00 armada388 -> n2100 TCP 1009→nfs [SYN] Seq=3794066539
>> > > Win=28560 Len=0 MSS=1440 SACK_PERM=1 TSval=15712 TSecr=870317691
>> > > WS=128
>> > >   2   0.000414 n2100 -> armada388 TCP nfs→1009 [SYN, ACK]
>> > > Seq=1884476522 Ack=3794066540 Win=28560 Len=0 MSS=1440 SACK_PERM=1
>> > > TSval=870318939 TSecr=15712 WS=64
>> > >   3   0.000787 armada388 -> n2100 TCP 1009→nfs [ACK] Seq=3794066540
>> > > Ack=1884476523 Win=28672 Len=0 TSval=15712 TSecr=870318939
>> > >   4   0.001304 armada388 -> n2100 NFS V3 ACCESS Call, FH:
>> > > 0x905379cc, [Check: RD LU MD XT DL]
>> > >   5   0.001566 n2100 -> armada388 TCP nfs→1009 [ACK] Seq=1884476523
>> > > Ack=379400 Win=28608 Len=0 TSval=870318939 TSecr=15712
>> > >   6   0.001640 armada388 -> n2100 NFS V3 ACCESS Call, FH:
>> > > 0x905379cc, [Check: RD LU MD XT DL]
>> > >   7   0.001866 n2100 -> armada388 TCP nfs→1009 [ACK] Seq=1884476523
>> > > Ack=3794066780 Win=28608 Len=0 TSval=870318939 TSecr=15712
>> > >   8   0.003070 n2100 -> armada388 NFS V3 ACCESS Reply (Call In 4),
>> > > [Allowed: RD LU MD XT DL]
>> > >   9   0.003415 armada388 -> n2100 TCP 1009→nfs [ACK] Seq=3794066780
>> > > Ack=1884476647 Win=28672 Len=0 TSval=15712 TSecr=870318939
>> > >  10   0.003592 armada388 -> n2100 NFS V3 ACCESS Call, FH:
>> > > 0xe15fc9c9, [Check: RD LU MD XT DL]
>> > >  11   0.004354 n2100 -> armada388 NFS V3 ACCESS Reply (Call In 6),
>> > > [Allowed: RD LU MD XT DL]
>> > >  12   0.004682 armada388 -> n2100 NFS V3 ACCESS Call, FH:
>> > > 0xe15fc9c9, [Check: RD LU MD XT DL]
>> > >  13   0.005365 n2100 -> armada388 NFS V3 ACCESS Reply (Call In 10),
>> > > [Allowed: RD LU MD XT DL]
>> > >  14   0.005701 armada388 -> n2100 NFS V3 GETATTR Call, FH:
>> > > 0xe15fc9c9
>> > > ...
>> >
>> > NFS people - any comments on this patch?  Is it the correct way to
>> > solve
>> > this problem (please see the first message in this thread for the
>> > problem.)
>> > Without this patch, NFS is unusable as it tries to launch multiple
>> > new
>> > connections from the same port to the NFS server without giving the
>> > NFS
>> > server time to respond and establish the TCP connection.
>>
>> I agree that it addresses a real problem here, however there are a
>> couple of issues with the patch itself:
>>
>> AFAICS, the 2 possible next states for SYN_SENT are TCP_ESTABLISHED and
>> TCP_CLOSE, so if the connection attempt fails, this patch leaves the
>> XPRT_CONNECTING flag set.
>> There is also the issue that clearing XPRT_CONNECTING in TCP_FIN_WAIT1,
>> TCP_CLOSE_WAIT and TCP_CLOSING could interfere with another connection
>> attempt by canceling the XPRT_CONNECTING state.
>>
>> How about the following? It is based on your patch, but adds a check to
>> ensure that xs_tcp_state_change() doesn't clear the 'connecting' state
>> more than once (which could otherwise still happen in the TCP_CLOSE
>> case).
>
> This patch also seems to fix the problem I've been seeing.
>
> Yes, I wasn't sure about my patch - I didn't spend much time properly
> reading and understanding the sunrpc code, beyond analysing what was
> going on to cause the problem and deciding on a way to stop it happening.
> I really wasn't sure that clearing the connecting flag everywhere I did
> was the right thing, which is why I didn't send the patch properly
> dressed up.
>
>> 8<---
>> >From 4dbfdebbc09982a9248866f8256549456e2b2efd Mon Sep 17 00:00:00 2001
>> From: Trond Myklebust <trond.mykleb...@primarydata.com>
>> Date: Wed, 16 Sep 2015 23:43:17 -0400
>> Subject: [PATCH] SUNRPC: Ensure that we wait for connections to complete

Re: NFS/TCP/IPv6 acting strangely in 4.2

2015-09-17 Thread Trond Myklebust

On Thu, Sep 17, 2015 at 12:27 PM, Benjamin Coddington
<bcodd...@redhat.com> wrote:
> On Thu, 17 Sep 2015, Trond Myklebust wrote:
>
>> Hi Russell,
>>
>> On Thu, 2015-09-17 at 14:57 +0100, Russell King - ARM Linux wrote:
>> > On Fri, Sep 11, 2015 at 05:49:38PM +0100, Russell King - ARM Linux
>> > wrote:
>> > > Following that idea, I just tried the patch below, and it seems to
>> > > work.
>> > > I don't know whether it handles all cases after a call to
>> > > kernel_connect(),
>> > > but it stops the multiple connection attempts:
>> > >
>> > >   1   0.00 armada388 -> n2100 TCP 1009→nfs [SYN] Seq=3794066539
>> > > Win=28560 Len=0 MSS=1440 SACK_PERM=1 TSval=15712 TSecr=870317691
>> > > WS=128
>> > >   2   0.000414 n2100 -> armada388 TCP nfs→1009 [SYN, ACK]
>> > > Seq=1884476522 Ack=3794066540 Win=28560 Len=0 MSS=1440 SACK_PERM=1
>> > > TSval=870318939 TSecr=15712 WS=64
>> > >   3   0.000787 armada388 -> n2100 TCP 1009→nfs [ACK] Seq=3794066540
>> > > Ack=1884476523 Win=28672 Len=0 TSval=15712 TSecr=870318939
>> > >   4   0.001304 armada388 -> n2100 NFS V3 ACCESS Call, FH:
>> > > 0x905379cc, [Check: RD LU MD XT DL]
>> > >   5   0.001566 n2100 -> armada388 TCP nfs→1009 [ACK] Seq=1884476523
>> > > Ack=379400 Win=28608 Len=0 TSval=870318939 TSecr=15712
>> > >   6   0.001640 armada388 -> n2100 NFS V3 ACCESS Call, FH:
>> > > 0x905379cc, [Check: RD LU MD XT DL]
>> > >   7   0.001866 n2100 -> armada388 TCP nfs→1009 [ACK] Seq=1884476523
>> > > Ack=3794066780 Win=28608 Len=0 TSval=870318939 TSecr=15712
>> > >   8   0.003070 n2100 -> armada388 NFS V3 ACCESS Reply (Call In 4),
>> > > [Allowed: RD LU MD XT DL]
>> > >   9   0.003415 armada388 -> n2100 TCP 1009→nfs [ACK] Seq=3794066780
>> > > Ack=1884476647 Win=28672 Len=0 TSval=15712 TSecr=870318939
>> > >  10   0.003592 armada388 -> n2100 NFS V3 ACCESS Call, FH:
>> > > 0xe15fc9c9, [Check: RD LU MD XT DL]
>> > >  11   0.004354 n2100 -> armada388 NFS V3 ACCESS Reply (Call In 6),
>> > > [Allowed: RD LU MD XT DL]
>> > >  12   0.004682 armada388 -> n2100 NFS V3 ACCESS Call, FH:
>> > > 0xe15fc9c9, [Check: RD LU MD XT DL]
>> > >  13   0.005365 n2100 -> armada388 NFS V3 ACCESS Reply (Call In 10),
>> > > [Allowed: RD LU MD XT DL]
>> > >  14   0.005701 armada388 -> n2100 NFS V3 GETATTR Call, FH:
>> > > 0xe15fc9c9
>> > > ...
>> >
>> > NFS people - any comments on this patch?  Is it the correct way to
>> > solve
>> > this problem (please see the first message in this thread for the
>> > problem.)
>> > Without this patch, NFS is unusable as it tries to launch multiple
>> > new
>> > connections from the same port to the NFS server without giving the
>> > NFS
>> > server time to respond and establish the TCP connection.
>>
>> I agree that it addresses a real problem here, however there are a
>> couple of issues with the patch itself:
>>
>> AFAICS, the 2 possible next states for SYN_SENT are TCP_ESTABLISHED and
>> TCP_CLOSE, so if the connection attempt fails, this patch leaves the
>> XPRT_CONNECTING flag set.
>> There is also the issue that clearing XPRT_CONNECTING in TCP_FIN_WAIT1,
>> TCP_CLOSE_WAIT and TCP_CLOSING could interfere with another connection
>> attempt by canceling the XPRT_CONNECTING state.
>>
>> How about the following? It is based on your patch, but adds a check to
>> ensure that xs_tcp_state_change() doesn't clear the 'connecting' state
>> more than once (which could otherwise still happen in the TCP_CLOSE
>> case).
>>
>> 8<---
>> From 4dbfdebbc09982a9248866f8256549456e2b2efd Mon Sep 17 00:00:00 2001
>> From: Trond Myklebust <trond.mykleb...@primarydata.com>
>> Date: Wed, 16 Sep 2015 23:43:17 -0400
>> Subject: [PATCH] SUNRPC: Ensure that we wait for connections to complete
>>  before retrying
>>
>> Commit 718ba5b87343, moved the responsibility for unlocking the socket to
>> xs_tcp_setup_socket, meaning that the socket will be unlocked before we
>> know that it has finished trying to connect. The following patch is based on
>> an initial patch by Russell King to ensure that we delay clearing the
>> XPRT_SOCK_CONNECTING flag until we either know that we failed to initiate
>> a connection attempt, or the connection attempt itself failed.
>>
>> Fixes: 718ba5b87343 ("SUNRPC: Add helpers to prevent socket create from 
>> racing")
>> Reported-by: Russell King <li...@arm.linux.org.uk>
>> Signed-off-by: Trond Myklebust <trond.mykleb...@primarydata.com>
>
> This fixes up my network segmentation problem, tested on top of your "Fix
> races between socket connection and destroy code".
>
> Tested-by: Benjamin Coddington <bcodd...@redhat.com>
>

Thanks Ben!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 14/39] SUNRPC: drop null test before destroy functions

2015-09-14 Thread Trond Myklebust

On Mon, Sep 14, 2015 at 12:07 PM, J. Bruce Fields  wrote:
> ACK, but assuming Trond takes this one.--b.

No problem. I'll pick it up...

Cheers
  Trond

> On Sun, Sep 13, 2015 at 02:15:07PM +0200, Julia Lawall wrote:
>> Remove unneeded NULL test.
>>
>> The semantic patch that makes this change is as follows:
>> (http://coccinelle.lip6.fr/)
>>
>> // 
>> @@ expression x; @@
>> -if (x != NULL)
>>   \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
>> // 
>>
>> Signed-off-by: Julia Lawall 
>>
>> ---
>>  net/sunrpc/sched.c |   12 
>>  1 file changed, 4 insertions(+), 8 deletions(-)
>>
>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>> index b140c09..425ca2f 100644
>> --- a/net/sunrpc/sched.c
>> +++ b/net/sunrpc/sched.c
>> @@ -1092,14 +1092,10 @@ void
>>  rpc_destroy_mempool(void)
>>  {
>>   rpciod_stop();
>> - if (rpc_buffer_mempool)
>> - mempool_destroy(rpc_buffer_mempool);
>> - if (rpc_task_mempool)
>> - mempool_destroy(rpc_task_mempool);
>> - if (rpc_task_slabp)
>> - kmem_cache_destroy(rpc_task_slabp);
>> - if (rpc_buffer_slabp)
>> - kmem_cache_destroy(rpc_buffer_slabp);
>> + mempool_destroy(rpc_buffer_mempool);
>> + mempool_destroy(rpc_task_mempool);
>> + kmem_cache_destroy(rpc_task_slabp);
>> + kmem_cache_destroy(rpc_buffer_slabp);
>>   rpc_destroy_wait_queue(_queue);
>>  }
>>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Global protection fault in netfilter code

2015-08-14 Thread Trond Myklebust

Hi,

When doing NFS stress tests in a VM with a recent kernel (yesterday's
commit 7ddab73346a1 Merge branch 'fixes' of git
://ftp.arm.linux.org.uk/~rmk/linux-arm), I've been seeing the
following General Protection Fault code apparently in the nf_conntrack
code:


PID: 358TASK: 88003630cb80  CPU: 0   COMMAND: kworker/0:1H
 #0 [88013a603680] die at 81007608
 #1 [88013a6036b0] do_general_protection at 8100407a
 #2 [88013a6036e0] general_protection at 817888c8
[exception RIP: detach_if_pending+103]
RIP: 81101b37  RSP: 88013a603798  RFLAGS: 00010086
RAX: dead00200200  RBX: 8800b9771bd8  RCX: 000f
RDX: 88013a60e818  RSI: 88013a60d980  RDI: 0046
RBP: 88013a6037b8   R8:    R9: 0001
R10: 88013a60d998  R11: 0001  R12: 8800b9771bd8
R13: 88013a60d980  R14:   R15: 0001
ORIG_RAX:   CS: 0010  SS: 0018
 #3 [88013a6037c0] mod_timer_pending at 81101fd2
 #4 [88013a603820] __nf_ct_refresh_acct at a055891b [nf_conntrack]
 #5 [88013a603850] tcp_packet at a056232e [nf_conntrack]
 #6 [88013a603970] nf_conntrack_in at a055b70a [nf_conntrack]
 #7 [88013a603a40] ipv4_conntrack_in at a0576326 [nf_conntrack_ipv4]
 #8 [88013a603a50] nf_iterate at 81688dad
 #9 [88013a603aa0] nf_hook_slow at 81688e42
#10 [88013a603af0] ip_rcv at 81695ca3
#11 [88013a603b60] __netif_receive_skb_core at 8164d688
#12 [88013a603c00] __netif_receive_skb at 8164e108
#13 [88013a603c20] netif_receive_skb_internal at 8164f7f6
#14 [88013a603c60] napi_gro_complete at 8164fbf7
#15 [88013a603cb0] dev_gro_receive at 81650508
#16 [88013a603d20] napi_gro_receive at 81650a6b
#17 [88013a603d50] e1000_clean_rx_irq at a00308db [e1000]
#18 [88013a603e00] e1000_clean at a0030f3d [e1000]
#19 [88013a603ec0] net_rx_action at 8164ffda
#20 [88013a603f40] __do_softirq at 81087a18
#21 [88013a603fb0] do_softirq_own_stack at 8178875c
--- IRQ stack ---
#22 [880035fabaf0] do_softirq_own_stack at 8178875c
[exception RIP: unknown or invalid address]
RIP: 88007d7a6108  RSP: 88007d7a60d0  RFLAGS: a02827e5
RAX: 810868a9  RBX: 880035fabb38  RCX: 88007d7a6108
RDX: 880136e56a00  RSI: 880035fabb78  RDI: 81786209
RBP: 810d662d   R8: 880035fabb58   R9: fe00
R10: 0046  R11: 810867e5  R12: 0046
R13: 810ce565  R14: 880035fabb18  R15: 
ORIG_RAX: 8800a4408840  CS: 880035fabbc8  SS: 0001
WARNING: possibly bogus exception frame
#23 [880035fabbd0] nfs41_wake_and_assign_slot at a06c7a9d [nfsv4]
#24 [880035fabbe0] nfs41_sequence_done at a069c7c0 [nfsv4]
#25 [880035fabc30] nfs4_sequence_done at a069caaf [nfsv4]
#26 [880035fabc40] nfs4_read_done at a06a2c0e [nfsv4]
#27 [880035fabc60] nfs_readpage_done at a0654736 [nfs]
#28 [880035fabc90] nfs_pgio_result at a0653414 [nfs]
#29 [880035fabcc0] rpc_exit_task at a027f10c [sunrpc]
#30 [880035fabce0] __rpc_execute at a02820dd [sunrpc]
#31 [880035fabd60] rpc_async_schedule at a0282725 [sunrpc]
#32 [880035fabd70] process_one_work at 8109fe89
#33 [880035fabdf0] worker_thread at 810a04ae
#34 [880035fabe60] kthread at 810a6aef
#35 [880035fabf50] ret_from_fork at 81786f5f

I do not see that in vanilla Linux-4.1, so it seems to be a 4.2 cycle
thing.

Is anyone else seeing this, and is it being looked at by the netfilter
folks?

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.mykleb...@primarydata.com


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [REGRESSION] NFS is creating a hidden port (left over from xs_bind() )

2015-06-19 Thread Trond Myklebust

On Fri, 2015-06-19 at 15:52 -0400, Jeff Layton wrote:
 On Fri, 19 Jun 2015 13:39:08 -0400
 Trond Myklebust trond.mykleb...@primarydata.com wrote:
 
  On Fri, Jun 19, 2015 at 1:17 PM, Steven Rostedt 
  rost...@goodmis.org wrote:
   On Fri, 19 Jun 2015 12:25:53 -0400
   Steven Rostedt rost...@goodmis.org wrote:
   
   
I don't see that 55201 anywhere. But then again, I didn't look 
for it
before the port disappeared. I could reboot and look for it 
again. I
should have saved the full netstat -tapn as well :-/
   
   Of course I didn't find it anywhere, that's the port on my wife's 
   box
   that port 947 was connected to.
   
   Now I even went over to my wife's box and ran
   
# rpcinfo -p localhost
  program vers proto   port  service
   104   tcp111  portmapper
   103   tcp111  portmapper
   102   tcp111  portmapper
   104   udp111  portmapper
   103   udp111  portmapper
   102   udp111  portmapper
   1000241   udp  34243  status
   1000241   tcp  34498  status
   
   which doesn't show anything.
   
   but something is listening to that port...
   
# netstat -ntap |grep 55201
   tcp0  0 0.0.0.0:55201   0.0.0.0:*
  LISTEN
  
  
  Hang on. This is on the client box while there is an active NFSv4
  mount? Then that's probably the NFSv4 callback channel listening 
  for
  delegation callbacks.
  
  Can you please try:
  
  echo options nfs callback_tcpport=4048  /etc/modprobe.d/nfs
  -local.conf
  
  and then either reboot the client or unload and then reload the nfs
  modules before reattempting the mount. If this is indeed the 
  callback
  channel, then that will move your phantom listener to port 4048...
  
 
 Right, it was a little unclear to me before, but it now seems clear
 that the callback socket that the server is opening to the client is
 the one squatting on the port.
 
 ...and that sort of makes sense, doesn't it? That rpc_clnt will stick
 around for the life of the client's lease, and the rpc_clnt binds to 
 a
 particular port so that it can reconnect using the same one.
 
 Given that Stephen has done the legwork and figured out that 
 reverting
 those commits fixes the issue, then I suspect that the real culprit 
 is
 caf4ccd4e88cf2.
 
 The client is likely closing down the other end of the callback
 socket when it goes idle. Before that commit, we probably did an
 xs_close on it, but now we're doing a xs_tcp_shutdown and that leaves
 the port bound.
 

Agreed. I've been looking into whether or not there is a simple fix.
Reverting those patches is not an option, because the whole point was
to ensure that the socket is in the TCP_CLOSED state before we release
the socket.

Steven, how about something like the following patch?

8-
From 9a0bcfdbdbc793eae1ed6d901a6396b6c66f9513 Mon Sep 17 00:00:00 2001
From: Trond Myklebust trond.mykleb...@primarydata.com
Date: Fri, 19 Jun 2015 16:17:57 -0400
Subject: [PATCH] SUNRPC: Ensure we release the TCP socket once it has been
 closed

This fixes a regression introduced by commit caf4ccd4e88cf2 (SUNRPC:
Make xs_tcp_close() do a socket shutdown rather than a sock_release).
Prior to that commit, the autoclose feature would ensure that an
idle connection would result in the socket being both disconnected and
released, whereas now only gets disconnected.

While the current behaviour is harmless, it does leave the port bound
until either RPC traffic resumes or the RPC client is shut down.

Reported-by: Steven Rostedt rost...@goodmis.org
Signed-off-by: Trond Myklebust trond.mykleb...@primarydata.com
---
 net/sunrpc/xprt.c | 2 +-
 net/sunrpc/xprtsock.c | 8 ++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 3ca31f20b97c..ab5dd621ae0c 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -611,8 +611,8 @@ static void xprt_autoclose(struct work_struct *work)
struct rpc_xprt *xprt =
container_of(work, struct rpc_xprt, task_cleanup);
 
-   xprt-ops-close(xprt);
clear_bit(XPRT_CLOSE_WAIT, xprt-state);
+   xprt-ops-close(xprt);
xprt_release_write(xprt, NULL);
 }
 
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index fda8ec8c74c0..75dcdadf0269 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -634,10 +634,13 @@ static void xs_tcp_shutdown(struct rpc_xprt *xprt)
struct sock_xprt *transport = container_of(xprt, struct sock_xprt, 
xprt);
struct socket *sock = transport-sock;
 
-   if (sock != NULL) {
+   if (sock == NULL)
+   return;
+   if (xprt_connected(xprt)) {
kernel_sock_shutdown(sock, SHUT_RDWR);
trace_rpc_socket_shutdown(xprt, sock);
-   }
+   } else
+   xs_reset_transport(transport

Re: [REGRESSION] NFS is creating a hidden port (left over from xs_bind() )

2015-06-19 Thread Trond Myklebust

On Fri, 2015-06-19 at 18:14 -0400, Steven Rostedt wrote:
 On Fri, 19 Jun 2015 16:30:18 -0400
 Trond Myklebust trond.mykleb...@primarydata.com wrote:
 
  Steven, how about something like the following patch?
  
 
 OK, the box I'm running this on is using v4.0.5, can you make a patch
 based on that, as whatever you make needs to go to stable as well.

Is it causing any other damage than the rkhunter warning you reported?

 distcc[31554] ERROR: compile /home/rostedt/work/git/nobackup/linux
 -build.git/net/sunrpc/xprtsock.c on fedora/8 failed
 distcc[31554] (dcc_build_somewhere) Warning: remote compilation of 
 '/home/rostedt/work/git/nobackup/linux
 -build.git/net/sunrpc/xprtsock.c' failed, retrying locally
 distcc[31554] Warning: failed to distribute 
 /home/rostedt/work/git/nobackup/linux-build.git/net/sunrpc/xprtsock.c 
 to fedora/8, running locally instead
 /home/rostedt/work/git/nobackup/linux
 -build.git/net/sunrpc/xprtsock.c: In function 'xs_tcp_shutdown':
 /home/rostedt/work/git/nobackup/linux
 -build.git/net/sunrpc/xprtsock.c:643:3: error: implicit declaration 
 of function 'xs_reset_transport' [-Werror=implicit-function
 -declaration]
 /home/rostedt/work/git/nobackup/linux
 -build.git/net/sunrpc/xprtsock.c: At top level:
 /home/rostedt/work/git/nobackup/linux
 -build.git/net/sunrpc/xprtsock.c:825:13: warning: conflicting types 
 for 'xs_reset_transport' [enabled by default]
 /home/rostedt/work/git/nobackup/linux
 -build.git/net/sunrpc/xprtsock.c:825:13: error: static declaration of 
 'xs_reset_transport' follows non-static declaration
 /home/rostedt/work/git/nobackup/linux
 -build.git/net/sunrpc/xprtsock.c:643:3: note: previous implicit 
 declaration of 'xs_reset_transport' was here
 cc1: some warnings being treated as errors
 distcc[31554] ERROR: compile /home/rostedt/work/git/nobackup/linux
 -build.git/net/sunrpc/xprtsock.c on localhost failed
 /home/rostedt/work/git/nobackup/linux
 -build.git/scripts/Makefile.build:258: recipe for target 
 'net/sunrpc/xprtsock.o' failed
 make[3]: *** [net/sunrpc/xprtsock.o] Error 1

Sorry. I sent that one off too quickly. Try the following.

8--
From 4876cc779ff525b9c2376d8076edf47815e71f2c Mon Sep 17 00:00:00 2001
From: Trond Myklebust trond.mykleb...@primarydata.com
Date: Fri, 19 Jun 2015 16:17:57 -0400
Subject: [PATCH v2] SUNRPC: Ensure we release the TCP socket once it has been
 closed

This fixes a regression introduced by commit caf4ccd4e88cf2 (SUNRPC:
Make xs_tcp_close() do a socket shutdown rather than a sock_release).
Prior to that commit, the autoclose feature would ensure that an
idle connection would result in the socket being both disconnected and
released, whereas now only gets disconnected.

While the current behaviour is harmless, it does leave the port bound
until either RPC traffic resumes or the RPC client is shut down.

Reported-by: Steven Rostedt rost...@goodmis.org
Signed-off-by: Trond Myklebust trond.mykleb...@primarydata.com
---
 net/sunrpc/xprt.c |  2 +-
 net/sunrpc/xprtsock.c | 40 ++--
 2 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 3ca31f20b97c..ab5dd621ae0c 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -611,8 +611,8 @@ static void xprt_autoclose(struct work_struct *work)
struct rpc_xprt *xprt =
container_of(work, struct rpc_xprt, task_cleanup);
 
-   xprt-ops-close(xprt);
clear_bit(XPRT_CLOSE_WAIT, xprt-state);
+   xprt-ops-close(xprt);
xprt_release_write(xprt, NULL);
 }
 
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index fda8ec8c74c0..ee0715dfc3c7 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -623,24 +623,6 @@ process_status:
 }
 
 /**
- * xs_tcp_shutdown - gracefully shut down a TCP socket
- * @xprt: transport
- *
- * Initiates a graceful shutdown of the TCP socket by calling the
- * equivalent of shutdown(SHUT_RDWR);
- */
-static void xs_tcp_shutdown(struct rpc_xprt *xprt)
-{
-   struct sock_xprt *transport = container_of(xprt, struct sock_xprt, 
xprt);
-   struct socket *sock = transport-sock;
-
-   if (sock != NULL) {
-   kernel_sock_shutdown(sock, SHUT_RDWR);
-   trace_rpc_socket_shutdown(xprt, sock);
-   }
-}
-
-/**
  * xs_tcp_send_request - write an RPC request to a TCP socket
  * @task: address of RPC task that manages the state of an RPC request
  *
@@ -786,6 +768,7 @@ static void xs_sock_mark_closed(struct rpc_xprt *xprt)
xs_sock_reset_connection_flags(xprt);
/* Mark transport as closed and wake up all pending tasks */
xprt_disconnect_done(xprt);
+   xprt_force_disconnect(xprt);
 }
 
 /**
@@ -2103,6 +2086,27 @@ out:
xprt_wake_pending_tasks(xprt, status);
 }
 
+/**
+ * xs_tcp_shutdown - gracefully shut down a TCP socket
+ * @xprt: transport
+ *
+ * Initiates a graceful shutdown of the TCP socket

Re: [REGRESSION] NFS is creating a hidden port (left over from xs_bind() )

2015-06-19 Thread Trond Myklebust

On Fri, Jun 19, 2015 at 9:27 PM, Steven Rostedt rost...@goodmis.org wrote:
 On Fri, 19 Jun 2015 19:25:59 -0400
 Trond Myklebust trond.mykleb...@primarydata.com wrote:

 8--
 From 4876cc779ff525b9c2376d8076edf47815e71f2c Mon Sep 17 00:00:00 2001
 From: Trond Myklebust trond.mykleb...@primarydata.com
 Date: Fri, 19 Jun 2015 16:17:57 -0400
 Subject: [PATCH v2] SUNRPC: Ensure we release the TCP socket once it has been
  closed

 This fixes a regression introduced by commit caf4ccd4e88cf2 (SUNRPC:
 Make xs_tcp_close() do a socket shutdown rather than a sock_release).
 Prior to that commit, the autoclose feature would ensure that an
 idle connection would result in the socket being both disconnected and
 released, whereas now only gets disconnected.

 While the current behaviour is harmless, it does leave the port bound
 until either RPC traffic resumes or the RPC client is shut down.

 Is there a way to test RPC traffic resuming? I'd like to try that before
 declaring this bug harmless.

You should be seeing the same issue if you mount an NFSv3 partition.
After about 5 minutes of inactivity, the client will close down the
connection to the server, and rkhunter should again see the phantom
socket. If you then try to access the partition, the RPC layer should
immediately release the socket and establish a new connection on the
same port.

Cheers
  Trond
--
To unsubscribe from this list: send the line unsubscribe netdev in

Re: [REGRESSION] NFS is creating a hidden port (left over from xs_bind() )

2015-06-19 Thread Trond Myklebust

On Fri, Jun 19, 2015 at 1:17 PM, Steven Rostedt rost...@goodmis.org wrote:
 On Fri, 19 Jun 2015 12:25:53 -0400
 Steven Rostedt rost...@goodmis.org wrote:


 I don't see that 55201 anywhere. But then again, I didn't look for it
 before the port disappeared. I could reboot and look for it again. I
 should have saved the full netstat -tapn as well :-/

 Of course I didn't find it anywhere, that's the port on my wife's box
 that port 947 was connected to.

 Now I even went over to my wife's box and ran

  # rpcinfo -p localhost
program vers proto   port  service
 104   tcp111  portmapper
 103   tcp111  portmapper
 102   tcp111  portmapper
 104   udp111  portmapper
 103   udp111  portmapper
 102   udp111  portmapper
 1000241   udp  34243  status
 1000241   tcp  34498  status

 which doesn't show anything.

 but something is listening to that port...

  # netstat -ntap |grep 55201
 tcp0  0 0.0.0.0:55201   0.0.0.0:*   LISTEN


Hang on. This is on the client box while there is an active NFSv4
mount? Then that's probably the NFSv4 callback channel listening for
delegation callbacks.

Can you please try:

echo options nfs callback_tcpport=4048  /etc/modprobe.d/nfs-local.conf

and then either reboot the client or unload and then reload the nfs
modules before reattempting the mount. If this is indeed the callback
channel, then that will move your phantom listener to port 4048...

Cheers
   Trond
--
To unsubscribe from this list: send the line unsubscribe netdev in

Re: [REGRESSION] NFS is creating a hidden port (left over from xs_bind() )

2015-06-18 Thread Trond Myklebust

On Wed, Jun 17, 2015 at 11:08 PM, Steven Rostedt rost...@goodmis.org wrote:
 On Fri, 12 Jun 2015 11:50:38 -0400
 Steven Rostedt rost...@goodmis.org wrote:

 I reverted the following commits:

 c627d31ba0696cbd829437af2be2f2dee3546b1e
 9e2b9f37760e129cee053cc7b6e7288acc2a7134
 caf4ccd4e88cf2795c927834bc488c8321437586

 And the issue goes away. That is, I watched the port go from
 ESTABLISHED to TIME_WAIT, and then gone, and theirs no hidden port.

 In fact, I watched the port with my portlist.c module, and it
 disappeared there too when it entered the TIME_WAIT state.


I've scanned those commits again and again, and I'm not seeing how we
could be introducing a socket leak there. The only suspect I can see
would be the NFS swap bugs that Jeff fixed a few weeks ago. Are you
using NFS swap?

 I've been running v4.0.5 with the above commits reverted for 5 days
 now, and there's still no hidden port appearing.

 What's the status on this? Should those commits be reverted or is there
 another solution to this bug?


I'm trying to reproduce, but I've had no luck yet.

Cheers
  Trond
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [REGRESSION] NFS is creating a hidden port (left over from xs_bind() )

2015-06-12 Thread Trond Myklebust

On Fri, Jun 12, 2015 at 10:40 AM, Eric Dumazet eric.duma...@gmail.com wrote:
 On Fri, 2015-06-12 at 10:10 -0400, Trond Myklebust wrote:
 On Thu, Jun 11, 2015 at 11:49 PM, Steven Rostedt rost...@goodmis.org wrote:
 
  I recently upgraded my main server to 4.0.4 from 3.19.5 and rkhunter
  started reporting a hidden port on my box.
 
  Running unhide-tcp I see this:
 
  # unhide-tcp
  Unhide-tcp 20121229
  Copyright © 2012 Yago Jesus  Patrick Gouin
  License GPLv3+ : GNU GPL version 3 or later
  http://www.unhide-forensics.info
  Used options:
  [*]Starting TCP checking
 
  Found Hidden port that not appears in ss: 946
  [*]Starting UDP checking
 
  This scared the hell out of me as I'm thinking that I have got some kind
  of NSA backdoor hooked into my server and it is monitoring my plans to
  smuggle Kinder Überraschung into the USA from Germany. I panicked!
 
  Well, I wasted the day writing modules to first look at all the sockets
  opened by all processes (via their file descriptors) and posted their
  port numbers.
 
http://rostedt.homelinux.com/private/tasklist.c
 
  But this port wasn't there either.
 
  Then I decided to look at the ports in tcp_hashinfo.
 
http://rostedt.homelinux.com/private/portlist.c
 
  This found the port but no file was connected to it, and worse yet,
  when I first ran it without using probe_kernel_read(), it crashed my
  kernel, because sk-sk_socket pointed to a freed socket!
 
  Note, each boot, the hidden port is different.
 
  Finally, I decided to bring in the big guns, and inserted a
  trace_printk() into the bind logic, to see if I could find the culprit.
  After fiddling with it a few times, I found a suspect:
 
 kworker/3:1H-123   [003] ..s.96.696213: inet_bind_hash: add 946
 
  Bah, it's a kernel thread doing it, via a work queue. I then added a
  trace_dump_stack() to find what was calling this, and here it is:
 
  kworker/3:1H-123   [003] ..s.96.696222: stack trace
   = inet_csk_get_port
   = inet_addr_type
   = inet_bind
   = xs_bind
   = sock_setsockopt
   = __sock_create
   = xs_create_sock.isra.18
   = xs_tcp_setup_socket
   = process_one_work
   = worker_thread
   = worker_thread
   = kthread
   = kthread
   = ret_from_fork
   = kthread
 
  I rebooted, and examined what happens. I see the kworker binding that
  port, and all seems well:
 
  # netstat -tapn |grep 946
  tcp0  0 192.168.23.9:946192.168.23.22:55201 
  ESTABLISHED -
 
  But waiting for a bit, the connection goes into a TIME_WAIT, and then
  it just disappears. But the bind to the port does not get released, and
  that port is from then on, taken.
 
  This never happened with my 3.19 kernels. I would bisect it but this is
  happening on my main server box which I usually only reboot every other
  month doing upgrades. It causes too much disturbance for myself (and my
  family) as when this box is offline, basically the rest of my machines
  are too.
 
  I figured this may be enough information to see if you can fix it.
  Otherwise I can try to do the bisect, but that's not going to happen
  any time soon. I may just go back to 3.19 for now, such that rkhunter
  stops complaining about the hidden port.
 

 The only new thing that we're doing with 4.0 is to set SO_REUSEPORT on
 the socket before binding the port (commit 4dda9c8a5e34: SUNRPC: Set
 SO_REUSEPORT socket option for TCP connections). Perhaps there is an
 issue with that?

 Strange, because the usual way to not have time-wait is to use SO_LINGER
 with linger=0

 And apparently xs_tcp_finish_connecting() has this :

 sock_reset_flag(sk, SOCK_LINGER);
 tcp_sk(sk)-linger2 = 0;

Are you sure? I thought that SO_LINGER is more about controlling how
the socket behaves w.r.t. waiting for the TCP_CLOSE state to be
achieved (i.e. about aborting the FIN state negotiation early). I've
never observed an effect on the TCP time-wait states.

 Are you sure SO_REUSEADDR was not the thing you wanted ?

Yes. SO_REUSEADDR has the problem that it requires you bind to
something other than 0.0.0.0, so it is less appropriate for outgoing
connections; the RPC code really should not have to worry about
routing and routability of a particular source address.

Cheers
  Trond
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [REGRESSION] NFS is creating a hidden port (left over from xs_bind() )

2015-06-12 Thread Trond Myklebust

On Thu, Jun 11, 2015 at 11:49 PM, Steven Rostedt rost...@goodmis.org wrote:

 I recently upgraded my main server to 4.0.4 from 3.19.5 and rkhunter
 started reporting a hidden port on my box.

 Running unhide-tcp I see this:

 # unhide-tcp
 Unhide-tcp 20121229
 Copyright © 2012 Yago Jesus  Patrick Gouin
 License GPLv3+ : GNU GPL version 3 or later
 http://www.unhide-forensics.info
 Used options:
 [*]Starting TCP checking

 Found Hidden port that not appears in ss: 946
 [*]Starting UDP checking

 This scared the hell out of me as I'm thinking that I have got some kind
 of NSA backdoor hooked into my server and it is monitoring my plans to
 smuggle Kinder Überraschung into the USA from Germany. I panicked!

 Well, I wasted the day writing modules to first look at all the sockets
 opened by all processes (via their file descriptors) and posted their
 port numbers.

   http://rostedt.homelinux.com/private/tasklist.c

 But this port wasn't there either.

 Then I decided to look at the ports in tcp_hashinfo.

   http://rostedt.homelinux.com/private/portlist.c

 This found the port but no file was connected to it, and worse yet,
 when I first ran it without using probe_kernel_read(), it crashed my
 kernel, because sk-sk_socket pointed to a freed socket!

 Note, each boot, the hidden port is different.

 Finally, I decided to bring in the big guns, and inserted a
 trace_printk() into the bind logic, to see if I could find the culprit.
 After fiddling with it a few times, I found a suspect:

kworker/3:1H-123   [003] ..s.96.696213: inet_bind_hash: add 946

 Bah, it's a kernel thread doing it, via a work queue. I then added a
 trace_dump_stack() to find what was calling this, and here it is:

 kworker/3:1H-123   [003] ..s.96.696222: stack trace
  = inet_csk_get_port
  = inet_addr_type
  = inet_bind
  = xs_bind
  = sock_setsockopt
  = __sock_create
  = xs_create_sock.isra.18
  = xs_tcp_setup_socket
  = process_one_work
  = worker_thread
  = worker_thread
  = kthread
  = kthread
  = ret_from_fork
  = kthread

 I rebooted, and examined what happens. I see the kworker binding that
 port, and all seems well:

 # netstat -tapn |grep 946
 tcp0  0 192.168.23.9:946192.168.23.22:55201 
 ESTABLISHED -

 But waiting for a bit, the connection goes into a TIME_WAIT, and then
 it just disappears. But the bind to the port does not get released, and
 that port is from then on, taken.

 This never happened with my 3.19 kernels. I would bisect it but this is
 happening on my main server box which I usually only reboot every other
 month doing upgrades. It causes too much disturbance for myself (and my
 family) as when this box is offline, basically the rest of my machines
 are too.

 I figured this may be enough information to see if you can fix it.
 Otherwise I can try to do the bisect, but that's not going to happen
 any time soon. I may just go back to 3.19 for now, such that rkhunter
 stops complaining about the hidden port.


The only new thing that we're doing with 4.0 is to set SO_REUSEPORT on
the socket before binding the port (commit 4dda9c8a5e34: SUNRPC: Set
SO_REUSEPORT socket option for TCP connections). Perhaps there is an
issue with that?

Cheers
  Trond
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] SUNRPC: Mark buffer used for debug printks with __maybe_unused

2008-02-20 Thread Trond Myklebust


On Wed, 2008-02-20 at 19:27 +0300, Pavel Emelyanov wrote:
 Patrick McHardy wrote:
  Joe Perches wrote:
  On Wed, 2008-02-20 at 07:29 -0800, Joe Perches wrote:

  fs/nfsd/nfsproc.c:  char buf[RPC_MAX_ADDRBUFLEN];
  Perhaps there should be a DECLARE_RPC_BUF(buf) macro?
  #define DECLARE_RPC_BUF(var) char var[MAC_BUF_SIZE] __maybe_unused
  
  Make that:
 
  #define DECLARE_RPC_BUF(var) char var[RPC_MAX_ADDRBUFLEN] __maybe_unuse
 
 OK, I'll send the patch in a moment.

1) Please always Cc [EMAIL PROTECTED] when changing the sunrpc
code

2) Please don't use the name RPC_BUF. These are debugging strings, not
buffers. Something like DECLARE_RPC_DEBUG_STR() would be more
appropriate.

3) If you're going to use a macro, why not just use the existing
RPC_IFDEBUG()?

Trond

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] SUNRPC: Mark buffer used for debug printks with __maybe_unused

2008-02-20 Thread Trond Myklebust


On Wed, 2008-02-20 at 08:36 -0800, Joe Perches wrote:
 On Wed, 2008-02-20 at 16:35 +0100, Patrick McHardy wrote:
  Alternatively change the dprintk macro to behave similar like
  pr_debug() and mark things like svc_print_addr() __pure, which
  has the advantage that is still performs format checking even
  if debugging is disabled.
 
 I think it's better to change the dprintk style macros
 to what Philip Craig suggested.
 
 http://marc.info/?l=linux-wirelessm=120338413108120w=2
 
 This makes clear to the compiler that the called function is not
 going to be used so it can be optimized away, keeps any argument
 verification in place, and doesn't require __pure attributes on
 arbitrary functions that may be called during the dprintk
 
 #ifdef DEBUG
 #define some_print_wrapper(fmt, arg...) \
 do { if (0) printk(KERN_DEBUG fmt, ##arg); } while (0)
 #else
 #define some_print_wrapper(fmt, arg...) \
   printk(KERN_DEBUG fmt, ##arg)
 #endif

Have you actually read include/linux/sunrpc/debug.h?

Trond

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] SUNRPC: Mark buffer used for debug printks with __maybe_unused

2008-02-20 Thread Trond Myklebust


On Wed, 2008-02-20 at 09:23 -0800, Joe Perches wrote:
 On Wed, 2008-02-20 at 12:02 -0500, Trond Myklebust wrote:
   #ifdef DEBUG
   #define some_print_wrapper(fmt, arg...) \
   do { if (0) printk(KERN_DEBUG fmt, ##arg); } while (0)
   #else
   #define some_print_wrapper(fmt, arg...) \
 printk(KERN_DEBUG fmt, ##arg)
   #endif
  Have you actually read include/linux/sunrpc/debug.h?
 
 Yes, I have.
 
 What's there:
 
 #define dfprintk(fac, args...) do ; while (0)
 
 vs what's suggested:
 
 #define dfprintk(fac, args...) \
   do  { if (0) printk(##args); } while (0);
 
 No argument verification is done to args
 
 There has been code that fails to compile with -DDEBUG when
 the code use two different #ifdef DEBUG #else macros.
 I think some of the USB code was reworked because of that.
 
 The extra verification is just a guard against bad arguments
 when compiled normally.  It's similar to what's done in kernel.h
 pr_debug without the __attribute__((format(printf,x,y))) so
 that calls made as arguments to functions aren't called
 unnecessarily.

RPC_DEBUG is on by default if SYSCTL is compiled in, so it is not as if
we're having any trouble typechecking the arguments. In fact, most of
the major distros also tend to enable RPC_DEBUG, since it does come in
handy when their customers report rpc/nfs problems.

As you can see, there is already a macro RPC_IFDEBUG to deal with simple
code that depends on whether or not RPC_DEBUG is set. Using that will
make it obvious to a reader exactly when the declaration will be
optimised away, and why, without compromising gcc's ability to warn us
if it were to be unused due to some future code change.


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET: Add the helper kernel_sock_shutdown()

2007-11-12 Thread Trond Myklebust


On Mon, 2007-11-12 at 12:22 +, David Howells wrote:
 Trond Myklebust [EMAIL PROTECTED] wrote:
 
  take a SHUT_RD/SHUT_WR/SHUT_RDWR argument instead of the
 
 Hmmm...  Why SHUT_*?  Why not SHUTDOWN_*?

SHUT_RD/SHUT_WR/SHUT_RDWR are the traditional names for these constants
(see 'man 3 shutdown') and so should be easier to remember. I didn't
however feel comfortable naming the function kernel_shutdown() or
shutdown(), since those have other connotations, hence
kernel_sock_shutdown().

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH][RESEND] NET: Add the helper kernel_sock_shutdown()

2007-11-12 Thread Trond Myklebust

...and fix a couple of bugs in the NBD, CIFS and OCFS2 socket handlers.

Looking at the sock-op-shutdown() handlers, it looks as if all of them
take a SHUT_RD/SHUT_WR/SHUT_RDWR argument instead of the
RCV_SHUTDOWN/SEND_SHUTDOWN arguments.
Add a helper, and then define the SHUT_* enum to ensure that kernel users
of shutdown() don't get confused.

Signed-off-by: Trond Myklebust [EMAIL PROTECTED]
Cc: Paul Clements [EMAIL PROTECTED]
Acked-by: Mark Fasheh [EMAIL PROTECTED]
Cc: Steve French [EMAIL PROTECTED]
Acked-by: David Howells [EMAIL PROTECTED]
Cc: David S. Miller [EMAIL PROTECTED]
---

 drivers/block/nbd.c|3 ++-
 fs/cifs/connect.c  |2 +-
 fs/ocfs2/cluster/tcp.c |4 ++--
 include/linux/net.h|8 
 net/rxrpc/ar-local.c   |4 ++--
 net/socket.c   |6 ++
 6 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 6332aca..b4c0888 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -28,6 +28,7 @@
 #include linux/err.h
 #include linux/kernel.h
 #include net/sock.h
+#include linux/net.h
 
 #include asm/uaccess.h
 #include asm/system.h
@@ -126,7 +127,7 @@ static void sock_shutdown(struct nbd_device *lo, int lock)
if (lo-sock) {
printk(KERN_WARNING %s: shutting down socket\n,
lo-disk-disk_name);
-   lo-sock-ops-shutdown(lo-sock, SEND_SHUTDOWN|RCV_SHUTDOWN);
+   kernel_sock_shutdown(lo-sock, SHUT_RDWR);
lo-sock = NULL;
}
if (lock)
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 1102160..c52a76f 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -160,7 +160,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
if (server-ssocket) {
cFYI(1, (State: 0x%x Flags: 0x%lx, server-ssocket-state,
server-ssocket-flags));
-   server-ssocket-ops-shutdown(server-ssocket, SEND_SHUTDOWN);
+   kernel_sock_shutdown(server-ssocket, SHUT_WR);
cFYI(1, (Post shutdown state: 0x%x Flags: 0x%lx,
server-ssocket-state,
server-ssocket-flags));
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index 685c180..d84bd15 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -58,6 +58,7 @@
 #include linux/slab.h
 #include linux/idr.h
 #include linux/kref.h
+#include linux/net.h
 #include net/tcp.h
 
 #include asm/uaccess.h
@@ -616,8 +617,7 @@ static void o2net_shutdown_sc(struct work_struct *work)
del_timer_sync(sc-sc_idle_timeout);
o2net_sc_cancel_delayed_work(sc, sc-sc_keepalive_work);
sc_put(sc);
-   sc-sc_sock-ops-shutdown(sc-sc_sock,
-  RCV_SHUTDOWN|SEND_SHUTDOWN);
+   kernel_sock_shutdown(sc-sc_sock, SHUT_RDWR);
}
 
/* not fatal so failed connects before the other guy has our
diff --git a/include/linux/net.h b/include/linux/net.h
index dd79cdb..596131e 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -95,6 +95,12 @@ enum sock_type {
 
 #endif /* ARCH_HAS_SOCKET_TYPES */
 
+enum sock_shutdown_cmd {
+   SHUT_RD = 0,
+   SHUT_WR = 1,
+   SHUT_RDWR   = 2,
+};
+
 /**
  *  struct socket - general BSD socket
  *  @state: socket state (%SS_CONNECTED, etc)
@@ -223,6 +229,8 @@ extern int kernel_setsockopt(struct socket *sock, int 
level, int optname,
 extern int kernel_sendpage(struct socket *sock, struct page *page, int offset,
   size_t size, int flags);
 extern int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg);
+extern int kernel_sock_shutdown(struct socket *sock,
+   enum sock_shutdown_cmd how);
 
 #ifndef CONFIG_SMP
 #define SOCKOPS_WRAPPED(name) name
diff --git a/net/rxrpc/ar-local.c b/net/rxrpc/ar-local.c
index fe03f71..f3a2bd7 100644
--- a/net/rxrpc/ar-local.c
+++ b/net/rxrpc/ar-local.c
@@ -114,7 +114,7 @@ static int rxrpc_create_local(struct rxrpc_local *local)
return 0;
 
 error:
-   local-socket-ops-shutdown(local-socket, 2);
+   kernel_sock_shutdown(local-socket, SHUT_RDWR);
local-socket-sk-sk_user_data = NULL;
sock_release(local-socket);
local-socket = NULL;
@@ -267,7 +267,7 @@ static void rxrpc_destroy_local(struct work_struct *work)
/* finish cleaning up the local descriptor */
rxrpc_purge_queue(local-accept_queue);
rxrpc_purge_queue(local-reject_queue);
-   local-socket-ops-shutdown(local-socket, 2);
+   kernel_sock_shutdown(local-socket, SHUT_RDWR);
sock_release(local-socket);
 
up_read(rxrpc_local_sem);
diff --git a/net/socket.c b/net/socket.c
index 5d879fd..74784df 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2319,6 +2319,11 @@ int kernel_sock_ioctl(struct socket *sock, int cmd, 
unsigned long arg)
return err

Re: [PATCH] NET: Add the helper kernel_sock_shutdown()

2007-11-11 Thread Trond Myklebust


On Sun, 2007-11-11 at 10:03 -0800, Mark Fasheh wrote:
 On Thu, Nov 08, 2007 at 07:01:36PM -0500, Trond Myklebust wrote:
  From: Trond Myklebust [EMAIL PROTECTED]
  
  ...and fix a couple of bugs in the NBD, CIFS and OCFS2 socket handlers.
  
  Looking at the sock-op-shutdown() handlers, it looks as if all of them
  take a SHUT_RD/SHUT_WR/SHUT_RDWR argument instead of the
  RCV_SHUTDOWN/SEND_SHUTDOWN arguments.
  Add a helper, and then define the SHUT_* enum to ensure that kernel users
  of shutdown() don't get confused.
 
 That looks pretty good - any objection to naming the enum and using that
 name in the prototype for kernel_sock_shutdown() so it's even more obvious
 what type of shutdown argument this expects?
   --Mark
 
 --
 Mark Fasheh
 Senior Software Developer, Oracle
 [EMAIL PROTECTED]

That would be fine by me. How about the attached patch?

Cheers
  Trond

---BeginMessage---
...and fix a couple of bugs in the NBD, CIFS and OCFS2 socket handlers.

Looking at the sock-op-shutdown() handlers, it looks as if all of them
take a SHUT_RD/SHUT_WR/SHUT_RDWR argument instead of the
RCV_SHUTDOWN/SEND_SHUTDOWN arguments.
Add a helper, and then define the SHUT_* enum to ensure that kernel users
of shutdown() don't get confused.

Signed-off-by: Trond Myklebust [EMAIL PROTECTED]
Cc: Paul Clements [EMAIL PROTECTED]
Cc: Mark Fasheh [EMAIL PROTECTED]
Cc: Steve French [EMAIL PROTECTED]
Cc: David Howells [EMAIL PROTECTED]
Cc: David S Miller [EMAIL PROTECTED]
---

 drivers/block/nbd.c|3 ++-
 fs/cifs/connect.c  |2 +-
 fs/ocfs2/cluster/tcp.c |4 ++--
 include/linux/net.h|8 
 net/rxrpc/ar-local.c   |4 ++--
 net/socket.c   |6 ++
 6 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 6332aca..b4c0888 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -28,6 +28,7 @@
 #include linux/err.h
 #include linux/kernel.h
 #include net/sock.h
+#include linux/net.h
 
 #include asm/uaccess.h
 #include asm/system.h
@@ -126,7 +127,7 @@ static void sock_shutdown(struct nbd_device *lo, int lock)
if (lo-sock) {
printk(KERN_WARNING %s: shutting down socket\n,
lo-disk-disk_name);
-   lo-sock-ops-shutdown(lo-sock, SEND_SHUTDOWN|RCV_SHUTDOWN);
+   kernel_sock_shutdown(lo-sock, SHUT_RDWR);
lo-sock = NULL;
}
if (lock)
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 19ee11f..bea0d2e 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -160,7 +160,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
if (server-ssocket) {
cFYI(1, (State: 0x%x Flags: 0x%lx, server-ssocket-state,
server-ssocket-flags));
-   server-ssocket-ops-shutdown(server-ssocket, SEND_SHUTDOWN);
+   kernel_sock_shutdown(server-ssocket, SHUT_WR);
cFYI(1, (Post shutdown state: 0x%x Flags: 0x%lx,
server-ssocket-state,
server-ssocket-flags));
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index 685c180..d84bd15 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -58,6 +58,7 @@
 #include linux/slab.h
 #include linux/idr.h
 #include linux/kref.h
+#include linux/net.h
 #include net/tcp.h
 
 #include asm/uaccess.h
@@ -616,8 +617,7 @@ static void o2net_shutdown_sc(struct work_struct *work)
del_timer_sync(sc-sc_idle_timeout);
o2net_sc_cancel_delayed_work(sc, sc-sc_keepalive_work);
sc_put(sc);
-   sc-sc_sock-ops-shutdown(sc-sc_sock,
-  RCV_SHUTDOWN|SEND_SHUTDOWN);
+   kernel_sock_shutdown(sc-sc_sock, SHUT_RDWR);
}
 
/* not fatal so failed connects before the other guy has our
diff --git a/include/linux/net.h b/include/linux/net.h
index dd79cdb..596131e 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -95,6 +95,12 @@ enum sock_type {
 
 #endif /* ARCH_HAS_SOCKET_TYPES */
 
+enum sock_shutdown_cmd {
+   SHUT_RD = 0,
+   SHUT_WR = 1,
+   SHUT_RDWR   = 2,
+};
+
 /**
  *  struct socket - general BSD socket
  *  @state: socket state (%SS_CONNECTED, etc)
@@ -223,6 +229,8 @@ extern int kernel_setsockopt(struct socket *sock, int 
level, int optname,
 extern int kernel_sendpage(struct socket *sock, struct page *page, int offset,
   size_t size, int flags);
 extern int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg);
+extern int kernel_sock_shutdown(struct socket *sock,
+   enum sock_shutdown_cmd how);
 
 #ifndef CONFIG_SMP
 #define SOCKOPS_WRAPPED(name) name
diff --git a/net/rxrpc/ar-local.c b/net/rxrpc/ar-local.c
index fe03f71..f3a2bd7 100644
--- a/net/rxrpc/ar-local.c
+++ b/net/rxrpc/ar-local.c
@@ -114,7 +114,7 @@ static int rxrpc_create_local(struct

[PATCH] NET: Add the helper kernel_sock_shutdown()

2007-11-08 Thread Trond Myklebust

From: Trond Myklebust [EMAIL PROTECTED]

...and fix a couple of bugs in the NBD, CIFS and OCFS2 socket handlers.

Looking at the sock-op-shutdown() handlers, it looks as if all of them
take a SHUT_RD/SHUT_WR/SHUT_RDWR argument instead of the
RCV_SHUTDOWN/SEND_SHUTDOWN arguments.
Add a helper, and then define the SHUT_* enum to ensure that kernel users
of shutdown() don't get confused.

Signed-off-by: Trond Myklebust [EMAIL PROTECTED]
Cc: Paul Clements [EMAIL PROTECTED]
Cc: Mark Fasheh [EMAIL PROTECTED]
Cc: Steve French [EMAIL PROTECTED]
Cc: David Howells [EMAIL PROTECTED]
Cc: David S Miller [EMAIL PROTECTED]
---

 drivers/block/nbd.c|3 ++-
 fs/cifs/connect.c  |2 +-
 fs/ocfs2/cluster/tcp.c |4 ++--
 include/linux/net.h|7 +++
 net/rxrpc/ar-local.c   |4 ++--
 net/socket.c   |6 ++
 6 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 6332aca..b4c0888 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -28,6 +28,7 @@
 #include linux/err.h
 #include linux/kernel.h
 #include net/sock.h
+#include linux/net.h
 
 #include asm/uaccess.h
 #include asm/system.h
@@ -126,7 +127,7 @@ static void sock_shutdown(struct nbd_device *lo, int lock)
if (lo-sock) {
printk(KERN_WARNING %s: shutting down socket\n,
lo-disk-disk_name);
-   lo-sock-ops-shutdown(lo-sock, SEND_SHUTDOWN|RCV_SHUTDOWN);
+   kernel_sock_shutdown(lo-sock, SHUT_RDWR);
lo-sock = NULL;
}
if (lock)
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 19ee11f..bea0d2e 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -160,7 +160,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
if (server-ssocket) {
cFYI(1, (State: 0x%x Flags: 0x%lx, server-ssocket-state,
server-ssocket-flags));
-   server-ssocket-ops-shutdown(server-ssocket, SEND_SHUTDOWN);
+   kernel_sock_shutdown(server-ssocket, SHUT_WR);
cFYI(1, (Post shutdown state: 0x%x Flags: 0x%lx,
server-ssocket-state,
server-ssocket-flags));
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index 685c180..d84bd15 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -58,6 +58,7 @@
 #include linux/slab.h
 #include linux/idr.h
 #include linux/kref.h
+#include linux/net.h
 #include net/tcp.h
 
 #include asm/uaccess.h
@@ -616,8 +617,7 @@ static void o2net_shutdown_sc(struct work_struct *work)
del_timer_sync(sc-sc_idle_timeout);
o2net_sc_cancel_delayed_work(sc, sc-sc_keepalive_work);
sc_put(sc);
-   sc-sc_sock-ops-shutdown(sc-sc_sock,
-  RCV_SHUTDOWN|SEND_SHUTDOWN);
+   kernel_sock_shutdown(sc-sc_sock, SHUT_RDWR);
}
 
/* not fatal so failed connects before the other guy has our
diff --git a/include/linux/net.h b/include/linux/net.h
index dd79cdb..c804d81 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -95,6 +95,12 @@ enum sock_type {
 
 #endif /* ARCH_HAS_SOCKET_TYPES */
 
+enum {
+   SHUT_RD = 0,
+   SHUT_WR = 1,
+   SHUT_RDWR   = 2,
+};
+
 /**
  *  struct socket - general BSD socket
  *  @state: socket state (%SS_CONNECTED, etc)
@@ -223,6 +229,7 @@ extern int kernel_setsockopt(struct socket *sock, int 
level, int optname,
 extern int kernel_sendpage(struct socket *sock, struct page *page, int offset,
   size_t size, int flags);
 extern int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg);
+extern int kernel_sock_shutdown(struct socket *sock, int how);
 
 #ifndef CONFIG_SMP
 #define SOCKOPS_WRAPPED(name) name
diff --git a/net/rxrpc/ar-local.c b/net/rxrpc/ar-local.c
index fe03f71..f3a2bd7 100644
--- a/net/rxrpc/ar-local.c
+++ b/net/rxrpc/ar-local.c
@@ -114,7 +114,7 @@ static int rxrpc_create_local(struct rxrpc_local *local)
return 0;
 
 error:
-   local-socket-ops-shutdown(local-socket, 2);
+   kernel_sock_shutdown(local-socket, SHUT_RDWR);
local-socket-sk-sk_user_data = NULL;
sock_release(local-socket);
local-socket = NULL;
@@ -267,7 +267,7 @@ static void rxrpc_destroy_local(struct work_struct *work)
/* finish cleaning up the local descriptor */
rxrpc_purge_queue(local-accept_queue);
rxrpc_purge_queue(local-reject_queue);
-   local-socket-ops-shutdown(local-socket, 2);
+   kernel_sock_shutdown(local-socket, SHUT_RDWR);
sock_release(local-socket);
 
up_read(rxrpc_local_sem);
diff --git a/net/socket.c b/net/socket.c
index 5d879fd..1e41b15 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2319,6 +2319,11 @@ int kernel_sock_ioctl(struct socket *sock, int cmd, 
unsigned long arg)
return err;
 }
 
+int kernel_sock_shutdown(struct

Re: NFS mount gives ENETDOWN in -git15

2007-07-22 Thread Trond Myklebust

On Sat, 2007-07-21 at 15:31 +0200, Andi Kleen wrote:
 I tried to mount another nfs mount on a system running with nfsroot.
 But I get
 
 # mount basil:/home /basil/home/
 mount: Network is down
 
 The network is not down of course, the system is happily running with nfs 
 root from that
 server. Userland is older SUSE 10.0

Does Al's patch help in any way?

Cheers
  Trond

---BeginMessage---
Obviously broken on little-endian; fortunately, the option is
not frequently used...

Signed-off-by: Al Viro [EMAIL PROTECTED]
---
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index b34b7a7..b2a851c 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -732,7 +732,7 @@ static int nfs_parse_mount_options(char *raw,
return 0;
if (option  0 || option  65535)
return 0;
-   mnt-nfs_server.address.sin_port = htonl(option);
+   mnt-nfs_server.address.sin_port = htons(option);
break;
case Opt_rsize:
if (match_int(args, mnt-rsize))
---End Message---

Re: networking busted in current -git ???

2007-06-09 Thread Trond Myklebust

On Fri, 2007-06-08 at 19:06 -0700, David Miller wrote:
 From: Trond Myklebust [EMAIL PROTECTED]
 Date: Fri, 08 Jun 2007 17:43:27 -0400

  It is not dhcp. I'm seeing the same bug with bog-standard ifup with a
  static address on an FC-6 machine.

  It appears to be something in the latest dump from davem to Linus, but I
  haven't yet had time to identify what.

 Linus's current tree should have this fixed.

 Let us know if this is not the case.

It appears to be working for me again.

Trond

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: networking busted in current -git ???

2007-06-08 Thread Trond Myklebust

On Fri, 2007-06-08 at 23:07 +0200, Arkadiusz Miskiewicz wrote:
 On Friday 08 of June 2007, you wrote:
  Hello,
 
  I am using the current git tree: 85f6038f2170e3335dda09c3dfb0f83110e87019 .
  Git tree from two days ago (with the same config) works fine.
 
  Attempting to acquire an IP address via DHCP fails with:
 
  SIOCSIFADDR: No buffer space available
  Listening on LPF/eth0/00:19:b9:0c:9a:43
  Sending on   LPF/eth0/00:19:b9:0c:9a:43
  Sending on   Socket/fallback
  DHCPREQUEST on eth0 to 255.255.255.255 port 67
  DHCPACK from xxx.xxx.xxx.xxx
  SIOCSIFADDR: No buffer space available
  SIOCSIFNETMASK: Cannot assign requested address
  SIOCSIFBRDADDR: Cannot assign requested address
  SIOCADDRT: Network is unreachable
  bound to xxx.xxx.xxx.xxx -- renewal in 98610 seconds.
 
  This is on a Dell 490 with tg3 network driver running Ubuntu 7.04 .
  .config and dmesg are appended.
 
  florin
 
 Here it requires few retries (stop dhcpcd, start again) to get the IP. git 
 tree from few hours ago. tg3 driver. I also saw SIOCSIFADDR: No buffer space 
 available once.

(added netdev to the Cc list)

It is not dhcp. I'm seeing the same bug with bog-standard ifup with a
static address on an FC-6 machine.

It appears to be something in the latest dump from davem to Linus, but I
haven't yet had time to identify what.

Cheers
  Trond

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [-mm patch] fix fs/nfs/nfsroot.c compile error

2007-05-05 Thread Trond Myklebust

On Sat, 2007-05-05 at 18:44 +0200, Adrian Bunk wrote:
 On Sat, May 05, 2007 at 01:49:55AM -0700, Andrew Morton wrote:
 ...
  Changes since 2.6.21-rc7-mm2:
 ...
   git-net.patch
 ...
   git trees
 ...
 
 match_table_t was made const and gcc doesn't like const __initdata:

Then please revert the fix to match_table_t...

There is no reason for something like the nfsroot parser to be kept in
memory after the system has booted. That would be code bloat.

Trond

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [-mm patch] fix fs/nfs/nfsroot.c compile error

2007-05-05 Thread Trond Myklebust

On Sat, 2007-05-05 at 13:20 -0400, Trond Myklebust wrote:
 On Sat, 2007-05-05 at 18:44 +0200, Adrian Bunk wrote:
  On Sat, May 05, 2007 at 01:49:55AM -0700, Andrew Morton wrote:
  ...
   Changes since 2.6.21-rc7-mm2:
  ...
git-net.patch
  ...
git trees
  ...
  
  match_table_t was made const and gcc doesn't like const __initdata:
 
 Then please revert the fix to match_table_t...
 
 There is no reason for something like the nfsroot parser to be kept in
 memory after the system has booted. That would be code bloat.
 
 Trond

Alternatively, please change the nfsroot parser to use

static struct match_token tokens[] __initdata = {

That is in any case cleaner than using a typedef.

Trond

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.19-rc1: known regressions (v2)

2006-10-07 Thread Trond Myklebust

Cc: Linus Torvalds [EMAIL PROTECTED], Linux Kernel Mailing List 
linux-kernel@vger.kernel.org, [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL 
PROTECTED],  Prakash Punnoor [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL 
PROTECTED], Steve Fox [EMAIL PROTECTED],  netdev@vger.kernel.org, Michael S. 
Tsirkin [EMAIL PROTECTED],  [EMAIL PROTECTED], linux-acpi@vger.kernel.org, 
Pavel Machek [EMAIL PROTECTED],  Olaf Hering [EMAIL PROTECTED], Antonino 
Daplas [EMAIL PROTECTED], [EMAIL PROTECTED],  Thierry Vignaud [EMAIL 
PROTECTED], [EMAIL PROTECTED], linux-ide@vger.kernel.org, Ernst Herzberg 
[EMAIL PROTECTED], Matthieu Castet [EMAIL PROTECTED],  
linux-usb-devel@lists.sourceforge.net, Jens Axboe [EMAIL PROTECTED],  
Benjamin Herrenschmidt [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL 
PROTECTED], Andreas Schwab [EMAIL PROTECTED],  Mel Gorman [EMAIL 
PROTECTED], Alex Romosan [EMAIL PROTECTED], Sukadev Bhattiprolu [EMAIL 
PROTECTED]
 com, Dave Kleikamp [EMAIL PROTECTED], Torsten Kaiser [EMAIL PROTECTED], 
Magnus Damm [EMAIL PROTECTED], Vivek Goyal [EMAIL PROTECTED], [EMAIL 
PROTECTED], [EMAIL PROTECTED], Alistair John Strachan [EMAIL PROTECTED], 
Stefan Richter [EMAIL PROTECTED],  [EMAIL PROTECTED]
In-Reply-To: [EMAIL PROTECTED]
References: [EMAIL PROTECTED]
 [EMAIL PROTECTED]
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
Organization: Network Appliance Inc
Date: Sun, 08 Oct 2006 00:43:35 -0400
Message-Id: [EMAIL PROTECTED]
Mime-Version: 1.0
X-Mailer: Evolution 2.8.1 

On Sat, 2006-10-07 at 23:46 +0200, Adrian Bunk wrote:

 Subject: NFSv4 fails to mount (timeout)
 References : http://bugzilla.kernel.org/show_bug.cgi?id=7274
 Submitter  : Torsten Kaiser [EMAIL PROTECTED]
 Guilty : Trond Myklebust [EMAIL PROTECTED]
  commit 51b6ded4d9a94a61035deba1d8f51a54e3a3dd86
 Handled-By : Trond Myklebust [EMAIL PROTECTED]
 Patch  : http://bugzilla.kernel.org/show_bug.cgi?id=7274
 Status : patch available

Thanks... Always nice to hear that you have been judged and found
guilty. Now go and reread that fucking bug report...

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] nfs: deadlock prevention for NFS

2006-08-25 Thread Trond Myklebust

Grumble... If your patches are targetting NFS, could you please at the
very least Cc [EMAIL PROTECTED] and/or myself.


On Fri, 2006-08-25 at 17:40 +0200, Peter Zijlstra wrote:
 Provide a proper a_ops-swapfile() implementation for NFS. This will
 set the NFS socket to SOCK_VMIO and put the socket reconnection under
 PF_MEMALLOC (I hope this is enough, otherwise more work needs to be done).
 
 Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
 ---
  fs/nfs/file.c   |   21 -
  include/linux/sunrpc/xprt.h |4 +++-
  net/sunrpc/xprtsock.c   |   16 
  3 files changed, 39 insertions(+), 2 deletions(-)
 
 Index: linux-2.6/fs/nfs/file.c
 ===
 --- linux-2.6.orig/fs/nfs/file.c
 +++ linux-2.6/fs/nfs/file.c
 @@ -27,6 +27,7 @@
  #include linux/slab.h
  #include linux/pagemap.h
  #include linux/smp_lock.h
 +#include net/sock.h
  
  #include asm/uaccess.h
  #include asm/system.h
 @@ -317,7 +318,25 @@ static int nfs_release_page(struct page 
  
  static int nfs_swapfile(struct address_space *mapping, int enable)
  {
 - return 0;
 + int err = -EINVAL;
 + struct rpc_clnt *client = NFS_CLIENT(mapping-host);
 + struct sock *sk = client-cl_xprt-inet;
 +
 + if (enable) {
 + client-cl_xprt-swapper = 1;
 + /*
 +  * keep one extra sock reference so the reserve won't dip
 +  * when the socket gets reconnected.
 +  */
 + sk_adjust_memalloc(1, 1);
 + err = sk_set_vmio(sk);
 + } else if (client-cl_xprt-swapper) {
 + client-cl_xprt-swapper = 0;
 + sk_adjust_memalloc(-1, -1);
 + err = sk_clear_vmio(sk);
 + }
 +
 + return err;
  }

This all belongs in net/sunrpc/xprtsock.c. The NFS code has no business
screwing around with the internals of the sunrpc transport.

  const struct address_space_operations nfs_file_aops = {
 Index: linux-2.6/net/sunrpc/xprtsock.c
 ===
 --- linux-2.6.orig/net/sunrpc/xprtsock.c
 +++ linux-2.6/net/sunrpc/xprtsock.c
 @@ -1014,6 +1014,7 @@ static void xs_udp_connect_worker(void *
  {
   struct rpc_xprt *xprt = (struct rpc_xprt *) args;
   struct socket *sock = xprt-sock;
 + unsigned long pflags = current-flags;
   int err, status = -EIO;
  
   if (xprt-shutdown || xprt-addr.sin_port == 0)
 @@ -1021,6 +1022,9 @@ static void xs_udp_connect_worker(void *
  
   dprintk(RPC:  xs_udp_connect_worker for xprt %p\n, xprt);
  
 + if (xprt-swapper)
 + current-flags |= PF_MEMALLOC;
 +
   /* Start by resetting any existing state */
   xs_close(xprt);
  
 @@ -1054,6 +1058,9 @@ static void xs_udp_connect_worker(void *
   xprt-sock = sock;
   xprt-inet = sk;
  
 + if (xprt-swapper)
 + sk_set_vmio(sk);
 +
   write_unlock_bh(sk-sk_callback_lock);
   }
   xs_udp_do_set_buffer_size(xprt);
 @@ -1061,6 +1068,7 @@ static void xs_udp_connect_worker(void *
  out:
   xprt_wake_pending_tasks(xprt, status);
   xprt_clear_connecting(xprt);
 + current-flags = pflags;
  }
  
  /*
 @@ -1097,11 +1105,15 @@ static void xs_tcp_connect_worker(void *
  {
   struct rpc_xprt *xprt = (struct rpc_xprt *)args;
   struct socket *sock = xprt-sock;
 + unsigned long pflags = current-flags;
   int err, status = -EIO;
  
   if (xprt-shutdown || xprt-addr.sin_port == 0)
   goto out;
  
 + if (xprt-swapper)
 + current-flags |= PF_MEMALLOC;
 +
   dprintk(RPC:  xs_tcp_connect_worker for xprt %p\n, xprt);
  
   if (!xprt-sock) {
 @@ -1170,10 +1182,14 @@ static void xs_tcp_connect_worker(void *
   break;
   }
   }
 +
 + if (xprt-swapper)
 + sk_set_vmio(xprt-inet);
  out:
   xprt_wake_pending_tasks(xprt, status);
  out_clear:
   xprt_clear_connecting(xprt);
 + current-flags = pflags;
  }

How does this guarantee that the socket reconnection won't fail?

Also, what about the case of rpc_malloc()? Can't that cause rpciod to
deadlock when you add NFS swap into the equation?

Cheers,
  Trond

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch] Dead code in net/sunrpc/auth_gss/auth_gss.c

2006-04-18 Thread Trond Myklebust

On Tue, 2006-04-18 at 18:07 +0200, Eric Sesterhenn wrote:
 Hi,
 
 the coverity checker spotted that cred is always NULL
 when we jump to out_err ( there is just one case, when
 we fail to allocate the memory for cred )
 This is Coverity ID #79
 
 Signed-off-by: Eric Sesterhenn [EMAIL PROTECTED]
 

I'll take it, but please send sunrpc patches to the NFS lists, not
netdev.

Cheers,
  Trond

 --- linux-2.6.17-rc1/net/sunrpc/auth_gss/auth_gss.c.orig  2006-04-18 
 13:23:22.0 +0200
 +++ linux-2.6.17-rc1/net/sunrpc/auth_gss/auth_gss.c   2006-04-18 
 13:24:10.0 +0200
 @@ -794,7 +794,6 @@ gss_create_cred(struct rpc_auth *auth, s
  
  out_err:
   dprintk(RPC:  gss_create_cred failed with error %d\n, err);
 - if (cred) gss_destroy_cred(cred-gc_base);
   return ERR_PTR(err);
  }
  
 
 
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Fwd: [Bug 5644] New: NFS v3 TCP 3-way handshake incorrect, iptables blocks access]

2005-11-23 Thread Trond Myklebust

Sorry to be cross-posting, but does this bug ring any bells? I'm having
trouble seeing how the sunrpc server code could be at fault.

Cheers,
  Trond
---BeginMessage---
http://bugzilla.kernel.org/show_bug.cgi?id=5644

   Summary: NFS v3 TCP 3-way handshake incorrect, iptables blocks
access
Kernel Version: 2.6.14
Status: NEW
  Severity: blocking
 Owner: [EMAIL PROTECTED]
 Submitter: [EMAIL PROTECTED]


Most recent kernel where this bug did not occur:
Distribution: Can't remember, possibly FC2.
Hardware Environment:
Software Environment:
Problem Description:

Steps to reproduce:
1. Boot NFS v3 TCP client running iptables  mount NFS filesystem
2. Do a normal NFS client reboot  try mounting the same filesystem again
3. Experience intermittent failure to read superblock

The cause of this problem is NFS server's improper response to SYN packet sent
by the client.  This occurs *after* successful client authorization, when the
client tries to open the connection (i.e. sends SYN to the server's nfs port) to
read the superblock.  The server (sometimes) responds with a pure ACK without
the SYN bit set.  This is blocked by iptables -- thus, mount fails with a could
not read superblock message.

Here is an excerpt from ethereal log:

  3 0.021733client   SERVER   TCP  800  nfs [SYN]
Seq=0 Ack=0 Win=5840 Len=0 MSS=1460 TSV=24095 TSER=0 WS=2
  4 0.021846SERVER   client   TCP  nfs  800 [ACK]
Seq=9138391 Ack=3580883479 Win=16022 Len=0 TSV=244936050 TSER=1149400
  5 0.021864client   SERVER   ICMP Destination
unreachable (Host administratively prohibited)

The above problem occurs with a very simple default+ssh iptables configuration.
 Disabling iptables on the client makes the problem go away.  Even with iptables
active, there is no problem when nfsd responds with a proper [SYN,ACK] instead
of just pure ACK (this happens intermittently after the client reboot).

Please fix nfsd so that it reliably responds to SYN packets with proper
[SYN,ACK] packets instead of just ACK packets.  Apparently, nfsd state doesn't
get properly reset on client reboots.  Other people have reported autofs
failures which may be related (e.g. on remounts).

--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.
---End Message---

46 matches

Mail list logo