Re: [PATCH v2] sunrpc: fix waitqueue_active without memory barrier in sunrpc

2015-10-15 Thread Neil Brown
"J. Bruce Fields"  writes:

> On Thu, Oct 15, 2015 at 11:44:20AM +, Kosuke Tatsukawa wrote:
>> Tatsukawa Kosuke wrote:
>> > J. Bruce Fields wrote:
>> >> Thanks for the detailed investigation.
>> >> 
>> >> I think it would be worth adding a comment if that might help someone
>> >> having to reinvestigate this again some day.
>> > 
>> > It would be nice, but I find it difficult to write a comment in the
>> > sunrpc layer why a memory barrier isn't necessary, using the knowledge
>> > of how nfsd uses it, and the current implementation of the network code.
>> > 
>> > Personally, I would prefer removing the call to waitqueue_active() which
>> > would make the memory barrier totally unnecessary at the cost of a
>> > spin_lock + spin_unlock by unconditionally calling
>> > wake_up_interruptible.
>> 
>> On second thought, the callbacks will be called frequently from the tcp
>> code, so it wouldn't be a good idea.
>
> So, I was even considering documenting it like this, if it's not
> overkill.
>
> Hmm... but if this is right, then we may as well ask why we're doing the
> wakeups at all.  Might be educational to test the code with them
> removed.
>
> --b.
>
> commit 0882cfeb39e0
> Author: J. Bruce Fields 
> Date:   Thu Oct 15 16:53:41 2015 -0400
>
> svcrpc: document lack of some memory barriers.
> 
> Kosuke Tatsukawa points out an odd lack of memory barriers in some sites
> here.  I think the code's correct, but it's probably worth documenting.
> 
> Reported-by: Kosuke Tatsukawa 
>
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 856407fa085e..90480993ec4a 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -399,6 +399,25 @@ static int svc_sock_secure_port(struct svc_rqst *rqstp)
>   return svc_port_is_privileged(svc_addr(rqstp));
>  }
>  
> +static void svc_no_smp_mb(void)
> +{
> + /*
> +  * Kosuke Tatsukawa points out there should normally be an
> +  * smp_mb() at the callsites of this function.  (Either that or
> +  * we could just drop the waitqueue_active() checks.)
> +  *
> +  * It appears they aren't currently necessary, though, basically
> +  * because nfsd does non-blocking reads from these sockets, so
> +  * the only places we wait on this waitqueue is in sendpage and
> +  * sendmsg, which won't be waiting for wakeups on newly arrived
> +  * data.
> +  *
> +  * Maybe we should add the memory barriers anyway, but these are
> +  * hot paths so we'd need to be convinced there's no sigificant
> +  * penalty.
> +  */
> +}
> +
>  /*
>   * INET callback when data has been received on the socket.
>   */
> @@ -414,7 +433,7 @@ static void svc_udp_data_ready(struct sock *sk)
>   set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>   svc_xprt_enqueue(&svsk->sk_xprt);
>   }
> - smp_mb();
> + svc_no_smp_mb();
>   if (wq && waitqueue_active(wq))
>   wake_up_interruptible(wq);
>  }
> @@ -433,7 +452,7 @@ static void svc_write_space(struct sock *sk)
>   svc_xprt_enqueue(&svsk->sk_xprt);
>   }
>  
> - smp_mb();
> + svc_no_smp_mb();
>   if (wq && waitqueue_active(wq)) {
>   dprintk("RPC svc_write_space: someone sleeping on %p\n",
>  svsk);
> @@ -789,7 +808,7 @@ static void svc_tcp_listen_data_ready(struct sock *sk)
>   }
>  
>   wq = sk_sleep(sk);
> - smp_mb();
> + svc_no_smp_mb();
>   if (wq && waitqueue_active(wq))
>   wake_up_interruptible_all(wq);
>  }
> @@ -811,7 +830,7 @@ static void svc_tcp_state_change(struct sock *sk)
>   set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags);
>   svc_xprt_enqueue(&svsk->sk_xprt);
>   }
> - smp_mb();
> + svc_no_smp_mb();
>   if (wq && waitqueue_active(wq))
>   wake_up_interruptible_all(wq);
>  }
> @@ -827,7 +846,7 @@ static void svc_tcp_data_ready(struct sock *sk)
>   set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>   svc_xprt_enqueue(&svsk->sk_xprt);
>   }
> - smp_mb();
> + svc_no_smp_mb();
>   if (wq && waitqueue_active(wq))
>   wake_up_interruptible(wq);
>  }
> @@ -1599,7 +1618,7 @@ static void svc_sock_detach(struct svc_xprt *xprt)
>   sk->sk_write_space = svsk->sk_owspace;
>  
>   wq = sk_sleep(sk);
> - smp_mb();
> + svc_no_smp_mb();
>   if (wq && waitqueue_active(wq))
>   wake_up_interruptible(wq);
>  }

I would feel a lot more comfortable if you instead created:

static inline bool sunrpc_waitqueue_active(struct wait_queue_head *wq)
{
if (!wq)
return false;
/* long comment abot not needing a memory barrier */
return waitqueue_active(wq);
}

and then replace various "if (wq && waitqueue_active(wq))" calls with
if (sunrpc_waitqueue_active(wq))"

The comment seems readable and seems to make sense.

NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH v2] sunrpc: fix waitqueue_active without memory barrier in sunrpc

2015-10-08 Thread Neil Brown
Kosuke Tatsukawa  writes:

> There are several places in net/sunrpc/svcsock.c which calls
> waitqueue_active() without calling a memory barrier.  Add a memory
> barrier just as in wq_has_sleeper().
>
> I found this issue when I was looking through the linux source code
> for places calling waitqueue_active() before wake_up*(), but without
> preceding memory barriers, after sending a patch to fix a similar
> issue in drivers/tty/n_tty.c  (Details about the original issue can be
> found here: https://lkml.org/lkml/2015/9/28/849).

hi,
this feels like the wrong approach to the problem.  It requires extra
'smb_mb's to be spread around which are hard to understand as easy to
forget.

A quick look seems to suggest that (nearly) every waitqueue_active()
will need an smb_mb.  Could we just put the smb_mb() inside
waitqueue_active()??

Thanks,
NeilBrown


>
> Signed-off-by: Kosuke Tatsukawa 
> ---
> v2:
>   - Fixed compiler warnings caused by type mismatch
> v1:
>   - https://lkml.org/lkml/2015/10/8/993
> ---
>  net/sunrpc/svcsock.c |6 ++
>  1 files changed, 6 insertions(+), 0 deletions(-)
>
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 0c81202..ec19444 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -414,6 +414,7 @@ static void svc_udp_data_ready(struct sock *sk)
>   set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>   svc_xprt_enqueue(&svsk->sk_xprt);
>   }
> + smp_mb();
>   if (wq && waitqueue_active(wq))
>   wake_up_interruptible(wq);
>  }


signature.asc
Description: PGP signature


Re: [PATCH 00/28] Swap over NFS -v16

2008-02-25 Thread Neil Brown
On Saturday February 23, [EMAIL PROTECTED] wrote:
> On Wed, 20 Feb 2008 15:46:10 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> 
> > Another posting of the full swap over NFS series. 
> 
> Well I looked.  There's rather a lot of it and I wouldn't pretend to
> understand it.

But pretending is fun :-)

> 
> What is the NFS and net people's take on all of this?

Well I'm only vaguely an NFS person, barely a net person, sporadically
an mm person, but I've had a look and it seems to mostly make sense.

We introduce a new "emergency" concept for page allocation.
The size of the emergency pool is set by various reservations by
different potential users.
If the number of free pages is below the "emergency" size, then only
users with a "MEMALLOC" flag get to allocate pages.  Further, those
pages get a "reserve" flag set which propagates into slab/slub so
kmalloc/kmemalloc only return memory from those pages to MEMALLOC
users. 
MEMALLOC users are those that set PF_MEMALLOC.  A socket can get
SOCK_MEMALLOC set which will cause certain pieces of code to
temporarily set PF_MEMALLOC while working on that socket.

The upshot is that providing any MEMALLOC user reserves an appropriate
amount of emergency space, returns the emergency memory promptly, and
sets PF_MEMALLOC whenever allocating memory, it's memory allocations
should never fail.

As memory is requested is small units, but allocated as pages, there
needs to be a conversion from small-units to pages.  One of the
patches does this and appears to err on the side of be over-generous,
which is the right thing to do.


Memory reservations are organised in a tree.  I really don't
understand the tree.  Is it just to make /proc/reserve_info look more
helpful?
Certainly all the individual reservations need to be recorded, and the
cumulative reservation needs also to be recorded (currently in the
root of the tree) but what are all the other levels used for?

Reservations are used for all the transient memory that might be used
by the network stack.  This particularly includes the route cache and
skbs for incoming messages.  I have no idea if there is anything else
that needs to be allowed for.

Filesystems can advertise (via address_space_operations) that files
may be used as swap file.  They then provide swapout/swapin methods
which are like writepage/readpage but may behave differently and have
a different way to get credentials from a 'struct file'.


So in general, the patch set looks to have the right sort of shape.  I
cannot be very authoritative on the details as there are a lot of
them, and they touch code that I'm not very familiar with.

Some specific comments on patches:


reserve-slub.patch

   Please avoid irrelevant reformatting in patches.  It makes them
   harder to read.  e.g.:

-static void setup_object(struct kmem_cache *s, struct page *page,
-   void *object)
+static void setup_object(struct kmem_cache *s, struct page *page, void *object)


mm-kmem_estimate_pages.patch

   This introduces
 kestimate
 kestimate_single
 kmem_estimate_pages

   The last obviously returns a number of pages.  The contrast seems
   to suggest the others don't.   But they do...
   I don't think the names are very good, but I concede that it is
   hard to choose good names here.  Maybe:
  kmalloc_estimate_variable
  kmalloc_estimate_fixed
  kmem_alloc_estimate
   ???

mm-reserve.patch

   I'm confused by __mem_reserve_add.

+   reserve = mem_reserve_root.pages;
+   __calc_reserve(res, pages, 0);
+   reserve = mem_reserve_root.pages - reserve;

   __calc_reserve will always add 'pages' to mem_reserve_root.pages.
   So this is a complex way of doing
reserve = pages;
__calc_reserve(res, pages, 0);

And as you can calculate reserve before calling __calc_reserve
(which seems odd when stated that way), the whole function looks
like it could become:

   ret = adjust_memalloc_reserve(pages);
   if (!ret)
__calc_reserve(res, pages, limit);
   return ret;

What am I missing?

Also, mem_reserve_disconnect really should be a "void" function.
Just put a BUG_ON(ret) and don't return anything.

Finally, I'll just repeat that the purpose of the tree structure
eludes me.

net-sk_allocation.patch

Why are the "GFP_KERNEL" call sites just changed to
"sk->sk_allocation" rather than "sk_allocation(sk, GFP_KERNEL)" ??

I assume there is a good reason, and seeing it in the change log
would educate me and make the patch more obviously correct.

netvm-reserve.patch

Function names again:

 sk_adjust_memalloc
 sk_set_memalloc

sound similar.  Purpose is completely different.

mm-page_file_methods.patch

This makes page_offset and others more expensive by adding a
conditional jump to a function call that is not usually made.

Why do swap pages have a different index to everyon

Re: Top 10 kernel oopses for the week ending January 5th, 2008

2008-01-13 Thread Neil Brown
On Thursday January 10, [EMAIL PROTECTED] wrote:
> On Thu, Jan 10, 2008 at 03:13:48PM +1100, Neil Brown wrote:
> > > What guarantees that it doesn't happen before we get to callback?  AFAICS,
> > > nothing whatsoever...
> > 
> > Yes, that's bad isn't it :-)
> > 
> > I think I should be using sysfs_schedule_callback here.  That makes the 
> > required 'get' and 'put' calls but it can fail with -ENOMEM.  I
> > wonder what I do if -ENOMEM???  Maybe I'll just continue to roll my
> > one :-( 
> 
> How about this instead (completely untested)
> 
>   * split failure exits
>   * switch to kick_rdev_from_array()
>   * fold unbind_rdev_from_array() into it (no other callers anymore)
>   * take export_rdev() into failure case in bind_rdev_to_array()
>   * in kick_rdev_from_array() do what export_rdev() does sans
> kobject_put() and do that before schedule_work().  Take kobject_put() into
> delayed_delete().

While there are probably some good ideas in there, I think fixing this
particular bug is much simpler.  Just take a reference to the object
before scheduling the worker, and drop it when the worker has done
its work.

I have a closer look at the idea of no required export_rdev after a
failed bind_rdev_to_array.  On the surface it does seem to make the
code nicer.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Top 10 kernel oopses for the week ending January 5th, 2008

2008-01-09 Thread Neil Brown
On Tuesday January 8, [EMAIL PROTECTED] wrote:
> 
> FWIW, I'm going to go through Arjan's collection and post blow-by-blow
> analysis of some of those suckers.  Tonight, probably...
> 
> Let's take e.g. http://www.kerneloops.org/raw.php?rawid=2618

Thanks for that analysis.
...
> 
> Humm...  So we have kobj->parent containing crap.  What about the caller?
> It's from drivers/md/md.c:
> static void delayed_delete(struct work_struct *ws)

This is a good argument for sticking "md_" at the from of all my
function names, even if they are static.  I'm fairly sure I looked at
that trace:

> Call Trace:
>  [] kobject_put+0x19/0x20
>  [] kobject_del+0x2b/0x40
>  [] delayed_delete+0x0/0xb0
>  [] delayed_delete+0x69/0xb0
>  [] run_workqueue+0x175/0x210
>  [] worker_thread+0x71/0xb0
>  [] autoremove_wake_function+0x0/0x40
>  [] worker_thread+0x0/0xb0
>  [] kthread+0x4d/0x80
>  [] child_rip+0xa/0x12
>  [] restore_args+0x0/0x30
>  [] kthread+0x0/0x80
>  [] child_rip+0x0/0x12

but as it doesn't mention 'md' or 'nfs' I moved on.  My bad.

> 
> What guarantees that it doesn't happen before we get to callback?  AFAICS,
> nothing whatsoever...

Yes, that's bad isn't it :-)

I think I should be using sysfs_schedule_callback here.  That makes the 
required 'get' and 'put' calls but it can fail with -ENOMEM.  I
wonder what I do if -ENOMEM???  Maybe I'll just continue to roll my
one :-( 

Thanks,
NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] New Kernel Bugs

2007-11-14 Thread Neil Brown
On Tuesday November 13, [EMAIL PROTECTED] wrote:
> On Tuesday 13 November 2007 07:08, Mark Lord wrote:
> > Ingo Molnar wrote:
> > ..
> >
> > > This is all QA-101 that _cannot be argued against on a rational basis_,
> > > it's just that these sorts of things have been largely ignored for
> > > years, in favor of the all-too-easy "open source means many eyeballs and
> > > that is our QA" answer, which is a _good_ answer but by far not the most
> > > intelligent answer! Today "many eyeballs" is simply not good enough and
> > > nature (and other OS projects) will route us around if we dont change.
> >
> > ..
> >
> > QA-101 and "many eyeballs" are not at all in opposition.
> > The latter is how we find out about bugs on uncommon hardware,
> > and the former is what we need to track them and overall quality.
> >
> > A HUGE problem I have with current "efforts", is that once someone
> > reports a bug, the onus seems to be 99% on the *reporter* to find
> > the exact line of code or commit.  Ghad what a repressive method.
> 
> This is the only method that scales.

That sounds overly hash, and the rest of you mail sounds much more
moderate and sensible - I can only assume you were using hyperbole??

Putting the "onus on the reporter" is simply not going to work unless
you have a business relationship.  In the community, we are all
volunteering our time (well, maybe my employer is volunteering my time
to do community support, but the effect is the same).

I would hope that the focus of developers is to empower bug reporters
to provide further information (and as has been said, "git bisect" is
a great empowerer).  Some people will be incredibly help, especially
if you ask politely and say thankyou.  Others won't for any of a
number of reasons - and maybe that means their bug won't get fixed.

To my eyes, the "only method that scales" is investing effort in
encouraging and training bug reporters.  Some of that effort might not
produce results, but when others among those you have encouraged start
answering the newbee questions on the list and save you the time, you
get a distinct feeling that it was all worth while.


I think we are in agreement - I just wanted to take issue with that
one sentence :-)  The rest is great.

NeilBrown

> 
> Developer has only 24 hours in each day, and sometimes he needs to eat,
> sleep, and maybe even pay attention to e.g. his kids.
> 
> But bug reporters are much more numerous and they have more
> hours in one day combined.
> 
> BUT - it means that developers should try to increase user base,
> not scare users away.
> 
> > And if the "developer" who broke the damn thing, or who at least
> > "claims" to be supporting that code, cannot "reproduce" the bug,
> > they drop it completely.
> 
> Developer should let reporter know that reporter needs to help
> a bit here. Sometimes a bit of hand holding is needed, but it
> pays off because you breed more qualified testers/bug reporters.
> 
> > Contrast that flawed approach with how Linus does things..
> > he thinks through the symptoms, matches them to the code,
> > and figures out what the few possibilities might be,
> > and feeds back some trial balloon patches for the bug reporter to try.
> >
> > MUCH better.
> >
> > And remember, *I'm* an old-time Linux kernel developer.. just think about
> > the people reporting bugs who haven't been around here since 1992..
> 
> Yes. Developers should not grow more and more unhelpful
> and arrogant towards their users just because inexperienced
> users send incomplete/poorly written bug reports.
> They need to provide help, not humiliate/ignore.
> 
> I think we agree here.
> --
> vda
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] New Kernel Bugs

2007-11-14 Thread Neil Brown
On Wednesday November 14, [EMAIL PROTECTED] wrote:
> On Wed, Nov 14, 2007 at 09:38:20AM -0800, Randy Dunlap wrote:
> > On Wed, 14 Nov 2007 15:08:47 +0100 Ingo Molnar wrote:
> > > so please stop this "too busy and too noisy" nonsense already. It was 
> > > nonsense 10 years ago and it's nonsense today. In 10 years the kernel 
> > > grew from a 1 million lines codebase to an 8 million lines codebase, so 
> > > what? Deal with it and be intelligent about filtering your information 
> > > influx instead of imposing a hard pre-filtering criteria that restricts 
> > > intelligent processing of information.
> > 
> > So you have a preferred method of handling email.  Please don't
> > force it on the rest of us.
> 
> I'd be curious for any pointers on tools, actually.  I "read" (ok, skim)
> lkml but still overlook relevant bug reports occasionally.
> (Fortunately, between Trond and Andrew and others forwarding things it's
> not actually a problem, but I'm still curious).

Virtual Folders.

I use VM mode in EMACS, but I believe some other mail readers have the
same functionality.
I have a virtual folder called "nfs" which shows me all mail in my
inbox which has the string 'nfs' or 'lockd' in a To, Cc, or Subject
field.  When I visit that folder, I see all mail about nfs, whether it
was sent to me personally, or to a relevant list, or to lkml.

Admittedly if someone doesn't bother to choose a meaningful Subject,
then I might miss that.  I think this mostly happens when Andrew sends
a "-mm" announcement, asked people to change the subject line when
following up, and someone follows up without changing the subject line
and say "NFS doesn't work any more".

I have another virtual folder which matches "md" and "raid" and
"mdadm" in any header (so when the people from coraid.com talk about
ATA over Ethernet, that gets badly filed, but it is a small cost).

Then I have the "bkernel" (boring kernel) folder for all mail from
lkml that doesn't mention nfs or raid or md, and isn't from or to
me.  That folder I skim every week or so and just read the juicy
debates and look for interesting tidbits from interesting people -
then delete the whole folder, mostly unread.

I don't think I could cope with mail without virtual folders.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] NFS: handle IPv6 addresses in nfs ctl

2007-10-29 Thread Neil Brown
On Monday October 22, [EMAIL PROTECTED] wrote:
> Here is a second missing part of the IPv6 support in NFS server code 
> concerning knfd syscall interface.
> It updates write_getfd and write_getfd to accept IPv6 addresses.

Sorry for not replying to this earlier - I saw the Oct 12 post and
thought about it but didn't actually reply before getting distracted
:-(

I think this patch is unnecessary and hence not wanted.
The getfs / getfs calls should be considered legacy calls.
Adding functionality to them is not appropriate.

IPv6 addresses should be handled only via the
   /proc/net/rpc/auth.unix.ip 
interface.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] sunrpc: make closing of old temporary sockets work (was: problems with lockd in 2.6.22.6)

2007-09-12 Thread Neil Brown
On Wednesday September 12, [EMAIL PROTECTED] wrote:
> Hello,
> 
> as already described old temporary sockets (client is gone) of lockd aren't
> closed after some time. So, with enough clients and some time gone, there
> are 80 open dangling sockets and you start getting messages of the form:
> 
> lockd: too many open TCP sockets, consider increasing the number of nfsd 
> threads.
> 
> If I understand the code then the intention was that the server closes
> temporary sockets after about 6 to 12 minutes:
> 
>   a timer is started which calls svc_age_temp_sockets every 6 minutes.
> 
>   svc_age_temp_sockets:
>   if a socket is marked OLD it gets closed.
>   sockets which are not marked as OLD are marked OLD
> 
>   every time the sockets receives something OLD is cleared.
> 
> But svc_age_temp_sockets never closes any socket though because it only
> closes sockets with svsk->sk_inuse == 0. This seems to be a bug.
> 
> Here is a patch against 2.6.22.6 which changes the test to
> svsk->sk_inuse <= 0 which was probably meant. The patched kernel runs fine
> here. Unused sockets get closed (after 6 to 12 minutes)
> 
> Signed-off-by: Wolfgang Walter <[EMAIL PROTECTED]>
> 
> --- ../linux-2.6.22.6/net/sunrpc/svcsock.c2007-08-27 18:10:14.0 
> +0200
> +++ net/sunrpc/svcsock.c  2007-09-11 11:07:13.0 +0200
> @@ -1572,7 +1575,7 @@
>  
>   if (!test_and_set_bit(SK_OLD, &svsk->sk_flags))
>   continue;
> - if (atomic_read(&svsk->sk_inuse) || test_bit(SK_BUSY, 
> &svsk->sk_flags))
> + if (atomic_read(&svsk->sk_inuse) <= 0 || test_bit(SK_BUSY, 
> &svsk->sk_flags))
>   continue;
>   atomic_inc(&svsk->sk_inuse);
>   list_move(le, &to_be_aged);
> 
> 
> As svc_age_temp_sockets did not do anything before this change may trigger
> hidden bugs.
> 
> To be true I don't see why this check
> 
> (atomic_read(&svsk->sk_inuse) <= 0 || test_bit(SK_BUSY, &svsk->sk_flags))
> 
> is needed at all (it can only be an optimation) as this fields change after
> the check. In svc_tcp_accept there is no such check when a temporary socket
> is closed.

Thanks for looking into this.

I think the correct change is to test
if (atomic_read(&svsk->sk_inuse) > 1 || test_bit(SK_BUSY, 
&svsk->sk_flags))
or even
if (atomic_read(&svsk->sk_inuse) != 1 || test_bit(SK_BUSY, 
&svsk->sk_flags))

sk_inuse contains a bias of '1' until SK_DEAD is set.  So a normal,
active socket will have an inuse count of 1 or more.  If it is exactly
1, then either it is SK_DEAD (in which case there is nothing for this
code to do), or it has no users, in which case it is appropriate to
close the socket if it is old.
Note that this test is for "the socket should not be closed", so we
test if it is *not* 1, or > 1.

The tests are needed because we don't want to close a socket that
might be inuse elsewhere.  The SK_BUSY bit combined with the sk_inuse
count combine to check if the socket is in use at all or not.

You change effectively disabled the test, as sk_inuse is never <= 0
(except when SK_DEAD is set).

This bug has been present since 
commit aaf68cfbf2241d24d46583423f6bff5c47e088b3
Author: NeilBrown <[EMAIL PROTECTED]>
Date:   Thu Feb 8 14:20:30 2007 -0800

(i.e. it is my fault).
So it is in 2.6.21 and later and should probably go to .stable for .21
and .22.

Bruce:  for you :-)
---
Correctly close old nfsd/lockd sockets.

From: NeilBrown <[EMAIL PROTECTED]>

Commit aaf68cfbf2241d24d46583423f6bff5c47e088b3 added a bias
to sk_inuse, so this test for an unused socket now fails.  So no
sockets gets closed because they are old (they might get closed
if the client closed them).

This bug has existed since 2.6.21-rc1.

Thanks to Wolfgang Walter for finding and reporting the bug.


Cc: Wolfgang Walter <[EMAIL PROTECTED]>
Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./net/sunrpc/svcsock.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff .prev/net/sunrpc/svcsock.c ./net/sunrpc/svcsock.c
--- .prev/net/sunrpc/svcsock.c  2007-09-12 16:05:23.0 +0200
+++ ./net/sunrpc/svcsock.c  2007-09-12 16:06:01.0 +0200
@@ -1592,7 +1592,8 @@ svc_age_temp_sockets(unsigned long closu
 
if (!test_and_set_bit(SK_OLD, &svsk->sk_flags))
continue;
-   if (atomic_read(&svsk->sk_inuse) || test_bit(SK_BUSY, 
&svsk->sk_flags))
+   if (atomic_read(&svsk->sk_inuse) > 1
+   || test_bit(SK_BUSY, &svsk->sk_flags))
continue;
atomic_inc(&svsk->sk_inuse);
list_move(le, &to_be_aged);
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [OOPS] 2.6.23-rc5 in tcp/net/nfsd

2007-09-11 Thread Neil Brown
On Tuesday September 11, [EMAIL PROTECTED] wrote:
> This oops appeared over night on a box running 2.6.23-rc5 (recent with the
> tcp_input.c fix).
> 
> I can't find a similar one reported.


Okay. this is weird.
> 
> Mark
> 
> 
> BUG: unable to handle kernel NULL pointer dereference at virtual address 
> 007e 
   


That is the bad address,

> EFLAGS: 00010246   (2.6.23-rc5-2-mcyrixiii #1)
> EIP is at ip_fragment+0x7f/0x680
> eax: c3c09c00   ebx:    ecx: b524d006   edx: 007b
  ^

It looks like an offset of 3 from edx.  I got that from decoding:

> Code:  <00> ba 03 00 00 00 bf a6 ff ff

which is 
   0:   00 ba 03 00 00 00 add%bh,0x3(%rdx)

However that instruction doesn't appear in ip_fragment.
The code in ip_fragment reads:
  27:   b9 04 00 00 00mov$0x4,%ecx
^^
  2c:   ba 03 00 00 00mov$0x3,%edx
^^

which contains the bytes of the offending instruction.
Note that $0x4 is ICMP_FRAG_NEEDED and $0x3 is ICMP_DEST_UNREACH:
these are args to icmp_send.  So the latter is the correct disassembly
based on the C code.

So somehow the kernel is jumping to a bad address.  I don't know how
that would happening maybe a single bit error in memory or a
register???

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Oops in 2.6.22.1: skb_copy_and_csum_datagram_iovec()

2007-09-05 Thread Neil Brown
On Wednesday September 5, [EMAIL PROTECTED] wrote:
> On Wednesday August 22, [EMAIL PROTECTED] wrote:
> > Chuck Ebbert <[EMAIL PROTECTED]> wrote:
> > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=253290
> > > 
> > > 18:57:54 osama kernel:  [] kernel_recvmsg+0x31/0x40
> > > 18:57:54 osama kernel:  [] svc_udp_recvfrom+0x114/0x368 [sunrpc]
> > 
> > svc_udp_recvfrom is calling kernel_recvmsg with iov == NULL.
> 
> iov == NULL used to work.
> 
> I think it stopped working at 
>commit 759e5d006462d53fb708daa8284b4ad909415da1
> 
> Previously, as len==0, MSG_TRUNC would get set, so copy_only would get
> set, so skb_copy_datagram_iovec would get called, and that handles a
> len of 0.
> 
> Now, skb_copy_and_csum_datagram_iovec gets called unless
> skb_csum_unnecessary(skb), which now kills us.

Actually, the new code is broken for more reasons than that.
In core/datagram.c, the comment for skb_copy_and_csum_datagram_iovec,
it says:
 *  Caller _must_ check that skb will fit to this iovec.

but udp_recvmsg doesn't.

It seems to try:
if (copied < ulen || UDP_SKB_CB(skb)->partial_cov) {
if (udp_lib_checksum_complete(skb))
goto csum_copy_err;
}

if (skb_csum_unnecessary(skb))
err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr),
  msg->msg_iov, copied   );

so it doesn't call skb_copy_datagram_iovec if "copied < ulen".

However earlier there is:
ulen = skb->len - sizeof(struct udphdr);
copied = len;
if (copied > ulen)
copied = ulen;

so if the 'len' (of the iovec) is too small, we end up with "copied == ulen", 

so udp_lib_checksum_complete doesn't get called


> 
> We could 'fix' it by making skb_copy_and_csum_datagram_iovec just
> return if len==0, or don't call it from udp_recvmsg in that case.
> 

So the latter of these is needed.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Oops in 2.6.22.1: skb_copy_and_csum_datagram_iovec()

2007-09-05 Thread Neil Brown
On Wednesday August 22, [EMAIL PROTECTED] wrote:
> Chuck Ebbert <[EMAIL PROTECTED]> wrote:
> > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=253290
> > 
> > 18:57:54 osama kernel:  [] kernel_recvmsg+0x31/0x40
> > 18:57:54 osama kernel:  [] svc_udp_recvfrom+0x114/0x368 [sunrpc]
> 
> svc_udp_recvfrom is calling kernel_recvmsg with iov == NULL.

iov == NULL used to work.

I think it stopped working at 
   commit 759e5d006462d53fb708daa8284b4ad909415da1

Previously, as len==0, MSG_TRUNC would get set, so copy_only would get
set, so skb_copy_datagram_iovec would get called, and that handles a
len of 0.

Now, skb_copy_and_csum_datagram_iovec gets called unless
skb_csum_unnecessary(skb), which now kills us.

We could 'fix' it by making skb_copy_and_csum_datagram_iovec just
return if len==0, or don't call it from udp_recvmsg in that case.

The reason that I call kernel_recvmsg with iov == NULL is that I want
to collect the source/dest addresses of the packet, but I don't want
to actually read the data - I want to just get a reference to the skb
(to avoid needless copy where possible) later.
So I call kernel_recvmsg with MSG_PEEK, and don't bother with an
iovec.

I guess a could send an empty iovec rather than a NULL, but it used to
work 

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] NFS: change the ip_map cache code to handle IPv6 addresses

2007-08-09 Thread Neil Brown
On Thursday August 9, [EMAIL PROTECTED] wrote:
> Here is a small part of missing pieces of IPv6 support for the server.
> It deals with the ip_map caching code part.
> 
> It changes the ip_map structure to be able to store INET6 addresses.
> It adds also the changes in address hashing, and mapping to test it with 
> INET addresses.

Thanks - this generally seems to make sense.  A few specific comments:

> @@ -1558,6 +1558,7 @@

Please use "diff -p" to include the C function name.  It makes the
patch easier to read.


> +for (i = 0; i < ncp->cl_naddr; i++) {
> +/* Mapping address */
> +addr6.s6_addr32[0] = 0;
> +addr6.s6_addr32[1] = 0;
> +addr6.s6_addr32[2] = htonl(0x);
> +addr6.s6_addr32[3] = (uint32_t)ncp->cl_addrlist[i].s_addr;
> +auth_unix_add_addr(addr6, dom);
> +}

You do this conversion several times, and each time it is indented
strangely...  Maybe create an inline (if there isn't one already) to
do this.

>  extern void auth_domain_put(struct auth_domain *item);
> -extern int auth_unix_add_addr(struct in_addr addr, struct auth_domain 
> *dom);
> +extern int auth_unix_add_addr(struct in6_addr addr, struct auth_domain 
> *dom);

Your mailer is wrapping long lines.  This is bad.

>  static int ip_map_match(struct cache_head *corig, struct cache_head *cnew)
>  {
>  struct ip_map *orig = container_of(corig, struct ip_map, h);
>  struct ip_map *new = container_of(cnew, struct ip_map, h);
>  return strcmp(orig->m_class, new->m_class) == 0
> -&& orig->m_addr.s_addr == new->m_addr.s_addr;
> +&& memcmp(orig->m_addr.s6_addr, 
> new->m_addr.s6_addr,sizeof(struct in6_addr));
>  }

I think you have inverted the sense of the test.  The original tests
for equality.  The new tests for inequality.  For memcmp (and strcmp),
always use   "function(arg1, argc) COMPARE_OP 0"
e.g.
memcmp(orig, new, size) == 0
or use ipv6_addr_equal.


Also please run ./scripts/checkpatch.pl on the patch.  You need a
space after that comma.

   
> @@ -125,7 +129,7 @@
>  struct ip_map *item = container_of(citem, struct ip_map, h);
>  
>  strcpy(new->m_class, item->m_class);
> -new->m_addr.s_addr = item->m_addr.s_addr;
> +memcpy(&new->m_addr.s6_addr, 
> &item->m_addr.s6_addr,sizeof(item->m_addr.s6_addr));

Ditto. did you test this code?


>  }
>  static void update(struct cache_head *cnew, struct cache_head *citem)
>  {
> @@ -151,20 +155,28 @@
>  {
>  char text_addr[20];
>  struct ip_map *im = container_of(h, struct ip_map, h);
> -__be32 addr = im->m_addr.s_addr;
> +   
> +__be32 addr[4];
> +int i;
> +for (i=0;i<4;i++)
> +addr[i] = im->m_addr.s6_addr[i];
   

I think you mean "s6_addr32" ??


>  
> -snprintf(text_addr, 20, "%u.%u.%u.%u",
> - ntohl(addr) >> 24 & 0xff,
> - ntohl(addr) >> 16 & 0xff,
> - ntohl(addr) >>  8 & 0xff,
> - ntohl(addr) >>  0 & 0xff);
> +snprintf(text_addr, 20, "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x",
> +ntohl(addr[3]) >> 16 & 0xff,
  
should be 0x ???

Now I know you didn't test the code :-)

> +ntohl(addr[3]) >>  0 & 0xff,
> +ntohl(addr[2]) >> 16 & 0xff,
> +ntohl(addr[2]) >>  0 & 0xff,
> +ntohl(addr[1]) >> 16 & 0xff,
> +ntohl(addr[1]) >>  0 & 0xff,
> +ntohl(addr[0]) >> 16 & 0xff,
> +ntohl(addr[0]) >>  0 & 0xff);

This code would read better if you did
   __be16 addr[8];
   for (i = 0; i < 8 ; i++)
addr[i] = im->m_addr.s6_addr16[i];
   snprintf(
ntohs(addr[7]),
ntohs(addr[6]),


Also, I think that if it is a (converted) IPv4 address, it should be
printed as a dotted-quad.


> +   addr.s6_addr32[0] = htonl((b1<<16)|b2);
> +   addr.s6_addr32[1] = htonl((b3<<16)|b4);
> +   addr.s6_addr32[2] = htonl((b5<<16)|b6);
> +   addr.s6_addr32[3] = htonl((b7<<16)|b8);

Assign to addr.s6_addr16[].

> -seq_printf(m, "%s %d.%d.%d.%d %s\n",
> +seq_printf(m, "%s %04x.%04x.%04x.%04x.%04x.%04x.%04x.%04x %s\n",
> im->m_class,
> -   ntohl(addr.s_addr) >> 24 & 0xff,
> -   ntohl(addr.s_addr) >> 16 & 0xff,
> -   ntohl(addr.s_addr) >>  8 & 0xff,
> -   ntohl(addr.s_addr) >>  0 & 0xff,
> +   ntohl(addr.s6_addr32[3]) >> 16 & 0x,
> +   ntohl(addr.s6_addr32[3]) & 0x,
> +   ntohl(addr.s6_addr32[2]) >> 16 & 0x,
> +   ntohl(addr.s6_addr32[2]) & 0x,
> +   ntohl(addr.s6_addr32[1]) >> 16 & 0x,
> +   ntohl(addr.s6_addr32[1]) & 0x,
> +   ntohl(addr.s6_addr32[0]) >> 16 & 0x,
> +   ntohl(addr.s6_addr32[0]) & 0x,

Again, s6_addr16[], and keep the dotted-quad for IPv4.
In fact ( borrowing Chuck's suggestion)
 if (is_ipv4)
 

Re: [PATCH 1/1] NFS: change the ip_map cache code to handle IPv6 addresses

2007-08-09 Thread Neil Brown
On Thursday August 9, [EMAIL PROTECTED] wrote:
> >>
> >> How have you tested the effectiveness of the new hash function?
> > 
> > I have not tested that point but I can easily imagine there are better 
> > solutions.
> > Perhaps we can keep the same function for an IPv4 address (only taking 
> > the 32 bits of IPv4 addr), and then design one for IPv6 addresses.
> 
> I see that, to generate the hash, you would be xor-ing the FF and 00 
> bytes in the canonicalized IPv4 address.  Yes, perhaps a better function 
> is needed, or as you say, one specifically for IPv6 and one for 
> canonicalized IPv4.
> 

I suspect that the given hash function will be as good as any other.
The values in each byte of an IPv6 are likely to be either evenly
distributed, or constant.
Each the first case you don't need a clever hash function to
distribute them, while in the second, no hash function can improve the
distribution. 

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATCH 001 of 3] knfsd: Use recv_msg to get peer address for NFSD instead of code-copying

2007-03-05 Thread Neil Brown
On Monday March 5, [EMAIL PROTECTED] wrote:
> On Friday 02 March 2007 05:28, NeilBrown wrote:
> > The sunrpc server code needs to know the source and destination address
> > for UDP packets so it can reply properly.
> > It currently copies code out of the network stack to pick the pieces out
> > of the skb.
> > This is ugly and causes compile problems with the IPv6 stuff.
> 
> ... and this IPv6 code could never have worked anyway:

:-(
It's hard to test the IPv6 server until we have an IPv6 client I
guess, so thanks for the code review, even though we aren't going to
end up using that code...

> 
> But I find using recvmsg just for getting at the addresses
> a little awkward too.

Do you?  It's surely a lot better than code duplication, and it is
exactly how you would get the information from user-space.

>   And I think to be on the safe side, you
> should check that you're really looking at a PKTINFO cmsg
> rather than something else.

Maybe.
But is there really a chance that it might not be PKTINFO?
And what do you do if it isn't?
Log an error and drop the packet I guess.

I'll see what I can do.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATCH 001 of 3] knfsd: Use recv_msg to get peer address for NFSD instead of code-copying

2007-03-05 Thread Neil Brown
On Monday March 5, [EMAIL PROTECTED] wrote:
> 
> Hi Neil,
> 
> here's another minor comment:
> 
> On Friday 02 March 2007 05:28, NeilBrown wrote:
> > +static inline void svc_udp_get_dest_address(struct svc_rqst *rqstp,
> > +   struct cmsghdr *cmh)
> >  {
> > switch (rqstp->rq_sock->sk_sk->sk_family) {
> > case AF_INET: {
> > +   struct in_pktinfo *pki = CMSG_DATA(cmh);
> > +   rqstp->rq_daddr.addr.s_addr = pki->ipi_spec_dst.s_addr;
> > break;
> > +   }
> ...
> 
> The daddr that is extracted here will only ever be used to build
> another PKTINFO cmsg when sending the reply. So it would be
> much easier to just store the raw control message in the svc_rqst,
> without looking at its contents, and send it out along with the reply,
> unchanged.

Yes, sounds tempting, doesn't it?
Unfortunately it isn't that simple as I found out when the sunrpc code
in glibc did exactly that.

You see sendmsg will use the interface-number as well as the source
address from the PKTINFO structure.

Suppose my server has two interfaces (A and B) on two subnets that
both are connected to some router which is connected to a third subnet
that my client is on.  Further, suppose my server has only one default
route, out interface A.
The client chooses the IP address of interface B and sends a request.
It arrives on interface B and is processed.
If the PKTINFO received is passed unchanged to sendmsg, the pack will
be sent out interface B.  But interfacve B doesn't have a route to
that client, so the packet is dropped.

This exactly what was happening for me with mountd a few years ago.

So yes, we could just zero the interface field, but I think it is
clearer to extract that wanted data, then re-insert it.  They really
are different structures with different meanings (send verse receive)
which happen to have the same layout.

Thanks,
NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-git8 fails compile -- net/built-in.o __ipv6_addr_type

2007-02-12 Thread Neil Brown
On Monday February 12, [EMAIL PROTECTED] wrote:
> Quoting YOSHIFUJIHideaki/=?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?=
>   > In article <[EMAIL PROTECTED]> (at Mon, 12 Feb 2007 18:12:16 -0800), 
> Andrew Morton <[EMAIL PROTECTED]> says:
>   > 
>   > > > On Mon, 12 Feb 2007 20:10:13 -0500 (EST) Pete Clements <[EMAIL 
> PROTECTED]> wrote:
>   > > > 2.6.20-git8 fails compile:
>   > > > 
>   > > >   CHK include/linux/compile.h
>   > > >   UPD include/linux/compile.h
>   > > >   CC  init/version.o
>   > > >   LD  init/built-in.o
>   > > >   LD  .tmp_vmlinux1
>   > > > net/built-in.o: In function `svc_udp_recvfrom':
>   > > > svcsock.c:(.text+0x61be4): undefined reference to `__ipv6_addr_type'
>   > > > make: *** [.tmp_vmlinux1] Error 1
>   > > > 

Hmmm.  

> CONFIG_IPV6=m
combined with
> CONFIG_SUNRPC=y

is a combination that isn't going to work at the moment

Options?

 - Move __ipv6_addr_type to lib/ ??
Might not work for modules.
 - Disallow that combination in the .config.
.. a bit harsh.
 - Only include IPv6 support in SUNRPC is a module or IPV6 is built-in
A bit of a kludge..

Chuck?  Any ideas?

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] d80211 based driver for Intel PRO/Wireless 3945ABG

2007-02-09 Thread Neil Brown
On Friday February 9, [EMAIL PROTECTED] wrote:
> 
> Ok.  Now... any questions?
> 

Yes.  Does this require a closed user-space helper like the other
3945ABG driver, or is it completely open (maybe excepting firmware)?

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


TCP stack sometimes loses ACKs ... or something

2006-11-06 Thread Neil Brown

I upgraded my notebook from 2.6.16 to 2.6.18 recently and noticed that
I couldn't talk to my VOIP device (which has a WEB interface).
Watching traffic I see the three-way-handshake working perfectly, and
then the first data packet is sent (a partial HTTP request: 
GET / HTTP/1.1 ) and an ACK comes back from the device.
Then the next data packet (remainder of the HTTP request) is sent, but
tcpdump never sees the ACK, nor does the TCP stack.  So the data gets
recent repeatedly.  No ack. Ever.

With 2.6.16, The ack comes back just fine and the connection proceeds
as you would expect.

As it was a very reproducible problem I decided to try "git bisect"
and found 

 bad: [7b4f4b5ebceab67ce440a61081a69f0265e17c2a] [TCP]: Set default max buffers 
from memory pool size

I double checked as this seemed a fairly unlikely patch to cause the
problem, but this definitely is it.
The net effect of this patch is to change the last of the three
numbers in 
cat /proc/sys/net/ipv4/tcp_[rw]mem 
from well below 2^20 to well above. 2^20 seems to be a significant
number. I set tcp_wmem to that and the ACK was lost.  I set it to
one less and the first ACK (at least) was accepted.
I ended up setting both r and w to 10 and everything is fine.

Exploring more deeply, and comparing:
  - a failing connection (to VIOP box, [rw]mem large)
  - a working connection to VOIP box ([rw]mem small)
  - a working connection to another machine ([rw]mem irrelevant).
I find:

  The VIOP returns MSS=1360 in the SYN/ACK packet.  Other machine
returns MSS=1460

  The ack that is getting lost contains data as well as the
  ACK. i.e. the same packet that ACKs at the TCP level includes the
  HTTP level reply.
  The matching ACK from the other machine (some Linux 2.6.8 I think)
   is a data-less ACK followed very quickly by the HTTP reply in
   a separate packet.

  The 'Timestamps' option coming back from the VOIP box is a little
  odd.  The Timestamp in the SYN/ACK is the same as the timestamp in
  the next ACK (the ack for the first partial HTTP request).
  The Timestamp in the next packet which is the one that gets lost has
  exactly the same TSval as previous packets, and TSecr is one more
  than in the previous packet.

I assume that one (or more) of these differences combined with the
large tcp_[rw]mem value cause the packet loss, but I have no idea
which.

Help?

I can make the tcp traces available if needed, but these are really
the only non-trivial differences.

I'm willing to test patches.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-17 Thread Neil Brown
On Thursday August 17, [EMAIL PROTECTED] wrote:
> Andrew Morton wrote:
> > Daniel Phillips <[EMAIL PROTECTED]> wrote:
> >>What happened to the case where we just fill memory full of dirty file
> >>pages backed by a remote disk?
> > 
> > Processes which are dirtying those pages throttle at
> > /proc/sys/vm/dirty_ratio% of memory dirty.  So it is not possible to "fill"
> > memory with dirty pages.  If the amount of physical memory which is dirty
> > exceeds 40%: bug.
> 
> Hi Andrew,
> 
> So we make 400 MB of a 1 GB system unavailable for write caching just to
> get around the network receive starvation issue?

No.  We make it unavailable for write caching so it is available for
other important things like cached clean pages and executables etc.
You have to start throttling some where or you get very bad
behaviour.  40% seems a good number, but it is tunable.
The fact that it helps with receive starvation is just s bonus.

> 
> What happens if some in kernel user grabs 68% of kernel memory to do some
> very important thing, does this starvation avoidance scheme still work?

That is a very good question.  Is memlocked memory throttled against
dirty pages, and does it decrease the space available to the 40%
calculation?  I don't know.  I guess I could look at the code...

get_dirty_limits in page_writeback caps 'dirty_ratio' (40 by default,
hence the 40%) at unmapped_ratio/2.
So yes.  If 68% is mapped and locked (I assume that it the situation
you are referring to) that leaves 32% unlocked, so the 40% above is
reduced to 16% and you should still have 160Meg of breathing space.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-14 Thread Neil Brown
On Monday August 14, [EMAIL PROTECTED] wrote:
> On Sun, 2006-08-13 at 22:22 -0700, Andrew Morton wrote:
> > 
> > We could track dirty anonymous memory and throttle.
> > 
> > Also, there must be some value of /proc/sys/vm/min_free_kbytes at which a
> > machine is no longer deadlockable with any of these tricks.  Do we know
> > what level that is?
> 
> Not sure, the theoretical max amount of memory one can 'lose' in socket
> wait queues is well over the amount of physical memory we have in
> machines today (even for SGI); this combined with the fact that we limit
> the memory in some way to avoid DoS attacks, could make for all memory
> to be stuck in wait queues. Of course this becomes rather more unlikely
> for ever larger amounts of memory. But unlikely is never a guarantee.

What is the minimum amount of memory we need to reserve for each
socket?  1K? 1 page?  Call it X
Suppose that whenever a socket is created (or bound or connected or
whatever is right) we first allocate that much to a recv pool.
If any socket has less than X queued, then it is allowed to allocate
up to a total of X from the reserve pool.  After that it can only
receive when memory can be allocated from elsewhere.  Then we will
never block on recv.

Note that X doesn't need to be the biggest possible incoming message.
It only needs to be enough to get an 'ack' over any possible network
storage protocol with any possible layering. I suspect that it well
within one page.

Would it be too much waste to reserve one page for every idle socket? 

Does this have some fatal flaw?

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2.6 patch] net/sunrpc/xdr.c: remove xdr_decode_string()

2005-11-23 Thread Neil Brown
On Wednesday November 23, [EMAIL PROTECTED] wrote:
> On Wed, Nov 23, 2005 at 04:31:14AM -0800, Lever, Charles wrote:
> > so i don't remember why you are removing xdr_decode_string.  are we sure
> > that no-one will need this functionality in the future?  it is harmless
> > to remove today, but i wonder if someone is just going to add it back
> > sometime.
> 
> It's unused and you said:
>   the only harmless change i see below is removing xdr_decode_string().
> 

As 'xdr_decode_string' (sometimes) modifies the buffer that it is
decoding, I don't think it's usage should be encouraged.  If it is no
longer in use, then I fully support and encourage removing it.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html