Re: [PATCH v2] sunrpc: fix waitqueue_active without memory barrier in sunrpc
"J. Bruce Fields" writes: > On Thu, Oct 15, 2015 at 11:44:20AM +, Kosuke Tatsukawa wrote: >> Tatsukawa Kosuke wrote: >> > J. Bruce Fields wrote: >> >> Thanks for the detailed investigation. >> >> >> >> I think it would be worth adding a comment if that might help someone >> >> having to reinvestigate this again some day. >> > >> > It would be nice, but I find it difficult to write a comment in the >> > sunrpc layer why a memory barrier isn't necessary, using the knowledge >> > of how nfsd uses it, and the current implementation of the network code. >> > >> > Personally, I would prefer removing the call to waitqueue_active() which >> > would make the memory barrier totally unnecessary at the cost of a >> > spin_lock + spin_unlock by unconditionally calling >> > wake_up_interruptible. >> >> On second thought, the callbacks will be called frequently from the tcp >> code, so it wouldn't be a good idea. > > So, I was even considering documenting it like this, if it's not > overkill. > > Hmm... but if this is right, then we may as well ask why we're doing the > wakeups at all. Might be educational to test the code with them > removed. > > --b. > > commit 0882cfeb39e0 > Author: J. Bruce Fields > Date: Thu Oct 15 16:53:41 2015 -0400 > > svcrpc: document lack of some memory barriers. > > Kosuke Tatsukawa points out an odd lack of memory barriers in some sites > here. I think the code's correct, but it's probably worth documenting. > > Reported-by: Kosuke Tatsukawa > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c > index 856407fa085e..90480993ec4a 100644 > --- a/net/sunrpc/svcsock.c > +++ b/net/sunrpc/svcsock.c > @@ -399,6 +399,25 @@ static int svc_sock_secure_port(struct svc_rqst *rqstp) > return svc_port_is_privileged(svc_addr(rqstp)); > } > > +static void svc_no_smp_mb(void) > +{ > + /* > + * Kosuke Tatsukawa points out there should normally be an > + * smp_mb() at the callsites of this function. (Either that or > + * we could just drop the waitqueue_active() checks.) > + * > + * It appears they aren't currently necessary, though, basically > + * because nfsd does non-blocking reads from these sockets, so > + * the only places we wait on this waitqueue is in sendpage and > + * sendmsg, which won't be waiting for wakeups on newly arrived > + * data. > + * > + * Maybe we should add the memory barriers anyway, but these are > + * hot paths so we'd need to be convinced there's no sigificant > + * penalty. > + */ > +} > + > /* > * INET callback when data has been received on the socket. > */ > @@ -414,7 +433,7 @@ static void svc_udp_data_ready(struct sock *sk) > set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); > svc_xprt_enqueue(&svsk->sk_xprt); > } > - smp_mb(); > + svc_no_smp_mb(); > if (wq && waitqueue_active(wq)) > wake_up_interruptible(wq); > } > @@ -433,7 +452,7 @@ static void svc_write_space(struct sock *sk) > svc_xprt_enqueue(&svsk->sk_xprt); > } > > - smp_mb(); > + svc_no_smp_mb(); > if (wq && waitqueue_active(wq)) { > dprintk("RPC svc_write_space: someone sleeping on %p\n", > svsk); > @@ -789,7 +808,7 @@ static void svc_tcp_listen_data_ready(struct sock *sk) > } > > wq = sk_sleep(sk); > - smp_mb(); > + svc_no_smp_mb(); > if (wq && waitqueue_active(wq)) > wake_up_interruptible_all(wq); > } > @@ -811,7 +830,7 @@ static void svc_tcp_state_change(struct sock *sk) > set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags); > svc_xprt_enqueue(&svsk->sk_xprt); > } > - smp_mb(); > + svc_no_smp_mb(); > if (wq && waitqueue_active(wq)) > wake_up_interruptible_all(wq); > } > @@ -827,7 +846,7 @@ static void svc_tcp_data_ready(struct sock *sk) > set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); > svc_xprt_enqueue(&svsk->sk_xprt); > } > - smp_mb(); > + svc_no_smp_mb(); > if (wq && waitqueue_active(wq)) > wake_up_interruptible(wq); > } > @@ -1599,7 +1618,7 @@ static void svc_sock_detach(struct svc_xprt *xprt) > sk->sk_write_space = svsk->sk_owspace; > > wq = sk_sleep(sk); > - smp_mb(); > + svc_no_smp_mb(); > if (wq && waitqueue_active(wq)) > wake_up_interruptible(wq); > } I would feel a lot more comfortable if you instead created: static inline bool sunrpc_waitqueue_active(struct wait_queue_head *wq) { if (!wq) return false; /* long comment abot not needing a memory barrier */ return waitqueue_active(wq); } and then replace various "if (wq && waitqueue_active(wq))" calls with if (sunrpc_waitqueue_active(wq))" The comment seems readable and seems to make sense. NeilBrown signature.asc Description: PGP signature
Re: [PATCH v2] sunrpc: fix waitqueue_active without memory barrier in sunrpc
Kosuke Tatsukawa writes: > There are several places in net/sunrpc/svcsock.c which calls > waitqueue_active() without calling a memory barrier. Add a memory > barrier just as in wq_has_sleeper(). > > I found this issue when I was looking through the linux source code > for places calling waitqueue_active() before wake_up*(), but without > preceding memory barriers, after sending a patch to fix a similar > issue in drivers/tty/n_tty.c (Details about the original issue can be > found here: https://lkml.org/lkml/2015/9/28/849). hi, this feels like the wrong approach to the problem. It requires extra 'smb_mb's to be spread around which are hard to understand as easy to forget. A quick look seems to suggest that (nearly) every waitqueue_active() will need an smb_mb. Could we just put the smb_mb() inside waitqueue_active()?? Thanks, NeilBrown > > Signed-off-by: Kosuke Tatsukawa > --- > v2: > - Fixed compiler warnings caused by type mismatch > v1: > - https://lkml.org/lkml/2015/10/8/993 > --- > net/sunrpc/svcsock.c |6 ++ > 1 files changed, 6 insertions(+), 0 deletions(-) > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c > index 0c81202..ec19444 100644 > --- a/net/sunrpc/svcsock.c > +++ b/net/sunrpc/svcsock.c > @@ -414,6 +414,7 @@ static void svc_udp_data_ready(struct sock *sk) > set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); > svc_xprt_enqueue(&svsk->sk_xprt); > } > + smp_mb(); > if (wq && waitqueue_active(wq)) > wake_up_interruptible(wq); > } signature.asc Description: PGP signature
Re: [PATCH 00/28] Swap over NFS -v16
On Saturday February 23, [EMAIL PROTECTED] wrote: > On Wed, 20 Feb 2008 15:46:10 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote: > > > Another posting of the full swap over NFS series. > > Well I looked. There's rather a lot of it and I wouldn't pretend to > understand it. But pretending is fun :-) > > What is the NFS and net people's take on all of this? Well I'm only vaguely an NFS person, barely a net person, sporadically an mm person, but I've had a look and it seems to mostly make sense. We introduce a new "emergency" concept for page allocation. The size of the emergency pool is set by various reservations by different potential users. If the number of free pages is below the "emergency" size, then only users with a "MEMALLOC" flag get to allocate pages. Further, those pages get a "reserve" flag set which propagates into slab/slub so kmalloc/kmemalloc only return memory from those pages to MEMALLOC users. MEMALLOC users are those that set PF_MEMALLOC. A socket can get SOCK_MEMALLOC set which will cause certain pieces of code to temporarily set PF_MEMALLOC while working on that socket. The upshot is that providing any MEMALLOC user reserves an appropriate amount of emergency space, returns the emergency memory promptly, and sets PF_MEMALLOC whenever allocating memory, it's memory allocations should never fail. As memory is requested is small units, but allocated as pages, there needs to be a conversion from small-units to pages. One of the patches does this and appears to err on the side of be over-generous, which is the right thing to do. Memory reservations are organised in a tree. I really don't understand the tree. Is it just to make /proc/reserve_info look more helpful? Certainly all the individual reservations need to be recorded, and the cumulative reservation needs also to be recorded (currently in the root of the tree) but what are all the other levels used for? Reservations are used for all the transient memory that might be used by the network stack. This particularly includes the route cache and skbs for incoming messages. I have no idea if there is anything else that needs to be allowed for. Filesystems can advertise (via address_space_operations) that files may be used as swap file. They then provide swapout/swapin methods which are like writepage/readpage but may behave differently and have a different way to get credentials from a 'struct file'. So in general, the patch set looks to have the right sort of shape. I cannot be very authoritative on the details as there are a lot of them, and they touch code that I'm not very familiar with. Some specific comments on patches: reserve-slub.patch Please avoid irrelevant reformatting in patches. It makes them harder to read. e.g.: -static void setup_object(struct kmem_cache *s, struct page *page, - void *object) +static void setup_object(struct kmem_cache *s, struct page *page, void *object) mm-kmem_estimate_pages.patch This introduces kestimate kestimate_single kmem_estimate_pages The last obviously returns a number of pages. The contrast seems to suggest the others don't. But they do... I don't think the names are very good, but I concede that it is hard to choose good names here. Maybe: kmalloc_estimate_variable kmalloc_estimate_fixed kmem_alloc_estimate ??? mm-reserve.patch I'm confused by __mem_reserve_add. + reserve = mem_reserve_root.pages; + __calc_reserve(res, pages, 0); + reserve = mem_reserve_root.pages - reserve; __calc_reserve will always add 'pages' to mem_reserve_root.pages. So this is a complex way of doing reserve = pages; __calc_reserve(res, pages, 0); And as you can calculate reserve before calling __calc_reserve (which seems odd when stated that way), the whole function looks like it could become: ret = adjust_memalloc_reserve(pages); if (!ret) __calc_reserve(res, pages, limit); return ret; What am I missing? Also, mem_reserve_disconnect really should be a "void" function. Just put a BUG_ON(ret) and don't return anything. Finally, I'll just repeat that the purpose of the tree structure eludes me. net-sk_allocation.patch Why are the "GFP_KERNEL" call sites just changed to "sk->sk_allocation" rather than "sk_allocation(sk, GFP_KERNEL)" ?? I assume there is a good reason, and seeing it in the change log would educate me and make the patch more obviously correct. netvm-reserve.patch Function names again: sk_adjust_memalloc sk_set_memalloc sound similar. Purpose is completely different. mm-page_file_methods.patch This makes page_offset and others more expensive by adding a conditional jump to a function call that is not usually made. Why do swap pages have a different index to everyon
Re: Top 10 kernel oopses for the week ending January 5th, 2008
On Thursday January 10, [EMAIL PROTECTED] wrote: > On Thu, Jan 10, 2008 at 03:13:48PM +1100, Neil Brown wrote: > > > What guarantees that it doesn't happen before we get to callback? AFAICS, > > > nothing whatsoever... > > > > Yes, that's bad isn't it :-) > > > > I think I should be using sysfs_schedule_callback here. That makes the > > required 'get' and 'put' calls but it can fail with -ENOMEM. I > > wonder what I do if -ENOMEM??? Maybe I'll just continue to roll my > > one :-( > > How about this instead (completely untested) > > * split failure exits > * switch to kick_rdev_from_array() > * fold unbind_rdev_from_array() into it (no other callers anymore) > * take export_rdev() into failure case in bind_rdev_to_array() > * in kick_rdev_from_array() do what export_rdev() does sans > kobject_put() and do that before schedule_work(). Take kobject_put() into > delayed_delete(). While there are probably some good ideas in there, I think fixing this particular bug is much simpler. Just take a reference to the object before scheduling the worker, and drop it when the worker has done its work. I have a closer look at the idea of no required export_rdev after a failed bind_rdev_to_array. On the surface it does seem to make the code nicer. Thanks, NeilBrown -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Top 10 kernel oopses for the week ending January 5th, 2008
On Tuesday January 8, [EMAIL PROTECTED] wrote: > > FWIW, I'm going to go through Arjan's collection and post blow-by-blow > analysis of some of those suckers. Tonight, probably... > > Let's take e.g. http://www.kerneloops.org/raw.php?rawid=2618 Thanks for that analysis. ... > > Humm... So we have kobj->parent containing crap. What about the caller? > It's from drivers/md/md.c: > static void delayed_delete(struct work_struct *ws) This is a good argument for sticking "md_" at the from of all my function names, even if they are static. I'm fairly sure I looked at that trace: > Call Trace: > [] kobject_put+0x19/0x20 > [] kobject_del+0x2b/0x40 > [] delayed_delete+0x0/0xb0 > [] delayed_delete+0x69/0xb0 > [] run_workqueue+0x175/0x210 > [] worker_thread+0x71/0xb0 > [] autoremove_wake_function+0x0/0x40 > [] worker_thread+0x0/0xb0 > [] kthread+0x4d/0x80 > [] child_rip+0xa/0x12 > [] restore_args+0x0/0x30 > [] kthread+0x0/0x80 > [] child_rip+0x0/0x12 but as it doesn't mention 'md' or 'nfs' I moved on. My bad. > > What guarantees that it doesn't happen before we get to callback? AFAICS, > nothing whatsoever... Yes, that's bad isn't it :-) I think I should be using sysfs_schedule_callback here. That makes the required 'get' and 'put' calls but it can fail with -ENOMEM. I wonder what I do if -ENOMEM??? Maybe I'll just continue to roll my one :-( Thanks, NeilBrown -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] New Kernel Bugs
On Tuesday November 13, [EMAIL PROTECTED] wrote: > On Tuesday 13 November 2007 07:08, Mark Lord wrote: > > Ingo Molnar wrote: > > .. > > > > > This is all QA-101 that _cannot be argued against on a rational basis_, > > > it's just that these sorts of things have been largely ignored for > > > years, in favor of the all-too-easy "open source means many eyeballs and > > > that is our QA" answer, which is a _good_ answer but by far not the most > > > intelligent answer! Today "many eyeballs" is simply not good enough and > > > nature (and other OS projects) will route us around if we dont change. > > > > .. > > > > QA-101 and "many eyeballs" are not at all in opposition. > > The latter is how we find out about bugs on uncommon hardware, > > and the former is what we need to track them and overall quality. > > > > A HUGE problem I have with current "efforts", is that once someone > > reports a bug, the onus seems to be 99% on the *reporter* to find > > the exact line of code or commit. Ghad what a repressive method. > > This is the only method that scales. That sounds overly hash, and the rest of you mail sounds much more moderate and sensible - I can only assume you were using hyperbole?? Putting the "onus on the reporter" is simply not going to work unless you have a business relationship. In the community, we are all volunteering our time (well, maybe my employer is volunteering my time to do community support, but the effect is the same). I would hope that the focus of developers is to empower bug reporters to provide further information (and as has been said, "git bisect" is a great empowerer). Some people will be incredibly help, especially if you ask politely and say thankyou. Others won't for any of a number of reasons - and maybe that means their bug won't get fixed. To my eyes, the "only method that scales" is investing effort in encouraging and training bug reporters. Some of that effort might not produce results, but when others among those you have encouraged start answering the newbee questions on the list and save you the time, you get a distinct feeling that it was all worth while. I think we are in agreement - I just wanted to take issue with that one sentence :-) The rest is great. NeilBrown > > Developer has only 24 hours in each day, and sometimes he needs to eat, > sleep, and maybe even pay attention to e.g. his kids. > > But bug reporters are much more numerous and they have more > hours in one day combined. > > BUT - it means that developers should try to increase user base, > not scare users away. > > > And if the "developer" who broke the damn thing, or who at least > > "claims" to be supporting that code, cannot "reproduce" the bug, > > they drop it completely. > > Developer should let reporter know that reporter needs to help > a bit here. Sometimes a bit of hand holding is needed, but it > pays off because you breed more qualified testers/bug reporters. > > > Contrast that flawed approach with how Linus does things.. > > he thinks through the symptoms, matches them to the code, > > and figures out what the few possibilities might be, > > and feeds back some trial balloon patches for the bug reporter to try. > > > > MUCH better. > > > > And remember, *I'm* an old-time Linux kernel developer.. just think about > > the people reporting bugs who haven't been around here since 1992.. > > Yes. Developers should not grow more and more unhelpful > and arrogant towards their users just because inexperienced > users send incomplete/poorly written bug reports. > They need to provide help, not humiliate/ignore. > > I think we agree here. > -- > vda > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] New Kernel Bugs
On Wednesday November 14, [EMAIL PROTECTED] wrote: > On Wed, Nov 14, 2007 at 09:38:20AM -0800, Randy Dunlap wrote: > > On Wed, 14 Nov 2007 15:08:47 +0100 Ingo Molnar wrote: > > > so please stop this "too busy and too noisy" nonsense already. It was > > > nonsense 10 years ago and it's nonsense today. In 10 years the kernel > > > grew from a 1 million lines codebase to an 8 million lines codebase, so > > > what? Deal with it and be intelligent about filtering your information > > > influx instead of imposing a hard pre-filtering criteria that restricts > > > intelligent processing of information. > > > > So you have a preferred method of handling email. Please don't > > force it on the rest of us. > > I'd be curious for any pointers on tools, actually. I "read" (ok, skim) > lkml but still overlook relevant bug reports occasionally. > (Fortunately, between Trond and Andrew and others forwarding things it's > not actually a problem, but I'm still curious). Virtual Folders. I use VM mode in EMACS, but I believe some other mail readers have the same functionality. I have a virtual folder called "nfs" which shows me all mail in my inbox which has the string 'nfs' or 'lockd' in a To, Cc, or Subject field. When I visit that folder, I see all mail about nfs, whether it was sent to me personally, or to a relevant list, or to lkml. Admittedly if someone doesn't bother to choose a meaningful Subject, then I might miss that. I think this mostly happens when Andrew sends a "-mm" announcement, asked people to change the subject line when following up, and someone follows up without changing the subject line and say "NFS doesn't work any more". I have another virtual folder which matches "md" and "raid" and "mdadm" in any header (so when the people from coraid.com talk about ATA over Ethernet, that gets badly filed, but it is a small cost). Then I have the "bkernel" (boring kernel) folder for all mail from lkml that doesn't mention nfs or raid or md, and isn't from or to me. That folder I skim every week or so and just read the juicy debates and look for interesting tidbits from interesting people - then delete the whole folder, mostly unread. I don't think I could cope with mail without virtual folders. NeilBrown - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] NFS: handle IPv6 addresses in nfs ctl
On Monday October 22, [EMAIL PROTECTED] wrote: > Here is a second missing part of the IPv6 support in NFS server code > concerning knfd syscall interface. > It updates write_getfd and write_getfd to accept IPv6 addresses. Sorry for not replying to this earlier - I saw the Oct 12 post and thought about it but didn't actually reply before getting distracted :-( I think this patch is unnecessary and hence not wanted. The getfs / getfs calls should be considered legacy calls. Adding functionality to them is not appropriate. IPv6 addresses should be handled only via the /proc/net/rpc/auth.unix.ip interface. Thanks, NeilBrown - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] sunrpc: make closing of old temporary sockets work (was: problems with lockd in 2.6.22.6)
On Wednesday September 12, [EMAIL PROTECTED] wrote: > Hello, > > as already described old temporary sockets (client is gone) of lockd aren't > closed after some time. So, with enough clients and some time gone, there > are 80 open dangling sockets and you start getting messages of the form: > > lockd: too many open TCP sockets, consider increasing the number of nfsd > threads. > > If I understand the code then the intention was that the server closes > temporary sockets after about 6 to 12 minutes: > > a timer is started which calls svc_age_temp_sockets every 6 minutes. > > svc_age_temp_sockets: > if a socket is marked OLD it gets closed. > sockets which are not marked as OLD are marked OLD > > every time the sockets receives something OLD is cleared. > > But svc_age_temp_sockets never closes any socket though because it only > closes sockets with svsk->sk_inuse == 0. This seems to be a bug. > > Here is a patch against 2.6.22.6 which changes the test to > svsk->sk_inuse <= 0 which was probably meant. The patched kernel runs fine > here. Unused sockets get closed (after 6 to 12 minutes) > > Signed-off-by: Wolfgang Walter <[EMAIL PROTECTED]> > > --- ../linux-2.6.22.6/net/sunrpc/svcsock.c2007-08-27 18:10:14.0 > +0200 > +++ net/sunrpc/svcsock.c 2007-09-11 11:07:13.0 +0200 > @@ -1572,7 +1575,7 @@ > > if (!test_and_set_bit(SK_OLD, &svsk->sk_flags)) > continue; > - if (atomic_read(&svsk->sk_inuse) || test_bit(SK_BUSY, > &svsk->sk_flags)) > + if (atomic_read(&svsk->sk_inuse) <= 0 || test_bit(SK_BUSY, > &svsk->sk_flags)) > continue; > atomic_inc(&svsk->sk_inuse); > list_move(le, &to_be_aged); > > > As svc_age_temp_sockets did not do anything before this change may trigger > hidden bugs. > > To be true I don't see why this check > > (atomic_read(&svsk->sk_inuse) <= 0 || test_bit(SK_BUSY, &svsk->sk_flags)) > > is needed at all (it can only be an optimation) as this fields change after > the check. In svc_tcp_accept there is no such check when a temporary socket > is closed. Thanks for looking into this. I think the correct change is to test if (atomic_read(&svsk->sk_inuse) > 1 || test_bit(SK_BUSY, &svsk->sk_flags)) or even if (atomic_read(&svsk->sk_inuse) != 1 || test_bit(SK_BUSY, &svsk->sk_flags)) sk_inuse contains a bias of '1' until SK_DEAD is set. So a normal, active socket will have an inuse count of 1 or more. If it is exactly 1, then either it is SK_DEAD (in which case there is nothing for this code to do), or it has no users, in which case it is appropriate to close the socket if it is old. Note that this test is for "the socket should not be closed", so we test if it is *not* 1, or > 1. The tests are needed because we don't want to close a socket that might be inuse elsewhere. The SK_BUSY bit combined with the sk_inuse count combine to check if the socket is in use at all or not. You change effectively disabled the test, as sk_inuse is never <= 0 (except when SK_DEAD is set). This bug has been present since commit aaf68cfbf2241d24d46583423f6bff5c47e088b3 Author: NeilBrown <[EMAIL PROTECTED]> Date: Thu Feb 8 14:20:30 2007 -0800 (i.e. it is my fault). So it is in 2.6.21 and later and should probably go to .stable for .21 and .22. Bruce: for you :-) --- Correctly close old nfsd/lockd sockets. From: NeilBrown <[EMAIL PROTECTED]> Commit aaf68cfbf2241d24d46583423f6bff5c47e088b3 added a bias to sk_inuse, so this test for an unused socket now fails. So no sockets gets closed because they are old (they might get closed if the client closed them). This bug has existed since 2.6.21-rc1. Thanks to Wolfgang Walter for finding and reporting the bug. Cc: Wolfgang Walter <[EMAIL PROTECTED]> Signed-off-by: Neil Brown <[EMAIL PROTECTED]> ### Diffstat output ./net/sunrpc/svcsock.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff .prev/net/sunrpc/svcsock.c ./net/sunrpc/svcsock.c --- .prev/net/sunrpc/svcsock.c 2007-09-12 16:05:23.0 +0200 +++ ./net/sunrpc/svcsock.c 2007-09-12 16:06:01.0 +0200 @@ -1592,7 +1592,8 @@ svc_age_temp_sockets(unsigned long closu if (!test_and_set_bit(SK_OLD, &svsk->sk_flags)) continue; - if (atomic_read(&svsk->sk_inuse) || test_bit(SK_BUSY, &svsk->sk_flags)) + if (atomic_read(&svsk->sk_inuse) > 1 + || test_bit(SK_BUSY, &svsk->sk_flags)) continue; atomic_inc(&svsk->sk_inuse); list_move(le, &to_be_aged); - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [OOPS] 2.6.23-rc5 in tcp/net/nfsd
On Tuesday September 11, [EMAIL PROTECTED] wrote: > This oops appeared over night on a box running 2.6.23-rc5 (recent with the > tcp_input.c fix). > > I can't find a similar one reported. Okay. this is weird. > > Mark > > > BUG: unable to handle kernel NULL pointer dereference at virtual address > 007e That is the bad address, > EFLAGS: 00010246 (2.6.23-rc5-2-mcyrixiii #1) > EIP is at ip_fragment+0x7f/0x680 > eax: c3c09c00 ebx: ecx: b524d006 edx: 007b ^ It looks like an offset of 3 from edx. I got that from decoding: > Code: <00> ba 03 00 00 00 bf a6 ff ff which is 0: 00 ba 03 00 00 00 add%bh,0x3(%rdx) However that instruction doesn't appear in ip_fragment. The code in ip_fragment reads: 27: b9 04 00 00 00mov$0x4,%ecx ^^ 2c: ba 03 00 00 00mov$0x3,%edx ^^ which contains the bytes of the offending instruction. Note that $0x4 is ICMP_FRAG_NEEDED and $0x3 is ICMP_DEST_UNREACH: these are args to icmp_send. So the latter is the correct disassembly based on the C code. So somehow the kernel is jumping to a bad address. I don't know how that would happening maybe a single bit error in memory or a register??? NeilBrown - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Oops in 2.6.22.1: skb_copy_and_csum_datagram_iovec()
On Wednesday September 5, [EMAIL PROTECTED] wrote: > On Wednesday August 22, [EMAIL PROTECTED] wrote: > > Chuck Ebbert <[EMAIL PROTECTED]> wrote: > > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=253290 > > > > > > 18:57:54 osama kernel: [] kernel_recvmsg+0x31/0x40 > > > 18:57:54 osama kernel: [] svc_udp_recvfrom+0x114/0x368 [sunrpc] > > > > svc_udp_recvfrom is calling kernel_recvmsg with iov == NULL. > > iov == NULL used to work. > > I think it stopped working at >commit 759e5d006462d53fb708daa8284b4ad909415da1 > > Previously, as len==0, MSG_TRUNC would get set, so copy_only would get > set, so skb_copy_datagram_iovec would get called, and that handles a > len of 0. > > Now, skb_copy_and_csum_datagram_iovec gets called unless > skb_csum_unnecessary(skb), which now kills us. Actually, the new code is broken for more reasons than that. In core/datagram.c, the comment for skb_copy_and_csum_datagram_iovec, it says: * Caller _must_ check that skb will fit to this iovec. but udp_recvmsg doesn't. It seems to try: if (copied < ulen || UDP_SKB_CB(skb)->partial_cov) { if (udp_lib_checksum_complete(skb)) goto csum_copy_err; } if (skb_csum_unnecessary(skb)) err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov, copied ); so it doesn't call skb_copy_datagram_iovec if "copied < ulen". However earlier there is: ulen = skb->len - sizeof(struct udphdr); copied = len; if (copied > ulen) copied = ulen; so if the 'len' (of the iovec) is too small, we end up with "copied == ulen", so udp_lib_checksum_complete doesn't get called > > We could 'fix' it by making skb_copy_and_csum_datagram_iovec just > return if len==0, or don't call it from udp_recvmsg in that case. > So the latter of these is needed. NeilBrown - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Oops in 2.6.22.1: skb_copy_and_csum_datagram_iovec()
On Wednesday August 22, [EMAIL PROTECTED] wrote: > Chuck Ebbert <[EMAIL PROTECTED]> wrote: > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=253290 > > > > 18:57:54 osama kernel: [] kernel_recvmsg+0x31/0x40 > > 18:57:54 osama kernel: [] svc_udp_recvfrom+0x114/0x368 [sunrpc] > > svc_udp_recvfrom is calling kernel_recvmsg with iov == NULL. iov == NULL used to work. I think it stopped working at commit 759e5d006462d53fb708daa8284b4ad909415da1 Previously, as len==0, MSG_TRUNC would get set, so copy_only would get set, so skb_copy_datagram_iovec would get called, and that handles a len of 0. Now, skb_copy_and_csum_datagram_iovec gets called unless skb_csum_unnecessary(skb), which now kills us. We could 'fix' it by making skb_copy_and_csum_datagram_iovec just return if len==0, or don't call it from udp_recvmsg in that case. The reason that I call kernel_recvmsg with iov == NULL is that I want to collect the source/dest addresses of the packet, but I don't want to actually read the data - I want to just get a reference to the skb (to avoid needless copy where possible) later. So I call kernel_recvmsg with MSG_PEEK, and don't bother with an iovec. I guess a could send an empty iovec rather than a NULL, but it used to work NeilBrown - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] NFS: change the ip_map cache code to handle IPv6 addresses
On Thursday August 9, [EMAIL PROTECTED] wrote: > Here is a small part of missing pieces of IPv6 support for the server. > It deals with the ip_map caching code part. > > It changes the ip_map structure to be able to store INET6 addresses. > It adds also the changes in address hashing, and mapping to test it with > INET addresses. Thanks - this generally seems to make sense. A few specific comments: > @@ -1558,6 +1558,7 @@ Please use "diff -p" to include the C function name. It makes the patch easier to read. > +for (i = 0; i < ncp->cl_naddr; i++) { > +/* Mapping address */ > +addr6.s6_addr32[0] = 0; > +addr6.s6_addr32[1] = 0; > +addr6.s6_addr32[2] = htonl(0x); > +addr6.s6_addr32[3] = (uint32_t)ncp->cl_addrlist[i].s_addr; > +auth_unix_add_addr(addr6, dom); > +} You do this conversion several times, and each time it is indented strangely... Maybe create an inline (if there isn't one already) to do this. > extern void auth_domain_put(struct auth_domain *item); > -extern int auth_unix_add_addr(struct in_addr addr, struct auth_domain > *dom); > +extern int auth_unix_add_addr(struct in6_addr addr, struct auth_domain > *dom); Your mailer is wrapping long lines. This is bad. > static int ip_map_match(struct cache_head *corig, struct cache_head *cnew) > { > struct ip_map *orig = container_of(corig, struct ip_map, h); > struct ip_map *new = container_of(cnew, struct ip_map, h); > return strcmp(orig->m_class, new->m_class) == 0 > -&& orig->m_addr.s_addr == new->m_addr.s_addr; > +&& memcmp(orig->m_addr.s6_addr, > new->m_addr.s6_addr,sizeof(struct in6_addr)); > } I think you have inverted the sense of the test. The original tests for equality. The new tests for inequality. For memcmp (and strcmp), always use "function(arg1, argc) COMPARE_OP 0" e.g. memcmp(orig, new, size) == 0 or use ipv6_addr_equal. Also please run ./scripts/checkpatch.pl on the patch. You need a space after that comma. > @@ -125,7 +129,7 @@ > struct ip_map *item = container_of(citem, struct ip_map, h); > > strcpy(new->m_class, item->m_class); > -new->m_addr.s_addr = item->m_addr.s_addr; > +memcpy(&new->m_addr.s6_addr, > &item->m_addr.s6_addr,sizeof(item->m_addr.s6_addr)); Ditto. did you test this code? > } > static void update(struct cache_head *cnew, struct cache_head *citem) > { > @@ -151,20 +155,28 @@ > { > char text_addr[20]; > struct ip_map *im = container_of(h, struct ip_map, h); > -__be32 addr = im->m_addr.s_addr; > + > +__be32 addr[4]; > +int i; > +for (i=0;i<4;i++) > +addr[i] = im->m_addr.s6_addr[i]; I think you mean "s6_addr32" ?? > > -snprintf(text_addr, 20, "%u.%u.%u.%u", > - ntohl(addr) >> 24 & 0xff, > - ntohl(addr) >> 16 & 0xff, > - ntohl(addr) >> 8 & 0xff, > - ntohl(addr) >> 0 & 0xff); > +snprintf(text_addr, 20, "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x", > +ntohl(addr[3]) >> 16 & 0xff, should be 0x ??? Now I know you didn't test the code :-) > +ntohl(addr[3]) >> 0 & 0xff, > +ntohl(addr[2]) >> 16 & 0xff, > +ntohl(addr[2]) >> 0 & 0xff, > +ntohl(addr[1]) >> 16 & 0xff, > +ntohl(addr[1]) >> 0 & 0xff, > +ntohl(addr[0]) >> 16 & 0xff, > +ntohl(addr[0]) >> 0 & 0xff); This code would read better if you did __be16 addr[8]; for (i = 0; i < 8 ; i++) addr[i] = im->m_addr.s6_addr16[i]; snprintf( ntohs(addr[7]), ntohs(addr[6]), Also, I think that if it is a (converted) IPv4 address, it should be printed as a dotted-quad. > + addr.s6_addr32[0] = htonl((b1<<16)|b2); > + addr.s6_addr32[1] = htonl((b3<<16)|b4); > + addr.s6_addr32[2] = htonl((b5<<16)|b6); > + addr.s6_addr32[3] = htonl((b7<<16)|b8); Assign to addr.s6_addr16[]. > -seq_printf(m, "%s %d.%d.%d.%d %s\n", > +seq_printf(m, "%s %04x.%04x.%04x.%04x.%04x.%04x.%04x.%04x %s\n", > im->m_class, > - ntohl(addr.s_addr) >> 24 & 0xff, > - ntohl(addr.s_addr) >> 16 & 0xff, > - ntohl(addr.s_addr) >> 8 & 0xff, > - ntohl(addr.s_addr) >> 0 & 0xff, > + ntohl(addr.s6_addr32[3]) >> 16 & 0x, > + ntohl(addr.s6_addr32[3]) & 0x, > + ntohl(addr.s6_addr32[2]) >> 16 & 0x, > + ntohl(addr.s6_addr32[2]) & 0x, > + ntohl(addr.s6_addr32[1]) >> 16 & 0x, > + ntohl(addr.s6_addr32[1]) & 0x, > + ntohl(addr.s6_addr32[0]) >> 16 & 0x, > + ntohl(addr.s6_addr32[0]) & 0x, Again, s6_addr16[], and keep the dotted-quad for IPv4. In fact ( borrowing Chuck's suggestion) if (is_ipv4)
Re: [PATCH 1/1] NFS: change the ip_map cache code to handle IPv6 addresses
On Thursday August 9, [EMAIL PROTECTED] wrote: > >> > >> How have you tested the effectiveness of the new hash function? > > > > I have not tested that point but I can easily imagine there are better > > solutions. > > Perhaps we can keep the same function for an IPv4 address (only taking > > the 32 bits of IPv4 addr), and then design one for IPv6 addresses. > > I see that, to generate the hash, you would be xor-ing the FF and 00 > bytes in the canonicalized IPv4 address. Yes, perhaps a better function > is needed, or as you say, one specifically for IPv6 and one for > canonicalized IPv4. > I suspect that the given hash function will be as good as any other. The values in each byte of an IPv6 are likely to be either evenly distributed, or constant. Each the first case you don't need a clever hash function to distribute them, while in the second, no hash function can improve the distribution. NeilBrown - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 001 of 3] knfsd: Use recv_msg to get peer address for NFSD instead of code-copying
On Monday March 5, [EMAIL PROTECTED] wrote: > On Friday 02 March 2007 05:28, NeilBrown wrote: > > The sunrpc server code needs to know the source and destination address > > for UDP packets so it can reply properly. > > It currently copies code out of the network stack to pick the pieces out > > of the skb. > > This is ugly and causes compile problems with the IPv6 stuff. > > ... and this IPv6 code could never have worked anyway: :-( It's hard to test the IPv6 server until we have an IPv6 client I guess, so thanks for the code review, even though we aren't going to end up using that code... > > But I find using recvmsg just for getting at the addresses > a little awkward too. Do you? It's surely a lot better than code duplication, and it is exactly how you would get the information from user-space. > And I think to be on the safe side, you > should check that you're really looking at a PKTINFO cmsg > rather than something else. Maybe. But is there really a chance that it might not be PKTINFO? And what do you do if it isn't? Log an error and drop the packet I guess. I'll see what I can do. NeilBrown - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NFS] [PATCH 001 of 3] knfsd: Use recv_msg to get peer address for NFSD instead of code-copying
On Monday March 5, [EMAIL PROTECTED] wrote: > > Hi Neil, > > here's another minor comment: > > On Friday 02 March 2007 05:28, NeilBrown wrote: > > +static inline void svc_udp_get_dest_address(struct svc_rqst *rqstp, > > + struct cmsghdr *cmh) > > { > > switch (rqstp->rq_sock->sk_sk->sk_family) { > > case AF_INET: { > > + struct in_pktinfo *pki = CMSG_DATA(cmh); > > + rqstp->rq_daddr.addr.s_addr = pki->ipi_spec_dst.s_addr; > > break; > > + } > ... > > The daddr that is extracted here will only ever be used to build > another PKTINFO cmsg when sending the reply. So it would be > much easier to just store the raw control message in the svc_rqst, > without looking at its contents, and send it out along with the reply, > unchanged. Yes, sounds tempting, doesn't it? Unfortunately it isn't that simple as I found out when the sunrpc code in glibc did exactly that. You see sendmsg will use the interface-number as well as the source address from the PKTINFO structure. Suppose my server has two interfaces (A and B) on two subnets that both are connected to some router which is connected to a third subnet that my client is on. Further, suppose my server has only one default route, out interface A. The client chooses the IP address of interface B and sends a request. It arrives on interface B and is processed. If the PKTINFO received is passed unchanged to sendmsg, the pack will be sent out interface B. But interfacve B doesn't have a route to that client, so the packet is dropped. This exactly what was happening for me with mountd a few years ago. So yes, we could just zero the interface field, but I think it is clearer to extract that wanted data, then re-insert it. They really are different structures with different meanings (send verse receive) which happen to have the same layout. Thanks, NeilBrown - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-git8 fails compile -- net/built-in.o __ipv6_addr_type
On Monday February 12, [EMAIL PROTECTED] wrote: > Quoting YOSHIFUJIHideaki/=?iso-2022-jp?B?GyRCNUhGIzFRTEAbKEI=?= > > In article <[EMAIL PROTECTED]> (at Mon, 12 Feb 2007 18:12:16 -0800), > Andrew Morton <[EMAIL PROTECTED]> says: > > > > > > On Mon, 12 Feb 2007 20:10:13 -0500 (EST) Pete Clements <[EMAIL > PROTECTED]> wrote: > > > > 2.6.20-git8 fails compile: > > > > > > > > CHK include/linux/compile.h > > > > UPD include/linux/compile.h > > > > CC init/version.o > > > > LD init/built-in.o > > > > LD .tmp_vmlinux1 > > > > net/built-in.o: In function `svc_udp_recvfrom': > > > > svcsock.c:(.text+0x61be4): undefined reference to `__ipv6_addr_type' > > > > make: *** [.tmp_vmlinux1] Error 1 > > > > Hmmm. > CONFIG_IPV6=m combined with > CONFIG_SUNRPC=y is a combination that isn't going to work at the moment Options? - Move __ipv6_addr_type to lib/ ?? Might not work for modules. - Disallow that combination in the .config. .. a bit harsh. - Only include IPv6 support in SUNRPC is a module or IPV6 is built-in A bit of a kludge.. Chuck? Any ideas? NeilBrown - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] d80211 based driver for Intel PRO/Wireless 3945ABG
On Friday February 9, [EMAIL PROTECTED] wrote: > > Ok. Now... any questions? > Yes. Does this require a closed user-space helper like the other 3945ABG driver, or is it completely open (maybe excepting firmware)? Thanks, NeilBrown - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
TCP stack sometimes loses ACKs ... or something
I upgraded my notebook from 2.6.16 to 2.6.18 recently and noticed that I couldn't talk to my VOIP device (which has a WEB interface). Watching traffic I see the three-way-handshake working perfectly, and then the first data packet is sent (a partial HTTP request: GET / HTTP/1.1 ) and an ACK comes back from the device. Then the next data packet (remainder of the HTTP request) is sent, but tcpdump never sees the ACK, nor does the TCP stack. So the data gets recent repeatedly. No ack. Ever. With 2.6.16, The ack comes back just fine and the connection proceeds as you would expect. As it was a very reproducible problem I decided to try "git bisect" and found bad: [7b4f4b5ebceab67ce440a61081a69f0265e17c2a] [TCP]: Set default max buffers from memory pool size I double checked as this seemed a fairly unlikely patch to cause the problem, but this definitely is it. The net effect of this patch is to change the last of the three numbers in cat /proc/sys/net/ipv4/tcp_[rw]mem from well below 2^20 to well above. 2^20 seems to be a significant number. I set tcp_wmem to that and the ACK was lost. I set it to one less and the first ACK (at least) was accepted. I ended up setting both r and w to 10 and everything is fine. Exploring more deeply, and comparing: - a failing connection (to VIOP box, [rw]mem large) - a working connection to VOIP box ([rw]mem small) - a working connection to another machine ([rw]mem irrelevant). I find: The VIOP returns MSS=1360 in the SYN/ACK packet. Other machine returns MSS=1460 The ack that is getting lost contains data as well as the ACK. i.e. the same packet that ACKs at the TCP level includes the HTTP level reply. The matching ACK from the other machine (some Linux 2.6.8 I think) is a data-less ACK followed very quickly by the HTTP reply in a separate packet. The 'Timestamps' option coming back from the VOIP box is a little odd. The Timestamp in the SYN/ACK is the same as the timestamp in the next ACK (the ack for the first partial HTTP request). The Timestamp in the next packet which is the one that gets lost has exactly the same TSval as previous packets, and TSecr is one more than in the previous packet. I assume that one (or more) of these differences combined with the large tcp_[rw]mem value cause the packet loss, but I have no idea which. Help? I can make the tcp traces available if needed, but these are really the only non-trivial differences. I'm willing to test patches. NeilBrown - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/9] deadlock prevention core
On Thursday August 17, [EMAIL PROTECTED] wrote: > Andrew Morton wrote: > > Daniel Phillips <[EMAIL PROTECTED]> wrote: > >>What happened to the case where we just fill memory full of dirty file > >>pages backed by a remote disk? > > > > Processes which are dirtying those pages throttle at > > /proc/sys/vm/dirty_ratio% of memory dirty. So it is not possible to "fill" > > memory with dirty pages. If the amount of physical memory which is dirty > > exceeds 40%: bug. > > Hi Andrew, > > So we make 400 MB of a 1 GB system unavailable for write caching just to > get around the network receive starvation issue? No. We make it unavailable for write caching so it is available for other important things like cached clean pages and executables etc. You have to start throttling some where or you get very bad behaviour. 40% seems a good number, but it is tunable. The fact that it helps with receive starvation is just s bonus. > > What happens if some in kernel user grabs 68% of kernel memory to do some > very important thing, does this starvation avoidance scheme still work? That is a very good question. Is memlocked memory throttled against dirty pages, and does it decrease the space available to the 40% calculation? I don't know. I guess I could look at the code... get_dirty_limits in page_writeback caps 'dirty_ratio' (40 by default, hence the 40%) at unmapped_ratio/2. So yes. If 68% is mapped and locked (I assume that it the situation you are referring to) that leaves 32% unlocked, so the 40% above is reduced to 16% and you should still have 160Meg of breathing space. NeilBrown - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/9] deadlock prevention core
On Monday August 14, [EMAIL PROTECTED] wrote: > On Sun, 2006-08-13 at 22:22 -0700, Andrew Morton wrote: > > > > We could track dirty anonymous memory and throttle. > > > > Also, there must be some value of /proc/sys/vm/min_free_kbytes at which a > > machine is no longer deadlockable with any of these tricks. Do we know > > what level that is? > > Not sure, the theoretical max amount of memory one can 'lose' in socket > wait queues is well over the amount of physical memory we have in > machines today (even for SGI); this combined with the fact that we limit > the memory in some way to avoid DoS attacks, could make for all memory > to be stuck in wait queues. Of course this becomes rather more unlikely > for ever larger amounts of memory. But unlikely is never a guarantee. What is the minimum amount of memory we need to reserve for each socket? 1K? 1 page? Call it X Suppose that whenever a socket is created (or bound or connected or whatever is right) we first allocate that much to a recv pool. If any socket has less than X queued, then it is allowed to allocate up to a total of X from the reserve pool. After that it can only receive when memory can be allocated from elsewhere. Then we will never block on recv. Note that X doesn't need to be the biggest possible incoming message. It only needs to be enough to get an 'ack' over any possible network storage protocol with any possible layering. I suspect that it well within one page. Would it be too much waste to reserve one page for every idle socket? Does this have some fatal flaw? NeilBrown - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2.6 patch] net/sunrpc/xdr.c: remove xdr_decode_string()
On Wednesday November 23, [EMAIL PROTECTED] wrote: > On Wed, Nov 23, 2005 at 04:31:14AM -0800, Lever, Charles wrote: > > so i don't remember why you are removing xdr_decode_string. are we sure > > that no-one will need this functionality in the future? it is harmless > > to remove today, but i wonder if someone is just going to add it back > > sometime. > > It's unused and you said: > the only harmless change i see below is removing xdr_decode_string(). > As 'xdr_decode_string' (sometimes) modifies the buffer that it is decoding, I don't think it's usage should be encouraged. If it is no longer in use, then I fully support and encourage removing it. NeilBrown - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html