Re: [PATCH] af_unix: fix garbage collect vs. MSG_PEEK

2016-12-19 Thread Miklos Szeredi
On Tue, Oct 4, 2016 at 3:51 AM, David Miller <da...@davemloft.net> wrote:
> From: Miklos Szeredi <mszer...@redhat.com>
> Date: Thu, 29 Sep 2016 14:09:14 +0200
>
>> @@ -1550,6 +1550,17 @@ static int unix_attach_fds(struct scm_cookie *scm, 
>> struct sk_buff *skb)
>>   return max_level;
>>  }
>>
>> +static void unix_peek_fds(struct scm_cookie *scm, struct sk_buff *skb)
>> +{
>> + scm->fp = scm_fp_dup(UNIXCB(skb).fp);
>> + /*
>> +  * During garbage collection it is assumed that in-flight sockets don't
>> +  * get a new external reference.  So we need to wait until current run
>> +  * finishes.
>> +  */
>> + unix_gc_barrier();
>> +}
>  ...
>> @@ -266,6 +266,11 @@ void wait_for_unix_gc(void)
>>   wait_event(unix_gc_wait, gc_in_progress == false);
>>  }
>>
>> +void unix_gc_barrier(void)
>> +{
>> + spin_unlock_wait(_gc_lock);
>> +}
>
> Can you explain why wait_for_unix_gc() isn't appropriate?  I'm a little
> bit uncomfortable with a spinlock wait like this, and would rather see
> something like the existing helper used.

Missed this mail, sorry for the late reply...

AFAICS wait_for_unix_gc() lacks a barrier that the spin lock provides.

The ordering needs to be strictly:

A: duplicate file pointers
B: garbage collect

Also wait_for_unix_gc() is an overkill since the complete garbage
collect (including purging the trash) will take longer than the
collection stage, which we need to have ordering with.  This probably
doesn't matter much in practice.

Thanks,
Miklos


[PATCH] af_unix: fix garbage collect vs. MSG_PEEK

2016-09-29 Thread Miklos Szeredi
Gc assumes that in-flight sockets that don't have an external ref can't
gain one while unix_gc_lock is held.  That is true because
unix_notinflight() will be called before detaching fds, which takes
unix_gc_lock.

Only MSG_PEEK was somehow overlooked.  That one also clones the fds, also
keeping them in the skb.  But through MSG_PEEK an external reference can
definitely be gained without ever touching unix_gc_lock.

This patch adds unix_gc_barrier() that waits for a garbage collect run to
finish (if there is one), before actually installing the peeked in-flight
files to file descriptors.  This prevents problems from a pure in-flight
socket having its buffers modified while the garbage collect is taking
place.

Signed-off-by: Miklos Szeredi <mszer...@redhat.com>
Cc: <sta...@vger.kernel.org>
---
 include/net/af_unix.h |  1 +
 net/unix/af_unix.c| 15 +--
 net/unix/garbage.c|  5 +
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index fd60eccb59a6..f73ce09da2b4 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -10,6 +10,7 @@ void unix_inflight(struct user_struct *user, struct file *fp);
 void unix_notinflight(struct user_struct *user, struct file *fp);
 void unix_gc(void);
 void wait_for_unix_gc(void);
+void unix_gc_barrier(void);
 struct sock *unix_get_socket(struct file *filp);
 struct sock *unix_peer_get(struct sock *);
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 8309687a56b0..cece344b41b8 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1550,6 +1550,17 @@ static int unix_attach_fds(struct scm_cookie *scm, 
struct sk_buff *skb)
return max_level;
 }
 
+static void unix_peek_fds(struct scm_cookie *scm, struct sk_buff *skb)
+{
+   scm->fp = scm_fp_dup(UNIXCB(skb).fp);
+   /*
+* During garbage collection it is assumed that in-flight sockets don't
+* get a new external reference.  So we need to wait until current run
+* finishes.
+*/
+   unix_gc_barrier();
+}
+
 static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool 
send_fds)
 {
int err = 0;
@@ -2182,7 +2193,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct 
msghdr *msg,
sk_peek_offset_fwd(sk, size);
 
if (UNIXCB(skb).fp)
-   scm.fp = scm_fp_dup(UNIXCB(skb).fp);
+   unix_peek_fds(, skb);
}
err = (flags & MSG_TRUNC) ? skb->len - skip : size;
 
@@ -2422,7 +2433,7 @@ unlock:
/* It is questionable, see note in unix_dgram_recvmsg.
 */
if (UNIXCB(skb).fp)
-   scm.fp = scm_fp_dup(UNIXCB(skb).fp);
+   unix_peek_fds(, skb);
 
sk_peek_offset_fwd(sk, chunk);
 
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index 6a0d48525fcf..d4fb6ff8024d 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -266,6 +266,11 @@ void wait_for_unix_gc(void)
wait_event(unix_gc_wait, gc_in_progress == false);
 }
 
+void unix_gc_barrier(void)
+{
+   spin_unlock_wait(_gc_lock);
+}
+
 /* The external entry point: unix_gc() */
 void unix_gc(void)
 {
-- 
2.5.5



Re: kernel BUG at net/unix/garbage.c:149!"

2016-08-30 Thread Miklos Szeredi
On Tue, Aug 30, 2016 at 11:31 AM, Nikolay Borisov <ker...@kyup.com> wrote:
>
>
> On 08/30/2016 12:18 PM, Miklos Szeredi wrote:
>> On Tue, Aug 30, 2016 at 12:37 AM, Miklos Szeredi <mszer...@redhat.com> wrote:
>>> On Sat, Aug 27, 2016 at 11:55 AM, Miklos Szeredi <mszer...@redhat.com> 
>>> wrote:
>>
>>> crash> list -H gc_inflight_list unix_sock.link -s unix_sock.inflight |
>>> grep counter | cut -d= -f2 | awk '{s+=$1} END {print s}'
>>> 130
>>> crash> p unix_tot_inflight
>>> unix_tot_inflight = $2 = 135
>>>
>>> We've lost track of a total of five inflight sockets, so it's not a
>>> one-off thing.  Really weird...  Now off to sleep, maybe I'll dream of
>>> the solution.
>>
>> Okay, found one bug: gc assumes that in-flight sockets that don't have
>> an external ref can't gain one while unix_gc_lock is held.  That is
>> true because unix_notinflight() will be called before detaching fds,
>> which takes unix_gc_lock.  Only MSG_PEEK was somehow overlooked.  That
>> one also clones the fds, also keeping them in the skb.  But through
>> MSG_PEEK an external reference can definitely be gained without ever
>> touching unix_gc_lock.
>>
>> Not sure whether the reported bug can be explained by this.  Can you
>> confirm the MSG_PEEK was used in the setup?
>>
>> Does someone want to write a stress test for SCM_RIGHTS + MSG_PEEK?
>>
>> Anyway, attaching a fix that works by acquiring unix_gc_lock in case
>> of MSG_PEEK also.  It is trivially correct, but I haven't tested it.
>
> I have no way of being 100% sure but looking through nginx's source code
> it seems they do utilize MSG_PEEK on several occasions. This issue has
> been apparently very hard to reproduce since I have 100s of servers
> running a lot of  NGINX processes and this has been triggered only once.
>
> On a different note - if I inspect a live node without this patch should
> the discrepancy between the gc_inflight_list and the unix_tot_inflight
> be present VS with this patch applied?

May well be, since in the vmcore 4 in-flight sockets were "lost"
before triggering the bug.  I guess the best way to check is with a
systemtap script that walks the list with the gc lock.

Thanks,
Miklos


Re: kernel BUG at net/unix/garbage.c:149!"

2016-08-30 Thread Miklos Szeredi
On Tue, Aug 30, 2016 at 12:37 AM, Miklos Szeredi <mszer...@redhat.com> wrote:
> On Sat, Aug 27, 2016 at 11:55 AM, Miklos Szeredi <mszer...@redhat.com> wrote:

> crash> list -H gc_inflight_list unix_sock.link -s unix_sock.inflight |
> grep counter | cut -d= -f2 | awk '{s+=$1} END {print s}'
> 130
> crash> p unix_tot_inflight
> unix_tot_inflight = $2 = 135
>
> We've lost track of a total of five inflight sockets, so it's not a
> one-off thing.  Really weird...  Now off to sleep, maybe I'll dream of
> the solution.

Okay, found one bug: gc assumes that in-flight sockets that don't have
an external ref can't gain one while unix_gc_lock is held.  That is
true because unix_notinflight() will be called before detaching fds,
which takes unix_gc_lock.  Only MSG_PEEK was somehow overlooked.  That
one also clones the fds, also keeping them in the skb.  But through
MSG_PEEK an external reference can definitely be gained without ever
touching unix_gc_lock.

Not sure whether the reported bug can be explained by this.  Can you
confirm the MSG_PEEK was used in the setup?

Does someone want to write a stress test for SCM_RIGHTS + MSG_PEEK?

Anyway, attaching a fix that works by acquiring unix_gc_lock in case
of MSG_PEEK also.  It is trivially correct, but I haven't tested it.

Thanks,
Miklos
From: Miklos Szeredi <mszer...@redhat.com>
Subject: af_unix: fix garbage collect vs. MSG_PEEK

Gc assumes that in-flight sockets that don't have an external ref can't
gain one while unix_gc_lock is held.  That is true because
unix_notinflight() will be called before detaching fds, which takes
unix_gc_lock.

Only MSG_PEEK was somehow overlooked.  That one also clones the fds, also
keeping them in the skb.  But through MSG_PEEK an external reference can
definitely be gained without ever touching unix_gc_lock.

Signed-off-by: Miklos Szeredi <mszer...@redhat.com>
Cc: <sta...@vger.kernel.org>
---
 include/net/af_unix.h |1 +
 net/unix/af_unix.c|   15 +--
 net/unix/garbage.c|6 ++
 3 files changed, 20 insertions(+), 2 deletions(-)

--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -10,6 +10,7 @@ void unix_inflight(struct user_struct *u
 void unix_notinflight(struct user_struct *user, struct file *fp);
 void unix_gc(void);
 void wait_for_unix_gc(void);
+void unix_gc_barrier(void);
 struct sock *unix_get_socket(struct file *filp);
 struct sock *unix_peer_get(struct sock *);
 
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1563,6 +1563,17 @@ static int unix_attach_fds(struct scm_co
 	return max_level;
 }
 
+static void unix_peek_fds(struct scm_cookie *scm, struct sk_buff *skb)
+{
+	scm->fp = scm_fp_dup(UNIXCB(skb).fp);
+	/*
+	 * During garbage collection it is assumed that in-flight sockets don't
+	 * get a new external reference.  So we need to wait until current run
+	 * finishes.
+	 */
+	unix_gc_barrier();
+}
+
 static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool send_fds)
 {
 	int err = 0;
@@ -2195,7 +2206,7 @@ static int unix_dgram_recvmsg(struct soc
 		sk_peek_offset_fwd(sk, size);
 
 		if (UNIXCB(skb).fp)
-			scm.fp = scm_fp_dup(UNIXCB(skb).fp);
+			unix_peek_fds(, skb);
 	}
 	err = (flags & MSG_TRUNC) ? skb->len - skip : size;
 
@@ -2435,7 +2446,7 @@ static int unix_stream_read_generic(stru
 			/* It is questionable, see note in unix_dgram_recvmsg.
 			 */
 			if (UNIXCB(skb).fp)
-scm.fp = scm_fp_dup(UNIXCB(skb).fp);
+unix_peek_fds(, skb);
 
 			sk_peek_offset_fwd(sk, chunk);
 
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -266,6 +266,12 @@ void wait_for_unix_gc(void)
 	wait_event(unix_gc_wait, gc_in_progress == false);
 }
 
+void unix_gc_barrier(void)
+{
+	spin_lock(_gc_lock);
+	spin_unlock(_gc_lock);
+}
+
 /* The external entry point: unix_gc() */
 void unix_gc(void)
 {


Re: kernel BUG at net/unix/garbage.c:149!"

2016-08-29 Thread Miklos Szeredi
On Sat, Aug 27, 2016 at 11:55 AM, Miklos Szeredi <mszer...@redhat.com> wrote:
>
> On Wed, Aug 24, 2016 at 11:40 PM, Hannes Frederic Sowa
> <han...@stressinduktion.org> wrote:
> > On 24.08.2016 16:24, Nikolay Borisov wrote:
> >> Hello,
> >>
> >> I hit the following BUG:
> >>
> >> [1851513.239831] [ cut here ]
> >> [1851513.240079] kernel BUG at net/unix/garbage.c:149!
> >> [1851513.240313] invalid opcode:  [#1] SMP
> >> [1851513.248320] CPU: 37 PID: 11683 Comm: nginx Tainted: G   O
> >> 4.4.14-clouder3 #26
> >> [1851513.248719] Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 
> >> 04/14/2015
> >> [1851513.248966] task: 883b0f6f ti: 880189cf task.ti: 
> >> 880189cf
> >> [1851513.249361] RIP: 0010:[]  [] 
> >> unix_notinflight+0x8d/0x90
> >> [1851513.249846] RSP: 0018:880189cf3cf8  EFLAGS: 00010246
> >> [1851513.250082] RAX: 883b05491968 RBX: 883b05491680 RCX: 
> >> 8807f9967330
> >> [1851513.250476] RDX: 0001 RSI: 882e6d8bae00 RDI: 
> >> 82073f10
> >> [1851513.250886] RBP: 880189cf3d08 R08: 880cbc70e200 R09: 
> >> 00018021
> >> [1851513.251280] R10: 883fff3b9dc0 R11: ea0032f1c380 R12: 
> >> 883fbaf5
> >> [1851513.251674] R13: 815f6354 R14: 881a7c77b140 R15: 
> >> 881a7c7792c0
> >> [1851513.252083] FS:  7f4f19573720() GS:883fff3a() 
> >> knlGS:
> >> [1851513.252481] CS:  0010 DS:  ES:  CR0: 80050033
> >> [1851513.252717] CR2: 013062d8 CR3: 001712f32000 CR4: 
> >> 001406e0
> >> [1851513.253116] Stack:
> >> [1851513.253345]   880189cf3d40 880189cf3d28 
> >> 815f4383
> >> [1851513.254022]  8839ee11a800 8839ee11a800 880189cf3d60 
> >> 815f53b8
> >> [1851513.254685]   883406788de0  
> >> 
> >> [1851513.255360] Call Trace:
> >> [1851513.255594]  [] unix_detach_fds.isra.19+0x43/0x50
> >> [1851513.255851]  [] unix_destruct_scm+0x48/0x80
> >> [1851513.256090]  [] skb_release_head_state+0x4f/0xb0
> >> [1851513.256328]  [] skb_release_all+0x12/0x30
> >> [1851513.256564]  [] kfree_skb+0x32/0xa0
> >> [1851513.256810]  [] unix_release_sock+0x1e4/0x2c0
> >> [1851513.257046]  [] unix_release+0x20/0x30
> >> [1851513.257284]  [] sock_release+0x1f/0x80
> >> [1851513.257521]  [] sock_close+0x12/0x20
> >> [1851513.257769]  [] __fput+0xea/0x1f0
> >> [1851513.258005]  [] fput+0xe/0x10
> >> [1851513.258244]  [] task_work_run+0x7f/0xb0
> >> [1851513.258488]  [] exit_to_usermode_loop+0xc0/0xd0
> >> [1851513.258728]  [] syscall_return_slowpath+0x80/0xf0
> >> [1851513.258983]  [] int_ret_from_sys_call+0x25/0x9f
> >> [1851513.259222] Code: 7e 5b 41 5c 5d c3 48 8b 8b e8 02 00 00 48 8b 93 f0 
> >> 02 00 00 48 89 51 08 48 89 0a 48 89 83 e8 02 00 00 48 89 83 f0 02 00 00 eb 
> >> b8 <0f> 0b 90 0f 1f 44 00 00 55 48 c7 c7 10 3f 07 82 48 89 e5 41 54
> >> [1851513.268473] RIP  [] unix_notinflight+0x8d/0x90
> >> [1851513.268793]  RSP 
> >>
> >> That's essentially BUG_ON(list_empty(>link));
> >>
> >> I see that all the code involving the ->link member hasn't really been
> >> touched since it was introduced in 2007. So this must be a latent bug.
> >> This is the first time I've observed it. The state
> >> of the struct unix_sock can be found here http://sprunge.us/WCMW . 
> >> Evidently,
> >> there are no inflight sockets.
>
> Weird.  If it was the BUG_ON(!list_empty(>link)) I'd understand,
> because the code in scan children looks fishy: what prevents "embryos"
> from fledging to full socket status and going in-flight?
>
> But this one, I cannot imagine any scenario.
>
> Can we have access to the crashdump?

crash> list -H gc_inflight_list unix_sock.link -s unix_sock.inflight |
grep counter | cut -d= -f2 | awk '{s+=$1} END {print s}'
130
crash> p unix_tot_inflight
unix_tot_inflight = $2 = 135

We've lost track of a total of five inflight sockets, so it's not a
one-off thing.  Really weird...  Now off to sleep, maybe I'll dream of
the solution.

Btw. I notice that this is a patched kernel.  Nothing in there that
could be relevant to this bug?

Thanks,
Miklos


Re: kernel BUG at net/unix/garbage.c:149!"

2016-08-27 Thread Miklos Szeredi
On Wed, Aug 24, 2016 at 11:40 PM, Hannes Frederic Sowa
 wrote:
> On 24.08.2016 16:24, Nikolay Borisov wrote:
>> Hello,
>>
>> I hit the following BUG:
>>
>> [1851513.239831] [ cut here ]
>> [1851513.240079] kernel BUG at net/unix/garbage.c:149!
>> [1851513.240313] invalid opcode:  [#1] SMP
>> [1851513.248320] CPU: 37 PID: 11683 Comm: nginx Tainted: G   O
>> 4.4.14-clouder3 #26
>> [1851513.248719] Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
>> [1851513.248966] task: 883b0f6f ti: 880189cf task.ti: 
>> 880189cf
>> [1851513.249361] RIP: 0010:[]  [] 
>> unix_notinflight+0x8d/0x90
>> [1851513.249846] RSP: 0018:880189cf3cf8  EFLAGS: 00010246
>> [1851513.250082] RAX: 883b05491968 RBX: 883b05491680 RCX: 
>> 8807f9967330
>> [1851513.250476] RDX: 0001 RSI: 882e6d8bae00 RDI: 
>> 82073f10
>> [1851513.250886] RBP: 880189cf3d08 R08: 880cbc70e200 R09: 
>> 00018021
>> [1851513.251280] R10: 883fff3b9dc0 R11: ea0032f1c380 R12: 
>> 883fbaf5
>> [1851513.251674] R13: 815f6354 R14: 881a7c77b140 R15: 
>> 881a7c7792c0
>> [1851513.252083] FS:  7f4f19573720() GS:883fff3a() 
>> knlGS:
>> [1851513.252481] CS:  0010 DS:  ES:  CR0: 80050033
>> [1851513.252717] CR2: 013062d8 CR3: 001712f32000 CR4: 
>> 001406e0
>> [1851513.253116] Stack:
>> [1851513.253345]   880189cf3d40 880189cf3d28 
>> 815f4383
>> [1851513.254022]  8839ee11a800 8839ee11a800 880189cf3d60 
>> 815f53b8
>> [1851513.254685]   883406788de0  
>> 
>> [1851513.255360] Call Trace:
>> [1851513.255594]  [] unix_detach_fds.isra.19+0x43/0x50
>> [1851513.255851]  [] unix_destruct_scm+0x48/0x80
>> [1851513.256090]  [] skb_release_head_state+0x4f/0xb0
>> [1851513.256328]  [] skb_release_all+0x12/0x30
>> [1851513.256564]  [] kfree_skb+0x32/0xa0
>> [1851513.256810]  [] unix_release_sock+0x1e4/0x2c0
>> [1851513.257046]  [] unix_release+0x20/0x30
>> [1851513.257284]  [] sock_release+0x1f/0x80
>> [1851513.257521]  [] sock_close+0x12/0x20
>> [1851513.257769]  [] __fput+0xea/0x1f0
>> [1851513.258005]  [] fput+0xe/0x10
>> [1851513.258244]  [] task_work_run+0x7f/0xb0
>> [1851513.258488]  [] exit_to_usermode_loop+0xc0/0xd0
>> [1851513.258728]  [] syscall_return_slowpath+0x80/0xf0
>> [1851513.258983]  [] int_ret_from_sys_call+0x25/0x9f
>> [1851513.259222] Code: 7e 5b 41 5c 5d c3 48 8b 8b e8 02 00 00 48 8b 93 f0 02 
>> 00 00 48 89 51 08 48 89 0a 48 89 83 e8 02 00 00 48 89 83 f0 02 00 00 eb b8 
>> <0f> 0b 90 0f 1f 44 00 00 55 48 c7 c7 10 3f 07 82 48 89 e5 41 54
>> [1851513.268473] RIP  [] unix_notinflight+0x8d/0x90
>> [1851513.268793]  RSP 
>>
>> That's essentially BUG_ON(list_empty(>link));
>>
>> I see that all the code involving the ->link member hasn't really been
>> touched since it was introduced in 2007. So this must be a latent bug.
>> This is the first time I've observed it. The state
>> of the struct unix_sock can be found here http://sprunge.us/WCMW . Evidently,
>> there are no inflight sockets.

Weird.  If it was the BUG_ON(!list_empty(>link)) I'd understand,
because the code in scan children looks fishy: what prevents "embryos"
from fledging to full socket status and going in-flight?

But this one, I cannot imagine any scenario.

Can we have access to the crashdump?

Thanks,
Miklos


Re: [PATCH 22/28] mm: add support for non block device backed swap files

2008-02-26 Thread Miklos Szeredi
Starting review in the middle, because this is the part I'm most
familiar with.

 New addres_space_operations methods are added:
   int swapfile(struct address_space *, int);

Separate -swapon() and -swapoff() methods would be so much cleaner IMO.

Also is there a reason why 'struct file *' cannot be supplied to these
functions?

[snip]

 +int swap_set_page_dirty(struct page *page)
 +{
 + struct swap_info_struct *sis = page_swap_info(page);
 +
 + if (sis-flags  SWP_FILE) {
 + const struct address_space_operations *a_ops =
 + sis-swap_file-f_mapping-a_ops;
 + int (*spd)(struct page *) = a_ops-set_page_dirty;
 +#ifdef CONFIG_BLOCK
 + if (!spd)
 + spd = __set_page_dirty_buffers;
 +#endif

This ifdef is not really needed.  Just require -set_page_dirty() be
filled in by filesystems which want swapfiles (and others too, in the
longer term, the fallback is just historical crud).

Here's an incremental patch addressing these issues and beautifying
the new code.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]

Index: linux/mm/page_io.c
===
--- linux.orig/mm/page_io.c 2008-02-26 11:15:58.0 +0100
+++ linux/mm/page_io.c  2008-02-26 13:40:55.0 +0100
@@ -106,8 +106,10 @@ int swap_writepage(struct page *page, st
}
 
if (sis-flags  SWP_FILE) {
-   ret = sis-swap_file-f_mapping-
-   a_ops-swap_out(sis-swap_file, page, wbc);
+   struct file *swap_file = sis-swap_file;
+   struct address_space *mapping = swap_file-f_mapping;
+
+   ret = mapping-a_ops-swap_out(swap_file, page, wbc);
if (!ret)
count_vm_event(PSWPOUT);
return ret;
@@ -136,12 +138,13 @@ void swap_sync_page(struct page *page)
struct swap_info_struct *sis = page_swap_info(page);
 
if (sis-flags  SWP_FILE) {
-   const struct address_space_operations *a_ops =
-   sis-swap_file-f_mapping-a_ops;
-   if (a_ops-sync_page)
-   a_ops-sync_page(page);
-   } else
+   struct address_space *mapping = sis-swap_file-f_mapping;
+
+   if (mapping-a_ops-sync_page)
+   mapping-a_ops-sync_page(page);
+   } else {
block_sync_page(page);
+   }
 }
 
 int swap_set_page_dirty(struct page *page)
@@ -149,17 +152,12 @@ int swap_set_page_dirty(struct page *pag
struct swap_info_struct *sis = page_swap_info(page);
 
if (sis-flags  SWP_FILE) {
-   const struct address_space_operations *a_ops =
-   sis-swap_file-f_mapping-a_ops;
-   int (*spd)(struct page *) = a_ops-set_page_dirty;
-#ifdef CONFIG_BLOCK
-   if (!spd)
-   spd = __set_page_dirty_buffers;
-#endif
-   return (*spd)(page);
-   }
+   struct address_space *mapping = sis-swap_file-f_mapping;
 
-   return __set_page_dirty_nobuffers(page);
+   return mapping-a_ops-set_page_dirty(page);
+   } else {
+   return __set_page_dirty_nobuffers(page);
+   }
 }
 
 int swap_readpage(struct file *file, struct page *page)
@@ -172,8 +170,10 @@ int swap_readpage(struct file *file, str
BUG_ON(PageUptodate(page));
 
if (sis-flags  SWP_FILE) {
-   ret = sis-swap_file-f_mapping-
-   a_ops-swap_in(sis-swap_file, page);
+   struct file *swap_file = sis-swap_file;
+   struct address_space *mapping = swap_file-f_mapping;
+
+   ret = mapping-a_ops-swap_in(swap_file, page);
if (!ret)
count_vm_event(PSWPIN);
return ret;
Index: linux/include/linux/fs.h
===
--- linux.orig/include/linux/fs.h   2008-02-26 11:15:58.0 +0100
+++ linux/include/linux/fs.h2008-02-26 13:29:40.0 +0100
@@ -485,7 +485,8 @@ struct address_space_operations {
/*
 * swapfile support
 */
-   int (*swapfile)(struct address_space *, int);
+   int (*swapon)(struct file *file);
+   int (*swapoff)(struct file *file);
int (*swap_out)(struct file *file, struct page *page,
struct writeback_control *wbc);
int (*swap_in)(struct file *file, struct page *page);
Index: linux/mm/swapfile.c
===
--- linux.orig/mm/swapfile.c2008-02-26 12:43:57.0 +0100
+++ linux/mm/swapfile.c 2008-02-26 13:34:57.0 +0100
@@ -1014,9 +1014,11 @@ static void destroy_swap_extents(struct 
}
 
if (sis-flags  SWP_FILE) {
+   struct file *swap_file = sis-swap_file;
+   struct address_space *mapping = swap_file

Re: [PATCH 00/28] Swap over NFS -v16

2008-02-26 Thread Miklos Szeredi
  mm-page_file_methods.patch
  
  This makes page_offset and others more expensive by adding a
  conditional jump to a function call that is not usually made.
  
  Why do swap pages have a different index to everyone else?
 
 Because the page-index of an anonymous page is related to its (anon)vma
 so that it satisfies the constraints for vm_normal_page().
 
 The index in the swap file it totally unrelated and quite random. Hence
 the swap-cache uses page-private to store it in.

Yeah, and putting the condition into page_offset() will confuse code
which uses it for finding the offset in the VMA or in a tmpfs file.

So why not just have a separate page_swap_offset() function, used
exclusively by swap_in/out()?

Miklos
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/28] Swap over NFS -v16

2008-02-26 Thread Miklos Szeredi
mm-page_file_methods.patch

This makes page_offset and others more expensive by adding a
conditional jump to a function call that is not usually made.

Why do swap pages have a different index to everyone else?
   
   Because the page-index of an anonymous page is related to its (anon)vma
   so that it satisfies the constraints for vm_normal_page().
   
   The index in the swap file it totally unrelated and quite random. Hence
   the swap-cache uses page-private to store it in.
  
  Yeah, and putting the condition into page_offset() will confuse code
  which uses it for finding the offset in the VMA or in a tmpfs file.
  
  So why not just have a separate page_swap_offset() function, used
  exclusively by swap_in/out()?
 
 Ah, we can do the page_file_offset() to match page_file_index() and
 page_file_mapping(). And convert NFS to use page_file_offset() where
 appropriate, as I already did for these others.
 
 That would sort out the mess, right?

Yes, that sounds perfect.

Miklos
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH resend] rewrite AF_UNIX garbage collector, fixes race

2007-07-12 Thread Miklos Szeredi
 This looks good, patch applied and I'll get this into the
 2.6.23 merge window.
 
 Once this sits around for a while and we feel super-confident
 with it we can consider a backport into -stable.

OK.  Thanks for the review.

Miklos
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH resend] rewrite AF_UNIX garbage collector, fixes race

2007-07-11 Thread Miklos Szeredi
  But myself, nor any other developers, are going to review your work
  any faster if you do things like try to slip things in behind the
  maintainer's back as you attempted to do yesterday by asking Andrew to
  put your AF_UNIX garbage collector rewrite directly into his -mm tree.
 
 mmm...  I didn't see it as a slip-behind-the-back thing.  More a park it
 in -mm for a bit of concurrent testing and so it doesn't get lost and so I
 don't have to worry about it any more thing.  Plenty of that goes on.

Exactly.  I have CC-ed everyone involved, and clearly stated that I do
want it reviewed.  There wasn't any evil intention.

I _am_ slightly pissed off by still not having this bug fixed after
two months, and after spending considerable effort at hunting it down
and finding a solution for it (actually three different solutions by
now).

This may seem low priority and irrelevant, but it _is_ a bug, and it's
affecting a real world application.  So please take that into
consideration.

Thanks,
Miklos
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH resend] rewrite AF_UNIX garbage collector, fixes race

2007-07-11 Thread Miklos Szeredi
  This may seem low priority and irrelevant, but it _is_ a bug, and it's
  affecting a real world application.  So please take that into
  consideration.
 
 I'm not the only person who hasn't reviewed your latest patch yet.
 
 You may want to consider that this might have something to do with
 your attitude and behavior, rather than the importance of your patch.

I think it's more to do with everyone being busy, and not considering
this important, which is understandable, this is a fringe
functionality, and almost nobody uses it.

Problems with my attitude probably stem from this fact.  This bug is
important to _me_, and I behave accordingly, ant that seems
inappropriate.

Dave, I really appreciate the work you do.  Please forgive my
impatience.

Miklos
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH resend] rewrite AF_UNIX garbage collector, fixes race

2007-07-11 Thread Miklos Szeredi
  Dave, I really appreciate the work you do.  Please forgive my
  impatience.
 
 No problem.
 
 Your patch is first in line for me to review tomorrow ok?

Great, thanks.

Miklos
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH re-resend] fix race in AF_UNIX garbage collector

2007-07-10 Thread Miklos Szeredi
Andrew,

Can you please add this to -mm.  Sometimes that has wonderful effect
on the willingness of people to look at a patch.

It fixes a bug in AF_UNIX sockets that affects the ulockmgr library in
the fuse package.  And for some reason it doesn't seem to interest the
netdev guys in the least.

Thanks,
Miklos


From: Miklos Szeredi [EMAIL PROTECTED]

Throw out the old mark  sweep garbage collector and put in a
refcounting cycle detecting one.

The old one had a race with recvmsg, that resulted in false positives
and hence data loss.  The old algorithm operated on all unix sockets
in the system, so any additional locking would have meant performance
problems for all users of these.

The new algorithm instead only operates on in flight sockets, which
are very rare, and the additional locking for these doesn't negatively
impact the vast majority of users.

In fact it's probable, that there weren't *any* heavy senders of
sockets over sockets, otherwise the above race would have been
discovered long ago.

The patch works OK with the app that exposed the race with the old
code.  The garbage collection has also been verified to work in a few
simple cases.

See comments inside patch for the description of the algorithm.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---
 include/net/af_unix.h |3
 net/unix/af_unix.c|6
 net/unix/garbage.c|  329 --
 3 files changed, 192 insertions(+), 146 deletions(-)

Index: linux-2.6.22-rc6/include/net/af_unix.h
===
--- linux-2.6.22-rc6.orig/include/net/af_unix.h 2007-06-27 22:53:53.0 
+0200
+++ linux-2.6.22-rc6/include/net/af_unix.h  2007-06-27 22:53:55.0 
+0200
@@ -79,9 +79,10 @@ struct unix_sock {
struct mutexreadlock;
 struct sock*peer;
 struct sock*other;
-struct sock*gc_tree;
+   struct list_headlink;
 atomic_tinflight;
 spinlock_t lock;
+   unsigned intgc_candidate : 1;
 wait_queue_head_t   peer_wait;
 };
 #define unix_sk(__sk) ((struct unix_sock *)__sk)
Index: linux-2.6.22-rc6/net/unix/af_unix.c
===
--- linux-2.6.22-rc6.orig/net/unix/af_unix.c2007-06-27 22:53:53.0 
+0200
+++ linux-2.6.22-rc6/net/unix/af_unix.c 2007-06-27 22:53:55.0 +0200
@@ -592,7 +592,8 @@ static struct sock * unix_create1(struct
u-dentry = NULL;
u-mnt= NULL;
spin_lock_init(u-lock);
-   atomic_set(u-inflight, sock ? 0 : -1);
+   atomic_set(u-inflight, 0);
+   INIT_LIST_HEAD(u-link);
mutex_init(u-readlock); /* single task reading lock */
init_waitqueue_head(u-peer_wait);
unix_insert_socket(unix_sockets_unbound, sk);
@@ -1134,9 +1135,6 @@ restart:
/* take ten and and send info to listening sock */
spin_lock(other-sk_receive_queue.lock);
__skb_queue_tail(other-sk_receive_queue, skb);
-   /* Undo artificially decreased inflight after embrion
-* is installed to listening socket. */
-   atomic_inc(newu-inflight);
spin_unlock(other-sk_receive_queue.lock);
unix_state_unlock(other);
other-sk_data_ready(other, 0);
Index: linux-2.6.22-rc6/net/unix/garbage.c
===
--- linux-2.6.22-rc6.orig/net/unix/garbage.c2007-06-27 22:53:53.0 
+0200
+++ linux-2.6.22-rc6/net/unix/garbage.c 2007-06-27 22:53:55.0 +0200
@@ -62,6 +62,10 @@
  * AV  1 Mar 1999
  * Damn. Added missing check for -dead in listen queues scanning.
  *
+ * Miklos Szeredi 25 Jun 2007
+ * Reimplement with a cycle collecting algorithm. This should
+ * solve several problems with the previous code, like being racy
+ * wrt receive and holding up unrelated socket operations.
  */
 
 #include linux/kernel.h
@@ -84,10 +88,9 @@
 
 /* Internal data structures and random procedures: */
 
-#define GC_HEAD((struct sock *)(-1))
-#define GC_ORPHAN  ((struct sock *)(-3))
-
-static struct sock *gc_current = GC_HEAD; /* stack of objects to mark */
+static LIST_HEAD(gc_inflight_list);
+static LIST_HEAD(gc_candidates);
+static DEFINE_SPINLOCK(unix_gc_lock);
 
 atomic_t unix_tot_inflight = ATOMIC_INIT(0);
 
@@ -122,8 +125,16 @@ void unix_inflight(struct file *fp)
 {
struct sock *s = unix_get_socket(fp);
if(s) {
-   atomic_inc(unix_sk(s)-inflight);
+   struct unix_sock *u = unix_sk(s);
+   spin_lock(unix_gc_lock);
+   if (atomic_inc_return(u-inflight) == 1) {
+   BUG_ON(!list_empty(u-link));
+   list_add_tail(u-link, gc_inflight_list);
+   } else {
+   BUG_ON

Re: [PATCH re-resend] fix race in AF_UNIX garbage collector

2007-07-10 Thread Miklos Szeredi
  On Tue, 10 Jul 2007 11:50:30 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote:
  
   Can you please add this to -mm.  Sometimes that has wonderful effect
   on the willingness of people to look at a patch.
  
  sure ;)
 
 Andrew please don't, it's in my to review pile.

Yeah, and I can guess, where it is sitting in that pile ;)

Dave, if you are busy, just say so.  But if you don't reply anything,
I'm bound to think the patch is being ignored.

I apologize for being annoying.  Just let's get over this bloody bug ;)

Thanks,
Miklos
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend] rewrite AF_UNIX garbage collector, fixes race

2007-07-04 Thread Miklos Szeredi
From: Miklos Szeredi [EMAIL PROTECTED]

Throw out the old mark  sweep garbage collector and put in a
refcounting cycle detecting one.

The old one had a race with recvmsg, that resulted in false positives
and hence data loss.  The old algorithm operated on all unix sockets
in the system, so any additional locking would have meant performance
problems for all users of these.

The new algorithm instead only operates on in flight sockets, which
are very rare, and the additional locking for these doesn't negatively
impact the vast majority of users.

In fact it's probable, that there weren't *any* heavy senders of
sockets over sockets, otherwise the above race would have been
discovered long ago.

The patch works OK with the app that exposed the race with the old
code.  The garbage collection has also been verified to work in a few
simple cases.

See comments inside patch for the description of the algorithm.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---
 include/net/af_unix.h |3
 net/unix/af_unix.c|6
 net/unix/garbage.c|  329 --
 3 files changed, 192 insertions(+), 146 deletions(-)

Index: linux-2.6.22-rc6/include/net/af_unix.h
===
--- linux-2.6.22-rc6.orig/include/net/af_unix.h 2007-06-27 22:53:53.0 
+0200
+++ linux-2.6.22-rc6/include/net/af_unix.h  2007-06-27 22:53:55.0 
+0200
@@ -79,9 +79,10 @@ struct unix_sock {
struct mutexreadlock;
 struct sock*peer;
 struct sock*other;
-struct sock*gc_tree;
+   struct list_headlink;
 atomic_tinflight;
 spinlock_t lock;
+   unsigned intgc_candidate : 1;
 wait_queue_head_t   peer_wait;
 };
 #define unix_sk(__sk) ((struct unix_sock *)__sk)
Index: linux-2.6.22-rc6/net/unix/af_unix.c
===
--- linux-2.6.22-rc6.orig/net/unix/af_unix.c2007-06-27 22:53:53.0 
+0200
+++ linux-2.6.22-rc6/net/unix/af_unix.c 2007-06-27 22:53:55.0 +0200
@@ -592,7 +592,8 @@ static struct sock * unix_create1(struct
u-dentry = NULL;
u-mnt= NULL;
spin_lock_init(u-lock);
-   atomic_set(u-inflight, sock ? 0 : -1);
+   atomic_set(u-inflight, 0);
+   INIT_LIST_HEAD(u-link);
mutex_init(u-readlock); /* single task reading lock */
init_waitqueue_head(u-peer_wait);
unix_insert_socket(unix_sockets_unbound, sk);
@@ -1134,9 +1135,6 @@ restart:
/* take ten and and send info to listening sock */
spin_lock(other-sk_receive_queue.lock);
__skb_queue_tail(other-sk_receive_queue, skb);
-   /* Undo artificially decreased inflight after embrion
-* is installed to listening socket. */
-   atomic_inc(newu-inflight);
spin_unlock(other-sk_receive_queue.lock);
unix_state_unlock(other);
other-sk_data_ready(other, 0);
Index: linux-2.6.22-rc6/net/unix/garbage.c
===
--- linux-2.6.22-rc6.orig/net/unix/garbage.c2007-06-27 22:53:53.0 
+0200
+++ linux-2.6.22-rc6/net/unix/garbage.c 2007-06-27 22:53:55.0 +0200
@@ -62,6 +62,10 @@
  * AV  1 Mar 1999
  * Damn. Added missing check for -dead in listen queues scanning.
  *
+ * Miklos Szeredi 25 Jun 2007
+ * Reimplement with a cycle collecting algorithm. This should
+ * solve several problems with the previous code, like being racy
+ * wrt receive and holding up unrelated socket operations.
  */
 
 #include linux/kernel.h
@@ -84,10 +88,9 @@
 
 /* Internal data structures and random procedures: */
 
-#define GC_HEAD((struct sock *)(-1))
-#define GC_ORPHAN  ((struct sock *)(-3))
-
-static struct sock *gc_current = GC_HEAD; /* stack of objects to mark */
+static LIST_HEAD(gc_inflight_list);
+static LIST_HEAD(gc_candidates);
+static DEFINE_SPINLOCK(unix_gc_lock);
 
 atomic_t unix_tot_inflight = ATOMIC_INIT(0);
 
@@ -122,8 +125,16 @@ void unix_inflight(struct file *fp)
 {
struct sock *s = unix_get_socket(fp);
if(s) {
-   atomic_inc(unix_sk(s)-inflight);
+   struct unix_sock *u = unix_sk(s);
+   spin_lock(unix_gc_lock);
+   if (atomic_inc_return(u-inflight) == 1) {
+   BUG_ON(!list_empty(u-link));
+   list_add_tail(u-link, gc_inflight_list);
+   } else {
+   BUG_ON(list_empty(u-link));
+   }
atomic_inc(unix_tot_inflight);
+   spin_unlock(unix_gc_lock);
}
 }
 
@@ -131,182 +142,218 @@ void unix_notinflight(struct file *fp)
 {
struct sock *s = unix_get_socket(fp);
if(s) {
-   atomic_dec(unix_sk(s)-inflight

Re: [PATCH] fix race in AF_UNIX

2007-06-23 Thread Miklos Szeredi
Eric, thanks for looking at this.

   There are races involving the garbage collector, that can throw away
   perfectly good packets with AF_UNIX sockets in them.
   
   The problems arise when a socket goes from installed to in-flight or
   vice versa during garbage collection.  Since gc is done with a
   spinlock held, this only shows up on SMP.
   
   Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
  
  I'm going to hold off on this one for now.
  
  Holding all of the read locks kind of defeats the purpose of using
  the per-socket lock.
  
  Can't you just lock purely around the receive queue operation?
 
  That's already protected by the receive queue spinlock.  The race
  however happens _between_ pushing the root set and marking of the
  in-flight but reachable sockets.
 
  If in that space any of the AF_UNIX sockets goes from in-flight to
  installed into a file descriptor, the garbage collector can miss it.
  If we want to protect against this using unix_sk(s)-readlock, then we
  have to hold all of them for the duration of the marking.
 
  Al, Alan, you have more experience with this piece of code.  Do you
  have better ideas about how to fix this?
 
 I haven't looked at the code closely enough to be confident of
 changing something in this area.  However the classic solution to this
 kind of gc problem is to mark things that are manipulated during
 garbage collection as dirty (not orphaned).
 
 It should be possible to fix this problem by simply changing gc_tree
 when we perform a problematic manipulation of a passed socket, such
 as installing a passed socket into the file descriptors of a process.
 
 Essentially the idea is moving the current code in the direction of
 an incremental gc algorithm.
 
 
 If I understand the race properly.  What happens is that we dequeue
 a socket (which has packets in it passing sockets) before the
 garbage collector gets to it.  Therefore the garbage collector
 never processes that socket.  So it sounds like we just
 need to call maybe_unmark_and_push or possibly just wait for
 the garbage collector to complete when we do that and the packet
 we have pulled out 

Right.  But the devil is in the details, and (as you correctly point
out later) to implement this, the whole locking scheme needs to be
overhauled.  Problems:

 - Using the queue lock to make the dequeue and the fd detach atomic
   wrt the GC is difficult, if not impossible: they are are far from
   each other with various magic in between.  It would need thorough
   understanding of these functions and _big_ changes to implement.

 - Sleeping on u-readlock in GC is currently not possible, since that
   could deadlock with unix_dgram_recvmsg().  That function could
   probably be modified to release u-readlock, while waiting for
   data, similarly to unix_stream_recvmsg() at the cost of some added
   complexity.

 - Sleeping on u-readlock is also impossible, because GC is holding
   unix_table_lock for the whole operation.  We could release
   unix_table_lock, but then would have to cope with sockets coming
   and going, making the current socket iterator unworkable.

So theoretically it's quite simple, but it needs big changes.  And
this wouldn't even solve all the problems with the GC, like being a
possible DoS vector.

Miklos
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix race in AF_UNIX

2007-06-18 Thread Miklos Szeredi
Ping Dave,

Since there doesn't seem to be any new ideas forthcoming, can we
please decide on either one of my two sumbitted patches?

Thanks,
Miklos

 [CC'd Al Viro and Alan Cox, restored patch]
 
   There are races involving the garbage collector, that can throw away
   perfectly good packets with AF_UNIX sockets in them.
   
   The problems arise when a socket goes from installed to in-flight or
   vice versa during garbage collection.  Since gc is done with a
   spinlock held, this only shows up on SMP.
   
   Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
  
  I'm going to hold off on this one for now.
  
  Holding all of the read locks kind of defeats the purpose of using
  the per-socket lock.
  
  Can't you just lock purely around the receive queue operation?
 
 That's already protected by the receive queue spinlock.  The race
 however happens _between_ pushing the root set and marking of the
 in-flight but reachable sockets.
 
 If in that space any of the AF_UNIX sockets goes from in-flight to
 installed into a file descriptor, the garbage collector can miss it.
 If we want to protect against this using unix_sk(s)-readlock, then we
 have to hold all of them for the duration of the marking.
 
 Al, Alan, you have more experience with this piece of code.  Do you
 have better ideas about how to fix this?
 
 Thanks,
 Miklos
 
  Index: linux-2.6.22-rc2/net/unix/garbage.c
  ===
  --- linux-2.6.22-rc2.orig/net/unix/garbage.c2007-06-03 
  23:58:11.0 +0200
  +++ linux-2.6.22-rc2/net/unix/garbage.c 2007-06-06 09:48:36.0 
  +0200
  @@ -186,7 +186,21 @@ void unix_gc(void)
   
  forall_unix_sockets(i, s)
  {
  -   unix_sk(s)-gc_tree = GC_ORPHAN;
  +   struct unix_sock *u = unix_sk(s);
  +
  +   u-gc_tree = GC_ORPHAN;
  +
  +   /*
  +* Hold -readlock to protect against sockets going from
  +* in-flight to installed
  +*
  +* Can't sleep on this, because
  +*   a) we are under spinlock
  +*   b) skb_recv_datagram() could be waiting for a packet that
  +*  is to be sent by this thread
  +*/
  +   if (!mutex_trylock(u-readlock))
  +   goto lock_failed;
  }
  /*
   *  Everything is now marked
  @@ -207,8 +221,6 @@ void unix_gc(void)
   
  forall_unix_sockets(i, s)
  {
  -   int open_count = 0;
  -
  /*
   *  If all instances of the descriptor are not
   *  in flight we are in use.
  @@ -218,10 +230,20 @@ void unix_gc(void)
   *  In this case (see unix_create1()) we set artificial
   *  negative inflight counter to close race window.
   *  It is trick of course and dirty one.
  +*
  +*  Get the inflight counter first, then the open
  +*  counter.  This avoids problems if racing with
  +*  sendmsg
  +*
  +*  If just created socket is not yet attached to
  +*  a file descriptor, assume open_count of 1
   */
  +   int inflight_count = atomic_read(unix_sk(s)-inflight);
  +   int open_count = 1;
  +
  if (s-sk_socket  s-sk_socket-file)
  open_count = file_count(s-sk_socket-file);
  -   if (open_count  atomic_read(unix_sk(s)-inflight))
  +   if (open_count  inflight_count)
  maybe_unmark_and_push(s);
  }
   
  @@ -300,6 +322,7 @@ void unix_gc(void)
  spin_unlock(s-sk_receive_queue.lock);
  }
  u-gc_tree = GC_ORPHAN;
  +   mutex_unlock(u-readlock);
  }
  spin_unlock(unix_table_lock);
   
  @@ -309,4 +332,19 @@ void unix_gc(void)
   
  __skb_queue_purge(hitlist);
  mutex_unlock(unix_gc_sem);
  +   return;
  +
  + lock_failed:
  +   {
  +   struct sock *s1;
  +   forall_unix_sockets(i, s1) {
  +   if (s1 == s) {
  +   spin_unlock(unix_table_lock);
  +   mutex_unlock(unix_gc_sem);
  +   return;
  +   }
  +   mutex_unlock(unix_sk(s1)-readlock);
  +   }
  +   BUG();
  +   }
   }
  
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix race in AF_UNIX

2007-06-18 Thread Miklos Szeredi
 From: Miklos Szeredi [EMAIL PROTECTED]
 Date: Mon, 18 Jun 2007 09:49:32 +0200
 
  Ping Dave,
  
  Since there doesn't seem to be any new ideas forthcoming, can we
  please decide on either one of my two sumbitted patches?
 
 You just need to be patient, we can't just put a bad patch
 in because a better one has not been suggested yet :-)

And is anyone working on a better patch?

Those patches aren't bad in the correctness sense.  So IMO any one
of them is better, than having that bug in there.

Miklos
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix race in AF_UNIX

2007-06-18 Thread Miklos Szeredi
  And is anyone working on a better patch?
 
 I have no idea.
 
  Those patches aren't bad in the correctness sense.  So IMO any one
  of them is better, than having that bug in there.
 
 You're adding a very serious performance regression, which is
 about as bad as the bug itself.

No, correctness always trumps performance.  Lost packets on an AF_UNIX
socket are _unexceptable_, and this is definitely not a theoretical
problem.

And BTW my second patch does _not_ have the performance problems you
are arguing about, it's just plain ugly.  But hey, if you can't live
with ugly code, go and fix it.

 It can wait for a more appropriate fix.

Now _please_ be a bit more constructive.

Do you want me to send the patch to Andrew instead?  His attitude
towards bugfixes is rather better ;)

Miklos
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix race in AF_UNIX

2007-06-18 Thread Miklos Szeredi
And is anyone working on a better patch?
   
   I have no idea.
   
Those patches aren't bad in the correctness sense.  So IMO any one
of them is better, than having that bug in there.
   
   You're adding a very serious performance regression, which is
   about as bad as the bug itself.
  
  No, correctness always trumps performance.
 
 To a point.  There is no black and white in this world.
 
  Lost packets on an AF_UNIX socket are _unexceptable_, and this is
  definitely not a theoretical problem.
 
 A lot of people will consider having all of their AF_UNIX sockets on
 their 64 cpu system just stop when garbage collection runs to be
 unacceptable as well.

Garbage collection only ever happens, if the app is sending AF_UNIX
sockets over AF_UNIX sockets.  Which is a rather rare case.  And which
is basically why this bug went unnoticed for so long.

So my second patch only affects the performance of _exactly_ those
apps which might well be bitten by the bug itself.

 Secondarily, this bug has been around for years and nobody noticed.
 The world will not explode if this bug takes a few more days or
 even a week to work out.  Let's do it right instead of ramming
 arbitrary turds into the kernel.

Fine, but just wishing a bug to get fixed won't accomplish anything.
I've spent a fair amount of time debugging this thing, and I'm out of
ideas.  Really.  So unless somebody steps up to look at this, it won't
_ever_ get fixed.

Miklos
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix race in AF_UNIX

2007-06-18 Thread Miklos Szeredi
   Secondarily, this bug has been around for years and nobody noticed.
   The world will not explode if this bug takes a few more days or
   even a week to work out.  Let's do it right instead of ramming
   arbitrary turds into the kernel.
  
  Fine, but just wishing a bug to get fixed won't accomplish anything.
  I've spent a fair amount of time debugging this thing, and I'm out of
  ideas.  Really.  So unless somebody steps up to look at this, it won't
  _ever_ get fixed.
 
 Somone just needs to find a way to only lock the socket as it is
 being operated upon.

No, you are still not getting it.  The problem is that the socket
needs to be locked for the _whole_ of the garbage collection.  This is
because the gc is done in two phases, in the first phase sockets which
are installed into file descriptors are collected as a root set,
then the set is expanded by iterating over everything it's in there
and that's later added.  If a socket moves from in-flight to installed
_during_ the gc, then it can miss being collected.  So the socket must
be locked for the duration of _both_ phases.

 The race you are dealing with is rather simple, the queue check
 and the state check need to be done atomically.  The only chore
 is to find a way to make that happen in the context of what the
 garbage allocator is trying to do.
 
 I'm not even convinced that your most recent attempt is deadlock free.
 Locking multiple objects the same way all at once like that is
 something that needs to be seriously audited.

It's doing trylocks and releasing all those locks within the same spin
locked region, how the f**k can that deadlock?

The only problem I've experienced with this patch (other than being
ugly) is that the multitude of locks it acquires makes the lockdep
code give up.  But I think we can live with that.

Miklos
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix race in AF_UNIX

2007-06-18 Thread Miklos Szeredi
 * Miklos Szeredi [EMAIL PROTECTED] 2007-06-18 11:44
  Garbage collection only ever happens, if the app is sending AF_UNIX
  sockets over AF_UNIX sockets.  Which is a rather rare case.  And which
  is basically why this bug went unnoticed for so long.
  
  So my second patch only affects the performance of _exactly_ those
  apps which might well be bitten by the bug itself.
 
 That's not entirely the truth. It affects all applications using
 AF_UNIX sockets while file descriptors are being transfered. I
 agree that the performance impact is not severe on most systems
 but if file descriptors are being transfered continously by just
 a single application it can become rather severe.

You are wrong.  Look in unix_release_sock():

if (atomic_read(unix_tot_inflight))
unix_gc();  /* Garbage collect fds */


unix_tot_inflight is the number of AF_UNIX sockets currently being
transferred over some AF_UNIX sockets.

That means that just sending (non-unix socket) fds over unix sockets
will never invoke the gc.

Miklos
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix race in AF_UNIX

2007-06-18 Thread Miklos Szeredi
 * Thomas Graf [EMAIL PROTECTED] 2007-06-18 12:32
  * Miklos Szeredi [EMAIL PROTECTED] 2007-06-18 11:44
   Garbage collection only ever happens, if the app is sending AF_UNIX
   sockets over AF_UNIX sockets.  Which is a rather rare case.  And which
   is basically why this bug went unnoticed for so long.
   
   So my second patch only affects the performance of _exactly_ those
   apps which might well be bitten by the bug itself.
  
  That's not entirely the truth. It affects all applications using
  AF_UNIX sockets while file descriptors are being transfered. I
  agree that the performance impact is not severe on most systems
  but if file descriptors are being transfered continously by just
  a single application it can become rather severe.
 
 Also think of the scenario where an application, deliberately or not,
 begins a file descriptor tranfser using sendmsg() and the receiving
 part never invokes recvmsg() to decrement the inflight counters
 again. Every unix socket that gets closed would result in a gc call
 locking all sockets.

And if some of the sent files were unix sockets, rightly so, since the
sent sockets might need to be garbage collected.

And BTW the whole gc is done with the unix_table_lock held, so it will
stop some socket operations anyway.  The fact that it needs to stop
some more operations is a necessary thing.  But we are talking about a
_spinlocked_ region, which for zillions of sockets might run for a
long time, but it's not as if it's really going to affect performance
in real cases.

Miklos
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix race in AF_UNIX

2007-06-18 Thread Miklos Szeredi
  but it's not as if it's really going to affect performance
  in real cases.
 
 Since these circumstances are creatable by any user, we have
 to consider the cases caused by malicious entities.

OK.  But then the whole gc thing is already broken, since a user can
DoS socket creation/destruction.

I'm all for fixing this gc mess that we have now.  But please don't
expect me to be the one who's doing it.

Miklos
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix race in AF_UNIX

2007-06-18 Thread Miklos Szeredi
  I'm all for fixing this gc mess that we have now.  But please don't
  expect me to be the one who's doing it.
 
 Don't worry, I only expect you to make the situation worse :-)

That's real nice.  Looks like finding and fixing bugs in not
appreciated in the networking subsystem :-/

Miklos
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix race in AF_UNIX

2007-06-18 Thread Miklos Szeredi
   I'm all for fixing this gc mess that we have now.  But please don't
   expect me to be the one who's doing it.
  
  Don't worry, I only expect you to make the situation worse :-)
 
 In any event, I'll try to find some time to look more at your patch.
 
 But just like you don't want to be expected to rewrite the AF_UNIX GC
 code, you can't expect me to go right to your patch right when you
 want me to.
 
 I am one human being, and I depend upon other developers to help me
 out with patch review and so on.  If nobody else wants to review your
 patch that doesn't work out so well for me.

Sure, no problem.  I just pinged you after a week of inactivity, to
make sure it hasn't been forgotten.  Just say if you are busy and I'll
wait, but trying to forget the issue is not the way to deal with it.

 Your behavior also tends to make me (and others) put your patch near
 the end of the todo list rather than near the top.  Fix your attitude
 and people will enjoy reviewing your patches that much more.  Reviewing
 your work will become something to look forward to rather than a chore.

I'll happily admit I'm a fool when it comes to social interactions.

Miklos
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix race in AF_UNIX

2007-06-18 Thread Miklos Szeredi
  No, correctness always trumps performance.  Lost packets on an AF_UNIX
  socket are _unexceptable_, and this is definitely not a theoretical
  problem.
 
 If its so unacceptable why has nobody noticed until now - its a bug
 clearly, it needs fixing clearly, but unless you can produce some kind of
 exploit from it (eg a DoS attack or kernel memory leak exploiter) it
 doesn't appear to be that serious.

It's serious in that it affects the operation of one of my
applications in a way that is rather hard to work around.  Sure, I
could resend the lost sockets, but it's something one doesn't do
unnecessarily for reliable transports.

And it's entirely possible, that it could bite other apps in rare and
mysterious ways.  All you need is

  - an application that sends a unix socket over a unix socket

  - a parallel close() operation on an independent unix socket

The two might happen to be from totally unrelated apps.

It's all the more serious, because it could happen rarely and
unreproducibly, and could well be crashing the app which is not
expecting this behavior.

  And BTW my second patch does _not_ have the performance problems you
  are arguing about, it's just plain ugly.  But hey, if you can't live
  with ugly code, go and fix it.
 
 If you put ugly code into the kernel you pay for maintaining it for years
 to come. If you get it right then you don't
 
  Do you want me to send the patch to Andrew instead?  His attitude
  towards bugfixes is rather better ;)
 
 And it'll get NAKked and binned. DaveM is (as happens sometimes ;)) right
 to insist on the code being clean and efficient.

Right, but treating a bug in your subsystem as if it's entirely the
responsibility of the reporter is not the right attitude either I
think.

Miklos
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix race in AF_UNIX

2007-06-11 Thread Miklos Szeredi
[CC'd Al Viro and Alan Cox, restored patch]

  There are races involving the garbage collector, that can throw away
  perfectly good packets with AF_UNIX sockets in them.
  
  The problems arise when a socket goes from installed to in-flight or
  vice versa during garbage collection.  Since gc is done with a
  spinlock held, this only shows up on SMP.
  
  Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
 
 I'm going to hold off on this one for now.
 
 Holding all of the read locks kind of defeats the purpose of using
 the per-socket lock.
 
 Can't you just lock purely around the receive queue operation?

That's already protected by the receive queue spinlock.  The race
however happens _between_ pushing the root set and marking of the
in-flight but reachable sockets.

If in that space any of the AF_UNIX sockets goes from in-flight to
installed into a file descriptor, the garbage collector can miss it.
If we want to protect against this using unix_sk(s)-readlock, then we
have to hold all of them for the duration of the marking.

Al, Alan, you have more experience with this piece of code.  Do you
have better ideas about how to fix this?

Thanks,
Miklos

 Index: linux-2.6.22-rc2/net/unix/garbage.c
 ===
 --- linux-2.6.22-rc2.orig/net/unix/garbage.c  2007-06-03 23:58:11.0 
 +0200
 +++ linux-2.6.22-rc2/net/unix/garbage.c   2007-06-06 09:48:36.0 
 +0200
 @@ -186,7 +186,21 @@ void unix_gc(void)
  
   forall_unix_sockets(i, s)
   {
 - unix_sk(s)-gc_tree = GC_ORPHAN;
 + struct unix_sock *u = unix_sk(s);
 +
 + u-gc_tree = GC_ORPHAN;
 +
 + /*
 +  * Hold -readlock to protect against sockets going from
 +  * in-flight to installed
 +  *
 +  * Can't sleep on this, because
 +  *   a) we are under spinlock
 +  *   b) skb_recv_datagram() could be waiting for a packet that
 +  *  is to be sent by this thread
 +  */
 + if (!mutex_trylock(u-readlock))
 + goto lock_failed;
   }
   /*
*  Everything is now marked
 @@ -207,8 +221,6 @@ void unix_gc(void)
  
   forall_unix_sockets(i, s)
   {
 - int open_count = 0;
 -
   /*
*  If all instances of the descriptor are not
*  in flight we are in use.
 @@ -218,10 +230,20 @@ void unix_gc(void)
*  In this case (see unix_create1()) we set artificial
*  negative inflight counter to close race window.
*  It is trick of course and dirty one.
 +  *
 +  *  Get the inflight counter first, then the open
 +  *  counter.  This avoids problems if racing with
 +  *  sendmsg
 +  *
 +  *  If just created socket is not yet attached to
 +  *  a file descriptor, assume open_count of 1
*/
 + int inflight_count = atomic_read(unix_sk(s)-inflight);
 + int open_count = 1;
 +
   if (s-sk_socket  s-sk_socket-file)
   open_count = file_count(s-sk_socket-file);
 - if (open_count  atomic_read(unix_sk(s)-inflight))
 + if (open_count  inflight_count)
   maybe_unmark_and_push(s);
   }
  
 @@ -300,6 +322,7 @@ void unix_gc(void)
   spin_unlock(s-sk_receive_queue.lock);
   }
   u-gc_tree = GC_ORPHAN;
 + mutex_unlock(u-readlock);
   }
   spin_unlock(unix_table_lock);
  
 @@ -309,4 +332,19 @@ void unix_gc(void)
  
   __skb_queue_purge(hitlist);
   mutex_unlock(unix_gc_sem);
 + return;
 +
 + lock_failed:
 + {
 + struct sock *s1;
 + forall_unix_sockets(i, s1) {
 + if (s1 == s) {
 + spin_unlock(unix_table_lock);
 + mutex_unlock(unix_gc_sem);
 + return;
 + }
 + mutex_unlock(unix_sk(s1)-readlock);
 + }
 + BUG();
 + }
  }
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix race in AF_UNIX

2007-06-06 Thread Miklos Szeredi
   Holding a global mutex over recvmsg() calls under AF_UNIX is pretty
   much a non-starter, this will kill performance for multi-threaded
   apps.
  
  That's an rwsem held for read.  It's held for write in unix_gc() only
  for a short duration, and unix_gc() should only rarely be called.  So
  I don't think there's any performance problem here.
 
 It pulls a non-local cacheline into the local thread, that's extremely
 expensive on SMP.

OK, here's an updated patch, that uses -readlock, and passes my
testing.


From: Miklos Szeredi [EMAIL PROTECTED]

There are races involving the garbage collector, that can throw away
perfectly good packets with AF_UNIX sockets in them.

The problems arise when a socket goes from installed to in-flight or
vice versa during garbage collection.  Since gc is done with a
spinlock held, this only shows up on SMP.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux-2.6.22-rc2/net/unix/garbage.c
===
--- linux-2.6.22-rc2.orig/net/unix/garbage.c2007-06-03 23:58:11.0 
+0200
+++ linux-2.6.22-rc2/net/unix/garbage.c 2007-06-06 09:48:36.0 +0200
@@ -186,7 +186,21 @@ void unix_gc(void)
 
forall_unix_sockets(i, s)
{
-   unix_sk(s)-gc_tree = GC_ORPHAN;
+   struct unix_sock *u = unix_sk(s);
+
+   u-gc_tree = GC_ORPHAN;
+
+   /*
+* Hold -readlock to protect against sockets going from
+* in-flight to installed
+*
+* Can't sleep on this, because
+*   a) we are under spinlock
+*   b) skb_recv_datagram() could be waiting for a packet that
+*  is to be sent by this thread
+*/
+   if (!mutex_trylock(u-readlock))
+   goto lock_failed;
}
/*
 *  Everything is now marked
@@ -207,8 +221,6 @@ void unix_gc(void)
 
forall_unix_sockets(i, s)
{
-   int open_count = 0;
-
/*
 *  If all instances of the descriptor are not
 *  in flight we are in use.
@@ -218,10 +230,20 @@ void unix_gc(void)
 *  In this case (see unix_create1()) we set artificial
 *  negative inflight counter to close race window.
 *  It is trick of course and dirty one.
+*
+*  Get the inflight counter first, then the open
+*  counter.  This avoids problems if racing with
+*  sendmsg
+*
+*  If just created socket is not yet attached to
+*  a file descriptor, assume open_count of 1
 */
+   int inflight_count = atomic_read(unix_sk(s)-inflight);
+   int open_count = 1;
+
if (s-sk_socket  s-sk_socket-file)
open_count = file_count(s-sk_socket-file);
-   if (open_count  atomic_read(unix_sk(s)-inflight))
+   if (open_count  inflight_count)
maybe_unmark_and_push(s);
}
 
@@ -300,6 +322,7 @@ void unix_gc(void)
spin_unlock(s-sk_receive_queue.lock);
}
u-gc_tree = GC_ORPHAN;
+   mutex_unlock(u-readlock);
}
spin_unlock(unix_table_lock);
 
@@ -309,4 +332,19 @@ void unix_gc(void)
 
__skb_queue_purge(hitlist);
mutex_unlock(unix_gc_sem);
+   return;
+
+ lock_failed:
+   {
+   struct sock *s1;
+   forall_unix_sockets(i, s1) {
+   if (s1 == s) {
+   spin_unlock(unix_table_lock);
+   mutex_unlock(unix_gc_sem);
+   return;
+   }
+   mutex_unlock(unix_sk(s1)-readlock);
+   }
+   BUG();
+   }
 }

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix race in AF_UNIX

2007-06-05 Thread Miklos Szeredi
   A recv() on an AF_UNIX, SOCK_STREAM socket can race with a
   send()+close() on the peer, causing recv() to return zero, even though
   the sent data should be received.
   
   This happens if the send() and the close() is performed between
   skb_dequeue() and checking sk-sk_shutdown in unix_stream_recvmsg():
   
   process A  skb_dequeue() returns NULL, there's no data in the socket queue
   process B  new data is inserted onto the queue by unix_stream_sendmsg()
   process B  sk-sk_shutdown is set to SHUTDOWN_MASK by unix_release_sock()
   process A  sk-sk_shutdown is checked, unix_release_sock() returns zero
  
  This is only part of the story.  It turns out, there are other races
  involving the garbage collector, that can throw away perfectly good
  packets with AF_UNIX sockets in them.
  
  The problems arise when a socket goes from installed to in-flight or
  vica versa during garbage collection.  Since gc is done with a
  spinlock held, this only shows up on SMP.
  
  The following patch fixes it for me, but it's possibly the wrong
  approach.
  
  Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
 
 I haven't seen a repost of the first patch, which is necessary because
 that first patch doesn't apply to the current tree.  Please don't
 ignore Arnaldo's feedback like that, or else I'll ignore you just the
 same. :-)

I just want to win the who's laziest? league.  It would take me
about 5 minutes to get the netdev tree and test compile the change.
Of which 5 seconds would be actually updating the patch.  I was
thought it was OK to pass that 5 seconds worth of hard work to you in
order to save the rest ;)

Anyway here's the updated (but not compile tested) patch.

Thanks,
Miklos

From: Miklos Szeredi [EMAIL PROTECTED]

A recv() on an AF_UNIX, SOCK_STREAM socket can race with a
send()+close() on the peer, causing recv() to return zero, even though
the sent data should be received.

This happens if the send() and the close() is performed between
skb_dequeue() and checking sk-sk_shutdown in unix_stream_recvmsg():

process A  skb_dequeue() returns NULL, there's no data in the socket queue
process B  new data is inserted onto the queue by unix_stream_sendmsg()
process B  sk-sk_shutdown is set to SHUTDOWN_MASK by unix_release_sock()
process A  sk-sk_shutdown is checked, unix_release_sock() returns zero

I'm surprised nobody noticed this, it's not hard to trigger.  Maybe
it's just (un)luck with the timing.

It's possible to work around this bug in userspace, by retrying the
recv() once in case of a zero return value.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux-2.6.22-rc2/net/unix/af_unix.c
===
--- linux-2.6.22-rc2.orig/net/unix/af_unix.c2007-06-02 23:45:47.0 
+0200
+++ linux-2.6.22-rc2/net/unix/af_unix.c 2007-06-02 23:45:49.0 +0200
@@ -1711,20 +1711,23 @@ static int unix_stream_recvmsg(struct ki
int chunk;
struct sk_buff *skb;
 
+   unix_state_lock(sk);
skb = skb_dequeue(sk-sk_receive_queue);
if (skb==NULL)
{
if (copied = target)
-   break;
+   goto unlock;
 
/*
 *  POSIX 1003.1g mandates this order.
 */
 
if ((err = sock_error(sk)) != 0)
-   break;
+   goto unlock;
if (sk-sk_shutdown  RCV_SHUTDOWN)
-   break;
+   goto unlock;
+
+   unix_state_unlock(sk);
err = -EAGAIN;
if (!timeo)
break;
@@ -1738,7 +1741,11 @@ static int unix_stream_recvmsg(struct ki
}
mutex_lock(u-readlock);
continue;
+ unlock:
+   unix_state_unlock(sk);
+   break;
}
+   unix_state_unlock(sk);
 
if (check_creds) {
/* Never glue messages from different writers */

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix race in AF_UNIX

2007-06-05 Thread Miklos Szeredi
  I just want to win the who's laziest? league.  It would take me
  about 5 minutes to get the netdev tree and test compile the change.
  Of which 5 seconds would be actually updating the patch.  I was
  thought it was OK to pass that 5 seconds worth of hard work to you in
  order to save the rest ;)
 
 That tradeoff is fine, if, in return you'll do the rest of the
 networking subsystem maintainership work I need to to. :-)

Well, I _did_ save you quite a bit of time by tracking down these
bugs.  That 5 sec of dedication in exchange would have really made me
feel good ;(

Miklos
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix race in AF_UNIX

2007-06-05 Thread Miklos Szeredi
  From: Miklos Szeredi [EMAIL PROTECTED]
  Date: Mon, 04 Jun 2007 11:45:32 +0200
  
A recv() on an AF_UNIX, SOCK_STREAM socket can race with a
send()+close() on the peer, causing recv() to return zero, even though
the sent data should be received.

This happens if the send() and the close() is performed between
skb_dequeue() and checking sk-sk_shutdown in unix_stream_recvmsg():

process A  skb_dequeue() returns NULL, there's no data in the socket 
queue
process B  new data is inserted onto the queue by unix_stream_sendmsg()
process B  sk-sk_shutdown is set to SHUTDOWN_MASK by 
unix_release_sock()
process A  sk-sk_shutdown is checked, unix_release_sock() returns zero
   
   This is only part of the story.  It turns out, there are other races
   involving the garbage collector, that can throw away perfectly good
   packets with AF_UNIX sockets in them.
   
   The problems arise when a socket goes from installed to in-flight or
   vica versa during garbage collection.  Since gc is done with a
   spinlock held, this only shows up on SMP.
   
   The following patch fixes it for me, but it's possibly the wrong
   approach.
   
   Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
 
 Concerning this specific patch I think we need to rethink it
 a bit.
 
 Holding a global mutex over recvmsg() calls under AF_UNIX is pretty
 much a non-starter, this will kill performance for multi-threaded
 apps.

That's an rwsem held for read.  It's held for write in unix_gc() only
for a short duration, and unix_gc() should only rarely be called.  So
I don't think there's any performance problem here.

 
 One possible solution is for the garbage collection code to hold the
 u-readlock while processing a socket, but be careful about deadlocks.

That would have exactly the same effect.  Only the code would be more
complicated.

Miklos
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix race in AF_UNIX

2007-06-04 Thread Miklos Szeredi
 A recv() on an AF_UNIX, SOCK_STREAM socket can race with a
 send()+close() on the peer, causing recv() to return zero, even though
 the sent data should be received.
 
 This happens if the send() and the close() is performed between
 skb_dequeue() and checking sk-sk_shutdown in unix_stream_recvmsg():
 
 process A  skb_dequeue() returns NULL, there's no data in the socket queue
 process B  new data is inserted onto the queue by unix_stream_sendmsg()
 process B  sk-sk_shutdown is set to SHUTDOWN_MASK by unix_release_sock()
 process A  sk-sk_shutdown is checked, unix_release_sock() returns zero

This is only part of the story.  It turns out, there are other races
involving the garbage collector, that can throw away perfectly good
packets with AF_UNIX sockets in them.

The problems arise when a socket goes from installed to in-flight or
vica versa during garbage collection.  Since gc is done with a
spinlock held, this only shows up on SMP.

The following patch fixes it for me, but it's possibly the wrong
approach.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux-2.6.22-rc2/net/unix/garbage.c
===
--- linux-2.6.22-rc2.orig/net/unix/garbage.c2007-06-03 23:58:11.0 
+0200
+++ linux-2.6.22-rc2/net/unix/garbage.c 2007-06-04 11:39:42.0 +0200
@@ -90,6 +90,7 @@
 static struct sock *gc_current = GC_HEAD; /* stack of objects to mark */
 
 atomic_t unix_tot_inflight = ATOMIC_INIT(0);
+DECLARE_RWSEM(unix_gc_sem);
 
 
 static struct sock *unix_get_socket(struct file *filp)
@@ -169,7 +170,7 @@ static void maybe_unmark_and_push(struct
 
 void unix_gc(void)
 {
-   static DEFINE_MUTEX(unix_gc_sem);
+   static DEFINE_MUTEX(unix_gc_local_lock);
int i;
struct sock *s;
struct sk_buff_head hitlist;
@@ -179,9 +180,22 @@ void unix_gc(void)
 *  Avoid a recursive GC.
 */
 
-   if (!mutex_trylock(unix_gc_sem))
+   if (!mutex_trylock(unix_gc_local_lock))
return;
 
+
+   /*
+* unix_gc_sem protects against sockets going from in-flight to
+* installed
+*
+* Can't sleep on this, because skb_recv_datagram could be
+* waiting for a packet that is to be sent by the thread which
+* invoked the gc
+*/
+   if (!down_write_trylock(unix_gc_sem)) {
+   mutex_unlock(unix_gc_local_lock);
+   return;
+   }
spin_lock(unix_table_lock);
 
forall_unix_sockets(i, s)
@@ -207,8 +221,6 @@ void unix_gc(void)
 
forall_unix_sockets(i, s)
{
-   int open_count = 0;
-
/*
 *  If all instances of the descriptor are not
 *  in flight we are in use.
@@ -218,10 +230,20 @@ void unix_gc(void)
 *  In this case (see unix_create1()) we set artificial
 *  negative inflight counter to close race window.
 *  It is trick of course and dirty one.
+*
+*  Get the inflight counter first, then the open
+*  counter.  This avoids problems if racing with
+*  sendmsg
+*
+*  If just created socket is not yet attached to
+*  a file descriptor, assume open_count of 1
 */
+   int inflight_count = atomic_read(unix_sk(s)-inflight);
+   int open_count = 1;
+
if (s-sk_socket  s-sk_socket-file)
open_count = file_count(s-sk_socket-file);
-   if (open_count  atomic_read(unix_sk(s)-inflight))
+   if (open_count  inflight_count)
maybe_unmark_and_push(s);
}
 
@@ -302,11 +324,12 @@ void unix_gc(void)
u-gc_tree = GC_ORPHAN;
}
spin_unlock(unix_table_lock);
+   up_write(unix_gc_sem);
 
/*
 *  Here we are. Hitlist is filled. Die.
 */
 
__skb_queue_purge(hitlist);
-   mutex_unlock(unix_gc_sem);
+   mutex_unlock(unix_gc_local_lock);
 }
Index: linux-2.6.22-rc2/include/net/af_unix.h
===
--- linux-2.6.22-rc2.orig/include/net/af_unix.h 2007-04-26 05:08:32.0 
+0200
+++ linux-2.6.22-rc2/include/net/af_unix.h  2007-06-04 09:13:56.0 
+0200
@@ -14,6 +14,7 @@ extern void unix_gc(void);
 
 extern struct hlist_head unix_socket_table[UNIX_HASH_SIZE + 1];
 extern spinlock_t unix_table_lock;
+extern struct rw_semaphore unix_gc_sem;
 
 extern atomic_t unix_tot_inflight;
 
Index: linux-2.6.22-rc2/net/unix/af_unix.c
===
--- linux-2.6.22-rc2.orig/net/unix/af_unix.c2007-06-03 23:58:11.0 
+0200
+++ linux-2.6.22-rc2/net/unix/af_unix.c 2007-06-04 11:04:15.0 +0200
@@ -1572,6 +1572,7 @@ static int

Re: [PATCH] fix race in AF_UNIX

2007-06-04 Thread Miklos Szeredi
(resend, sorry, fscked up the address list)

 A recv() on an AF_UNIX, SOCK_STREAM socket can race with a
 send()+close() on the peer, causing recv() to return zero, even though
 the sent data should be received.
 
 This happens if the send() and the close() is performed between
 skb_dequeue() and checking sk-sk_shutdown in unix_stream_recvmsg():
 
 process A  skb_dequeue() returns NULL, there's no data in the socket queue
 process B  new data is inserted onto the queue by unix_stream_sendmsg()
 process B  sk-sk_shutdown is set to SHUTDOWN_MASK by unix_release_sock()
 process A  sk-sk_shutdown is checked, unix_release_sock() returns zero

This is only part of the story.  It turns out, there are other races
involving the garbage collector, that can throw away perfectly good
packets with AF_UNIX sockets in them.

The problems arise when a socket goes from installed to in-flight or
vica versa during garbage collection.  Since gc is done with a
spinlock held, this only shows up on SMP.

The following patch fixes it for me, but it's possibly the wrong
approach.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux-2.6.22-rc2/net/unix/garbage.c
===
--- linux-2.6.22-rc2.orig/net/unix/garbage.c2007-06-03 23:58:11.0 
+0200
+++ linux-2.6.22-rc2/net/unix/garbage.c 2007-06-04 11:39:42.0 +0200
@@ -90,6 +90,7 @@
 static struct sock *gc_current = GC_HEAD; /* stack of objects to mark */
 
 atomic_t unix_tot_inflight = ATOMIC_INIT(0);
+DECLARE_RWSEM(unix_gc_sem);
 
 
 static struct sock *unix_get_socket(struct file *filp)
@@ -169,7 +170,7 @@ static void maybe_unmark_and_push(struct
 
 void unix_gc(void)
 {
-   static DEFINE_MUTEX(unix_gc_sem);
+   static DEFINE_MUTEX(unix_gc_local_lock);
int i;
struct sock *s;
struct sk_buff_head hitlist;
@@ -179,9 +180,22 @@ void unix_gc(void)
 *  Avoid a recursive GC.
 */
 
-   if (!mutex_trylock(unix_gc_sem))
+   if (!mutex_trylock(unix_gc_local_lock))
return;
 
+
+   /*
+* unix_gc_sem protects against sockets going from in-flight to
+* installed
+*
+* Can't sleep on this, because skb_recv_datagram could be
+* waiting for a packet that is to be sent by the thread which
+* invoked the gc
+*/
+   if (!down_write_trylock(unix_gc_sem)) {
+   mutex_unlock(unix_gc_local_lock);
+   return;
+   }
spin_lock(unix_table_lock);
 
forall_unix_sockets(i, s)
@@ -207,8 +221,6 @@ void unix_gc(void)
 
forall_unix_sockets(i, s)
{
-   int open_count = 0;
-
/*
 *  If all instances of the descriptor are not
 *  in flight we are in use.
@@ -218,10 +230,20 @@ void unix_gc(void)
 *  In this case (see unix_create1()) we set artificial
 *  negative inflight counter to close race window.
 *  It is trick of course and dirty one.
+*
+*  Get the inflight counter first, then the open
+*  counter.  This avoids problems if racing with
+*  sendmsg
+*
+*  If just created socket is not yet attached to
+*  a file descriptor, assume open_count of 1
 */
+   int inflight_count = atomic_read(unix_sk(s)-inflight);
+   int open_count = 1;
+
if (s-sk_socket  s-sk_socket-file)
open_count = file_count(s-sk_socket-file);
-   if (open_count  atomic_read(unix_sk(s)-inflight))
+   if (open_count  inflight_count)
maybe_unmark_and_push(s);
}
 
@@ -302,11 +324,12 @@ void unix_gc(void)
u-gc_tree = GC_ORPHAN;
}
spin_unlock(unix_table_lock);
+   up_write(unix_gc_sem);
 
/*
 *  Here we are. Hitlist is filled. Die.
 */
 
__skb_queue_purge(hitlist);
-   mutex_unlock(unix_gc_sem);
+   mutex_unlock(unix_gc_local_lock);
 }
Index: linux-2.6.22-rc2/include/net/af_unix.h
===
--- linux-2.6.22-rc2.orig/include/net/af_unix.h 2007-04-26 05:08:32.0 
+0200
+++ linux-2.6.22-rc2/include/net/af_unix.h  2007-06-04 09:13:56.0 
+0200
@@ -14,6 +14,7 @@ extern void unix_gc(void);
 
 extern struct hlist_head unix_socket_table[UNIX_HASH_SIZE + 1];
 extern spinlock_t unix_table_lock;
+extern struct rw_semaphore unix_gc_sem;
 
 extern atomic_t unix_tot_inflight;
 
Index: linux-2.6.22-rc2/net/unix/af_unix.c
===
--- linux-2.6.22-rc2.orig/net/unix/af_unix.c2007-06-03 23:58:11.0 
+0200
+++ linux-2.6.22-rc2/net/unix/af_unix.c 2007-06-04 11:04:15.0

[PATCH] file descriptor loss while receiving SCM_RIGHTS

2006-10-08 Thread Miklos Szeredi
If more than one file descriptor was sent with an SCM_RIGHTS message,
and on the receiving end, after installing a nonzero (but not all)
file descritpors the process runs out of fds, then the already
installed fds will be lost (userspace will have no way of knowing
about them).

The following patch makes sure, that at least the already installed
fds are sent to userspace.  It doesn't solve the issue of losing file
descriptors in case of an EFAULT on the userspace buffer.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]

---
Index: linux/net/core/scm.c
===
--- linux.orig/net/core/scm.c   2006-10-08 12:41:14.0 +0200
+++ linux/net/core/scm.c2006-10-08 12:41:46.0 +0200
@@ -245,8 +245,7 @@ void scm_detach_fds(struct msghdr *msg, 
if (i  0)
{
int cmlen = CMSG_LEN(i*sizeof(int));
-   if (!err)
-   err = put_user(SOL_SOCKET, cm-cmsg_level);
+   err = put_user(SOL_SOCKET, cm-cmsg_level);
if (!err)
err = put_user(SCM_RIGHTS, cm-cmsg_type);
if (!err)
Index: linux/net/compat.c
===
--- linux.orig/net/compat.c 2006-10-08 12:43:41.0 +0200
+++ linux/net/compat.c  2006-10-08 12:43:47.0 +0200
@@ -285,8 +285,7 @@ void scm_detach_fds_compat(struct msghdr
 
if (i  0) {
int cmlen = CMSG_COMPAT_LEN(i * sizeof(int));
-   if (!err)
-   err = put_user(SOL_SOCKET, cm-cmsg_level);
+   err = put_user(SOL_SOCKET, cm-cmsg_level);
if (!err)
err = put_user(SCM_RIGHTS, cm-cmsg_type);
if (!err)

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html