Re: strange crashes in tcp_poll() via epoll_wait

2013-07-19 Thread Eric Wong
Eric Dumazet  wrote:
> On Fri, 2013-07-19 at 23:50 +, Eric Wong wrote:
> > Eric Dumazet  wrote:
> > > Hi Al
> > > 
> > > I tried to debug strange crashes in tcp_poll() called from
> > > sys_epoll_wait() -> sock_poll()
> > > 
> > > The symptom is that sock->sk is NULL and we therefore dereference a NULL
> > > pointer.
> > > 
> > > It's really rare crashes but still, it would be nice to understand where
> > > is the bug. Presumably latest kernels would crash in sock_poll() because
> > > of the sk_can_busy_loop(sock->sk) call.
> > > 
> > > We do test sock->sk being NULL in sock_fasync(), but epoll should be
> > > safe because of existing synchronization (epmutex) ?
> > 
> > It should be safe because of ep->mtx, actually, as epmutex is not taken
> > in sys_epoll_wait.
> 
> Hmm, it might be more complex than that for multi threaded programs : 
> 
> eventpoll_release_file()
> 
> The problem might be because a thread closes a socket while an event
> was queued for it.

But ep->mtx is also held when traversing the ready list with
ep_send_events_proc.

Can sock->sk somehow be NULL before hitting eventpoll_release_file?

> > I took a look at this but have not found anything.  I've yet to see this
> > this on my machines.
> > 
> > When did you start noticing this?
> 
> Hard to say, but we have these crashes on a 3.3+ based kernel.

So I don't think any of my epoll changes caused it.  Phew!

> Probability of said crashes is very very low.

This still worries me since I rely heavily on multi-threaded epoll.  I
don't have a lot of cores/CPUs, though, so maybe it's harder to trigger
any potential race as a result...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: strange crashes in tcp_poll() via epoll_wait

2013-07-19 Thread Eric Dumazet
On Fri, 2013-07-19 at 23:50 +, Eric Wong wrote:
> Eric Dumazet  wrote:
> > Hi Al
> > 
> > I tried to debug strange crashes in tcp_poll() called from
> > sys_epoll_wait() -> sock_poll()
> > 
> > The symptom is that sock->sk is NULL and we therefore dereference a NULL
> > pointer.
> > 
> > It's really rare crashes but still, it would be nice to understand where
> > is the bug. Presumably latest kernels would crash in sock_poll() because
> > of the sk_can_busy_loop(sock->sk) call.
> > 
> > We do test sock->sk being NULL in sock_fasync(), but epoll should be
> > safe because of existing synchronization (epmutex) ?
> 
> It should be safe because of ep->mtx, actually, as epmutex is not taken
> in sys_epoll_wait.

Hmm, it might be more complex than that for multi threaded programs : 

eventpoll_release_file()

The problem might be because a thread closes a socket while an event
was queued for it.


> 
> I took a look at this but have not found anything.  I've yet to see this
> this on my machines.
> 
> When did you start noticing this?

Hard to say, but we have these crashes on a 3.3+ based kernel.

Probability of said crashes is very very low.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: strange crashes in tcp_poll() via epoll_wait

2013-07-19 Thread Eric Wong
Eric Dumazet  wrote:
> Hi Al
> 
> I tried to debug strange crashes in tcp_poll() called from
> sys_epoll_wait() -> sock_poll()
> 
> The symptom is that sock->sk is NULL and we therefore dereference a NULL
> pointer.
> 
> It's really rare crashes but still, it would be nice to understand where
> is the bug. Presumably latest kernels would crash in sock_poll() because
> of the sk_can_busy_loop(sock->sk) call.
> 
> We do test sock->sk being NULL in sock_fasync(), but epoll should be
> safe because of existing synchronization (epmutex) ?

It should be safe because of ep->mtx, actually, as epmutex is not taken
in sys_epoll_wait.

I took a look at this but have not found anything.  I've yet to see this
this on my machines.

When did you start noticing this?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


strange crashes in tcp_poll() via epoll_wait

2013-07-19 Thread Eric Dumazet
Hi Al

I tried to debug strange crashes in tcp_poll() called from
sys_epoll_wait() -> sock_poll()

The symptom is that sock->sk is NULL and we therefore dereference a NULL
pointer.

It's really rare crashes but still, it would be nice to understand where
is the bug. Presumably latest kernels would crash in sock_poll() because
of the sk_can_busy_loop(sock->sk) call.

We do test sock->sk being NULL in sock_fasync(), but epoll should be
safe because of existing synchronization (epmutex) ?

Any idea?

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/