Re: [Ksummit-discuss] RFC: create mailing list "linux-issues" focussed on issues/bugs and regressions

2021-03-23 Thread Eric Wong
Konstantin Ryabitsev  wrote:
> On Tue, Mar 23, 2021 at 09:30:33AM -0700, James Bottomley wrote:
> > > I think the bulk of user issues are going to be regressions. Although
> > > you may be in a better position to know for sure, but at least for
> > > me, wearing my "user" hat, the thing that gets me the most is
> > > upgrading to a new kernel and suddenly something that use to work no
> > > longer does. And that is the definition of a regression. My test
> > > boxes still run old distros (one is running fedora 13). These are the
> > > boxes that catch the most issues, and if they do, they are pretty
> > > much guaranteed to be a regression.
> > > 
> > > I like the "linux-regressions" mailing list idea.
> > 
> > Can't we use the fancy features of public inbox to get the best of both
> > worlds?  Have the bug list (or even a collection of lists) but make the
> > linux-regressions one a virtual list keying off an imap flag which a
> > group of people control.  That way anything that is flagged as a
> > regression appears in that public inbox.  I assume the search can be
> > quite wide so we could flag a regression on any list indexed by lore?

The lei (local email interface) data model will have "labels"(*)
and developers will be able to publish mail with it via static
HTML/Atom/JSON feed via cronjob and whatnot.

> There's a number of ways we can accomplish this, sure.
> 
> However, this functionality is not in production yet, and I'm not sure which
> upcoming public-inbox features we'll be implementing as a public
> lore.kernel.org service,

> which ones we'll only offer to kernel.org account holders,

lei could offer read-write JMAP support; either as a CGI tied
to Unix user accounts or some virtual user system.  Some fixes
I'm currently making to speed up the test suite will also make
it more suitable for a largish virtual user system.

> and which ones should really be running locally by developers
> themselves.

lei will be MY dream mail/git tool that fills in the gaps
left by other tools; I hope it can make others happy, too :)

> So, I don't want to say either yes or no to this one for the fear of
> over-promising. I guess this is why I'm not in sales. :)

Heh, same here.  Once I start using lei to handle all of my mail
and there's a data-loss bug, I could conceivably never know
about it because the bug reports would be lost... :x


[1] "labels" are "mailboxes" in JMAP-speak; and lei's per-user data
model will be tied to JMAP.  "keywords" are Maildir/IMAP-system
flags (seen/flagged/answered/...).  JMAP doesn't allow arbitrary
keywords, but does allow arbitrary labels.


Re: RFC: create mailing list "linux-issues" focussed on issues/bugs and regressions

2021-03-22 Thread Eric Wong
James Bottomley  wrote:
> On Mon, 2021-03-22 at 13:16 -0400, Konstantin Ryabitsev wrote:
> > On Mon, Mar 22, 2021 at 04:18:14PM +0100, Thorsten Leemhuis wrote:
> > > Note, there is a second reason why ksummit-discuss is CCed: another
> > > reason why I want to create this new list is making it easier to
> > > find and track regressions reported to our various mailing lists
> > > (often without LKML in CC, as for some subsystems it's seems to be
> > > custom to not CC it). 
> > 
> > FYI, there will soon be a unified "search all of lore.kernel.org
> > regardless of the list/feed source" capability that may make it
> > unnecessary to create a separate list for this purpose. There's
> > active ongoing work in the public-inbox project to provide parallel
> > ways to follow aggregate topics, including query-based subscriptions
> > (i.e. "put a thread into my inbox whenever someone mentions my
> > favourite file/function/device name"). This work is not complete yet,
> > but I have great hopes that it will become available in the next
> > little while.

Yes, making progress and learning new tricks to make the WWW UI faster :>

> > Once we have this ability, we should be able to plug in multiple
> > sources beyond just mailing lists, including a feed of all
> > bugzilla.kernel.org changes. This should allow someone an easy way to
> > query specific bugs and may not require the creation of a separate
> > list.
> > 
> > I'm not opposed to the creation of a new list, of course -- just want
> > to make sure it's aligned with the improvements we are working to
> > make available.
> 
> I suspect the problem is that there's no known useful search string to
> find a bug report even given a searchable set of lists, so the main
> purpose of this list would be "if it's on here, it's a bug report" and
> the triage team can cc additional lists as appropriate.  Then we simply
> tell everyone to send kernel bugs to this list and ask maintainers to
> cc it if a bug report shows up on their list?

It seems having "bug" or "regression" in the subject could be sufficient?

"s:Regression" or "s:Bug" can be used to query messages reasonably
quickly:

https://80x24.org/lore/all/?q=s:Bug || https://yhbt.net/lore/all/?q=s:Bug
http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/all/?q=s:Bug


Re: Why the edge-triggered mode doesn't work for epoll file descriptor?

2019-08-26 Thread Eric Wong
Heiher  wrote:
> Hello,
> 
> I've added a pipe file descriptor (fd1) to an epoll (fd3) with
> EPOLLOUT in edge-triggered mode, and then added the fd3 to another
> epoll (fd4) with EPOLLIN in edge-triggered too.
> 
> Next, waiting for fd4 without timeout. When fd1 to be writable, i
> think epoll_wait(fd4, ...)  only return once, because all file
> descriptors are added in edge-triggered mode.
> 
> But, the actual result is returns many and many times until do once
> eopll_wait(fd3, ...).

It looks like you can trigger a wakeup loop with printf writing
to the terminal (not a pipe), and that write to the terminal
triggering the EPOLLOUT wakeup over and over again.

I don't know TTY stuff at all, but I assume it's intended
for terminals.

You refer to "pipe file descriptor (fd1)", but I can't reproduce
the error when running your code piped to "tee" and using
strace to check epoll_wait returns.

"strace ./foo | tee /dev/null" only shows one epoll_wait returning.

> e.events = EPOLLIN | EPOLLET;
> e.data.u64 = 1;
> if (epoll_ctl (efd[0], EPOLL_CTL_ADD, efd[1], ) < 0)
> return -3;
> 
> e.events = EPOLLOUT | EPOLLET;
> e.data.u64 = 2;
> if (epoll_ctl (efd[1], EPOLL_CTL_ADD, 1, ) < 0)
> return -4;

Since epfd[1] is waiting for stdout...

> for (;;) {
> struct epoll_event events[16];
> int nfds;
> 
> nfds = epoll_wait (efd[0], events, 16, -1);
> printf ("nfds: %d\n", nfds);

Try outputting your message to stderr instead of stdout:

fprintf(stderr, "nfds: %d\n", nfds);

And then run your program so stdout and stderr point to
different files:

./foo | tee /dev/null

(so stdout becomes a pipe, and stderr remains your terminal)


Re: [PATCH v4 00/14] epoll: support pollable epoll from userspace

2019-07-27 Thread Eric Wong
Andrew Morton  wrote:
> On Tue, 11 Jun 2019 16:54:44 +0200 Roman Penyaev  wrote:
> > In order to measure polling from userspace libevent was modified [1] and
> > bench_http benchmark (client and server) was used:
> > 
> >  o EPOLLET, original epoll:
> > 
> > 2 requests in 0.551306 sec. (36277.49 throughput)
> > Each took about 5.54 msec latency
> > 160bytes read. 0 errors.
> > 
> >  o EPOLLET + polling from userspace:
> > 
> > 2 requests in 0.475585 sec. (42053.47 throughput)
> > Each took about 4.78 msec latency
> > 160bytes read. 0 errors.
> 
> It would be useful to include some words which describe the
> significance of this to real-world userspace.  If a real application is
> sped up 0.1% then this isn't very exciting ;)

Agreed.  I've put my wfcqueue changes from years back on hold
because I couldn't show a meaningful improvement in real-world
cases: https://lore.kernel.org/lkml/20130401183118.ga9...@dcvr.yhbt.net/

Roman's changes have me interested in seeing how my previous
changes would stack up (no UAPI changes required).
I've been looking for time to forward port my wfcqueue work
to the current kernel (but public-inbox takes priority;
not that I have much time for that).

On the userspace side; I'm not sure any widely-used open source
project is really impacted by epoll performance...
Most everybody seems to use level-trigger :<


Re: [PATCH v5 13/14] epoll: implement epoll_create2() syscall

2019-06-25 Thread Eric Wong
Roman Penyaev  wrote:
> epoll_create2() is needed to accept EPOLL_USERPOLL flags
> and size, i.e. this patch wires up polling from userspace.

Instead of adding a new syscall, is setting size (and/or even
the EPOLL_USEREPOLL flag) something that could be done via
ioctl?

There's no race like CLOEXEC to worry about and it's not
going to be in a hot path where the extra syscall matters.

glibc won't need to increase in .so size, either.


Re: [PATCH v5 00/14] epoll: support pollable epoll from userspace

2019-06-24 Thread Eric Wong
Roman Penyaev  wrote:
> Hi all,

+cc Jason Baron
 
> ** Limitations



> 4. No support for EPOLLEXCLUSIVE
>  If device does not pass pollflags to wake_up() there is no way to
>  call poll() from the context under spinlock, thus special work is
>  scheduled to offload polling.  In this specific case we can't
>  support exclusive wakeups, because we do not know actual result
>  of scheduled work and have to wake up every waiter.

Lacking EPOLLEXCLUSIVE support is probably a showstopper for
common applications using per-task epoll combined with
non-blocking accept4() (e.g. nginx).


Fwiw, I'm still a weirdo who prefers a dedicated thread doing
blocking accept4 for distribution between tasks (so epoll never
sees a listen socket).  But, depending on what runtime/language
I'm using, I can't always dedicate a blocking thread, so I
recently started using EPOLLEXCLUSIVE from Perl5 where I
couldn't rely on threads being available.


If I could dedicate time to improving epoll; I'd probably
add writev() support for batching epoll_ctl modifications
to reduce syscall traffic, or pick-up the kevent()-like interface
started long ago:
https://lore.kernel.org/lkml/1393206162-18151-1-git-send-email-n1ght.4nd@gmail.com/
(but I'm not sure I want to increase the size of the syscall table).


Re: [PATCH] signal: remove the wrong signal_pending() check in restore_user_sigmask()

2019-06-04 Thread Eric Wong
Linus Torvalds  wrote:
> On Tue, Jun 4, 2019 at 6:41 AM Oleg Nesterov  wrote:
> >
> > This is the minimal fix for stable, I'll send cleanups later.
> 
> Ugh. I htink this is correct, but I wish we had a better and more
> intuitive interface.

I had the same thoughts, but am not a regular kernel hacker,
so I didn't say anything earlier.

> In particular, since restore_user_sigmask() basically wants to check
> for "signal_pending()" anyway (to decide if the mask should be
> restored by signal handling or by that function), I really get the
> feeling that a lot of these patterns like
> 
> > -   restore_user_sigmask(ksig.sigmask, );
> > -   if (signal_pending(current) && !ret)
> > +
> > +   interrupted = signal_pending(current);
> > +   restore_user_sigmask(ksig.sigmask, , interrupted);
> > +   if (interrupted && !ret)
> > ret = -ERESTARTNOHAND;
> 
> are wrong to begin with, and we really should aim for an interface
> which says "tell me whether you completed the system call, and I'll
> give you an error return if not".
> 
> How about we make restore_user_sigmask() take two return codes: the
> 'ret' we already have, and the return we would get if there is a
> signal pending and w're currently returning zero.
> 
> IOW, I think the above could become
> 
> ret = restore_user_sigmask(ksig.sigmask, , ret, 
> -ERESTARTHAND);
> 
> instead if we just made the right interface decision.

But that falls down if ret were ever expected to match several
similar error codes (not sure if it happens)

When I was considering fixing this on my own a few weeks ago, I
was looking for an inline that could quickly tell if `ret' was
any of the EINTR-like error codes; but couldn't find one...

It'd probably end up being switch/case statement so I'm not sure
if it'd be too big and slow or not...

The caller would just do:

ret = restore_user_sigmask(ksig.sigmask, , ret);

And restore_user_sigmask would call some "was_interrupted(ret)"
inline which could return true if `ret' matched any of the
too-many-to-keep-track-of EINTR-like codes.  But I figured
there's probably a good reason it did not exist, already *shrug*

/me goes back to the wonderful world of userspace...


Re: [PATCH] signal: remove the wrong signal_pending() check in restore_user_sigmask()

2019-06-04 Thread Eric Wong
Oleg Nesterov  wrote:
> This is the minimal fix for stable, I'll send cleanups later.
> 
> The commit 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add
> restore_user_sigmask()") introduced the visible change which breaks
> user-space: a signal temporary unblocked by set_user_sigmask() can
> be delivered even if the caller returns success or timeout.
> 
> Change restore_user_sigmask() to accept the additional "interrupted"
> argument which should be used instead of signal_pending() check, and
> update the callers.
> 
> Reported-by: Eric Wong 
> Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add 
> restore_user_sigmask()")
> cc: sta...@vger.kernel.org (v5.0+)
> Signed-off-by: Oleg Nesterov 

Thanks, for epoll_pwait on top of Linux v5.1.7 and cmogstored v1.7.0:

Tested-by: Eric Wong 

(cmogstored v1.7.1 already works around this when it sees a 0
return value (but not >0, yet...))

> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 0fbb486..1147c5d 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2201,11 +2201,12 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, 
> int min_events,
>   }
>  
>   ret = wait_event_interruptible(ctx->wait, io_cqring_events(ring) >= 
> min_events);
> - if (ret == -ERESTARTSYS)
> - ret = -EINTR;
>  
>   if (sig)
> - restore_user_sigmask(sig, );
> + restore_user_sigmask(sig, , ret == -ERESTARTSYS);
> +
> + if (ret == -ERESTARTSYS)
> + ret = -EINTR;
>  
>   return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
>  }

That io_uring bit didn't apply cleanly to stable,
since stable is missing fdb288a679cdf6a71f3c1ae6f348ba4dae742681
("io_uring: use wait_event_interruptible for cq_wait conditional wait")
and related commits.

In any case, I'm not using io_uring anywhere, yet (and probably
won't, since I'll still need threads to deal with open/unlink/rename
on slow JBOD HDDs).


Re: pselect/etc semantics

2019-05-30 Thread Eric Wong
"Eric W. Biederman"  wrote:
> Frankly the only reason this appears to be worth touching is that we
> have a userspace regression.  Otherwise I would call the current
> behavior more correct and desirable than ignoring the signal longer.
> 
> If I am reading sitaution properly I suggest we go back to resoring the
> sigmask by hand in epoll_pwait, and put in a big fat comment about a
> silly userspace program depending on that behavior.
> 
> That will be easier to backport and it will just fix the regression and
> not overfix the problem for reasonable applications.

Fwiw, the cmogstored (userspace program which regressed) only
hit this regression in an obscure code path for tuning; that
code path probably sees no use outside of the test suite.

Add to that, cmogstored is an unofficial and optional component
to the obscure, largely-forgotten MogileFS.

Finally, the main users of cmogstored are on FreeBSD.  They
hit the kqueue+self-pipe code path instead of epoll_pwait.

I only used epoll_pwait on Linux since I figured I could save a
few bytes of memory by skipping eventfd/self-pipe...

Anyways, I updated cmogstored a few weeks back to call `note_run'
(the signal dispatcher) if epoll_pwait (wrapped by `mog_idleq_wait_intr')
returns 0 instead of -1 (EINTR) to workaround this kernel change:

https://bogomips.org/cmogstored-public/20190511075630.17811-...@80x24.org/

I could easily make a change to call `note_run' unconditionally
when `mog_idleq_wait_intr' returns, too.

But that said, there could be other users hitting the same
problem I did.  So maybe cmogstored's primary use on Linux these
days is finding regressions :>


Re: pselect/etc semantics (Was: [PATCH v2] signal: Adjust error codes according to restore_user_sigmask())

2019-05-29 Thread Eric Wong
David Laight  wrote:
> From: Oleg Nesterov
> > Sent: 29 May 2019 17:12
> > Al, Linus, Eric, please help.
> > 
> > The previous discussion was very confusing, we simply can not understand 
> > each
> > other.
> > 
> > To me everything looks very simple and clear, but perhaps I missed something
> > obvious? Please correct me.
> > 
> > I think that the following code is correct
> > 
> > int interrupted = 0;
> > 
> > void sigint_handler(int sig)
> > {
> > interrupted = 1;
> > }
> > 
> > int main(void)
> > {
> > sigset_t sigint, empty;
> > 
> > sigemptyset();
> > sigaddset(, SIGINT);
> > sigprocmask(SIG_BLOCK, , NULL);
> > 
> > signal(SIGINT, sigint_handler);
> > 
> > sigemptyset();// so pselect() unblocks SIGINT
> > 
> > ret = pselect(..., );
> ^ sigint
> > 
> > if (ret >= 0)   // sucess or timeout
> > assert(!interrupted);
> > 
> > if (interrupted)
> > assert(ret == -EINTR);
> > }
> > 
> > IOW, if pselect(sigmask) temporary unblocks SIGINT according to sigmask, 
> > this
> > signal should not be delivered if a ready fd was found or timeout. The 
> > signal
> > handle should only run if ret == -EINTR.
> 
> Personally I think that is wrong.
> Given code like the above that has:
>   while (!interrupted) {
>   pselect(..., );
>   // process available data
>   }
> 
> You want the signal handler to be executed even if one of the fds
> always has available data.
> Otherwise you can't interrupt a process that is always busy.

Agreed...  I believe cmogstored has always had a bug in the way
it uses epoll_pwait because it failed to check interrupts if:

a) an FD is ready + interrupt
b) epoll_pwait returns 0 on interrupt

The bug remains in userspace for a), which I will fix by adding
an interrupt check when an FD is ready.  The window is very
small for a) and difficult to trigger, and also in a rare code
path.

The b) case is the kernel bug introduced in 854a6ed56839a40f
("signal: Add restore_user_sigmask()").

I don't think there's any disagreement that b) is a kernel bug.

So the confusion is for a), and POSIX is not clear w.r.t. how
pselect/poll works when there's both FD readiness and an
interrupt.

Thus I'm inclined to believe *select/*poll/epoll_*wait should
follow POSIX read() semantics:

   If a read() is interrupted by a signal before it reads any data, it shall
   return −1 with errno set to [EINTR].

   If  a  read()  is  interrupted by a signal after it has successfully read
   some data, it shall return the number of bytes read.

> One option is to return -EINTR if a signal is pending when the mask
> is updated - before even looking at anything else.
>
> Signals that happen later on (eg after a timeout) need not be reported
> (until the next time around the loop).

I'm not sure that's necessary and it would cause delays in
signal handling.


Re: [PATCH 1/1] signal: Adjust error codes according to restore_user_sigmask()

2019-05-21 Thread Eric Wong
Andrew Morton  wrote:
> On Tue, 21 May 2019 09:25:51 +0000 Eric Wong  wrote:
> 
> > Deepa Dinamani  wrote:
> > > For all the syscalls that receive a sigmask from the userland,
> > > the user sigmask is to be in effect through the syscall execution.
> > > At the end of syscall, sigmask of the current process is restored
> > > to what it was before the switch over to user sigmask.
> > > But, for this to be true in practice, the sigmask should be restored
> > > only at the the point we change the saved_sigmask. Anything before
> > > that loses signals. And, anything after is just pointless as the
> > > signal is already lost by restoring the sigmask.
> > > 
> > > The inherent issue was detected because of a regression caused by
> > > 854a6ed56839a.
> > > The patch moved the signal_pending() check closer to restoring of the
> > > user sigmask. But, it failed to update the error code accordingly.
> > > 
> > > Detailed issue discussion permalink:
> > > https://lore.kernel.org/linux-fsdevel/20190427093319.sgicqik2oqkez3wk@dcvr/
> > > 
> > > Note that the patch returns interrupted errors (EINTR, ERESTARTNOHAND,
> > > etc) only when there is no other error. If there is a signal and an error
> > > like EINVAL, the syscalls return -EINVAL rather than the interrupted
> > > error codes.
> > > 
> > > The sys_io_uring_enter() seems to be returning success when there is
> > > a signal and the queue is not empty. This seems to be a bug. I will
> > > follow up with a separate patch for that.
> > > 
> > > Reported-by: Eric Wong 
> > > Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add 
> > > restore_user_sigmask()")
> > > Signed-off-by: Deepa Dinamani 
> > > Reviewed-by: Davidlohr Bueso 
> 
> (top-posting fixed).
> 
> > It's been 2 weeks and this fix hasn't appeared in mmots / mmotm.
> > I also noticed it's missing Cc: for stable@ (below)
> 
> Why is a -stable backport needed?  I see some talk above about lost
> signals but it is unclear whether these are being observed after fixing
> the regression caused by 854a6ed56839a.

I guess Deepa's commit messages wasn't clear...
I suggest prepending this as the first paragraph to Deepa's
original message:

  This fixes a bug introduced with 854a6ed56839a which caused
  EINTR to not be reported to userspace on epoll_pwait.  Failure
  to report EINTR to userspace caused problems with user code
  which relies on EINTR to run signal handlers.

> IOW, can we please have a changelog which has a clear and complete
> description of the user-visible effects of the change.
> 
> And please Cc Oleg.


Re: [PATCH 1/1] signal: Adjust error codes according to restore_user_sigmask()

2019-05-21 Thread Eric Wong
It's been 2 weeks and this fix hasn't appeared in mmots / mmotm.
I also noticed it's missing Cc: for stable@ (below)

Deepa Dinamani  wrote:
> For all the syscalls that receive a sigmask from the userland,
> the user sigmask is to be in effect through the syscall execution.
> At the end of syscall, sigmask of the current process is restored
> to what it was before the switch over to user sigmask.
> But, for this to be true in practice, the sigmask should be restored
> only at the the point we change the saved_sigmask. Anything before
> that loses signals. And, anything after is just pointless as the
> signal is already lost by restoring the sigmask.
> 
> The inherent issue was detected because of a regression caused by
> 854a6ed56839a.
> The patch moved the signal_pending() check closer to restoring of the
> user sigmask. But, it failed to update the error code accordingly.
> 
> Detailed issue discussion permalink:
> https://lore.kernel.org/linux-fsdevel/20190427093319.sgicqik2oqkez3wk@dcvr/
> 
> Note that the patch returns interrupted errors (EINTR, ERESTARTNOHAND,
> etc) only when there is no other error. If there is a signal and an error
> like EINVAL, the syscalls return -EINVAL rather than the interrupted
> error codes.
> 
> The sys_io_uring_enter() seems to be returning success when there is
> a signal and the queue is not empty. This seems to be a bug. I will
> follow up with a separate patch for that.
> 
> Reported-by: Eric Wong 
> Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add 
> restore_user_sigmask()")
> Signed-off-by: Deepa Dinamani 
> Reviewed-by: Davidlohr Bueso 

Cc:  # 5.0.x
Cc:  # 5.1.x

> ---
> 
>  fs/aio.c   | 24 
>  fs/eventpoll.c | 14 ++
>  fs/io_uring.c  |  9 ++---
>  fs/select.c| 37 +
>  include/linux/signal.h |  2 +-
>  kernel/signal.c| 13 ++---
>  6 files changed, 60 insertions(+), 39 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index 3490d1fa0e16..ebd2b1980161 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -2095,7 +2095,7 @@ SYSCALL_DEFINE6(io_pgetevents,
>   struct __aio_sigset ksig = { NULL, };
>   sigset_tksigmask, sigsaved;
>   struct timespec64   ts;
> - int ret;
> + int ret, signal_detected;
>  
>   if (timeout && unlikely(get_timespec64(, timeout)))
>   return -EFAULT;
> @@ -2108,8 +2108,8 @@ SYSCALL_DEFINE6(io_pgetevents,
>   return ret;
>  
>   ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
> - restore_user_sigmask(ksig.sigmask, );
> - if (signal_pending(current) && !ret)
> + signal_detected = restore_user_sigmask(ksig.sigmask, );
> + if (signal_detected && !ret)
>   ret = -ERESTARTNOHAND;
>  
>   return ret;
> @@ -2128,7 +2128,7 @@ SYSCALL_DEFINE6(io_pgetevents_time32,
>   struct __aio_sigset ksig = { NULL, };
>   sigset_tksigmask, sigsaved;
>   struct timespec64   ts;
> - int ret;
> + int ret, signal_detected;
>  
>   if (timeout && unlikely(get_old_timespec32(, timeout)))
>   return -EFAULT;
> @@ -2142,8 +2142,8 @@ SYSCALL_DEFINE6(io_pgetevents_time32,
>   return ret;
>  
>   ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
> - restore_user_sigmask(ksig.sigmask, );
> - if (signal_pending(current) && !ret)
> + signal_detected = restore_user_sigmask(ksig.sigmask, );
> + if (signal_detected && !ret)
>   ret = -ERESTARTNOHAND;
>  
>   return ret;
> @@ -2193,7 +2193,7 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents,
>   struct __compat_aio_sigset ksig = { NULL, };
>   sigset_t ksigmask, sigsaved;
>   struct timespec64 t;
> - int ret;
> + int ret, signal_detected;
>  
>   if (timeout && get_old_timespec32(, timeout))
>   return -EFAULT;
> @@ -2206,8 +2206,8 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents,
>   return ret;
>  
>   ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
> - restore_user_sigmask(ksig.sigmask, );
> - if (signal_pending(current) && !ret)
> + signal_detected = restore_user_sigmask(ksig.sigmask, );
> + if (signal_detected && !ret)
>   ret = -ERESTARTNOHAND;
>  
>   return ret;
> @@ -2226,7 +2226,7 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents_time64,
>   struct __compat_aio_sigset ksig = { NULL, };
>   sigset_t ksigmask, sigsaved;
>   struct timespec64

Re: [PATCH v3 05/13] epoll: offload polling to a work in case of epfd polled from userspace

2019-05-21 Thread Eric Wong
Roman Penyaev  wrote:
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 81da4571f1e0..9d3905c0afbf 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -44,6 +44,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  /*
> @@ -185,6 +186,9 @@ struct epitem {
>  
>   /* The structure that describe the interested events and the source fd 
> */
>   struct epoll_event event;
> +
> + /* Work for offloading event callback */
> + struct work_struct work;
>  };
>  
>  /*

Can we avoid the size regression for existing epoll users?

> @@ -2547,12 +2601,6 @@ static int __init eventpoll_init(void)
>   ep_nested_calls_init(_safewake_ncalls);
>  #endif
>  
> - /*
> -  * We can have many thousands of epitems, so prevent this from
> -  * using an extra cache line on 64-bit (and smaller) CPUs
> -  */
> - BUILD_BUG_ON(sizeof(void *) <= 8 && sizeof(struct epitem) > 128);
> -
>   /* Allocates slab cache used to allocate "struct epitem" items */
>   epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem),
>   0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);

Perhaps a "struct uepitem" transparent union and separate slab cache.


Re: [PATCH] signal: Adjust error codes according to restore_user_sigmask()

2019-05-03 Thread Eric Wong
Deepa Dinamani  wrote:
> Sorry, I was trying a new setup at work. I should have tested it.
> My bad, I've checked this one.

Thanks.  This is good w.r.t. epoll_pwait and ppoll when applied
to 5.0.11 (no fs/io_uring.c).

Can't think of anything which uses pselect or aio on my system;
but it looks right to me.

> I've removed the questionable reported-by, since we're not sure if
> it is actually the same issue.

Yes, I hope Omar can test this, too.

Thanks again, all!


Re: Strange issues with epoll since 5.0

2019-05-02 Thread Eric Wong
Deepa Dinamani  wrote:
> Eric,
> Can you please help test this?

Nope, that was _really_ badly whitespace-damaged.
(C'mon, it's not like you're new to this)


Re: Strange issues with epoll since 5.0

2019-05-01 Thread Eric Wong
Deepa Dinamani  wrote:
> So here is my analysis:



> So the 854a6ed56839a40f6 seems to be better than the original code in
> that it detects the signal.

OTOH, does matter to anybody that a signal is detected slightly
sooner than it would've been, otherwise?

> But, the problem is that it doesn't
> communicate it to the userspace.

Yup, that's a big problem :)
 
> So a patch like below solves the problem. This is incomplete. I'll
> verify and send you a proper fix you can test soon. This is just for
> the sake of discussion:
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 4a0e98d87fcc..63a387329c3d 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -2317,7 +2317,7 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct
> epoll_event __user *, events,
> int, maxevents, int, timeout, const sigset_t __user *, 
> sigmask,
> size_t, sigsetsize)
>  {
> -   int error;
> +   int error, signal_detected;
> sigset_t ksigmask, sigsaved;
> 
> /*
> @@ -2330,7 +2330,10 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct
> epoll_event __user *, events,
> 
> error = do_epoll_wait(epfd, events, maxevents, timeout);
> 
> -   restore_user_sigmask(sigmask, );
> +   signal_detected = restore_user_sigmask(sigmask, );
> +
> +   if (signal_detected && !error)
> +   return -EITNR;
> 
> return error;

Looks like a reasonable API.

> @@ -2862,7 +2862,7 @@ void restore_user_sigmask(const void __user
> *usigmask, sigset_t *sigsaved)
> if (signal_pending(current)) {
> current->saved_sigmask = *sigsaved;
> set_restore_sigmask();
> -   return;
> +   return 0;

Shouldn't that "return 1" if a signal is pending?


Re: Strange issues with epoll since 5.0

2019-05-01 Thread Eric Wong
Eric Wong  wrote:
> (didn't test AIO, but everything else seems good)

"seems" != "is"

Now that I understand the fix for epoll, the fs/select.c changes
would hit the same problem and not return -EINTR when it should.

I'll let you guys decide how to fix this, but there's definitely
a problem when "(errno == EINTR)" comparisons in userspace
stop working.


Re: Strange issues with epoll since 5.0

2019-04-30 Thread Eric Wong
Eric Wong  wrote:
> Deepa Dinamani  wrote:
> > I'm not sure what the hang in the userspace is about. Is it because
> > the syscall did not return an error or the particular signal was
> > blocked etc.
> 
> Uh, ok; that's less comforting.

Nevermind, I think I understand everything, now.  epoll_pwait
never set errno without our patch.

cmogstored does this in notify.c:

/* wait_intr calls epoll_pwait: */
mfd = mog_idleq_wait_intr(mog_notify_queue, timeout);
if (mfd)
notify_queue_step(mfd);
else if (errno == EINTR) /* EINTR fails to be noticed */
note_run();


Re: Strange issues with epoll since 5.0

2019-04-30 Thread Eric Wong
Deepa Dinamani  wrote:
> I was also not able to reproduce this.
> Arnd and I were talking about this today morning. Here is something
> Arnd noticed:
> 
> If there was a signal after do_epoll_wait(), we never were not
> entering the if (err = -EINTR) at all before.

I'm not sure which `if' statement you're talking about, but ok...

> But, now we do.
> We could try with the below patch:

Wasn't close to applying or being buildable, but I put a
working version together (below).

epoll_pwait wakes up as expected, now :>

> If this works that means we know what is busted.

OK, good to know...

> I'm not sure what the hang in the userspace is about. Is it because
> the syscall did not return an error or the particular signal was
> blocked etc.

Uh, ok; that's less comforting.

> There are also a few timing differences also. But, can we try this first?

Anyways, the following patch works and builds cleanly for me
(didn't test AIO, but everything else seems good)

Thanks!

-8<---
Subject: [PATCH] test fix from Deepa

TBD
---
 fs/aio.c   |  8 
 fs/eventpoll.c |  4 ++--
 fs/select.c| 12 ++--
 include/linux/signal.h |  2 +-
 kernel/signal.c|  6 --
 5 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 3d9669d011b9..d54513ca11b6 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -2146,7 +2146,7 @@ SYSCALL_DEFINE6(io_pgetevents,
return ret;
 
ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
-   restore_user_sigmask(ksig.sigmask, );
+   restore_user_sigmask(ksig.sigmask, , -1);
if (signal_pending(current) && !ret)
ret = -ERESTARTNOHAND;
 
@@ -2180,7 +2180,7 @@ SYSCALL_DEFINE6(io_pgetevents_time32,
return ret;
 
ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
-   restore_user_sigmask(ksig.sigmask, );
+   restore_user_sigmask(ksig.sigmask, , -1);
if (signal_pending(current) && !ret)
ret = -ERESTARTNOHAND;
 
@@ -2244,7 +2244,7 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents,
return ret;
 
ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
-   restore_user_sigmask(ksig.sigmask, );
+   restore_user_sigmask(ksig.sigmask, , -1);
if (signal_pending(current) && !ret)
ret = -ERESTARTNOHAND;
 
@@ -2277,7 +2277,7 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents_time64,
return ret;
 
ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
-   restore_user_sigmask(ksig.sigmask, );
+   restore_user_sigmask(ksig.sigmask, , -1);
if (signal_pending(current) && !ret)
ret = -ERESTARTNOHAND;
 
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a5d219d920e7..bd84ec54a8fb 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2247,7 +2247,7 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct 
epoll_event __user *, events,
 
error = do_epoll_wait(epfd, events, maxevents, timeout);
 
-   restore_user_sigmask(sigmask, );
+   restore_user_sigmask(sigmask, , error == -EINTR);
 
return error;
 }
@@ -2272,7 +2272,7 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 
err = do_epoll_wait(epfd, events, maxevents, timeout);
 
-   restore_user_sigmask(sigmask, );
+   restore_user_sigmask(sigmask, , err == -EINTR);
 
return err;
 }
diff --git a/fs/select.c b/fs/select.c
index d0f35dbc0e8f..04720e5856ed 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -760,7 +760,7 @@ static long do_pselect(int n, fd_set __user *inp, fd_set 
__user *outp,
ret = core_sys_select(n, inp, outp, exp, to);
ret = poll_select_copy_remaining(_time, tsp, type, ret);
 
-   restore_user_sigmask(sigmask, );
+   restore_user_sigmask(sigmask, , -1);
 
return ret;
 }
@@ -1106,7 +1106,7 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, 
unsigned int, nfds,
 
ret = do_sys_poll(ufds, nfds, to);
 
-   restore_user_sigmask(sigmask, );
+   restore_user_sigmask(sigmask, , -1);
 
/* We can restart this syscall, usually */
if (ret == -EINTR)
@@ -1142,7 +1142,7 @@ SYSCALL_DEFINE5(ppoll_time32, struct pollfd __user *, 
ufds, unsigned int, nfds,
 
ret = do_sys_poll(ufds, nfds, to);
 
-   restore_user_sigmask(sigmask, );
+   restore_user_sigmask(sigmask, , -1);
 
/* We can restart this syscall, usually */
if (ret == -EINTR)
@@ -1352,7 +1352,7 @@ static long do_compat_pselect(int n, compat_ulong_t 
__user *inp,
ret = compat_core_sys_select(n, inp, outp, exp, to);
ret = poll_select_copy_remaining(_time, tsp, type, ret);
 
-   restore_user_sigmask(sigmask, );
+   restore_user_sigmask(sigmask, , -1);
 
return ret;
 }
@@ -1425,7 +1425,7 @@ COMPAT_SYSCALL_DEFINE5(ppoll, struct pollfd __user *, 
ufds,
 
ret = do_sys_poll(ufds, nfds, to);
 
-   

Re: Strange issues with epoll since 5.0

2019-04-29 Thread Eric Wong
Davidlohr Bueso  wrote:
> On Sun, 28 Apr 2019, Eric Wong wrote:
> 
> > Just running one test won't trigger since it needs a busy
> > machine; but:
> > 
> > make test/mgmt_auto_adjust.log
> > (and "rm make test/mgmt_auto_adjust.log" if you want to rerun)
> 
> fyi no luck reproducing on both either a large (280) or small (4 cpu)
> machine, I'm running it along with a kernel build overcommiting cpus x2.

Thanks for taking a look.

> Is there any workload in particular you are using to stress the box?

Just the "make -j$N check" I mentioned up-thread which spawns a
lot of tests in parallel.  For the small 4 CPU machine,
-j32 or -j16 ought to be reproducing the problem.

I'll try to set aside some time this week to get familiar with
kernel internals w.r.t. signal handling.


Re: Strange issues with epoll since 5.0

2019-04-27 Thread Eric Wong
Deepa Dinamani  wrote:
> I tried to replicate the failure on qemu.
> I do not see the failure with N=32.

> Does it work for N < 32?

Depends on number of cores you have; I have 4 cores, 8 threads
with HT; so I needed to have a lot of load on the machine to get
it to fail (it takes about 1 minute).

cmogstored is intended to run on machines that were already
saturated in CPU/memory from other processes, but not HDD I/O
bandwidth.

> Does any other signal work?

SIGCONT does, via:

   perl -i -p -e 's/SIGURG/SIGCONT/g' `git ls-files`

> Are there any other architectures that fail?

I don't have other arches (well, 32-bit x86, but I've never
really tried cmogstored on that, even).

> Could you help me figure out how to run just the one test that is failing?

Just running one test won't trigger since it needs a busy
machine; but:

make test/mgmt_auto_adjust.log
(and "rm make test/mgmt_auto_adjust.log" if you want to rerun)

Thanks for looking into this.  Fwiw, cmogstored uses epoll in
strange and uncommon ways which has led to kernel bugfixes
in the past.


Re: Strange issues with epoll since 5.0

2019-04-27 Thread Eric Wong
Eric Wong  wrote:
> Omar Kilani  wrote:
> > Hi there,
> > 
> > I’m still trying to piece together a reproducible test that triggers
> > this, but I wanted to post in case someone goes “hmmm... change X
> > might have done this”.
> 
> Maybe Davidlohr knows, since he's responsible for most of the
> epoll changes in 5.0.

Well, I am not sure if I am hitting the same problem Omar is
hitting.  But I did find an epoll_pwait regression in 5.0:

epoll_pwait seems unresponsive to SIGURG in my
heavily-parallelized use case[1] on 5.0.9.  I bisected it to
commit 854a6ed56839a40f6b5d02a2962f48841482eec4
("signal: Add restore_user_sigmask()")

Just reverting the fs/eventpoll.c change in 854a6ed56 seems
enough to fix the non-responsive epoll_pwait for me.  I have not
looked deeply into this, but perhaps the signal_pending check in
restore_user_sigmask is racy w.r.t. epoll.  It is been a while
since I have looked at kernel stuff, myself.

Anyways, this revert works; but I'm not 100% sure why...

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a5d219d920e7..151739d76801 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2247,7 +2247,20 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct 
epoll_event __user *, events,
 
error = do_epoll_wait(epfd, events, maxevents, timeout);
 
-   restore_user_sigmask(sigmask, );
+   /*
+* If we changed the signal mask, we need to restore the original one.
+* In case we've got a signal while waiting, we do not restore the
+* signal mask yet, and we allow do_signal() to deliver the signal on
+* the way back to userspace, before the signal mask is restored.
+*/
+   if (sigmask) {
+   if (error == -EINTR) {
+   memcpy(>saved_sigmask, ,
+  sizeof(sigsaved));
+   set_restore_sigmask();
+   } else
+   set_current_blocked();
+   }
 
return error;
 }
@@ -2272,7 +2285,20 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 
err = do_epoll_wait(epfd, events, maxevents, timeout);
 
-   restore_user_sigmask(sigmask, );
+   /*
+* If we changed the signal mask, we need to restore the original one.
+* In case we've got a signal while waiting, we do not restore the
+* signal mask yet, and we allow do_signal() to deliver the signal on
+* the way back to userspace, before the signal mask is restored.
+*/
+   if (sigmask) {
+   if (err == -EINTR) {
+   memcpy(>saved_sigmask, ,
+  sizeof(sigsaved));
+   set_restore_sigmask();
+   } else
+   set_current_blocked();
+   }
 
return err;
 }

Comments and/or a proper fix would be greatly appreciated.

[1] my test case is running the cmogstored 1.7.0 test suite
in amd64 Debian stable environment.
test/mgmt_auto_adjust would get stuck and time-out after 60s
on vanilla v5.0.9

tgz: https://bogomips.org/cmogstored/files/cmogstored-1.7.0.tar.gz
# Standard autotools install,  N=32 or some high-ish number
./configure
make -j$N
make check -j$N

# OR git clone https://bogomips.org/cmogstored.git

So, requoting the rest of Omar's original report, here; since
I am not sure if his use case involves epoll_pwait like mine does:

> Omar Kilani  wrote:
> > Basically, something’s broken (or at least, has changed enough to
> > cause problems in user space) in epoll since 5.0. It’s still broken in
> > 5.1-rc5.
> > 
> > It doesn’t happen 100% of the time. It’s sort of hard to pin down but
> > I’ve observed the following:
> > 
> > * nginx not accepting connections under load
> > * A java app which uses netty / NIO having strange writability
> > semantics on channels, which confuses netty / java enough to not
> > properly flush written data on the socket.
> > 
> > I went and tested these Linux kernels:
> > 
> > 4.20.17
> > 4.19.32
> > 4.14.111
> > 
> > And the issue(s) do not show up there.
> > 
> > I’m still actively chasing this up, and will report back — I haven’t
> > touched kernel code in 15 years so I’m a little rusty. :)
> > 
> > Regards,
> > Omar


Re: Strange issues with epoll since 5.0

2019-04-24 Thread Eric Wong
Omar Kilani  wrote:
> Hi there,
> 
> I’m still trying to piece together a reproducible test that triggers
> this, but I wanted to post in case someone goes “hmmm... change X
> might have done this”.

Maybe Davidlohr knows, since he's responsible for most of the
epoll changes in 5.0.

> Basically, something’s broken (or at least, has changed enough to
> cause problems in user space) in epoll since 5.0. It’s still broken in
> 5.1-rc5.
> 
> It doesn’t happen 100% of the time. It’s sort of hard to pin down but
> I’ve observed the following:
> 
> * nginx not accepting connections under load
> * A java app which uses netty / NIO having strange writability
> semantics on channels, which confuses netty / java enough to not
> properly flush written data on the socket.
> 
> I went and tested these Linux kernels:
> 
> 4.20.17
> 4.19.32
> 4.14.111
> 
> And the issue(s) do not show up there.
> 
> I’m still actively chasing this up, and will report back — I haven’t
> touched kernel code in 15 years so I’m a little rusty. :)
> 
> Regards,
> Omar


Re: [RFC] LKML Archive in Maildir Format

2019-03-06 Thread Eric Wong
Bjorn Helgaas  wrote:
> On Tue, Mar 5, 2019 at 5:26 PM Eric Wong  wrote:
> > Bjorn Helgaas  wrote:
> 
> > > Any pointers?  I guess there's no mutt backend that can read a
> > > public-inbox archive directly?
> >
> > There's mutt patches to support reading over NNTP, so that
> > works:
> >
> > mutt -f news://$INBOX_HOST/$INBOX_NEWSGROUP
> 
> Neomutt includes NNTP support, so I tried this:
> 
>   neomutt -f news://nntp.lore.kernel.org/org.kernel.vger.linux-kernel
> 
> which worked OK but (1) I only see the most recent 1000 messages and
> (2) obviously isn't reading a *local* archive.  Neomutt took about 45
> seconds to start up over my wimpy ISP.
> 
> I assume I could probably have a local archive and run a local NNTP
> server and point neomutt at that local server.  But I don't know how
> full-archive searching would work there.

Right.  AFAIK there isn't a good solution for search via NNTP.

> > I don't think mutt handles mboxrd 100% correctly, but it's close
> > enough that you can can download the gzipped mboxrd of a search
> > query and open it via "mutt -f /path/to/downloaded/mbox.gz"
> >
> >   curl -XPOST -OJ "$INBOX_URL/?q=$SEARCH_QUERY=m"
> 
> I got nothing at all with -XPOST, but this:

Ah, I guess nginx (or something in AWS) rejects POST without
Content-Length headers.  Adding "-HContent-Length:0"
to the command-line with -XPOST works for lore.

>   curl -OJ "https://lore.kernel.org/linux-pci/?q=d:20190301..=m;
> 
> got me the HTML source.  Nothing that looks like mboxrd.  I assume

Right.  The "x=m" requests an mbox; but it's only available via
POST requests (to prevent search engine spiders from wasting
time on non-HTML content).  With the HTML output in a browser,
the "mbox.gz" button makes the POST request and allows you to
download the mbox.

> this is stupid user error on my part, but even with that resolved, it
> wouldn't have the nice git fetch properties of the git archive, i.e.,
> incremental updates of only new stuff, would it?

You could bump d:MMDD (there's also "dt:" for date-time if
you need more precision).

> I think my ideal solution would be a mutt that could read the git
> archive directly, plus a notmuch index.  But AFAIK, mutt can't do
> that, and notmuch only works with one message per file, not with the
> git archive.
> 
> Something that might work would be to use Konstantin's "git archive to
> maildir" hint but shard into a bunch of smaller maildirs instead of
> one big one, then have notmuch index those, and use mutt or vim with
> notmuch queries instead of having it read in a maildir.

Small Maildirs work great, but large ones fall over.  I don't
think having a bunch of smaller Maildirs would help notmuch
since notmuch still needs to know each file path.

The only way I could see notmuch/Maildir working well is to keep
the overall number of messages relatively small.

One of my longer-term goals is to write a mairix-like tool in
Perl which works with public-inbox archives; but I barely have
enough time for public-inbox these days :<

mairix works with gzipped mboxes, which is great for large
archives; but the indexing falls over since it rewrites the
entire search index every time. SSDs have died as a result :<

> But I feel like I must be missing the solution that's obvious to
> everybody but me.

Nope, you're not alone :)  There's not a lot of mail software
which can handle LKML-sized histories efficiently.


Re: [RFC] LKML Archive in Maildir Format

2019-03-05 Thread Eric Wong
Bjorn Helgaas  wrote:
> OK, so I understand how to clone archives from lore.kernel.org and how
> to convert a git archive to a maildir (thanks, Konstantin!)
> 
> What I *don't* understand is how to effectively read this locally.
> Ideally I'd like to run mutt, possibly with notmuch for indexing.  But
> a maildir with 3M files seems impractical.  I did actually try it
> (without notmuch), but it takes mutt about 5 minutes to start up.  And
> the maildir is about 23G, compared with 7.5G for the git archive.

Right, relying on Maildir for long-term storage of giant archives
is not a usable solution with any general purpose FSes I know about.
git itself had the same problem with loose object scalability in
the old days and packs were invented as a result.

> Any pointers?  I guess there's no mutt backend that can read a
> public-inbox archive directly?

There's mutt patches to support reading over NNTP, so that
works:

mutt -f news://$INBOX_HOST/$INBOX_NEWSGROUP

I don't think mutt handles mboxrd 100% correctly, but it's close
enough that you can can download the gzipped mboxrd of a search
query and open it via "mutt -f /path/to/downloaded/mbox.gz"

  curl -XPOST -OJ "$INBOX_URL/?q=$SEARCH_QUERY=m"

POST is required(*), and -OJ lets it use the
Content-Disposition: header for a meaningful server-generated
name, but you can also redirect the result to whatever you want.

For all messages since March 1, you could use:

SEARCH_QUERY=d:20190301..

All the supported search queries are documented in
$INBOX_URL/_/text/help/ and the search prefixes (e.g. "d:",
"s:", "b:") are modeled after what's in mairix.  You'll need to
escape the queries for URIs (e.g. " " => "+", and so on).
Xapian requires date ranges to be denoted with ".." whereas
mairix uses "-" for ranges.


The main thing public-inbox search misses from mairix is support
for "-t" which grabs non-matching messages from the same thread.
I would like to support that someday, but don't have enough time
(or funding) to make it happen at the moment.


(*) to reliably avoid wasting resources from spiders/prefetchers


Re: [PATCH][RESEND] wistron_btns needs executable BIOS image

2019-01-25 Thread Eric Wong
Jakub Bogusz  wrote:
> Let's try once again... (take 3)
> 
> First time I sent this patch when I prepared it in 2013 for Linux 3.12,
> the second time after update for Linux 4.12 in 2017...
> And it still applies to Linux 4.20.

Adding Dmitry Torokhov and linux-input to Cc:
(that's what ./scripts/get_maintainer.pl says)

> Actual description in forwarded message.
> 
> 
> -- 
> Jakub Boguszhttp://qboosh.pl/

> From: Jakub Bogusz 
> To: linux-kernel@vger.kernel.org, m...@volny.cz
> Subject: [PATCH] wistron_btns needs executable BIOS image
> 
> Hello,
> 
> This patch (originally agains 3.1x, now I updated include to build
> against 4.12.x) fixes winstron_btns module issue with calling BIOS
> functions in non-executable memory.
> 
> Tested (on Linux 3.10.x and few later versions) on F-S Amilo 8210 laptop.
> 
> 
> -- 
> Jakub Boguszhttp://qboosh.pl/

> wistron_btns needs executable BIOS image.
> 
> Signed-off-by: Jakub Bogusz 
> 
> --- linux-4.12/drivers/input/misc/wistron_btns.c.orig 2013-11-16 
> 09:05:55.612742472 +0100
> +++ linux-4.12/drivers/input/misc/wistron_btns.c  2013-11-16 
> 09:24:37.356028732 +0100
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /* How often we poll keys - msecs */
>  #define POLL_INTERVAL_DEFAULT500 /* when idle */
> @@ -124,6 +125,7 @@
>   if (entry_point >= 0xF) {
>   bios_code_map_base = base;
>   bios_entry_point = bios_code_map_base + (entry_point & 0x);
> + set_memory_x((unsigned long)bios_code_map_base, 0x1 >> 
> PAGE_SHIFT);
>   } else {
>   iounmap(base);
>   bios_code_map_base = ioremap(entry_point & ~0x3FFF, 0x4000);
> @@ -134,6 +136,7 @@
>   goto err;
>   }
>   bios_entry_point = bios_code_map_base + (entry_point & 0x3FFF);
> + set_memory_x((unsigned long)bios_code_map_base, 0x4000 >> 
> PAGE_SHIFT);
>   }
>   /* The Windows driver maps 0x1 bytes, we keep only one page... */
>   bios_data_map_base = ioremap(0x400, 0xc00);


[PATCH] iommu/intel: quirk to disable DMAR for QM57 igfx

2019-01-18 Thread Eric Wong
Joonas Lahtinen  wrote:
> Quoting Eric Wong (2019-01-04 03:06:26)
> > Yeah, so the Debian bpo 4.17(.17) kernel did not set
> > CONFIG_INTEL_IOMMU_DEFAULT_ON, so I didn't encounter problems.
> > My self-built kernels all set CONFIG_INTEL_IOMMU_DEFAULT_ON.
> 
> So it's the case that IOMMU never worked on your machine.
> 
> My recommendation would be to simply use intel_iommu=igfx_off if you
> need IOMMU.
> 
> Old hardware is known to have issues with IOMMU, and retroactively
> enabling IOMMU on those machines just brings them up :/

How about we use a quirk in case distros make IOMMU the default
one day?

8<
Subject: [PATCH] iommu/intel: quirk to disable DMAR for QM57 igfx

Like the GM45, it seems the integrated graphics on QM57 seems
broken and hanging graphics with "intel_iommu=on".  So allow
future users to unconditionally enable DMAR support and not have
to remember or know to specify "intel_iommu=igfx_off"

cf. https://lore.kernel.org/lkml/20181227114948.ev4b3jte3ubsc5us@dcvr/
cf. 
https://lore.kernel.org/lkml/154659116310.4596.13613897418163029...@jlahtine-desk.ger.corp.intel.com/

Signed-off-by: Eric Wong 
---
 drivers/iommu/intel-iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 048b5ab36a02..dc2507a01580 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5399,7 +5399,7 @@ const struct iommu_ops intel_iommu_ops = {
 
 static void quirk_iommu_g4x_gfx(struct pci_dev *dev)
 {
-   /* G4x/GM45 integrated gfx dmar support is totally busted. */
+   /* G4x/GM45/QM57 integrated gfx dmar support is totally busted. */
pr_info("Disabling IOMMU for graphics on this chipset\n");
dmar_map_gfx = 0;
 }
@@ -5411,6 +5411,7 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e20, 
quirk_iommu_g4x_gfx);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e30, quirk_iommu_g4x_gfx);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e40, quirk_iommu_g4x_gfx);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e90, quirk_iommu_g4x_gfx);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0044, quirk_iommu_g4x_gfx);
 
 static void quirk_iommu_rwbf(struct pci_dev *dev)
 {
@@ -5457,7 +5458,6 @@ static void quirk_calpella_no_shadow_gtt(struct pci_dev 
*dev)
}
 }
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0040, 
quirk_calpella_no_shadow_gtt);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0044, 
quirk_calpella_no_shadow_gtt);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0062, 
quirk_calpella_no_shadow_gtt);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x006a, 
quirk_calpella_no_shadow_gtt);
 
-- 
EW


Re: [PATCH] rtc: cmos: ignore bogus century byte

2019-01-10 Thread Eric Wong
Alexandre Belloni  wrote:
> Does this mean that it is set to an invalid value and that in is not
> getting updated properly in mc146818_set_time?

It seems to be set/updated without trouble while the machine is
on.  It's only on a cold boot (from poweroff or hibernate) that
the value is invalid.

Patrick may be able to fill you in on more details.


[PATCH] rtc: cmos: ignore bogus century byte

2019-01-06 Thread Eric Wong
Older versions of Libreboot and Coreboot had an invalid value
(`3' in my case) in the century byte affecting the GM45 in
the Thinkpad X200.  Not everybody's updated their firmwares,
and Linux <= 4.2 was able to read the RTC without problems,
so workaround this by ignoring invalid values.

Fixes: 3c217e51d8a272b9 ("rtc: cmos: century support")

Cc: Alexandre Belloni 
Cc: Alessandro Zummo 
Cc: Sylvain Chouleur 
Cc: Patrick McDermott 
Cc: linux-...@vger.kernel.org
Signed-off-by: Eric Wong 
---
 drivers/rtc/rtc-mc146818-lib.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/rtc/rtc-mc146818-lib.c b/drivers/rtc/rtc-mc146818-lib.c
index 2f1772a358ca..18a6f15e313d 100644
--- a/drivers/rtc/rtc-mc146818-lib.c
+++ b/drivers/rtc/rtc-mc146818-lib.c
@@ -82,7 +82,7 @@ unsigned int mc146818_get_time(struct rtc_time *time)
time->tm_year += real_year - 72;
 #endif
 
-   if (century)
+   if (century > 20)
time->tm_year += (century - 19) * 100;
 
/*
-- 
EW


Re: [RFC] LKML Archive in Maildir Format

2019-01-03 Thread Eric Wong
Joey Pabalinas  wrote:
> My only comment on the public-mailbox choice is that the documentation
> is very sparse and erratic. Myself and a couple other people just
> couldn't figure out how to convert that format to Maildir or some other
> format you could feed into a reader like neomutt.

Sorry, I didn't notice this before.  I started making some attempts
at improving documentation (among other things, when time permits)
to public-inbox:

  https://public-inbox.org/meta/20190102083305.30473-...@80x24.org/

And without knowing anything about git or public-inbox, you can
get NNTP messages into Maildir or mboxrd pretty easily.  Nothing
new to learn :)

I wrote a one-off Ruby years ago (before public-inbox) for
converting slrnspools to Maildir (sample slrnpull.conf below).
But yeah, I wouldn't recommend 3M+ messages in a Maildir...

==> slrnspool2maildir <==
#!/usr/bin/ruby
require 'socket'
require 'fileutils'
HOSTNAME = Socket.gethostname

usage = "Usage #$0  "
spooldir = ARGV[0] or abort usage
maildir = ARGV[1] or abort usage

f = base = nil
nr = 0
%w(cur new tmp).each { |x| FileUtils.mkpath("#{maildir}/#{x}") }
Dir.glob("#{spooldir}/*").each do |src|
  File.file?(src) or next
  base = File.basename(src)
  dest = "#{maildir}/new/#{Time.now.to_i}_#{base}_0.#{HOSTNAME}:2,"
  begin
File.link(src, dest)
  rescue Errno::EEXIST
warn "#{dest} already exists"
next
  end
  File.unlink(src)
end
__END__
==> slrnpull.conf <==
# group_name maxexpire headers_only
inbox.com.example.news.group.name10 10 0
# usage: slrnpull -d $PWD -h news.example.com --no-post

# Wouldn't be hard to script something using Net::NNTP in Perl
# to write directly to Maildirs, either.


Re: iommu_intel or i915 regression in 4.18, 4.19.12 and drm-tip

2019-01-03 Thread Eric Wong
Joonas Lahtinen  wrote:
> Quoting Eric Wong (2018-12-27 13:49:48)
> > I just got a used Thinkpad X201 (Core i5 M 520, Intel QM57
> > chipset) and hit some kernel panics while trying to view
> > image/animation-intensive stuff in Firefox (X11) unless I use
> > "iommu_intel=igfx_off".
> > 
> > With Debian stable backport kernels, "linux-image-4.17.0-0.bpo.3-amd64"
> > (4.17.17-1~bpo9+1) has no problems.  But "linux-image-4.18.0-0.bpo.3-amd64"
> > (4.18.20-2~bpo9+1) gives a blank screen before I can login via agetty
> > and run startx.

> Most confusing about this is that 4.17 would have worked to begin with,
> without intel_iommu=igfx_off (unless it was the default for older
> kernel?)

Yeah, so the Debian bpo 4.17(.17) kernel did not set
CONFIG_INTEL_IOMMU_DEFAULT_ON, so I didn't encounter problems.
My self-built kernels all set CONFIG_INTEL_IOMMU_DEFAULT_ON.

Booting the Debian 4.17 kernel with "intel_iommu=on" gives the
same hanging problem I hit with self-built 4.19.{12,13} kernels.

I'm not sure how far back the problem goes (maybe forever),
since I only got this hardware.  Not sure what's the problem
with Debian 4.18, either; but (self-built) 4.19.13 is fine w/o
CONFIG_INTEL_IOMMU_DEFAULT_ON.

Debian backports doesn't have kernels for 4.19 or 4.20, yet.

> Did you maybe update other parts of the system while updating the
> kernel?

Definitely not; just the kernel + headers ("make bindeb-pkg)".

> If you could attach full boot dmesg from working and non-working kernel +
> have config file of both kernel's in Bugzilla. That'd be a good start!

Sorry, I get anxiety attacks when it comes to logins and forms.
Anyways, I managed to get the Debian kernel dmesg output uploaded
with and without iommu_intel=on:
https://bugs.freedesktop.org/attachment.cgi?bugid=109219


iommu_intel or i915 regression in 4.18, 4.19.12 and drm-tip

2018-12-27 Thread Eric Wong
I just got a used Thinkpad X201 (Core i5 M 520, Intel QM57
chipset) and hit some kernel panics while trying to view
image/animation-intensive stuff in Firefox (X11) unless I use
"iommu_intel=igfx_off".

With Debian stable backport kernels, "linux-image-4.17.0-0.bpo.3-amd64"
(4.17.17-1~bpo9+1) has no problems.  But "linux-image-4.18.0-0.bpo.3-amd64"
(4.18.20-2~bpo9+1) gives a blank screen before I can login via agetty
and run startx.

Building 4.19.12 myself got me into X11 and able to start
Firefox to panic the kernel.  I also updated to the latest BIOS
(1.40), but it's an EOL laptop (but it's still the most powerful
laptop I use).  I intend to replace the BIOS with Coreboot soon...

Initially, I thought I was hitting another GPU hang from 4.18+:

https://bugs.freedesktop.org/show_bug.cgi?id=107945

But building drm-tip @ commit 28bb1fc015cedadf3b099b8bd0bb27609849f362
("drm-tip: 2018y-12m-25d-08h-12m-37s UTC integration manifest")
I was still able to reproduce the panic unless I use iommu_intel=igfx_off
"i915.reset=1" did not help matters, either.

Below is what I got from netconsole while on drm-tip:

Kernel panic - not syncing: DMAR hardware is malfunctioning
Shutting down cpus with NMI
Kernel Offset: disabled
---[ end Kernel panic - not syncing: DMAR hardware is malfunctioning ]---
[ cut here ]
sched: Unexpected reschedule of offline CPU#3!
WARNING: CPU: 0 PID: 105 at native_smp_send_reschedule+0x34/0x40
Modules linked in: netconsole ccm snd_hda_codec_hdmi snd_hda_codec_conexant 
snd_hda_codec_generic intel_powerclamp coretemp kvm_intel kvm irqbypass 
crc32_pclmul crc32c_intel ghash_clmulni_intel arc4 iwldvm aesni_intel 
aes_x86_64 crypto_simd cryptd mac80211 glue_helper intel_cstate iwlwifi 
intel_uncore i915 intel_gtt i2c_algo_bit iosf_mbi drm_kms_helper cfbfillrect 
syscopyarea intel_ips cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea 
thinkpad_acpi prime_numbers cfg80211 ledtrig_audio i2c_i801 sg snd_hda_intel 
led_class snd_hda_codec drm ac drm_panel_orientation_quirks snd_hwdep battery 
e1000e agpgart snd_hda_core snd_pcm snd_timer ptp snd soundcore pps_core 
ehci_pci ehci_hcd lpc_ich video mfd_core button acpi_cpufreq ecryptfs ip_tables 
x_tables ipv6 evdev thermal [last unloaded: netconsole]
CPU: 0 PID: 105 Comm: kworker/u8:3 Not tainted 4.20.0-rc7b1+ #1
Hardware name: LENOVO 3680FBU/3680FBU, BIOS 6QET70WW (1.40 ) 10/11/2012
Workqueue: i915 __i915_gem_free_work [i915]
RIP: 0010:native_smp_send_reschedule+0x34/0x40
Code: 05 69 c6 c9 00 73 15 48 8b 05 18 2d b3 00 be fd 00 00 00 48 8b 40 30 e9 
9a 58 7d 00 89 fe 48 c7 c7 78 73 af 81 e8 dc c2 01 00 <0f> 0b c3 66 0f 1f 84 00 
00 00 00 00 66 66 66 66 90 8b 05 0d 7d df
RSP: 0018:888075003d98 EFLAGS: 00010092
RAX: 002e RBX: 8880751a0740 RCX: 0006
RDX: 0007 RSI: 0082 RDI: 888075015440
RBP: 88806e823700 R08:  R09: 888072fc07c0
R10: 888075003d60 R11: fff5c002 R12: 8880751a0740
R13: 8880751a0740 R14:  R15: 0003
FS:  () GS:88807500() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fdb1f53f000 CR3: 01c0a004 CR4: 000206f0
Call Trace:
 
 ? check_preempt_curr+0x4e/0x90
 ? ttwu_do_wakeup.isra.19+0x14/0xf0
 ? try_to_wake_up+0x323/0x410
 ? autoremove_wake_function+0xe/0x30
 ? __wake_up_common+0x8d/0x140
 ? __wake_up_common_lock+0x6c/0x90
 ? irq_work_run_list+0x49/0x80
 ? tick_sched_handle.isra.6+0x50/0x50
 ? update_process_times+0x3b/0x50
 ? tick_sched_handle.isra.6+0x30/0x50
 ? tick_sched_timer+0x3b/0x80
 ? __hrtimer_run_queues+0xea/0x270
 ? hrtimer_interrupt+0x101/0x240
 ? smp_apic_timer_interrupt+0x6a/0x150
 ? apic_timer_interrupt+0xf/0x20
 
 ? panic+0x1ca/0x212
 ? panic+0x1c7/0x212
 ? __iommu_flush_iotlb+0x19e/0x1c0
 ? iommu_flush_iotlb_psi+0x96/0xf0
 ? intel_unmap+0xbf/0xf0
 ? i915_gem_object_put_pages_gtt+0x36/0x220 [i915]
 ? drm_ht_remove+0x20/0x20 [drm]
 ? drm_mm_remove_node+0x1ad/0x310 [drm]
 ? __pm_runtime_resume+0x54/0x70
 ? __i915_gem_object_unset_pages+0x129/0x170 [i915]
 ? __i915_gem_object_put_pages+0x70/0xa0 [i915]
 ? __i915_gem_free_objects+0x245/0x4e0 [i915]
 ? __switch_to_asm+0x24/0x60
 ? __i915_gem_free_work+0x65/0xa0 [i915]
 ? process_one_work+0x1fd/0x410
 ? worker_thread+0x49/0x3f0
 ? kthread+0xf8/0x130
 ? process_one_work+0x410/0x410
 ? kthread_park+0x90/0x90
 ? ret_from_fork+0x35/0x40
WARNING: CPU: 0 PID: 105 at native_smp_send_reschedule+0x34/0x40
---[ end trace 7dd2184d8c86cef5 ]---
[ cut here ]
sched: Unexpected reschedule of offline CPU#2!
WARNING: CPU: 0 PID: 105 at native_smp_send_reschedule+0x34/0x40
Modules linked in: netconsole ccm snd_hda_codec_hdmi snd_hda_codec_conexant 
snd_hda_codec_generic intel_powerclamp coretemp kvm_intel kvm irqbypass 
crc32_pclmul crc32c_intel ghash_clmulni_intel arc4 iwldvm aesni_intel 
aes_x86_64 crypto_simd cryptd mac80211 

Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention

2018-12-06 Thread Eric Wong
Roman Penyaev  wrote:
> On 2018-12-06 00:46, Eric Wong wrote:
> > Roman Penyaev  wrote:
> > > Hi all,
> > > 
> > > The goal of this patch is to reduce contention of ep_poll_callback()
> > > which
> > > can be called concurrently from different CPUs in case of high events
> > > rates and many fds per epoll.  Problem can be very well reproduced by
> > > generating events (write to pipe or eventfd) from many threads, while
> > > consumer thread does polling.  In other words this patch increases the
> > > bandwidth of events which can be delivered from sources to the
> > > poller by
> > > adding poll items in a lockless way to the list.
> > 
> > Hi Roman,
> > 
> > I also tried to solve this problem many years ago with help of
> > the well-tested-in-userspace wfcqueue from Mathieu's URCU.
> > 
> > I was also looking to solve contention with parallel epoll_wait
> > callers with this.  AFAIK, it worked well; but needed the
> > userspace tests from wfcqueue ported over to the kernel and more
> > review.
> > 
> > I didn't have enough computing power to show the real-world
> > benefits or funding to continue:
> > 
> > https://lore.kernel.org/lkml/?q=wfcqueue+d:..20130501
> 
> Hi Eric,
> 
> Nice work.  That was a huge change by itself and by dependency
> on wfcqueue.  I could not find any valuable discussion on this,
> what was the reaction of the community?

Hi Roman, AFAIK there wasn't much reaction.  Mathieu was VERY
helpful with wfcqueue but there wasn't much else.  Honestly, I'm
surprised wfcqueue hasn't made it into more places; I love it :)

(More recently, I started an effort to get glibc malloc to use wfcqueue:
https://public-inbox.org/libc-alpha/20180731084936.g4yw6wnvt677miti@dcvr/ )

> > It might not be too much trouble for you to brush up the wait-free
> > patches and test them against the rwlock implementation.
> 
> Ha :)  I may try to cherry-pick these patches, let's see how many
> conflicts I have to resolve, eventpoll.c has been changed a lot
> since that (6 years passed, right?)

AFAIK not, epoll remains a queue with a key-value mapping.
I'm not a regular/experienced kernel hacker and I had no trouble
understanding eventpoll.c years ago.

> But reading your work description I can assume that epoll_wait() calls
> should be faster, because they do not content with ep_poll_callback(),
> and I did not try to solve this, only contention between producers,
> which make my change tiny.

Yes, I recall that was it.  My real-world programs[1], even without
slow HDD access, didn't show it, though.

> I also found your https://yhbt.net/eponeshotmt.c , where you count
> number of bare epoll_wait() calls, which IMO is not correct, because
> we need to count how many events are delivered, but not how fast
> you've returned from epoll_wait().  But as I said no doubts that
> getting rid of contention between consumer and producers will show
> even better results.

"epoll_wait calls" == "events delivered" in my case since I (ab)use
epoll_wait with max_events=1 as a work-distribution mechanism
between threads.  Not a common use-case, I admit.

My design was terrible from a syscall overhead POV, but my
bottleneck for real-world use for cmogstored[1] was dozens of
rotational HDDs in JBOD configuration; so I favored elimination
of head-of-line blocking over throughput of epoll itself.

My motivation for hacking on epoll back then was only to look
better on synthetic benchmarks that didn't hit slow HDDs :)



[1] git clone https://bogomips.org/cmogstored.git/
the Ragel-generated HTTP parser was also a bottleneck in
synthetic benchmarks, as we


Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention

2018-12-06 Thread Eric Wong
Roman Penyaev  wrote:
> On 2018-12-06 00:46, Eric Wong wrote:
> > Roman Penyaev  wrote:
> > > Hi all,
> > > 
> > > The goal of this patch is to reduce contention of ep_poll_callback()
> > > which
> > > can be called concurrently from different CPUs in case of high events
> > > rates and many fds per epoll.  Problem can be very well reproduced by
> > > generating events (write to pipe or eventfd) from many threads, while
> > > consumer thread does polling.  In other words this patch increases the
> > > bandwidth of events which can be delivered from sources to the
> > > poller by
> > > adding poll items in a lockless way to the list.
> > 
> > Hi Roman,
> > 
> > I also tried to solve this problem many years ago with help of
> > the well-tested-in-userspace wfcqueue from Mathieu's URCU.
> > 
> > I was also looking to solve contention with parallel epoll_wait
> > callers with this.  AFAIK, it worked well; but needed the
> > userspace tests from wfcqueue ported over to the kernel and more
> > review.
> > 
> > I didn't have enough computing power to show the real-world
> > benefits or funding to continue:
> > 
> > https://lore.kernel.org/lkml/?q=wfcqueue+d:..20130501
> 
> Hi Eric,
> 
> Nice work.  That was a huge change by itself and by dependency
> on wfcqueue.  I could not find any valuable discussion on this,
> what was the reaction of the community?

Hi Roman, AFAIK there wasn't much reaction.  Mathieu was VERY
helpful with wfcqueue but there wasn't much else.  Honestly, I'm
surprised wfcqueue hasn't made it into more places; I love it :)

(More recently, I started an effort to get glibc malloc to use wfcqueue:
https://public-inbox.org/libc-alpha/20180731084936.g4yw6wnvt677miti@dcvr/ )

> > It might not be too much trouble for you to brush up the wait-free
> > patches and test them against the rwlock implementation.
> 
> Ha :)  I may try to cherry-pick these patches, let's see how many
> conflicts I have to resolve, eventpoll.c has been changed a lot
> since that (6 years passed, right?)

AFAIK not, epoll remains a queue with a key-value mapping.
I'm not a regular/experienced kernel hacker and I had no trouble
understanding eventpoll.c years ago.

> But reading your work description I can assume that epoll_wait() calls
> should be faster, because they do not content with ep_poll_callback(),
> and I did not try to solve this, only contention between producers,
> which make my change tiny.

Yes, I recall that was it.  My real-world programs[1], even without
slow HDD access, didn't show it, though.

> I also found your https://yhbt.net/eponeshotmt.c , where you count
> number of bare epoll_wait() calls, which IMO is not correct, because
> we need to count how many events are delivered, but not how fast
> you've returned from epoll_wait().  But as I said no doubts that
> getting rid of contention between consumer and producers will show
> even better results.

"epoll_wait calls" == "events delivered" in my case since I (ab)use
epoll_wait with max_events=1 as a work-distribution mechanism
between threads.  Not a common use-case, I admit.

My design was terrible from a syscall overhead POV, but my
bottleneck for real-world use for cmogstored[1] was dozens of
rotational HDDs in JBOD configuration; so I favored elimination
of head-of-line blocking over throughput of epoll itself.

My motivation for hacking on epoll back then was only to look
better on synthetic benchmarks that didn't hit slow HDDs :)



[1] git clone https://bogomips.org/cmogstored.git/
the Ragel-generated HTTP parser was also a bottleneck in
synthetic benchmarks, as we


Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention

2018-12-05 Thread Eric Wong
Roman Penyaev  wrote:
> Hi all,
> 
> The goal of this patch is to reduce contention of ep_poll_callback() which
> can be called concurrently from different CPUs in case of high events
> rates and many fds per epoll.  Problem can be very well reproduced by
> generating events (write to pipe or eventfd) from many threads, while
> consumer thread does polling.  In other words this patch increases the
> bandwidth of events which can be delivered from sources to the poller by
> adding poll items in a lockless way to the list.

Hi Roman,

I also tried to solve this problem many years ago with help of
the well-tested-in-userspace wfcqueue from Mathieu's URCU.

I was also looking to solve contention with parallel epoll_wait
callers with this.  AFAIK, it worked well; but needed the
userspace tests from wfcqueue ported over to the kernel and more
review.

I didn't have enough computing power to show the real-world
benefits or funding to continue:

https://lore.kernel.org/lkml/?q=wfcqueue+d:..20130501

It might not be too much trouble for you to brush up the wait-free
patches and test them against the rwlock implementation.

(I only noticed this thread since I was catching up on some
 public-inbox work :>)


Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention

2018-12-05 Thread Eric Wong
Roman Penyaev  wrote:
> Hi all,
> 
> The goal of this patch is to reduce contention of ep_poll_callback() which
> can be called concurrently from different CPUs in case of high events
> rates and many fds per epoll.  Problem can be very well reproduced by
> generating events (write to pipe or eventfd) from many threads, while
> consumer thread does polling.  In other words this patch increases the
> bandwidth of events which can be delivered from sources to the poller by
> adding poll items in a lockless way to the list.

Hi Roman,

I also tried to solve this problem many years ago with help of
the well-tested-in-userspace wfcqueue from Mathieu's URCU.

I was also looking to solve contention with parallel epoll_wait
callers with this.  AFAIK, it worked well; but needed the
userspace tests from wfcqueue ported over to the kernel and more
review.

I didn't have enough computing power to show the real-world
benefits or funding to continue:

https://lore.kernel.org/lkml/?q=wfcqueue+d:..20130501

It might not be too much trouble for you to brush up the wait-free
patches and test them against the rwlock implementation.

(I only noticed this thread since I was catching up on some
 public-inbox work :>)


[PATCH] platform/x86: thinkpad_acpi: add adaptive_kbd_modes parameter

2018-11-14 Thread Eric Wong
This bitmap parameter allows the user to add/remove modes for
DFR_CHANGE_ROW to cycle through.

Users who wish to cycle through WEB_BROWSER_MODE and/or
WEB_CONFERENCE_MODE may now do so by enabling corresponding bits.

While some appreciate more modes, I made this feature because I
wanted to lock the keyboard into FUNCTION_MODE.  This is because
my Fn key (DFR_CHANGE_ROW) is mapped to Escape, and my "Esc" key
is mapped to grave/asciitilde to match the layout of a regular
keyboard.

The default remains unchanged with the DFR_CHANGE_ROW key
toggling between HOME_MODE and FUNCTION_MODE.  Thus the
default "adaptive_kbd_modes" value is 9, but I use 8:

echo 8 >/sys/devices/platform/thinkpad_acpi/adaptive_kbd_modes

The above setting with this change and the following keymap
preserves my sanity on the atrocious adaptive keyboard on
the 2nd-gen X1 Carbon:

{
echo keymaps 0-255
# Esc key maps to '`' or '~'
echo keycode  1 = grave asciitilde

# Fn key maps to Escape
echo keycode 143 = Escape

# Home and End on the keyboard map to Control
echo keycode 102 = Control
echo keycode 107 = Control
} | loadkeys -

Or with the following xmodmaprc:

remove Control = Control_L
remove Lock = Control_L
keycode   9 = grave asciitilde grave
keycode 110 = Control_L NoSymbol Control_L
keycode 115 = Control_L NoSymbol Control_L
keycode 151 = Escape NoSymbol Escape
add Control = Control_L

Signed-off-by: Eric Wong 
---
 Documentation/laptops/thinkpad-acpi.txt | 11 +
 drivers/platform/x86/thinkpad_acpi.c| 79 ++---
 2 files changed, 64 insertions(+), 26 deletions(-)

diff --git a/Documentation/laptops/thinkpad-acpi.txt 
b/Documentation/laptops/thinkpad-acpi.txt
index 6cced88de6da..36c8731b6919 100644
--- a/Documentation/laptops/thinkpad-acpi.txt
+++ b/Documentation/laptops/thinkpad-acpi.txt
@@ -1378,6 +1378,17 @@ For more details about which buttons will appear 
depending on the mode, please
 review the laptop's user guide:
 http://www.lenovo.com/shop/americas/content/user_guides/x1carbon_2_ug_en.pdf
 
+sysfs device attribute: adaptive_kbd_modes
+
+This bitmap attribute controls the modes the "Fn" key is allowed
+to cycle through.  The value can be read and set.  Enabled bits
+correspond to the modes above (that is, the first bit is "Home mode"
+and the fourth bit "Function mode").
+
+The default value is 9, which allows cycling between Home and Function modes.
+Setting and unsetting corresponding bits allows adding or removing modes
+to cycle through.
+
 Multiple Commands, Module Parameters
 
 
diff --git a/drivers/platform/x86/thinkpad_acpi.c 
b/drivers/platform/x86/thinkpad_acpi.c
index fde08a997557..77b4f00e0443 100644
--- a/drivers/platform/x86/thinkpad_acpi.c
+++ b/drivers/platform/x86/thinkpad_acpi.c
@@ -3094,8 +3094,44 @@ static ssize_t adaptive_kbd_mode_store(struct device 
*dev,
 
 static DEVICE_ATTR_RW(adaptive_kbd_mode);
 
+/* Thinkpad X1 Carbon support 5 modes including Home mode, Web browser
+ * mode, Web conference mode, Function mode and Lay-flat mode.
+ * We support cycling between Home mode and Function mode by default.
+ *
+ * Users may enable and disable other modes by changing the
+ * adaptive_kbd_modes bitmap attribute
+ */
+static unsigned adaptive_kbd_modes =
+   1 << HOME_MODE |
+/* 1 << WEB_BROWSER_MODE |
+   1 << WEB_CONFERENCE_MODE | */
+   1 << FUNCTION_MODE;
+
+static ssize_t adaptive_kbd_modes_show(struct device *dev,
+   struct device_attribute *attr,
+   char *buf)
+{
+   return snprintf(buf, PAGE_SIZE, "%u\n", adaptive_kbd_modes);
+}
+
+static ssize_t adaptive_kbd_modes_store(struct device *dev,
+   struct device_attribute *attr,
+   const char *buf, size_t count)
+{
+   unsigned long t;
+
+   if (parse_strtoul(buf, (1 << LAYFLAT_MODE) - 1, ))
+   return -EINVAL;
+
+   adaptive_kbd_modes = (unsigned int)t;
+   return count;
+}
+
+static DEVICE_ATTR_RW(adaptive_kbd_modes);
+
 static struct attribute *adaptive_kbd_attributes[] = {
_attr_adaptive_kbd_mode.attr,
+   _attr_adaptive_kbd_modes.attr,
NULL
 };
 
@@ -3763,20 +3799,7 @@ static int __init hotkey_init(struct ibm_init_struct 
*iibm)
return (res < 0) ? res : 1;
 }
 
-/* Thinkpad X1 Carbon support 5 modes including Home mode, Web browser
- * mode, Web conference mode, Function mode and Lay-flat mode.
- * We support Home mode and Function mode currently.
- *
- * Will consider support rest of modes in future.
- *
- */
-static const int adaptive_keyboard_modes[] = {
-   HOME_MODE,
-/* WEB_BROWSER_MODE = 2,
-   WEB_CONFEREN

[PATCH] platform/x86: thinkpad_acpi: add adaptive_kbd_modes parameter

2018-11-14 Thread Eric Wong
This bitmap parameter allows the user to add/remove modes for
DFR_CHANGE_ROW to cycle through.

Users who wish to cycle through WEB_BROWSER_MODE and/or
WEB_CONFERENCE_MODE may now do so by enabling corresponding bits.

While some appreciate more modes, I made this feature because I
wanted to lock the keyboard into FUNCTION_MODE.  This is because
my Fn key (DFR_CHANGE_ROW) is mapped to Escape, and my "Esc" key
is mapped to grave/asciitilde to match the layout of a regular
keyboard.

The default remains unchanged with the DFR_CHANGE_ROW key
toggling between HOME_MODE and FUNCTION_MODE.  Thus the
default "adaptive_kbd_modes" value is 9, but I use 8:

echo 8 >/sys/devices/platform/thinkpad_acpi/adaptive_kbd_modes

The above setting with this change and the following keymap
preserves my sanity on the atrocious adaptive keyboard on
the 2nd-gen X1 Carbon:

{
echo keymaps 0-255
# Esc key maps to '`' or '~'
echo keycode  1 = grave asciitilde

# Fn key maps to Escape
echo keycode 143 = Escape

# Home and End on the keyboard map to Control
echo keycode 102 = Control
echo keycode 107 = Control
} | loadkeys -

Or with the following xmodmaprc:

remove Control = Control_L
remove Lock = Control_L
keycode   9 = grave asciitilde grave
keycode 110 = Control_L NoSymbol Control_L
keycode 115 = Control_L NoSymbol Control_L
keycode 151 = Escape NoSymbol Escape
add Control = Control_L

Signed-off-by: Eric Wong 
---
 Documentation/laptops/thinkpad-acpi.txt | 11 +
 drivers/platform/x86/thinkpad_acpi.c| 79 ++---
 2 files changed, 64 insertions(+), 26 deletions(-)

diff --git a/Documentation/laptops/thinkpad-acpi.txt 
b/Documentation/laptops/thinkpad-acpi.txt
index 6cced88de6da..36c8731b6919 100644
--- a/Documentation/laptops/thinkpad-acpi.txt
+++ b/Documentation/laptops/thinkpad-acpi.txt
@@ -1378,6 +1378,17 @@ For more details about which buttons will appear 
depending on the mode, please
 review the laptop's user guide:
 http://www.lenovo.com/shop/americas/content/user_guides/x1carbon_2_ug_en.pdf
 
+sysfs device attribute: adaptive_kbd_modes
+
+This bitmap attribute controls the modes the "Fn" key is allowed
+to cycle through.  The value can be read and set.  Enabled bits
+correspond to the modes above (that is, the first bit is "Home mode"
+and the fourth bit "Function mode").
+
+The default value is 9, which allows cycling between Home and Function modes.
+Setting and unsetting corresponding bits allows adding or removing modes
+to cycle through.
+
 Multiple Commands, Module Parameters
 
 
diff --git a/drivers/platform/x86/thinkpad_acpi.c 
b/drivers/platform/x86/thinkpad_acpi.c
index fde08a997557..77b4f00e0443 100644
--- a/drivers/platform/x86/thinkpad_acpi.c
+++ b/drivers/platform/x86/thinkpad_acpi.c
@@ -3094,8 +3094,44 @@ static ssize_t adaptive_kbd_mode_store(struct device 
*dev,
 
 static DEVICE_ATTR_RW(adaptive_kbd_mode);
 
+/* Thinkpad X1 Carbon support 5 modes including Home mode, Web browser
+ * mode, Web conference mode, Function mode and Lay-flat mode.
+ * We support cycling between Home mode and Function mode by default.
+ *
+ * Users may enable and disable other modes by changing the
+ * adaptive_kbd_modes bitmap attribute
+ */
+static unsigned adaptive_kbd_modes =
+   1 << HOME_MODE |
+/* 1 << WEB_BROWSER_MODE |
+   1 << WEB_CONFERENCE_MODE | */
+   1 << FUNCTION_MODE;
+
+static ssize_t adaptive_kbd_modes_show(struct device *dev,
+   struct device_attribute *attr,
+   char *buf)
+{
+   return snprintf(buf, PAGE_SIZE, "%u\n", adaptive_kbd_modes);
+}
+
+static ssize_t adaptive_kbd_modes_store(struct device *dev,
+   struct device_attribute *attr,
+   const char *buf, size_t count)
+{
+   unsigned long t;
+
+   if (parse_strtoul(buf, (1 << LAYFLAT_MODE) - 1, ))
+   return -EINVAL;
+
+   adaptive_kbd_modes = (unsigned int)t;
+   return count;
+}
+
+static DEVICE_ATTR_RW(adaptive_kbd_modes);
+
 static struct attribute *adaptive_kbd_attributes[] = {
_attr_adaptive_kbd_mode.attr,
+   _attr_adaptive_kbd_modes.attr,
NULL
 };
 
@@ -3763,20 +3799,7 @@ static int __init hotkey_init(struct ibm_init_struct 
*iibm)
return (res < 0) ? res : 1;
 }
 
-/* Thinkpad X1 Carbon support 5 modes including Home mode, Web browser
- * mode, Web conference mode, Function mode and Lay-flat mode.
- * We support Home mode and Function mode currently.
- *
- * Will consider support rest of modes in future.
- *
- */
-static const int adaptive_keyboard_modes[] = {
-   HOME_MODE,
-/* WEB_BROWSER_MODE = 2,
-   WEB_CONFEREN

[RFC] pipe: prevent compiler reordering in pipe_poll

2018-08-24 Thread Eric Wong
The pipe_poll function does not use locks, and adding an entry
to the waitqueue is not guaranteed to happen before pipe->nrbufs
(or other fields) are read, leading to missed wakeups.

Looking at Ruby CI build logs and backtraces, I've noticed
occasional instances where processes are stuck in select(2) or
ppoll(2) with a pipe.

I don't have access to the systems where this is happening to
test/reproduce the problem, and haven't been able to reproduce
it locally on less-powerful hardware, either.  However, it seems
like a problem based on similar comments in
fs/eventfd.c::eventfd_poll made by Paolo.

Signed-off-by: Eric Wong 
Cc: Paolo Bonzini 
---
 fs/pipe.c | 32 ++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 39d6f431da83..1a904d941cf1 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -509,7 +509,7 @@ static long pipe_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
}
 }
 
-/* No kernel lock held - fine */
+/* No kernel lock held - fine, but a compiler barrier is required */
 static __poll_t
 pipe_poll(struct file *filp, poll_table *wait)
 {
@@ -519,7 +519,35 @@ pipe_poll(struct file *filp, poll_table *wait)
 
poll_wait(filp, >wait, wait);
 
-   /* Reading only -- no need for acquiring the semaphore.  */
+   /*
+* Reading only -- no need for acquiring the semaphore, but
+* we need a compiler barrier to ensure the compiler does
+* not reorder reads to pipe->nrbufs, pipe->writers,
+* pipe->readers, filp->f_version, pipe->w_counter, and
+* pipe->buffers before poll_wait to avoid missing wakeups
+* from compiler reordering.  In other words, we need to
+* prevent the following situation:
+*
+* pipe_poll  pipe_write
+* -  
+* nrbufs = pipe->nrbufs (INVALID!)
+*
+*__pipe_lock
+*pipe->nrbufs = ++bufs;
+*__pipe_unlock
+*wake_up_interruptible_sync_poll
+*  pipe->wait is empty, no wakeup
+*
+* lock pipe->wait.lock (in poll_wait)
+* __add_wait_queue
+* unlock pipe->wait.lock
+*
+*  // pipe->nrbufs should be read here, NOT above
+*
+* pipe_poll returns 0 (WRONG)
+*/
+   barrier();
+
nrbufs = pipe->nrbufs;
mask = 0;
if (filp->f_mode & FMODE_READ) {
-- 
EW


[RFC] pipe: prevent compiler reordering in pipe_poll

2018-08-24 Thread Eric Wong
The pipe_poll function does not use locks, and adding an entry
to the waitqueue is not guaranteed to happen before pipe->nrbufs
(or other fields) are read, leading to missed wakeups.

Looking at Ruby CI build logs and backtraces, I've noticed
occasional instances where processes are stuck in select(2) or
ppoll(2) with a pipe.

I don't have access to the systems where this is happening to
test/reproduce the problem, and haven't been able to reproduce
it locally on less-powerful hardware, either.  However, it seems
like a problem based on similar comments in
fs/eventfd.c::eventfd_poll made by Paolo.

Signed-off-by: Eric Wong 
Cc: Paolo Bonzini 
---
 fs/pipe.c | 32 ++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 39d6f431da83..1a904d941cf1 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -509,7 +509,7 @@ static long pipe_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
}
 }
 
-/* No kernel lock held - fine */
+/* No kernel lock held - fine, but a compiler barrier is required */
 static __poll_t
 pipe_poll(struct file *filp, poll_table *wait)
 {
@@ -519,7 +519,35 @@ pipe_poll(struct file *filp, poll_table *wait)
 
poll_wait(filp, >wait, wait);
 
-   /* Reading only -- no need for acquiring the semaphore.  */
+   /*
+* Reading only -- no need for acquiring the semaphore, but
+* we need a compiler barrier to ensure the compiler does
+* not reorder reads to pipe->nrbufs, pipe->writers,
+* pipe->readers, filp->f_version, pipe->w_counter, and
+* pipe->buffers before poll_wait to avoid missing wakeups
+* from compiler reordering.  In other words, we need to
+* prevent the following situation:
+*
+* pipe_poll  pipe_write
+* -  
+* nrbufs = pipe->nrbufs (INVALID!)
+*
+*__pipe_lock
+*pipe->nrbufs = ++bufs;
+*__pipe_unlock
+*wake_up_interruptible_sync_poll
+*  pipe->wait is empty, no wakeup
+*
+* lock pipe->wait.lock (in poll_wait)
+* __add_wait_queue
+* unlock pipe->wait.lock
+*
+*  // pipe->nrbufs should be read here, NOT above
+*
+* pipe_poll returns 0 (WRONG)
+*/
+   barrier();
+
nrbufs = pipe->nrbufs;
mask = 0;
if (filp->f_mode & FMODE_READ) {
-- 
EW


Re: [PATCH 2/2] send-email: supply a --send-delay=1 by default

2018-03-27 Thread Eric Wong
Ævar Arnfjörð Bjarmason  wrote:
> Good point. I also see that (via git log --author=Ævar --grep='^\[PATCH
> ') that this series itself arrived out of order (0 -> 2 -> 1), but I
> don't know to what extent public-inbox itself might be batching things.

public-inbox doesn't batch, aside from when the
public-inbox-watch process gets restarted and needs to catch up
using readdir.  Once it's done catching up with readdir, it
gets into an inotify loop which injects messages in the order
the MTA (or offlineimap) puts them in a Maildir.

Right now, public-inbox only sorts by Date: header in the mail.

The next Xapian schema revision of public-inbox will use
internally sorts search results(*) by the date in the newest
Received: header.  That is analogous to git committer date.  The
displayed message date will still be sorted by the Date: header
(analogous to git author date); since git-send-email already
alters the Date: in a series for sorting.

This allow messages/threads which are actually new get bumped to
the top of the homepage; regardless of how wrong the original
sender's clock was.

It should help prevent kernel developers from crafting message
dates with optimization for classic^Wextra reviews in mind :)


(*) all the look-at-a-bunch-of-messages operations, including
the landing page (e.g. https://public-inbox.org/git/)
is a Xapian search query, nowadays; but "git log"


Re: [PATCH 2/2] send-email: supply a --send-delay=1 by default

2018-03-27 Thread Eric Wong
Ævar Arnfjörð Bjarmason  wrote:
> Good point. I also see that (via git log --author=Ævar --grep='^\[PATCH
> ') that this series itself arrived out of order (0 -> 2 -> 1), but I
> don't know to what extent public-inbox itself might be batching things.

public-inbox doesn't batch, aside from when the
public-inbox-watch process gets restarted and needs to catch up
using readdir.  Once it's done catching up with readdir, it
gets into an inotify loop which injects messages in the order
the MTA (or offlineimap) puts them in a Maildir.

Right now, public-inbox only sorts by Date: header in the mail.

The next Xapian schema revision of public-inbox will use
internally sorts search results(*) by the date in the newest
Received: header.  That is analogous to git committer date.  The
displayed message date will still be sorted by the Date: header
(analogous to git author date); since git-send-email already
alters the Date: in a series for sorting.

This allow messages/threads which are actually new get bumped to
the top of the homepage; regardless of how wrong the original
sender's clock was.

It should help prevent kernel developers from crafting message
dates with optimization for classic^Wextra reviews in mind :)


(*) all the look-at-a-bunch-of-messages operations, including
the landing page (e.g. https://public-inbox.org/git/)
is a Xapian search query, nowadays; but "git log"


Re: blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/

2017-10-24 Thread Eric Wong
Hi Marek, I'm replying to
http://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/
via email so Jason and linux-kernel see it.  I also don't believe in
using centralized, proprietary messaging like Disqus for discussing
Open Source, nor do I deal with JavaScript.

I still believe the best way to balance connections across
multiple processes in any server with persistent socket
connections is to create a dedicated thread performing nothing
but blocking accept4() + EPOLL_CTL_ADD calls in each worker
process, acting as the queue producer with EPOLLONESHOT.

(the "queue" in this case is the epoll or kqueue file description)

The usual worker thread(s) act as the queue consumer, calling
epoll_wait as usual with minimal modification.  These usual
worker threads only change the epoll watch set with
EPOLL_CTL_MOD (and maybe _DEL); but not _ADD; with EPOLLONESHOT.

Perhaps some pseudo code can describe this better:

thread_acceptor: /* this thread never does epoll_wait */

while (running) {
/*
 * blocking accept, but create non-blocking client socket
 * lfd may be shared across any number of processes
 */
int cfd = accept4(lfd, ..., SOCK_NONBLOCK);

if (cfd >= 0) {
struct epoll_event event;

event.events = EPOLLONESHOT|EPOLLIN;
event.data.ptr = client_new(cfd);

/* epfd is per-process */
epoll_ctl(epfd, EPOLL_CTL_ADD, cfd, );
}
}

thread_worker: /* this never does EPOLL_CTL_ADD */

/*
 * If there's multiple worker threads; maxevents can be 1
 * for optimal fairness (at the expense of throughput)
 */
while (running) {
int i;
int n = epoll_wait(epfd, events, maxevents, timeout);

for (i = 0; i < n; i++) {
struct client *client = events[i].data.ptr;

/*
 * The usual non-blocking server processing,
 * any socket read/writes are non-blocking here:
 */
enum next_action next = client_read_write(client);

int want = 0;

switch (next) {
case NEXT_RDONLY: want = EPOLLIN; break;
case NEXT_WRONLY: want = EPOLLOUT; break;
case NEXT_RDWR: want = EPOLLOUT|EPOLLIN; break;
case NEXT_CLOSE:
close(client->fd);
}
if (want) {
events[i].events = want | EPOLLONESHOT;
epoll_ctl(epfd, EPOLL_CTL_MOD,
  client->fd, [i]);
}
}
}


I came up with this design back around 2011 before EPOLLEXCLUSIVE
and SO_REUSEPORT came about.  I instead based my design on the
ancient-but-still-accurate document around blocking accept() and
exclusive wakeups:

  http://www.citi.umich.edu/projects/linux-scalability/reports/accept.html

All this is applied in cmogstored which has had the same basic
design since 2012.  Since then, it has evenly distributed
persistent connections across multiple processes.

cmogstored supports hundreds of rotational disks in a JBOD
configuration while maintaining thousands of persistent
connections indefinitely for both fast and slow clients over
HTTP and a MogileFS-specific sidechannel protocol (TCP).  It
applies the Waitrose "combined queue model" in every aspect to
avoid worst case latency.  For cmogstored, any new latency from
lock contention (ep->mtx and ep->lock) is inconsequential
compared to the huge latencies of storage devices.

Some documentation around the design is here:

https://bogomips.org/cmogstored/queues.txt

Multiple processes isn't documented as it wasn't in the original
Perl mogstored, but it's there since since I figured somebody
might run into contention with FD allocation and it provides
some safety in case of segfaults[1].

All the code (GPL-3.0+) is available at:

git clone git://bogomips.org/cmogstored/

It also works with kqueue and is in the FreeBSD ports collection.


[1] Ironically, the only segfault I've encountered in cmogstored
is because I accidentally shared a DIR * (from opendir) across
processes :x   And, really, cmogstored probably doesn't
benefit rom multiple processes like nginx does as
cmogstored was always designed to be MT.


Anyways, I believe nginx can apply this design around dedicated
blocking acceptor threads for each worker process to its
existing model to improve client balancing across worker processes.

However, this change breaks the nginx SIGUSR2 upgrade/backout
because old workers depend on 

Re: blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/

2017-10-24 Thread Eric Wong
Hi Marek, I'm replying to
http://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/
via email so Jason and linux-kernel see it.  I also don't believe in
using centralized, proprietary messaging like Disqus for discussing
Open Source, nor do I deal with JavaScript.

I still believe the best way to balance connections across
multiple processes in any server with persistent socket
connections is to create a dedicated thread performing nothing
but blocking accept4() + EPOLL_CTL_ADD calls in each worker
process, acting as the queue producer with EPOLLONESHOT.

(the "queue" in this case is the epoll or kqueue file description)

The usual worker thread(s) act as the queue consumer, calling
epoll_wait as usual with minimal modification.  These usual
worker threads only change the epoll watch set with
EPOLL_CTL_MOD (and maybe _DEL); but not _ADD; with EPOLLONESHOT.

Perhaps some pseudo code can describe this better:

thread_acceptor: /* this thread never does epoll_wait */

while (running) {
/*
 * blocking accept, but create non-blocking client socket
 * lfd may be shared across any number of processes
 */
int cfd = accept4(lfd, ..., SOCK_NONBLOCK);

if (cfd >= 0) {
struct epoll_event event;

event.events = EPOLLONESHOT|EPOLLIN;
event.data.ptr = client_new(cfd);

/* epfd is per-process */
epoll_ctl(epfd, EPOLL_CTL_ADD, cfd, );
}
}

thread_worker: /* this never does EPOLL_CTL_ADD */

/*
 * If there's multiple worker threads; maxevents can be 1
 * for optimal fairness (at the expense of throughput)
 */
while (running) {
int i;
int n = epoll_wait(epfd, events, maxevents, timeout);

for (i = 0; i < n; i++) {
struct client *client = events[i].data.ptr;

/*
 * The usual non-blocking server processing,
 * any socket read/writes are non-blocking here:
 */
enum next_action next = client_read_write(client);

int want = 0;

switch (next) {
case NEXT_RDONLY: want = EPOLLIN; break;
case NEXT_WRONLY: want = EPOLLOUT; break;
case NEXT_RDWR: want = EPOLLOUT|EPOLLIN; break;
case NEXT_CLOSE:
close(client->fd);
}
if (want) {
events[i].events = want | EPOLLONESHOT;
epoll_ctl(epfd, EPOLL_CTL_MOD,
  client->fd, [i]);
}
}
}


I came up with this design back around 2011 before EPOLLEXCLUSIVE
and SO_REUSEPORT came about.  I instead based my design on the
ancient-but-still-accurate document around blocking accept() and
exclusive wakeups:

  http://www.citi.umich.edu/projects/linux-scalability/reports/accept.html

All this is applied in cmogstored which has had the same basic
design since 2012.  Since then, it has evenly distributed
persistent connections across multiple processes.

cmogstored supports hundreds of rotational disks in a JBOD
configuration while maintaining thousands of persistent
connections indefinitely for both fast and slow clients over
HTTP and a MogileFS-specific sidechannel protocol (TCP).  It
applies the Waitrose "combined queue model" in every aspect to
avoid worst case latency.  For cmogstored, any new latency from
lock contention (ep->mtx and ep->lock) is inconsequential
compared to the huge latencies of storage devices.

Some documentation around the design is here:

https://bogomips.org/cmogstored/queues.txt

Multiple processes isn't documented as it wasn't in the original
Perl mogstored, but it's there since since I figured somebody
might run into contention with FD allocation and it provides
some safety in case of segfaults[1].

All the code (GPL-3.0+) is available at:

git clone git://bogomips.org/cmogstored/

It also works with kqueue and is in the FreeBSD ports collection.


[1] Ironically, the only segfault I've encountered in cmogstored
is because I accidentally shared a DIR * (from opendir) across
processes :x   And, really, cmogstored probably doesn't
benefit rom multiple processes like nginx does as
cmogstored was always designed to be MT.


Anyways, I believe nginx can apply this design around dedicated
blocking acceptor threads for each worker process to its
existing model to improve client balancing across worker processes.

However, this change breaks the nginx SIGUSR2 upgrade/backout
because old workers depend on 

Re: BUG in git diff-index

2017-09-26 Thread Eric Wong
Marc Herbert  wrote:
> PS: I used NNTP and http://dir.gmane.org/gmane.comp.version-control.git
> to quickly find this old thread (what could we do without NNTP?). Then
> I googled for a web archive of this thread and Google could only find
> this one: 
> http://git.661346.n2.nabble.com/BUG-in-git-diff-index-tt7652105.html#none
> Is there a robots.txt to block indexing on
> https://public-inbox.org/git/1459432667.2124.2.ca...@dwim.me ?

There's no blocks on public-inbox.org and I'm completely against
any sort of blocking/throttling.  Maybe there's too many pages
to index?  Or the Message-IDs in URLs are too ugly/scary?  Not
sure what to do about that...

Anyways, I just put up a robots.txt with Crawl-Delay: 1, since I
seem to recall crawlers use a more conservative delay by default:

==> https://public-inbox.org/robots.txt <==
User-Agent: *
Crawl-Delay: 1


I don't know much about SEO other than keeping a site up and
responsive; so perhaps there's more to be done about getting
things indexed...


Re: BUG in git diff-index

2017-09-26 Thread Eric Wong
Marc Herbert  wrote:
> PS: I used NNTP and http://dir.gmane.org/gmane.comp.version-control.git
> to quickly find this old thread (what could we do without NNTP?). Then
> I googled for a web archive of this thread and Google could only find
> this one: 
> http://git.661346.n2.nabble.com/BUG-in-git-diff-index-tt7652105.html#none
> Is there a robots.txt to block indexing on
> https://public-inbox.org/git/1459432667.2124.2.ca...@dwim.me ?

There's no blocks on public-inbox.org and I'm completely against
any sort of blocking/throttling.  Maybe there's too many pages
to index?  Or the Message-IDs in URLs are too ugly/scary?  Not
sure what to do about that...

Anyways, I just put up a robots.txt with Crawl-Delay: 1, since I
seem to recall crawlers use a more conservative delay by default:

==> https://public-inbox.org/robots.txt <==
User-Agent: *
Crawl-Delay: 1


I don't know much about SEO other than keeping a site up and
responsive; so perhaps there's more to be done about getting
things indexed...


Re: epoll_wait inaccurate timeout

2016-12-12 Thread Eric Wong
+Cc folks who may know about timer stuff on epoll.

Dmitry Banschikov  wrote:
> Hi!
> 
> I have a problem caused by inaccurate timeouts in epoll_wait(2).
> Here are some parts of strace -tt output:

Which kernel version are you using?

> 22578 09:33:46.959791 epoll_wait(5,  
> 22578 09:33:50.010794 <... epoll_wait resumed> [], 128, 1498) = 0
> ...
> 22034 09:35:07.686896 epoll_wait(5,  
> 22034 09:35:09.482526 <... epoll_wait resumed> [{EPOLLIN,
> {u32=151458248, u64=151458248}}], 128, 362) = 1
> ...
> 22036 09:35:41.433241 epoll_wait(5,  
> 22036 09:35:43.176881 <... epoll_wait resumed> [], 128, 97) = 0
> 
> In each example epoll_wait is blocked for too longer then asked in timeout.
> 
> Is it normal?

I don't think so, unless you have a huge /proc//timerslack_ns
set.  But I mainly use -1 or 0 as the timeout value.

> Please CC me.

That's standard procedure, here :)


Re: epoll_wait inaccurate timeout

2016-12-12 Thread Eric Wong
+Cc folks who may know about timer stuff on epoll.

Dmitry Banschikov  wrote:
> Hi!
> 
> I have a problem caused by inaccurate timeouts in epoll_wait(2).
> Here are some parts of strace -tt output:

Which kernel version are you using?

> 22578 09:33:46.959791 epoll_wait(5,  
> 22578 09:33:50.010794 <... epoll_wait resumed> [], 128, 1498) = 0
> ...
> 22034 09:35:07.686896 epoll_wait(5,  
> 22034 09:35:09.482526 <... epoll_wait resumed> [{EPOLLIN,
> {u32=151458248, u64=151458248}}], 128, 362) = 1
> ...
> 22036 09:35:41.433241 epoll_wait(5,  
> 22036 09:35:43.176881 <... epoll_wait resumed> [], 128, 97) = 0
> 
> In each example epoll_wait is blocked for too longer then asked in timeout.
> 
> Is it normal?

I don't think so, unless you have a huge /proc//timerslack_ns
set.  But I mainly use -1 or 0 as the timeout value.

> Please CC me.

That's standard procedure, here :)


Re: BYD TouchPad driver (4.8.1) misdetects a Logitech mouse

2016-11-10 Thread Eric Wong
Sorry for the late response, forwarding to Dmitry and linux-input.
I haven't dealt with input in years...

Michael Shell  wrote:
> 
> Not a big deal, but something nonetheless ...
> 
> I recently upgraded my kernel from 4.3 to 4.8.1 (this is a Linux From
> Scratch build). There was only one obvious hiccup - X would not
> start because it no longer saw the mouse.
> 
> I have a Logitech RX250 PS2/USB mouse on the PS/2 port (via a USB->PS2
> adapter):
> 
> http://support.logitech.com/en_gb/product/rx250-mouse
> 
> The RX250 is a wheel mouse, but with an added left/right scroll wheel
> "tilt" feature to provide for horizontal scrolling.
> 
> In my xorg.conf I had:
> 
> 
> Section "InputDevice"
> Identifier  "Mouse0"
> Driver  "evdev"
> Option  "Name"  "ImExPS/2 Logitech Explorer Mouse"
> EndSection
> 
> 
> However, under the 4.8.1 kernel, cat /proc/bus/input/devices shows:
> 
> 
> I: Bus=0011 Vendor=0002 Product=0006 Version=0073
> N: Name="ImExPS/2 BYD TouchPad"
> P: Phys=isa0060/serio1/input0
> S: Sysfs=/devices/platform/i8042/serio1/input/input3
> U: Uniq=
> H: Handlers=mouse0 event2 
> B: PROP=1
> B: EV=7
> B: KEY=1f 0 0 0 0 0 0 0 0
> B: REL=143
> 
> 
> and dmesg shows the lines:
> 
> 
> psmouse serio1: logips2pp: Detected unknown Logitech mouse model 115
> input: ImExPS/2 BYD TouchPad as /devices/platform/i8042/serio1/input/input3
> 
> 
> Changing the mouse name in xorg.conf to "ImExPS/2 BYD TouchPad" did
> allow the mouse to work correctly under X, including the tilt buttons.
> FWIW, in the 4.8.1 kernel config, the byd.c driver does default to
> "Yes", a default which I accepted under make oldconfig.
> 
> Apparently, the detection code (byd_detect) in the new byd.c driver,
> which did not even exist in 4.3, (drivers/input/mouse/byd.c) falsely
> sees the Logitech RX250 as being a BYD TouchPad and thus alters the
> vendor and model names.
> 
> I do not know if it is also desirable to add an entry for a Logitech
> model 115 in get_model_info in logips2pp.c
> (drivers/input/mouse/logips2pp.c). But if so, I would be willing to
> test any such new code. Even as an unknown Logitech model, it does
> work just fine. I am curious as to how the tilt feature buttons would
> be declared in the model info (would these be PS2PP_EXTRA_BTN,
> or PS2PP_NAV_BTN, etc.?)
> 
> 
> 
>   Cheers and thanks in advance,
>   
>   Mike Shell
> 


Re: BYD TouchPad driver (4.8.1) misdetects a Logitech mouse

2016-11-10 Thread Eric Wong
Sorry for the late response, forwarding to Dmitry and linux-input.
I haven't dealt with input in years...

Michael Shell  wrote:
> 
> Not a big deal, but something nonetheless ...
> 
> I recently upgraded my kernel from 4.3 to 4.8.1 (this is a Linux From
> Scratch build). There was only one obvious hiccup - X would not
> start because it no longer saw the mouse.
> 
> I have a Logitech RX250 PS2/USB mouse on the PS/2 port (via a USB->PS2
> adapter):
> 
> http://support.logitech.com/en_gb/product/rx250-mouse
> 
> The RX250 is a wheel mouse, but with an added left/right scroll wheel
> "tilt" feature to provide for horizontal scrolling.
> 
> In my xorg.conf I had:
> 
> 
> Section "InputDevice"
> Identifier  "Mouse0"
> Driver  "evdev"
> Option  "Name"  "ImExPS/2 Logitech Explorer Mouse"
> EndSection
> 
> 
> However, under the 4.8.1 kernel, cat /proc/bus/input/devices shows:
> 
> 
> I: Bus=0011 Vendor=0002 Product=0006 Version=0073
> N: Name="ImExPS/2 BYD TouchPad"
> P: Phys=isa0060/serio1/input0
> S: Sysfs=/devices/platform/i8042/serio1/input/input3
> U: Uniq=
> H: Handlers=mouse0 event2 
> B: PROP=1
> B: EV=7
> B: KEY=1f 0 0 0 0 0 0 0 0
> B: REL=143
> 
> 
> and dmesg shows the lines:
> 
> 
> psmouse serio1: logips2pp: Detected unknown Logitech mouse model 115
> input: ImExPS/2 BYD TouchPad as /devices/platform/i8042/serio1/input/input3
> 
> 
> Changing the mouse name in xorg.conf to "ImExPS/2 BYD TouchPad" did
> allow the mouse to work correctly under X, including the tilt buttons.
> FWIW, in the 4.8.1 kernel config, the byd.c driver does default to
> "Yes", a default which I accepted under make oldconfig.
> 
> Apparently, the detection code (byd_detect) in the new byd.c driver,
> which did not even exist in 4.3, (drivers/input/mouse/byd.c) falsely
> sees the Logitech RX250 as being a BYD TouchPad and thus alters the
> vendor and model names.
> 
> I do not know if it is also desirable to add an entry for a Logitech
> model 115 in get_model_info in logips2pp.c
> (drivers/input/mouse/logips2pp.c). But if so, I would be willing to
> test any such new code. Even as an unknown Logitech model, it does
> work just fine. I am curious as to how the tilt feature buttons would
> be declared in the model info (would these be PS2PP_EXTRA_BTN,
> or PS2PP_NAV_BTN, etc.?)
> 
> 
> 
>   Cheers and thanks in advance,
>   
>   Mike Shell
> 


Re: patch mail rejected by vger.kernel.org

2016-08-11 Thread Eric Wong
Zhouyi Zhou  wrote:
> Hi,
>   Yesterday, I cced 5 patches to linux-kernel@vger.kernel.org using 
> send-email, but all of them were
> rejected by mail server at vger.kernel.org as follows: 
>   "Dear yizhouz...@ict.ac.cn, Your message to linux-kernel@vger.kernel.org 
> was rejected by the recipient domain. The error that the other server 
> returned was: " SMTP error, RCPT TO: 451 4.7.1 Hello [159.226.251.20], for 
> recipient address  the policy analysis 
> reported: zpostgrey: connect: Connection refused". Your original message is 
> included as attachment."
>   I have successfully posted to linux-kernel@vger.kernel.org previously. And 
> maintainers have received my patch
> correctly yesterday. Do you know what is going on?

+Cc: some folks who may actually know what's going on...

It looks like the zpostgrey instance goes down and is not
restarting.  I've noticed it on the git@vger list a few times
recently, too; but I'm not sure if any messages got rejected
for git@vger (we get a lot fewer messages than LKML).


Re: patch mail rejected by vger.kernel.org

2016-08-11 Thread Eric Wong
Zhouyi Zhou  wrote:
> Hi,
>   Yesterday, I cced 5 patches to linux-kernel@vger.kernel.org using 
> send-email, but all of them were
> rejected by mail server at vger.kernel.org as follows: 
>   "Dear yizhouz...@ict.ac.cn, Your message to linux-kernel@vger.kernel.org 
> was rejected by the recipient domain. The error that the other server 
> returned was: " SMTP error, RCPT TO: 451 4.7.1 Hello [159.226.251.20], for 
> recipient address  the policy analysis 
> reported: zpostgrey: connect: Connection refused". Your original message is 
> included as attachment."
>   I have successfully posted to linux-kernel@vger.kernel.org previously. And 
> maintainers have received my patch
> correctly yesterday. Do you know what is going on?

+Cc: some folks who may actually know what's going on...

It looks like the zpostgrey instance goes down and is not
restarting.  I've noticed it on the git@vger list a few times
recently, too; but I'm not sure if any messages got rejected
for git@vger (we get a lot fewer messages than LKML).


Re: [ANNOUNCE] git-series: track changes to a patch series over time

2016-08-11 Thread Eric Wong
Josh Triplett  wrote:
> On August 9, 2016 11:37:31 PM HST, Richard Ipsum 
>  wrote:
> 
> >Maybe there's a better solution to this problem than git-candidate
> >then,
> >maybe we can just invent some wonderful new subcommand that fetches
> >a mailing list archive into a git repo, for those that want that,
> >I don't know.
> 
> public-inbox seems to address that use case. I'd love to see a
> public-inbox version of LKML, with full history. I don't think
> that fully solves the review storage and interchange problem,
> but it seems like an *excellent* solution for email archiving,
> and for distribution of archives.

Thanks, I'd like to see an LKML version, too :)  First, I want
to ensure public-inbox can handle large repos better, first.
public-inbox.org/git has been doing well so far, even on a
low-end VM with 2 cores and 2GB RAM.

I don't have anything close to full history of LKML, and
download.gmane.org is down, right now :<  I'd use NNTP, but
news.gmane.org gets overloaded from slrnpull and I even got
temporarily banned there before discovering download.gmane
:x

Maybe I'll do what was done with linux.git in 2005 and just
ignore old mail for a while...


Re: [ANNOUNCE] git-series: track changes to a patch series over time

2016-08-11 Thread Eric Wong
Josh Triplett  wrote:
> On August 9, 2016 11:37:31 PM HST, Richard Ipsum 
>  wrote:
> 
> >Maybe there's a better solution to this problem than git-candidate
> >then,
> >maybe we can just invent some wonderful new subcommand that fetches
> >a mailing list archive into a git repo, for those that want that,
> >I don't know.
> 
> public-inbox seems to address that use case. I'd love to see a
> public-inbox version of LKML, with full history. I don't think
> that fully solves the review storage and interchange problem,
> but it seems like an *excellent* solution for email archiving,
> and for distribution of archives.

Thanks, I'd like to see an LKML version, too :)  First, I want
to ensure public-inbox can handle large repos better, first.
public-inbox.org/git has been doing well so far, even on a
low-end VM with 2 cores and 2GB RAM.

I don't have anything close to full history of LKML, and
download.gmane.org is down, right now :<  I'd use NNTP, but
news.gmane.org gets overloaded from slrnpull and I even got
temporarily banned there before discovering download.gmane
:x

Maybe I'll do what was done with linux.git in 2005 and just
ignore old mail for a while...


Re: [ANNOUNCE] git-series: track changes to a patch series over time

2016-08-01 Thread Eric Wong
Josh Triplett <j...@joshtriplett.org> wrote:
> On Mon, Aug 01, 2016 at 07:55:54AM +0000, Eric Wong wrote:
> > Christian Couder <christian.cou...@gmail.com> wrote:
> > > On Fri, Jul 29, 2016 at 12:10 PM, Richard Ipsum
> > > <richard.ip...@codethink.co.uk> wrote:
> > > > On Thu, Jul 28, 2016 at 11:40:55PM -0700, Josh Triplett wrote:
> > > > [snip]
> > > >>
> > > >> I'd welcome any feedback, whether on the interface and workflow, the
> > > >> internals and collaboration, ideas on presenting diffs of patch series,
> > > >> or anything else.
> > 
> > > > I'm particularly interested in trying to establish a standard for
> > > > storing review data in git. I've got a prototype for doing that[3],
> > > > and an example tool that uses it[4]. The tool is still incomplete/buggy 
> > > > though.

> > I'm not convinced another format/standard is needed besides the
> > email workflow we already use for git and kernel development.
> 
> Not all projects use a patches-by-email workflow, or want to.  To the
> extent that tools and projects use some other workflow, standardizing
> the format they use to store patch reviews (including per-line
> annotations, approvals, test results, etc) seems preferable to having
> each tool use its own custom format.

I think standardizing on email conventions (such as what we
already do with format-patch, request-pull, S-o-b trailers) would
be a step in this direction and a good step to take.

But yeah, I also hope git adopters can somehow be convinced to
also adopt the workflow that built git itself.

> > I also see the reliance on an after-the-fact search engine
> > (which can be tuned/replaced) as philosophically inline with
> > what git does, too, such as not having rename tracking and
> > doing delayed deltafication.
> 
> Storing review data in git doesn't mean it needs to end up in the
> history of the project itself; it can use after-the-fact annotations on
> a commit.

Right.  So on public-inbox.org/git today, one could search for
after-the-fact annotations based on commit titles and maybe
exact commit ID matches.

A future goal might be to get search indexing working on commit
ID substrings.  So finding references to commit
deadbeefcafe01234567890123467890abcdef00 could be done by
searching for "commit deadbeefcafe" or even a shorter ID, and
the following results could still be returned:

  1. commit deadbeefcafe broke my cat feeder
  2. commit deadbeef killed my cow

> > Email also has the advantage of having existing tooling, and
> > being (at least for now) federated without a single point of
> > failure.
> 
> Storing review data in git makes it easy to push and pull it, which can
> provide the basis for a federated system.

Every public-inbox exposed over HTTP(S) is git clonable[1], so
it's possible to push/pull or have developers merge/combine
inboxes with index-only operations.  There's no UI for that,
yet, and having a working tree checked out is inefficient with
300K uncompressed mails...

But there needs to be way to message others about the existence
of new pushes/pull-requests/reviews/etc; including users
unable to clone or host 800M git repos; so that messaging
system might as well be email.



[1] git clone --mirror https://public-inbox.org/git/
That's not efficient, yet, though, at around 800M when the
gzipped fast-export dump is around half that:
https://public-inbox.org/git/20160710034745.ga20...@dcvr.yhbt.net/T/#u


Re: [ANNOUNCE] git-series: track changes to a patch series over time

2016-08-01 Thread Eric Wong
Josh Triplett  wrote:
> On Mon, Aug 01, 2016 at 07:55:54AM +0000, Eric Wong wrote:
> > Christian Couder  wrote:
> > > On Fri, Jul 29, 2016 at 12:10 PM, Richard Ipsum
> > >  wrote:
> > > > On Thu, Jul 28, 2016 at 11:40:55PM -0700, Josh Triplett wrote:
> > > > [snip]
> > > >>
> > > >> I'd welcome any feedback, whether on the interface and workflow, the
> > > >> internals and collaboration, ideas on presenting diffs of patch series,
> > > >> or anything else.
> > 
> > > > I'm particularly interested in trying to establish a standard for
> > > > storing review data in git. I've got a prototype for doing that[3],
> > > > and an example tool that uses it[4]. The tool is still incomplete/buggy 
> > > > though.

> > I'm not convinced another format/standard is needed besides the
> > email workflow we already use for git and kernel development.
> 
> Not all projects use a patches-by-email workflow, or want to.  To the
> extent that tools and projects use some other workflow, standardizing
> the format they use to store patch reviews (including per-line
> annotations, approvals, test results, etc) seems preferable to having
> each tool use its own custom format.

I think standardizing on email conventions (such as what we
already do with format-patch, request-pull, S-o-b trailers) would
be a step in this direction and a good step to take.

But yeah, I also hope git adopters can somehow be convinced to
also adopt the workflow that built git itself.

> > I also see the reliance on an after-the-fact search engine
> > (which can be tuned/replaced) as philosophically inline with
> > what git does, too, such as not having rename tracking and
> > doing delayed deltafication.
> 
> Storing review data in git doesn't mean it needs to end up in the
> history of the project itself; it can use after-the-fact annotations on
> a commit.

Right.  So on public-inbox.org/git today, one could search for
after-the-fact annotations based on commit titles and maybe
exact commit ID matches.

A future goal might be to get search indexing working on commit
ID substrings.  So finding references to commit
deadbeefcafe01234567890123467890abcdef00 could be done by
searching for "commit deadbeefcafe" or even a shorter ID, and
the following results could still be returned:

  1. commit deadbeefcafe broke my cat feeder
  2. commit deadbeef killed my cow

> > Email also has the advantage of having existing tooling, and
> > being (at least for now) federated without a single point of
> > failure.
> 
> Storing review data in git makes it easy to push and pull it, which can
> provide the basis for a federated system.

Every public-inbox exposed over HTTP(S) is git clonable[1], so
it's possible to push/pull or have developers merge/combine
inboxes with index-only operations.  There's no UI for that,
yet, and having a working tree checked out is inefficient with
300K uncompressed mails...

But there needs to be way to message others about the existence
of new pushes/pull-requests/reviews/etc; including users
unable to clone or host 800M git repos; so that messaging
system might as well be email.



[1] git clone --mirror https://public-inbox.org/git/
That's not efficient, yet, though, at around 800M when the
gzipped fast-export dump is around half that:
https://public-inbox.org/git/20160710034745.ga20...@dcvr.yhbt.net/T/#u


Re: [ANNOUNCE] git-series: track changes to a patch series over time

2016-08-01 Thread Eric Wong
Christian Couder  wrote:
> On Fri, Jul 29, 2016 at 12:10 PM, Richard Ipsum
>  wrote:
> > On Thu, Jul 28, 2016 at 11:40:55PM -0700, Josh Triplett wrote:
> > [snip]
> >>
> >> I'd welcome any feedback, whether on the interface and workflow, the
> >> internals and collaboration, ideas on presenting diffs of patch series,
> >> or anything else.

> > I'm particularly interested in trying to establish a standard for
> > storing review data in git. I've got a prototype for doing that[3],
> > and an example tool that uses it[4]. The tool is still incomplete/buggy 
> > though.
> 
> There is also git-appraise (https://github.com/google/git-appraise)
> written in Go to store code review data in Git.
> It looks like it stores its data in git notes and can be integrated
> with Rust (https://github.com/Nemo157/git-appraise-rs).

I'm not convinced another format/standard is needed besides the
email workflow we already use for git and kernel development.

Rather, better ways to archive/search the emails is desirable.
Fortunately, commit titles are rather unique :)

I started archiving the git ML with public-inbox (which uses git):

  https://public-inbox.org/git/20160710004813.ga20...@dcvr.yhbt.net/T/

It can be easy to search by Subject (commit titles):

  https://public-inbox.org/git/?q=s:%22more+archives+of+this+list%22

Search (currently Xapian) will be tuned to parse things like
filenames and diffs to allow searching within those.  It is
already somewhat email-aware, such as deprioritizing quoted
text; and having a code repository browser with mail archive
integration is in the works.

I also see the reliance on an after-the-fact search engine
(which can be tuned/replaced) as philosophically inline with
what git does, too, such as not having rename tracking and
doing delayed deltafication.

Email also has the advantage of having existing tooling, and
being (at least for now) federated without a single point of
failure.

vger.kernel.org can still be a major point of failure, which is
why the "archives first" approach of public-inbox favors readers
pulling messages over NNTP/HTTP/git (and maybe soon, POP3).


Re: [ANNOUNCE] git-series: track changes to a patch series over time

2016-08-01 Thread Eric Wong
Christian Couder  wrote:
> On Fri, Jul 29, 2016 at 12:10 PM, Richard Ipsum
>  wrote:
> > On Thu, Jul 28, 2016 at 11:40:55PM -0700, Josh Triplett wrote:
> > [snip]
> >>
> >> I'd welcome any feedback, whether on the interface and workflow, the
> >> internals and collaboration, ideas on presenting diffs of patch series,
> >> or anything else.

> > I'm particularly interested in trying to establish a standard for
> > storing review data in git. I've got a prototype for doing that[3],
> > and an example tool that uses it[4]. The tool is still incomplete/buggy 
> > though.
> 
> There is also git-appraise (https://github.com/google/git-appraise)
> written in Go to store code review data in Git.
> It looks like it stores its data in git notes and can be integrated
> with Rust (https://github.com/Nemo157/git-appraise-rs).

I'm not convinced another format/standard is needed besides the
email workflow we already use for git and kernel development.

Rather, better ways to archive/search the emails is desirable.
Fortunately, commit titles are rather unique :)

I started archiving the git ML with public-inbox (which uses git):

  https://public-inbox.org/git/20160710004813.ga20...@dcvr.yhbt.net/T/

It can be easy to search by Subject (commit titles):

  https://public-inbox.org/git/?q=s:%22more+archives+of+this+list%22

Search (currently Xapian) will be tuned to parse things like
filenames and diffs to allow searching within those.  It is
already somewhat email-aware, such as deprioritizing quoted
text; and having a code repository browser with mail archive
integration is in the works.

I also see the reliance on an after-the-fact search engine
(which can be tuned/replaced) as philosophically inline with
what git does, too, such as not having rename tracking and
doing delayed deltafication.

Email also has the advantage of having existing tooling, and
being (at least for now) federated without a single point of
failure.

vger.kernel.org can still be a major point of failure, which is
why the "archives first" approach of public-inbox favors readers
pulling messages over NNTP/HTTP/git (and maybe soon, POP3).


Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket

2015-09-13 Thread Eric Wong
+cc Jason Baron since he might be able to provide more insight into
epoll.

Mathias Krause  wrote:
> Hi,
> 
> this is an attempt to resurrect the thread initially started here:
> 
>   http://thread.gmane.org/gmane.linux.network/353003
> 
> As that patch fixed the issue for the mentioned reproducer, it did not
> fix the bug for the production code Olivier is using. :(
> 
> Changing the reproducer only slightly allows me to trigger the following
> list debug splat (CONFIG_DEBUG_LIST=y) reliable within seconds -- even
> with the above linked patch applied:
> 
> [   50.264249] [ cut here ]
> [   50.264249] WARNING: CPU: 0 PID: 214 at lib/list_debug.c:59 
> __list_del_entry+0xa4/0xd0()
> [   50.264249] list_del corruption. prev->next should be 88003c2c1bb8, 
> but was 88003f07bbb8
> [   50.264249] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net 
> virtio_pci virtio_ring virtio sr_mod cdrom
> [   50.264249] CPU: 0 PID: 214 Comm: epoll_bug Not tainted 4.2.0 #75
> [   50.264249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.7.5-20140531_083030-gandalf 04/01/2014
> [   50.264249]  817e902e 88087d08 8155593c 
> 0007
> [   50.264249]  88087d58 88087d48 8105202a 
> 0001
> [   50.264249]  88003c2c1bb8 88003f07bb80 0286 
> 88003f736640
> [   50.264249] Call Trace:
> [   50.264249]  [] dump_stack+0x4c/0x65
> [   50.264249]  [] warn_slowpath_common+0x8a/0xc0
> [   50.264249]  [] warn_slowpath_fmt+0x46/0x50
> [   50.264249]  [] __list_del_entry+0xa4/0xd0
> [   50.264249]  [] list_del+0x11/0x40
> [   50.264249]  [] remove_wait_queue+0x29/0x40
> [   50.264249]  [] ep_unregister_pollwait.isra.6+0x58/0x1a0
> [   50.264249]  [] ? 
> ep_unregister_pollwait.isra.6+0xa9/0x1a0
> [   50.264249]  [] ep_remove+0x22/0x110
> [   50.264249]  [] SyS_epoll_ctl+0x62b/0xf70
> [   50.264249]  [] ? lockdep_sys_exit_thunk+0x12/0x14
> [   50.264249]  [] entry_SYSCALL_64_fastpath+0x12/0x6f
> [   50.264249] ---[ end trace d9af9b915df9667e ]---
> [   50.572100] [ cut here ]
> [   50.572100] WARNING: CPU: 1 PID: 212 at lib/list_debug.c:62 
> __list_del_entry+0xc3/0xd0()
> [   50.584263] list_del corruption. next->prev should be 88003f664c90, 
> but was 88003f0cb5b8
> [   50.584263] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net 
> virtio_pci virtio_ring virtio sr_mod cdrom
> [   50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: GW   
> 4.2.0 #75
> [   50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.7.5-20140531_083030-gandalf 04/01/2014
> [   50.584263]  817e902e 88003d37fce8 8155593c 
> 0007
> [   50.584263]  88003d37fd38 88003d37fd28 8105202a 
> 0001
> [   50.584263]  88003f664c90 88003f0cb580 0282 
> 88003f731740
> [   50.584263] Call Trace:
> [   50.584263]  [] dump_stack+0x4c/0x65
> [   50.584263]  [] warn_slowpath_common+0x8a/0xc0
> [   50.584263]  [] warn_slowpath_fmt+0x46/0x50
> [   50.584263]  [] __list_del_entry+0xc3/0xd0
> [   50.584263]  [] list_del+0x11/0x40
> [   50.584263]  [] remove_wait_queue+0x29/0x40
> [   50.584263]  [] ep_unregister_pollwait.isra.6+0x58/0x1a0
> [   50.584263]  [] ? 
> ep_unregister_pollwait.isra.6+0xa9/0x1a0
> [   50.584263]  [] ep_remove+0x22/0x110
> [   50.584263]  [] eventpoll_release_file+0x62/0xa0
> [   50.584263]  [] __fput+0x1af/0x200
> [   50.584263]  [] ? int_very_careful+0x5/0x3f
> [   50.584263]  [] fput+0xe/0x10
> [   50.584263]  [] task_work_run+0x8d/0xc0
> [   50.584263]  [] do_notify_resume+0x4f/0x60
> [   50.584263]  [] int_signal+0x12/0x17
> [   50.584263] ---[ end trace d9af9b915df9667f ]---
> [   50.584263] BUG: spinlock already unlocked on CPU#1, epoll_bug/212
> [   50.584263]  lock: 0x88003f0cb580, .magic: dead4ead, .owner: 
> /-1, .owner_cpu: -1
> [   50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: GW   
> 4.2.0 #75
> [   50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.7.5-20140531_083030-gandalf 04/01/2014
> [   50.584263]  88003f0cb580 88003d37fd38 8155593c 
> 0007
> [   50.584263]   88003d37fd58 810a3375 
> 88003f0cb580
> [   50.584263]  817b9cc8 88003d37fd78 810a33f6 
> 88003f0cb580
> [   50.584263] Call Trace:
> [   50.584263]  [] dump_stack+0x4c/0x65
> [   50.584263]  [] spin_dump+0x85/0xe0
> [   50.584263]  [] spin_bug+0x26/0x30
> [   50.584263]  [] do_raw_spin_unlock+0x75/0xa0
> [   50.584263]  [] _raw_spin_unlock_irqrestore+0x2c/0x50
> [   50.584263]  [] remove_wait_queue+0x34/0x40
> [   50.584263]  [] ep_unregister_pollwait.isra.6+0x58/0x1a0
> [   50.584263]  [] ? 
> ep_unregister_pollwait.isra.6+0xa9/0x1a0
> [   50.584263]  [] ep_remove+0x22/0x110
> [   50.584263]  [] eventpoll_release_file+0x62/0xa0
> [   50.584263]  

Re: List corruption on epoll_ctl(EPOLL_CTL_DEL) an AF_UNIX socket

2015-09-13 Thread Eric Wong
+cc Jason Baron since he might be able to provide more insight into
epoll.

Mathias Krause  wrote:
> Hi,
> 
> this is an attempt to resurrect the thread initially started here:
> 
>   http://thread.gmane.org/gmane.linux.network/353003
> 
> As that patch fixed the issue for the mentioned reproducer, it did not
> fix the bug for the production code Olivier is using. :(
> 
> Changing the reproducer only slightly allows me to trigger the following
> list debug splat (CONFIG_DEBUG_LIST=y) reliable within seconds -- even
> with the above linked patch applied:
> 
> [   50.264249] [ cut here ]
> [   50.264249] WARNING: CPU: 0 PID: 214 at lib/list_debug.c:59 
> __list_del_entry+0xa4/0xd0()
> [   50.264249] list_del corruption. prev->next should be 88003c2c1bb8, 
> but was 88003f07bbb8
> [   50.264249] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net 
> virtio_pci virtio_ring virtio sr_mod cdrom
> [   50.264249] CPU: 0 PID: 214 Comm: epoll_bug Not tainted 4.2.0 #75
> [   50.264249] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.7.5-20140531_083030-gandalf 04/01/2014
> [   50.264249]  817e902e 88087d08 8155593c 
> 0007
> [   50.264249]  88087d58 88087d48 8105202a 
> 0001
> [   50.264249]  88003c2c1bb8 88003f07bb80 0286 
> 88003f736640
> [   50.264249] Call Trace:
> [   50.264249]  [] dump_stack+0x4c/0x65
> [   50.264249]  [] warn_slowpath_common+0x8a/0xc0
> [   50.264249]  [] warn_slowpath_fmt+0x46/0x50
> [   50.264249]  [] __list_del_entry+0xa4/0xd0
> [   50.264249]  [] list_del+0x11/0x40
> [   50.264249]  [] remove_wait_queue+0x29/0x40
> [   50.264249]  [] ep_unregister_pollwait.isra.6+0x58/0x1a0
> [   50.264249]  [] ? 
> ep_unregister_pollwait.isra.6+0xa9/0x1a0
> [   50.264249]  [] ep_remove+0x22/0x110
> [   50.264249]  [] SyS_epoll_ctl+0x62b/0xf70
> [   50.264249]  [] ? lockdep_sys_exit_thunk+0x12/0x14
> [   50.264249]  [] entry_SYSCALL_64_fastpath+0x12/0x6f
> [   50.264249] ---[ end trace d9af9b915df9667e ]---
> [   50.572100] [ cut here ]
> [   50.572100] WARNING: CPU: 1 PID: 212 at lib/list_debug.c:62 
> __list_del_entry+0xc3/0xd0()
> [   50.584263] list_del corruption. next->prev should be 88003f664c90, 
> but was 88003f0cb5b8
> [   50.584263] Modules linked in: ipv6 pcspkr serio_raw microcode virtio_net 
> virtio_pci virtio_ring virtio sr_mod cdrom
> [   50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: GW   
> 4.2.0 #75
> [   50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.7.5-20140531_083030-gandalf 04/01/2014
> [   50.584263]  817e902e 88003d37fce8 8155593c 
> 0007
> [   50.584263]  88003d37fd38 88003d37fd28 8105202a 
> 0001
> [   50.584263]  88003f664c90 88003f0cb580 0282 
> 88003f731740
> [   50.584263] Call Trace:
> [   50.584263]  [] dump_stack+0x4c/0x65
> [   50.584263]  [] warn_slowpath_common+0x8a/0xc0
> [   50.584263]  [] warn_slowpath_fmt+0x46/0x50
> [   50.584263]  [] __list_del_entry+0xc3/0xd0
> [   50.584263]  [] list_del+0x11/0x40
> [   50.584263]  [] remove_wait_queue+0x29/0x40
> [   50.584263]  [] ep_unregister_pollwait.isra.6+0x58/0x1a0
> [   50.584263]  [] ? 
> ep_unregister_pollwait.isra.6+0xa9/0x1a0
> [   50.584263]  [] ep_remove+0x22/0x110
> [   50.584263]  [] eventpoll_release_file+0x62/0xa0
> [   50.584263]  [] __fput+0x1af/0x200
> [   50.584263]  [] ? int_very_careful+0x5/0x3f
> [   50.584263]  [] fput+0xe/0x10
> [   50.584263]  [] task_work_run+0x8d/0xc0
> [   50.584263]  [] do_notify_resume+0x4f/0x60
> [   50.584263]  [] int_signal+0x12/0x17
> [   50.584263] ---[ end trace d9af9b915df9667f ]---
> [   50.584263] BUG: spinlock already unlocked on CPU#1, epoll_bug/212
> [   50.584263]  lock: 0x88003f0cb580, .magic: dead4ead, .owner: 
> /-1, .owner_cpu: -1
> [   50.584263] CPU: 1 PID: 212 Comm: epoll_bug Tainted: GW   
> 4.2.0 #75
> [   50.584263] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.7.5-20140531_083030-gandalf 04/01/2014
> [   50.584263]  88003f0cb580 88003d37fd38 8155593c 
> 0007
> [   50.584263]   88003d37fd58 810a3375 
> 88003f0cb580
> [   50.584263]  817b9cc8 88003d37fd78 810a33f6 
> 88003f0cb580
> [   50.584263] Call Trace:
> [   50.584263]  [] dump_stack+0x4c/0x65
> [   50.584263]  [] spin_dump+0x85/0xe0
> [   50.584263]  [] spin_bug+0x26/0x30
> [   50.584263]  [] do_raw_spin_unlock+0x75/0xa0
> [   50.584263]  [] _raw_spin_unlock_irqrestore+0x2c/0x50
> [   50.584263]  [] remove_wait_queue+0x34/0x40
> [   50.584263]  [] ep_unregister_pollwait.isra.6+0x58/0x1a0
> [   50.584263]  [] ? 
> ep_unregister_pollwait.isra.6+0xa9/0x1a0
> [   50.584263]  [] ep_remove+0x22/0x110
> [   50.584263]  [] 

Re: epoll and multiple processes - eliminate unneeded process wake-ups

2015-08-03 Thread Eric Wong
Madars Vitolins  wrote:
> Hi Folks,
> 
> I am developing kind of open systems application, which uses
> multiple processes/executables where each of them monitors some set
> of resources (in this case POSIX Queues) via epoll interface. For
> example when 10 processes on same queue are in state of epoll_wait()
> and one message arrives, all 10 processes gets woken up and all of
> them tries to read the message from Q. One succeeds, the others gets
> EAGAIN error. The problem is with those others, which generates
> extra context switches - useless CPU usage. With more processes
> inefficiency gets higher.
> 
> I tried to use EPOLLONESHOT, but no help. Seems this is suitable for
> multi-threaded application and not for multi-process application.

Correct.  Most FDs are not shared across processes.

> Ideal mechanism for this would be:
> 1. If multiple epoll sets in kernel matches same event and one or
> more processes are in state of epoll_wait() - then send event only
> to one waiter.
> 2. If none of processes are in wait state, then send the event to
> all epoll sets (as it is currently). Then the first free process
> will grab the event.

Jason Baron was working on this (search LKML archives for
EPOLLEXCLUSIVE, EPOLLROUNDROBIN, EPOLL_ROTATE)

However, I was unconvinced about modifying epoll.

Perhaps I may be more easily convinced about your mqueue case than his
case for listen sockets, though[*]

Typical applications have few (probably only one) listen sockets or
POSIX mqueues; so I would rather use dedicated threads to issue
blocking syscalls (accept4 or mq_timedreceive).

Making blocking syscalls allows exclusive wakeups to avoid thundering
herds.

> How do you think, would it be real to implement this? How about
> concurrency?
> Can you please give me some hints from which points in code to start
> to implement these changes?

For now, I suggest dedicating a thread in each process to do
mq_timedreceive/mq_receive, assuming you only have a small amount
of queues in your system.


[*] mq_timedreceive may copy a largish buffer which benefits from
staying on the same CPU as much as possible.
Contrary, accept4 only creates a client socket.  With a C10K+
socket server (e.g. http/memcached/DB), a typical new client
socket spends a fair amount of time idle.  Thus I don't believe
memory locality inside the kernel is much concern when there's
thousands of accepted client sockets.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: epoll and multiple processes - eliminate unneeded process wake-ups

2015-08-03 Thread Eric Wong
Madars Vitolins m...@silodev.com wrote:
 Hi Folks,
 
 I am developing kind of open systems application, which uses
 multiple processes/executables where each of them monitors some set
 of resources (in this case POSIX Queues) via epoll interface. For
 example when 10 processes on same queue are in state of epoll_wait()
 and one message arrives, all 10 processes gets woken up and all of
 them tries to read the message from Q. One succeeds, the others gets
 EAGAIN error. The problem is with those others, which generates
 extra context switches - useless CPU usage. With more processes
 inefficiency gets higher.
 
 I tried to use EPOLLONESHOT, but no help. Seems this is suitable for
 multi-threaded application and not for multi-process application.

Correct.  Most FDs are not shared across processes.

 Ideal mechanism for this would be:
 1. If multiple epoll sets in kernel matches same event and one or
 more processes are in state of epoll_wait() - then send event only
 to one waiter.
 2. If none of processes are in wait state, then send the event to
 all epoll sets (as it is currently). Then the first free process
 will grab the event.

Jason Baron was working on this (search LKML archives for
EPOLLEXCLUSIVE, EPOLLROUNDROBIN, EPOLL_ROTATE)

However, I was unconvinced about modifying epoll.

Perhaps I may be more easily convinced about your mqueue case than his
case for listen sockets, though[*]

Typical applications have few (probably only one) listen sockets or
POSIX mqueues; so I would rather use dedicated threads to issue
blocking syscalls (accept4 or mq_timedreceive).

Making blocking syscalls allows exclusive wakeups to avoid thundering
herds.

 How do you think, would it be real to implement this? How about
 concurrency?
 Can you please give me some hints from which points in code to start
 to implement these changes?

For now, I suggest dedicating a thread in each process to do
mq_timedreceive/mq_receive, assuming you only have a small amount
of queues in your system.


[*] mq_timedreceive may copy a largish buffer which benefits from
staying on the same CPU as much as possible.
Contrary, accept4 only creates a client socket.  With a C10K+
socket server (e.g. http/memcached/DB), a typical new client
socket spends a fair amount of time idle.  Thus I don't believe
memory locality inside the kernel is much concern when there's
thousands of accepted client sockets.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ALSA: usb-audio: don't try to get Outlaw RR2150 sample rate

2015-05-30 Thread Eric Wong
This quirk allows us to avoid the noisy:

current rate 0 is different from the runtime rate

message every time playback starts.  While USB DAC in the RR2150
supports reading the sample rate, it never returns a sample rate
other than zero in my observation with common sample rates.

Signed-off-by: Eric Wong 
Cc: Joe Turner 
Cc: Takashi Iwai 
---
 sound/usb/quirks.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/sound/usb/quirks.c b/sound/usb/quirks.c
index 32631a8..2ea5b33 100644
--- a/sound/usb/quirks.c
+++ b/sound/usb/quirks.c
@@ -1118,6 +1118,7 @@ bool snd_usb_get_sample_rate_quirk(struct snd_usb_audio 
*chip)
case USB_ID(0x045E, 0x075D): /* MS Lifecam Cinema  */
case USB_ID(0x045E, 0x076D): /* MS Lifecam HD-5000 */
case USB_ID(0x04D8, 0xFEEA): /* Benchmark DAC1 Pre */
+   case USB_ID(0x074D, 0x3553): /* Outlaw RR2150 (Micronas UAC3553B) */
return true;
}
return false;
-- 
EW

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ALSA: usb-audio: don't try to get Outlaw RR2150 sample rate

2015-05-30 Thread Eric Wong
This quirk allows us to avoid the noisy:

current rate 0 is different from the runtime rate

message every time playback starts.  While USB DAC in the RR2150
supports reading the sample rate, it never returns a sample rate
other than zero in my observation with common sample rates.

Signed-off-by: Eric Wong normalper...@yhbt.net
Cc: Joe Turner j...@oampo.co.uk
Cc: Takashi Iwai ti...@suse.de
---
 sound/usb/quirks.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/sound/usb/quirks.c b/sound/usb/quirks.c
index 32631a8..2ea5b33 100644
--- a/sound/usb/quirks.c
+++ b/sound/usb/quirks.c
@@ -1118,6 +1118,7 @@ bool snd_usb_get_sample_rate_quirk(struct snd_usb_audio 
*chip)
case USB_ID(0x045E, 0x075D): /* MS Lifecam Cinema  */
case USB_ID(0x045E, 0x076D): /* MS Lifecam HD-5000 */
case USB_ID(0x04D8, 0xFEEA): /* Benchmark DAC1 Pre */
+   case USB_ID(0x074D, 0x3553): /* Outlaw RR2150 (Micronas UAC3553B) */
return true;
}
return false;
-- 
EW

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] net: support SOCK_DONTWAIT for non-blocking accept4

2015-05-12 Thread Eric Wong
It may occasionally be useful to share a listen socket between two
or more tasks with different processing models (blocking and
non-blocking).

This may happen during a software upgrade when an older process
(using blocking I/O) shares the listen socket with a new process
which wants to use non-blocking I/O, but still provides a path for
the old process to fall back to blocking I/O and avoiding poll()
for exclusive wakeup if the upgrade does not work out.

Proposed manpage addtion:

  SOCK_DONTWAIT

  Enable non-blocking operation on the listen socket for this call only.
  Unlike SOCK_NONBLOCK, this does not affect the accepted socket, nor does
  it change the file status flag of the listen socket for other calls.

Signed-off-by: Eric Wong 
---
  RFC since this seems a bit esoteric, and I'm not sure if it'd be
  useful to others.  I've certainly wished I've had it a few times
  along with an opposite SOCK_MUSTWAIT flag to ignore O_NONBLOCK on
  a listen socket.

 include/linux/net.h |  3 +++
 net/socket.c| 10 --
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 17d8339..9a71654 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -76,6 +76,9 @@ enum sock_type {
 #ifndef SOCK_NONBLOCK
 #define SOCK_NONBLOCK  O_NONBLOCK
 #endif
+#ifndef SOCK_DONTWAIT
+#define SOCK_DONTWAIT  MSG_DONTWAIT
+#endif
 
 #endif /* ARCH_HAS_SOCKET_TYPES */
 
diff --git a/net/socket.c b/net/socket.c
index 245330c..52395df 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1294,6 +1294,7 @@ SYSCALL_DEFINE3(socket, int, family, int, type, int, 
protocol)
BUILD_BUG_ON((SOCK_MAX | SOCK_TYPE_MASK) != SOCK_TYPE_MASK);
BUILD_BUG_ON(SOCK_CLOEXEC & SOCK_TYPE_MASK);
BUILD_BUG_ON(SOCK_NONBLOCK & SOCK_TYPE_MASK);
+   BUILD_BUG_ON(SOCK_DONTWAIT & SOCK_TYPE_MASK);
 
flags = type & ~SOCK_TYPE_MASK;
if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
@@ -1502,8 +1503,9 @@ SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user 
*, upeer_sockaddr,
struct file *newfile;
int err, len, newfd, fput_needed;
struct sockaddr_storage address;
+   int f_flags;
 
-   if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+   if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK | SOCK_DONTWAIT))
return -EINVAL;
 
if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
@@ -1513,6 +1515,10 @@ SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user 
*, upeer_sockaddr,
if (!sock)
goto out;
 
+   f_flags = sock->file->f_flags;
+   if (flags & SOCK_DONTWAIT)
+   f_flags |= O_NONBLOCK;
+
err = -ENFILE;
newsock = sock_alloc();
if (!newsock)
@@ -1545,7 +1551,7 @@ SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user 
*, upeer_sockaddr,
if (err)
goto out_fd;
 
-   err = sock->ops->accept(sock, newsock, sock->file->f_flags);
+   err = sock->ops->accept(sock, newsock, f_flags);
if (err < 0)
goto out_fd;
 
-- 
EW
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] net: support SOCK_DONTWAIT for non-blocking accept4

2015-05-12 Thread Eric Wong
It may occasionally be useful to share a listen socket between two
or more tasks with different processing models (blocking and
non-blocking).

This may happen during a software upgrade when an older process
(using blocking I/O) shares the listen socket with a new process
which wants to use non-blocking I/O, but still provides a path for
the old process to fall back to blocking I/O and avoiding poll()
for exclusive wakeup if the upgrade does not work out.

Proposed manpage addtion:

  SOCK_DONTWAIT

  Enable non-blocking operation on the listen socket for this call only.
  Unlike SOCK_NONBLOCK, this does not affect the accepted socket, nor does
  it change the file status flag of the listen socket for other calls.

Signed-off-by: Eric Wong normalper...@yhbt.net
---
  RFC since this seems a bit esoteric, and I'm not sure if it'd be
  useful to others.  I've certainly wished I've had it a few times
  along with an opposite SOCK_MUSTWAIT flag to ignore O_NONBLOCK on
  a listen socket.

 include/linux/net.h |  3 +++
 net/socket.c| 10 --
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 17d8339..9a71654 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -76,6 +76,9 @@ enum sock_type {
 #ifndef SOCK_NONBLOCK
 #define SOCK_NONBLOCK  O_NONBLOCK
 #endif
+#ifndef SOCK_DONTWAIT
+#define SOCK_DONTWAIT  MSG_DONTWAIT
+#endif
 
 #endif /* ARCH_HAS_SOCKET_TYPES */
 
diff --git a/net/socket.c b/net/socket.c
index 245330c..52395df 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1294,6 +1294,7 @@ SYSCALL_DEFINE3(socket, int, family, int, type, int, 
protocol)
BUILD_BUG_ON((SOCK_MAX | SOCK_TYPE_MASK) != SOCK_TYPE_MASK);
BUILD_BUG_ON(SOCK_CLOEXEC  SOCK_TYPE_MASK);
BUILD_BUG_ON(SOCK_NONBLOCK  SOCK_TYPE_MASK);
+   BUILD_BUG_ON(SOCK_DONTWAIT  SOCK_TYPE_MASK);
 
flags = type  ~SOCK_TYPE_MASK;
if (flags  ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
@@ -1502,8 +1503,9 @@ SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user 
*, upeer_sockaddr,
struct file *newfile;
int err, len, newfd, fput_needed;
struct sockaddr_storage address;
+   int f_flags;
 
-   if (flags  ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+   if (flags  ~(SOCK_CLOEXEC | SOCK_NONBLOCK | SOCK_DONTWAIT))
return -EINVAL;
 
if (SOCK_NONBLOCK != O_NONBLOCK  (flags  SOCK_NONBLOCK))
@@ -1513,6 +1515,10 @@ SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user 
*, upeer_sockaddr,
if (!sock)
goto out;
 
+   f_flags = sock-file-f_flags;
+   if (flags  SOCK_DONTWAIT)
+   f_flags |= O_NONBLOCK;
+
err = -ENFILE;
newsock = sock_alloc();
if (!newsock)
@@ -1545,7 +1551,7 @@ SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user 
*, upeer_sockaddr,
if (err)
goto out_fd;
 
-   err = sock-ops-accept(sock, newsock, sock-file-f_flags);
+   err = sock-ops-accept(sock, newsock, f_flags);
if (err  0)
goto out_fd;
 
-- 
EW
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] ALSA: usb-audio: reduce "cannot get freq at ep" spew

2015-03-31 Thread Eric Wong
Eric Wong  wrote:
>   I also had another generic patch prepared before I noticed Joe's
>   quirk addition for the MS Lifecam HD-5000.

This is the generic patch I prepared before I noticed Joe's quirk
addition (applies cleanly to 3.19 and 4.0-rc6):

--8<---
Subject: [PATCH] ALSA: usb-audio: reduce "cannot get freq at ep" spew

If a device fails to support reading the sample rate, it will likely
fail again and there is no point in logging the message every time the
sample rate is set.

This reduces dmesg noise when using the Benchmark DAC1 PRE as a USB
sound card.

Signed-off-by: Eric Wong 
---
 sound/usb/clock.c| 7 +--
 sound/usb/usbaudio.h | 3 ++-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/sound/usb/clock.c b/sound/usb/clock.c
index 03fed66..512fe12 100644
--- a/sound/usb/clock.c
+++ b/sound/usb/clock.c
@@ -307,8 +307,11 @@ static int set_sample_rate_v1(struct snd_usb_audio *chip, 
int iface,
   USB_TYPE_CLASS | USB_RECIP_ENDPOINT | 
USB_DIR_IN,
   UAC_EP_CS_ATTR_SAMPLE_RATE << 8, ep,
   data, sizeof(data))) < 0) {
-   dev_err(>dev, "%d:%d: cannot get freq at ep %#x\n",
-   iface, fmt->altsetting, ep);
+   if (!chip->no_get_freq) {
+   dev_err(>dev, "%d:%d: cannot get freq at ep %#x\n",
+   iface, fmt->altsetting, ep);
+   chip->no_get_freq = 1;
+   }
return 0; /* some devices don't support reading */
}
 
diff --git a/sound/usb/usbaudio.h b/sound/usb/usbaudio.h
index 91d0380..be65671 100644
--- a/sound/usb/usbaudio.h
+++ b/sound/usb/usbaudio.h
@@ -43,7 +43,8 @@ struct snd_usb_audio {
unsigned int in_pm:1;
unsigned int autosuspended:1;   
unsigned int txfr_quirk:1; /* Subframe boundaries on transfers */
-   
+   unsigned int no_get_freq:1;
+
int num_interfaces;
int num_suspended_intf;
 
-- 
EW

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ALSA: usb-audio: don't try to get Benchmark DAC1 sample rate

2015-03-31 Thread Eric Wong
Adding this quirk allows us to avoid the noisy
"cannot get freq at ep 0x1" message in dmesg output every time
playback starts.

This ought to affect other Benchmark DAC1 variations using the same
"Microchip Technology, Inc." chip as well, but I have only tested
with the "Pre" variant.

Signed-off-by: Eric Wong 
Cc: Joe Turner 
Cc: Takashi Iwai 
---
  I also had another generic patch prepared before I noticed Joe's
  quirk addition for the MS Lifecam HD-5000.

 sound/usb/quirks.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/sound/usb/quirks.c b/sound/usb/quirks.c
index 753a47d..9a28365 100644
--- a/sound/usb/quirks.c
+++ b/sound/usb/quirks.c
@@ -1113,8 +1113,13 @@ void snd_usb_set_format_quirk(struct snd_usb_substream 
*subs,
 
 bool snd_usb_get_sample_rate_quirk(struct snd_usb_audio *chip)
 {
-   /* MS Lifecam HD-5000 doesn't support reading the sample rate. */
-   return chip->usb_id == USB_ID(0x045E, 0x076D);
+   /* devices which do not support reading the sample rate. */
+   switch (chip->usb_id) {
+   case USB_ID(0x045E, 0x076D): /* MS Lifecam HD-5000 */
+   case USB_ID(0x04D8, 0xFEEA): /* Benchmark DAC1 Pre */
+   return true;
+   }
+   return false;
 }
 
 /* Marantz/Denon USB DACs need a vendor cmd to switch
-- 
EW

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] ALSA: usb-audio: reduce cannot get freq at ep spew

2015-03-31 Thread Eric Wong
Eric Wong normalper...@yhbt.net wrote:
   I also had another generic patch prepared before I noticed Joe's
   quirk addition for the MS Lifecam HD-5000.

This is the generic patch I prepared before I noticed Joe's quirk
addition (applies cleanly to 3.19 and 4.0-rc6):

--8---
Subject: [PATCH] ALSA: usb-audio: reduce cannot get freq at ep spew

If a device fails to support reading the sample rate, it will likely
fail again and there is no point in logging the message every time the
sample rate is set.

This reduces dmesg noise when using the Benchmark DAC1 PRE as a USB
sound card.

Signed-off-by: Eric Wong normalper...@yhbt.net
---
 sound/usb/clock.c| 7 +--
 sound/usb/usbaudio.h | 3 ++-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/sound/usb/clock.c b/sound/usb/clock.c
index 03fed66..512fe12 100644
--- a/sound/usb/clock.c
+++ b/sound/usb/clock.c
@@ -307,8 +307,11 @@ static int set_sample_rate_v1(struct snd_usb_audio *chip, 
int iface,
   USB_TYPE_CLASS | USB_RECIP_ENDPOINT | 
USB_DIR_IN,
   UAC_EP_CS_ATTR_SAMPLE_RATE  8, ep,
   data, sizeof(data)))  0) {
-   dev_err(dev-dev, %d:%d: cannot get freq at ep %#x\n,
-   iface, fmt-altsetting, ep);
+   if (!chip-no_get_freq) {
+   dev_err(dev-dev, %d:%d: cannot get freq at ep %#x\n,
+   iface, fmt-altsetting, ep);
+   chip-no_get_freq = 1;
+   }
return 0; /* some devices don't support reading */
}
 
diff --git a/sound/usb/usbaudio.h b/sound/usb/usbaudio.h
index 91d0380..be65671 100644
--- a/sound/usb/usbaudio.h
+++ b/sound/usb/usbaudio.h
@@ -43,7 +43,8 @@ struct snd_usb_audio {
unsigned int in_pm:1;
unsigned int autosuspended:1;   
unsigned int txfr_quirk:1; /* Subframe boundaries on transfers */
-   
+   unsigned int no_get_freq:1;
+
int num_interfaces;
int num_suspended_intf;
 
-- 
EW

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ALSA: usb-audio: don't try to get Benchmark DAC1 sample rate

2015-03-31 Thread Eric Wong
Adding this quirk allows us to avoid the noisy
cannot get freq at ep 0x1 message in dmesg output every time
playback starts.

This ought to affect other Benchmark DAC1 variations using the same
Microchip Technology, Inc. chip as well, but I have only tested
with the Pre variant.

Signed-off-by: Eric Wong normalper...@yhbt.net
Cc: Joe Turner j...@oampo.co.uk
Cc: Takashi Iwai ti...@suse.de
---
  I also had another generic patch prepared before I noticed Joe's
  quirk addition for the MS Lifecam HD-5000.

 sound/usb/quirks.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/sound/usb/quirks.c b/sound/usb/quirks.c
index 753a47d..9a28365 100644
--- a/sound/usb/quirks.c
+++ b/sound/usb/quirks.c
@@ -1113,8 +1113,13 @@ void snd_usb_set_format_quirk(struct snd_usb_substream 
*subs,
 
 bool snd_usb_get_sample_rate_quirk(struct snd_usb_audio *chip)
 {
-   /* MS Lifecam HD-5000 doesn't support reading the sample rate. */
-   return chip-usb_id == USB_ID(0x045E, 0x076D);
+   /* devices which do not support reading the sample rate. */
+   switch (chip-usb_id) {
+   case USB_ID(0x045E, 0x076D): /* MS Lifecam HD-5000 */
+   case USB_ID(0x04D8, 0xFEEA): /* Benchmark DAC1 Pre */
+   return true;
+   }
+   return false;
 }
 
 /* Marantz/Denon USB DACs need a vendor cmd to switch
-- 
EW

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-21 Thread Eric Wong
Jason Baron  wrote:
> On 02/18/2015 12:51 PM, Ingo Molnar wrote:
> > * Ingo Molnar  wrote:
> >
> >>> [...] However, I think the userspace API change is less 
> >>> clear since epoll_wait() doesn't currently have an 
> >>> 'input' events argument as epoll_ctl() does.
> >> ... but the change would be a bit clearer and somewhat 
> >> more flexible: LIFO or FIFO queueing, right?
> >>
> >> But having the queueing model as part of the epoll 
> >> context is a legitimate approach as well.
> > Btw., there's another optimization that the networking code 
> > already does when processing incoming packets: waking up a 
> > thread on the local CPU, where the wakeup is running.
> >
> > Doing the same on epoll would have real scalability 
> > advantages where incoming events are IRQ driven and are 
> > distributed amongst multiple CPUs.
> >
> > Where events are task driven the scheduler will already try 
> > to pair up waker and wakee so it might not show up in 
> > measurements that markedly.
> >
> 
> Right, so this makes me think that we may want to potentially
> support a variety of wakeup policies. Adding these to the
> generic wake up code is just going to be too messy. So, perhaps
> a better approach here would be to register a single
> wait_queue_t with the event source queue that will always
> be woken up, and then layer any epoll balancing/irq affinity
> policies on top of that. So in essence we end up with sort of
> two queues layers, but I think it provides much nicer isolation
> between layers. Also, the bulk of the changes are going to be
> isolated to the epoll code, and we avoid Andy's concern about
> missing, or starving out wakeups.
> 
> So here's a stab at how this API could look:
> 
> 1. ep1 = epoll_create1(EPOLL_POLICY);
> 
> So EPOLL_POLICY here could the round robin policy described
> here, or the irq affinity or other ideas. The idea is to create
> an fd that is local to the process, such that other processes
> can not subsequently attach to it and affect our policy.

I'm not against defining more policies if needed.
Maybe FIFO vs LIFO is a good case for this.

For affinity, it could probably be done transparently based on
epoll_wait retrievals + EPOLL_CTL_MOD operations.

> 2. epoll_ctl(ep1, EPOLL_CTL_ADD, fd_source, NULL);
> 
> This associates ep1 with the event source. ep1 can be
> associated with or added to at most 1 wakeup source. This call
> would largely just form the association, but not queue anything
> to the fd_source wait queue.

This would mean one extra FD for every fd_source, but that's
only a handful of FDs (listen sockets), correct?

> 3. epoll_ctl(ep2, EPOLL_CTL_ADD, ep1, event);
> epoll_ctl(ep3, EPOLL_CTL_ADD, ep1, event);
> epoll_ctl(ep4, EPOLL_CTL_ADD, ep1, event);
>  .
>  .
>  .
> 
> Finally, we add the epoll sets to the event source (indirectly via
> ep1). So the first add would actually queue the callback to the
> fd_source. While the subsequent calls would simply queue things
> to the 'nested' wakeup queue associated with ep1.

I'm not sure I follow, wouldn't this increase the number of wakeups?

> So any existing epoll/poll/select calls could be queued as well
> to fd_source and will operate independenly from this mechanism,
> as the fd_source queue continues to be 'wake all'. Also, there
> should be no changes necessary to __wake_up_common(), other
> than potentially passing more back though the
> wait_queue_func_t, such as 'nr_exclusive'.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-21 Thread Eric Wong
Jason Baron jba...@akamai.com wrote:
 On 02/18/2015 12:51 PM, Ingo Molnar wrote:
  * Ingo Molnar mi...@kernel.org wrote:
 
  [...] However, I think the userspace API change is less 
  clear since epoll_wait() doesn't currently have an 
  'input' events argument as epoll_ctl() does.
  ... but the change would be a bit clearer and somewhat 
  more flexible: LIFO or FIFO queueing, right?
 
  But having the queueing model as part of the epoll 
  context is a legitimate approach as well.
  Btw., there's another optimization that the networking code 
  already does when processing incoming packets: waking up a 
  thread on the local CPU, where the wakeup is running.
 
  Doing the same on epoll would have real scalability 
  advantages where incoming events are IRQ driven and are 
  distributed amongst multiple CPUs.
 
  Where events are task driven the scheduler will already try 
  to pair up waker and wakee so it might not show up in 
  measurements that markedly.
 
 
 Right, so this makes me think that we may want to potentially
 support a variety of wakeup policies. Adding these to the
 generic wake up code is just going to be too messy. So, perhaps
 a better approach here would be to register a single
 wait_queue_t with the event source queue that will always
 be woken up, and then layer any epoll balancing/irq affinity
 policies on top of that. So in essence we end up with sort of
 two queues layers, but I think it provides much nicer isolation
 between layers. Also, the bulk of the changes are going to be
 isolated to the epoll code, and we avoid Andy's concern about
 missing, or starving out wakeups.
 
 So here's a stab at how this API could look:
 
 1. ep1 = epoll_create1(EPOLL_POLICY);
 
 So EPOLL_POLICY here could the round robin policy described
 here, or the irq affinity or other ideas. The idea is to create
 an fd that is local to the process, such that other processes
 can not subsequently attach to it and affect our policy.

I'm not against defining more policies if needed.
Maybe FIFO vs LIFO is a good case for this.

For affinity, it could probably be done transparently based on
epoll_wait retrievals + EPOLL_CTL_MOD operations.

 2. epoll_ctl(ep1, EPOLL_CTL_ADD, fd_source, NULL);
 
 This associates ep1 with the event source. ep1 can be
 associated with or added to at most 1 wakeup source. This call
 would largely just form the association, but not queue anything
 to the fd_source wait queue.

This would mean one extra FD for every fd_source, but that's
only a handful of FDs (listen sockets), correct?

 3. epoll_ctl(ep2, EPOLL_CTL_ADD, ep1, event);
 epoll_ctl(ep3, EPOLL_CTL_ADD, ep1, event);
 epoll_ctl(ep4, EPOLL_CTL_ADD, ep1, event);
  .
  .
  .
 
 Finally, we add the epoll sets to the event source (indirectly via
 ep1). So the first add would actually queue the callback to the
 fd_source. While the subsequent calls would simply queue things
 to the 'nested' wakeup queue associated with ep1.

I'm not sure I follow, wouldn't this increase the number of wakeups?

 So any existing epoll/poll/select calls could be queued as well
 to fd_source and will operate independenly from this mechanism,
 as the fd_source queue continues to be 'wake all'. Also, there
 should be no changes necessary to __wake_up_common(), other
 than potentially passing more back though the
 wait_queue_func_t, such as 'nr_exclusive'.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-18 Thread Eric Wong
Ingo Molnar  wrote:
> 
> * Ingo Molnar  wrote:
> 
> > > [...] However, I think the userspace API change is less 
> > > clear since epoll_wait() doesn't currently have an 
> > > 'input' events argument as epoll_ctl() does.
> > 
> > ... but the change would be a bit clearer and somewhat 
> > more flexible: LIFO or FIFO queueing, right?
> > 
> > But having the queueing model as part of the epoll 
> > context is a legitimate approach as well.
> 
> Btw., there's another optimization that the networking code 
> already does when processing incoming packets: waking up a 
> thread on the local CPU, where the wakeup is running.
> 
> Doing the same on epoll would have real scalability 
> advantages where incoming events are IRQ driven and are 
> distributed amongst multiple CPUs.

Right.  One thing in the back of my mind has been to have CPU
affinity for epoll.  Either having everything in an epoll set
favor a certain CPU or even having affinity down to the epitem
level (so concurrent epoll_wait callers end up favoring the
same epitems).

I'm not convinced this series is worth doing without a
comparison against my previous suggestion to use a dedicated
thread which only makes blocking accept4 + EPOLL_CTL_ADD calls.

The majority of epoll events in a typical server should not be
for listen sockets, so I'd rather not bloat existing code paths
for them.  For web servers nowadays, the benefits of maintaining
long-lived connections to avoid handshakes is even more
beneficial with increasing HTTPS and HTTP2 adoption; so
listen socket events should become less common.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-18 Thread Eric Wong
Ingo Molnar mi...@kernel.org wrote:
 
 * Ingo Molnar mi...@kernel.org wrote:
 
   [...] However, I think the userspace API change is less 
   clear since epoll_wait() doesn't currently have an 
   'input' events argument as epoll_ctl() does.
  
  ... but the change would be a bit clearer and somewhat 
  more flexible: LIFO or FIFO queueing, right?
  
  But having the queueing model as part of the epoll 
  context is a legitimate approach as well.
 
 Btw., there's another optimization that the networking code 
 already does when processing incoming packets: waking up a 
 thread on the local CPU, where the wakeup is running.
 
 Doing the same on epoll would have real scalability 
 advantages where incoming events are IRQ driven and are 
 distributed amongst multiple CPUs.

Right.  One thing in the back of my mind has been to have CPU
affinity for epoll.  Either having everything in an epoll set
favor a certain CPU or even having affinity down to the epitem
level (so concurrent epoll_wait callers end up favoring the
same epitems).

I'm not convinced this series is worth doing without a
comparison against my previous suggestion to use a dedicated
thread which only makes blocking accept4 + EPOLL_CTL_ADD calls.

The majority of epoll events in a typical server should not be
for listen sockets, so I'd rather not bloat existing code paths
for them.  For web servers nowadays, the benefits of maintaining
long-lived connections to avoid handshakes is even more
beneficial with increasing HTTPS and HTTP2 adoption; so
listen socket events should become less common.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-10 Thread Eric Wong
Jason Baron  wrote:
> On 02/09/2015 11:49 PM, Eric Wong wrote:
> > Do you have a userland use case to share?
> 
> I've been trying to describe the use case, maybe I haven't been doing a good
> job :(

Sorry, I meant if you had any public code.

Anyways, I've restarted work on another project which I'll hopefully be
able to share in a few weeks which might be a good public candidate for
epoll performance testing.

> > Did you try my suggestion of using a dedicated thread (or thread pool)
> > which does nothing but loop on accept() + EPOLL_CTL_ADD?
> >
> > Those dedicated threads could do its own round-robin in userland to pick
> > a different epollfd to call EPOLL_CTL_ADD on.
> 
> Thanks for your suggestion! I'm not actively working on the user-space
> code here, but I will pass it along.
> 
> I would prefer though not to have to context switch the 'accept' thread
> on and off the cpu every time there is a new connection. So the approach
> suggested here essentially moves this dedicated thread (threads), down
> into the kernel and avoids the creation of these threads entirely.

For cmogstored, I stopped using TCP_DEFER_ACCEPT when using the
dedicated thread.  This approach offloads to epoll and ends up giving
similar behavior to what used to be infinite in TCP_DEFER_ACCEPT in
Linux <= 2.6.31
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-10 Thread Eric Wong
Jason Baron jba...@akamai.com wrote:
 On 02/09/2015 11:49 PM, Eric Wong wrote:
  Do you have a userland use case to share?
 
 I've been trying to describe the use case, maybe I haven't been doing a good
 job :(

Sorry, I meant if you had any public code.

Anyways, I've restarted work on another project which I'll hopefully be
able to share in a few weeks which might be a good public candidate for
epoll performance testing.

  Did you try my suggestion of using a dedicated thread (or thread pool)
  which does nothing but loop on accept() + EPOLL_CTL_ADD?
 
  Those dedicated threads could do its own round-robin in userland to pick
  a different epollfd to call EPOLL_CTL_ADD on.
 
 Thanks for your suggestion! I'm not actively working on the user-space
 code here, but I will pass it along.
 
 I would prefer though not to have to context switch the 'accept' thread
 on and off the cpu every time there is a new connection. So the approach
 suggested here essentially moves this dedicated thread (threads), down
 into the kernel and avoids the creation of these threads entirely.

For cmogstored, I stopped using TCP_DEFER_ACCEPT when using the
dedicated thread.  This approach offloads to epoll and ends up giving
similar behavior to what used to be infinite in TCP_DEFER_ACCEPT in
Linux = 2.6.31
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-09 Thread Eric Wong
Jason Baron  wrote:
> On 02/09/2015 05:45 PM, Andy Lutomirski wrote:
> > On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron  wrote:
> >> On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
> >>> On 02/09/2015 12:06 PM, Jason Baron wrote:
>  Epoll file descriptors that are added to a shared wakeup source are 
>  always
>  added in a non-exclusive manner. That means that when we have multiple 
>  epoll
>  fds attached to a shared wakeup source they are all woken up. This can
>  lead to excessive cpu usage and uneven load distribution.
> 
>  This patch introduces two new 'events' flags that are intended to be used
>  with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the 
>  event
>  source in an exclusive manner such that the minimum number of threads are
>  woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, 
>  can
>  also be added to the 'events' flag, such that we round robin around the 
>  set
>  of waiting threads.
> 
>  An implementation note is that in the epoll wakeup routine,
>  'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a 
>  successful
>  wakeup, only when there are current waiters. The idea is to use this 
>  additional
>  heuristic in order minimize wakeup latencies.
> >>> I don't understand what this is intended to do.
> >>>
> >>> If an event has EPOLLONESHOT, then this only one thread should be woken 
> >>> regardless, right?  If not, isn't that just a bug that should be fixed?
> >>>
> >> hmm...so with EPOLLONESHOT you basically get notified once about an event. 
> >> If i have multiple epoll fds (say 1 per-thread) attached to a single 
> >> source in EPOLLONESHOT, then all threads will potentially get woken up 
> >> once per event. Then, I would have to re-arm all of them. So I don't think 
> >> this addresses this particular usecase...what I am trying to avoid is this 
> >> mass wakeup or thundering herd for a shared event source.
> > Now I understand.  Why are you using multiple epollfds?
> >
> > --Andy
> 
> So the multiple epollfds is really a way to partition the set of
> events. Otherwise, I have all the threads contending on all the events
> that are being generated. So I'm not sure if that is scalable.

I wonder if EPOLLONESHOT + epoll_wait with a sufficiently large
maxevents value is sufficient for you.  All events would be shared, so
they can migrate between threads(*).  Each thread takes a largish set of
events on every epoll_wait call and doesn't call epoll_wait again until
it's done with the whole set it got.

You'll hit more contention on EPOLL_CTL_MOD with shared events and a
single epoll, but I think it's a better goal to make that lock-free.

(*) Too large a maxevents will lead to head-of-line blocking, but from
what I'm inferring, you already risk that with multiple epollfds and
separate threads working on them.

Do you have a userland use case to share?

> In the use-case I'm trying to describe, I've partitioned a large set
> of the events, but there may still be some event sources that we wish
> to share among all of the threads (or even subsets of them), so as not
> to overload any one in particular.
 
> More specifically, in the case of a single listen socket, its natural
> to call accept() on the thread that has been woken up, but without
> doing round robin, you quickly get into a very unbalanced load, and in
> addition you waste a lot of cpu doing unnecessary wakeups. There are
> other approaches to solve this, specifically using SO_REUSEPORT, which
> creates a separate socket per-thread and gets one back to the
> separately partitioned events case previously described. However,
> SO_REUSEPORT, I believe is very specific to tcp/udp, and in addition
> does not have knowledge of the threads that are actively waiting as
> the epoll code does.

Did you try my suggestion of using a dedicated thread (or thread pool)
which does nothing but loop on accept() + EPOLL_CTL_ADD?

Those dedicated threads could do its own round-robin in userland to pick
a different epollfd to call EPOLL_CTL_ADD on.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

2015-02-09 Thread Eric Wong
Jason Baron jba...@akamai.com wrote:
 On 02/09/2015 05:45 PM, Andy Lutomirski wrote:
  On Mon, Feb 9, 2015 at 1:32 PM, Jason Baron jba...@akamai.com wrote:
  On 02/09/2015 03:18 PM, Andy Lutomirski wrote:
  On 02/09/2015 12:06 PM, Jason Baron wrote:
  Epoll file descriptors that are added to a shared wakeup source are 
  always
  added in a non-exclusive manner. That means that when we have multiple 
  epoll
  fds attached to a shared wakeup source they are all woken up. This can
  lead to excessive cpu usage and uneven load distribution.
 
  This patch introduces two new 'events' flags that are intended to be used
  with EPOLL_CTL_ADD operations. EPOLLEXCLUSIVE, adds the epoll fd to the 
  event
  source in an exclusive manner such that the minimum number of threads are
  woken. EPOLLROUNDROBIN, which depends on EPOLLEXCLUSIVE also being set, 
  can
  also be added to the 'events' flag, such that we round robin around the 
  set
  of waiting threads.
 
  An implementation note is that in the epoll wakeup routine,
  'ep_poll_callback()', if EPOLLROUNDROBIN is set, we return 1, for a 
  successful
  wakeup, only when there are current waiters. The idea is to use this 
  additional
  heuristic in order minimize wakeup latencies.
  I don't understand what this is intended to do.
 
  If an event has EPOLLONESHOT, then this only one thread should be woken 
  regardless, right?  If not, isn't that just a bug that should be fixed?
 
  hmm...so with EPOLLONESHOT you basically get notified once about an event. 
  If i have multiple epoll fds (say 1 per-thread) attached to a single 
  source in EPOLLONESHOT, then all threads will potentially get woken up 
  once per event. Then, I would have to re-arm all of them. So I don't think 
  this addresses this particular usecase...what I am trying to avoid is this 
  mass wakeup or thundering herd for a shared event source.
  Now I understand.  Why are you using multiple epollfds?
 
  --Andy
 
 So the multiple epollfds is really a way to partition the set of
 events. Otherwise, I have all the threads contending on all the events
 that are being generated. So I'm not sure if that is scalable.

I wonder if EPOLLONESHOT + epoll_wait with a sufficiently large
maxevents value is sufficient for you.  All events would be shared, so
they can migrate between threads(*).  Each thread takes a largish set of
events on every epoll_wait call and doesn't call epoll_wait again until
it's done with the whole set it got.

You'll hit more contention on EPOLL_CTL_MOD with shared events and a
single epoll, but I think it's a better goal to make that lock-free.

(*) Too large a maxevents will lead to head-of-line blocking, but from
what I'm inferring, you already risk that with multiple epollfds and
separate threads working on them.

Do you have a userland use case to share?

 In the use-case I'm trying to describe, I've partitioned a large set
 of the events, but there may still be some event sources that we wish
 to share among all of the threads (or even subsets of them), so as not
 to overload any one in particular.
 
 More specifically, in the case of a single listen socket, its natural
 to call accept() on the thread that has been woken up, but without
 doing round robin, you quickly get into a very unbalanced load, and in
 addition you waste a lot of cpu doing unnecessary wakeups. There are
 other approaches to solve this, specifically using SO_REUSEPORT, which
 creates a separate socket per-thread and gets one back to the
 separately partitioned events case previously described. However,
 SO_REUSEPORT, I believe is very specific to tcp/udp, and in addition
 does not have knowledge of the threads that are actively waiting as
 the epoll code does.

Did you try my suggestion of using a dedicated thread (or thread pool)
which does nothing but loop on accept() + EPOLL_CTL_ADD?

Those dedicated threads could do its own round-robin in userland to pick
a different epollfd to call EPOLL_CTL_ADD on.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 5/5] epoll: introduce epoll connected components (remove the epmutex)

2015-01-15 Thread Eric Wong
Jason Baron  wrote:
> I've done a bit of performance evaluation on a dual socket, 10 core, hyper
> threading enabled box: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz. For the
> simple epfdN->epfdN->pipefdN topology case where each thread has its
> own unique files and is doing EPOLL_CTL_ADD and EPOLL_CTL_DEL on the pipefd,
> I see an almost 300% improvement. This is obviously a very contrived case,
> but shows the motivation for this patch.

Any improvements for non-contrived cases? :)

> +++ b/include/linux/fs.h
> @@ -835,6 +835,9 @@ struct file {
>   /* Used by fs/eventpoll.c to link all the hooks to this file */
>   struct list_headf_ep_links;
>   struct list_headf_tfile_llink;
> + /* connected component */
> + struct list_headf_ep_cc_link;
> + struct ep_cc __rcu  *f_ep_cc;
>  #endif /* #ifdef CONFIG_EPOLL */

This size increase worries me.  Perhaps this can be a separately
allocated struct to avoid penalizing non-epoll users?

struct file_eventpoll {
struct list_headf_ep_links;
struct list_headf_tfile_llink;
/* connected component */
struct list_headf_ep_cc_link;
struct ep_cc __rcu  *f_ep_cc;
};

But I wish Linux never allowed nesting epoll in the first place :/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 5/5] epoll: introduce epoll connected components (remove the epmutex)

2015-01-15 Thread Eric Wong
Jason Baron jba...@akamai.com wrote:
 I've done a bit of performance evaluation on a dual socket, 10 core, hyper
 threading enabled box: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz. For the
 simple epfdN-epfdN-pipefdN topology case where each thread has its
 own unique files and is doing EPOLL_CTL_ADD and EPOLL_CTL_DEL on the pipefd,
 I see an almost 300% improvement. This is obviously a very contrived case,
 but shows the motivation for this patch.

Any improvements for non-contrived cases? :)

 +++ b/include/linux/fs.h
 @@ -835,6 +835,9 @@ struct file {
   /* Used by fs/eventpoll.c to link all the hooks to this file */
   struct list_headf_ep_links;
   struct list_headf_tfile_llink;
 + /* connected component */
 + struct list_headf_ep_cc_link;
 + struct ep_cc __rcu  *f_ep_cc;
  #endif /* #ifdef CONFIG_EPOLL */

This size increase worries me.  Perhaps this can be a separately
allocated struct to avoid penalizing non-epoll users?

struct file_eventpoll {
struct list_headf_ep_links;
struct list_headf_tfile_llink;
/* connected component */
struct list_headf_ep_cc_link;
struct ep_cc __rcu  *f_ep_cc;
};

But I wish Linux never allowed nesting epoll in the first place :/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] eventpoll: fix uninitialized variable in epoll_ctl

2014-08-20 Thread Eric Wong
Nicolas Iooss  wrote:
> When calling epoll_ctl with operation EPOLL_CTL_DEL, structure epds is
> not initialized but ep_take_care_of_epollwakeup reads its event field.
> When this unintialized field has EPOLLWAKEUP bit set, a capability check
> is done for CAP_BLOCK_SUSPEND in ep_take_care_of_epollwakeup.  This
> produces unexpected messages in the audit log, such as (on a system
> running SELinux):
> 
> type=AVC msg=audit(1408212798.866:410): avc:  denied
> { block_suspend } for  pid=7754 comm="dbus-daemon" capability=36
> scontext=unconfined_u:unconfined_r:unconfined_t
> tcontext=unconfined_u:unconfined_r:unconfined_t
> tclass=capability2 permissive=1
> 
> type=SYSCALL msg=audit(1408212798.866:410): arch=c03e syscall=233
> success=yes exit=0 a0=3 a1=2 a2=9 a3=7fffd4d66ec0 items=0 ppid=1
> pid=7754 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0
> fsgid=0 tty=(none) ses=3 comm="dbus-daemon"
> exe="/usr/bin/dbus-daemon"
> subj=unconfined_u:unconfined_r:unconfined_t key=(null)
> 
> ("arch=c03e syscall=233 a1=2" means "epoll_ctl(op=EPOLL_CTL_DEL)")
> 
> Remove use of epds in epoll_ctl when op == EPOLL_CTL_DEL.
> 
> Fixes: 4d7e30d98939 ("epoll: Add a flag, EPOLLWAKEUP, to prevent suspend
> while epoll events are ready")
> Signed-off-by: Nicolas Iooss 

Looks good to me.  Tested without regressions on several of my
(non-EPOLLWAKEUP) projects.  Adding a few more folks to the Cc: list.

Acked-by: Eric Wong 

> ---
>  fs/eventpoll.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index b10b48c..7bcfff9 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1852,7 +1852,8 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
>   goto error_tgt_fput;
>  
>   /* Check if EPOLLWAKEUP is allowed */
> - ep_take_care_of_epollwakeup();
> + if (ep_op_has_event(op))
> + ep_take_care_of_epollwakeup();
>  
>   /*
>* We have to check that the file structure underneath the file 
> descriptor
> -- 
> 1.7.10.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] eventpoll: fix uninitialized variable in epoll_ctl

2014-08-20 Thread Eric Wong
Nicolas Iooss nicolas.iooss_li...@m4x.org wrote:
 When calling epoll_ctl with operation EPOLL_CTL_DEL, structure epds is
 not initialized but ep_take_care_of_epollwakeup reads its event field.
 When this unintialized field has EPOLLWAKEUP bit set, a capability check
 is done for CAP_BLOCK_SUSPEND in ep_take_care_of_epollwakeup.  This
 produces unexpected messages in the audit log, such as (on a system
 running SELinux):
 
 type=AVC msg=audit(1408212798.866:410): avc:  denied
 { block_suspend } for  pid=7754 comm=dbus-daemon capability=36
 scontext=unconfined_u:unconfined_r:unconfined_t
 tcontext=unconfined_u:unconfined_r:unconfined_t
 tclass=capability2 permissive=1
 
 type=SYSCALL msg=audit(1408212798.866:410): arch=c03e syscall=233
 success=yes exit=0 a0=3 a1=2 a2=9 a3=7fffd4d66ec0 items=0 ppid=1
 pid=7754 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0
 fsgid=0 tty=(none) ses=3 comm=dbus-daemon
 exe=/usr/bin/dbus-daemon
 subj=unconfined_u:unconfined_r:unconfined_t key=(null)
 
 (arch=c03e syscall=233 a1=2 means epoll_ctl(op=EPOLL_CTL_DEL))
 
 Remove use of epds in epoll_ctl when op == EPOLL_CTL_DEL.
 
 Fixes: 4d7e30d98939 (epoll: Add a flag, EPOLLWAKEUP, to prevent suspend
 while epoll events are ready)
 Signed-off-by: Nicolas Iooss nicolas.iooss_li...@m4x.org

Looks good to me.  Tested without regressions on several of my
(non-EPOLLWAKEUP) projects.  Adding a few more folks to the Cc: list.

Acked-by: Eric Wong normalper...@yhbt.net

 ---
  fs/eventpoll.c |3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 
 diff --git a/fs/eventpoll.c b/fs/eventpoll.c
 index b10b48c..7bcfff9 100644
 --- a/fs/eventpoll.c
 +++ b/fs/eventpoll.c
 @@ -1852,7 +1852,8 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
   goto error_tgt_fput;
  
   /* Check if EPOLLWAKEUP is allowed */
 - ep_take_care_of_epollwakeup(epds);
 + if (ep_op_has_event(op))
 + ep_take_care_of_epollwakeup(epds);
  
   /*
* We have to check that the file structure underneath the file 
 descriptor
 -- 
 1.7.10.4
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Does anyone know when FUTEX_WAIT can fail with EAGAIN?

2014-08-16 Thread Eric Wong
Steven Stewart-Gallus  wrote:
> I'm trying to debug a hangup where my program loops with FUTEX_WAIT (actually
> FUTEX_WAIT_PRIVATE but same thing)  endlessly erring out with EAGAIN. I would
> like to know if anyone on the mailing list knows when FUTEX_WAIT can fail with
> EAGAIN.

FUTEX_WAIT fails with EAGAIN if the value of *uaddr no longer matches
the expected val.
---
#include 
#include 
#include 
#include 
int main(void) {
int op = FUTEX_WAIT;
int exp = 1;
int *addr = 
int val = 1; /* XXX change this to anything else to get EAGAIN */
int rc = syscall(SYS_futex, addr, op, val, 0, 0, 0);

if (rc < 0 && errno == EAGAIN)
write(1, "EAGAIN\n", 7);

return 0;
}
---
I just encountered this myself yesterday while implementing futex-based
locks/condvars for Ruby:
https://bugs.ruby-lang.org/issues/10009#change-48372
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Does anyone know when FUTEX_WAIT can fail with EAGAIN?

2014-08-16 Thread Eric Wong
Steven Stewart-Gallus sstewartgallu...@mylangara.bc.ca wrote:
 I'm trying to debug a hangup where my program loops with FUTEX_WAIT (actually
 FUTEX_WAIT_PRIVATE but same thing)  endlessly erring out with EAGAIN. I would
 like to know if anyone on the mailing list knows when FUTEX_WAIT can fail with
 EAGAIN.

FUTEX_WAIT fails with EAGAIN if the value of *uaddr no longer matches
the expected val.
---
#include linux/futex.h
#include sys/syscall.h
#include unistd.h
#include errno.h
int main(void) {
int op = FUTEX_WAIT;
int exp = 1;
int *addr = exp;
int val = 1; /* XXX change this to anything else to get EAGAIN */
int rc = syscall(SYS_futex, addr, op, val, 0, 0, 0);

if (rc  0  errno == EAGAIN)
write(1, EAGAIN\n, 7);

return 0;
}
---
I just encountered this myself yesterday while implementing futex-based
locks/condvars for Ruby:
https://bugs.ruby-lang.org/issues/10009#change-48372
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: eventpoll __list_del_entry corruption

2014-06-16 Thread Eric Wong
Sasha Levin  wrote:
> On 05/15/2014 02:11 PM, Peter Zijlstra wrote:
> > On Mon, May 12, 2014 at 11:42:33AM -0400, Sasha Levin wrote:
> >> Hi all,
> >>
> >> While fuzzing with trinity inside a KVM tools guest running the latest 
> >> -next
> >> kernel I've stumbled on the following spew. Maybe related to the very 
> >> recent
> >> change in freeing on task exit?

> > [ 5823.690004]  [] do_exit+0x2d4/0xa90
> > [ 5823.690004]  [] ? lockdep_sys_exit_thunk+0x35/0x67
> > [ 5823.690004]  [] do_group_exit+0x4c/0xc0
> > [ 5823.690004]  [] SyS_exit_group+0x17/0x20
> > [ 5823.690004]  [] system_call_fastpath+0x16/0x1b
> > [ 5823.690004] ---[ end trace 515b7fa3169c0906 ]---
> 
> Dave reported something similar to that last year(!) and that never got fixed
> AFAIK: https://lkml.org/lkml/2013/10/14/353.

Both stack traces happen during exit, so it may be a destruction
ordering problem when the eventpoll and files it references are
all released together.

I've never seen this in normal use, perhaps because I'm in the habit of
closing all descriptors before exit to keep valgrind happy :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: eventpoll __list_del_entry corruption

2014-06-16 Thread Eric Wong
Sasha Levin sasha.le...@oracle.com wrote:
 On 05/15/2014 02:11 PM, Peter Zijlstra wrote:
  On Mon, May 12, 2014 at 11:42:33AM -0400, Sasha Levin wrote:
  Hi all,
 
  While fuzzing with trinity inside a KVM tools guest running the latest 
  -next
  kernel I've stumbled on the following spew. Maybe related to the very 
  recent
  change in freeing on task exit?

  [ 5823.690004]  [8109a544] do_exit+0x2d4/0xa90
  [ 5823.690004]  [813825c4] ? lockdep_sys_exit_thunk+0x35/0x67
  [ 5823.690004]  [8109ae2c] do_group_exit+0x4c/0xc0
  [ 5823.690004]  [8109aeb7] SyS_exit_group+0x17/0x20
  [ 5823.690004]  [8168a2c2] system_call_fastpath+0x16/0x1b
  [ 5823.690004] ---[ end trace 515b7fa3169c0906 ]---
 
 Dave reported something similar to that last year(!) and that never got fixed
 AFAIK: https://lkml.org/lkml/2013/10/14/353.

Both stack traces happen during exit, so it may be a destruction
ordering problem when the eventpoll and files it references are
all released together.

I've never seen this in normal use, perhaps because I'm in the habit of
closing all descriptors before exit to keep valgrind happy :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH for-next 2/4] epoll: epoll() syscall declaration

2014-02-25 Thread Eric Wong
Nathaniel Yazdani  wrote:
> On Sun, Feb 23, 2014 at 9:32 PM, Eric Wong  wrote:
> > Nathaniel Yazdani  wrote:
> >> +asmlinkage long sys_epoll(int ep, struct epoll __user *in,
> >> +   unsigned int inc, struct epoll __user *out,
> >> +   unsigned int outc, int timeout);
> >
> > I can understand using the new struct for 'in', but 'out' could just be
> > "struct epoll_event *" like sys_epoll_wait, right?
> >
> >>  asmlinkage long sys_epoll_wait(int epfd, struct epoll_event __user 
> >> *events,
> 
> Yeah and I went back and forth on that, it just seemed to me that the
> inconsistency could be confusing to others... maybe instead of defining a new
> struct to begin with it might make me sense to just have an 'infd' array of 
> file
> descriptors in addition to an 'in' array of epoll_event struct
> (obviously the length
> of these would be identical).

I don't think a separate array for in is a good idea, too error prone
and you lose locality.

For output, some users either end up allocating more memory/retrieve
fewer items with the larger struct for *out.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH for-next 4/4] epoll: epoll() syscall definition

2014-02-25 Thread Eric Wong
Nathaniel Yazdani  wrote:
> + * stores triggered eventpoll entries in the 'out' array. The input array is
> + * _not_ read-only, because the resulting event mask gets written back to 
> each
> + * entry's ->ep_events field. When successful, this will be the same as 
> before
> + * (plus EPOLLERR & EPOLLHUP). If ->ep_events gets cleared, then it is 
> reasonable
> + * to infer that the entry's ->ep_fildes was a bad file descriptor.
> + */

> + if (!access_ok(VERIFY_WRITE, in, inc * sizeof(struct epoll)))
> + goto out;
> + for (i = 0; i < inc; ++i) {
> + int fd, io;
> + long long id;
> +
> + ret = -EFAULT;
> + if (__get_user(fd, [i].ep_fildes) ||
> + __get_user(io, [i].ep_events) ||
> + __get_user(id, [i].ep_ident))
> + goto out;
> +
> + ep_control(file->private_data, fd, , id, 0);
> + ret = -EFAULT;
> + if (__put_user(io, [i].ep_events))
> + goto out;

I don't think we should waste cycles writing to 'in' on success.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH for-next 4/4] epoll: epoll() syscall definition

2014-02-25 Thread Eric Wong
Nathaniel Yazdani n1ght.4nd@gmail.com wrote:
 + * stores triggered eventpoll entries in the 'out' array. The input array is
 + * _not_ read-only, because the resulting event mask gets written back to 
 each
 + * entry's -ep_events field. When successful, this will be the same as 
 before
 + * (plus EPOLLERR  EPOLLHUP). If -ep_events gets cleared, then it is 
 reasonable
 + * to infer that the entry's -ep_fildes was a bad file descriptor.
 + */

 + if (!access_ok(VERIFY_WRITE, in, inc * sizeof(struct epoll)))
 + goto out;
 + for (i = 0; i  inc; ++i) {
 + int fd, io;
 + long long id;
 +
 + ret = -EFAULT;
 + if (__get_user(fd, in[i].ep_fildes) ||
 + __get_user(io, in[i].ep_events) ||
 + __get_user(id, in[i].ep_ident))
 + goto out;
 +
 + ep_control(file-private_data, fd, io, id, 0);
 + ret = -EFAULT;
 + if (__put_user(io, in[i].ep_events))
 + goto out;

I don't think we should waste cycles writing to 'in' on success.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH for-next 2/4] epoll: epoll() syscall declaration

2014-02-25 Thread Eric Wong
Nathaniel Yazdani n1ght.4nd@gmail.com wrote:
 On Sun, Feb 23, 2014 at 9:32 PM, Eric Wong normalper...@yhbt.net wrote:
  Nathaniel Yazdani n1ght.4nd@gmail.com wrote:
  +asmlinkage long sys_epoll(int ep, struct epoll __user *in,
  +   unsigned int inc, struct epoll __user *out,
  +   unsigned int outc, int timeout);
 
  I can understand using the new struct for 'in', but 'out' could just be
  struct epoll_event * like sys_epoll_wait, right?
 
   asmlinkage long sys_epoll_wait(int epfd, struct epoll_event __user 
  *events,
 
 Yeah and I went back and forth on that, it just seemed to me that the
 inconsistency could be confusing to others... maybe instead of defining a new
 struct to begin with it might make me sense to just have an 'infd' array of 
 file
 descriptors in addition to an 'in' array of epoll_event struct
 (obviously the length
 of these would be identical).

I don't think a separate array for in is a good idea, too error prone
and you lose locality.

For output, some users either end up allocating more memory/retrieve
fewer items with the larger struct for *out.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH for-next 2/4] epoll: epoll() syscall declaration

2014-02-23 Thread Eric Wong
Nathaniel Yazdani  wrote:
> +asmlinkage long sys_epoll(int ep, struct epoll __user *in,
> +   unsigned int inc, struct epoll __user *out,
> +   unsigned int outc, int timeout);

I can understand using the new struct for 'in', but 'out' could just be
"struct epoll_event *" like sys_epoll_wait, right?

>  asmlinkage long sys_epoll_wait(int epfd, struct epoll_event __user *events,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH for-next 1/4] epoll: struct epoll userspace definiton

2014-02-23 Thread Eric Wong
Nathaniel Yazdani  wrote:
> +++ b/include/uapi/linux/eventpoll.h
> @@ -61,6 +61,12 @@ struct epoll_event {
>   __u64 data;
>  } EPOLL_PACKED;
>  
> +struct epoll {
> + int ep_fildes; /* file descriptor */
> + int ep_events; /* triggering events */
> + long long ep_ident; /* entry ID (cf. epoll_event->data) */
> +} EPOLL_PACKED;

Why not reuse the existing epoll_event struct?

struct epoll {
int ep_fildes; /* file descriptor */
struct epoll_event event;
} EPOLL_PACKED;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH for-next 1/4] epoll: struct epoll userspace definiton

2014-02-23 Thread Eric Wong
Nathaniel Yazdani n1ght.4nd@gmail.com wrote:
 +++ b/include/uapi/linux/eventpoll.h
 @@ -61,6 +61,12 @@ struct epoll_event {
   __u64 data;
  } EPOLL_PACKED;
  
 +struct epoll {
 + int ep_fildes; /* file descriptor */
 + int ep_events; /* triggering events */
 + long long ep_ident; /* entry ID (cf. epoll_event-data) */
 +} EPOLL_PACKED;

Why not reuse the existing epoll_event struct?

struct epoll {
int ep_fildes; /* file descriptor */
struct epoll_event event;
} EPOLL_PACKED;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH for-next 2/4] epoll: epoll() syscall declaration

2014-02-23 Thread Eric Wong
Nathaniel Yazdani n1ght.4nd@gmail.com wrote:
 +asmlinkage long sys_epoll(int ep, struct epoll __user *in,
 +   unsigned int inc, struct epoll __user *out,
 +   unsigned int outc, int timeout);

I can understand using the new struct for 'in', but 'out' could just be
struct epoll_event * like sys_epoll_wait, right?

  asmlinkage long sys_epoll_wait(int epfd, struct epoll_event __user *events,
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/3] epoll: read(),write(),ioctl() interface

2014-02-03 Thread Eric Wong
Andy Lutomirski  wrote:
> >> On 02/02/2014 06:17 PM, Nathaniel Yazdani wrote:
> > So are you saying that those features you mentioned are specifically sought
> > after for the kernel? If so I'd like to take a crack at some of them,
> > may as well
> > get some use out of my new knowledge of epoll internals :)
> 
> If by "sought after", you mean "is there at least one epoll user who
> wants them", then yes :)
> 
> I think that EPOLLET and EPOLLONESHOT are giant hacks, and that what
> everyone really wants is the ability to very efficiently toggle events
> on and off.  The ability to do it simultaneously and inexpensively
> with epoll_wait would make it happen.

Everybody using single-threaded epoll, you mean?  I suppose there's
quite a few of those.

I've pondered an epoll_xchg syscall which would behave like *BSD kevent
to satisfy single-threaded users, but never got around to it.  All my
epoll uses are multithreaded w/ oneshot nowadays, so xchg would only
save one syscall per thread.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/3] epoll: read(),write(),ioctl() interface

2014-02-03 Thread Eric Wong
Andy Lutomirski l...@amacapital.net wrote:
  On 02/02/2014 06:17 PM, Nathaniel Yazdani wrote:
  So are you saying that those features you mentioned are specifically sought
  after for the kernel? If so I'd like to take a crack at some of them,
  may as well
  get some use out of my new knowledge of epoll internals :)
 
 If by sought after, you mean is there at least one epoll user who
 wants them, then yes :)
 
 I think that EPOLLET and EPOLLONESHOT are giant hacks, and that what
 everyone really wants is the ability to very efficiently toggle events
 on and off.  The ability to do it simultaneously and inexpensively
 with epoll_wait would make it happen.

Everybody using single-threaded epoll, you mean?  I suppose there's
quite a few of those.

I've pondered an epoll_xchg syscall which would behave like *BSD kevent
to satisfy single-threaded users, but never got around to it.  All my
epoll uses are multithreaded w/ oneshot nowadays, so xchg would only
save one syscall per thread.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: epoll oops.

2013-10-16 Thread Eric Wong
Oleg Nesterov  wrote:
> Yes. Before that 971316f0503a hack epoll can't even know if the task
> which did signalfd_poll() exits and frees the active signalfd_wqh.
> If for example that task forked a child before exit.
> 
> And the whole RCU logic is only needed if exit/ep_remove_wait_queue
> actually race with each other.

Is there any chance this oops is caused by (or at least more easily
exposed by) commit 91cf5ab60ff82ecf4550a596867787c1e360dd3f ?
(epoll: add a reschedule point in ep_free())

I thought 91cf5ab would be benign, except...

> Yes, ugly, agreed. d80e731ecab4 even tries to docunent that this all
> is the hack.

.. the following sentence from d80e731ecab4 caught my eye:

It also assumes that nobody can take tasklist_lock under epoll
locks, this seems to be true.

I haven't been able to trace if cond_resched() can take tasklist_lock.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: epoll oops.

2013-10-16 Thread Eric Wong
Oleg Nesterov o...@redhat.com wrote:
 Yes. Before that 971316f0503a hack epoll can't even know if the task
 which did signalfd_poll() exits and frees the active signalfd_wqh.
 If for example that task forked a child before exit.
 
 And the whole RCU logic is only needed if exit/ep_remove_wait_queue
 actually race with each other.

Is there any chance this oops is caused by (or at least more easily
exposed by) commit 91cf5ab60ff82ecf4550a596867787c1e360dd3f ?
(epoll: add a reschedule point in ep_free())

I thought 91cf5ab would be benign, except...

 Yes, ugly, agreed. d80e731ecab4 even tries to docunent that this all
 is the hack.

.. the following sentence from d80e731ecab4 caught my eye:

It also assumes that nobody can take tasklist_lock under epoll
locks, this seems to be true.

I haven't been able to trace if cond_resched() can take tasklist_lock.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/11] epoll: Remove unnecessary error path

2013-09-30 Thread Eric Wong
Andi Kleen  wrote:
> From: Andi Kleen 
> 
> A static checker was pointing out that nothing can possible set
> nwait < 0 in this path. The comment and the check appears to be
> outdated. Remove it.

I don't think so...

> Cc: v...@zeniv.linux.org.uk
> Signed-off-by: Andi Kleen 
> ---
>  fs/eventpoll.c | 10 --
>  1 file changed, 10 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 473e09d..f72bf55 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1276,15 +1276,6 @@ static int ep_insert(struct eventpoll *ep, struct 
> epoll_event *event,
>*/
>   revents = ep_item_poll(epi, );

ep_item_poll calls f_op->poll, which calls poll_wait().
poll_wait() will call ep_ptable_queue_proc.

> - /*
> -  * We have to check if something went wrong during the poll wait queue
> -  * install process. Namely an allocation for a wait queue failed due
> -  * high memory pressure.
> -  */
> - error = -ENOMEM;
> - if (epi->nwait < 0)
> - goto error_unregister;
> -
>   /* Add the current item to the list of active epoll hook for this file 
> */
>   spin_lock(>f_lock);
>   list_add_tail(>fllink, >f_ep_links);
> @@ -1334,7 +1325,6 @@ error_remove_epi:
>  
>   rb_erase(>rbn, >rbr);
>  
> -error_unregister:
>   ep_unregister_pollwait(ep, epi);
>  
>   /*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/11] epoll: Remove unnecessary error path

2013-09-30 Thread Eric Wong
Andi Kleen a...@firstfloor.org wrote:
 From: Andi Kleen a...@linux.intel.com
 
 A static checker was pointing out that nothing can possible set
 nwait  0 in this path. The comment and the check appears to be
 outdated. Remove it.

I don't think so...

 Cc: v...@zeniv.linux.org.uk
 Signed-off-by: Andi Kleen a...@linux.intel.com
 ---
  fs/eventpoll.c | 10 --
  1 file changed, 10 deletions(-)
 
 diff --git a/fs/eventpoll.c b/fs/eventpoll.c
 index 473e09d..f72bf55 100644
 --- a/fs/eventpoll.c
 +++ b/fs/eventpoll.c
 @@ -1276,15 +1276,6 @@ static int ep_insert(struct eventpoll *ep, struct 
 epoll_event *event,
*/
   revents = ep_item_poll(epi, epq.pt);

ep_item_poll calls f_op-poll, which calls poll_wait().
poll_wait() will call ep_ptable_queue_proc.

 - /*
 -  * We have to check if something went wrong during the poll wait queue
 -  * install process. Namely an allocation for a wait queue failed due
 -  * high memory pressure.
 -  */
 - error = -ENOMEM;
 - if (epi-nwait  0)
 - goto error_unregister;
 -
   /* Add the current item to the list of active epoll hook for this file 
 */
   spin_lock(tfile-f_lock);
   list_add_tail(epi-fllink, tfile-f_ep_links);
 @@ -1334,7 +1325,6 @@ error_remove_epi:
  
   rb_erase(epi-rbn, ep-rbr);
  
 -error_unregister:
   ep_unregister_pollwait(ep, epi);
  
   /*
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   >