Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-24 Thread Fam Zheng
On Wed, 02/18 19:49, Ingo Molnar wrote:
> 
> * Fam Zheng  wrote:
> 
> > On Sun, 02/15 15:00, Jonathan Corbet wrote:
> > > On Fri, 13 Feb 2015 17:03:56 +0800
> > > Fam Zheng  wrote:
> > > 
> > > > SYNOPSIS
> > > > 
> > > >#include 
> > > > 
> > > >int epoll_pwait1(int epfd, int flags,
> > > > struct epoll_event *events,
> > > > int maxevents,
> > > > struct epoll_wait_params *params);
> > > 
> > > Quick, possibly dumb question: might it make sense to also pass in 
> > > sizeof(struct epoll_wait_params)?  That way, when somebody wants to add
> > > another parameter in the future, the kernel can tell which version is in
> > > use and they won't have to do an epoll_pwait2()?
> > > 
> > 
> > Flags can be used for that, if the change is not 
> > radically different.
> 
> Passing in size is generally better than flags, because 
> that way an extension of the ABI (new field[s]) 
> automatically signals towards the kernel what to do with 
> old binaries - while extending the functionality of new 
> binaries, without sacrificing functionality.
> 
> With flags you are either limited to the same structure 
> size - or have to decode a 'size' value from the flags 
> value - which is fragile (and in which case a real 'size' 
> parameter is better).
> 
> in the perf ABI we use something like that: there's a 
> perf_attr.size parameter that iterates the ABI forward, 
> while still being binary compatible with older software.
> 
> If old binaries pass in a smaller structure to a newer 
> kernel then the kernel pads the new fields with zero by 
> default - that way the kernel internals are never burdened 
> with compatibility details and data format versions.
> 
> If new user-space passes in a large structure than the 
> kernel can handle then the kernel returns an error - this 
> way user-space can transparently support conditional 
> features and fallback logic.
> 
> It works really well, we've done literally a hundred perf 
> ABI extensions this way in the last 4+ years, in a pretty 
> natural fashion, without littering the kernel (or 
> user-space) with version legacies and without breaking 
> existing perf tooling.
> 
> Other syscall ABIs already get painful when trying to 
> handle 2-3 data structure versions, so people either give 
> up, or add flags kludges or go to new syscall entries: 
> which is painful in its own fashion and adds unnecessary 
> latency to feature introduction as well.
> 

Excellent. This now makes a lot of sense to me, thanks to your explanations,
Ingo.

I'll add the "size" field in the next revision.

Thanks,
Fam
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-24 Thread Fam Zheng
On Wed, 02/18 19:49, Ingo Molnar wrote:
 
 * Fam Zheng f...@redhat.com wrote:
 
  On Sun, 02/15 15:00, Jonathan Corbet wrote:
   On Fri, 13 Feb 2015 17:03:56 +0800
   Fam Zheng f...@redhat.com wrote:
   
SYNOPSIS

   #include sys/epoll.h

   int epoll_pwait1(int epfd, int flags,
struct epoll_event *events,
int maxevents,
struct epoll_wait_params *params);
   
   Quick, possibly dumb question: might it make sense to also pass in 
   sizeof(struct epoll_wait_params)?  That way, when somebody wants to add
   another parameter in the future, the kernel can tell which version is in
   use and they won't have to do an epoll_pwait2()?
   
  
  Flags can be used for that, if the change is not 
  radically different.
 
 Passing in size is generally better than flags, because 
 that way an extension of the ABI (new field[s]) 
 automatically signals towards the kernel what to do with 
 old binaries - while extending the functionality of new 
 binaries, without sacrificing functionality.
 
 With flags you are either limited to the same structure 
 size - or have to decode a 'size' value from the flags 
 value - which is fragile (and in which case a real 'size' 
 parameter is better).
 
 in the perf ABI we use something like that: there's a 
 perf_attr.size parameter that iterates the ABI forward, 
 while still being binary compatible with older software.
 
 If old binaries pass in a smaller structure to a newer 
 kernel then the kernel pads the new fields with zero by 
 default - that way the kernel internals are never burdened 
 with compatibility details and data format versions.
 
 If new user-space passes in a large structure than the 
 kernel can handle then the kernel returns an error - this 
 way user-space can transparently support conditional 
 features and fallback logic.
 
 It works really well, we've done literally a hundred perf 
 ABI extensions this way in the last 4+ years, in a pretty 
 natural fashion, without littering the kernel (or 
 user-space) with version legacies and without breaking 
 existing perf tooling.
 
 Other syscall ABIs already get painful when trying to 
 handle 2-3 data structure versions, so people either give 
 up, or add flags kludges or go to new syscall entries: 
 which is painful in its own fashion and adds unnecessary 
 latency to feature introduction as well.
 

Excellent. This now makes a lot of sense to me, thanks to your explanations,
Ingo.

I'll add the size field in the next revision.

Thanks,
Fam
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-18 Thread Ingo Molnar

* Fam Zheng  wrote:

> On Sun, 02/15 15:00, Jonathan Corbet wrote:
> > On Fri, 13 Feb 2015 17:03:56 +0800
> > Fam Zheng  wrote:
> > 
> > > SYNOPSIS
> > > 
> > >#include 
> > > 
> > >int epoll_pwait1(int epfd, int flags,
> > > struct epoll_event *events,
> > > int maxevents,
> > > struct epoll_wait_params *params);
> > 
> > Quick, possibly dumb question: might it make sense to also pass in 
> > sizeof(struct epoll_wait_params)?  That way, when somebody wants to add
> > another parameter in the future, the kernel can tell which version is in
> > use and they won't have to do an epoll_pwait2()?
> > 
> 
> Flags can be used for that, if the change is not 
> radically different.

Passing in size is generally better than flags, because 
that way an extension of the ABI (new field[s]) 
automatically signals towards the kernel what to do with 
old binaries - while extending the functionality of new 
binaries, without sacrificing functionality.

With flags you are either limited to the same structure 
size - or have to decode a 'size' value from the flags 
value - which is fragile (and in which case a real 'size' 
parameter is better).

in the perf ABI we use something like that: there's a 
perf_attr.size parameter that iterates the ABI forward, 
while still being binary compatible with older software.

If old binaries pass in a smaller structure to a newer 
kernel then the kernel pads the new fields with zero by 
default - that way the kernel internals are never burdened 
with compatibility details and data format versions.

If new user-space passes in a large structure than the 
kernel can handle then the kernel returns an error - this 
way user-space can transparently support conditional 
features and fallback logic.

It works really well, we've done literally a hundred perf 
ABI extensions this way in the last 4+ years, in a pretty 
natural fashion, without littering the kernel (or 
user-space) with version legacies and without breaking 
existing perf tooling.

Other syscall ABIs already get painful when trying to 
handle 2-3 data structure versions, so people either give 
up, or add flags kludges or go to new syscall entries: 
which is painful in its own fashion and adds unnecessary 
latency to feature introduction as well.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-18 Thread Ingo Molnar

* Fam Zheng f...@redhat.com wrote:

 On Sun, 02/15 15:00, Jonathan Corbet wrote:
  On Fri, 13 Feb 2015 17:03:56 +0800
  Fam Zheng f...@redhat.com wrote:
  
   SYNOPSIS
   
  #include sys/epoll.h
   
  int epoll_pwait1(int epfd, int flags,
   struct epoll_event *events,
   int maxevents,
   struct epoll_wait_params *params);
  
  Quick, possibly dumb question: might it make sense to also pass in 
  sizeof(struct epoll_wait_params)?  That way, when somebody wants to add
  another parameter in the future, the kernel can tell which version is in
  use and they won't have to do an epoll_pwait2()?
  
 
 Flags can be used for that, if the change is not 
 radically different.

Passing in size is generally better than flags, because 
that way an extension of the ABI (new field[s]) 
automatically signals towards the kernel what to do with 
old binaries - while extending the functionality of new 
binaries, without sacrificing functionality.

With flags you are either limited to the same structure 
size - or have to decode a 'size' value from the flags 
value - which is fragile (and in which case a real 'size' 
parameter is better).

in the perf ABI we use something like that: there's a 
perf_attr.size parameter that iterates the ABI forward, 
while still being binary compatible with older software.

If old binaries pass in a smaller structure to a newer 
kernel then the kernel pads the new fields with zero by 
default - that way the kernel internals are never burdened 
with compatibility details and data format versions.

If new user-space passes in a large structure than the 
kernel can handle then the kernel returns an error - this 
way user-space can transparently support conditional 
features and fallback logic.

It works really well, we've done literally a hundred perf 
ABI extensions this way in the last 4+ years, in a pretty 
natural fashion, without littering the kernel (or 
user-space) with version legacies and without breaking 
existing perf tooling.

Other syscall ABIs already get painful when trying to 
handle 2-3 data structure versions, so people either give 
up, or add flags kludges or go to new syscall entries: 
which is painful in its own fashion and adds unnecessary 
latency to feature introduction as well.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-16 Thread Fam Zheng
Hi Seymour,

On Mon, 02/16 07:25, Seymour, Shane M wrote:
> I found the manual pages really confusing so I had a go at rewriting
> them - there were places in the manual page that didn't match the
> functionality provided by your code as well as I could tell).

Could you point which places don't match the code?

> 
> My apologies for a few formatting issues though. I still don't like
> parts of epoll_pwait1 but it's less confusing than it was.

Any other than the timespec question don't you like?

> 
> You are free to take some or all or none of the changes.
> 
> I did have a question I marked with  below about what you
> describe and what your code does.
> 



>The timeout member specifies the minimum time that epoll_wait(2) will
>block. The time spent waiting will be rounded up to the clock
>granularity. Kernel scheduling delays mean that the blocking
>interval may overrun by a small amount. Specifying a -1 for either
>tv_sec or tv_nsec member of the struct timespec timeout will cause
>causes epoll_pwait1(2) to block indefinitely. Specifying a timeout
>equal to zero (both tv_sec or tv_nsec member of the struct timespec
>timeout are zero) causes epoll_wait(2) to return immediately, even
>if no events are available.
> 
>  Are you really really sure about this for the -1 stuff? your code copies
> in the timespec and just passes it to timespec_to_ktime:
> 
> + if (copy_from_user(, params, sizeof(p)))
> + return -EFAULT;
> ...
> + kt = timespec_to_ktime(p.timeout);
> 
> Compare that to something like the futex syscall which does this:
> 
>   if (copy_from_user(, utime, sizeof(ts)) != 0)
>   return -EFAULT;
>   if (!timespec_valid())
>   return -EINVAL;
> 
>   t = timespec_to_ktime(ts);
> 
> If the timespec is not valid it returns -EINVAL back to user space. With your
> settings of tv_sec and/or tv_usec to -1 are you relying on a side effect of
> the conversion that could break your code in the future if in the unlikely
> event someone changes timespec_to_ktime() and should it be:
> 
> + if (copy_from_user(, params, sizeof(p)))
> + return -EFAULT;
> +   if ((p.timeout.tv_sec == -1) || (p.timeout.tv_nsec == -1)) {
> +  /* this is off the top of my head no idea if it will compile */
> + p.timeout.tv_sec = KTIME_SEC_MAX;
> + p.timeout.tv_nsec = 0;
> + }
> +   if (!timespec_valid())
> + return -EINVAL;
> ...
> + kt = timespec_to_ktime(p.timeout);

OK. timespec_valid() is clear about this: negative tv_sec is invalid, so I
don't think accepting -1 from user is the right thing to do.

We cannot do pointer check as ppoll already because the structure is embedded
in epoll_wait_params.

Maybe it's best to use a flags bit (#define EPOLL_PWAIT1_BLOCK 1).  What do you
think?

Fam


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-16 Thread Fam Zheng
Hi Seymour,

On Mon, 02/16 07:25, Seymour, Shane M wrote:
 I found the manual pages really confusing so I had a go at rewriting
 them - there were places in the manual page that didn't match the
 functionality provided by your code as well as I could tell).

Could you point which places don't match the code?

 
 My apologies for a few formatting issues though. I still don't like
 parts of epoll_pwait1 but it's less confusing than it was.

Any other than the timespec question don't you like?

 
 You are free to take some or all or none of the changes.
 
 I did have a question I marked with  below about what you
 describe and what your code does.
 

snip

The timeout member specifies the minimum time that epoll_wait(2) will
block. The time spent waiting will be rounded up to the clock
granularity. Kernel scheduling delays mean that the blocking
interval may overrun by a small amount. Specifying a -1 for either
tv_sec or tv_nsec member of the struct timespec timeout will cause
causes epoll_pwait1(2) to block indefinitely. Specifying a timeout
equal to zero (both tv_sec or tv_nsec member of the struct timespec
timeout are zero) causes epoll_wait(2) to return immediately, even
if no events are available.
 
  Are you really really sure about this for the -1 stuff? your code copies
 in the timespec and just passes it to timespec_to_ktime:
 
 + if (copy_from_user(p, params, sizeof(p)))
 + return -EFAULT;
 ...
 + kt = timespec_to_ktime(p.timeout);
 
 Compare that to something like the futex syscall which does this:
 
   if (copy_from_user(ts, utime, sizeof(ts)) != 0)
   return -EFAULT;
   if (!timespec_valid(ts))
   return -EINVAL;
 
   t = timespec_to_ktime(ts);
 
 If the timespec is not valid it returns -EINVAL back to user space. With your
 settings of tv_sec and/or tv_usec to -1 are you relying on a side effect of
 the conversion that could break your code in the future if in the unlikely
 event someone changes timespec_to_ktime() and should it be:
 
 + if (copy_from_user(p, params, sizeof(p)))
 + return -EFAULT;
 +   if ((p.timeout.tv_sec == -1) || (p.timeout.tv_nsec == -1)) {
 +  /* this is off the top of my head no idea if it will compile */
 + p.timeout.tv_sec = KTIME_SEC_MAX;
 + p.timeout.tv_nsec = 0;
 + }
 +   if (!timespec_valid(p.timeout))
 + return -EINVAL;
 ...
 + kt = timespec_to_ktime(p.timeout);

OK. timespec_valid() is clear about this: negative tv_sec is invalid, so I
don't think accepting -1 from user is the right thing to do.

We cannot do pointer check as ppoll already because the structure is embedded
in epoll_wait_params.

Maybe it's best to use a flags bit (#define EPOLL_PWAIT1_BLOCK 1).  What do you
think?

Fam

snip
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-15 Thread Seymour, Shane M
I found the manual pages really confusing so I had a go at rewriting
them - there were places in the manual page that didn't match the
functionality provided by your code as well as I could tell).

My apologies for a few formatting issues though. I still don't like
parts of epoll_pwait1 but it's less confusing than it was.

You are free to take some or all or none of the changes.

I did have a question I marked with  below about what you
describe and what your code does.

1) epoll_ctl_batch
--

NAME
   epoll_ctl_batch - batch control interface for an epoll descriptor

SYNOPSIS

   #include 

   int epoll_ctl_batch(int epfd, int flags,
   int ncmds, struct epoll_ctl_cmd *cmds);

DESCRIPTION

   This system call is an extension of epoll_ctl(). The primary
   difference is that this system call allows you to batch multiple
   operations with the one system call. This provides a more efficient
   interface for updating events on this epoll file descriptor epfd.

   The flags argument is reserved and must be 0.

   The argument ncmds is the number of cmds entries being passed in.
   This number must be greater than 0.

   Each operation is specified as an element in the cmds array, defined as:

   struct epoll_ctl_cmd {

  /* Reserved flags for future extension, must be 0. */
  int flags;

  /* The same as epoll_ctl() op parameter. */
  int op;

  /* The same as epoll_ctl() fd parameter. */
  int fd;

  /* The same as the "events" field in struct epoll_event. */
  uint32_t events;

  /* The same as the "data" field in struct epoll_event. */
  uint64_t data;

  /* Output field, will be set to the return code after this
   * command is executed by kernel */
  int result;
   };

   This system call is not atomic when updating the epoll descriptor.
   All entries in cmds are executed in the order provided. If any cmds
   entry fails to be processed no further entries are processed and 
   the number of successfully processed entries is returned.

   Each single operation defined by a struct epoll_ctl_cmd has the same 
   semantics as an epoll_ctl(2) call. See the epoll_ctl() manual page
   for more information about how to correctly setup the members of a
   struct epoll_ctl_cmd.

   Upon completion of the call the result member of each struct 
epoll_ctl_cmd
   may be set to 0 (sucessfully completed) or an error code depending on the
   result of the command. If the kernel fails to change the result (for
   example the location of the cmds argument is fully or partly read only)
   the result member of each struct epoll_ctl_cmd may be unchanged. 

RETURN VALUE

   epoll_ctl_batch() returns a number greater than 0 to indicate the number
   of cmnd entries processed. If all entries have been processed this will
   equal the ncmds parameter passed in.

   If one or more parameters are incorrect the value returned is -1 with
   errno set appropriately - no cmds entries have been processed when this
   happens.

   If processing any entry in the cmds argument results in an error the
   number returned is the number of the failing entry - this number will be
   less than ncmds. Since ncmds must be greater than 0 a return value of
   0 indicates an error associated with the very first cmds entry. 
   A return value of 0 does not indicate a successful system call.

   To correctly test the return value from epoll_ctl_batch() use code 
similar
   to the following:

ret=epoll_ctl_batch(epfd, flags, ncmds, );
if (ret < ncmds) {
if (ret == -1) {
/* An argument was invalid */
} else {
/* ret contains the number of successful entries
 * processed. If you (mis?)use it as a C index 
it
 * will index directly to the failing entry to
 * get the result use cmds[ret].result which 
may 
 * contain the errno value associated with the
 * entry.
 */
}
} else {
/* Success */
}

ERRORS

   EINVAL flags is non-zero, or ncmds is less than or equal to zero, or
  cmds is NULL.

   ENOMEM There was insufficient memory to handle the requested op control
  operation.

   EFAULT The memory area pointed to by cmds is not accessible with read
  permissions.

   In the event that the 

Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-15 Thread Fam Zheng
On Sun, 02/15 15:00, Jonathan Corbet wrote:
> On Fri, 13 Feb 2015 17:03:56 +0800
> Fam Zheng  wrote:
> 
> > SYNOPSIS
> > 
> >#include 
> > 
> >int epoll_pwait1(int epfd, int flags,
> > struct epoll_event *events,
> > int maxevents,
> > struct epoll_wait_params *params);
> 
> Quick, possibly dumb question: might it make sense to also pass in 
> sizeof(struct epoll_wait_params)?  That way, when somebody wants to add
> another parameter in the future, the kernel can tell which version is in
> use and they won't have to do an epoll_pwait2()?
> 

Flags can be used for that, if the change is not radically different.

Fam
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-15 Thread Jonathan Corbet
On Fri, 13 Feb 2015 17:03:56 +0800
Fam Zheng  wrote:

> SYNOPSIS
> 
>#include 
> 
>int epoll_pwait1(int epfd, int flags,
> struct epoll_event *events,
> int maxevents,
> struct epoll_wait_params *params);

Quick, possibly dumb question: might it make sense to also pass in 
sizeof(struct epoll_wait_params)?  That way, when somebody wants to add
another parameter in the future, the kernel can tell which version is in
use and they won't have to do an epoll_pwait2()?

Thanks,

jon
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-15 Thread Michael Kerrisk (man-pages)
On 02/13/2015 10:53 AM, Omar Sandoval wrote:
> On Fri, Feb 13, 2015 at 05:03:56PM +0800, Fam Zheng wrote:
>> Hi all,
>>
>> This is the updated series for the new epoll system calls, with the cover
>> letter rewritten which includes some more explanation. Comments are very
>> welcome!
>>
>> Original Motivation
>> ===
>>
>> QEMU, and probably many more select/poll based applications, will consider
>> epoll as an alternative, when its event loop needs to handle a big number of
>> fds. However, there are currently two concerns with epoll which prevents the
>> switching from ppoll to epoll. 
>>
>> The major one is the timeout precision.
>>
>> For example in QEMU, the main loop takes care of calling callbacks at a
>> specific timeout - the QEMU timer API. The timeout value in ppoll depends on
>> the next firing timer. epoll_pwait's millisecond timeout is so coarse that
>> rounding up the timeout will hurt performance badly.
>>
>> The minor one is the number of system call to update fd set.
>>
>> While epoll can handle a large number of fds quickly, it still requires one
>> epoll_ctl per fd update, compared to the one-shot call to select/poll with an
>> fd array. This may as well make epoll inferior to ppoll in the cases where a
>> small, but frequently changing set of fds are polled by the event loop.
>>
>> This series introduces two new epoll APIs to address them respectively. The
>> idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also
>> suggested clockid as a parameter in epoll_pwait1.
>>
>> Discussion
>> ==
>>
>> [Note: This is the question part regarding the interface contract of
>> epoll_ctl_batch.  If you don't have the context of what is epoll_ctl_batch 
>> yet,
>> please skip this part and probably start with the man page style 
>> documentation.
>> You can resume to this section later.]
>>
>> [Thanks to Omar Sandoval , who pointed out this in
>> reviewing v1]
>>
>> We try to report status for each command in epoll_ctl_batch, by writting to
>> user provided command array (pointed to cmds). The tricky thing in the
>> implementation is that, copying the results back to userspace comes last, 
>> after
>> the commands are executed. At this point, if the copy_to_user fails, the
>> effects are done and no return - or if we add some mechanism to revert it, 
>> the
>> code will be too complicated and slow.
>>
>> In above corner case, the return value of epoll_ctl_batch is smaller than
>> ncmds, which assures our caller that last N commands failed, where N = ncmds 
>> -
>> ret.  But they'll also find that cmd.result is not changed to error code.
>>
>> I suppose this is not a big deal, because 1) it should happen very rarely. 2)
>> user does know the actual number of commands that succeed.
>>
>> So, do we leave it as is? Or is there any way to improve?
>>
>> One tiny improvement (not a complete fix) in my mind is a testing 
>> copy_to_user
>> before we execute the commands. If it succeeds, it's even less likely the 
>> last
>> copy_to_user could fail, so that we can even probably assert it won't. The
>> testing 'copy_to_user(cmds, , ...)' will not hurt functionality if do 
>> it
>> right after 'copy_from_user(, cmds, ...)'. But I'm not sure about the
>> performance impact, especially when @ncmds is big.
>>
> I don't think this is the right thing to do, since, for example, another
> thread could unmap the memory region holding buffer between the "check"
> copy_to_user and the actual one.
> 
> The two alternatives that I see are:
> 
> 1. If copy_to_user fails, ignore it and return the number of commands
> that succeeded (i.e., what the code in your patch does).
> 2. If copy_to_user fails, return -EFAULT, regardless of how many
> commands succeeded.
> 
> The problem with 1 is that it could potentially mask bugs in a user
> program. You could imagine a buggy program that passes a read-only
> buffer to epoll_ctl_batch and never finds out about it because they
> don't get the error. (Then, when there's a real error in one of the
> epoll_ctl_cmds, but .result is 0, someone will be very confused.)
> 
> So I think 2 is the better option. Sure, the user will have no idea how
> many commands were executed, but when would EFAULT on this system call
> be part of normal operation, anyways? You'll find the memory bug, fix
> it, and rest easy knowing that nothing is silently failing behind your
> back.

What Omar says makes sense to me too. Best to have the user get a clear 
error indication for this case.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-15 Thread Michael Kerrisk (man-pages)
On 02/13/2015 10:53 AM, Omar Sandoval wrote:
 On Fri, Feb 13, 2015 at 05:03:56PM +0800, Fam Zheng wrote:
 Hi all,

 This is the updated series for the new epoll system calls, with the cover
 letter rewritten which includes some more explanation. Comments are very
 welcome!

 Original Motivation
 ===

 QEMU, and probably many more select/poll based applications, will consider
 epoll as an alternative, when its event loop needs to handle a big number of
 fds. However, there are currently two concerns with epoll which prevents the
 switching from ppoll to epoll. 

 The major one is the timeout precision.

 For example in QEMU, the main loop takes care of calling callbacks at a
 specific timeout - the QEMU timer API. The timeout value in ppoll depends on
 the next firing timer. epoll_pwait's millisecond timeout is so coarse that
 rounding up the timeout will hurt performance badly.

 The minor one is the number of system call to update fd set.

 While epoll can handle a large number of fds quickly, it still requires one
 epoll_ctl per fd update, compared to the one-shot call to select/poll with an
 fd array. This may as well make epoll inferior to ppoll in the cases where a
 small, but frequently changing set of fds are polled by the event loop.

 This series introduces two new epoll APIs to address them respectively. The
 idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also
 suggested clockid as a parameter in epoll_pwait1.

 Discussion
 ==

 [Note: This is the question part regarding the interface contract of
 epoll_ctl_batch.  If you don't have the context of what is epoll_ctl_batch 
 yet,
 please skip this part and probably start with the man page style 
 documentation.
 You can resume to this section later.]

 [Thanks to Omar Sandoval osan...@osandov.com, who pointed out this in
 reviewing v1]

 We try to report status for each command in epoll_ctl_batch, by writting to
 user provided command array (pointed to cmds). The tricky thing in the
 implementation is that, copying the results back to userspace comes last, 
 after
 the commands are executed. At this point, if the copy_to_user fails, the
 effects are done and no return - or if we add some mechanism to revert it, 
 the
 code will be too complicated and slow.

 In above corner case, the return value of epoll_ctl_batch is smaller than
 ncmds, which assures our caller that last N commands failed, where N = ncmds 
 -
 ret.  But they'll also find that cmd.result is not changed to error code.

 I suppose this is not a big deal, because 1) it should happen very rarely. 2)
 user does know the actual number of commands that succeed.

 So, do we leave it as is? Or is there any way to improve?

 One tiny improvement (not a complete fix) in my mind is a testing 
 copy_to_user
 before we execute the commands. If it succeeds, it's even less likely the 
 last
 copy_to_user could fail, so that we can even probably assert it won't. The
 testing 'copy_to_user(cmds, kcmds, ...)' will not hurt functionality if do 
 it
 right after 'copy_from_user(kcmds, cmds, ...)'. But I'm not sure about the
 performance impact, especially when @ncmds is big.

 I don't think this is the right thing to do, since, for example, another
 thread could unmap the memory region holding buffer between the check
 copy_to_user and the actual one.
 
 The two alternatives that I see are:
 
 1. If copy_to_user fails, ignore it and return the number of commands
 that succeeded (i.e., what the code in your patch does).
 2. If copy_to_user fails, return -EFAULT, regardless of how many
 commands succeeded.
 
 The problem with 1 is that it could potentially mask bugs in a user
 program. You could imagine a buggy program that passes a read-only
 buffer to epoll_ctl_batch and never finds out about it because they
 don't get the error. (Then, when there's a real error in one of the
 epoll_ctl_cmds, but .result is 0, someone will be very confused.)
 
 So I think 2 is the better option. Sure, the user will have no idea how
 many commands were executed, but when would EFAULT on this system call
 be part of normal operation, anyways? You'll find the memory bug, fix
 it, and rest easy knowing that nothing is silently failing behind your
 back.

What Omar says makes sense to me too. Best to have the user get a clear 
error indication for this case.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-15 Thread Fam Zheng
On Sun, 02/15 15:00, Jonathan Corbet wrote:
 On Fri, 13 Feb 2015 17:03:56 +0800
 Fam Zheng f...@redhat.com wrote:
 
  SYNOPSIS
  
 #include sys/epoll.h
  
 int epoll_pwait1(int epfd, int flags,
  struct epoll_event *events,
  int maxevents,
  struct epoll_wait_params *params);
 
 Quick, possibly dumb question: might it make sense to also pass in 
 sizeof(struct epoll_wait_params)?  That way, when somebody wants to add
 another parameter in the future, the kernel can tell which version is in
 use and they won't have to do an epoll_pwait2()?
 

Flags can be used for that, if the change is not radically different.

Fam
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-15 Thread Jonathan Corbet
On Fri, 13 Feb 2015 17:03:56 +0800
Fam Zheng f...@redhat.com wrote:

 SYNOPSIS
 
#include sys/epoll.h
 
int epoll_pwait1(int epfd, int flags,
 struct epoll_event *events,
 int maxevents,
 struct epoll_wait_params *params);

Quick, possibly dumb question: might it make sense to also pass in 
sizeof(struct epoll_wait_params)?  That way, when somebody wants to add
another parameter in the future, the kernel can tell which version is in
use and they won't have to do an epoll_pwait2()?

Thanks,

jon
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-15 Thread Seymour, Shane M
I found the manual pages really confusing so I had a go at rewriting
them - there were places in the manual page that didn't match the
functionality provided by your code as well as I could tell).

My apologies for a few formatting issues though. I still don't like
parts of epoll_pwait1 but it's less confusing than it was.

You are free to take some or all or none of the changes.

I did have a question I marked with  below about what you
describe and what your code does.

1) epoll_ctl_batch
--

NAME
   epoll_ctl_batch - batch control interface for an epoll descriptor

SYNOPSIS

   #include sys/epoll.h

   int epoll_ctl_batch(int epfd, int flags,
   int ncmds, struct epoll_ctl_cmd *cmds);

DESCRIPTION

   This system call is an extension of epoll_ctl(). The primary
   difference is that this system call allows you to batch multiple
   operations with the one system call. This provides a more efficient
   interface for updating events on this epoll file descriptor epfd.

   The flags argument is reserved and must be 0.

   The argument ncmds is the number of cmds entries being passed in.
   This number must be greater than 0.

   Each operation is specified as an element in the cmds array, defined as:

   struct epoll_ctl_cmd {

  /* Reserved flags for future extension, must be 0. */
  int flags;

  /* The same as epoll_ctl() op parameter. */
  int op;

  /* The same as epoll_ctl() fd parameter. */
  int fd;

  /* The same as the events field in struct epoll_event. */
  uint32_t events;

  /* The same as the data field in struct epoll_event. */
  uint64_t data;

  /* Output field, will be set to the return code after this
   * command is executed by kernel */
  int result;
   };

   This system call is not atomic when updating the epoll descriptor.
   All entries in cmds are executed in the order provided. If any cmds
   entry fails to be processed no further entries are processed and 
   the number of successfully processed entries is returned.

   Each single operation defined by a struct epoll_ctl_cmd has the same 
   semantics as an epoll_ctl(2) call. See the epoll_ctl() manual page
   for more information about how to correctly setup the members of a
   struct epoll_ctl_cmd.

   Upon completion of the call the result member of each struct 
epoll_ctl_cmd
   may be set to 0 (sucessfully completed) or an error code depending on the
   result of the command. If the kernel fails to change the result (for
   example the location of the cmds argument is fully or partly read only)
   the result member of each struct epoll_ctl_cmd may be unchanged. 

RETURN VALUE

   epoll_ctl_batch() returns a number greater than 0 to indicate the number
   of cmnd entries processed. If all entries have been processed this will
   equal the ncmds parameter passed in.

   If one or more parameters are incorrect the value returned is -1 with
   errno set appropriately - no cmds entries have been processed when this
   happens.

   If processing any entry in the cmds argument results in an error the
   number returned is the number of the failing entry - this number will be
   less than ncmds. Since ncmds must be greater than 0 a return value of
   0 indicates an error associated with the very first cmds entry. 
   A return value of 0 does not indicate a successful system call.

   To correctly test the return value from epoll_ctl_batch() use code 
similar
   to the following:

ret=epoll_ctl_batch(epfd, flags, ncmds, cmds);
if (ret  ncmds) {
if (ret == -1) {
/* An argument was invalid */
} else {
/* ret contains the number of successful entries
 * processed. If you (mis?)use it as a C index 
it
 * will index directly to the failing entry to
 * get the result use cmds[ret].result which 
may 
 * contain the errno value associated with the
 * entry.
 */
}
} else {
/* Success */
}

ERRORS

   EINVAL flags is non-zero, or ncmds is less than or equal to zero, or
  cmds is NULL.

   ENOMEM There was insufficient memory to handle the requested op control
  operation.

   EFAULT The memory area pointed to by cmds is not accessible with read
  permissions.

   In the event 

Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-14 Thread Fam Zheng
On Fri, 02/13 01:53, Omar Sandoval wrote:

> > Discussion
> > ==
> > 
> > [Note: This is the question part regarding the interface contract of
> > epoll_ctl_batch.  If you don't have the context of what is epoll_ctl_batch 
> > yet,
> > please skip this part and probably start with the man page style 
> > documentation.
> > You can resume to this section later.]
> > 
> > [Thanks to Omar Sandoval , who pointed out this in
> > reviewing v1]
> > 
> > We try to report status for each command in epoll_ctl_batch, by writting to
> > user provided command array (pointed to cmds). The tricky thing in the
> > implementation is that, copying the results back to userspace comes last, 
> > after
> > the commands are executed. At this point, if the copy_to_user fails, the
> > effects are done and no return - or if we add some mechanism to revert it, 
> > the
> > code will be too complicated and slow.
> > 
> > In above corner case, the return value of epoll_ctl_batch is smaller than
> > ncmds, which assures our caller that last N commands failed, where N = 
> > ncmds -
> > ret.  But they'll also find that cmd.result is not changed to error code.
> > 
> > I suppose this is not a big deal, because 1) it should happen very rarely. 
> > 2)
> > user does know the actual number of commands that succeed.
> > 
> > So, do we leave it as is? Or is there any way to improve?
> > 
> > One tiny improvement (not a complete fix) in my mind is a testing 
> > copy_to_user
> > before we execute the commands. If it succeeds, it's even less likely the 
> > last
> > copy_to_user could fail, so that we can even probably assert it won't. The
> > testing 'copy_to_user(cmds, , ...)' will not hurt functionality if do 
> > it
> > right after 'copy_from_user(, cmds, ...)'. But I'm not sure about the
> > performance impact, especially when @ncmds is big.
> > 
> I don't think this is the right thing to do, since, for example, another
> thread could unmap the memory region holding buffer between the "check"
> copy_to_user and the actual one.
> 
> The two alternatives that I see are:
> 
> 1. If copy_to_user fails, ignore it and return the number of commands
> that succeeded (i.e., what the code in your patch does).
> 2. If copy_to_user fails, return -EFAULT, regardless of how many
> commands succeeded.
> 
> The problem with 1 is that it could potentially mask bugs in a user
> program. You could imagine a buggy program that passes a read-only
> buffer to epoll_ctl_batch and never finds out about it because they
> don't get the error. (Then, when there's a real error in one of the
> epoll_ctl_cmds, but .result is 0, someone will be very confused.)
> 
> So I think 2 is the better option. Sure, the user will have no idea how
> many commands were executed, but when would EFAULT on this system call
> be part of normal operation, anyways? You'll find the memory bug, fix
> it, and rest easy knowing that nothing is silently failing behind your
> back.
> 

OK, I agree with you here. I'm going to change this to -EFAULT in the next
revision.

Thanks!
Fam
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-14 Thread Fam Zheng
On Fri, 02/13 01:53, Omar Sandoval wrote:
snip
  Discussion
  ==
  
  [Note: This is the question part regarding the interface contract of
  epoll_ctl_batch.  If you don't have the context of what is epoll_ctl_batch 
  yet,
  please skip this part and probably start with the man page style 
  documentation.
  You can resume to this section later.]
  
  [Thanks to Omar Sandoval osan...@osandov.com, who pointed out this in
  reviewing v1]
  
  We try to report status for each command in epoll_ctl_batch, by writting to
  user provided command array (pointed to cmds). The tricky thing in the
  implementation is that, copying the results back to userspace comes last, 
  after
  the commands are executed. At this point, if the copy_to_user fails, the
  effects are done and no return - or if we add some mechanism to revert it, 
  the
  code will be too complicated and slow.
  
  In above corner case, the return value of epoll_ctl_batch is smaller than
  ncmds, which assures our caller that last N commands failed, where N = 
  ncmds -
  ret.  But they'll also find that cmd.result is not changed to error code.
  
  I suppose this is not a big deal, because 1) it should happen very rarely. 
  2)
  user does know the actual number of commands that succeed.
  
  So, do we leave it as is? Or is there any way to improve?
  
  One tiny improvement (not a complete fix) in my mind is a testing 
  copy_to_user
  before we execute the commands. If it succeeds, it's even less likely the 
  last
  copy_to_user could fail, so that we can even probably assert it won't. The
  testing 'copy_to_user(cmds, kcmds, ...)' will not hurt functionality if do 
  it
  right after 'copy_from_user(kcmds, cmds, ...)'. But I'm not sure about the
  performance impact, especially when @ncmds is big.
  
 I don't think this is the right thing to do, since, for example, another
 thread could unmap the memory region holding buffer between the check
 copy_to_user and the actual one.
 
 The two alternatives that I see are:
 
 1. If copy_to_user fails, ignore it and return the number of commands
 that succeeded (i.e., what the code in your patch does).
 2. If copy_to_user fails, return -EFAULT, regardless of how many
 commands succeeded.
 
 The problem with 1 is that it could potentially mask bugs in a user
 program. You could imagine a buggy program that passes a read-only
 buffer to epoll_ctl_batch and never finds out about it because they
 don't get the error. (Then, when there's a real error in one of the
 epoll_ctl_cmds, but .result is 0, someone will be very confused.)
 
 So I think 2 is the better option. Sure, the user will have no idea how
 many commands were executed, but when would EFAULT on this system call
 be part of normal operation, anyways? You'll find the memory bug, fix
 it, and rest easy knowing that nothing is silently failing behind your
 back.
 

OK, I agree with you here. I'm going to change this to -EFAULT in the next
revision.

Thanks!
Fam
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-13 Thread Omar Sandoval
On Fri, Feb 13, 2015 at 05:03:56PM +0800, Fam Zheng wrote:
> Hi all,
> 
> This is the updated series for the new epoll system calls, with the cover
> letter rewritten which includes some more explanation. Comments are very
> welcome!
> 
> Original Motivation
> ===
> 
> QEMU, and probably many more select/poll based applications, will consider
> epoll as an alternative, when its event loop needs to handle a big number of
> fds. However, there are currently two concerns with epoll which prevents the
> switching from ppoll to epoll. 
> 
> The major one is the timeout precision.
> 
> For example in QEMU, the main loop takes care of calling callbacks at a
> specific timeout - the QEMU timer API. The timeout value in ppoll depends on
> the next firing timer. epoll_pwait's millisecond timeout is so coarse that
> rounding up the timeout will hurt performance badly.
> 
> The minor one is the number of system call to update fd set.
> 
> While epoll can handle a large number of fds quickly, it still requires one
> epoll_ctl per fd update, compared to the one-shot call to select/poll with an
> fd array. This may as well make epoll inferior to ppoll in the cases where a
> small, but frequently changing set of fds are polled by the event loop.
> 
> This series introduces two new epoll APIs to address them respectively. The
> idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also
> suggested clockid as a parameter in epoll_pwait1.
> 
> Discussion
> ==
> 
> [Note: This is the question part regarding the interface contract of
> epoll_ctl_batch.  If you don't have the context of what is epoll_ctl_batch 
> yet,
> please skip this part and probably start with the man page style 
> documentation.
> You can resume to this section later.]
> 
> [Thanks to Omar Sandoval , who pointed out this in
> reviewing v1]
> 
> We try to report status for each command in epoll_ctl_batch, by writting to
> user provided command array (pointed to cmds). The tricky thing in the
> implementation is that, copying the results back to userspace comes last, 
> after
> the commands are executed. At this point, if the copy_to_user fails, the
> effects are done and no return - or if we add some mechanism to revert it, the
> code will be too complicated and slow.
> 
> In above corner case, the return value of epoll_ctl_batch is smaller than
> ncmds, which assures our caller that last N commands failed, where N = ncmds -
> ret.  But they'll also find that cmd.result is not changed to error code.
> 
> I suppose this is not a big deal, because 1) it should happen very rarely. 2)
> user does know the actual number of commands that succeed.
> 
> So, do we leave it as is? Or is there any way to improve?
> 
> One tiny improvement (not a complete fix) in my mind is a testing copy_to_user
> before we execute the commands. If it succeeds, it's even less likely the last
> copy_to_user could fail, so that we can even probably assert it won't. The
> testing 'copy_to_user(cmds, , ...)' will not hurt functionality if do it
> right after 'copy_from_user(, cmds, ...)'. But I'm not sure about the
> performance impact, especially when @ncmds is big.
> 
I don't think this is the right thing to do, since, for example, another
thread could unmap the memory region holding buffer between the "check"
copy_to_user and the actual one.

The two alternatives that I see are:

1. If copy_to_user fails, ignore it and return the number of commands
that succeeded (i.e., what the code in your patch does).
2. If copy_to_user fails, return -EFAULT, regardless of how many
commands succeeded.

The problem with 1 is that it could potentially mask bugs in a user
program. You could imagine a buggy program that passes a read-only
buffer to epoll_ctl_batch and never finds out about it because they
don't get the error. (Then, when there's a real error in one of the
epoll_ctl_cmds, but .result is 0, someone will be very confused.)

So I think 2 is the better option. Sure, the user will have no idea how
many commands were executed, but when would EFAULT on this system call
be part of normal operation, anyways? You'll find the memory bug, fix
it, and rest easy knowing that nothing is silently failing behind your
back.

> Links
> =
> 
> [1]: http://lists.openwall.net/linux-kernel/2015/01/08/542
> 
> Changelog
> =
> 
> Changes from v2 (https://lkml.org/lkml/2015/2/4/105)
> 
> 
>   - Rename epoll_ctl_cmd.error_hint to "result". [Michael]
> 
>   - Add background introduction in cover letter. [Michael]
> 
>   - Expand the last struct of epoll_pwait1, add clockid and timespec.
>   
>   - Update man page in cover letter accordingly:
> 
> * "error_hint" -> "result".
> * The result field's caveat in "RETURN VALUE" secion of epoll_ctl_batch.
> 
>   Please review!
> 
> Changes from v1 (https://lkml.org/lkml/2015/1/20/189)
> -
> 
>   - As discussed 

Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-02-13 Thread Omar Sandoval
On Fri, Feb 13, 2015 at 05:03:56PM +0800, Fam Zheng wrote:
 Hi all,
 
 This is the updated series for the new epoll system calls, with the cover
 letter rewritten which includes some more explanation. Comments are very
 welcome!
 
 Original Motivation
 ===
 
 QEMU, and probably many more select/poll based applications, will consider
 epoll as an alternative, when its event loop needs to handle a big number of
 fds. However, there are currently two concerns with epoll which prevents the
 switching from ppoll to epoll. 
 
 The major one is the timeout precision.
 
 For example in QEMU, the main loop takes care of calling callbacks at a
 specific timeout - the QEMU timer API. The timeout value in ppoll depends on
 the next firing timer. epoll_pwait's millisecond timeout is so coarse that
 rounding up the timeout will hurt performance badly.
 
 The minor one is the number of system call to update fd set.
 
 While epoll can handle a large number of fds quickly, it still requires one
 epoll_ctl per fd update, compared to the one-shot call to select/poll with an
 fd array. This may as well make epoll inferior to ppoll in the cases where a
 small, but frequently changing set of fds are polled by the event loop.
 
 This series introduces two new epoll APIs to address them respectively. The
 idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also
 suggested clockid as a parameter in epoll_pwait1.
 
 Discussion
 ==
 
 [Note: This is the question part regarding the interface contract of
 epoll_ctl_batch.  If you don't have the context of what is epoll_ctl_batch 
 yet,
 please skip this part and probably start with the man page style 
 documentation.
 You can resume to this section later.]
 
 [Thanks to Omar Sandoval osan...@osandov.com, who pointed out this in
 reviewing v1]
 
 We try to report status for each command in epoll_ctl_batch, by writting to
 user provided command array (pointed to cmds). The tricky thing in the
 implementation is that, copying the results back to userspace comes last, 
 after
 the commands are executed. At this point, if the copy_to_user fails, the
 effects are done and no return - or if we add some mechanism to revert it, the
 code will be too complicated and slow.
 
 In above corner case, the return value of epoll_ctl_batch is smaller than
 ncmds, which assures our caller that last N commands failed, where N = ncmds -
 ret.  But they'll also find that cmd.result is not changed to error code.
 
 I suppose this is not a big deal, because 1) it should happen very rarely. 2)
 user does know the actual number of commands that succeed.
 
 So, do we leave it as is? Or is there any way to improve?
 
 One tiny improvement (not a complete fix) in my mind is a testing copy_to_user
 before we execute the commands. If it succeeds, it's even less likely the last
 copy_to_user could fail, so that we can even probably assert it won't. The
 testing 'copy_to_user(cmds, kcmds, ...)' will not hurt functionality if do it
 right after 'copy_from_user(kcmds, cmds, ...)'. But I'm not sure about the
 performance impact, especially when @ncmds is big.
 
I don't think this is the right thing to do, since, for example, another
thread could unmap the memory region holding buffer between the check
copy_to_user and the actual one.

The two alternatives that I see are:

1. If copy_to_user fails, ignore it and return the number of commands
that succeeded (i.e., what the code in your patch does).
2. If copy_to_user fails, return -EFAULT, regardless of how many
commands succeeded.

The problem with 1 is that it could potentially mask bugs in a user
program. You could imagine a buggy program that passes a read-only
buffer to epoll_ctl_batch and never finds out about it because they
don't get the error. (Then, when there's a real error in one of the
epoll_ctl_cmds, but .result is 0, someone will be very confused.)

So I think 2 is the better option. Sure, the user will have no idea how
many commands were executed, but when would EFAULT on this system call
be part of normal operation, anyways? You'll find the memory bug, fix
it, and rest easy knowing that nothing is silently failing behind your
back.

 Links
 =
 
 [1]: http://lists.openwall.net/linux-kernel/2015/01/08/542
 
 Changelog
 =
 
 Changes from v2 (https://lkml.org/lkml/2015/2/4/105)
 
 
   - Rename epoll_ctl_cmd.error_hint to result. [Michael]
 
   - Add background introduction in cover letter. [Michael]
 
   - Expand the last struct of epoll_pwait1, add clockid and timespec.
   
   - Update man page in cover letter accordingly:
 
 * error_hint - result.
 * The result field's caveat in RETURN VALUE secion of epoll_ctl_batch.
 
   Please review!
 
 Changes from v1 (https://lkml.org/lkml/2015/1/20/189)
 -
 
   - As discussed in previous thread [1], split the call to epoll_ctl_batch and
 epoll_pwait.