Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
On Wed, 02/18 19:49, Ingo Molnar wrote: > > * Fam Zheng wrote: > > > On Sun, 02/15 15:00, Jonathan Corbet wrote: > > > On Fri, 13 Feb 2015 17:03:56 +0800 > > > Fam Zheng wrote: > > > > > > > SYNOPSIS > > > > > > > >#include > > > > > > > >int epoll_pwait1(int epfd, int flags, > > > > struct epoll_event *events, > > > > int maxevents, > > > > struct epoll_wait_params *params); > > > > > > Quick, possibly dumb question: might it make sense to also pass in > > > sizeof(struct epoll_wait_params)? That way, when somebody wants to add > > > another parameter in the future, the kernel can tell which version is in > > > use and they won't have to do an epoll_pwait2()? > > > > > > > Flags can be used for that, if the change is not > > radically different. > > Passing in size is generally better than flags, because > that way an extension of the ABI (new field[s]) > automatically signals towards the kernel what to do with > old binaries - while extending the functionality of new > binaries, without sacrificing functionality. > > With flags you are either limited to the same structure > size - or have to decode a 'size' value from the flags > value - which is fragile (and in which case a real 'size' > parameter is better). > > in the perf ABI we use something like that: there's a > perf_attr.size parameter that iterates the ABI forward, > while still being binary compatible with older software. > > If old binaries pass in a smaller structure to a newer > kernel then the kernel pads the new fields with zero by > default - that way the kernel internals are never burdened > with compatibility details and data format versions. > > If new user-space passes in a large structure than the > kernel can handle then the kernel returns an error - this > way user-space can transparently support conditional > features and fallback logic. > > It works really well, we've done literally a hundred perf > ABI extensions this way in the last 4+ years, in a pretty > natural fashion, without littering the kernel (or > user-space) with version legacies and without breaking > existing perf tooling. > > Other syscall ABIs already get painful when trying to > handle 2-3 data structure versions, so people either give > up, or add flags kludges or go to new syscall entries: > which is painful in its own fashion and adds unnecessary > latency to feature introduction as well. > Excellent. This now makes a lot of sense to me, thanks to your explanations, Ingo. I'll add the "size" field in the next revision. Thanks, Fam -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
On Wed, 02/18 19:49, Ingo Molnar wrote: * Fam Zheng f...@redhat.com wrote: On Sun, 02/15 15:00, Jonathan Corbet wrote: On Fri, 13 Feb 2015 17:03:56 +0800 Fam Zheng f...@redhat.com wrote: SYNOPSIS #include sys/epoll.h int epoll_pwait1(int epfd, int flags, struct epoll_event *events, int maxevents, struct epoll_wait_params *params); Quick, possibly dumb question: might it make sense to also pass in sizeof(struct epoll_wait_params)? That way, when somebody wants to add another parameter in the future, the kernel can tell which version is in use and they won't have to do an epoll_pwait2()? Flags can be used for that, if the change is not radically different. Passing in size is generally better than flags, because that way an extension of the ABI (new field[s]) automatically signals towards the kernel what to do with old binaries - while extending the functionality of new binaries, without sacrificing functionality. With flags you are either limited to the same structure size - or have to decode a 'size' value from the flags value - which is fragile (and in which case a real 'size' parameter is better). in the perf ABI we use something like that: there's a perf_attr.size parameter that iterates the ABI forward, while still being binary compatible with older software. If old binaries pass in a smaller structure to a newer kernel then the kernel pads the new fields with zero by default - that way the kernel internals are never burdened with compatibility details and data format versions. If new user-space passes in a large structure than the kernel can handle then the kernel returns an error - this way user-space can transparently support conditional features and fallback logic. It works really well, we've done literally a hundred perf ABI extensions this way in the last 4+ years, in a pretty natural fashion, without littering the kernel (or user-space) with version legacies and without breaking existing perf tooling. Other syscall ABIs already get painful when trying to handle 2-3 data structure versions, so people either give up, or add flags kludges or go to new syscall entries: which is painful in its own fashion and adds unnecessary latency to feature introduction as well. Excellent. This now makes a lot of sense to me, thanks to your explanations, Ingo. I'll add the size field in the next revision. Thanks, Fam -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
* Fam Zheng wrote: > On Sun, 02/15 15:00, Jonathan Corbet wrote: > > On Fri, 13 Feb 2015 17:03:56 +0800 > > Fam Zheng wrote: > > > > > SYNOPSIS > > > > > >#include > > > > > >int epoll_pwait1(int epfd, int flags, > > > struct epoll_event *events, > > > int maxevents, > > > struct epoll_wait_params *params); > > > > Quick, possibly dumb question: might it make sense to also pass in > > sizeof(struct epoll_wait_params)? That way, when somebody wants to add > > another parameter in the future, the kernel can tell which version is in > > use and they won't have to do an epoll_pwait2()? > > > > Flags can be used for that, if the change is not > radically different. Passing in size is generally better than flags, because that way an extension of the ABI (new field[s]) automatically signals towards the kernel what to do with old binaries - while extending the functionality of new binaries, without sacrificing functionality. With flags you are either limited to the same structure size - or have to decode a 'size' value from the flags value - which is fragile (and in which case a real 'size' parameter is better). in the perf ABI we use something like that: there's a perf_attr.size parameter that iterates the ABI forward, while still being binary compatible with older software. If old binaries pass in a smaller structure to a newer kernel then the kernel pads the new fields with zero by default - that way the kernel internals are never burdened with compatibility details and data format versions. If new user-space passes in a large structure than the kernel can handle then the kernel returns an error - this way user-space can transparently support conditional features and fallback logic. It works really well, we've done literally a hundred perf ABI extensions this way in the last 4+ years, in a pretty natural fashion, without littering the kernel (or user-space) with version legacies and without breaking existing perf tooling. Other syscall ABIs already get painful when trying to handle 2-3 data structure versions, so people either give up, or add flags kludges or go to new syscall entries: which is painful in its own fashion and adds unnecessary latency to feature introduction as well. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
* Fam Zheng f...@redhat.com wrote: On Sun, 02/15 15:00, Jonathan Corbet wrote: On Fri, 13 Feb 2015 17:03:56 +0800 Fam Zheng f...@redhat.com wrote: SYNOPSIS #include sys/epoll.h int epoll_pwait1(int epfd, int flags, struct epoll_event *events, int maxevents, struct epoll_wait_params *params); Quick, possibly dumb question: might it make sense to also pass in sizeof(struct epoll_wait_params)? That way, when somebody wants to add another parameter in the future, the kernel can tell which version is in use and they won't have to do an epoll_pwait2()? Flags can be used for that, if the change is not radically different. Passing in size is generally better than flags, because that way an extension of the ABI (new field[s]) automatically signals towards the kernel what to do with old binaries - while extending the functionality of new binaries, without sacrificing functionality. With flags you are either limited to the same structure size - or have to decode a 'size' value from the flags value - which is fragile (and in which case a real 'size' parameter is better). in the perf ABI we use something like that: there's a perf_attr.size parameter that iterates the ABI forward, while still being binary compatible with older software. If old binaries pass in a smaller structure to a newer kernel then the kernel pads the new fields with zero by default - that way the kernel internals are never burdened with compatibility details and data format versions. If new user-space passes in a large structure than the kernel can handle then the kernel returns an error - this way user-space can transparently support conditional features and fallback logic. It works really well, we've done literally a hundred perf ABI extensions this way in the last 4+ years, in a pretty natural fashion, without littering the kernel (or user-space) with version legacies and without breaking existing perf tooling. Other syscall ABIs already get painful when trying to handle 2-3 data structure versions, so people either give up, or add flags kludges or go to new syscall entries: which is painful in its own fashion and adds unnecessary latency to feature introduction as well. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
Hi Seymour, On Mon, 02/16 07:25, Seymour, Shane M wrote: > I found the manual pages really confusing so I had a go at rewriting > them - there were places in the manual page that didn't match the > functionality provided by your code as well as I could tell). Could you point which places don't match the code? > > My apologies for a few formatting issues though. I still don't like > parts of epoll_pwait1 but it's less confusing than it was. Any other than the timespec question don't you like? > > You are free to take some or all or none of the changes. > > I did have a question I marked with below about what you > describe and what your code does. > >The timeout member specifies the minimum time that epoll_wait(2) will >block. The time spent waiting will be rounded up to the clock >granularity. Kernel scheduling delays mean that the blocking >interval may overrun by a small amount. Specifying a -1 for either >tv_sec or tv_nsec member of the struct timespec timeout will cause >causes epoll_pwait1(2) to block indefinitely. Specifying a timeout >equal to zero (both tv_sec or tv_nsec member of the struct timespec >timeout are zero) causes epoll_wait(2) to return immediately, even >if no events are available. > > Are you really really sure about this for the -1 stuff? your code copies > in the timespec and just passes it to timespec_to_ktime: > > + if (copy_from_user(, params, sizeof(p))) > + return -EFAULT; > ... > + kt = timespec_to_ktime(p.timeout); > > Compare that to something like the futex syscall which does this: > > if (copy_from_user(, utime, sizeof(ts)) != 0) > return -EFAULT; > if (!timespec_valid()) > return -EINVAL; > > t = timespec_to_ktime(ts); > > If the timespec is not valid it returns -EINVAL back to user space. With your > settings of tv_sec and/or tv_usec to -1 are you relying on a side effect of > the conversion that could break your code in the future if in the unlikely > event someone changes timespec_to_ktime() and should it be: > > + if (copy_from_user(, params, sizeof(p))) > + return -EFAULT; > + if ((p.timeout.tv_sec == -1) || (p.timeout.tv_nsec == -1)) { > + /* this is off the top of my head no idea if it will compile */ > + p.timeout.tv_sec = KTIME_SEC_MAX; > + p.timeout.tv_nsec = 0; > + } > + if (!timespec_valid()) > + return -EINVAL; > ... > + kt = timespec_to_ktime(p.timeout); OK. timespec_valid() is clear about this: negative tv_sec is invalid, so I don't think accepting -1 from user is the right thing to do. We cannot do pointer check as ppoll already because the structure is embedded in epoll_wait_params. Maybe it's best to use a flags bit (#define EPOLL_PWAIT1_BLOCK 1). What do you think? Fam -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
Hi Seymour, On Mon, 02/16 07:25, Seymour, Shane M wrote: I found the manual pages really confusing so I had a go at rewriting them - there were places in the manual page that didn't match the functionality provided by your code as well as I could tell). Could you point which places don't match the code? My apologies for a few formatting issues though. I still don't like parts of epoll_pwait1 but it's less confusing than it was. Any other than the timespec question don't you like? You are free to take some or all or none of the changes. I did have a question I marked with below about what you describe and what your code does. snip The timeout member specifies the minimum time that epoll_wait(2) will block. The time spent waiting will be rounded up to the clock granularity. Kernel scheduling delays mean that the blocking interval may overrun by a small amount. Specifying a -1 for either tv_sec or tv_nsec member of the struct timespec timeout will cause causes epoll_pwait1(2) to block indefinitely. Specifying a timeout equal to zero (both tv_sec or tv_nsec member of the struct timespec timeout are zero) causes epoll_wait(2) to return immediately, even if no events are available. Are you really really sure about this for the -1 stuff? your code copies in the timespec and just passes it to timespec_to_ktime: + if (copy_from_user(p, params, sizeof(p))) + return -EFAULT; ... + kt = timespec_to_ktime(p.timeout); Compare that to something like the futex syscall which does this: if (copy_from_user(ts, utime, sizeof(ts)) != 0) return -EFAULT; if (!timespec_valid(ts)) return -EINVAL; t = timespec_to_ktime(ts); If the timespec is not valid it returns -EINVAL back to user space. With your settings of tv_sec and/or tv_usec to -1 are you relying on a side effect of the conversion that could break your code in the future if in the unlikely event someone changes timespec_to_ktime() and should it be: + if (copy_from_user(p, params, sizeof(p))) + return -EFAULT; + if ((p.timeout.tv_sec == -1) || (p.timeout.tv_nsec == -1)) { + /* this is off the top of my head no idea if it will compile */ + p.timeout.tv_sec = KTIME_SEC_MAX; + p.timeout.tv_nsec = 0; + } + if (!timespec_valid(p.timeout)) + return -EINVAL; ... + kt = timespec_to_ktime(p.timeout); OK. timespec_valid() is clear about this: negative tv_sec is invalid, so I don't think accepting -1 from user is the right thing to do. We cannot do pointer check as ppoll already because the structure is embedded in epoll_wait_params. Maybe it's best to use a flags bit (#define EPOLL_PWAIT1_BLOCK 1). What do you think? Fam snip -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
I found the manual pages really confusing so I had a go at rewriting them - there were places in the manual page that didn't match the functionality provided by your code as well as I could tell). My apologies for a few formatting issues though. I still don't like parts of epoll_pwait1 but it's less confusing than it was. You are free to take some or all or none of the changes. I did have a question I marked with below about what you describe and what your code does. 1) epoll_ctl_batch -- NAME epoll_ctl_batch - batch control interface for an epoll descriptor SYNOPSIS #include int epoll_ctl_batch(int epfd, int flags, int ncmds, struct epoll_ctl_cmd *cmds); DESCRIPTION This system call is an extension of epoll_ctl(). The primary difference is that this system call allows you to batch multiple operations with the one system call. This provides a more efficient interface for updating events on this epoll file descriptor epfd. The flags argument is reserved and must be 0. The argument ncmds is the number of cmds entries being passed in. This number must be greater than 0. Each operation is specified as an element in the cmds array, defined as: struct epoll_ctl_cmd { /* Reserved flags for future extension, must be 0. */ int flags; /* The same as epoll_ctl() op parameter. */ int op; /* The same as epoll_ctl() fd parameter. */ int fd; /* The same as the "events" field in struct epoll_event. */ uint32_t events; /* The same as the "data" field in struct epoll_event. */ uint64_t data; /* Output field, will be set to the return code after this * command is executed by kernel */ int result; }; This system call is not atomic when updating the epoll descriptor. All entries in cmds are executed in the order provided. If any cmds entry fails to be processed no further entries are processed and the number of successfully processed entries is returned. Each single operation defined by a struct epoll_ctl_cmd has the same semantics as an epoll_ctl(2) call. See the epoll_ctl() manual page for more information about how to correctly setup the members of a struct epoll_ctl_cmd. Upon completion of the call the result member of each struct epoll_ctl_cmd may be set to 0 (sucessfully completed) or an error code depending on the result of the command. If the kernel fails to change the result (for example the location of the cmds argument is fully or partly read only) the result member of each struct epoll_ctl_cmd may be unchanged. RETURN VALUE epoll_ctl_batch() returns a number greater than 0 to indicate the number of cmnd entries processed. If all entries have been processed this will equal the ncmds parameter passed in. If one or more parameters are incorrect the value returned is -1 with errno set appropriately - no cmds entries have been processed when this happens. If processing any entry in the cmds argument results in an error the number returned is the number of the failing entry - this number will be less than ncmds. Since ncmds must be greater than 0 a return value of 0 indicates an error associated with the very first cmds entry. A return value of 0 does not indicate a successful system call. To correctly test the return value from epoll_ctl_batch() use code similar to the following: ret=epoll_ctl_batch(epfd, flags, ncmds, ); if (ret < ncmds) { if (ret == -1) { /* An argument was invalid */ } else { /* ret contains the number of successful entries * processed. If you (mis?)use it as a C index it * will index directly to the failing entry to * get the result use cmds[ret].result which may * contain the errno value associated with the * entry. */ } } else { /* Success */ } ERRORS EINVAL flags is non-zero, or ncmds is less than or equal to zero, or cmds is NULL. ENOMEM There was insufficient memory to handle the requested op control operation. EFAULT The memory area pointed to by cmds is not accessible with read permissions. In the event that the
Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
On Sun, 02/15 15:00, Jonathan Corbet wrote: > On Fri, 13 Feb 2015 17:03:56 +0800 > Fam Zheng wrote: > > > SYNOPSIS > > > >#include > > > >int epoll_pwait1(int epfd, int flags, > > struct epoll_event *events, > > int maxevents, > > struct epoll_wait_params *params); > > Quick, possibly dumb question: might it make sense to also pass in > sizeof(struct epoll_wait_params)? That way, when somebody wants to add > another parameter in the future, the kernel can tell which version is in > use and they won't have to do an epoll_pwait2()? > Flags can be used for that, if the change is not radically different. Fam -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
On Fri, 13 Feb 2015 17:03:56 +0800 Fam Zheng wrote: > SYNOPSIS > >#include > >int epoll_pwait1(int epfd, int flags, > struct epoll_event *events, > int maxevents, > struct epoll_wait_params *params); Quick, possibly dumb question: might it make sense to also pass in sizeof(struct epoll_wait_params)? That way, when somebody wants to add another parameter in the future, the kernel can tell which version is in use and they won't have to do an epoll_pwait2()? Thanks, jon -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
On 02/13/2015 10:53 AM, Omar Sandoval wrote: > On Fri, Feb 13, 2015 at 05:03:56PM +0800, Fam Zheng wrote: >> Hi all, >> >> This is the updated series for the new epoll system calls, with the cover >> letter rewritten which includes some more explanation. Comments are very >> welcome! >> >> Original Motivation >> === >> >> QEMU, and probably many more select/poll based applications, will consider >> epoll as an alternative, when its event loop needs to handle a big number of >> fds. However, there are currently two concerns with epoll which prevents the >> switching from ppoll to epoll. >> >> The major one is the timeout precision. >> >> For example in QEMU, the main loop takes care of calling callbacks at a >> specific timeout - the QEMU timer API. The timeout value in ppoll depends on >> the next firing timer. epoll_pwait's millisecond timeout is so coarse that >> rounding up the timeout will hurt performance badly. >> >> The minor one is the number of system call to update fd set. >> >> While epoll can handle a large number of fds quickly, it still requires one >> epoll_ctl per fd update, compared to the one-shot call to select/poll with an >> fd array. This may as well make epoll inferior to ppoll in the cases where a >> small, but frequently changing set of fds are polled by the event loop. >> >> This series introduces two new epoll APIs to address them respectively. The >> idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also >> suggested clockid as a parameter in epoll_pwait1. >> >> Discussion >> == >> >> [Note: This is the question part regarding the interface contract of >> epoll_ctl_batch. If you don't have the context of what is epoll_ctl_batch >> yet, >> please skip this part and probably start with the man page style >> documentation. >> You can resume to this section later.] >> >> [Thanks to Omar Sandoval , who pointed out this in >> reviewing v1] >> >> We try to report status for each command in epoll_ctl_batch, by writting to >> user provided command array (pointed to cmds). The tricky thing in the >> implementation is that, copying the results back to userspace comes last, >> after >> the commands are executed. At this point, if the copy_to_user fails, the >> effects are done and no return - or if we add some mechanism to revert it, >> the >> code will be too complicated and slow. >> >> In above corner case, the return value of epoll_ctl_batch is smaller than >> ncmds, which assures our caller that last N commands failed, where N = ncmds >> - >> ret. But they'll also find that cmd.result is not changed to error code. >> >> I suppose this is not a big deal, because 1) it should happen very rarely. 2) >> user does know the actual number of commands that succeed. >> >> So, do we leave it as is? Or is there any way to improve? >> >> One tiny improvement (not a complete fix) in my mind is a testing >> copy_to_user >> before we execute the commands. If it succeeds, it's even less likely the >> last >> copy_to_user could fail, so that we can even probably assert it won't. The >> testing 'copy_to_user(cmds, , ...)' will not hurt functionality if do >> it >> right after 'copy_from_user(, cmds, ...)'. But I'm not sure about the >> performance impact, especially when @ncmds is big. >> > I don't think this is the right thing to do, since, for example, another > thread could unmap the memory region holding buffer between the "check" > copy_to_user and the actual one. > > The two alternatives that I see are: > > 1. If copy_to_user fails, ignore it and return the number of commands > that succeeded (i.e., what the code in your patch does). > 2. If copy_to_user fails, return -EFAULT, regardless of how many > commands succeeded. > > The problem with 1 is that it could potentially mask bugs in a user > program. You could imagine a buggy program that passes a read-only > buffer to epoll_ctl_batch and never finds out about it because they > don't get the error. (Then, when there's a real error in one of the > epoll_ctl_cmds, but .result is 0, someone will be very confused.) > > So I think 2 is the better option. Sure, the user will have no idea how > many commands were executed, but when would EFAULT on this system call > be part of normal operation, anyways? You'll find the memory bug, fix > it, and rest easy knowing that nothing is silently failing behind your > back. What Omar says makes sense to me too. Best to have the user get a clear error indication for this case. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
On 02/13/2015 10:53 AM, Omar Sandoval wrote: On Fri, Feb 13, 2015 at 05:03:56PM +0800, Fam Zheng wrote: Hi all, This is the updated series for the new epoll system calls, with the cover letter rewritten which includes some more explanation. Comments are very welcome! Original Motivation === QEMU, and probably many more select/poll based applications, will consider epoll as an alternative, when its event loop needs to handle a big number of fds. However, there are currently two concerns with epoll which prevents the switching from ppoll to epoll. The major one is the timeout precision. For example in QEMU, the main loop takes care of calling callbacks at a specific timeout - the QEMU timer API. The timeout value in ppoll depends on the next firing timer. epoll_pwait's millisecond timeout is so coarse that rounding up the timeout will hurt performance badly. The minor one is the number of system call to update fd set. While epoll can handle a large number of fds quickly, it still requires one epoll_ctl per fd update, compared to the one-shot call to select/poll with an fd array. This may as well make epoll inferior to ppoll in the cases where a small, but frequently changing set of fds are polled by the event loop. This series introduces two new epoll APIs to address them respectively. The idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also suggested clockid as a parameter in epoll_pwait1. Discussion == [Note: This is the question part regarding the interface contract of epoll_ctl_batch. If you don't have the context of what is epoll_ctl_batch yet, please skip this part and probably start with the man page style documentation. You can resume to this section later.] [Thanks to Omar Sandoval osan...@osandov.com, who pointed out this in reviewing v1] We try to report status for each command in epoll_ctl_batch, by writting to user provided command array (pointed to cmds). The tricky thing in the implementation is that, copying the results back to userspace comes last, after the commands are executed. At this point, if the copy_to_user fails, the effects are done and no return - or if we add some mechanism to revert it, the code will be too complicated and slow. In above corner case, the return value of epoll_ctl_batch is smaller than ncmds, which assures our caller that last N commands failed, where N = ncmds - ret. But they'll also find that cmd.result is not changed to error code. I suppose this is not a big deal, because 1) it should happen very rarely. 2) user does know the actual number of commands that succeed. So, do we leave it as is? Or is there any way to improve? One tiny improvement (not a complete fix) in my mind is a testing copy_to_user before we execute the commands. If it succeeds, it's even less likely the last copy_to_user could fail, so that we can even probably assert it won't. The testing 'copy_to_user(cmds, kcmds, ...)' will not hurt functionality if do it right after 'copy_from_user(kcmds, cmds, ...)'. But I'm not sure about the performance impact, especially when @ncmds is big. I don't think this is the right thing to do, since, for example, another thread could unmap the memory region holding buffer between the check copy_to_user and the actual one. The two alternatives that I see are: 1. If copy_to_user fails, ignore it and return the number of commands that succeeded (i.e., what the code in your patch does). 2. If copy_to_user fails, return -EFAULT, regardless of how many commands succeeded. The problem with 1 is that it could potentially mask bugs in a user program. You could imagine a buggy program that passes a read-only buffer to epoll_ctl_batch and never finds out about it because they don't get the error. (Then, when there's a real error in one of the epoll_ctl_cmds, but .result is 0, someone will be very confused.) So I think 2 is the better option. Sure, the user will have no idea how many commands were executed, but when would EFAULT on this system call be part of normal operation, anyways? You'll find the memory bug, fix it, and rest easy knowing that nothing is silently failing behind your back. What Omar says makes sense to me too. Best to have the user get a clear error indication for this case. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
On Sun, 02/15 15:00, Jonathan Corbet wrote: On Fri, 13 Feb 2015 17:03:56 +0800 Fam Zheng f...@redhat.com wrote: SYNOPSIS #include sys/epoll.h int epoll_pwait1(int epfd, int flags, struct epoll_event *events, int maxevents, struct epoll_wait_params *params); Quick, possibly dumb question: might it make sense to also pass in sizeof(struct epoll_wait_params)? That way, when somebody wants to add another parameter in the future, the kernel can tell which version is in use and they won't have to do an epoll_pwait2()? Flags can be used for that, if the change is not radically different. Fam -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
On Fri, 13 Feb 2015 17:03:56 +0800 Fam Zheng f...@redhat.com wrote: SYNOPSIS #include sys/epoll.h int epoll_pwait1(int epfd, int flags, struct epoll_event *events, int maxevents, struct epoll_wait_params *params); Quick, possibly dumb question: might it make sense to also pass in sizeof(struct epoll_wait_params)? That way, when somebody wants to add another parameter in the future, the kernel can tell which version is in use and they won't have to do an epoll_pwait2()? Thanks, jon -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
I found the manual pages really confusing so I had a go at rewriting them - there were places in the manual page that didn't match the functionality provided by your code as well as I could tell). My apologies for a few formatting issues though. I still don't like parts of epoll_pwait1 but it's less confusing than it was. You are free to take some or all or none of the changes. I did have a question I marked with below about what you describe and what your code does. 1) epoll_ctl_batch -- NAME epoll_ctl_batch - batch control interface for an epoll descriptor SYNOPSIS #include sys/epoll.h int epoll_ctl_batch(int epfd, int flags, int ncmds, struct epoll_ctl_cmd *cmds); DESCRIPTION This system call is an extension of epoll_ctl(). The primary difference is that this system call allows you to batch multiple operations with the one system call. This provides a more efficient interface for updating events on this epoll file descriptor epfd. The flags argument is reserved and must be 0. The argument ncmds is the number of cmds entries being passed in. This number must be greater than 0. Each operation is specified as an element in the cmds array, defined as: struct epoll_ctl_cmd { /* Reserved flags for future extension, must be 0. */ int flags; /* The same as epoll_ctl() op parameter. */ int op; /* The same as epoll_ctl() fd parameter. */ int fd; /* The same as the events field in struct epoll_event. */ uint32_t events; /* The same as the data field in struct epoll_event. */ uint64_t data; /* Output field, will be set to the return code after this * command is executed by kernel */ int result; }; This system call is not atomic when updating the epoll descriptor. All entries in cmds are executed in the order provided. If any cmds entry fails to be processed no further entries are processed and the number of successfully processed entries is returned. Each single operation defined by a struct epoll_ctl_cmd has the same semantics as an epoll_ctl(2) call. See the epoll_ctl() manual page for more information about how to correctly setup the members of a struct epoll_ctl_cmd. Upon completion of the call the result member of each struct epoll_ctl_cmd may be set to 0 (sucessfully completed) or an error code depending on the result of the command. If the kernel fails to change the result (for example the location of the cmds argument is fully or partly read only) the result member of each struct epoll_ctl_cmd may be unchanged. RETURN VALUE epoll_ctl_batch() returns a number greater than 0 to indicate the number of cmnd entries processed. If all entries have been processed this will equal the ncmds parameter passed in. If one or more parameters are incorrect the value returned is -1 with errno set appropriately - no cmds entries have been processed when this happens. If processing any entry in the cmds argument results in an error the number returned is the number of the failing entry - this number will be less than ncmds. Since ncmds must be greater than 0 a return value of 0 indicates an error associated with the very first cmds entry. A return value of 0 does not indicate a successful system call. To correctly test the return value from epoll_ctl_batch() use code similar to the following: ret=epoll_ctl_batch(epfd, flags, ncmds, cmds); if (ret ncmds) { if (ret == -1) { /* An argument was invalid */ } else { /* ret contains the number of successful entries * processed. If you (mis?)use it as a C index it * will index directly to the failing entry to * get the result use cmds[ret].result which may * contain the errno value associated with the * entry. */ } } else { /* Success */ } ERRORS EINVAL flags is non-zero, or ncmds is less than or equal to zero, or cmds is NULL. ENOMEM There was insufficient memory to handle the requested op control operation. EFAULT The memory area pointed to by cmds is not accessible with read permissions. In the event
Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
On Fri, 02/13 01:53, Omar Sandoval wrote: > > Discussion > > == > > > > [Note: This is the question part regarding the interface contract of > > epoll_ctl_batch. If you don't have the context of what is epoll_ctl_batch > > yet, > > please skip this part and probably start with the man page style > > documentation. > > You can resume to this section later.] > > > > [Thanks to Omar Sandoval , who pointed out this in > > reviewing v1] > > > > We try to report status for each command in epoll_ctl_batch, by writting to > > user provided command array (pointed to cmds). The tricky thing in the > > implementation is that, copying the results back to userspace comes last, > > after > > the commands are executed. At this point, if the copy_to_user fails, the > > effects are done and no return - or if we add some mechanism to revert it, > > the > > code will be too complicated and slow. > > > > In above corner case, the return value of epoll_ctl_batch is smaller than > > ncmds, which assures our caller that last N commands failed, where N = > > ncmds - > > ret. But they'll also find that cmd.result is not changed to error code. > > > > I suppose this is not a big deal, because 1) it should happen very rarely. > > 2) > > user does know the actual number of commands that succeed. > > > > So, do we leave it as is? Or is there any way to improve? > > > > One tiny improvement (not a complete fix) in my mind is a testing > > copy_to_user > > before we execute the commands. If it succeeds, it's even less likely the > > last > > copy_to_user could fail, so that we can even probably assert it won't. The > > testing 'copy_to_user(cmds, , ...)' will not hurt functionality if do > > it > > right after 'copy_from_user(, cmds, ...)'. But I'm not sure about the > > performance impact, especially when @ncmds is big. > > > I don't think this is the right thing to do, since, for example, another > thread could unmap the memory region holding buffer between the "check" > copy_to_user and the actual one. > > The two alternatives that I see are: > > 1. If copy_to_user fails, ignore it and return the number of commands > that succeeded (i.e., what the code in your patch does). > 2. If copy_to_user fails, return -EFAULT, regardless of how many > commands succeeded. > > The problem with 1 is that it could potentially mask bugs in a user > program. You could imagine a buggy program that passes a read-only > buffer to epoll_ctl_batch and never finds out about it because they > don't get the error. (Then, when there's a real error in one of the > epoll_ctl_cmds, but .result is 0, someone will be very confused.) > > So I think 2 is the better option. Sure, the user will have no idea how > many commands were executed, but when would EFAULT on this system call > be part of normal operation, anyways? You'll find the memory bug, fix > it, and rest easy knowing that nothing is silently failing behind your > back. > OK, I agree with you here. I'm going to change this to -EFAULT in the next revision. Thanks! Fam -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
On Fri, 02/13 01:53, Omar Sandoval wrote: snip Discussion == [Note: This is the question part regarding the interface contract of epoll_ctl_batch. If you don't have the context of what is epoll_ctl_batch yet, please skip this part and probably start with the man page style documentation. You can resume to this section later.] [Thanks to Omar Sandoval osan...@osandov.com, who pointed out this in reviewing v1] We try to report status for each command in epoll_ctl_batch, by writting to user provided command array (pointed to cmds). The tricky thing in the implementation is that, copying the results back to userspace comes last, after the commands are executed. At this point, if the copy_to_user fails, the effects are done and no return - or if we add some mechanism to revert it, the code will be too complicated and slow. In above corner case, the return value of epoll_ctl_batch is smaller than ncmds, which assures our caller that last N commands failed, where N = ncmds - ret. But they'll also find that cmd.result is not changed to error code. I suppose this is not a big deal, because 1) it should happen very rarely. 2) user does know the actual number of commands that succeed. So, do we leave it as is? Or is there any way to improve? One tiny improvement (not a complete fix) in my mind is a testing copy_to_user before we execute the commands. If it succeeds, it's even less likely the last copy_to_user could fail, so that we can even probably assert it won't. The testing 'copy_to_user(cmds, kcmds, ...)' will not hurt functionality if do it right after 'copy_from_user(kcmds, cmds, ...)'. But I'm not sure about the performance impact, especially when @ncmds is big. I don't think this is the right thing to do, since, for example, another thread could unmap the memory region holding buffer between the check copy_to_user and the actual one. The two alternatives that I see are: 1. If copy_to_user fails, ignore it and return the number of commands that succeeded (i.e., what the code in your patch does). 2. If copy_to_user fails, return -EFAULT, regardless of how many commands succeeded. The problem with 1 is that it could potentially mask bugs in a user program. You could imagine a buggy program that passes a read-only buffer to epoll_ctl_batch and never finds out about it because they don't get the error. (Then, when there's a real error in one of the epoll_ctl_cmds, but .result is 0, someone will be very confused.) So I think 2 is the better option. Sure, the user will have no idea how many commands were executed, but when would EFAULT on this system call be part of normal operation, anyways? You'll find the memory bug, fix it, and rest easy knowing that nothing is silently failing behind your back. OK, I agree with you here. I'm going to change this to -EFAULT in the next revision. Thanks! Fam -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
On Fri, Feb 13, 2015 at 05:03:56PM +0800, Fam Zheng wrote: > Hi all, > > This is the updated series for the new epoll system calls, with the cover > letter rewritten which includes some more explanation. Comments are very > welcome! > > Original Motivation > === > > QEMU, and probably many more select/poll based applications, will consider > epoll as an alternative, when its event loop needs to handle a big number of > fds. However, there are currently two concerns with epoll which prevents the > switching from ppoll to epoll. > > The major one is the timeout precision. > > For example in QEMU, the main loop takes care of calling callbacks at a > specific timeout - the QEMU timer API. The timeout value in ppoll depends on > the next firing timer. epoll_pwait's millisecond timeout is so coarse that > rounding up the timeout will hurt performance badly. > > The minor one is the number of system call to update fd set. > > While epoll can handle a large number of fds quickly, it still requires one > epoll_ctl per fd update, compared to the one-shot call to select/poll with an > fd array. This may as well make epoll inferior to ppoll in the cases where a > small, but frequently changing set of fds are polled by the event loop. > > This series introduces two new epoll APIs to address them respectively. The > idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also > suggested clockid as a parameter in epoll_pwait1. > > Discussion > == > > [Note: This is the question part regarding the interface contract of > epoll_ctl_batch. If you don't have the context of what is epoll_ctl_batch > yet, > please skip this part and probably start with the man page style > documentation. > You can resume to this section later.] > > [Thanks to Omar Sandoval , who pointed out this in > reviewing v1] > > We try to report status for each command in epoll_ctl_batch, by writting to > user provided command array (pointed to cmds). The tricky thing in the > implementation is that, copying the results back to userspace comes last, > after > the commands are executed. At this point, if the copy_to_user fails, the > effects are done and no return - or if we add some mechanism to revert it, the > code will be too complicated and slow. > > In above corner case, the return value of epoll_ctl_batch is smaller than > ncmds, which assures our caller that last N commands failed, where N = ncmds - > ret. But they'll also find that cmd.result is not changed to error code. > > I suppose this is not a big deal, because 1) it should happen very rarely. 2) > user does know the actual number of commands that succeed. > > So, do we leave it as is? Or is there any way to improve? > > One tiny improvement (not a complete fix) in my mind is a testing copy_to_user > before we execute the commands. If it succeeds, it's even less likely the last > copy_to_user could fail, so that we can even probably assert it won't. The > testing 'copy_to_user(cmds, , ...)' will not hurt functionality if do it > right after 'copy_from_user(, cmds, ...)'. But I'm not sure about the > performance impact, especially when @ncmds is big. > I don't think this is the right thing to do, since, for example, another thread could unmap the memory region holding buffer between the "check" copy_to_user and the actual one. The two alternatives that I see are: 1. If copy_to_user fails, ignore it and return the number of commands that succeeded (i.e., what the code in your patch does). 2. If copy_to_user fails, return -EFAULT, regardless of how many commands succeeded. The problem with 1 is that it could potentially mask bugs in a user program. You could imagine a buggy program that passes a read-only buffer to epoll_ctl_batch and never finds out about it because they don't get the error. (Then, when there's a real error in one of the epoll_ctl_cmds, but .result is 0, someone will be very confused.) So I think 2 is the better option. Sure, the user will have no idea how many commands were executed, but when would EFAULT on this system call be part of normal operation, anyways? You'll find the memory bug, fix it, and rest easy knowing that nothing is silently failing behind your back. > Links > = > > [1]: http://lists.openwall.net/linux-kernel/2015/01/08/542 > > Changelog > = > > Changes from v2 (https://lkml.org/lkml/2015/2/4/105) > > > - Rename epoll_ctl_cmd.error_hint to "result". [Michael] > > - Add background introduction in cover letter. [Michael] > > - Expand the last struct of epoll_pwait1, add clockid and timespec. > > - Update man page in cover letter accordingly: > > * "error_hint" -> "result". > * The result field's caveat in "RETURN VALUE" secion of epoll_ctl_batch. > > Please review! > > Changes from v1 (https://lkml.org/lkml/2015/1/20/189) > - > > - As discussed
Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1
On Fri, Feb 13, 2015 at 05:03:56PM +0800, Fam Zheng wrote: Hi all, This is the updated series for the new epoll system calls, with the cover letter rewritten which includes some more explanation. Comments are very welcome! Original Motivation === QEMU, and probably many more select/poll based applications, will consider epoll as an alternative, when its event loop needs to handle a big number of fds. However, there are currently two concerns with epoll which prevents the switching from ppoll to epoll. The major one is the timeout precision. For example in QEMU, the main loop takes care of calling callbacks at a specific timeout - the QEMU timer API. The timeout value in ppoll depends on the next firing timer. epoll_pwait's millisecond timeout is so coarse that rounding up the timeout will hurt performance badly. The minor one is the number of system call to update fd set. While epoll can handle a large number of fds quickly, it still requires one epoll_ctl per fd update, compared to the one-shot call to select/poll with an fd array. This may as well make epoll inferior to ppoll in the cases where a small, but frequently changing set of fds are polled by the event loop. This series introduces two new epoll APIs to address them respectively. The idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also suggested clockid as a parameter in epoll_pwait1. Discussion == [Note: This is the question part regarding the interface contract of epoll_ctl_batch. If you don't have the context of what is epoll_ctl_batch yet, please skip this part and probably start with the man page style documentation. You can resume to this section later.] [Thanks to Omar Sandoval osan...@osandov.com, who pointed out this in reviewing v1] We try to report status for each command in epoll_ctl_batch, by writting to user provided command array (pointed to cmds). The tricky thing in the implementation is that, copying the results back to userspace comes last, after the commands are executed. At this point, if the copy_to_user fails, the effects are done and no return - or if we add some mechanism to revert it, the code will be too complicated and slow. In above corner case, the return value of epoll_ctl_batch is smaller than ncmds, which assures our caller that last N commands failed, where N = ncmds - ret. But they'll also find that cmd.result is not changed to error code. I suppose this is not a big deal, because 1) it should happen very rarely. 2) user does know the actual number of commands that succeed. So, do we leave it as is? Or is there any way to improve? One tiny improvement (not a complete fix) in my mind is a testing copy_to_user before we execute the commands. If it succeeds, it's even less likely the last copy_to_user could fail, so that we can even probably assert it won't. The testing 'copy_to_user(cmds, kcmds, ...)' will not hurt functionality if do it right after 'copy_from_user(kcmds, cmds, ...)'. But I'm not sure about the performance impact, especially when @ncmds is big. I don't think this is the right thing to do, since, for example, another thread could unmap the memory region holding buffer between the check copy_to_user and the actual one. The two alternatives that I see are: 1. If copy_to_user fails, ignore it and return the number of commands that succeeded (i.e., what the code in your patch does). 2. If copy_to_user fails, return -EFAULT, regardless of how many commands succeeded. The problem with 1 is that it could potentially mask bugs in a user program. You could imagine a buggy program that passes a read-only buffer to epoll_ctl_batch and never finds out about it because they don't get the error. (Then, when there's a real error in one of the epoll_ctl_cmds, but .result is 0, someone will be very confused.) So I think 2 is the better option. Sure, the user will have no idea how many commands were executed, but when would EFAULT on this system call be part of normal operation, anyways? You'll find the memory bug, fix it, and rest easy knowing that nothing is silently failing behind your back. Links = [1]: http://lists.openwall.net/linux-kernel/2015/01/08/542 Changelog = Changes from v2 (https://lkml.org/lkml/2015/2/4/105) - Rename epoll_ctl_cmd.error_hint to result. [Michael] - Add background introduction in cover letter. [Michael] - Expand the last struct of epoll_pwait1, add clockid and timespec. - Update man page in cover letter accordingly: * error_hint - result. * The result field's caveat in RETURN VALUE secion of epoll_ctl_batch. Please review! Changes from v1 (https://lkml.org/lkml/2015/1/20/189) - - As discussed in previous thread [1], split the call to epoll_ctl_batch and epoll_pwait.