Re: Linux 4.9.256
On 2/8/21 8:57 PM, Sasha Levin wrote: On Mon, Feb 08, 2021 at 05:50:21PM +0200, Avi Kivity wrote: On 05/02/2021 16.26, Greg Kroah-Hartman wrote: I'm announcing the release of the 4.9.256 kernel. This, and the 4.4.256 release are a little bit "different" than normal. This contains only 1 patch, just the version bump from .255 to .256 which ends up causing the userspace-visable LINUX_VERSION_CODE to behave a bit differently than normal due to the "overflow". With this release, KERNEL_VERSION(4, 9, 256) is the same as KERNEL_VERSION(4, 10, 0). I think this is a bad idea. Many kernel features can only be discovered by checking the kernel version. If a feature was introduced in 4.10, then an application can be tricked into thinking a 4.9 kernel has it. IMO, better to stop LINUX_VERSION_CODE at 255 and introduce a In the upstream (and new -stable fix) we did this part. LINUX_VERSION_CODE_IMPROVED that has more bits for patchlevel. Do you have a usecase where it's actually needed? i.e. userspace that checks for -stable patchlevels? Not stable patchlevels, but minors. So a change from 4.9 to 4.10 could be harmful. I have two such examples (not on the 4.9->4.10 boundary), but they test the runtime version from uname(), not LINUX_VERSION_CODE, so they would be vulnerable to such a change.
Re: Linux 4.9.256
On 05/02/2021 16.26, Greg Kroah-Hartman wrote: I'm announcing the release of the 4.9.256 kernel. This, and the 4.4.256 release are a little bit "different" than normal. This contains only 1 patch, just the version bump from .255 to .256 which ends up causing the userspace-visable LINUX_VERSION_CODE to behave a bit differently than normal due to the "overflow". With this release, KERNEL_VERSION(4, 9, 256) is the same as KERNEL_VERSION(4, 10, 0). I think this is a bad idea. Many kernel features can only be discovered by checking the kernel version. If a feature was introduced in 4.10, then an application can be tricked into thinking a 4.9 kernel has it. IMO, better to stop LINUX_VERSION_CODE at 255 and introduce a LINUX_VERSION_CODE_IMPROVED that has more bits for patchlevel. Nothing in the kernel build itself breaks with this change, but given that this is a userspace visible change, and some crazy tools (like glibc and gcc) have logic that checks the kernel version for different reasons, I wanted to do this release as an "empty" release to ensure that everything still works properly. Even if glibc and gcc work, other programs may not. I have two such cases. They don't depend on 4.9, but they're examples of features that are not discoverable by other means. So, this is a YOU MUST UPGRADE requirement of a release. If you rely on the 4.9.y kernel, please throw this release into your test builds and rebuild the world and let us know if anything breaks, or if all is well. Go forth and do full system rebuilds! Yocto and Gentoo are great for this, as will systems that use buildroot. I'll try to hold off on doing a "real" 4.9.y release for a 9eek to give everyone a chance to test this out and get back to me. The pending patches in the 4.9.y queue are pretty serious, so I am loath to wait longer than that, consider yourself warned... The updated 4.9.y git tree can be found at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-4.9.y and can be browsed at the normal kernel.org git web browser: https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary thanks, greg k-h Makefile |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Greg Kroah-Hartman (1): Linux 4.9.256
Re: Spurious EIO on AIO+DIO+RWF_NOWAIT
On 12/10/18 2:48 PM, Goldwyn Rodrigues wrote: On 13:19 09/12, Avi Kivity wrote: I have an application that receives spurious EIO when running with RWF_NOWAIT enabled. Removing RWF_NOWAIT causes those EIOs to disappear. The application uses AIO+DIO, and errors were seen on both xfs and ext4. I suspect the following code: /* * Process one completed BIO. No locks are held. */ static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio) { struct bio_vec *bvec; unsigned i; blk_status_t err = bio->bi_status; if (err) { if (err == BLK_STS_AGAIN && (bio->bi_opf & REQ_NOWAIT)) dio->io_error = -EAGAIN; else dio->io_error = -EIO; } Could it be that REQ_NOWAIT was dropped from bio->bi_opf? or that bio->bi_status got changed along the way? I don't think REQ_NOWAIT is dropped. I am assuming bio->bi_status error is set differently. Is the blk queue being stopped? Is it possible to instrument the kernel in your testcase? I traced the function, and I see bio->bi_status == BLK_STS_NOTSUPP and bio->bi_opf == REQ_OP_WRITE|REQ_SYNC|REQ_NOMERGE|REQ_FUA|REQ_NOWAIT. Presumably the NOTSUPP is the result of NOWAIT not being supported down the stack, but shouldn't it be detected earlier? And not converted to EIO?
Re: Spurious EIO on AIO+DIO+RWF_NOWAIT
On 10/12/2018 14.48, Goldwyn Rodrigues wrote: On 13:19 09/12, Avi Kivity wrote: I have an application that receives spurious EIO when running with RWF_NOWAIT enabled. Removing RWF_NOWAIT causes those EIOs to disappear. The application uses AIO+DIO, and errors were seen on both xfs and ext4. I suspect the following code: /* * Process one completed BIO. No locks are held. */ static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio) { struct bio_vec *bvec; unsigned i; blk_status_t err = bio->bi_status; if (err) { if (err == BLK_STS_AGAIN && (bio->bi_opf & REQ_NOWAIT)) dio->io_error = -EAGAIN; else dio->io_error = -EIO; } Could it be that REQ_NOWAIT was dropped from bio->bi_opf? or that bio->bi_status got changed along the way? I don't think REQ_NOWAIT is dropped. I am assuming bio->bi_status error is set differently. Is the blk queue being stopped? Not that I know of. Is it possible to instrument the kernel in your testcase? I'm happy to apply patches, or run systemtap (or similar) scripts.
Spurious EIO on AIO+DIO+RWF_NOWAIT
I have an application that receives spurious EIO when running with RWF_NOWAIT enabled. Removing RWF_NOWAIT causes those EIOs to disappear. The application uses AIO+DIO, and errors were seen on both xfs and ext4. I suspect the following code: /* * Process one completed BIO. No locks are held. */ static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio) { struct bio_vec *bvec; unsigned i; blk_status_t err = bio->bi_status; if (err) { if (err == BLK_STS_AGAIN && (bio->bi_opf & REQ_NOWAIT)) dio->io_error = -EAGAIN; else dio->io_error = -EIO; } Could it be that REQ_NOWAIT was dropped from bio->bi_opf? or that bio->bi_status got changed along the way? Reducing the test case may be a lot of work, but I can package a binary using docker if a reproducer is needed. Seen on 4.18.19 and 4.19.something.
[PATCH v1] Revert "eventfd: only return events requested in poll_mask()"
This reverts commit 4d572d9f46507be8cfe326aa5bc3698babcbdfa7. It is superceded by the more general 2739b807b0885a09996659be82f813af219c7360 ("aio: only return events requested in poll_mask() for IOCB_CMD_POLL"). Unfortunately, hch nacked it on the bug report rather than on the patch itself, so it was picked up. --- fs/eventfd.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/eventfd.c b/fs/eventfd.c index ceb1031f1cac..61c9514da5e9 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -154,15 +154,15 @@ static __poll_t eventfd_poll_mask(struct file *file, __poll_t eventmask) * eventfd_poll returns 0 */ count = READ_ONCE(ctx->count); if (count > 0) - events |= (EPOLLIN & eventmask); + events |= EPOLLIN; if (count == ULLONG_MAX) events |= EPOLLERR; if (ULLONG_MAX - 1 > count) - events |= (EPOLLOUT & eventmask); + events |= EPOLLOUT; return events; } static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt) -- 2.14.4
[PATCH v1] Revert "eventfd: only return events requested in poll_mask()"
This reverts commit 4d572d9f46507be8cfe326aa5bc3698babcbdfa7. It is superceded by the more general 2739b807b0885a09996659be82f813af219c7360 ("aio: only return events requested in poll_mask() for IOCB_CMD_POLL"). Unfortunately, hch nacked it on the bug report rather than on the patch itself, so it was picked up. --- fs/eventfd.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/eventfd.c b/fs/eventfd.c index ceb1031f1cac..61c9514da5e9 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -154,15 +154,15 @@ static __poll_t eventfd_poll_mask(struct file *file, __poll_t eventmask) * eventfd_poll returns 0 */ count = READ_ONCE(ctx->count); if (count > 0) - events |= (EPOLLIN & eventmask); + events |= EPOLLIN; if (count == ULLONG_MAX) events |= EPOLLERR; if (ULLONG_MAX - 1 > count) - events |= (EPOLLOUT & eventmask); + events |= EPOLLOUT; return events; } static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt) -- 2.14.4
[PATCH v1] eventfd: only return events requested in poll_mask()
The ->poll_mask() operation has a mask of events that the caller is interested in, but we're returning all events regardless. Change to return only the events the caller is interested in. This fixes aio IO_CMD_POLL returning immediately when called with POLLIN on an eventfd, since an eventfd is almost always ready for a write. Signed-off-by: Avi Kivity --- fs/eventfd.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/eventfd.c b/fs/eventfd.c index 61c9514da5e9..ceb1031f1cac 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -154,15 +154,15 @@ static __poll_t eventfd_poll_mask(struct file *file, __poll_t eventmask) * eventfd_poll returns 0 */ count = READ_ONCE(ctx->count); if (count > 0) - events |= EPOLLIN; + events |= (EPOLLIN & eventmask); if (count == ULLONG_MAX) events |= EPOLLERR; if (ULLONG_MAX - 1 > count) - events |= EPOLLOUT; + events |= (EPOLLOUT & eventmask); return events; } static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt) -- 2.14.4
[PATCH v1] eventfd: only return events requested in poll_mask()
The ->poll_mask() operation has a mask of events that the caller is interested in, but we're returning all events regardless. Change to return only the events the caller is interested in. This fixes aio IO_CMD_POLL returning immediately when called with POLLIN on an eventfd, since an eventfd is almost always ready for a write. Signed-off-by: Avi Kivity --- fs/eventfd.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/eventfd.c b/fs/eventfd.c index 61c9514da5e9..ceb1031f1cac 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -154,15 +154,15 @@ static __poll_t eventfd_poll_mask(struct file *file, __poll_t eventmask) * eventfd_poll returns 0 */ count = READ_ONCE(ctx->count); if (count > 0) - events |= EPOLLIN; + events |= (EPOLLIN & eventmask); if (count == ULLONG_MAX) events |= EPOLLERR; if (ULLONG_MAX - 1 > count) - events |= EPOLLOUT; + events |= (EPOLLOUT & eventmask); return events; } static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt) -- 2.14.4
[PATCH v1] aio: mark __aio_sigset::sigmask const
io_pgetevents() will not change the signal mask. Mark it const to make it clear and to reduce the need for casts in user code. Signed-off-by: Avi Kivity --- include/uapi/linux/aio_abi.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h index ed0185945bb2..cdf115b03761 100644 --- a/include/uapi/linux/aio_abi.h +++ b/include/uapi/linux/aio_abi.h @@ -106,11 +106,11 @@ struct iocb { #undef IFBIG #undef IFLITTLE struct __aio_sigset { - sigset_t __user *sigmask; + const sigset_t __user *sigmask; size_t sigsetsize; }; #endif /* __LINUX__AIO_ABI_H */ -- 2.14.4
[PATCH v1] aio: mark __aio_sigset::sigmask const
io_pgetevents() will not change the signal mask. Mark it const to make it clear and to reduce the need for casts in user code. Signed-off-by: Avi Kivity --- include/uapi/linux/aio_abi.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h index ed0185945bb2..cdf115b03761 100644 --- a/include/uapi/linux/aio_abi.h +++ b/include/uapi/linux/aio_abi.h @@ -106,11 +106,11 @@ struct iocb { #undef IFBIG #undef IFLITTLE struct __aio_sigset { - sigset_t __user *sigmask; + const sigset_t __user *sigmask; size_t sigsetsize; }; #endif /* __LINUX__AIO_ABI_H */ -- 2.14.4
Re: aio poll, io_pgetevents and a new in-kernel poll API V3
On 01/18/2018 07:51 PM, Avi Kivity wrote: On 01/18/2018 05:46 PM, Jeff Moyer wrote: FYI, this kernel has issues. It will boot up, but I don't have networking, and even rebooting doesn't succeed. I'm looking into it. FWIW, I'm running an older version of this patchset on my desktop with no problems so far. (A Fedora 27)
Re: aio poll, io_pgetevents and a new in-kernel poll API V3
On 01/18/2018 07:51 PM, Avi Kivity wrote: On 01/18/2018 05:46 PM, Jeff Moyer wrote: FYI, this kernel has issues. It will boot up, but I don't have networking, and even rebooting doesn't succeed. I'm looking into it. FWIW, I'm running an older version of this patchset on my desktop with no problems so far. (A Fedora 27)
Re: aio poll, io_pgetevents and a new in-kernel poll API V3
On 01/18/2018 05:46 PM, Jeff Moyer wrote: FYI, this kernel has issues. It will boot up, but I don't have networking, and even rebooting doesn't succeed. I'm looking into it. FWIW, I'm running an older version of this patchset on my desktop with no problems so far. -Jeff Christoph Hellwigwrites: Hi all, this series adds support for the IOCB_CMD_POLL operation to poll for the readyness of file descriptors using the aio subsystem. The API is based on patches that existed in RHAS2.1 and RHEL3, which means it already is supported by libaio. To implement the poll support efficiently new methods to poll are introduced in struct file_operations: get_poll_head and poll_mask. The first one returns a wait_queue_head to wait on (lifetime is bound by the file), and the second does a non-blocking check for the POLL* events. This allows aio poll to work without any additional context switches, unlike epoll. To make the interface fully useful a new io_pgetevents system call is added, which atomically saves and restores the signal mask over the io_pgetevents system call. It it the logical equivalent to pselect and ppoll for io_pgetevents. The corresponding libaio changes for io_pgetevents support and documentation, as well as a test case will be posted in a separate series. The changes were sponsored by Scylladb, and improve performance of the seastar framework up to 10%, while also removing the need for a privileged SCHED_FIFO epoll listener thread. The patches are on top of Als __poll_t annoations, so I've also prepared a git branch on top of those here: git://git.infradead.org/users/hch/vfs.git aio-poll.3 Gitweb: http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/aio-poll.3 Libaio changes: https://pagure.io/libaio.git io-poll Seastar changes (not updated for the new io_pgetevens ABI yet): https://github.com/avikivity/seastar/commits/aio Changes since V2: - removed a double initialization - new vfs_get_poll_head helper - document that ->get_poll_head can return NULL - call ->poll_mask before sleeping - various ACKs - add conversion of random to ->poll_mask - add conversion of af_alg to ->poll_mask - lacking ->poll_mask support now returns -EINVAL for IOCB_CMD_POLL - reshuffled the series so that prep patches and everything not requiring the new in-kernel poll API is in the beginning Changes since V1: - handle the NULL ->poll case in vfs_poll - dropped the file argument to the ->poll_mask socket operation - replace the ->pre_poll socket operation with ->get_poll_head as in the file operations -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majord...@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: mailto:"a...@kvack.org;>a...@kvack.org
Re: aio poll, io_pgetevents and a new in-kernel poll API V3
On 01/18/2018 05:46 PM, Jeff Moyer wrote: FYI, this kernel has issues. It will boot up, but I don't have networking, and even rebooting doesn't succeed. I'm looking into it. FWIW, I'm running an older version of this patchset on my desktop with no problems so far. -Jeff Christoph Hellwig writes: Hi all, this series adds support for the IOCB_CMD_POLL operation to poll for the readyness of file descriptors using the aio subsystem. The API is based on patches that existed in RHAS2.1 and RHEL3, which means it already is supported by libaio. To implement the poll support efficiently new methods to poll are introduced in struct file_operations: get_poll_head and poll_mask. The first one returns a wait_queue_head to wait on (lifetime is bound by the file), and the second does a non-blocking check for the POLL* events. This allows aio poll to work without any additional context switches, unlike epoll. To make the interface fully useful a new io_pgetevents system call is added, which atomically saves and restores the signal mask over the io_pgetevents system call. It it the logical equivalent to pselect and ppoll for io_pgetevents. The corresponding libaio changes for io_pgetevents support and documentation, as well as a test case will be posted in a separate series. The changes were sponsored by Scylladb, and improve performance of the seastar framework up to 10%, while also removing the need for a privileged SCHED_FIFO epoll listener thread. The patches are on top of Als __poll_t annoations, so I've also prepared a git branch on top of those here: git://git.infradead.org/users/hch/vfs.git aio-poll.3 Gitweb: http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/aio-poll.3 Libaio changes: https://pagure.io/libaio.git io-poll Seastar changes (not updated for the new io_pgetevens ABI yet): https://github.com/avikivity/seastar/commits/aio Changes since V2: - removed a double initialization - new vfs_get_poll_head helper - document that ->get_poll_head can return NULL - call ->poll_mask before sleeping - various ACKs - add conversion of random to ->poll_mask - add conversion of af_alg to ->poll_mask - lacking ->poll_mask support now returns -EINVAL for IOCB_CMD_POLL - reshuffled the series so that prep patches and everything not requiring the new in-kernel poll API is in the beginning Changes since V1: - handle the NULL ->poll case in vfs_poll - dropped the file argument to the ->poll_mask socket operation - replace the ->pre_poll socket operation with ->get_poll_head as in the file operations -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majord...@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: mailto:"a...@kvack.org;>a...@kvack.org
Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs
On 01/07/2018 04:36 PM, Alan Cox wrote: I'm interested in participating to working on such a solution, given that haproxy is severely impacted by "pti=on" and that for now we'll have to run with "pti=off" on the whole system until a more suitable solution is found. I'm still trying to work out what cases there are for this. I can see the cases for pti-off. I've got minecraft servers for example where there isn't anyone running untrusted code on the box (*) and the only data of value is owned by the minecraft processes. If someone gets to the point pti matters then I already lost. You don't want, say, a local out-of-process dns resolver compromised and exploited, then PTI used to read all of the machine's memory. What I struggle to see is why I'd want to nominate specific processes for this except in very special cases Suppose you are running a database (I hope you'll agree that's not a special case). Then, "all of your data was stolen but the good news is that your ssh keys are safe" is not something that brightens your day. Your ssh keys are worth a lot less than your data. In that case disabling PTI just for the database can be a good trade-off between security and performance. You lose almost nothing by disabling PTI for the database, yet you're still secure* from a remote exploit in any supporting processes, or from a malicious local user. * as secure as you were with full PTI (like your packet generator). Even then it would make me nervous as the packet generator if that trusted is effectively CAP_SYS_RAWIO or close to it and can steal any ssh keys or similar on that guest. I still prefer cgroups because once you include the branch predictions it suddenly becomes very interesting to be able to say 'this pile of stuff trusts itself' and avoid user/user protection costs while keeping user/kernel. By "this pile of stuff" do you mean a group of mutually-trusting processes? I don't see how that can be implemented, because any cross-process switch has to cross the kernel boundary. In any case a cross-process switch will destroy the effectiveness of branch prediction history, so preserving it doesn't buy you much. Alan (*) I am sure that java programs can do sandbox breaking via spectre just as js can. Bonus points to anyone however who can do spectre through java from redstone 8)
Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs
On 01/07/2018 04:36 PM, Alan Cox wrote: I'm interested in participating to working on such a solution, given that haproxy is severely impacted by "pti=on" and that for now we'll have to run with "pti=off" on the whole system until a more suitable solution is found. I'm still trying to work out what cases there are for this. I can see the cases for pti-off. I've got minecraft servers for example where there isn't anyone running untrusted code on the box (*) and the only data of value is owned by the minecraft processes. If someone gets to the point pti matters then I already lost. You don't want, say, a local out-of-process dns resolver compromised and exploited, then PTI used to read all of the machine's memory. What I struggle to see is why I'd want to nominate specific processes for this except in very special cases Suppose you are running a database (I hope you'll agree that's not a special case). Then, "all of your data was stolen but the good news is that your ssh keys are safe" is not something that brightens your day. Your ssh keys are worth a lot less than your data. In that case disabling PTI just for the database can be a good trade-off between security and performance. You lose almost nothing by disabling PTI for the database, yet you're still secure* from a remote exploit in any supporting processes, or from a malicious local user. * as secure as you were with full PTI (like your packet generator). Even then it would make me nervous as the packet generator if that trusted is effectively CAP_SYS_RAWIO or close to it and can steal any ssh keys or similar on that guest. I still prefer cgroups because once you include the branch predictions it suddenly becomes very interesting to be able to say 'this pile of stuff trusts itself' and avoid user/user protection costs while keeping user/kernel. By "this pile of stuff" do you mean a group of mutually-trusting processes? I don't see how that can be implemented, because any cross-process switch has to cross the kernel boundary. In any case a cross-process switch will destroy the effectiveness of branch prediction history, so preserving it doesn't buy you much. Alan (*) I am sure that java programs can do sandbox breaking via spectre just as js can. Bonus points to anyone however who can do spectre through java from redstone 8)
Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs
On 01/07/2018 02:29 PM, Theodore Ts'o wrote: On Sun, Jan 07, 2018 at 11:16:28AM +0200, Avi Kivity wrote: I think capabilities will work just as well with cgroups. The container manager will set CAP_PAYLOAD to payload containers; and if those run an init system or a container manager themselves, they'll drop CAP_PAYLOAD for all process/sub-containers but their payloads. The reason why cgroups are better is Spectre can be used to steal information from within the same privilege level --- e.g., you could use Javascript to steal a user's Coindesk credentials or Lastpass data, which is going to be *way* more lucrative than trying to mine cryptocurrency in the sly in a user's browser. :-) As a result, you probably want Spectre mitigations to be enabled in a root process --- which means capabilities aren't the right answer. I don't see the connection. The browser wouldn't run with CAP_PAYLOAD set. In a desktop system, only init retains CAP_PAYLOAD. On a server that runs one application (and some supporting processes), only init and that one application have CAP_PAYLOAD (if the sysadmin makes it so). On a containerized server that happens to run just one application, init will retain CAP_PAYLOAD, as well as the process in the container (if the sysadmin makes it so). On a containerized server that happens to run just one application, which itself runs an init system, the two inits will retain CAP_PAYLOAD, as well as the application process (if the sysadmin makes it so).
Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs
On 01/07/2018 02:29 PM, Theodore Ts'o wrote: On Sun, Jan 07, 2018 at 11:16:28AM +0200, Avi Kivity wrote: I think capabilities will work just as well with cgroups. The container manager will set CAP_PAYLOAD to payload containers; and if those run an init system or a container manager themselves, they'll drop CAP_PAYLOAD for all process/sub-containers but their payloads. The reason why cgroups are better is Spectre can be used to steal information from within the same privilege level --- e.g., you could use Javascript to steal a user's Coindesk credentials or Lastpass data, which is going to be *way* more lucrative than trying to mine cryptocurrency in the sly in a user's browser. :-) As a result, you probably want Spectre mitigations to be enabled in a root process --- which means capabilities aren't the right answer. I don't see the connection. The browser wouldn't run with CAP_PAYLOAD set. In a desktop system, only init retains CAP_PAYLOAD. On a server that runs one application (and some supporting processes), only init and that one application have CAP_PAYLOAD (if the sysadmin makes it so). On a containerized server that happens to run just one application, init will retain CAP_PAYLOAD, as well as the process in the container (if the sysadmin makes it so). On a containerized server that happens to run just one application, which itself runs an init system, the two inits will retain CAP_PAYLOAD, as well as the application process (if the sysadmin makes it so).
Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs
On 01/06/2018 10:02 PM, Alan Cox wrote: I propose to create a new capability, CAP_PAYLOAD, that allows the system administrator to designate an application as the main workload in that system. Other processes (like sshd or monitoring daemons) exist to support it, and so it makes sense to protect the rest of the system from their being compromised. Much more general would be to do this with cgroups both for group-group trust and group-kernel trust levels. I think capabilities will work just as well with cgroups. The container manager will set CAP_PAYLOAD to payload containers; and if those run an init system or a container manager themselves, they'll drop CAP_PAYLOAD for all process/sub-containers but their payloads.
Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs
On 01/06/2018 10:02 PM, Alan Cox wrote: I propose to create a new capability, CAP_PAYLOAD, that allows the system administrator to designate an application as the main workload in that system. Other processes (like sshd or monitoring daemons) exist to support it, and so it makes sense to protect the rest of the system from their being compromised. Much more general would be to do this with cgroups both for group-group trust and group-kernel trust levels. I think capabilities will work just as well with cgroups. The container manager will set CAP_PAYLOAD to payload containers; and if those run an init system or a container manager themselves, they'll drop CAP_PAYLOAD for all process/sub-containers but their payloads.
Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs
On 01/06/2018 10:24 PM, Willy Tarreau wrote: Hi Avi, On Sat, Jan 06, 2018 at 09:33:28PM +0200, Avi Kivity wrote: Meltdown and Spectre mitigations focus on protecting the kernel from a hostile userspace. However, it's not a given that the kernel is the most important target in the system. It is common in server workloads that a single userspace application contains the valuable data on a system, and if it were hostile, the game would already be over, without the need to compromise the kernel. In these workloads, a single application performs most system calls, and so it pays the cost of protection, without benefiting from it directly (since it is the target, rather than the kernel). Definitely :-) I propose to create a new capability, CAP_PAYLOAD, that allows the system administrator to designate an application as the main workload in that system. Other processes (like sshd or monitoring daemons) exist to support it, and so it makes sense to protect the rest of the system from their being compromised. Initially I was thinking about letting applications disable PTI using prctl() when running under a certain capability (I initially thought about CAP_SYSADMIN though I changed my mind). One advantage of proceeding like this is that it would have to be explicitly implemented in the application, which limits the risk of running by default. I later thought that we could use CAP_RAWIO for this, given that such processes already have access to the hardware anyway. We could even imagine not switching the page tables on such a capability without requiring prctl(), though it would mean that processes running as root (as is often found on a number of servers) would automatically present a risk for the system. But maybe CAP_RAWIO + prctl() could be a good solution. CAP_RAWIO is like CAP_PAYLOAD in that both allow you to read stuff you shouldn't have access to on a vulnerable CPU. But CAP_PAYLOAD won't give you that access on a non-vulnerable CPU, so it's safer. The advantage of not requiring prctl() is that it will work on unmodified applications, requiring only sysadmin intervention (and it's the sysadmin's role to designate an application as payload, not the application's). I'm interested in participating to working on such a solution, given that haproxy is severely impacted by "pti=on" and that for now we'll have to run with "pti=off" on the whole system until a more suitable solution is found. I'd rather not rush anything and let things calm down for a while to avoid adding disturbance to the current situation. But I'm willing to continue this discussion and even test patches. Then you might want to test https://www.spinics.net/lists/kernel/msg2689101.html and its companion patchset https://www.spinics.net/lists/kernel/msg2689134.html, which as a side effect significantly reduce KPTI impact on C10K applications (and as their main effect improve their performance).
Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs
On 01/06/2018 10:24 PM, Willy Tarreau wrote: Hi Avi, On Sat, Jan 06, 2018 at 09:33:28PM +0200, Avi Kivity wrote: Meltdown and Spectre mitigations focus on protecting the kernel from a hostile userspace. However, it's not a given that the kernel is the most important target in the system. It is common in server workloads that a single userspace application contains the valuable data on a system, and if it were hostile, the game would already be over, without the need to compromise the kernel. In these workloads, a single application performs most system calls, and so it pays the cost of protection, without benefiting from it directly (since it is the target, rather than the kernel). Definitely :-) I propose to create a new capability, CAP_PAYLOAD, that allows the system administrator to designate an application as the main workload in that system. Other processes (like sshd or monitoring daemons) exist to support it, and so it makes sense to protect the rest of the system from their being compromised. Initially I was thinking about letting applications disable PTI using prctl() when running under a certain capability (I initially thought about CAP_SYSADMIN though I changed my mind). One advantage of proceeding like this is that it would have to be explicitly implemented in the application, which limits the risk of running by default. I later thought that we could use CAP_RAWIO for this, given that such processes already have access to the hardware anyway. We could even imagine not switching the page tables on such a capability without requiring prctl(), though it would mean that processes running as root (as is often found on a number of servers) would automatically present a risk for the system. But maybe CAP_RAWIO + prctl() could be a good solution. CAP_RAWIO is like CAP_PAYLOAD in that both allow you to read stuff you shouldn't have access to on a vulnerable CPU. But CAP_PAYLOAD won't give you that access on a non-vulnerable CPU, so it's safer. The advantage of not requiring prctl() is that it will work on unmodified applications, requiring only sysadmin intervention (and it's the sysadmin's role to designate an application as payload, not the application's). I'm interested in participating to working on such a solution, given that haproxy is severely impacted by "pti=on" and that for now we'll have to run with "pti=off" on the whole system until a more suitable solution is found. I'd rather not rush anything and let things calm down for a while to avoid adding disturbance to the current situation. But I'm willing to continue this discussion and even test patches. Then you might want to test https://www.spinics.net/lists/kernel/msg2689101.html and its companion patchset https://www.spinics.net/lists/kernel/msg2689134.html, which as a side effect significantly reduce KPTI impact on C10K applications (and as their main effect improve their performance).
Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs
Meltdown and Spectre mitigations focus on protecting the kernel from a hostile userspace. However, it's not a given that the kernel is the most important target in the system. It is common in server workloads that a single userspace application contains the valuable data on a system, and if it were hostile, the game would already be over, without the need to compromise the kernel. In these workloads, a single application performs most system calls, and so it pays the cost of protection, without benefiting from it directly (since it is the target, rather than the kernel). I propose to create a new capability, CAP_PAYLOAD, that allows the system administrator to designate an application as the main workload in that system. Other processes (like sshd or monitoring daemons) exist to support it, and so it makes sense to protect the rest of the system from their being compromised. When the kernel switches to user mode of a CAP_PAYLOAD process, it doesn't switch page tables and instead leaves the kernel mapped into the adddress space (still with supervisor protection, of course). This reduces context switch cost, and will also reduce interrupt costs if the interrupt happens while that process executes. Since a CAP_PAYLOAD process is likely to consume the majority of CPU time, the costs associated with Meltdown mitigation are almost completely nullified. CAP_PAYLOAD has potential to be abused; every software vendor will be absolutely certain that their application is the reason the universe (let alone that server) exists and they will turn it on, so init systems will have to work to make it doesn't get turned on without administrator opt-in. It's also not perfect, since if there is a payload application compromise, in addition to stealing the application's data ssh keys can be stolen too. But I think it's better than having to choose between significantly reduced performance and security. You get performance for your important application, and protection against the possibility that a remote exploit against a supporting process turns into a remote exploit against that important application.
Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs
Meltdown and Spectre mitigations focus on protecting the kernel from a hostile userspace. However, it's not a given that the kernel is the most important target in the system. It is common in server workloads that a single userspace application contains the valuable data on a system, and if it were hostile, the game would already be over, without the need to compromise the kernel. In these workloads, a single application performs most system calls, and so it pays the cost of protection, without benefiting from it directly (since it is the target, rather than the kernel). I propose to create a new capability, CAP_PAYLOAD, that allows the system administrator to designate an application as the main workload in that system. Other processes (like sshd or monitoring daemons) exist to support it, and so it makes sense to protect the rest of the system from their being compromised. When the kernel switches to user mode of a CAP_PAYLOAD process, it doesn't switch page tables and instead leaves the kernel mapped into the adddress space (still with supervisor protection, of course). This reduces context switch cost, and will also reduce interrupt costs if the interrupt happens while that process executes. Since a CAP_PAYLOAD process is likely to consume the majority of CPU time, the costs associated with Meltdown mitigation are almost completely nullified. CAP_PAYLOAD has potential to be abused; every software vendor will be absolutely certain that their application is the reason the universe (let alone that server) exists and they will turn it on, so init systems will have to work to make it doesn't get turned on without administrator opt-in. It's also not perfect, since if there is a payload application compromise, in addition to stealing the application's data ssh keys can be stolen too. But I think it's better than having to choose between significantly reduced performance and security. You get performance for your important application, and protection against the possibility that a remote exploit against a supporting process turns into a remote exploit against that important application.
Re: Detecting RWF_NOWAIT support
On 12/18/2017 05:28 AM, Goldwyn Rodrigues wrote: On 12/16/2017 08:49 AM, Avi Kivity wrote: On 12/14/2017 09:15 PM, Goldwyn Rodrigues wrote: On 12/14/2017 11:38 AM, Avi Kivity wrote: I'm looking to add support for RWF_NOWAIT within a linux-aio iocb. Naturally, I need to detect at runtime whether the kernel support RWF_NOWAIT or not. The only method I could find was to issue an I/O with RWF_NOWAIT set, and look for errors. This is somewhat less than perfect: - from the error, I can't tell whether RWF_NOWAIT was the problem, or something else. If I enable a number of new features, I have to run through all combinations to figure out which ones are supported and which are not. Here is the return codes for RWF_NOWAIT EINVAL - not supported (older kernel) EOPNOTSUPP - not supported EAGAIN - supported but could not complete because I/O will be delayed Which of these are returned from io_submit() and which are returned in the iocb? These are returned in iocb. Thanks. 0 - supported and I/O completed (success). - RWF_NOWAIT support is per-filesystem, so I can't just remember not to enable RWF_NOWAIT globally, I have to track it per file. Yes, the support is per filesystem. So, the application must know if the filesystem supports it, possibly by performing a small I/O. So the application must know about filesystem mount points, and be prepared to create a file and try to write it (in case the filesystem is empty) or alter its behavior during runtime depending on the errors it sees. Well yes. Hopefully, the application knows what it is doing when it performs RWF_NOWAIT. This type of interface makes it very hard to consume new kernel facilities in a backward compatible way. The kernel should advertise what support it provides; for example this support could be advertised via statx(2). For examples of facilities that advertise their capabilities, see membarrier(2) and KVM.
Re: Detecting RWF_NOWAIT support
On 12/18/2017 05:28 AM, Goldwyn Rodrigues wrote: On 12/16/2017 08:49 AM, Avi Kivity wrote: On 12/14/2017 09:15 PM, Goldwyn Rodrigues wrote: On 12/14/2017 11:38 AM, Avi Kivity wrote: I'm looking to add support for RWF_NOWAIT within a linux-aio iocb. Naturally, I need to detect at runtime whether the kernel support RWF_NOWAIT or not. The only method I could find was to issue an I/O with RWF_NOWAIT set, and look for errors. This is somewhat less than perfect: - from the error, I can't tell whether RWF_NOWAIT was the problem, or something else. If I enable a number of new features, I have to run through all combinations to figure out which ones are supported and which are not. Here is the return codes for RWF_NOWAIT EINVAL - not supported (older kernel) EOPNOTSUPP - not supported EAGAIN - supported but could not complete because I/O will be delayed Which of these are returned from io_submit() and which are returned in the iocb? These are returned in iocb. Thanks. 0 - supported and I/O completed (success). - RWF_NOWAIT support is per-filesystem, so I can't just remember not to enable RWF_NOWAIT globally, I have to track it per file. Yes, the support is per filesystem. So, the application must know if the filesystem supports it, possibly by performing a small I/O. So the application must know about filesystem mount points, and be prepared to create a file and try to write it (in case the filesystem is empty) or alter its behavior during runtime depending on the errors it sees. Well yes. Hopefully, the application knows what it is doing when it performs RWF_NOWAIT. This type of interface makes it very hard to consume new kernel facilities in a backward compatible way. The kernel should advertise what support it provides; for example this support could be advertised via statx(2). For examples of facilities that advertise their capabilities, see membarrier(2) and KVM.
Re: Detecting RWF_NOWAIT support
On 12/16/2017 08:12 PM, vcap...@pengaru.com wrote: On Sat, Dec 16, 2017 at 10:03:38AM -0800, vcap...@pengaru.com wrote: On Sat, Dec 16, 2017 at 04:49:08PM +0200, Avi Kivity wrote: On 12/14/2017 09:15 PM, Goldwyn Rodrigues wrote: On 12/14/2017 11:38 AM, Avi Kivity wrote: I'm looking to add support for RWF_NOWAIT within a linux-aio iocb. Naturally, I need to detect at runtime whether the kernel support RWF_NOWAIT or not. The only method I could find was to issue an I/O with RWF_NOWAIT set, and look for errors. This is somewhat less than perfect: - from the error, I can't tell whether RWF_NOWAIT was the problem, or something else. If I enable a number of new features, I have to run through all combinations to figure out which ones are supported and which are not. Here is the return codes for RWF_NOWAIT EINVAL - not supported (older kernel) EOPNOTSUPP - not supported EAGAIN - supported but could not complete because I/O will be delayed Which of these are returned from io_submit() and which are returned in the iocb? 0 - supported and I/O completed (success). - RWF_NOWAIT support is per-filesystem, so I can't just remember not to enable RWF_NOWAIT globally, I have to track it per file. Yes, the support is per filesystem. So, the application must know if the filesystem supports it, possibly by performing a small I/O. So the application must know about filesystem mount points, and be prepared to create a file and try to write it (in case the filesystem is empty) or alter its behavior during runtime depending on the errors it sees. Can't the application simply add a "nowait" flag to its open file descriptor encapsulation struct, then in the constructor perform a zero-length RWF_NOWAIT write immediately after opening the fd to set the flag? Then all writes branch according to the flag. According to write(2): If count is zero and fd refers to a regular file, then write() may return a failure status if one of the errors below is detected. If no errors are detected, or error detection is not performed, 0 will be returned without causing any other effect. If count is zero and fd refers to a file other than a regular file, the results are not specified. So the zero-length RWF_NOWAIT write should return zero, unless it's not supported. Oh, I assumed this new flag applied to pwritev2() flags. Disregard my comment, I see the ambiguity causing your question Avi and do not know the best approach. Actually it's not a bad idea. I'm using AIO, not p{read,write}v2, but I can assume that the response will be the same and that a zero-length read will return immediately.
Re: Detecting RWF_NOWAIT support
On 12/16/2017 08:12 PM, vcap...@pengaru.com wrote: On Sat, Dec 16, 2017 at 10:03:38AM -0800, vcap...@pengaru.com wrote: On Sat, Dec 16, 2017 at 04:49:08PM +0200, Avi Kivity wrote: On 12/14/2017 09:15 PM, Goldwyn Rodrigues wrote: On 12/14/2017 11:38 AM, Avi Kivity wrote: I'm looking to add support for RWF_NOWAIT within a linux-aio iocb. Naturally, I need to detect at runtime whether the kernel support RWF_NOWAIT or not. The only method I could find was to issue an I/O with RWF_NOWAIT set, and look for errors. This is somewhat less than perfect: - from the error, I can't tell whether RWF_NOWAIT was the problem, or something else. If I enable a number of new features, I have to run through all combinations to figure out which ones are supported and which are not. Here is the return codes for RWF_NOWAIT EINVAL - not supported (older kernel) EOPNOTSUPP - not supported EAGAIN - supported but could not complete because I/O will be delayed Which of these are returned from io_submit() and which are returned in the iocb? 0 - supported and I/O completed (success). - RWF_NOWAIT support is per-filesystem, so I can't just remember not to enable RWF_NOWAIT globally, I have to track it per file. Yes, the support is per filesystem. So, the application must know if the filesystem supports it, possibly by performing a small I/O. So the application must know about filesystem mount points, and be prepared to create a file and try to write it (in case the filesystem is empty) or alter its behavior during runtime depending on the errors it sees. Can't the application simply add a "nowait" flag to its open file descriptor encapsulation struct, then in the constructor perform a zero-length RWF_NOWAIT write immediately after opening the fd to set the flag? Then all writes branch according to the flag. According to write(2): If count is zero and fd refers to a regular file, then write() may return a failure status if one of the errors below is detected. If no errors are detected, or error detection is not performed, 0 will be returned without causing any other effect. If count is zero and fd refers to a file other than a regular file, the results are not specified. So the zero-length RWF_NOWAIT write should return zero, unless it's not supported. Oh, I assumed this new flag applied to pwritev2() flags. Disregard my comment, I see the ambiguity causing your question Avi and do not know the best approach. Actually it's not a bad idea. I'm using AIO, not p{read,write}v2, but I can assume that the response will be the same and that a zero-length read will return immediately.
Re: Detecting RWF_NOWAIT support
On 12/14/2017 09:15 PM, Goldwyn Rodrigues wrote: On 12/14/2017 11:38 AM, Avi Kivity wrote: I'm looking to add support for RWF_NOWAIT within a linux-aio iocb. Naturally, I need to detect at runtime whether the kernel support RWF_NOWAIT or not. The only method I could find was to issue an I/O with RWF_NOWAIT set, and look for errors. This is somewhat less than perfect: - from the error, I can't tell whether RWF_NOWAIT was the problem, or something else. If I enable a number of new features, I have to run through all combinations to figure out which ones are supported and which are not. Here is the return codes for RWF_NOWAIT EINVAL - not supported (older kernel) EOPNOTSUPP - not supported EAGAIN - supported but could not complete because I/O will be delayed Which of these are returned from io_submit() and which are returned in the iocb? 0 - supported and I/O completed (success). - RWF_NOWAIT support is per-filesystem, so I can't just remember not to enable RWF_NOWAIT globally, I have to track it per file. Yes, the support is per filesystem. So, the application must know if the filesystem supports it, possibly by performing a small I/O. So the application must know about filesystem mount points, and be prepared to create a file and try to write it (in case the filesystem is empty) or alter its behavior during runtime depending on the errors it sees.
Re: Detecting RWF_NOWAIT support
On 12/14/2017 09:15 PM, Goldwyn Rodrigues wrote: On 12/14/2017 11:38 AM, Avi Kivity wrote: I'm looking to add support for RWF_NOWAIT within a linux-aio iocb. Naturally, I need to detect at runtime whether the kernel support RWF_NOWAIT or not. The only method I could find was to issue an I/O with RWF_NOWAIT set, and look for errors. This is somewhat less than perfect: - from the error, I can't tell whether RWF_NOWAIT was the problem, or something else. If I enable a number of new features, I have to run through all combinations to figure out which ones are supported and which are not. Here is the return codes for RWF_NOWAIT EINVAL - not supported (older kernel) EOPNOTSUPP - not supported EAGAIN - supported but could not complete because I/O will be delayed Which of these are returned from io_submit() and which are returned in the iocb? 0 - supported and I/O completed (success). - RWF_NOWAIT support is per-filesystem, so I can't just remember not to enable RWF_NOWAIT globally, I have to track it per file. Yes, the support is per filesystem. So, the application must know if the filesystem supports it, possibly by performing a small I/O. So the application must know about filesystem mount points, and be prepared to create a file and try to write it (in case the filesystem is empty) or alter its behavior during runtime depending on the errors it sees.
Detecting RWF_NOWAIT support
I'm looking to add support for RWF_NOWAIT within a linux-aio iocb. Naturally, I need to detect at runtime whether the kernel support RWF_NOWAIT or not. The only method I could find was to issue an I/O with RWF_NOWAIT set, and look for errors. This is somewhat less than perfect: - from the error, I can't tell whether RWF_NOWAIT was the problem, or something else. If I enable a number of new features, I have to run through all combinations to figure out which ones are supported and which are not. - RWF_NOWAIT support is per-filesystem, so I can't just remember not to enable RWF_NOWAIT globally, I have to track it per file. Did I miss another method of detecting RWF_NOWAIT?
Detecting RWF_NOWAIT support
I'm looking to add support for RWF_NOWAIT within a linux-aio iocb. Naturally, I need to detect at runtime whether the kernel support RWF_NOWAIT or not. The only method I could find was to issue an I/O with RWF_NOWAIT set, and look for errors. This is somewhat less than perfect: - from the error, I can't tell whether RWF_NOWAIT was the problem, or something else. If I enable a number of new features, I have to run through all combinations to figure out which ones are supported and which are not. - RWF_NOWAIT support is per-filesystem, so I can't just remember not to enable RWF_NOWAIT globally, I have to track it per file. Did I miss another method of detecting RWF_NOWAIT?
Re: [RFC PATCH 0/2] x86: Fix missing core serialization on migration
On 11/14/2017 06:49 PM, Mathieu Desnoyers wrote: - On Nov 14, 2017, at 11:08 AM, Peter Zijlstra pet...@infradead.org wrote: On Tue, Nov 14, 2017 at 05:05:41PM +0100, Peter Zijlstra wrote: On Tue, Nov 14, 2017 at 03:17:12PM +, Mathieu Desnoyers wrote: I've tried to create a small single-threaded self-modifying loop in user-space to trigger a trace cache or speculative execution quirk, but I have not succeeded yet. I suspect that I would need to know more about the internals of the processor architecture to create the right stalls that would allow speculative execution to move further ahead, and trigger an incoherent execution flow. Ideas on how to trigger this would be welcome. I thought the whole problem was per definition multi-threaded. Single-threaded stuff can't get out of sync with itself; you'll always observe your own stores. And even if you could, you can always execute a local serializing instruction like CPUID to force things. What I'm trying to reproduce is something that breaks in single-threaded case if I explicitly leave out the CPUID core serializing instruction when doing code modification on upcoming code, in a loop. AFAIU, Intel requires a core serializing instruction to be issued even in single-threaded scenarios between code update and execution, to ensure that speculative execution does not observe incoherent code. Now the question we all have for Intel is: is this requirement too strong, or required by reality ? In single-threaded execution, a jump is enough. "As processor microarchitectures become more complex and start to speculatively execute code ahead of the retire- ment point (as in P6 and more recent processor families), the rules regarding which code should execute, pre- or post-modification, become blurred. To write self-modifying code and ensure that it is compliant with current and future versions of the IA-32 architectures, use one of the following coding options: (* OPTION 1 *) Store modified code (as data) into code segment; Jump to new code or an intermediate location; Execute new code;"
Re: [RFC PATCH 0/2] x86: Fix missing core serialization on migration
On 11/14/2017 06:49 PM, Mathieu Desnoyers wrote: - On Nov 14, 2017, at 11:08 AM, Peter Zijlstra pet...@infradead.org wrote: On Tue, Nov 14, 2017 at 05:05:41PM +0100, Peter Zijlstra wrote: On Tue, Nov 14, 2017 at 03:17:12PM +, Mathieu Desnoyers wrote: I've tried to create a small single-threaded self-modifying loop in user-space to trigger a trace cache or speculative execution quirk, but I have not succeeded yet. I suspect that I would need to know more about the internals of the processor architecture to create the right stalls that would allow speculative execution to move further ahead, and trigger an incoherent execution flow. Ideas on how to trigger this would be welcome. I thought the whole problem was per definition multi-threaded. Single-threaded stuff can't get out of sync with itself; you'll always observe your own stores. And even if you could, you can always execute a local serializing instruction like CPUID to force things. What I'm trying to reproduce is something that breaks in single-threaded case if I explicitly leave out the CPUID core serializing instruction when doing code modification on upcoming code, in a loop. AFAIU, Intel requires a core serializing instruction to be issued even in single-threaded scenarios between code update and execution, to ensure that speculative execution does not observe incoherent code. Now the question we all have for Intel is: is this requirement too strong, or required by reality ? In single-threaded execution, a jump is enough. "As processor microarchitectures become more complex and start to speculatively execute code ahead of the retire- ment point (as in P6 and more recent processor families), the rules regarding which code should execute, pre- or post-modification, become blurred. To write self-modifying code and ensure that it is compliant with current and future versions of the IA-32 architectures, use one of the following coding options: (* OPTION 1 *) Store modified code (as data) into code segment; Jump to new code or an intermediate location; Execute new code;"
Re: [RFC PATCH 0/2] x86: Fix missing core serialization on migration
On 11/14/2017 05:17 PM, Mathieu Desnoyers wrote: - On Nov 14, 2017, at 9:53 AM, Avi Kivity a...@scylladb.com wrote: On 11/13/2017 06:56 PM, Mathieu Desnoyers wrote: - On Nov 10, 2017, at 4:57 PM, Mathieu Desnoyers mathieu.desnoy...@efficios.com wrote: - On Nov 10, 2017, at 4:36 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Nov 10, 2017 at 1:12 PM, Mathieu Desnoyers <mathieu.desnoy...@efficios.com> wrote: x86 can return to user-space through sysexit and sysretq, which are not core serializing. This breaks expectations from user-space about sequential consistency from a single-threaded self-modifying program point of view in specific migration patterns. Feedback is welcome, We should check with Intel. I would actually be surprised if the I$ can be out of sync with the D$ after a sysretq. It would actually break things like "read code from disk" too in theory. That core serializing instruction is not that much about I$ vs D$ consistency, but rather about the processor speculatively executing code ahead of its retirement point. Ref. Intel Architecture Software Developer's Manual, Volume 3: System Programming. 7.1.3. "Handling Self- and Cross-Modifying Code": "The act of a processor writing data into a currently executing code segment with the intent of executing that data as code is called self-modifying code. Intel Architecture processors exhibit model-specific behavior when executing self-modified code, depending upon how far ahead of the current execution pointer the code has been modified. As processor architectures become more complex and start to speculatively execute code ahead of the retirement point (as in the P6 family processors), the rules regarding which code should execute, pre- or post-modification, become blurred. [...]" AFAIU, this core serializing instruction seems to be needed for use-cases of self-modifying code, but not for the initial load of a program from disk, as the processor has no way to have speculatively executed any of its instructions. I figured out what you're pointing to: if exec() is executed by a previously running thread, and there is no core serializing instruction between program load and return to user-space, the kernel ends up acting like a JIT, indeed. I think that's safe. The kernel has to execute a MOV CR3 instruction before it can execute code loaded by exec, and that is a serializing instruction. Loading and unloading shared libraries is made safe by the IRET executed by page faults (loading) and TLB shootdown IPIs (unloading). Very good points! Perhaps those guarantees should be documented somewhere ? Directly modifying code in userspace is unsafe if there is some non-coherent instruction cache. Instruction fetch and speculative execution are non-coherent, but they're probably too short (in current processors) to matter. Trace caches are probably large enough, but I don't know whether they are coherent or not. Android guys at Google have reproducers of context synchronization issues on arm 64 in JIT scenarios. Based on the information I got, flushing the instruction caches is not enough: they also need to issue a context synchronizing instruction. Perhaps the current Intel processors may have short enough speculative execution and small enough trace caches, but relying on this without a clear statement from Intel seems fragile. A small trace cache is still vulnerable, the question is whether it is coherent or not. I've tried to create a small single-threaded self-modifying loop in user-space to trigger a trace cache or speculative execution quirk, but I have not succeeded yet. I suspect that I would need to know more about the internals of the processor architecture to create the right stalls that would allow speculative execution to move further ahead, and trigger an incoherent execution flow. Ideas on how to trigger this would be welcome. Intels resynchronize as soon as you jump (in single-threaded execution), so you need to update ahead of the current instruction pointer to see something. Not sure what quirk you're interested in seeing, executing the old code? That's not very exciting.
Re: [RFC PATCH 0/2] x86: Fix missing core serialization on migration
On 11/14/2017 05:17 PM, Mathieu Desnoyers wrote: - On Nov 14, 2017, at 9:53 AM, Avi Kivity a...@scylladb.com wrote: On 11/13/2017 06:56 PM, Mathieu Desnoyers wrote: - On Nov 10, 2017, at 4:57 PM, Mathieu Desnoyers mathieu.desnoy...@efficios.com wrote: - On Nov 10, 2017, at 4:36 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Nov 10, 2017 at 1:12 PM, Mathieu Desnoyers wrote: x86 can return to user-space through sysexit and sysretq, which are not core serializing. This breaks expectations from user-space about sequential consistency from a single-threaded self-modifying program point of view in specific migration patterns. Feedback is welcome, We should check with Intel. I would actually be surprised if the I$ can be out of sync with the D$ after a sysretq. It would actually break things like "read code from disk" too in theory. That core serializing instruction is not that much about I$ vs D$ consistency, but rather about the processor speculatively executing code ahead of its retirement point. Ref. Intel Architecture Software Developer's Manual, Volume 3: System Programming. 7.1.3. "Handling Self- and Cross-Modifying Code": "The act of a processor writing data into a currently executing code segment with the intent of executing that data as code is called self-modifying code. Intel Architecture processors exhibit model-specific behavior when executing self-modified code, depending upon how far ahead of the current execution pointer the code has been modified. As processor architectures become more complex and start to speculatively execute code ahead of the retirement point (as in the P6 family processors), the rules regarding which code should execute, pre- or post-modification, become blurred. [...]" AFAIU, this core serializing instruction seems to be needed for use-cases of self-modifying code, but not for the initial load of a program from disk, as the processor has no way to have speculatively executed any of its instructions. I figured out what you're pointing to: if exec() is executed by a previously running thread, and there is no core serializing instruction between program load and return to user-space, the kernel ends up acting like a JIT, indeed. I think that's safe. The kernel has to execute a MOV CR3 instruction before it can execute code loaded by exec, and that is a serializing instruction. Loading and unloading shared libraries is made safe by the IRET executed by page faults (loading) and TLB shootdown IPIs (unloading). Very good points! Perhaps those guarantees should be documented somewhere ? Directly modifying code in userspace is unsafe if there is some non-coherent instruction cache. Instruction fetch and speculative execution are non-coherent, but they're probably too short (in current processors) to matter. Trace caches are probably large enough, but I don't know whether they are coherent or not. Android guys at Google have reproducers of context synchronization issues on arm 64 in JIT scenarios. Based on the information I got, flushing the instruction caches is not enough: they also need to issue a context synchronizing instruction. Perhaps the current Intel processors may have short enough speculative execution and small enough trace caches, but relying on this without a clear statement from Intel seems fragile. A small trace cache is still vulnerable, the question is whether it is coherent or not. I've tried to create a small single-threaded self-modifying loop in user-space to trigger a trace cache or speculative execution quirk, but I have not succeeded yet. I suspect that I would need to know more about the internals of the processor architecture to create the right stalls that would allow speculative execution to move further ahead, and trigger an incoherent execution flow. Ideas on how to trigger this would be welcome. Intels resynchronize as soon as you jump (in single-threaded execution), so you need to update ahead of the current instruction pointer to see something. Not sure what quirk you're interested in seeing, executing the old code? That's not very exciting.
Re: [RFC PATCH 0/2] x86: Fix missing core serialization on migration
On 11/13/2017 06:56 PM, Mathieu Desnoyers wrote: - On Nov 10, 2017, at 4:57 PM, Mathieu Desnoyers mathieu.desnoy...@efficios.com wrote: - On Nov 10, 2017, at 4:36 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Nov 10, 2017 at 1:12 PM, Mathieu Desnoyerswrote: x86 can return to user-space through sysexit and sysretq, which are not core serializing. This breaks expectations from user-space about sequential consistency from a single-threaded self-modifying program point of view in specific migration patterns. Feedback is welcome, We should check with Intel. I would actually be surprised if the I$ can be out of sync with the D$ after a sysretq. It would actually break things like "read code from disk" too in theory. That core serializing instruction is not that much about I$ vs D$ consistency, but rather about the processor speculatively executing code ahead of its retirement point. Ref. Intel Architecture Software Developer's Manual, Volume 3: System Programming. 7.1.3. "Handling Self- and Cross-Modifying Code": "The act of a processor writing data into a currently executing code segment with the intent of executing that data as code is called self-modifying code. Intel Architecture processors exhibit model-specific behavior when executing self-modified code, depending upon how far ahead of the current execution pointer the code has been modified. As processor architectures become more complex and start to speculatively execute code ahead of the retirement point (as in the P6 family processors), the rules regarding which code should execute, pre- or post-modification, become blurred. [...]" AFAIU, this core serializing instruction seems to be needed for use-cases of self-modifying code, but not for the initial load of a program from disk, as the processor has no way to have speculatively executed any of its instructions. I figured out what you're pointing to: if exec() is executed by a previously running thread, and there is no core serializing instruction between program load and return to user-space, the kernel ends up acting like a JIT, indeed. I think that's safe. The kernel has to execute a MOV CR3 instruction before it can execute code loaded by exec, and that is a serializing instruction. Loading and unloading shared libraries is made safe by the IRET executed by page faults (loading) and TLB shootdown IPIs (unloading). Directly modifying code in userspace is unsafe if there is some non-coherent instruction cache. Instruction fetch and speculative execution are non-coherent, but they're probably too short (in current processors) to matter. Trace caches are probably large enough, but I don't know whether they are coherent or not. Therefore, we'd also need to invoke sync_core_before_usermode() after loading the program. Let's wait to hear back from hpa, Thanks, Mathieu Hopefully hpa can tell us more about this, Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH 0/2] x86: Fix missing core serialization on migration
On 11/13/2017 06:56 PM, Mathieu Desnoyers wrote: - On Nov 10, 2017, at 4:57 PM, Mathieu Desnoyers mathieu.desnoy...@efficios.com wrote: - On Nov 10, 2017, at 4:36 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Nov 10, 2017 at 1:12 PM, Mathieu Desnoyers wrote: x86 can return to user-space through sysexit and sysretq, which are not core serializing. This breaks expectations from user-space about sequential consistency from a single-threaded self-modifying program point of view in specific migration patterns. Feedback is welcome, We should check with Intel. I would actually be surprised if the I$ can be out of sync with the D$ after a sysretq. It would actually break things like "read code from disk" too in theory. That core serializing instruction is not that much about I$ vs D$ consistency, but rather about the processor speculatively executing code ahead of its retirement point. Ref. Intel Architecture Software Developer's Manual, Volume 3: System Programming. 7.1.3. "Handling Self- and Cross-Modifying Code": "The act of a processor writing data into a currently executing code segment with the intent of executing that data as code is called self-modifying code. Intel Architecture processors exhibit model-specific behavior when executing self-modified code, depending upon how far ahead of the current execution pointer the code has been modified. As processor architectures become more complex and start to speculatively execute code ahead of the retirement point (as in the P6 family processors), the rules regarding which code should execute, pre- or post-modification, become blurred. [...]" AFAIU, this core serializing instruction seems to be needed for use-cases of self-modifying code, but not for the initial load of a program from disk, as the processor has no way to have speculatively executed any of its instructions. I figured out what you're pointing to: if exec() is executed by a previously running thread, and there is no core serializing instruction between program load and return to user-space, the kernel ends up acting like a JIT, indeed. I think that's safe. The kernel has to execute a MOV CR3 instruction before it can execute code loaded by exec, and that is a serializing instruction. Loading and unloading shared libraries is made safe by the IRET executed by page faults (loading) and TLB shootdown IPIs (unloading). Directly modifying code in userspace is unsafe if there is some non-coherent instruction cache. Instruction fetch and speculative execution are non-coherent, but they're probably too short (in current processors) to matter. Trace caches are probably large enough, but I don't know whether they are coherent or not. Therefore, we'd also need to invoke sync_core_before_usermode() after loading the program. Let's wait to hear back from hpa, Thanks, Mathieu Hopefully hpa can tell us more about this, Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH tip/core/rcu 1/3] membarrier: Provide register expedited private command
On 10/05/2017 07:23 AM, Nicholas Piggin wrote: On Wed, 4 Oct 2017 14:37:53 -0700 "Paul E. McKenney"wrote: From: Mathieu Desnoyers Provide a new command allowing processes to register their intent to use the private expedited command. This allows PowerPC to skip the full memory barrier in switch_mm(), and only issue the barrier when scheduling into a task belonging to a process that has registered to use expedited private. Processes are now required to register before using MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM. Changes since v1: - Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in powerpc membarrier_arch_sched_in(), given that we want to specifically check the next thread state. - Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig. - Use task_thread_info() to pass thread_info from task to *_ti_thread_flag(). Changes since v2: - Move membarrier_arch_sched_in() call to finish_task_switch(). - Check for NULL t->mm in membarrier_arch_fork(). - Use membarrier_sched_in() in generic code, which invokes the arch-specific membarrier_arch_sched_in(). This fixes allnoconfig build on PowerPC. - Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing allnoconfig build on PowerPC. - Build and runtime tested on PowerPC. Changes since v3: - Simply rely on copy_mm() to copy the membarrier_private_expedited mm field on fork. - powerpc: test thread flag instead of reading membarrier_private_expedited in membarrier_arch_fork(). - powerpc: skip memory barrier in membarrier_arch_sched_in() if coming from kernel thread, since mmdrop() implies a full barrier. - Set membarrier_private_expedited to 1 only after arch registration code, thus eliminating a race where concurrent commands could succeed when they should fail if issued concurrently with process registration. - Use READ_ONCE() for membarrier_private_expedited field access in membarrier_private_expedited. Matches WRITE_ONCE() performed in process registration. Changes since v4: - Move powerpc hook from sched_in() to switch_mm(), based on feedback from Nicholas Piggin. For now, the powerpc approach is okay by me. I plan to test others (e.g., taking runqueue locks) on larger systems, but that can be sent as an incremental patch at a later time. The main thing I would like is for people to review the userspace API. As a future satisfied user of the expedited private membarrier syscall, I am happy with the change.
Re: [PATCH tip/core/rcu 1/3] membarrier: Provide register expedited private command
On 10/05/2017 07:23 AM, Nicholas Piggin wrote: On Wed, 4 Oct 2017 14:37:53 -0700 "Paul E. McKenney" wrote: From: Mathieu Desnoyers Provide a new command allowing processes to register their intent to use the private expedited command. This allows PowerPC to skip the full memory barrier in switch_mm(), and only issue the barrier when scheduling into a task belonging to a process that has registered to use expedited private. Processes are now required to register before using MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM. Changes since v1: - Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in powerpc membarrier_arch_sched_in(), given that we want to specifically check the next thread state. - Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig. - Use task_thread_info() to pass thread_info from task to *_ti_thread_flag(). Changes since v2: - Move membarrier_arch_sched_in() call to finish_task_switch(). - Check for NULL t->mm in membarrier_arch_fork(). - Use membarrier_sched_in() in generic code, which invokes the arch-specific membarrier_arch_sched_in(). This fixes allnoconfig build on PowerPC. - Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing allnoconfig build on PowerPC. - Build and runtime tested on PowerPC. Changes since v3: - Simply rely on copy_mm() to copy the membarrier_private_expedited mm field on fork. - powerpc: test thread flag instead of reading membarrier_private_expedited in membarrier_arch_fork(). - powerpc: skip memory barrier in membarrier_arch_sched_in() if coming from kernel thread, since mmdrop() implies a full barrier. - Set membarrier_private_expedited to 1 only after arch registration code, thus eliminating a race where concurrent commands could succeed when they should fail if issued concurrently with process registration. - Use READ_ONCE() for membarrier_private_expedited field access in membarrier_private_expedited. Matches WRITE_ONCE() performed in process registration. Changes since v4: - Move powerpc hook from sched_in() to switch_mm(), based on feedback from Nicholas Piggin. For now, the powerpc approach is okay by me. I plan to test others (e.g., taking runqueue locks) on larger systems, but that can be sent as an incremental patch at a later time. The main thing I would like is for people to review the userspace API. As a future satisfied user of the expedited private membarrier syscall, I am happy with the change.
Re: [RFC PATCH v2] membarrier: expedited private command
On 08/01/2017 01:22 PM, Peter Zijlstra wrote: If mm cpumask is used, I think it's okay. You can cause quite similar kind of iteration over CPUs and lots of IPIs, tlb flushes, etc using munmap/mprotect/etc, or context switch IPIs, etc. Are we reaching the stage where we're controlling those kinds of ops in terms of impact to the rest of the system? So x86 has a tight mm_cpumask(), we only broadcast TLB invalidate IPIs to those CPUs actually running threads of our process (or very recently). So while there can be the sporadic stray IPI for a CPU that recently ran a thread of the target process, it will not get another one until it switches back into the process. On machines that need manual TLB broadcasts and don't keep a tight mask, yes you can interfere at will, but if they care they can fix by tightening the mask. In either case, the mm_cpumask() will be bounded by the set of CPUs the threads are allowed to run on and will not interfere with the rest of the system. As to scheduler IPIs, those are limited to the CPUs the user is limited to and are rate limited by the wakeup-latency of the tasks. After all, all the time a task is runnable but not running, wakeups are no-ops. Trouble is of course, that not everybody even sets a single bit in mm_cpumask() and those that never clear bits will end up with a fairly wide mask, still interfering with work that isn't hard partitioned. I hate to propose a way to make this more complicated, but this could be fixed by a process first declaring its intent to use expedited process-wide membarrier; if it does, then every context switch updates a process-wide cpumask indicating which cpus are currently running threads of that process: if (prev->mm != next->mm) if (prev->mm->running_cpumask) cpumask_clear(...); else if (next->mm->running_cpumask) cpumask_set(...); now only processes that want expedited process-wide membarrier pay for it (in other than some predictable branches). You can even have threads opt-in, so unrelated threads that don't participate in the party don't cause those bits to be set.
Re: [RFC PATCH v2] membarrier: expedited private command
On 08/01/2017 01:22 PM, Peter Zijlstra wrote: If mm cpumask is used, I think it's okay. You can cause quite similar kind of iteration over CPUs and lots of IPIs, tlb flushes, etc using munmap/mprotect/etc, or context switch IPIs, etc. Are we reaching the stage where we're controlling those kinds of ops in terms of impact to the rest of the system? So x86 has a tight mm_cpumask(), we only broadcast TLB invalidate IPIs to those CPUs actually running threads of our process (or very recently). So while there can be the sporadic stray IPI for a CPU that recently ran a thread of the target process, it will not get another one until it switches back into the process. On machines that need manual TLB broadcasts and don't keep a tight mask, yes you can interfere at will, but if they care they can fix by tightening the mask. In either case, the mm_cpumask() will be bounded by the set of CPUs the threads are allowed to run on and will not interfere with the rest of the system. As to scheduler IPIs, those are limited to the CPUs the user is limited to and are rate limited by the wakeup-latency of the tasks. After all, all the time a task is runnable but not running, wakeups are no-ops. Trouble is of course, that not everybody even sets a single bit in mm_cpumask() and those that never clear bits will end up with a fairly wide mask, still interfering with work that isn't hard partitioned. I hate to propose a way to make this more complicated, but this could be fixed by a process first declaring its intent to use expedited process-wide membarrier; if it does, then every context switch updates a process-wide cpumask indicating which cpus are currently running threads of that process: if (prev->mm != next->mm) if (prev->mm->running_cpumask) cpumask_clear(...); else if (next->mm->running_cpumask) cpumask_set(...); now only processes that want expedited process-wide membarrier pay for it (in other than some predictable branches). You can even have threads opt-in, so unrelated threads that don't participate in the party don't cause those bits to be set.
Re: Udpated sys_membarrier() speedup patch, FYI
On 07/31/2017 11:37 AM, Peter Zijlstra wrote: On Mon, Jul 31, 2017 at 09:03:09AM +0300, Avi Kivity wrote: I remembered that current->mm does not change when switching to a kernel task, but my Kernlish is very rusty, or maybe it has changed. kernel threads do indeed preserve the mm of the old userspace task, but we track that in ->active_mm. Their ->mm will be NULL. Gah, I'm so rusty, if I were any rustier I'd be doing bitcoin.
Re: Udpated sys_membarrier() speedup patch, FYI
On 07/31/2017 11:37 AM, Peter Zijlstra wrote: On Mon, Jul 31, 2017 at 09:03:09AM +0300, Avi Kivity wrote: I remembered that current->mm does not change when switching to a kernel task, but my Kernlish is very rusty, or maybe it has changed. kernel threads do indeed preserve the mm of the old userspace task, but we track that in ->active_mm. Their ->mm will be NULL. Gah, I'm so rusty, if I were any rustier I'd be doing bitcoin.
Re: Udpated sys_membarrier() speedup patch, FYI
On 07/28/2017 12:02 AM, Mathieu Desnoyers wrote: - On Jul 27, 2017, at 4:58 PM, Mathieu Desnoyers mathieu.desnoy...@efficios.com wrote: - On Jul 27, 2017, at 4:37 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Jul 27, 2017 at 11:04:13PM +0300, Avi Kivity wrote: [...] + + this_cpu = raw_smp_processor_id(); + for_each_online_cpu(cpu) { + struct task_struct *p; + + if (cpu == this_cpu) + continue; + rcu_read_lock(); + p = task_rcu_dereference(_rq(cpu)->curr); + if (p && p->mm == current->mm) + __cpumask_set_cpu(cpu, tmpmask); This gets you some false positives, if the CPU idled then mm will not have changed. Good point! The battery-powered embedded guys would probably prefer we not needlessly IPI idle CPUs. We cannot rely on RCU's dyntick-idle state in nohz_full cases. Not sure if is_idle_task() can be used safely, given things like play_idle(). Would changing the check in this loop to: if (p && !is_idle_task(p) && p->mm == current->mm) { work for you ? Avi, is there an optimization that allows current->mm to be non-null when the idle task is scheduled that I am missing ? I would have expected current->mm to be always NULL for idle tasks. I remembered that current->mm does not change when switching to a kernel task, but my Kernlish is very rusty, or maybe it has changed.
Re: Udpated sys_membarrier() speedup patch, FYI
On 07/28/2017 12:02 AM, Mathieu Desnoyers wrote: - On Jul 27, 2017, at 4:58 PM, Mathieu Desnoyers mathieu.desnoy...@efficios.com wrote: - On Jul 27, 2017, at 4:37 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Jul 27, 2017 at 11:04:13PM +0300, Avi Kivity wrote: [...] + + this_cpu = raw_smp_processor_id(); + for_each_online_cpu(cpu) { + struct task_struct *p; + + if (cpu == this_cpu) + continue; + rcu_read_lock(); + p = task_rcu_dereference(_rq(cpu)->curr); + if (p && p->mm == current->mm) + __cpumask_set_cpu(cpu, tmpmask); This gets you some false positives, if the CPU idled then mm will not have changed. Good point! The battery-powered embedded guys would probably prefer we not needlessly IPI idle CPUs. We cannot rely on RCU's dyntick-idle state in nohz_full cases. Not sure if is_idle_task() can be used safely, given things like play_idle(). Would changing the check in this loop to: if (p && !is_idle_task(p) && p->mm == current->mm) { work for you ? Avi, is there an optimization that allows current->mm to be non-null when the idle task is scheduled that I am missing ? I would have expected current->mm to be always NULL for idle tasks. I remembered that current->mm does not change when switching to a kernel task, but my Kernlish is very rusty, or maybe it has changed.
Re: Udpated sys_membarrier() speedup patch, FYI
On 07/27/2017 10:43 PM, Paul E. McKenney wrote: On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote: On 07/27/2017 09:12 PM, Paul E. McKenney wrote: Hello! Please see below for a prototype sys_membarrier() speedup patch. Please note that there is some controversy on this subject, so the final version will probably be quite a bit different than this prototype. But my main question is whether the throttling shown below is acceptable for your use cases, namely only one expedited sys_membarrier() permitted per scheduling-clock period (1 millisecond on many platforms), with any excess being silently converted to non-expedited form. The reason for the throttling is concerns about DoS attacks based on user code with a tight loop invoking this system call. Thoughts? Silent throttling would render it useless for me. -EAGAIN is a little better, but I'd be forced to spin until either I get kicked out of my loop, or it succeeds. IPIing only running threads of my process would be perfect. In fact I might even be able to make use of "membarrier these threads please" to reduce IPIs, when I change the topology from fully connected to something more sparse, on larger machines. My previous implementations were a signal (but that's horrible on large machines) and trylock + mprotect (but that doesn't work on ARM). OK, how about the following patch, which IPIs only the running threads of the process doing the sys_membarrier()? Works for me. Thanx, Paul From: Mathieu Desnoyers <mathieu.desnoy...@efficios.com> To: Peter Zijlstra <pet...@infradead.org> Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers <mathieu.desnoy...@efficios.com>, "Paul E . McKenney" <paul...@linux.vnet.ibm.com>, Boqun Feng <boqun.f...@gmail.com> Subject: [RFC PATCH] membarrier: expedited private command Date: Thu, 27 Jul 2017 14:59:43 -0400 Message-Id: <20170727185943.11570-1-mathieu.desnoy...@efficios.com> Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built from all runqueues for which current thread's mm is the same as our own. Scheduler-wise, it requires that we add a memory barrier after context switching between processes (which have different mm). It would be interesting to benchmark the overhead of this added barrier on the performance of context switching between processes. If the preexisting overhead of switching between mm is high enough, the overhead of adding this extra barrier may be insignificant. [ Compile-tested only! ] CC: Peter Zijlstra <pet...@infradead.org> CC: Paul E. McKenney <paul...@linux.vnet.ibm.com> CC: Boqun Feng <boqun.f...@gmail.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com> --- include/uapi/linux/membarrier.h | 8 +++-- kernel/membarrier.c | 76 - kernel/sched/core.c | 21 3 files changed, 102 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h index e0b108bd2624..6a33c5852f6b 100644 --- a/include/uapi/linux/membarrier.h +++ b/include/uapi/linux/membarrier.h @@ -40,14 +40,18 @@ * (non-running threads are de facto in such a * state). This covers threads from all processes * running on the system. This command returns 0. + * TODO: documentation. * * Command to be passed to the membarrier system call. The commands need to * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to * the value 0. */ enum membarrier_cmd { - MEMBARRIER_CMD_QUERY = 0, - MEMBARRIER_CMD_SHARED = (1 << 0), + MEMBARRIER_CMD_QUERY= 0, + MEMBARRIER_CMD_SHARED = (1 << 0), + /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */ + /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */ + MEMBARRIER_CMD_PRIVATE_EXPEDITED= (1 << 3), }; #endif /* _UAPI_LINUX_MEMBARRIER_H */ diff --git a/kernel/membarrier.c b/kernel/membarrier.c index 9f9284f37f8d..8c6c0f96f617 100644 --- a/kernel/membarrier.c +++ b/kernel/membarrier.c @@ -19,10 +19,81 @@ #include /* + * XXX For cpu_rq(). Should we rather move + * membarrier_private_expedited() to sched/core.c or create + * sched/membarrier.c ? + */ +#include "sched/sched.h" + +/* * Bitmask made from a "or" of all commands within enum membarrier_cmd, * except MEMBARRIER_CMD_QUERY. */ -#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED) +#define MEMBARRIER_CMD_BITMASK \ + (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED) + rcu_read_unlock(); + } +} + +static void membarrier_private_expedit
Re: Udpated sys_membarrier() speedup patch, FYI
On 07/27/2017 10:43 PM, Paul E. McKenney wrote: On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote: On 07/27/2017 09:12 PM, Paul E. McKenney wrote: Hello! Please see below for a prototype sys_membarrier() speedup patch. Please note that there is some controversy on this subject, so the final version will probably be quite a bit different than this prototype. But my main question is whether the throttling shown below is acceptable for your use cases, namely only one expedited sys_membarrier() permitted per scheduling-clock period (1 millisecond on many platforms), with any excess being silently converted to non-expedited form. The reason for the throttling is concerns about DoS attacks based on user code with a tight loop invoking this system call. Thoughts? Silent throttling would render it useless for me. -EAGAIN is a little better, but I'd be forced to spin until either I get kicked out of my loop, or it succeeds. IPIing only running threads of my process would be perfect. In fact I might even be able to make use of "membarrier these threads please" to reduce IPIs, when I change the topology from fully connected to something more sparse, on larger machines. My previous implementations were a signal (but that's horrible on large machines) and trylock + mprotect (but that doesn't work on ARM). OK, how about the following patch, which IPIs only the running threads of the process doing the sys_membarrier()? Works for me. Thanx, Paul From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers , "Paul E . McKenney" , Boqun Feng Subject: [RFC PATCH] membarrier: expedited private command Date: Thu, 27 Jul 2017 14:59:43 -0400 Message-Id: <20170727185943.11570-1-mathieu.desnoy...@efficios.com> Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built from all runqueues for which current thread's mm is the same as our own. Scheduler-wise, it requires that we add a memory barrier after context switching between processes (which have different mm). It would be interesting to benchmark the overhead of this added barrier on the performance of context switching between processes. If the preexisting overhead of switching between mm is high enough, the overhead of adding this extra barrier may be insignificant. [ Compile-tested only! ] CC: Peter Zijlstra CC: Paul E. McKenney CC: Boqun Feng Signed-off-by: Mathieu Desnoyers --- include/uapi/linux/membarrier.h | 8 +++-- kernel/membarrier.c | 76 - kernel/sched/core.c | 21 3 files changed, 102 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h index e0b108bd2624..6a33c5852f6b 100644 --- a/include/uapi/linux/membarrier.h +++ b/include/uapi/linux/membarrier.h @@ -40,14 +40,18 @@ * (non-running threads are de facto in such a * state). This covers threads from all processes * running on the system. This command returns 0. + * TODO: documentation. * * Command to be passed to the membarrier system call. The commands need to * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to * the value 0. */ enum membarrier_cmd { - MEMBARRIER_CMD_QUERY = 0, - MEMBARRIER_CMD_SHARED = (1 << 0), + MEMBARRIER_CMD_QUERY= 0, + MEMBARRIER_CMD_SHARED = (1 << 0), + /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */ + /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */ + MEMBARRIER_CMD_PRIVATE_EXPEDITED= (1 << 3), }; #endif /* _UAPI_LINUX_MEMBARRIER_H */ diff --git a/kernel/membarrier.c b/kernel/membarrier.c index 9f9284f37f8d..8c6c0f96f617 100644 --- a/kernel/membarrier.c +++ b/kernel/membarrier.c @@ -19,10 +19,81 @@ #include /* + * XXX For cpu_rq(). Should we rather move + * membarrier_private_expedited() to sched/core.c or create + * sched/membarrier.c ? + */ +#include "sched/sched.h" + +/* * Bitmask made from a "or" of all commands within enum membarrier_cmd, * except MEMBARRIER_CMD_QUERY. */ -#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED) +#define MEMBARRIER_CMD_BITMASK \ + (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED) + rcu_read_unlock(); + } +} + +static void membarrier_private_expedited(void) +{ + int cpu, this_cpu; + cpumask_var_t tmpmask; + + if (num_online_cpus() == 1) + return; + + /* +* Matches memory barriers around rq->curr modification in +* scheduler. +*/ + smp_mb(); /*
Re: Udpated sys_membarrier() speedup patch, FYI
On 07/27/2017 09:12 PM, Paul E. McKenney wrote: Hello! Please see below for a prototype sys_membarrier() speedup patch. Please note that there is some controversy on this subject, so the final version will probably be quite a bit different than this prototype. But my main question is whether the throttling shown below is acceptable for your use cases, namely only one expedited sys_membarrier() permitted per scheduling-clock period (1 millisecond on many platforms), with any excess being silently converted to non-expedited form. The reason for the throttling is concerns about DoS attacks based on user code with a tight loop invoking this system call. Thoughts? Silent throttling would render it useless for me. -EAGAIN is a little better, but I'd be forced to spin until either I get kicked out of my loop, or it succeeds. IPIing only running threads of my process would be perfect. In fact I might even be able to make use of "membarrier these threads please" to reduce IPIs, when I change the topology from fully connected to something more sparse, on larger machines. My previous implementations were a signal (but that's horrible on large machines) and trylock + mprotect (but that doesn't work on ARM). Thanx, Paul commit 4cd5253094b6d7f9501e21e13aa4e2e78e8a70cd Author: Paul E. McKenney <paul...@linux.vnet.ibm.com> Date: Tue Jul 18 13:53:32 2017 -0700 sys_membarrier: Add expedited option The sys_membarrier() system call has proven too slow for some use cases, which has prompted users to instead rely on TLB shootdown. Although TLB shootdown is much faster, it has the slight disadvantage of not working at all on arm and arm64 and also of being vulnerable to reasonable optimizations that might skip some IPIs. However, the Linux kernel does not currrently provide a reasonable alternative, so it is hard to criticize these users from doing what works for them on a given piece of hardware at a given time. This commit therefore adds an expedited option to the sys_membarrier() system call, thus providing a faster mechanism that is portable and is not subject to death by optimization. Note that if more than one MEMBARRIER_CMD_SHARED_EXPEDITED sys_membarrier() call happens within the same jiffy, all but the first will use synchronize_sched() instead of synchronize_sched_expedited(). Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com> [ paulmck: Fix code style issue pointed out by Boqun Feng. ] Tested-by: Avi Kivity <a...@scylladb.com> Cc: Maged Michael <maged.mich...@gmail.com> Cc: Andrew Hunter <a...@google.com> Cc: Geoffrey Romer <gro...@google.com> diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h index e0b108bd2624..5720386d0904 100644 --- a/include/uapi/linux/membarrier.h +++ b/include/uapi/linux/membarrier.h @@ -40,6 +40,16 @@ * (non-running threads are de facto in such a * state). This covers threads from all processes * running on the system. This command returns 0. + * @MEMBARRIER_CMD_SHARED_EXPEDITED: Execute a memory barrier on all + * running threads, but in an expedited fashion. + * Upon return from system call, the caller thread + * is ensured that all running threads have passed + * through a state where all memory accesses to + * user-space addresses match program order between + * entry to and return from the system call + * (non-running threads are de facto in such a + * state). This covers threads from all processes + * running on the system. This command returns 0. * * Command to be passed to the membarrier system call. The commands need to * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to @@ -48,6 +58,7 @@ enum membarrier_cmd { MEMBARRIER_CMD_QUERY = 0, MEMBARRIER_CMD_SHARED = (1 << 0), + MEMBARRIER_CMD_SHARED_EXPEDITED = (1 << 1), }; #endif /* _UAPI_LINUX_MEMBARRIER_H */ diff --git a/kernel/membarrier.c b/kernel/membarrier.c index 9f9284f37f8d..587e3bbfae7e 100644 --- a/kernel/membarrier.c +++ b/kernel/membarrier.c @@ -22,7 +22,8 @@ * Bitmask made from a "or" of all commands within enum membarrier_cmd, * except MEMBARRIER_CMD_QUERY. */ -#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED) +#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED |\ +MEMBARRIER_CMD_SHARED_EXPEDITED)
Re: Udpated sys_membarrier() speedup patch, FYI
On 07/27/2017 09:12 PM, Paul E. McKenney wrote: Hello! Please see below for a prototype sys_membarrier() speedup patch. Please note that there is some controversy on this subject, so the final version will probably be quite a bit different than this prototype. But my main question is whether the throttling shown below is acceptable for your use cases, namely only one expedited sys_membarrier() permitted per scheduling-clock period (1 millisecond on many platforms), with any excess being silently converted to non-expedited form. The reason for the throttling is concerns about DoS attacks based on user code with a tight loop invoking this system call. Thoughts? Silent throttling would render it useless for me. -EAGAIN is a little better, but I'd be forced to spin until either I get kicked out of my loop, or it succeeds. IPIing only running threads of my process would be perfect. In fact I might even be able to make use of "membarrier these threads please" to reduce IPIs, when I change the topology from fully connected to something more sparse, on larger machines. My previous implementations were a signal (but that's horrible on large machines) and trylock + mprotect (but that doesn't work on ARM). Thanx, Paul commit 4cd5253094b6d7f9501e21e13aa4e2e78e8a70cd Author: Paul E. McKenney Date: Tue Jul 18 13:53:32 2017 -0700 sys_membarrier: Add expedited option The sys_membarrier() system call has proven too slow for some use cases, which has prompted users to instead rely on TLB shootdown. Although TLB shootdown is much faster, it has the slight disadvantage of not working at all on arm and arm64 and also of being vulnerable to reasonable optimizations that might skip some IPIs. However, the Linux kernel does not currrently provide a reasonable alternative, so it is hard to criticize these users from doing what works for them on a given piece of hardware at a given time. This commit therefore adds an expedited option to the sys_membarrier() system call, thus providing a faster mechanism that is portable and is not subject to death by optimization. Note that if more than one MEMBARRIER_CMD_SHARED_EXPEDITED sys_membarrier() call happens within the same jiffy, all but the first will use synchronize_sched() instead of synchronize_sched_expedited(). Signed-off-by: Paul E. McKenney [ paulmck: Fix code style issue pointed out by Boqun Feng. ] Tested-by: Avi Kivity Cc: Maged Michael Cc: Andrew Hunter Cc: Geoffrey Romer diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h index e0b108bd2624..5720386d0904 100644 --- a/include/uapi/linux/membarrier.h +++ b/include/uapi/linux/membarrier.h @@ -40,6 +40,16 @@ * (non-running threads are de facto in such a * state). This covers threads from all processes * running on the system. This command returns 0. + * @MEMBARRIER_CMD_SHARED_EXPEDITED: Execute a memory barrier on all + * running threads, but in an expedited fashion. + * Upon return from system call, the caller thread + * is ensured that all running threads have passed + * through a state where all memory accesses to + * user-space addresses match program order between + * entry to and return from the system call + * (non-running threads are de facto in such a + * state). This covers threads from all processes + * running on the system. This command returns 0. * * Command to be passed to the membarrier system call. The commands need to * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to @@ -48,6 +58,7 @@ enum membarrier_cmd { MEMBARRIER_CMD_QUERY = 0, MEMBARRIER_CMD_SHARED = (1 << 0), + MEMBARRIER_CMD_SHARED_EXPEDITED = (1 << 1), }; #endif /* _UAPI_LINUX_MEMBARRIER_H */ diff --git a/kernel/membarrier.c b/kernel/membarrier.c index 9f9284f37f8d..587e3bbfae7e 100644 --- a/kernel/membarrier.c +++ b/kernel/membarrier.c @@ -22,7 +22,8 @@ * Bitmask made from a "or" of all commands within enum membarrier_cmd, * except MEMBARRIER_CMD_QUERY. */ -#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED) +#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED |\ +MEMBARRIER_CMD_SHARED_EXPEDITED) /** * sys_membarrier - issue memory barriers on a set of threads @@ -64,6 +65,20 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
Re: MAP_POPULATE vs. MADV_HUGEPAGES
On 03/16/2017 04:48 PM, Michal Hocko wrote: On Thu 16-03-17 15:26:54, Avi Kivity wrote: On 03/16/2017 02:34 PM, Michal Hocko wrote: On Wed 15-03-17 18:50:32, Avi Kivity wrote: A user is trying to allocate 1TB of anonymous memory in parallel on 48 cores (4 NUMA nodes). The kernel ends up spinning in isolate_freepages_block(). Which kernel version is that? A good question; it was 3.10.something-el.something. The user mentioned above updated to 4.4, and the problem was gone, so it looks like it is a Red Hat specific problem. I would really like the 3.10.something kernel to handle this workload well, but I understand that's not this list's concern. What is the THP defrag mode (/sys/kernel/mm/transparent_hugepage/defrag)? The default (always). the default has changed since then because the THP faul latencies were just too large. Currently we only allow madvised VMAs to go stall and even then we try hard to back off sooner rather than later. See 444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a stall-free defrag option") merged in 4.4 I see, thanks. So the 4.4 behavior is better mostly due to not trying so hard. I thought to help it along by using MAP_POPULATE, but then my MADV_HUGEPAGE won't be seen until after mmap() completes, with pages already populated. Are MAP_POPULATE and MADV_HUGEPAGE mutually exclusive? Why do you need MADV_HUGEPAGE? So that I get huge pages even if transparent_hugepage/enabled=madvise. I'm allocating almost all of the memory of that machine to be used as a giant cache, so I want it backed by hugepages. Is there any strong reason to not use hugetlb then? You probably want that memory reclaimable, right? Did you mean hugetlbfs? It's a pain to configure, and often requires a reboot. We support it via an option, but we prefer the user's first experience with the application not to be "configure this kernel parameter and reboot". We don't particularly need that memory to be reclaimable (and in fact we have an option to mlock() it; if it gets swapped, application performance tanks). Is my only option to serialize those memory allocations, and fault in those pages manually? Or perhaps use mlock()? I am still not 100% sure I see what you are trying to achieve, though. So you do not want all those processes to contend inside the compaction while still allocate as many huge pages as possible? Since the process starts with all of that memory free, there should not be any compaction going on (or perhaps very minimal eviction/movement of a few pages here and there). And since it's fixed in later kernels, it looks like the contention was not really mandated by the workload, just an artifact of the implementation. It is possible. A lot has changed since 3.10 times. Like the default behavior :).
Re: MAP_POPULATE vs. MADV_HUGEPAGES
On 03/16/2017 04:48 PM, Michal Hocko wrote: On Thu 16-03-17 15:26:54, Avi Kivity wrote: On 03/16/2017 02:34 PM, Michal Hocko wrote: On Wed 15-03-17 18:50:32, Avi Kivity wrote: A user is trying to allocate 1TB of anonymous memory in parallel on 48 cores (4 NUMA nodes). The kernel ends up spinning in isolate_freepages_block(). Which kernel version is that? A good question; it was 3.10.something-el.something. The user mentioned above updated to 4.4, and the problem was gone, so it looks like it is a Red Hat specific problem. I would really like the 3.10.something kernel to handle this workload well, but I understand that's not this list's concern. What is the THP defrag mode (/sys/kernel/mm/transparent_hugepage/defrag)? The default (always). the default has changed since then because the THP faul latencies were just too large. Currently we only allow madvised VMAs to go stall and even then we try hard to back off sooner rather than later. See 444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a stall-free defrag option") merged in 4.4 I see, thanks. So the 4.4 behavior is better mostly due to not trying so hard. I thought to help it along by using MAP_POPULATE, but then my MADV_HUGEPAGE won't be seen until after mmap() completes, with pages already populated. Are MAP_POPULATE and MADV_HUGEPAGE mutually exclusive? Why do you need MADV_HUGEPAGE? So that I get huge pages even if transparent_hugepage/enabled=madvise. I'm allocating almost all of the memory of that machine to be used as a giant cache, so I want it backed by hugepages. Is there any strong reason to not use hugetlb then? You probably want that memory reclaimable, right? Did you mean hugetlbfs? It's a pain to configure, and often requires a reboot. We support it via an option, but we prefer the user's first experience with the application not to be "configure this kernel parameter and reboot". We don't particularly need that memory to be reclaimable (and in fact we have an option to mlock() it; if it gets swapped, application performance tanks). Is my only option to serialize those memory allocations, and fault in those pages manually? Or perhaps use mlock()? I am still not 100% sure I see what you are trying to achieve, though. So you do not want all those processes to contend inside the compaction while still allocate as many huge pages as possible? Since the process starts with all of that memory free, there should not be any compaction going on (or perhaps very minimal eviction/movement of a few pages here and there). And since it's fixed in later kernels, it looks like the contention was not really mandated by the workload, just an artifact of the implementation. It is possible. A lot has changed since 3.10 times. Like the default behavior :).
Re: MAP_POPULATE vs. MADV_HUGEPAGES
On 03/16/2017 02:34 PM, Michal Hocko wrote: On Wed 15-03-17 18:50:32, Avi Kivity wrote: A user is trying to allocate 1TB of anonymous memory in parallel on 48 cores (4 NUMA nodes). The kernel ends up spinning in isolate_freepages_block(). Which kernel version is that? A good question; it was 3.10.something-el.something. The user mentioned above updated to 4.4, and the problem was gone, so it looks like it is a Red Hat specific problem. I would really like the 3.10.something kernel to handle this workload well, but I understand that's not this list's concern. What is the THP defrag mode (/sys/kernel/mm/transparent_hugepage/defrag)? The default (always). I thought to help it along by using MAP_POPULATE, but then my MADV_HUGEPAGE won't be seen until after mmap() completes, with pages already populated. Are MAP_POPULATE and MADV_HUGEPAGE mutually exclusive? Why do you need MADV_HUGEPAGE? So that I get huge pages even if transparent_hugepage/enabled=madvise. I'm allocating almost all of the memory of that machine to be used as a giant cache, so I want it backed by hugepages. Is my only option to serialize those memory allocations, and fault in those pages manually? Or perhaps use mlock()? I am still not 100% sure I see what you are trying to achieve, though. So you do not want all those processes to contend inside the compaction while still allocate as many huge pages as possible? Since the process starts with all of that memory free, there should not be any compaction going on (or perhaps very minimal eviction/movement of a few pages here and there). And since it's fixed in later kernels, it looks like the contention was not really mandated by the workload, just an artifact of the implementation. To explain the workload again, the process starts, clones as many threads as there are logical processors, and each of those threads mmap()s (and mbind()s) a chunk of memory and then proceeds to touch it.
Re: MAP_POPULATE vs. MADV_HUGEPAGES
On 03/16/2017 02:34 PM, Michal Hocko wrote: On Wed 15-03-17 18:50:32, Avi Kivity wrote: A user is trying to allocate 1TB of anonymous memory in parallel on 48 cores (4 NUMA nodes). The kernel ends up spinning in isolate_freepages_block(). Which kernel version is that? A good question; it was 3.10.something-el.something. The user mentioned above updated to 4.4, and the problem was gone, so it looks like it is a Red Hat specific problem. I would really like the 3.10.something kernel to handle this workload well, but I understand that's not this list's concern. What is the THP defrag mode (/sys/kernel/mm/transparent_hugepage/defrag)? The default (always). I thought to help it along by using MAP_POPULATE, but then my MADV_HUGEPAGE won't be seen until after mmap() completes, with pages already populated. Are MAP_POPULATE and MADV_HUGEPAGE mutually exclusive? Why do you need MADV_HUGEPAGE? So that I get huge pages even if transparent_hugepage/enabled=madvise. I'm allocating almost all of the memory of that machine to be used as a giant cache, so I want it backed by hugepages. Is my only option to serialize those memory allocations, and fault in those pages manually? Or perhaps use mlock()? I am still not 100% sure I see what you are trying to achieve, though. So you do not want all those processes to contend inside the compaction while still allocate as many huge pages as possible? Since the process starts with all of that memory free, there should not be any compaction going on (or perhaps very minimal eviction/movement of a few pages here and there). And since it's fixed in later kernels, it looks like the contention was not really mandated by the workload, just an artifact of the implementation. To explain the workload again, the process starts, clones as many threads as there are logical processors, and each of those threads mmap()s (and mbind()s) a chunk of memory and then proceeds to touch it.
MAP_POPULATE vs. MADV_HUGEPAGES
A user is trying to allocate 1TB of anonymous memory in parallel on 48 cores (4 NUMA nodes). The kernel ends up spinning in isolate_freepages_block(). I thought to help it along by using MAP_POPULATE, but then my MADV_HUGEPAGE won't be seen until after mmap() completes, with pages already populated. Are MAP_POPULATE and MADV_HUGEPAGE mutually exclusive? Is my only option to serialize those memory allocations, and fault in those pages manually? Or perhaps use mlock()?
MAP_POPULATE vs. MADV_HUGEPAGES
A user is trying to allocate 1TB of anonymous memory in parallel on 48 cores (4 NUMA nodes). The kernel ends up spinning in isolate_freepages_block(). I thought to help it along by using MAP_POPULATE, but then my MADV_HUGEPAGE won't be seen until after mmap() completes, with pages already populated. Are MAP_POPULATE and MADV_HUGEPAGE mutually exclusive? Is my only option to serialize those memory allocations, and fault in those pages manually? Or perhaps use mlock()?
Re: [PATCH] vfio: Include No-IOMMU mode
On 11/16/2015 07:06 PM, Alex Williamson wrote: On Wed, 2015-10-28 at 15:21 -0600, Alex Williamson wrote: There is really no way to safely give a user full access to a DMA capable device without an IOMMU to protect the host system. There is also no way to provide DMA translation, for use cases such as device assignment to virtual machines. However, there are still those users that want userspace drivers even under those conditions. The UIO driver exists for this use case, but does not provide the degree of device access and programming that VFIO has. In an effort to avoid code duplication, this introduces a No-IOMMU mode for VFIO. This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling the "enable_unsafe_noiommu_mode" option on the vfio driver. This should make it very clear that this mode is not safe. Additionally, CAP_SYS_RAWIO privileges are necessary to work with groups and containers using this mode. Groups making use of this support are named /dev/vfio/noiommu-$GROUP and can only make use of the special VFIO_NOIOMMU_IOMMU for the container. Use of this mode, specifically binding a device without a native IOMMU group to a VFIO bus driver will taint the kernel and should therefore not be considered supported. This patch includes no-iommu support for the vfio-pci bus driver only. Signed-off-by: Alex Williamson --- This is pretty well the same as RFCv2, I've changed the pr_warn to a dev_warn and added another, printing the pid and comm of the task when it actually opens the device. If Stephen can port the driver code over and prove that this actually works sometime next week, and there aren't any objections to this code, I'll include it in a pull request for the next merge window. MST, I dropped your ack due to the changes, but I'll be happy to add it back if you like. Thanks, Alex drivers/vfio/Kconfig| 15 +++ drivers/vfio/pci/vfio_pci.c |8 +- drivers/vfio/vfio.c | 186 ++- include/linux/vfio.h|3 + include/uapi/linux/vfio.h |7 ++ 5 files changed, 209 insertions(+), 10 deletions(-) FYI, this is now in v4.4-rc1 (the slightly modified v2 version). I want to give fair warning though that while we seem to agree on this idea, it hasn't been proven with a userspace driver port. I've opted to include it in this merge window rather than delaying it until v4.5, but I really need to see a user for this before the end of the v4.4 cycle or I think we'll need to revert and revisit for v4.5 anyway. I don't really have any interest in adding and maintaining code that has no users. Please keep me informed of progress with a dpdk port. Thanks, Thanks Alex. Copying the dpdk mailing list, where the users live. dpdk-ers: vfio-noiommu is a replacement for uio_pci_generic and uio_igb. It supports MSI-X and so can be used on SR/IOV VF devices. The intent is that you can use dpdk without an external module, using vfio, whether you are on bare metal with an iommu, bare metal without an iommu, or virtualized. However, dpdk needs modification to support this. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vfio: Include No-IOMMU mode
On 11/16/2015 07:06 PM, Alex Williamson wrote: On Wed, 2015-10-28 at 15:21 -0600, Alex Williamson wrote: There is really no way to safely give a user full access to a DMA capable device without an IOMMU to protect the host system. There is also no way to provide DMA translation, for use cases such as device assignment to virtual machines. However, there are still those users that want userspace drivers even under those conditions. The UIO driver exists for this use case, but does not provide the degree of device access and programming that VFIO has. In an effort to avoid code duplication, this introduces a No-IOMMU mode for VFIO. This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling the "enable_unsafe_noiommu_mode" option on the vfio driver. This should make it very clear that this mode is not safe. Additionally, CAP_SYS_RAWIO privileges are necessary to work with groups and containers using this mode. Groups making use of this support are named /dev/vfio/noiommu-$GROUP and can only make use of the special VFIO_NOIOMMU_IOMMU for the container. Use of this mode, specifically binding a device without a native IOMMU group to a VFIO bus driver will taint the kernel and should therefore not be considered supported. This patch includes no-iommu support for the vfio-pci bus driver only. Signed-off-by: Alex Williamson--- This is pretty well the same as RFCv2, I've changed the pr_warn to a dev_warn and added another, printing the pid and comm of the task when it actually opens the device. If Stephen can port the driver code over and prove that this actually works sometime next week, and there aren't any objections to this code, I'll include it in a pull request for the next merge window. MST, I dropped your ack due to the changes, but I'll be happy to add it back if you like. Thanks, Alex drivers/vfio/Kconfig| 15 +++ drivers/vfio/pci/vfio_pci.c |8 +- drivers/vfio/vfio.c | 186 ++- include/linux/vfio.h|3 + include/uapi/linux/vfio.h |7 ++ 5 files changed, 209 insertions(+), 10 deletions(-) FYI, this is now in v4.4-rc1 (the slightly modified v2 version). I want to give fair warning though that while we seem to agree on this idea, it hasn't been proven with a userspace driver port. I've opted to include it in this merge window rather than delaying it until v4.5, but I really need to see a user for this before the end of the v4.4 cycle or I think we'll need to revert and revisit for v4.5 anyway. I don't really have any interest in adding and maintaining code that has no users. Please keep me informed of progress with a dpdk port. Thanks, Thanks Alex. Copying the dpdk mailing list, where the users live. dpdk-ers: vfio-noiommu is a replacement for uio_pci_generic and uio_igb. It supports MSI-X and so can be used on SR/IOV VF devices. The intent is that you can use dpdk without an external module, using vfio, whether you are on bare metal with an iommu, bare metal without an iommu, or virtualized. However, dpdk needs modification to support this. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
On 10/12/2015 07:23 PM, Alex Williamson wrote: Also, although you think the long option will set the bar high enough it probably will not satisfy anyone. It is annoying enough, that I would just carry a patch to remove it the silly requirement. And the the people who believe all user mode DMA is evil won't be satisfied either. I find that many users blindly follow howtos and only sometimes do they question the options if they sound scary enough. So yeah, I would intend to make the option upstream sound scary enough for people to think twice about using it and maybe even read the description. That still doesn't prevent pasting it into modprobe.d and forgetting about it. I think we need to allow for packages to work with this. I.e. drop files into config directories rather than require editing of config files. I think this is mostly workable via modprobe.d. (for our own use case, we don't require the extreme performance of many L2/L3 dpdk apps so we'll work with regular iommued vfio for bare-metal installations and ship prebuilt virtual machine images for clouds, so it doesn't matter that much to me). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
On 10/12/2015 07:23 PM, Alex Williamson wrote: Also, although you think the long option will set the bar high enough it probably will not satisfy anyone. It is annoying enough, that I would just carry a patch to remove it the silly requirement. And the the people who believe all user mode DMA is evil won't be satisfied either. I find that many users blindly follow howtos and only sometimes do they question the options if they sound scary enough. So yeah, I would intend to make the option upstream sound scary enough for people to think twice about using it and maybe even read the description. That still doesn't prevent pasting it into modprobe.d and forgetting about it. I think we need to allow for packages to work with this. I.e. drop files into config directories rather than require editing of config files. I think this is mostly workable via modprobe.d. (for our own use case, we don't require the extreme performance of many L2/L3 dpdk apps so we'll work with regular iommued vfio for bare-metal installations and ship prebuilt virtual machine images for clouds, so it doesn't matter that much to me). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
On 10/11/2015 11:57 AM, Michael S. Tsirkin wrote: On Sun, Oct 11, 2015 at 11:12:14AM +0300, Avi Kivity wrote: Mixing no-iommu and secure VFIO is also unsupported, as are any VFIO IOMMU backends other than the vfio-noiommu backend. Furthermore, unsafe group files are relocated to /dev/vfio-noiommu/. Upon successful loading in this mode, the kernel is tainted due to the dummy IOMMU put in place. Unloading of the module in this mode is also unsupported and will BUG due to the lack of support for unregistering an IOMMU for a bus type. I did not see an API for detecting whether memory translation is provided or not. We can have the caller guess this by looking at the device name, or by requiring the user to specify this, but I think it's cleaner to provide programmatic access to this attribute. It seems that caller can just check for VFIO_NOIOMMU_IOMMU. Isn't this why it's there? That's just means the capability is there, not that it's active. But since you must pass the same value to open(), you already know that you're using noiommu. VFIO_IOMMU_MAP_DMA, VFIO_IOMMU_ENABLE and VFIO_IOMMU_DISABLE will probably also fail ... Don't you have to call MAP_DMA to pin the memory? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
On 10/09/2015 09:41 PM, Alex Williamson wrote: There is really no way to safely give a user full access to a PCI without an IOMMU to protect the host from errant DMA. There is also no way to provide DMA translation, for use cases such as devices assignment to virtual machines. However, there are still those users that want userspace drivers under those conditions. The UIO driver exists for this use case, but does not provide the degree of device access and programming that VFIO has. In an effort to avoid code duplication, this introduces a No-IOMMU mode for VFIO. This mode requires enabling CONFIG_VFIO_NOIOMMU and loading the vfio module with the option "enable_unsafe_pci_noiommu_mode". This should make it very clear that this mode is not safe. In this mode, there is no support for unprivileged users, CAP_SYS_ADMIN is required for access to the necessary dev files. CAP_SYS_RAWIO seems a better match (in particular, it allows access to /dev/mem, which is the same thing). Mixing no-iommu and secure VFIO is also unsupported, as are any VFIO IOMMU backends other than the vfio-noiommu backend. Furthermore, unsafe group files are relocated to /dev/vfio-noiommu/. Upon successful loading in this mode, the kernel is tainted due to the dummy IOMMU put in place. Unloading of the module in this mode is also unsupported and will BUG due to the lack of support for unregistering an IOMMU for a bus type. I did not see an API for detecting whether memory translation is provided or not. We can have the caller guess this by looking at the device name, or by requiring the user to specify this, but I think it's cleaner to provide programmatic access to this attribute. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
On 10/09/2015 09:41 PM, Alex Williamson wrote: There is really no way to safely give a user full access to a PCI without an IOMMU to protect the host from errant DMA. There is also no way to provide DMA translation, for use cases such as devices assignment to virtual machines. However, there are still those users that want userspace drivers under those conditions. The UIO driver exists for this use case, but does not provide the degree of device access and programming that VFIO has. In an effort to avoid code duplication, this introduces a No-IOMMU mode for VFIO. This mode requires enabling CONFIG_VFIO_NOIOMMU and loading the vfio module with the option "enable_unsafe_pci_noiommu_mode". This should make it very clear that this mode is not safe. In this mode, there is no support for unprivileged users, CAP_SYS_ADMIN is required for access to the necessary dev files. CAP_SYS_RAWIO seems a better match (in particular, it allows access to /dev/mem, which is the same thing). Mixing no-iommu and secure VFIO is also unsupported, as are any VFIO IOMMU backends other than the vfio-noiommu backend. Furthermore, unsafe group files are relocated to /dev/vfio-noiommu/. Upon successful loading in this mode, the kernel is tainted due to the dummy IOMMU put in place. Unloading of the module in this mode is also unsupported and will BUG due to the lack of support for unregistering an IOMMU for a bus type. I did not see an API for detecting whether memory translation is provided or not. We can have the caller guess this by looking at the device name, or by requiring the user to specify this, but I think it's cleaner to provide programmatic access to this attribute. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
On 10/11/2015 11:57 AM, Michael S. Tsirkin wrote: On Sun, Oct 11, 2015 at 11:12:14AM +0300, Avi Kivity wrote: Mixing no-iommu and secure VFIO is also unsupported, as are any VFIO IOMMU backends other than the vfio-noiommu backend. Furthermore, unsafe group files are relocated to /dev/vfio-noiommu/. Upon successful loading in this mode, the kernel is tainted due to the dummy IOMMU put in place. Unloading of the module in this mode is also unsupported and will BUG due to the lack of support for unregistering an IOMMU for a bus type. I did not see an API for detecting whether memory translation is provided or not. We can have the caller guess this by looking at the device name, or by requiring the user to specify this, but I think it's cleaner to provide programmatic access to this attribute. It seems that caller can just check for VFIO_NOIOMMU_IOMMU. Isn't this why it's there? That's just means the capability is there, not that it's active. But since you must pass the same value to open(), you already know that you're using noiommu. VFIO_IOMMU_MAP_DMA, VFIO_IOMMU_ENABLE and VFIO_IOMMU_DISABLE will probably also fail ... Don't you have to call MAP_DMA to pin the memory? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/08/2015 01:26 PM, Michael S. Tsirkin wrote: On Thu, Oct 08, 2015 at 12:19:20PM +0300, Avi Kivity wrote: On 10/08/2015 11:32 AM, Michael S. Tsirkin wrote: On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote: On 08/10/15 00:05, Michael S. Tsirkin wrote: On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote: That's what I thought as well, but apparently adding msix support to the already insecure uio drivers is even worse. I'm glad you finally agree what these drivers are doing is insecure. And basically kernel cares about security, no one wants to maintain insecure stuff. So you guys should think harder whether this code makes any sense upstream. You simply ignore everything I write, cherry-picking the word "insecure" as if it makes your point. That is very frustrating. And I'm sorry about the frustration. I didn't intend to twist your words. It's just that I had to spend literally hours trying to explain that security matters in kernel, and all I was getting back was a summary "there's no security issue because there are other way to corrupt memory". The word security has several meanings. The primary meaning is "defense against a malicious attacker". In that sense, there is no added value at all, because the attacker is already root, and can already access all of kernel and user memory. Even if the attacker is not root, and just has access to a non-iommu-protected device, they can still DMA to and from any memory they like. This sense of the word however is irrelevant for this conversation; the user already gave up on it when they chose to use uio_pci_generic (either because they have no iommu, or because they need the extra performance). Do we agree that security, in the sense of defense against a malicious attacker, is irrelevant for this conversation? No. uio_pci_generic currently can be used in a secure way in a sense that it's protected againt malicious attacker, assuming you bind it to a device that does not do DMA. The context of the conversation is dpdk, which only supports DMA. Do we agree that security, in the sense of defense against a malicious attacker, is irrelevant for this conversation, taking this under consideration? A secondary meaning is protection against inadvertent bugs. Yes, a faulty memory write that happens to land in the msix page, can cause a random memory word to be overwritten. But so can a faulty memory write into the rings, or the data structures that support virtual->physical translation, the data structures that describe the packets before translation, the memory allocator or pool. The patch extends the vulnerable surface, but by a negligible amount. So I was glad when it looked like there's finally an agreement that yes, there's value in validating userspace input and yes, it's insecure not to do this. It is good practice to defend against root oopsing the kernel, but in some cases it cannot be achieved. I originally included ways to fix issues that I pointed out, ranging >from harder to implement with more overhead but more secure to easier to implement with less overhead but less secure. There didn't seem to be an understanding that the issues are there at all, so I stopped doing that - seemed like a waste of time. For example, will it kill your performance to reset devices cleanly, on open and close, I don't recall this being mentioned at all. http://mid.gmane.org/20151006005527-mutt-send-email-...@redhat.com Down at the moment for me. But really, this is just off the top of my head. These are all issues VFIO developers encountered and fixed over the years. Go into that code, read it, and you will discover the issues and the solutions. vfio is solving a different problem, the problem of security against a malicious attacker, one that I'm hoping to agree here that we aren't attempting to solve. People have been happily using uio_pci_generic despite all those issues. All they were missing was msix support. You can't use that to force them to overhaul that driver, or to add a new subsystem to vfio. It seems completely unrelated to a patch adding msix support to uio_pci_generic. It isn't unrelated. It's because with MSIX patch you are enabling bus mastering in kernel. So if you start device in a bad state it will corrupt kernel memory. You are right, this patch can regress secure users. I'd be surprised if there are msix-capable pci devices that do not rely on DMA, though. protect them from writes into MSI config, BAR registers and related capablities etc etc? Obviously the userspace driver has to write to the BAR area. If you're talking about the BAR setup registers, yes there is some (tiny) value in that, but how is it related to this patch? If you don't, moving BARs will move the MSI-X region and protecting it won't help. Won't it just become invisible if you do? If userspace starts playing with BARs, you lost already, whe
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote: On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote: On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote: On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote: It is good practice to defend against root oopsing the kernel, but in some cases it cannot be achieved. Absolutely. That's one of the issues with these patches. They don't even try where it's absolutely possible. Are you referring to blocking the maps of the msix BAR areas? For example. There are more. I listed some of the issues on the mailing list, and I might have missed some. VFIO has code to address all this, people should share code to avoid duplication, or at least read it to understand the issues. All but one of those are unrelated to the patch that adds msix support. I think there is value in that. The value is small because a corruption is more likely in the dynamic memory responsible for tens of millions of DMA operations per second, rather than a static 4K area, but it exists. There are other bugs which will hurt e.g. each time application does not exit gracefully. uio_pci_generic disables DMA when the device is removed, so we're safe here, at least if files are released before the address space. But well, heh :) That's precisely my feeling about the whole "running userspace drivers without an IOMMU" project. The value is small since modern hardware has fast IOMMUs, but it exists. For users that don't have iommus at all (usually because it is taken by the hypervisor), it has great value. I can't comment on iommu overhead; for my use case it is likely negligible and we will use an iommu when available; but apparently it matters for others. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/08/2015 11:32 AM, Michael S. Tsirkin wrote: On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote: On 08/10/15 00:05, Michael S. Tsirkin wrote: On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote: That's what I thought as well, but apparently adding msix support to the already insecure uio drivers is even worse. I'm glad you finally agree what these drivers are doing is insecure. And basically kernel cares about security, no one wants to maintain insecure stuff. So you guys should think harder whether this code makes any sense upstream. You simply ignore everything I write, cherry-picking the word "insecure" as if it makes your point. That is very frustrating. And I'm sorry about the frustration. I didn't intend to twist your words. It's just that I had to spend literally hours trying to explain that security matters in kernel, and all I was getting back was a summary "there's no security issue because there are other way to corrupt memory". The word security has several meanings. The primary meaning is "defense against a malicious attacker". In that sense, there is no added value at all, because the attacker is already root, and can already access all of kernel and user memory. Even if the attacker is not root, and just has access to a non-iommu-protected device, they can still DMA to and from any memory they like. This sense of the word however is irrelevant for this conversation; the user already gave up on it when they chose to use uio_pci_generic (either because they have no iommu, or because they need the extra performance). Do we agree that security, in the sense of defense against a malicious attacker, is irrelevant for this conversation? A secondary meaning is protection against inadvertent bugs. Yes, a faulty memory write that happens to land in the msix page, can cause a random memory word to be overwritten. But so can a faulty memory write into the rings, or the data structures that support virtual->physical translation, the data structures that describe the packets before translation, the memory allocator or pool. The patch extends the vulnerable surface, but by a negligible amount. So I was glad when it looked like there's finally an agreement that yes, there's value in validating userspace input and yes, it's insecure not to do this. It is good practice to defend against root oopsing the kernel, but in some cases it cannot be achieved. I originally included ways to fix issues that I pointed out, ranging from harder to implement with more overhead but more secure to easier to implement with less overhead but less secure. There didn't seem to be an understanding that the issues are there at all, so I stopped doing that - seemed like a waste of time. For example, will it kill your performance to reset devices cleanly, on open and close, I don't recall this being mentioned at all. It seems completely unrelated to a patch adding msix support to uio_pci_generic. protect them from writes into MSI config, BAR registers and related capablities etc etc? Obviously the userspace driver has to write to the BAR area. If you're talking about the BAR setup registers, yes there is some (tiny) value in that, but how is it related to this patch? Protecting the MSI area in the BARs _is_ related to the patch. I agree it adds value, if small. And if not, why are you people wasting time arguing about that? I you want to use your position as maintainer of uio_pci_generic to get people to overhaul the driver for you with unrelated changes, they will object. I can understand a maintainer pointing out the right way to do something rather than the wrong way. But piling on a list of unrelated features as prerequisites is, in my opinion, abuse. Let me repeat that pci_uio_generic is already used for userspace drivers, with all the issues that you point out, for a long while now. These issues are not exposed by the requirement to use msix. You are not protecting the kernel in any way by blocking the patch, you are only protecting people with iommu-less configurations from using their hardware. The only thing I heard is that it's a hassle. That's true (though if you follow my advice and try to share code with vfio/pci you get a lot of this logic for free). My thinking was that vfio was for secure (in the "defense against malicious attackers" sense) while uio_pci_generic was, de-facto at least, for use by trusted users. We are in the strange situation that the Alex is open to adding an insecure mode to vfio, while you object to a patch which does not change the security of uio_pci_generic in any way; it only makes it more usable at the cost of a tiny increase in the bug surface. So it's an understandable argument if you just need something that works, quickly. But if it's such a stopgap hack, there's no need to insist on it upstream. It is n
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote: On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote: It is good practice to defend against root oopsing the kernel, but in some cases it cannot be achieved. Absolutely. That's one of the issues with these patches. They don't even try where it's absolutely possible. Are you referring to blocking the maps of the msix BAR areas? I think there is value in that. The value is small, because a corruption is more likely in the dynamic memory responsible for tens of millions of DMA operations per second, rather than a static 4K area, but it exists. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/08/2015 11:32 AM, Michael S. Tsirkin wrote: On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote: On 08/10/15 00:05, Michael S. Tsirkin wrote: On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote: That's what I thought as well, but apparently adding msix support to the already insecure uio drivers is even worse. I'm glad you finally agree what these drivers are doing is insecure. And basically kernel cares about security, no one wants to maintain insecure stuff. So you guys should think harder whether this code makes any sense upstream. You simply ignore everything I write, cherry-picking the word "insecure" as if it makes your point. That is very frustrating. And I'm sorry about the frustration. I didn't intend to twist your words. It's just that I had to spend literally hours trying to explain that security matters in kernel, and all I was getting back was a summary "there's no security issue because there are other way to corrupt memory". The word security has several meanings. The primary meaning is "defense against a malicious attacker". In that sense, there is no added value at all, because the attacker is already root, and can already access all of kernel and user memory. Even if the attacker is not root, and just has access to a non-iommu-protected device, they can still DMA to and from any memory they like. This sense of the word however is irrelevant for this conversation; the user already gave up on it when they chose to use uio_pci_generic (either because they have no iommu, or because they need the extra performance). Do we agree that security, in the sense of defense against a malicious attacker, is irrelevant for this conversation? A secondary meaning is protection against inadvertent bugs. Yes, a faulty memory write that happens to land in the msix page, can cause a random memory word to be overwritten. But so can a faulty memory write into the rings, or the data structures that support virtual->physical translation, the data structures that describe the packets before translation, the memory allocator or pool. The patch extends the vulnerable surface, but by a negligible amount. So I was glad when it looked like there's finally an agreement that yes, there's value in validating userspace input and yes, it's insecure not to do this. It is good practice to defend against root oopsing the kernel, but in some cases it cannot be achieved. I originally included ways to fix issues that I pointed out, ranging from harder to implement with more overhead but more secure to easier to implement with less overhead but less secure. There didn't seem to be an understanding that the issues are there at all, so I stopped doing that - seemed like a waste of time. For example, will it kill your performance to reset devices cleanly, on open and close, I don't recall this being mentioned at all. It seems completely unrelated to a patch adding msix support to uio_pci_generic. protect them from writes into MSI config, BAR registers and related capablities etc etc? Obviously the userspace driver has to write to the BAR area. If you're talking about the BAR setup registers, yes there is some (tiny) value in that, but how is it related to this patch? Protecting the MSI area in the BARs _is_ related to the patch. I agree it adds value, if small. And if not, why are you people wasting time arguing about that? I you want to use your position as maintainer of uio_pci_generic to get people to overhaul the driver for you with unrelated changes, they will object. I can understand a maintainer pointing out the right way to do something rather than the wrong way. But piling on a list of unrelated features as prerequisites is, in my opinion, abuse. Let me repeat that pci_uio_generic is already used for userspace drivers, with all the issues that you point out, for a long while now. These issues are not exposed by the requirement to use msix. You are not protecting the kernel in any way by blocking the patch, you are only protecting people with iommu-less configurations from using their hardware. The only thing I heard is that it's a hassle. That's true (though if you follow my advice and try to share code with vfio/pci you get a lot of this logic for free). My thinking was that vfio was for secure (in the "defense against malicious attackers" sense) while uio_pci_generic was, de-facto at least, for use by trusted users. We are in the strange situation that the Alex is open to adding an insecure mode to vfio, while you object to a patch which does not change the security of uio_pci_generic in any way; it only makes it more usable at the cost of a tiny increase in the bug surface. So it's an understandable argument if you just need something that works, quickly. But if it's such a stopgap hack, there's no need to insist on it upstream. It is n
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote: On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote: It is good practice to defend against root oopsing the kernel, but in some cases it cannot be achieved. Absolutely. That's one of the issues with these patches. They don't even try where it's absolutely possible. Are you referring to blocking the maps of the msix BAR areas? I think there is value in that. The value is small, because a corruption is more likely in the dynamic memory responsible for tens of millions of DMA operations per second, rather than a static 4K area, but it exists. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote: On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote: On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote: On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote: It is good practice to defend against root oopsing the kernel, but in some cases it cannot be achieved. Absolutely. That's one of the issues with these patches. They don't even try where it's absolutely possible. Are you referring to blocking the maps of the msix BAR areas? For example. There are more. I listed some of the issues on the mailing list, and I might have missed some. VFIO has code to address all this, people should share code to avoid duplication, or at least read it to understand the issues. All but one of those are unrelated to the patch that adds msix support. I think there is value in that. The value is small because a corruption is more likely in the dynamic memory responsible for tens of millions of DMA operations per second, rather than a static 4K area, but it exists. There are other bugs which will hurt e.g. each time application does not exit gracefully. uio_pci_generic disables DMA when the device is removed, so we're safe here, at least if files are released before the address space. But well, heh :) That's precisely my feeling about the whole "running userspace drivers without an IOMMU" project. The value is small since modern hardware has fast IOMMUs, but it exists. For users that don't have iommus at all (usually because it is taken by the hypervisor), it has great value. I can't comment on iommu overhead; for my use case it is likely negligible and we will use an iommu when available; but apparently it matters for others. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/08/2015 01:26 PM, Michael S. Tsirkin wrote: On Thu, Oct 08, 2015 at 12:19:20PM +0300, Avi Kivity wrote: On 10/08/2015 11:32 AM, Michael S. Tsirkin wrote: On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote: On 08/10/15 00:05, Michael S. Tsirkin wrote: On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote: That's what I thought as well, but apparently adding msix support to the already insecure uio drivers is even worse. I'm glad you finally agree what these drivers are doing is insecure. And basically kernel cares about security, no one wants to maintain insecure stuff. So you guys should think harder whether this code makes any sense upstream. You simply ignore everything I write, cherry-picking the word "insecure" as if it makes your point. That is very frustrating. And I'm sorry about the frustration. I didn't intend to twist your words. It's just that I had to spend literally hours trying to explain that security matters in kernel, and all I was getting back was a summary "there's no security issue because there are other way to corrupt memory". The word security has several meanings. The primary meaning is "defense against a malicious attacker". In that sense, there is no added value at all, because the attacker is already root, and can already access all of kernel and user memory. Even if the attacker is not root, and just has access to a non-iommu-protected device, they can still DMA to and from any memory they like. This sense of the word however is irrelevant for this conversation; the user already gave up on it when they chose to use uio_pci_generic (either because they have no iommu, or because they need the extra performance). Do we agree that security, in the sense of defense against a malicious attacker, is irrelevant for this conversation? No. uio_pci_generic currently can be used in a secure way in a sense that it's protected againt malicious attacker, assuming you bind it to a device that does not do DMA. The context of the conversation is dpdk, which only supports DMA. Do we agree that security, in the sense of defense against a malicious attacker, is irrelevant for this conversation, taking this under consideration? A secondary meaning is protection against inadvertent bugs. Yes, a faulty memory write that happens to land in the msix page, can cause a random memory word to be overwritten. But so can a faulty memory write into the rings, or the data structures that support virtual->physical translation, the data structures that describe the packets before translation, the memory allocator or pool. The patch extends the vulnerable surface, but by a negligible amount. So I was glad when it looked like there's finally an agreement that yes, there's value in validating userspace input and yes, it's insecure not to do this. It is good practice to defend against root oopsing the kernel, but in some cases it cannot be achieved. I originally included ways to fix issues that I pointed out, ranging >from harder to implement with more overhead but more secure to easier to implement with less overhead but less secure. There didn't seem to be an understanding that the issues are there at all, so I stopped doing that - seemed like a waste of time. For example, will it kill your performance to reset devices cleanly, on open and close, I don't recall this being mentioned at all. http://mid.gmane.org/20151006005527-mutt-send-email-...@redhat.com Down at the moment for me. But really, this is just off the top of my head. These are all issues VFIO developers encountered and fixed over the years. Go into that code, read it, and you will discover the issues and the solutions. vfio is solving a different problem, the problem of security against a malicious attacker, one that I'm hoping to agree here that we aren't attempting to solve. People have been happily using uio_pci_generic despite all those issues. All they were missing was msix support. You can't use that to force them to overhaul that driver, or to add a new subsystem to vfio. It seems completely unrelated to a patch adding msix support to uio_pci_generic. It isn't unrelated. It's because with MSIX patch you are enabling bus mastering in kernel. So if you start device in a bad state it will corrupt kernel memory. You are right, this patch can regress secure users. I'd be surprised if there are msix-capable pci devices that do not rely on DMA, though. protect them from writes into MSI config, BAR registers and related capablities etc etc? Obviously the userspace driver has to write to the BAR area. If you're talking about the BAR setup registers, yes there is some (tiny) value in that, but how is it related to this patch? If you don't, moving BARs will move the MSI-X region and protecting it won't help. Won't it just become invisible if you do? If userspace starts playing with BARs, you lost already, whe
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 08/10/15 00:05, Michael S. Tsirkin wrote: On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote: That's what I thought as well, but apparently adding msix support to the already insecure uio drivers is even worse. I'm glad you finally agree what these drivers are doing is insecure. And basically kernel cares about security, no one wants to maintain insecure stuff. So you guys should think harder whether this code makes any sense upstream. You simply ignore everything I write, cherry-picking the word "insecure" as if it makes your point. That is very frustrating. The kernel is not secure against root, even in the restricted "will it oops" sense. You can oops it easily, try dd if=/dev/urandom of=/dev/mem (or of=/dev/sda for a more satisfying oops). Getting support from kernel is probably the biggest reason to put code upstream, and this driver taints kernel unconditionally so you don't get that. The biggest reason is that if a driver gets upstream, in a year or two it is universally available. Alternatively, most of the problem you are trying to solve is for virtualization - and it is is better addressed at the hypervisor level. There are enough opensource hypervisors out there - work on IOMMU support there would be time well spent. It is not. The problem we are trying to solve, and please consider the following as if written in all caps, is that some configurations do not have an iommu or cannot use it for performance reasons. It is good practice to defend against root oopsing the kernel, but in some cases it cannot be achieved. A trivial example is a nommu kernel, this is another. In these cases we can give up on this goal, because it is not the only reason for the kernel's existence, there are others. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/07/2015 07:31 PM, Alex Williamson wrote: I guess the no-iommu map would error if the IOVA isn't simply the bus address of the page mapped. Of course this is entirely unsafe and this no-iommu driver should taint the kernel, but it at least standardizes on one userspace API and you're already doing completely unsafe things with uio. vfio should be enlightened at least to the point that it allows only privileged users access to devices under such a (lack of) iommu. There is an additional complication. With an iommu, userspace programs the device with virtual addresses, but without it, they have to program physical addresses. So vfio would need to communicate this bit of information. We can go further and define a better translation API than the current one (reading /proc/pagemap). But it's going to be a bigger change to vfio than I thought at first. It sounds like a separate vfio iommu backend from type1, one that just pins the page and returns the bus address. The curse and benefit would be that existing type1 users wouldn't "just work" in an insecure mode, the DMA mapping code would need to be aware of the difference. Still, I do really prefer to keep vfio as only exposing a secure, iommu protected device to the user because surely someone will try and users would expect that removing iommu restrictions from vfio means they can do device assignment to VMs w/o an iommu. That's what I thought as well, but apparently adding msix support to the already insecure uio drivers is even worse. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/3] uio: add MSI/MSI-X support to uio_pci_generic driver
On 10/07/2015 01:25 PM, Michael S. Tsirkin wrote: On Tue, Oct 06, 2015 at 07:09:11PM +0300, Avi Kivity wrote: On 10/06/2015 06:15 PM, Michael S. Tsirkin wrote: While it is possible that userspace malfunctions and accidentally programs MSI incorrectly, the risk is dwarfed by the ability of userspace to program DMA incorrectly. That seems to imply that for the upstream kernel this is not a valid usecase at all. That is trivially incorrect, upstream pci_uio_generic is used with dpdk for years. dpdk used to do polling for years. patch to use interrupts was posted in june 2015. dpdk used interrupts long before that. Are dpdk applications an invalid use case? The way dpdk is using UIO/sysfs is borderline at best, and can't be used to justify new interfaces. They have a more secure mode using VFIO. That one's more reasonable. Maybe this was not stressed enough times, but not all configurations have an iommu, or want to use one. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/06/2015 09:51 PM, Alex Williamson wrote: On Tue, 2015-10-06 at 18:23 +0300, Avi Kivity wrote: On 10/06/2015 05:56 PM, Michael S. Tsirkin wrote: On Tue, Oct 06, 2015 at 05:43:50PM +0300, Vlad Zolotarov wrote: The only "like VFIO" behavior we implement here is binding the MSI-X interrupt notification to eventfd descriptor. There will be more if you add some basic memory protections. Besides, that's not true. Your patch queries MSI capability, sets # of vectors. You even hinted you want to add BAR mapping down the road. BAR mapping is already available from sysfs; it is not mandatory. VFIO does all of that. Copying vfio maintainer Alex (hi!). vfio's charter is modern iommu-capable configurations. It is designed to be secure enough to be usable by an unprivileged user. For performance and hardware reasons, many dpdk deployments use uio_pci_generic. They are willing to trade off the security provided by vfio for the performance and deployment flexibility of pci_uio_generic. Forcing these features into vfio will compromise its security and needlessly complicate its code (I guess it can be done with a "null" iommu, but then vfio will have to decide whether it is secure or not). It's not just the iommu model vfio uses, it's that vfio is built around iommu groups. For instance to use a device in vfio, the user opens the vfio group file and asks for the device within that group. That's a fairly fundamental part of the mechanics to sidestep. However, is there an opportunity at a lower level? Systems without an iommu typically have dma ops handled via a software iotlb (ie. bounce buffers), but I think they simply don't have iommu ops registered. Could a no-iommu, iommu subsystem provide enough dummy iommu ops to fake out vfio? It would need to iterate the devices on the bus and come up with dummy iommu groups and dummy versions of iommu_map and unmap. The grouping is easy, one device per group, there's no isolation anyway. The vfio type1 iommu backend will do pinning, which seems like an improvement over the mlock that uio users probably try to do now. Right now, people use hugetlbfs maps, which both locks the memory and provides better performance. I guess the no-iommu map would error if the IOVA isn't simply the bus address of the page mapped. Of course this is entirely unsafe and this no-iommu driver should taint the kernel, but it at least standardizes on one userspace API and you're already doing completely unsafe things with uio. vfio should be enlightened at least to the point that it allows only privileged users access to devices under such a (lack of) iommu. There is an additional complication. With an iommu, userspace programs the device with virtual addresses, but without it, they have to program physical addresses. So vfio would need to communicate this bit of information. We can go further and define a better translation API than the current one (reading /proc/pagemap). But it's going to be a bigger change to vfio than I thought at first. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/06/2015 09:51 PM, Alex Williamson wrote: On Tue, 2015-10-06 at 18:23 +0300, Avi Kivity wrote: On 10/06/2015 05:56 PM, Michael S. Tsirkin wrote: On Tue, Oct 06, 2015 at 05:43:50PM +0300, Vlad Zolotarov wrote: The only "like VFIO" behavior we implement here is binding the MSI-X interrupt notification to eventfd descriptor. There will be more if you add some basic memory protections. Besides, that's not true. Your patch queries MSI capability, sets # of vectors. You even hinted you want to add BAR mapping down the road. BAR mapping is already available from sysfs; it is not mandatory. VFIO does all of that. Copying vfio maintainer Alex (hi!). vfio's charter is modern iommu-capable configurations. It is designed to be secure enough to be usable by an unprivileged user. For performance and hardware reasons, many dpdk deployments use uio_pci_generic. They are willing to trade off the security provided by vfio for the performance and deployment flexibility of pci_uio_generic. Forcing these features into vfio will compromise its security and needlessly complicate its code (I guess it can be done with a "null" iommu, but then vfio will have to decide whether it is secure or not). It's not just the iommu model vfio uses, it's that vfio is built around iommu groups. For instance to use a device in vfio, the user opens the vfio group file and asks for the device within that group. That's a fairly fundamental part of the mechanics to sidestep. However, is there an opportunity at a lower level? Systems without an iommu typically have dma ops handled via a software iotlb (ie. bounce buffers), but I think they simply don't have iommu ops registered. Could a no-iommu, iommu subsystem provide enough dummy iommu ops to fake out vfio? It would need to iterate the devices on the bus and come up with dummy iommu groups and dummy versions of iommu_map and unmap. The grouping is easy, one device per group, there's no isolation anyway. The vfio type1 iommu backend will do pinning, which seems like an improvement over the mlock that uio users probably try to do now. Right now, people use hugetlbfs maps, which both locks the memory and provides better performance. I guess the no-iommu map would error if the IOVA isn't simply the bus address of the page mapped. Of course this is entirely unsafe and this no-iommu driver should taint the kernel, but it at least standardizes on one userspace API and you're already doing completely unsafe things with uio. vfio should be enlightened at least to the point that it allows only privileged users access to devices under such a (lack of) iommu. There is an additional complication. With an iommu, userspace programs the device with virtual addresses, but without it, they have to program physical addresses. So vfio would need to communicate this bit of information. We can go further and define a better translation API than the current one (reading /proc/pagemap). But it's going to be a bigger change to vfio than I thought at first. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 08/10/15 00:05, Michael S. Tsirkin wrote: On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote: That's what I thought as well, but apparently adding msix support to the already insecure uio drivers is even worse. I'm glad you finally agree what these drivers are doing is insecure. And basically kernel cares about security, no one wants to maintain insecure stuff. So you guys should think harder whether this code makes any sense upstream. You simply ignore everything I write, cherry-picking the word "insecure" as if it makes your point. That is very frustrating. The kernel is not secure against root, even in the restricted "will it oops" sense. You can oops it easily, try dd if=/dev/urandom of=/dev/mem (or of=/dev/sda for a more satisfying oops). Getting support from kernel is probably the biggest reason to put code upstream, and this driver taints kernel unconditionally so you don't get that. The biggest reason is that if a driver gets upstream, in a year or two it is universally available. Alternatively, most of the problem you are trying to solve is for virtualization - and it is is better addressed at the hypervisor level. There are enough opensource hypervisors out there - work on IOMMU support there would be time well spent. It is not. The problem we are trying to solve, and please consider the following as if written in all caps, is that some configurations do not have an iommu or cannot use it for performance reasons. It is good practice to defend against root oopsing the kernel, but in some cases it cannot be achieved. A trivial example is a nommu kernel, this is another. In these cases we can give up on this goal, because it is not the only reason for the kernel's existence, there are others. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/3] uio: add MSI/MSI-X support to uio_pci_generic driver
On 10/07/2015 01:25 PM, Michael S. Tsirkin wrote: On Tue, Oct 06, 2015 at 07:09:11PM +0300, Avi Kivity wrote: On 10/06/2015 06:15 PM, Michael S. Tsirkin wrote: While it is possible that userspace malfunctions and accidentally programs MSI incorrectly, the risk is dwarfed by the ability of userspace to program DMA incorrectly. That seems to imply that for the upstream kernel this is not a valid usecase at all. That is trivially incorrect, upstream pci_uio_generic is used with dpdk for years. dpdk used to do polling for years. patch to use interrupts was posted in june 2015. dpdk used interrupts long before that. Are dpdk applications an invalid use case? The way dpdk is using UIO/sysfs is borderline at best, and can't be used to justify new interfaces. They have a more secure mode using VFIO. That one's more reasonable. Maybe this was not stressed enough times, but not all configurations have an iommu, or want to use one. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/07/2015 07:31 PM, Alex Williamson wrote: I guess the no-iommu map would error if the IOVA isn't simply the bus address of the page mapped. Of course this is entirely unsafe and this no-iommu driver should taint the kernel, but it at least standardizes on one userspace API and you're already doing completely unsafe things with uio. vfio should be enlightened at least to the point that it allows only privileged users access to devices under such a (lack of) iommu. There is an additional complication. With an iommu, userspace programs the device with virtual addresses, but without it, they have to program physical addresses. So vfio would need to communicate this bit of information. We can go further and define a better translation API than the current one (reading /proc/pagemap). But it's going to be a bigger change to vfio than I thought at first. It sounds like a separate vfio iommu backend from type1, one that just pins the page and returns the bus address. The curse and benefit would be that existing type1 users wouldn't "just work" in an insecure mode, the DMA mapping code would need to be aware of the difference. Still, I do really prefer to keep vfio as only exposing a secure, iommu protected device to the user because surely someone will try and users would expect that removing iommu restrictions from vfio means they can do device assignment to VMs w/o an iommu. That's what I thought as well, but apparently adding msix support to the already insecure uio drivers is even worse. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/3] uio: add MSI/MSI-X support to uio_pci_generic driver
On 10/06/2015 06:15 PM, Michael S. Tsirkin wrote: While it is possible that userspace malfunctions and accidentally programs MSI incorrectly, the risk is dwarfed by the ability of userspace to program DMA incorrectly. That seems to imply that for the upstream kernel this is not a valid usecase at all. That is trivially incorrect, upstream pci_uio_generic is used with dpdk for years. Are dpdk applications an invalid use case? Again: - security is not compromised. you need to be root to (ab)use this. - all of the potentially compromising functionality has been there from day 1 - uio_pci_generic is the only way to provide the required performance on some configurations (where kernel drivers, or userspace drivers + iommu are too slow) - uio_pci_generic + msix is the only way to enable userspace drivers on some configurations (SRIOV) The proposed functionality does not increase the attack surface. The proposed functionality marginally increases the bug surface. The proposed functionality is a natural evolution of uio_pci_generic. There is a new class of applications (network function virtualization) which require this. They can't use the kernel drivers because they are too slow. They can't use the iommu because it is either too slow, or taken over by the hypervisor. They are willing to live with less kernel protection, because they are a single user application anyway (and since they use a kernel bypass, they don't really care that much about the kernel). The kernel serves more use-cases than a desktop or a multi-user servers. Some of these users are willing to trade off protection for performance or functionality (an extreme, yet similar, example is linux-nommu, which allows any application to access any bit of memory, due to the lack of protection hardware). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
On 10/06/2015 05:07 PM, Michael S. Tsirkin wrote: On Tue, Oct 06, 2015 at 03:15:57PM +0300, Avi Kivity wrote: btw, (2) doesn't really add any insecurity. The user could already poke at the msix tables (as well as perform DMA); they just couldn't get a useful interrupt out of them. Poking at msix tables won't cause memory corruption unless msix and bus mastering is enabled. It's a given that bus mastering is enabled. It's true that msix is unlikely to be enabled, unless msix support is added. It's true root can enable msix and bus mastering through sysfs - but that's easy to block or detect. Even if you don't buy a security story, it seems less likely to trigger as a result of a userspace bug. If you're doing DMA, that's the least of your worries. Still, zero-mapping the msix space seems reasonable, and can protect userspace from silly stuff. It can't be considered to have anything to do with security though, as long as users can simply DMA to every bit of RAM in the system they want to. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/06/2015 05:46 PM, Michael S. Tsirkin wrote: On Mon, Oct 05, 2015 at 11:28:03AM +0300, Avi Kivity wrote: Eventfd is a natural enough representation of an interrupt; both kvm and vfio use it, and are also able to share the eventfd, allowing a vfio interrupt to generate a kvm interrupt, without userspace intervention, and one day without even kernel intervention. eventfd without kernel intervention sounds unlikely. kvm might configure the cpu such that an interrupt will not trigger a vmexit. eventfd seems like an unlikely interface to do that: with the eventfd, device triggering it has no info about the interrupt so it can't send it to the correct VM. https://lwn.net/Articles/650863/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/06/2015 05:56 PM, Michael S. Tsirkin wrote: On Tue, Oct 06, 2015 at 05:43:50PM +0300, Vlad Zolotarov wrote: The only "like VFIO" behavior we implement here is binding the MSI-X interrupt notification to eventfd descriptor. There will be more if you add some basic memory protections. Besides, that's not true. Your patch queries MSI capability, sets # of vectors. You even hinted you want to add BAR mapping down the road. BAR mapping is already available from sysfs; it is not mandatory. VFIO does all of that. Copying vfio maintainer Alex (hi!). vfio's charter is modern iommu-capable configurations. It is designed to be secure enough to be usable by an unprivileged user. For performance and hardware reasons, many dpdk deployments use uio_pci_generic. They are willing to trade off the security provided by vfio for the performance and deployment flexibility of pci_uio_generic. Forcing these features into vfio will compromise its security and needlessly complicate its code (I guess it can be done with a "null" iommu, but then vfio will have to decide whether it is secure or not). This doesn't justifies the hassle of implementing IOMMU-less VFIO mode. This applies to both VFIO and UIO really. I'm not sure the hassle of maintaining this functionality in tree is justified. It remains to be seen whether there are any users that won't taint the kernel. Apparently not in the current form of the patch, but who knows. It is not msix that taints the kernel, it's uio_pci_generic. Msix is a tiny feature addition that doesn't change the security situation one bit. btw, currently you can map BARs and dd to /dev/mem to your heart's content without tainting the kernel. I don't see how you can claim that msix support makes the situation worse, when root can access every bit of physical memory, either directly or via DMA. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/3] uio: add MSI/MSI-X support to uio_pci_generic driver
On 10/06/2015 05:30 PM, Michael S. Tsirkin wrote: On Tue, Oct 06, 2015 at 11:37:59AM +0300, Vlad Zolotarov wrote: Bus mastering is easily enabled from the user space (taken from DPDK code): static int pci_uio_set_bus_master(int dev_fd) { uint16_t reg; int ret; ret = pread(dev_fd, , sizeof(reg), PCI_COMMAND); if (ret != sizeof(reg)) { RTE_LOG(ERR, EAL, "Cannot read command from PCI config space!\n"); return -1; } /* return if bus mastering is already on */ if (reg & PCI_COMMAND_MASTER) return 0; reg |= PCI_COMMAND_MASTER; ret = pwrite(dev_fd, , sizeof(reg), PCI_COMMAND); if (ret != sizeof(reg)) { RTE_LOG(ERR, EAL, "Cannot write command to PCI config space!\n"); return -1; } return 0; } So, this is a non-issue. ;) There might be valid reasons for DPDK to do this, e.g. if using VFIO. DPDK does this when using vfio, and when using uio_pci_generic. All of the network cards that DPDK supports require DMA. I'm guessing it doesn't enable MSI though, does it? It does not enable MSI, because the main kernel driver used for interacting with the device, pci_uio_generic, does not support MSI. In some configurations, PCI INTA is not available, while MSI(X) is, hence the desire that pci_uio_generic support MSI. While it is possible that userspace malfunctions and accidentally programs MSI incorrectly, the risk is dwarfed by the ability of userspace to program DMA incorrectly. Under normal operation userspace programs tens of millions of DMA operations per second, while it never touches the MSI BARs (it is the kernel that programs them). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
On 10/06/2015 10:33 AM, Stephen Hemminger wrote: Other than implementation objections, so far the two main arguments against this reduce to: 1. If you allow UIO ioctl then it opens an API hook for all the crap out of tree UIO drivers to do what they want. 2. If you allow UIO MSI-X then you are expanding the usage of userspace device access in an insecure manner. Another alternative which I explored was making a version of VFIO that works without IOMMU. It solves #1 but actually increases the likely negative response to arguent #2. This would keep same API, and avoid having to modify UIO. But we would still have the same (if not more resistance) from IOMMU developers who believe all systems have to be secure against root. vfio's charter was explicitly aiming for modern setups with iommus. This could be revisited, but I agree it will have even more resistance, justified IMO. btw, (2) doesn't really add any insecurity. The user could already poke at the msix tables (as well as perform DMA); they just couldn't get a useful interrupt out of them. Maybe a module parameter "allow_insecure_dma" can be added to uio_pci_generic. Without the parameter, bus mastering and msix is disabled, with the parameter it is allowed. This requires the sysadmin to take a positive step in order to make use of their hardware. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/3] uio: add MSI/MSI-X support to uio_pci_generic driver
On 10/06/2015 05:30 PM, Michael S. Tsirkin wrote: On Tue, Oct 06, 2015 at 11:37:59AM +0300, Vlad Zolotarov wrote: Bus mastering is easily enabled from the user space (taken from DPDK code): static int pci_uio_set_bus_master(int dev_fd) { uint16_t reg; int ret; ret = pread(dev_fd, , sizeof(reg), PCI_COMMAND); if (ret != sizeof(reg)) { RTE_LOG(ERR, EAL, "Cannot read command from PCI config space!\n"); return -1; } /* return if bus mastering is already on */ if (reg & PCI_COMMAND_MASTER) return 0; reg |= PCI_COMMAND_MASTER; ret = pwrite(dev_fd, , sizeof(reg), PCI_COMMAND); if (ret != sizeof(reg)) { RTE_LOG(ERR, EAL, "Cannot write command to PCI config space!\n"); return -1; } return 0; } So, this is a non-issue. ;) There might be valid reasons for DPDK to do this, e.g. if using VFIO. DPDK does this when using vfio, and when using uio_pci_generic. All of the network cards that DPDK supports require DMA. I'm guessing it doesn't enable MSI though, does it? It does not enable MSI, because the main kernel driver used for interacting with the device, pci_uio_generic, does not support MSI. In some configurations, PCI INTA is not available, while MSI(X) is, hence the desire that pci_uio_generic support MSI. While it is possible that userspace malfunctions and accidentally programs MSI incorrectly, the risk is dwarfed by the ability of userspace to program DMA incorrectly. Under normal operation userspace programs tens of millions of DMA operations per second, while it never touches the MSI BARs (it is the kernel that programs them). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/06/2015 05:56 PM, Michael S. Tsirkin wrote: On Tue, Oct 06, 2015 at 05:43:50PM +0300, Vlad Zolotarov wrote: The only "like VFIO" behavior we implement here is binding the MSI-X interrupt notification to eventfd descriptor. There will be more if you add some basic memory protections. Besides, that's not true. Your patch queries MSI capability, sets # of vectors. You even hinted you want to add BAR mapping down the road. BAR mapping is already available from sysfs; it is not mandatory. VFIO does all of that. Copying vfio maintainer Alex (hi!). vfio's charter is modern iommu-capable configurations. It is designed to be secure enough to be usable by an unprivileged user. For performance and hardware reasons, many dpdk deployments use uio_pci_generic. They are willing to trade off the security provided by vfio for the performance and deployment flexibility of pci_uio_generic. Forcing these features into vfio will compromise its security and needlessly complicate its code (I guess it can be done with a "null" iommu, but then vfio will have to decide whether it is secure or not). This doesn't justifies the hassle of implementing IOMMU-less VFIO mode. This applies to both VFIO and UIO really. I'm not sure the hassle of maintaining this functionality in tree is justified. It remains to be seen whether there are any users that won't taint the kernel. Apparently not in the current form of the patch, but who knows. It is not msix that taints the kernel, it's uio_pci_generic. Msix is a tiny feature addition that doesn't change the security situation one bit. btw, currently you can map BARs and dd to /dev/mem to your heart's content without tainting the kernel. I don't see how you can claim that msix support makes the situation worse, when root can access every bit of physical memory, either directly or via DMA. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
On 10/06/2015 05:07 PM, Michael S. Tsirkin wrote: On Tue, Oct 06, 2015 at 03:15:57PM +0300, Avi Kivity wrote: btw, (2) doesn't really add any insecurity. The user could already poke at the msix tables (as well as perform DMA); they just couldn't get a useful interrupt out of them. Poking at msix tables won't cause memory corruption unless msix and bus mastering is enabled. It's a given that bus mastering is enabled. It's true that msix is unlikely to be enabled, unless msix support is added. It's true root can enable msix and bus mastering through sysfs - but that's easy to block or detect. Even if you don't buy a security story, it seems less likely to trigger as a result of a userspace bug. If you're doing DMA, that's the least of your worries. Still, zero-mapping the msix space seems reasonable, and can protect userspace from silly stuff. It can't be considered to have anything to do with security though, as long as users can simply DMA to every bit of RAM in the system they want to. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/06/2015 05:46 PM, Michael S. Tsirkin wrote: On Mon, Oct 05, 2015 at 11:28:03AM +0300, Avi Kivity wrote: Eventfd is a natural enough representation of an interrupt; both kvm and vfio use it, and are also able to share the eventfd, allowing a vfio interrupt to generate a kvm interrupt, without userspace intervention, and one day without even kernel intervention. eventfd without kernel intervention sounds unlikely. kvm might configure the cpu such that an interrupt will not trigger a vmexit. eventfd seems like an unlikely interface to do that: with the eventfd, device triggering it has no info about the interrupt so it can't send it to the correct VM. https://lwn.net/Articles/650863/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
On 10/06/2015 10:33 AM, Stephen Hemminger wrote: Other than implementation objections, so far the two main arguments against this reduce to: 1. If you allow UIO ioctl then it opens an API hook for all the crap out of tree UIO drivers to do what they want. 2. If you allow UIO MSI-X then you are expanding the usage of userspace device access in an insecure manner. Another alternative which I explored was making a version of VFIO that works without IOMMU. It solves #1 but actually increases the likely negative response to arguent #2. This would keep same API, and avoid having to modify UIO. But we would still have the same (if not more resistance) from IOMMU developers who believe all systems have to be secure against root. vfio's charter was explicitly aiming for modern setups with iommus. This could be revisited, but I agree it will have even more resistance, justified IMO. btw, (2) doesn't really add any insecurity. The user could already poke at the msix tables (as well as perform DMA); they just couldn't get a useful interrupt out of them. Maybe a module parameter "allow_insecure_dma" can be added to uio_pci_generic. Without the parameter, bus mastering and msix is disabled, with the parameter it is allowed. This requires the sysadmin to take a positive step in order to make use of their hardware. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/3] uio: add MSI/MSI-X support to uio_pci_generic driver
On 10/06/2015 06:15 PM, Michael S. Tsirkin wrote: While it is possible that userspace malfunctions and accidentally programs MSI incorrectly, the risk is dwarfed by the ability of userspace to program DMA incorrectly. That seems to imply that for the upstream kernel this is not a valid usecase at all. That is trivially incorrect, upstream pci_uio_generic is used with dpdk for years. Are dpdk applications an invalid use case? Again: - security is not compromised. you need to be root to (ab)use this. - all of the potentially compromising functionality has been there from day 1 - uio_pci_generic is the only way to provide the required performance on some configurations (where kernel drivers, or userspace drivers + iommu are too slow) - uio_pci_generic + msix is the only way to enable userspace drivers on some configurations (SRIOV) The proposed functionality does not increase the attack surface. The proposed functionality marginally increases the bug surface. The proposed functionality is a natural evolution of uio_pci_generic. There is a new class of applications (network function virtualization) which require this. They can't use the kernel drivers because they are too slow. They can't use the iommu because it is either too slow, or taken over by the hypervisor. They are willing to live with less kernel protection, because they are a single user application anyway (and since they use a kernel bypass, they don't really care that much about the kernel). The kernel serves more use-cases than a desktop or a multi-user servers. Some of these users are willing to trade off protection for performance or functionality (an extreme, yet similar, example is linux-nommu, which allows any application to access any bit of memory, due to the lack of protection hardware). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/05/2015 02:41 PM, Vlad Zolotarov wrote: On 10/05/15 13:57, Greg KH wrote: On Mon, Oct 05, 2015 at 01:48:39PM +0300, Vlad Zolotarov wrote: On 10/05/15 10:56, Greg KH wrote: On Mon, Oct 05, 2015 at 10:41:39AM +0300, Vlad Zolotarov wrote: +struct msix_info { +int num_irqs; +struct msix_entry *table; +struct uio_msix_irq_ctx { +struct eventfd_ctx *trigger;/* MSI-x vector to eventfd */ Why are you using eventfd for msi vectors? What's the reason for needing this? A small correction - for MSI-X vectors. There may be only one MSI vector per PCI function and if it's used it would use the same interface as a legacy INT#x interrupt uses at the moment. So, for MSI-X case the reason is that there may be (in most cases there will be) more than one interrupt vector. Thus, as I've explained in a PATCH1 thread we need a way to indicated each of them separately. eventfd seems like a good way of doing so. If u have better ideas, pls., share. You need to document what you are doing here, I don't see any explaination for using eventfd at all. And no, I don't know of any other solution as I don't know what you are trying to do here (hint, the changelog didn't document it...) You haven't documented how this api works at all, you are going to have to a lot more work to justify this, as this greatly increases the complexity of the user/kernel api in unknown ways. I actually do documented it a bit. Pls., check PATCH3 out. That provided no information at all about how to use the api. If it did, you would see that your api is broken for 32/64bit kernels and will fall over into nasty pieces the first time you try to use it there, which means it hasn't been tested at all :( It has been tested of course ;) I tested it only in 64 bit environment however where both kernel and user space applications were compiled on the same machine with the same compiler and it could be that "int" had the same number of bytes both in kernel and in user space application. Therefore it worked perfectly - I patched DPDK to use the new uio_pci_generic MSI-X API to test this and I have verified that all 3 interrupt modes work: MSI-X with SR-IOV VF device in Amazon EC2 guest and INT#x and MSI with a PF device on bare metal server. However I agree using uint32_t for "vec" and "fd" would be much more correct. I don't think file descriptors are __u32 on a 64bit arch, are they? I think they are "int" on all platforms and as far as I know u32 should be enough to contain int on any platform. You need to make sure structures have the same layout on both 32-bit and 64-bit systems, or you'll have to code compat ioctl translations for them. The best way to do that is to use __u32 so the sizes are obvious, even for int, and to pad everything to 64 bit: +struct msix_info { +__u32 num_irqs; +__u32 pad; // so pointer below is aligned to 64-bit on both 32-bit and 64-bit userspace +struct msix_entry *table; +struct uio_msix_irq_ctx { +struct eventfd_ctx *trigger;/* MSI-x vector to eventfd */ And NEVER use the _t types in kernel code, Never meant it - it was for a user space interface. For a kernel it's u32 of course. For interfaces, use __u32. You can't use uint32_t because if someone uses C89 in 2015, they may not have . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/05/2015 01:57 PM, Greg KH wrote: On Mon, Oct 05, 2015 at 01:48:39PM +0300, Vlad Zolotarov wrote: On 10/05/15 10:56, Greg KH wrote: On Mon, Oct 05, 2015 at 10:41:39AM +0300, Vlad Zolotarov wrote: +struct msix_info { + int num_irqs; + struct msix_entry *table; + struct uio_msix_irq_ctx { + struct eventfd_ctx *trigger;/* MSI-x vector to eventfd */ Why are you using eventfd for msi vectors? What's the reason for needing this? A small correction - for MSI-X vectors. There may be only one MSI vector per PCI function and if it's used it would use the same interface as a legacy INT#x interrupt uses at the moment. So, for MSI-X case the reason is that there may be (in most cases there will be) more than one interrupt vector. Thus, as I've explained in a PATCH1 thread we need a way to indicated each of them separately. eventfd seems like a good way of doing so. If u have better ideas, pls., share. You need to document what you are doing here, I don't see any explaination for using eventfd at all. And no, I don't know of any other solution as I don't know what you are trying to do here (hint, the changelog didn't document it...) You haven't documented how this api works at all, you are going to have to a lot more work to justify this, as this greatly increases the complexity of the user/kernel api in unknown ways. I actually do documented it a bit. Pls., check PATCH3 out. That provided no information at all about how to use the api. If it did, you would see that your api is broken for 32/64bit kernels and will fall over into nasty pieces the first time you try to use it there, which means it hasn't been tested at all :( It has been tested of course ;) I tested it only in 64 bit environment however where both kernel and user space applications were compiled on the same machine with the same compiler and it could be that "int" had the same number of bytes both in kernel and in user space application. Therefore it worked perfectly - I patched DPDK to use the new uio_pci_generic MSI-X API to test this and I have verified that all 3 interrupt modes work: MSI-X with SR-IOV VF device in Amazon EC2 guest and INT#x and MSI with a PF device on bare metal server. However I agree using uint32_t for "vec" and "fd" would be much more correct. I don't think file descriptors are __u32 on a 64bit arch, are they? And NEVER use the _t types in kernel code, the namespaces is all wrong and it is not applicable for us, sorry. Wasn't the real reason that they aren't defined (or reserved) by C89, and therefore could clash with a user identifier, rather than some inherent wrongness? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/05/2015 12:49 PM, Greg KH wrote: On Mon, Oct 05, 2015 at 11:28:03AM +0300, Avi Kivity wrote: Of course it has to be documented, but this just follows vfio. Eventfd is a natural enough representation of an interrupt; both kvm and vfio use it, and are also able to share the eventfd, allowing a vfio interrupt to generate a kvm interrupt, without userspace intervention, and one day without even kernel intervention. That's nice and wonderful, but it's not how UIO works today, so this is now going to be a mix and match type interface, with no justification so far as to why to create this new api and exactly how this is all going to be used from userspace. The intended user is dpdk (http://dpdk.org), which is a family of userspace networking drivers for high performance networking applications. The natural device driver for dpdk is vfio, which both provides memory protection and exposes msi/msix interrupts. However, in many cases vfio cannot be used, either due to the lack of an iommu (for example, in virtualized environments) or out of a desire to avoid the iommus performance impact. The challenge in exposing msix interrupts to user space is that there are many of them, so you can't simply poll the device fd. If you do, how do you know which interrupt was triggered? The solution that vfio adopted was to associate each interrupt with an eventfd, allowing it to be individually polled. Since you can pass an eventfd with SCM_RIGHTS, and since kvm can trigger guest interrupts using an eventfd, the solution is very flexible. Example code would be even better... This is the vfio dpdk interface code: http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c basically, the equivalent uio msix code would be very similar if uio adopts a similar interface: http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_pci_uio.c (current code lacks msi/msix support, of course). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/05/2015 06:11 AM, Greg KH wrote: On Sun, Oct 04, 2015 at 11:43:17PM +0300, Vlad Zolotarov wrote: Add support for MSI and MSI-X interrupt modes: - Interrupt mode selection order is: INT#X (for backward compatibility) -> MSI-X -> MSI. - Add ioctl() commands: - UIO_PCI_GENERIC_INT_MODE_GET: query the current interrupt mode. - UIO_PCI_GENERIC_IRQ_NUM_GET: query the maximum number of IRQs. - UIO_PCI_GENERIC_IRQ_SET: bind the IRQ to eventfd (similar to vfio). - Add mappings to all bars (memory and portio): some devices have registers related to MSI/MSI-X handling outside BAR0. Signed-off-by: Vlad Zolotarov --- New in v3: - Add __iomem qualifier to temp buffer receiving ioremap value. New in v2: - Added #include to uio_pci_generic.c Signed-off-by: Vlad Zolotarov --- drivers/uio/uio_pci_generic.c | 410 +--- include/linux/uio_pci_generic.h | 36 2 files changed, 423 insertions(+), 23 deletions(-) create mode 100644 include/linux/uio_pci_generic.h diff --git a/drivers/uio/uio_pci_generic.c b/drivers/uio/uio_pci_generic.c index d0b508b..6b8b1789 100644 --- a/drivers/uio/uio_pci_generic.c +++ b/drivers/uio/uio_pci_generic.c @@ -22,16 +22,32 @@ #include #include #include +#include #include #include +#include +#include +#include #define DRIVER_VERSION "0.01.0" #define DRIVER_AUTHOR "Michael S. Tsirkin " #define DRIVER_DESC "Generic UIO driver for PCI 2.3 devices" +struct msix_info { + int num_irqs; + struct msix_entry *table; + struct uio_msix_irq_ctx { + struct eventfd_ctx *trigger;/* MSI-x vector to eventfd */ Why are you using eventfd for msi vectors? What's the reason for needing this? You haven't documented how this api works at all, you are going to have to a lot more work to justify this, as this greatly increases the complexity of the user/kernel api in unknown ways. Of course it has to be documented, but this just follows vfio. Eventfd is a natural enough representation of an interrupt; both kvm and vfio use it, and are also able to share the eventfd, allowing a vfio interrupt to generate a kvm interrupt, without userspace intervention, and one day without even kernel intervention. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/05/2015 06:11 AM, Greg KH wrote: On Sun, Oct 04, 2015 at 11:43:17PM +0300, Vlad Zolotarov wrote: Add support for MSI and MSI-X interrupt modes: - Interrupt mode selection order is: INT#X (for backward compatibility) -> MSI-X -> MSI. - Add ioctl() commands: - UIO_PCI_GENERIC_INT_MODE_GET: query the current interrupt mode. - UIO_PCI_GENERIC_IRQ_NUM_GET: query the maximum number of IRQs. - UIO_PCI_GENERIC_IRQ_SET: bind the IRQ to eventfd (similar to vfio). - Add mappings to all bars (memory and portio): some devices have registers related to MSI/MSI-X handling outside BAR0. Signed-off-by: Vlad Zolotarov--- New in v3: - Add __iomem qualifier to temp buffer receiving ioremap value. New in v2: - Added #include to uio_pci_generic.c Signed-off-by: Vlad Zolotarov --- drivers/uio/uio_pci_generic.c | 410 +--- include/linux/uio_pci_generic.h | 36 2 files changed, 423 insertions(+), 23 deletions(-) create mode 100644 include/linux/uio_pci_generic.h diff --git a/drivers/uio/uio_pci_generic.c b/drivers/uio/uio_pci_generic.c index d0b508b..6b8b1789 100644 --- a/drivers/uio/uio_pci_generic.c +++ b/drivers/uio/uio_pci_generic.c @@ -22,16 +22,32 @@ #include #include #include +#include #include #include +#include +#include +#include #define DRIVER_VERSION "0.01.0" #define DRIVER_AUTHOR "Michael S. Tsirkin " #define DRIVER_DESC "Generic UIO driver for PCI 2.3 devices" +struct msix_info { + int num_irqs; + struct msix_entry *table; + struct uio_msix_irq_ctx { + struct eventfd_ctx *trigger;/* MSI-x vector to eventfd */ Why are you using eventfd for msi vectors? What's the reason for needing this? You haven't documented how this api works at all, you are going to have to a lot more work to justify this, as this greatly increases the complexity of the user/kernel api in unknown ways. Of course it has to be documented, but this just follows vfio. Eventfd is a natural enough representation of an interrupt; both kvm and vfio use it, and are also able to share the eventfd, allowing a vfio interrupt to generate a kvm interrupt, without userspace intervention, and one day without even kernel intervention. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/05/2015 12:49 PM, Greg KH wrote: On Mon, Oct 05, 2015 at 11:28:03AM +0300, Avi Kivity wrote: Of course it has to be documented, but this just follows vfio. Eventfd is a natural enough representation of an interrupt; both kvm and vfio use it, and are also able to share the eventfd, allowing a vfio interrupt to generate a kvm interrupt, without userspace intervention, and one day without even kernel intervention. That's nice and wonderful, but it's not how UIO works today, so this is now going to be a mix and match type interface, with no justification so far as to why to create this new api and exactly how this is all going to be used from userspace. The intended user is dpdk (http://dpdk.org), which is a family of userspace networking drivers for high performance networking applications. The natural device driver for dpdk is vfio, which both provides memory protection and exposes msi/msix interrupts. However, in many cases vfio cannot be used, either due to the lack of an iommu (for example, in virtualized environments) or out of a desire to avoid the iommus performance impact. The challenge in exposing msix interrupts to user space is that there are many of them, so you can't simply poll the device fd. If you do, how do you know which interrupt was triggered? The solution that vfio adopted was to associate each interrupt with an eventfd, allowing it to be individually polled. Since you can pass an eventfd with SCM_RIGHTS, and since kvm can trigger guest interrupts using an eventfd, the solution is very flexible. Example code would be even better... This is the vfio dpdk interface code: http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c basically, the equivalent uio msix code would be very similar if uio adopts a similar interface: http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_pci_uio.c (current code lacks msi/msix support, of course). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support
On 10/05/2015 02:41 PM, Vlad Zolotarov wrote: On 10/05/15 13:57, Greg KH wrote: On Mon, Oct 05, 2015 at 01:48:39PM +0300, Vlad Zolotarov wrote: On 10/05/15 10:56, Greg KH wrote: On Mon, Oct 05, 2015 at 10:41:39AM +0300, Vlad Zolotarov wrote: +struct msix_info { +int num_irqs; +struct msix_entry *table; +struct uio_msix_irq_ctx { +struct eventfd_ctx *trigger;/* MSI-x vector to eventfd */ Why are you using eventfd for msi vectors? What's the reason for needing this? A small correction - for MSI-X vectors. There may be only one MSI vector per PCI function and if it's used it would use the same interface as a legacy INT#x interrupt uses at the moment. So, for MSI-X case the reason is that there may be (in most cases there will be) more than one interrupt vector. Thus, as I've explained in a PATCH1 thread we need a way to indicated each of them separately. eventfd seems like a good way of doing so. If u have better ideas, pls., share. You need to document what you are doing here, I don't see any explaination for using eventfd at all. And no, I don't know of any other solution as I don't know what you are trying to do here (hint, the changelog didn't document it...) You haven't documented how this api works at all, you are going to have to a lot more work to justify this, as this greatly increases the complexity of the user/kernel api in unknown ways. I actually do documented it a bit. Pls., check PATCH3 out. That provided no information at all about how to use the api. If it did, you would see that your api is broken for 32/64bit kernels and will fall over into nasty pieces the first time you try to use it there, which means it hasn't been tested at all :( It has been tested of course ;) I tested it only in 64 bit environment however where both kernel and user space applications were compiled on the same machine with the same compiler and it could be that "int" had the same number of bytes both in kernel and in user space application. Therefore it worked perfectly - I patched DPDK to use the new uio_pci_generic MSI-X API to test this and I have verified that all 3 interrupt modes work: MSI-X with SR-IOV VF device in Amazon EC2 guest and INT#x and MSI with a PF device on bare metal server. However I agree using uint32_t for "vec" and "fd" would be much more correct. I don't think file descriptors are __u32 on a 64bit arch, are they? I think they are "int" on all platforms and as far as I know u32 should be enough to contain int on any platform. You need to make sure structures have the same layout on both 32-bit and 64-bit systems, or you'll have to code compat ioctl translations for them. The best way to do that is to use __u32 so the sizes are obvious, even for int, and to pad everything to 64 bit: +struct msix_info { +__u32 num_irqs; +__u32 pad; // so pointer below is aligned to 64-bit on both 32-bit and 64-bit userspace +struct msix_entry *table; +struct uio_msix_irq_ctx { +struct eventfd_ctx *trigger;/* MSI-x vector to eventfd */ And NEVER use the _t types in kernel code, Never meant it - it was for a user space interface. For a kernel it's u32 of course. For interfaces, use __u32. You can't use uint32_t because if someone uses C89 in 2015, they may not have . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/