Re: Linux 4.9.256

2021-02-09 Thread Avi Kivity

On 2/8/21 8:57 PM, Sasha Levin wrote:

On Mon, Feb 08, 2021 at 05:50:21PM +0200, Avi Kivity wrote:

On 05/02/2021 16.26, Greg Kroah-Hartman wrote:

I'm announcing the release of the 4.9.256 kernel.

This, and the 4.4.256 release are a little bit "different" than normal.

This contains only 1 patch, just the version bump from .255 to .256 
which ends
up causing the userspace-visable LINUX_VERSION_CODE to behave a bit 
differently

than normal due to the "overflow".

With this release, KERNEL_VERSION(4, 9, 256) is the same as 
KERNEL_VERSION(4, 10, 0).



I think this is a bad idea. Many kernel features can only be 
discovered by checking the kernel version. If a feature was 
introduced in 4.10, then an application can be tricked into thinking 
a 4.9 kernel has it.



IMO, better to stop LINUX_VERSION_CODE at 255 and introduce a 


In the upstream (and new -stable fix) we did this part.


LINUX_VERSION_CODE_IMPROVED that has more bits for patchlevel.


Do you have a usecase where it's actually needed? i.e. userspace that
checks for -stable patchlevels?



Not stable patchlevels, but minors. So a change from 4.9 to 4.10 could 
be harmful.



I have two such examples (not on the 4.9->4.10 boundary), but they test 
the runtime version from uname(), not LINUX_VERSION_CODE, so they would 
be vulnerable to such a change.





Re: Linux 4.9.256

2021-02-08 Thread Avi Kivity

On 05/02/2021 16.26, Greg Kroah-Hartman wrote:

I'm announcing the release of the 4.9.256 kernel.

This, and the 4.4.256 release are a little bit "different" than normal.

This contains only 1 patch, just the version bump from .255 to .256 which ends
up causing the userspace-visable LINUX_VERSION_CODE to behave a bit differently
than normal due to the "overflow".

With this release, KERNEL_VERSION(4, 9, 256) is the same as KERNEL_VERSION(4, 
10, 0).



I think this is a bad idea. Many kernel features can only be discovered 
by checking the kernel version. If a feature was introduced in 4.10, 
then an application can be tricked into thinking a 4.9 kernel has it.



IMO, better to stop LINUX_VERSION_CODE at 255 and introduce a 
LINUX_VERSION_CODE_IMPROVED that has more bits for patchlevel.




Nothing in the kernel build itself breaks with this change, but given that this
is a userspace visible change, and some crazy tools (like glibc and gcc) have
logic that checks the kernel version for different reasons, I wanted to do this
release as an "empty" release to ensure that everything still works properly.



Even if glibc and gcc work, other programs may not.


I have two such cases. They don't depend on 4.9, but they're examples of 
features that are not discoverable by other means.





So, this is a YOU MUST UPGRADE requirement of a release.  If you rely on the
4.9.y kernel, please throw this release into your test builds and rebuild the
world and let us know if anything breaks, or if all is well.

Go forth and do full system rebuilds!  Yocto and Gentoo are great for this, as
will systems that use buildroot.

I'll try to hold off on doing a "real" 4.9.y release for a 9eek to give
everyone a chance to test this out and get back to me.  The pending patches in
the 4.9.y queue are pretty serious, so I am loath to wait longer than that,
consider yourself warned...

The updated 4.9.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git 
linux-4.9.y
and can be browsed at the normal kernel.org git web browser:

https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary

thanks,

greg k-h



  Makefile |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

Greg Kroah-Hartman (1):
   Linux 4.9.256




Re: Spurious EIO on AIO+DIO+RWF_NOWAIT

2018-12-12 Thread Avi Kivity

On 12/10/18 2:48 PM, Goldwyn Rodrigues wrote:

On 13:19 09/12, Avi Kivity wrote:

I have an application that receives spurious EIO when running with
RWF_NOWAIT enabled. Removing RWF_NOWAIT causes those EIOs to disappear. The
application uses AIO+DIO, and errors were seen on both xfs and ext4.


I suspect the following code:


/*
  * Process one completed BIO.  No locks are held.
  */
static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio)
{
     struct bio_vec *bvec;
     unsigned i;
     blk_status_t err = bio->bi_status;

     if (err) {
     if (err == BLK_STS_AGAIN && (bio->bi_opf & REQ_NOWAIT))
     dio->io_error = -EAGAIN;
     else
     dio->io_error = -EIO;
     }

Could it be that REQ_NOWAIT was dropped from bio->bi_opf? or that
bio->bi_status got changed along the way?


I don't think REQ_NOWAIT is dropped. I am assuming bio->bi_status error
is set differently. Is the blk queue being stopped? Is it possible to
instrument the kernel in your testcase?



I traced the function, and I see bio->bi_status == BLK_STS_NOTSUPP and 
bio->bi_opf == REQ_OP_WRITE|REQ_SYNC|REQ_NOMERGE|REQ_FUA|REQ_NOWAIT. 
Presumably the NOTSUPP is the result of NOWAIT not being supported down 
the stack, but shouldn't it be detected earlier? And not converted to EIO?




Re: Spurious EIO on AIO+DIO+RWF_NOWAIT

2018-12-10 Thread Avi Kivity



On 10/12/2018 14.48, Goldwyn Rodrigues wrote:

On 13:19 09/12, Avi Kivity wrote:

I have an application that receives spurious EIO when running with
RWF_NOWAIT enabled. Removing RWF_NOWAIT causes those EIOs to disappear. The
application uses AIO+DIO, and errors were seen on both xfs and ext4.


I suspect the following code:


/*
  * Process one completed BIO.  No locks are held.
  */
static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio)
{
     struct bio_vec *bvec;
     unsigned i;
     blk_status_t err = bio->bi_status;

     if (err) {
     if (err == BLK_STS_AGAIN && (bio->bi_opf & REQ_NOWAIT))
     dio->io_error = -EAGAIN;
     else
     dio->io_error = -EIO;
     }

Could it be that REQ_NOWAIT was dropped from bio->bi_opf? or that
bio->bi_status got changed along the way?


I don't think REQ_NOWAIT is dropped. I am assuming bio->bi_status error
is set differently. Is the blk queue being stopped?



Not that I know of.



Is it possible to
instrument the kernel in your testcase?



I'm happy to apply patches, or run systemtap (or similar) scripts.




Spurious EIO on AIO+DIO+RWF_NOWAIT

2018-12-09 Thread Avi Kivity
I have an application that receives spurious EIO when running with 
RWF_NOWAIT enabled. Removing RWF_NOWAIT causes those EIOs to disappear. 
The application uses AIO+DIO, and errors were seen on both xfs and ext4.



I suspect the following code:


/*
 * Process one completed BIO.  No locks are held.
 */
static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio)
{
    struct bio_vec *bvec;
    unsigned i;
    blk_status_t err = bio->bi_status;

    if (err) {
    if (err == BLK_STS_AGAIN && (bio->bi_opf & REQ_NOWAIT))
    dio->io_error = -EAGAIN;
    else
    dio->io_error = -EIO;
    }

Could it be that REQ_NOWAIT was dropped from bio->bi_opf? or that 
bio->bi_status got changed along the way?



Reducing the test case may be a lot of work, but I can package a binary 
using docker if a reproducer is needed. Seen on 4.18.19 and 4.19.something.




[PATCH v1] Revert "eventfd: only return events requested in poll_mask()"

2018-06-17 Thread Avi Kivity
This reverts commit 4d572d9f46507be8cfe326aa5bc3698babcbdfa7. It is
superceded by the more general
2739b807b0885a09996659be82f813af219c7360 ("aio: only return events
requested in poll_mask() for IOCB_CMD_POLL"). Unfortunately, hch
nacked it on the bug report rather than on the patch itself, so it
was picked up.
---
 fs/eventfd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index ceb1031f1cac..61c9514da5e9 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -154,15 +154,15 @@ static __poll_t eventfd_poll_mask(struct file *file, 
__poll_t eventmask)
 * eventfd_poll returns 0
 */
count = READ_ONCE(ctx->count);
 
if (count > 0)
-   events |= (EPOLLIN & eventmask);
+   events |= EPOLLIN;
if (count == ULLONG_MAX)
events |= EPOLLERR;
if (ULLONG_MAX - 1 > count)
-   events |= (EPOLLOUT & eventmask);
+   events |= EPOLLOUT;
 
return events;
 }
 
 static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
-- 
2.14.4



[PATCH v1] Revert "eventfd: only return events requested in poll_mask()"

2018-06-17 Thread Avi Kivity
This reverts commit 4d572d9f46507be8cfe326aa5bc3698babcbdfa7. It is
superceded by the more general
2739b807b0885a09996659be82f813af219c7360 ("aio: only return events
requested in poll_mask() for IOCB_CMD_POLL"). Unfortunately, hch
nacked it on the bug report rather than on the patch itself, so it
was picked up.
---
 fs/eventfd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index ceb1031f1cac..61c9514da5e9 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -154,15 +154,15 @@ static __poll_t eventfd_poll_mask(struct file *file, 
__poll_t eventmask)
 * eventfd_poll returns 0
 */
count = READ_ONCE(ctx->count);
 
if (count > 0)
-   events |= (EPOLLIN & eventmask);
+   events |= EPOLLIN;
if (count == ULLONG_MAX)
events |= EPOLLERR;
if (ULLONG_MAX - 1 > count)
-   events |= (EPOLLOUT & eventmask);
+   events |= EPOLLOUT;
 
return events;
 }
 
 static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
-- 
2.14.4



[PATCH v1] eventfd: only return events requested in poll_mask()

2018-06-08 Thread Avi Kivity
The ->poll_mask() operation has a mask of events that the caller
is interested in, but we're returning all events regardless.

Change to return only the events the caller is interested in. This
fixes aio IO_CMD_POLL returning immediately when called with POLLIN
on an eventfd, since an eventfd is almost always ready for a write.

Signed-off-by: Avi Kivity 
---
 fs/eventfd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 61c9514da5e9..ceb1031f1cac 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -154,15 +154,15 @@ static __poll_t eventfd_poll_mask(struct file *file, 
__poll_t eventmask)
 * eventfd_poll returns 0
 */
count = READ_ONCE(ctx->count);
 
if (count > 0)
-   events |= EPOLLIN;
+   events |= (EPOLLIN & eventmask);
if (count == ULLONG_MAX)
events |= EPOLLERR;
if (ULLONG_MAX - 1 > count)
-   events |= EPOLLOUT;
+   events |= (EPOLLOUT & eventmask);
 
return events;
 }
 
 static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
-- 
2.14.4



[PATCH v1] eventfd: only return events requested in poll_mask()

2018-06-08 Thread Avi Kivity
The ->poll_mask() operation has a mask of events that the caller
is interested in, but we're returning all events regardless.

Change to return only the events the caller is interested in. This
fixes aio IO_CMD_POLL returning immediately when called with POLLIN
on an eventfd, since an eventfd is almost always ready for a write.

Signed-off-by: Avi Kivity 
---
 fs/eventfd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 61c9514da5e9..ceb1031f1cac 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -154,15 +154,15 @@ static __poll_t eventfd_poll_mask(struct file *file, 
__poll_t eventmask)
 * eventfd_poll returns 0
 */
count = READ_ONCE(ctx->count);
 
if (count > 0)
-   events |= EPOLLIN;
+   events |= (EPOLLIN & eventmask);
if (count == ULLONG_MAX)
events |= EPOLLERR;
if (ULLONG_MAX - 1 > count)
-   events |= EPOLLOUT;
+   events |= (EPOLLOUT & eventmask);
 
return events;
 }
 
 static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
-- 
2.14.4



[PATCH v1] aio: mark __aio_sigset::sigmask const

2018-06-08 Thread Avi Kivity
io_pgetevents() will not change the signal mask. Mark it const
to make it clear and to reduce the need for casts in user code.

Signed-off-by: Avi Kivity 
---
 include/uapi/linux/aio_abi.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index ed0185945bb2..cdf115b03761 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -106,11 +106,11 @@ struct iocb {
 
 #undef IFBIG
 #undef IFLITTLE
 
 struct __aio_sigset {
-   sigset_t __user *sigmask;
+   const sigset_t __user   *sigmask;
size_t  sigsetsize;
 };
 
 #endif /* __LINUX__AIO_ABI_H */
 
-- 
2.14.4



[PATCH v1] aio: mark __aio_sigset::sigmask const

2018-06-08 Thread Avi Kivity
io_pgetevents() will not change the signal mask. Mark it const
to make it clear and to reduce the need for casts in user code.

Signed-off-by: Avi Kivity 
---
 include/uapi/linux/aio_abi.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index ed0185945bb2..cdf115b03761 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -106,11 +106,11 @@ struct iocb {
 
 #undef IFBIG
 #undef IFLITTLE
 
 struct __aio_sigset {
-   sigset_t __user *sigmask;
+   const sigset_t __user   *sigmask;
size_t  sigsetsize;
 };
 
 #endif /* __LINUX__AIO_ABI_H */
 
-- 
2.14.4



Re: aio poll, io_pgetevents and a new in-kernel poll API V3

2018-01-18 Thread Avi Kivity

On 01/18/2018 07:51 PM, Avi Kivity wrote:

On 01/18/2018 05:46 PM, Jeff Moyer wrote:

FYI, this kernel has issues.  It will boot up, but I don't have
networking, and even rebooting doesn't succeed.  I'm looking into it.


FWIW, I'm running an older version of this patchset on my desktop with 
no problems so far. 


(A Fedora 27)


Re: aio poll, io_pgetevents and a new in-kernel poll API V3

2018-01-18 Thread Avi Kivity

On 01/18/2018 07:51 PM, Avi Kivity wrote:

On 01/18/2018 05:46 PM, Jeff Moyer wrote:

FYI, this kernel has issues.  It will boot up, but I don't have
networking, and even rebooting doesn't succeed.  I'm looking into it.


FWIW, I'm running an older version of this patchset on my desktop with 
no problems so far. 


(A Fedora 27)


Re: aio poll, io_pgetevents and a new in-kernel poll API V3

2018-01-18 Thread Avi Kivity

On 01/18/2018 05:46 PM, Jeff Moyer wrote:

FYI, this kernel has issues.  It will boot up, but I don't have
networking, and even rebooting doesn't succeed.  I'm looking into it.


FWIW, I'm running an older version of this patchset on my desktop with 
no problems so far.



-Jeff

Christoph Hellwig  writes:


Hi all,

this series adds support for the IOCB_CMD_POLL operation to poll for the
readyness of file descriptors using the aio subsystem.  The API is based
on patches that existed in RHAS2.1 and RHEL3, which means it already is
supported by libaio.  To implement the poll support efficiently new
methods to poll are introduced in struct file_operations:  get_poll_head
and poll_mask.  The first one returns a wait_queue_head to wait on
(lifetime is bound by the file), and the second does a non-blocking
check for the POLL* events.  This allows aio poll to work without
any additional context switches, unlike epoll.

To make the interface fully useful a new io_pgetevents system call is
added, which atomically saves and restores the signal mask over the
io_pgetevents system call.  It it the logical equivalent to pselect and
ppoll for io_pgetevents.

The corresponding libaio changes for io_pgetevents support and
documentation, as well as a test case will be posted in a separate
series.

The changes were sponsored by Scylladb, and improve performance
of the seastar framework up to 10%, while also removing the need
for a privileged SCHED_FIFO epoll listener thread.

The patches are on top of Als __poll_t annoations, so I've also
prepared a git branch on top of those here:

 git://git.infradead.org/users/hch/vfs.git aio-poll.3

Gitweb:

 http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/aio-poll.3

Libaio changes:

 https://pagure.io/libaio.git io-poll

Seastar changes (not updated for the new io_pgetevens ABI yet):

 https://github.com/avikivity/seastar/commits/aio

Changes since V2:
  - removed a double initialization
  - new vfs_get_poll_head helper
  - document that ->get_poll_head can return NULL
  - call ->poll_mask before sleeping
  - various ACKs
  - add conversion of random to ->poll_mask
  - add conversion of af_alg to ->poll_mask
  - lacking ->poll_mask support now returns -EINVAL for IOCB_CMD_POLL
  - reshuffled the series so that prep patches and everything not
requiring the new in-kernel poll API is in the beginning

Changes since V1:
  - handle the NULL ->poll case in vfs_poll
  - dropped the file argument to the ->poll_mask socket operation
  - replace the ->pre_poll socket operation with ->get_poll_head as
in the file operations

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majord...@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: mailto:"a...@kvack.org;>a...@kvack.org





Re: aio poll, io_pgetevents and a new in-kernel poll API V3

2018-01-18 Thread Avi Kivity

On 01/18/2018 05:46 PM, Jeff Moyer wrote:

FYI, this kernel has issues.  It will boot up, but I don't have
networking, and even rebooting doesn't succeed.  I'm looking into it.


FWIW, I'm running an older version of this patchset on my desktop with 
no problems so far.



-Jeff

Christoph Hellwig  writes:


Hi all,

this series adds support for the IOCB_CMD_POLL operation to poll for the
readyness of file descriptors using the aio subsystem.  The API is based
on patches that existed in RHAS2.1 and RHEL3, which means it already is
supported by libaio.  To implement the poll support efficiently new
methods to poll are introduced in struct file_operations:  get_poll_head
and poll_mask.  The first one returns a wait_queue_head to wait on
(lifetime is bound by the file), and the second does a non-blocking
check for the POLL* events.  This allows aio poll to work without
any additional context switches, unlike epoll.

To make the interface fully useful a new io_pgetevents system call is
added, which atomically saves and restores the signal mask over the
io_pgetevents system call.  It it the logical equivalent to pselect and
ppoll for io_pgetevents.

The corresponding libaio changes for io_pgetevents support and
documentation, as well as a test case will be posted in a separate
series.

The changes were sponsored by Scylladb, and improve performance
of the seastar framework up to 10%, while also removing the need
for a privileged SCHED_FIFO epoll listener thread.

The patches are on top of Als __poll_t annoations, so I've also
prepared a git branch on top of those here:

 git://git.infradead.org/users/hch/vfs.git aio-poll.3

Gitweb:

 http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/aio-poll.3

Libaio changes:

 https://pagure.io/libaio.git io-poll

Seastar changes (not updated for the new io_pgetevens ABI yet):

 https://github.com/avikivity/seastar/commits/aio

Changes since V2:
  - removed a double initialization
  - new vfs_get_poll_head helper
  - document that ->get_poll_head can return NULL
  - call ->poll_mask before sleeping
  - various ACKs
  - add conversion of random to ->poll_mask
  - add conversion of af_alg to ->poll_mask
  - lacking ->poll_mask support now returns -EINVAL for IOCB_CMD_POLL
  - reshuffled the series so that prep patches and everything not
requiring the new in-kernel poll API is in the beginning

Changes since V1:
  - handle the NULL ->poll case in vfs_poll
  - dropped the file argument to the ->poll_mask socket operation
  - replace the ->pre_poll socket operation with ->get_poll_head as
in the file operations

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majord...@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: mailto:"a...@kvack.org;>a...@kvack.org





Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

2018-01-07 Thread Avi Kivity



On 01/07/2018 04:36 PM, Alan Cox wrote:

I'm interested in participating to working on such a solution, given
that haproxy is severely impacted by "pti=on" and that for now we'll
have to run with "pti=off" on the whole system until a more suitable
solution is found.

I'm still trying to work out what cases there are for this. I can see the
cases for pti-off. I've got minecraft servers for example where there
isn't anyone running untrusted code on the box (*) and the only data of
value is owned by the minecraft processes. If someone gets to the point
pti matters then I already lost.


You don't want, say, a local out-of-process dns resolver compromised and 
exploited, then PTI used to read all of the machine's memory.




What I struggle to see is why I'd want to nominate specific processes for
this except in very special cases



Suppose you are running a database (I hope you'll agree that's not a 
special case). Then, "all of your data was stolen but the good news is 
that your ssh keys are safe" is not something that brightens your day. 
Your ssh keys are worth a lot less than your data.


In that case disabling PTI just for the database can be a good trade-off 
between security and performance. You lose almost nothing by disabling 
PTI for the database, yet you're still secure* from a remote exploit in 
any supporting processes, or from a malicious local user.



* as secure as you were with full PTI


  (like your packet generator). Even then
it would make me nervous as the packet generator if that trusted is
effectively CAP_SYS_RAWIO or close to it and can steal any ssh keys or
similar on that guest.

I still prefer cgroups because once you include the branch predictions it
suddenly becomes very interesting to be able to say 'this pile of stuff
trusts itself' and avoid user/user protection costs while keeping
user/kernel.


By "this pile of stuff" do you mean a group of mutually-trusting 
processes? I don't see how that can be implemented, because any 
cross-process switch has to cross the kernel boundary. In any case a 
cross-process switch will destroy the effectiveness of branch prediction 
history, so preserving it doesn't buy you much.




Alan
(*) I am sure that java programs can do sandbox breaking via spectre just
as js can. Bonus points to anyone however who can do spectre through java
from redstone 8)




Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

2018-01-07 Thread Avi Kivity



On 01/07/2018 04:36 PM, Alan Cox wrote:

I'm interested in participating to working on such a solution, given
that haproxy is severely impacted by "pti=on" and that for now we'll
have to run with "pti=off" on the whole system until a more suitable
solution is found.

I'm still trying to work out what cases there are for this. I can see the
cases for pti-off. I've got minecraft servers for example where there
isn't anyone running untrusted code on the box (*) and the only data of
value is owned by the minecraft processes. If someone gets to the point
pti matters then I already lost.


You don't want, say, a local out-of-process dns resolver compromised and 
exploited, then PTI used to read all of the machine's memory.




What I struggle to see is why I'd want to nominate specific processes for
this except in very special cases



Suppose you are running a database (I hope you'll agree that's not a 
special case). Then, "all of your data was stolen but the good news is 
that your ssh keys are safe" is not something that brightens your day. 
Your ssh keys are worth a lot less than your data.


In that case disabling PTI just for the database can be a good trade-off 
between security and performance. You lose almost nothing by disabling 
PTI for the database, yet you're still secure* from a remote exploit in 
any supporting processes, or from a malicious local user.



* as secure as you were with full PTI


  (like your packet generator). Even then
it would make me nervous as the packet generator if that trusted is
effectively CAP_SYS_RAWIO or close to it and can steal any ssh keys or
similar on that guest.

I still prefer cgroups because once you include the branch predictions it
suddenly becomes very interesting to be able to say 'this pile of stuff
trusts itself' and avoid user/user protection costs while keeping
user/kernel.


By "this pile of stuff" do you mean a group of mutually-trusting 
processes? I don't see how that can be implemented, because any 
cross-process switch has to cross the kernel boundary. In any case a 
cross-process switch will destroy the effectiveness of branch prediction 
history, so preserving it doesn't buy you much.




Alan
(*) I am sure that java programs can do sandbox breaking via spectre just
as js can. Bonus points to anyone however who can do spectre through java
from redstone 8)




Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

2018-01-07 Thread Avi Kivity



On 01/07/2018 02:29 PM, Theodore Ts'o wrote:

On Sun, Jan 07, 2018 at 11:16:28AM +0200, Avi Kivity wrote:

I think capabilities will work just as well with cgroups. The container
manager will set CAP_PAYLOAD to payload containers; and if those run an init
system or a container manager themselves, they'll drop CAP_PAYLOAD for all
process/sub-containers but their payloads.

The reason why cgroups are better is Spectre can be used to steal
information from within the same privilege level --- e.g., you could
use Javascript to steal a user's Coindesk credentials or Lastpass
data, which is going to be *way* more lucrative than trying to mine
cryptocurrency in the sly in a user's browser.  :-)

As a result, you probably want Spectre mitigations to be enabled in a
root process --- which means capabilities aren't the right answer.




I don't see the connection. The browser wouldn't run with CAP_PAYLOAD set.

In a desktop system, only init retains CAP_PAYLOAD.

On a server that runs one application (and some supporting processes), 
only init and that one application have CAP_PAYLOAD (if the sysadmin 
makes it so).


On a containerized server that happens to run just one application, init 
will retain CAP_PAYLOAD, as well as the process in the container (if the 
sysadmin makes it so).


On a containerized server that happens to run just one application, 
which itself runs an init system, the two inits will retain CAP_PAYLOAD, 
as well as the application process (if the sysadmin makes it so).


Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

2018-01-07 Thread Avi Kivity



On 01/07/2018 02:29 PM, Theodore Ts'o wrote:

On Sun, Jan 07, 2018 at 11:16:28AM +0200, Avi Kivity wrote:

I think capabilities will work just as well with cgroups. The container
manager will set CAP_PAYLOAD to payload containers; and if those run an init
system or a container manager themselves, they'll drop CAP_PAYLOAD for all
process/sub-containers but their payloads.

The reason why cgroups are better is Spectre can be used to steal
information from within the same privilege level --- e.g., you could
use Javascript to steal a user's Coindesk credentials or Lastpass
data, which is going to be *way* more lucrative than trying to mine
cryptocurrency in the sly in a user's browser.  :-)

As a result, you probably want Spectre mitigations to be enabled in a
root process --- which means capabilities aren't the right answer.




I don't see the connection. The browser wouldn't run with CAP_PAYLOAD set.

In a desktop system, only init retains CAP_PAYLOAD.

On a server that runs one application (and some supporting processes), 
only init and that one application have CAP_PAYLOAD (if the sysadmin 
makes it so).


On a containerized server that happens to run just one application, init 
will retain CAP_PAYLOAD, as well as the process in the container (if the 
sysadmin makes it so).


On a containerized server that happens to run just one application, 
which itself runs an init system, the two inits will retain CAP_PAYLOAD, 
as well as the application process (if the sysadmin makes it so).


Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

2018-01-07 Thread Avi Kivity

On 01/06/2018 10:02 PM, Alan Cox wrote:

I propose to create a new capability, CAP_PAYLOAD, that allows the
system administrator to designate an application as the main workload in
that system. Other processes (like sshd or monitoring daemons) exist to
support it, and so it makes sense to protect the rest of the system from
their being compromised.

Much more general would be to do this with cgroups both for group-group
trust and group-kernel trust levels.



I think capabilities will work just as well with cgroups. The container 
manager will set CAP_PAYLOAD to payload containers; and if those run an 
init system or a container manager themselves, they'll drop CAP_PAYLOAD 
for all process/sub-containers but their payloads.




Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

2018-01-07 Thread Avi Kivity

On 01/06/2018 10:02 PM, Alan Cox wrote:

I propose to create a new capability, CAP_PAYLOAD, that allows the
system administrator to designate an application as the main workload in
that system. Other processes (like sshd or monitoring daemons) exist to
support it, and so it makes sense to protect the rest of the system from
their being compromised.

Much more general would be to do this with cgroups both for group-group
trust and group-kernel trust levels.



I think capabilities will work just as well with cgroups. The container 
manager will set CAP_PAYLOAD to payload containers; and if those run an 
init system or a container manager themselves, they'll drop CAP_PAYLOAD 
for all process/sub-containers but their payloads.




Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

2018-01-07 Thread Avi Kivity



On 01/06/2018 10:24 PM, Willy Tarreau wrote:

Hi Avi,

On Sat, Jan 06, 2018 at 09:33:28PM +0200, Avi Kivity wrote:

Meltdown and Spectre mitigations focus on protecting the kernel from a
hostile userspace. However, it's not a given that the kernel is the most
important target in the system. It is common in server workloads that a
single userspace application contains the valuable data on a system, and if
it were hostile, the game would already be over, without the need to
compromise the kernel.

In these workloads, a single application performs most system calls, and so
it pays the cost of protection, without benefiting from it directly (since
it is the target, rather than the kernel).

Definitely :-)


I propose to create a new capability, CAP_PAYLOAD, that allows the system
administrator to designate an application as the main workload in that
system. Other processes (like sshd or monitoring daemons) exist to support
it, and so it makes sense to protect the rest of the system from their being
compromised.

Initially I was thinking about letting applications disable PTI using
prctl() when running under a certain capability (I initially thought
about CAP_SYSADMIN though I changed my mind). One advantage of
proceeding like this is that it would have to be explicitly implemented
in the application, which limits the risk of running by default.

I later thought that we could use CAP_RAWIO for this, given that such
processes already have access to the hardware anyway. We could even
imagine not switching the page tables on such a capability without
requiring prctl(), though it would mean that processes running as root
(as is often found on a number of servers) would automatically present
a risk for the system. But maybe CAP_RAWIO + prctl() could be a good
solution.


CAP_RAWIO is like CAP_PAYLOAD in that both allow you to read stuff you 
shouldn't have access to on a vulnerable CPU. But CAP_PAYLOAD won't give 
you that access on a non-vulnerable CPU, so it's safer.


The advantage of not requiring prctl() is that it will work on 
unmodified applications, requiring only sysadmin intervention (and it's 
the sysadmin's role to designate an application as payload, not the 
application's).




I'm interested in participating to working on such a solution, given
that haproxy is severely impacted by "pti=on" and that for now we'll
have to run with "pti=off" on the whole system until a more suitable
solution is found.

I'd rather not rush anything and let things calm down for a while to
avoid adding disturbance to the current situation. But I'm willing to
continue this discussion and even test patches.




Then you might want to test 
https://www.spinics.net/lists/kernel/msg2689101.html and its companion 
patchset https://www.spinics.net/lists/kernel/msg2689134.html, which as 
a side effect significantly reduce KPTI impact on C10K applications (and 
as their main effect improve their performance).


Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

2018-01-07 Thread Avi Kivity



On 01/06/2018 10:24 PM, Willy Tarreau wrote:

Hi Avi,

On Sat, Jan 06, 2018 at 09:33:28PM +0200, Avi Kivity wrote:

Meltdown and Spectre mitigations focus on protecting the kernel from a
hostile userspace. However, it's not a given that the kernel is the most
important target in the system. It is common in server workloads that a
single userspace application contains the valuable data on a system, and if
it were hostile, the game would already be over, without the need to
compromise the kernel.

In these workloads, a single application performs most system calls, and so
it pays the cost of protection, without benefiting from it directly (since
it is the target, rather than the kernel).

Definitely :-)


I propose to create a new capability, CAP_PAYLOAD, that allows the system
administrator to designate an application as the main workload in that
system. Other processes (like sshd or monitoring daemons) exist to support
it, and so it makes sense to protect the rest of the system from their being
compromised.

Initially I was thinking about letting applications disable PTI using
prctl() when running under a certain capability (I initially thought
about CAP_SYSADMIN though I changed my mind). One advantage of
proceeding like this is that it would have to be explicitly implemented
in the application, which limits the risk of running by default.

I later thought that we could use CAP_RAWIO for this, given that such
processes already have access to the hardware anyway. We could even
imagine not switching the page tables on such a capability without
requiring prctl(), though it would mean that processes running as root
(as is often found on a number of servers) would automatically present
a risk for the system. But maybe CAP_RAWIO + prctl() could be a good
solution.


CAP_RAWIO is like CAP_PAYLOAD in that both allow you to read stuff you 
shouldn't have access to on a vulnerable CPU. But CAP_PAYLOAD won't give 
you that access on a non-vulnerable CPU, so it's safer.


The advantage of not requiring prctl() is that it will work on 
unmodified applications, requiring only sysadmin intervention (and it's 
the sysadmin's role to designate an application as payload, not the 
application's).




I'm interested in participating to working on such a solution, given
that haproxy is severely impacted by "pti=on" and that for now we'll
have to run with "pti=off" on the whole system until a more suitable
solution is found.

I'd rather not rush anything and let things calm down for a while to
avoid adding disturbance to the current situation. But I'm willing to
continue this discussion and even test patches.




Then you might want to test 
https://www.spinics.net/lists/kernel/msg2689101.html and its companion 
patchset https://www.spinics.net/lists/kernel/msg2689134.html, which as 
a side effect significantly reduce KPTI impact on C10K applications (and 
as their main effect improve their performance).


Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

2018-01-06 Thread Avi Kivity
Meltdown and Spectre mitigations focus on protecting the kernel from a 
hostile userspace. However, it's not a given that the kernel is the most 
important target in the system. It is common in server workloads that a 
single userspace application contains the valuable data on a system, and 
if it were hostile, the game would already be over, without the need to 
compromise the kernel.



In these workloads, a single application performs most system calls, and 
so it pays the cost of protection, without benefiting from it directly 
(since it is the target, rather than the kernel).



I propose to create a new capability, CAP_PAYLOAD, that allows the 
system administrator to designate an application as the main workload in 
that system. Other processes (like sshd or monitoring daemons) exist to 
support it, and so it makes sense to protect the rest of the system from 
their being compromised.



When the kernel switches to user mode of a CAP_PAYLOAD process, it 
doesn't switch page tables and instead leaves the kernel mapped into the 
adddress space (still with supervisor protection, of course). This 
reduces context switch cost, and will also reduce interrupt costs if the 
interrupt happens while that process executes. Since a CAP_PAYLOAD 
process is likely to consume the majority of CPU time, the costs 
associated with Meltdown mitigation are almost completely nullified.



CAP_PAYLOAD has potential to be abused; every software vendor will be 
absolutely certain that their application is the reason the universe 
(let alone that server) exists and they will turn it on, so init systems 
will have to work to make it doesn't get turned on without administrator 
opt-in. It's also not perfect, since if there is a payload application 
compromise, in addition to stealing the application's data ssh keys can 
be stolen too. But I think it's better than having to choose between 
significantly reduced performance and security. You get performance for 
your important application, and protection against the possibility that 
a remote exploit against a supporting process turns into a remote 
exploit against that important application.






Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

2018-01-06 Thread Avi Kivity
Meltdown and Spectre mitigations focus on protecting the kernel from a 
hostile userspace. However, it's not a given that the kernel is the most 
important target in the system. It is common in server workloads that a 
single userspace application contains the valuable data on a system, and 
if it were hostile, the game would already be over, without the need to 
compromise the kernel.



In these workloads, a single application performs most system calls, and 
so it pays the cost of protection, without benefiting from it directly 
(since it is the target, rather than the kernel).



I propose to create a new capability, CAP_PAYLOAD, that allows the 
system administrator to designate an application as the main workload in 
that system. Other processes (like sshd or monitoring daemons) exist to 
support it, and so it makes sense to protect the rest of the system from 
their being compromised.



When the kernel switches to user mode of a CAP_PAYLOAD process, it 
doesn't switch page tables and instead leaves the kernel mapped into the 
adddress space (still with supervisor protection, of course). This 
reduces context switch cost, and will also reduce interrupt costs if the 
interrupt happens while that process executes. Since a CAP_PAYLOAD 
process is likely to consume the majority of CPU time, the costs 
associated with Meltdown mitigation are almost completely nullified.



CAP_PAYLOAD has potential to be abused; every software vendor will be 
absolutely certain that their application is the reason the universe 
(let alone that server) exists and they will turn it on, so init systems 
will have to work to make it doesn't get turned on without administrator 
opt-in. It's also not perfect, since if there is a payload application 
compromise, in addition to stealing the application's data ssh keys can 
be stolen too. But I think it's better than having to choose between 
significantly reduced performance and security. You get performance for 
your important application, and protection against the possibility that 
a remote exploit against a supporting process turns into a remote 
exploit against that important application.






Re: Detecting RWF_NOWAIT support

2017-12-17 Thread Avi Kivity



On 12/18/2017 05:28 AM, Goldwyn Rodrigues wrote:


On 12/16/2017 08:49 AM, Avi Kivity wrote:


On 12/14/2017 09:15 PM, Goldwyn Rodrigues wrote:

On 12/14/2017 11:38 AM, Avi Kivity wrote:

I'm looking to add support for RWF_NOWAIT within a linux-aio iocb.
Naturally, I need to detect at runtime whether the kernel support
RWF_NOWAIT or not.


The only method I could find was to issue an I/O with RWF_NOWAIT set,
and look for errors. This is somewhat less than perfect:

   - from the error, I can't tell whether RWF_NOWAIT was the problem, or
something else. If I enable a number of new features, I have to run
through all combinations to figure out which ones are supported and
which are not.

Here is the return codes for RWF_NOWAIT
EINVAL - not supported (older kernel)
EOPNOTSUPP - not supported
EAGAIN - supported but could not complete because I/O will be delayed

Which of these are returned from io_submit() and which are returned in
the iocb?

These are returned in iocb.


Thanks.




0 - supported and I/O completed (success).


   - RWF_NOWAIT support is per-filesystem, so I can't just remember
not to
enable RWF_NOWAIT globally, I have to track it per file.

Yes, the support is per filesystem. So, the application must know if the
filesystem supports it, possibly by performing a small I/O.

So the application must know about filesystem mount points, and be
prepared to create a file and try to write it (in case the filesystem is
empty) or alter its behavior during runtime depending on the errors it
sees.

Well yes. Hopefully, the application knows what it is doing when it
performs RWF_NOWAIT.


This type of interface makes it very hard to consume new kernel 
facilities in a backward compatible way. The kernel should advertise 
what support it provides; for example this support could be advertised 
via statx(2).


For examples of facilities that advertise their capabilities, see 
membarrier(2) and KVM.




Re: Detecting RWF_NOWAIT support

2017-12-17 Thread Avi Kivity



On 12/18/2017 05:28 AM, Goldwyn Rodrigues wrote:


On 12/16/2017 08:49 AM, Avi Kivity wrote:


On 12/14/2017 09:15 PM, Goldwyn Rodrigues wrote:

On 12/14/2017 11:38 AM, Avi Kivity wrote:

I'm looking to add support for RWF_NOWAIT within a linux-aio iocb.
Naturally, I need to detect at runtime whether the kernel support
RWF_NOWAIT or not.


The only method I could find was to issue an I/O with RWF_NOWAIT set,
and look for errors. This is somewhat less than perfect:

   - from the error, I can't tell whether RWF_NOWAIT was the problem, or
something else. If I enable a number of new features, I have to run
through all combinations to figure out which ones are supported and
which are not.

Here is the return codes for RWF_NOWAIT
EINVAL - not supported (older kernel)
EOPNOTSUPP - not supported
EAGAIN - supported but could not complete because I/O will be delayed

Which of these are returned from io_submit() and which are returned in
the iocb?

These are returned in iocb.


Thanks.




0 - supported and I/O completed (success).


   - RWF_NOWAIT support is per-filesystem, so I can't just remember
not to
enable RWF_NOWAIT globally, I have to track it per file.

Yes, the support is per filesystem. So, the application must know if the
filesystem supports it, possibly by performing a small I/O.

So the application must know about filesystem mount points, and be
prepared to create a file and try to write it (in case the filesystem is
empty) or alter its behavior during runtime depending on the errors it
sees.

Well yes. Hopefully, the application knows what it is doing when it
performs RWF_NOWAIT.


This type of interface makes it very hard to consume new kernel 
facilities in a backward compatible way. The kernel should advertise 
what support it provides; for example this support could be advertised 
via statx(2).


For examples of facilities that advertise their capabilities, see 
membarrier(2) and KVM.




Re: Detecting RWF_NOWAIT support

2017-12-16 Thread Avi Kivity



On 12/16/2017 08:12 PM, vcap...@pengaru.com wrote:

On Sat, Dec 16, 2017 at 10:03:38AM -0800, vcap...@pengaru.com wrote:

On Sat, Dec 16, 2017 at 04:49:08PM +0200, Avi Kivity wrote:


On 12/14/2017 09:15 PM, Goldwyn Rodrigues wrote:

On 12/14/2017 11:38 AM, Avi Kivity wrote:

I'm looking to add support for RWF_NOWAIT within a linux-aio iocb.
Naturally, I need to detect at runtime whether the kernel support
RWF_NOWAIT or not.


The only method I could find was to issue an I/O with RWF_NOWAIT set,
and look for errors. This is somewhat less than perfect:

   - from the error, I can't tell whether RWF_NOWAIT was the problem, or
something else. If I enable a number of new features, I have to run
through all combinations to figure out which ones are supported and
which are not.

Here is the return codes for RWF_NOWAIT
EINVAL - not supported (older kernel)
EOPNOTSUPP - not supported
EAGAIN - supported but could not complete because I/O will be delayed

Which of these are returned from io_submit() and which are returned in the
iocb?


0 - supported and I/O completed (success).


   - RWF_NOWAIT support is per-filesystem, so I can't just remember not to
enable RWF_NOWAIT globally, I have to track it per file.

Yes, the support is per filesystem. So, the application must know if the
filesystem supports it, possibly by performing a small I/O.

So the application must know about filesystem mount points, and be prepared
to create a file and try to write it (in case the filesystem is empty) or
alter its behavior during runtime depending on the errors it sees.

Can't the application simply add a "nowait" flag to its open file
descriptor encapsulation struct, then in the constructor perform a
zero-length RWF_NOWAIT write immediately after opening the fd to set the
flag?  Then all writes branch according to the flag.

According to write(2):

If  count is zero and fd refers to a regular file, then write()
may return a failure status if one of the errors below is
detected.  If no errors are detected,  or error detection is not
performed, 0 will be returned without causing any other effect.
If count is zero and fd refers to a file  other than a regular
file, the results are not specified.

So the zero-length RWF_NOWAIT write should return zero, unless it's not
supported.


Oh, I assumed this new flag applied to pwritev2() flags.  Disregard my
comment, I see the ambiguity causing your question Avi and do not know
the best approach.



Actually it's not a bad idea. I'm using AIO, not p{read,write}v2, but I 
can assume that the response will be the same and that a zero-length 
read will return immediately.


Re: Detecting RWF_NOWAIT support

2017-12-16 Thread Avi Kivity



On 12/16/2017 08:12 PM, vcap...@pengaru.com wrote:

On Sat, Dec 16, 2017 at 10:03:38AM -0800, vcap...@pengaru.com wrote:

On Sat, Dec 16, 2017 at 04:49:08PM +0200, Avi Kivity wrote:


On 12/14/2017 09:15 PM, Goldwyn Rodrigues wrote:

On 12/14/2017 11:38 AM, Avi Kivity wrote:

I'm looking to add support for RWF_NOWAIT within a linux-aio iocb.
Naturally, I need to detect at runtime whether the kernel support
RWF_NOWAIT or not.


The only method I could find was to issue an I/O with RWF_NOWAIT set,
and look for errors. This is somewhat less than perfect:

   - from the error, I can't tell whether RWF_NOWAIT was the problem, or
something else. If I enable a number of new features, I have to run
through all combinations to figure out which ones are supported and
which are not.

Here is the return codes for RWF_NOWAIT
EINVAL - not supported (older kernel)
EOPNOTSUPP - not supported
EAGAIN - supported but could not complete because I/O will be delayed

Which of these are returned from io_submit() and which are returned in the
iocb?


0 - supported and I/O completed (success).


   - RWF_NOWAIT support is per-filesystem, so I can't just remember not to
enable RWF_NOWAIT globally, I have to track it per file.

Yes, the support is per filesystem. So, the application must know if the
filesystem supports it, possibly by performing a small I/O.

So the application must know about filesystem mount points, and be prepared
to create a file and try to write it (in case the filesystem is empty) or
alter its behavior during runtime depending on the errors it sees.

Can't the application simply add a "nowait" flag to its open file
descriptor encapsulation struct, then in the constructor perform a
zero-length RWF_NOWAIT write immediately after opening the fd to set the
flag?  Then all writes branch according to the flag.

According to write(2):

If  count is zero and fd refers to a regular file, then write()
may return a failure status if one of the errors below is
detected.  If no errors are detected,  or error detection is not
performed, 0 will be returned without causing any other effect.
If count is zero and fd refers to a file  other than a regular
file, the results are not specified.

So the zero-length RWF_NOWAIT write should return zero, unless it's not
supported.


Oh, I assumed this new flag applied to pwritev2() flags.  Disregard my
comment, I see the ambiguity causing your question Avi and do not know
the best approach.



Actually it's not a bad idea. I'm using AIO, not p{read,write}v2, but I 
can assume that the response will be the same and that a zero-length 
read will return immediately.


Re: Detecting RWF_NOWAIT support

2017-12-16 Thread Avi Kivity



On 12/14/2017 09:15 PM, Goldwyn Rodrigues wrote:


On 12/14/2017 11:38 AM, Avi Kivity wrote:

I'm looking to add support for RWF_NOWAIT within a linux-aio iocb.
Naturally, I need to detect at runtime whether the kernel support
RWF_NOWAIT or not.


The only method I could find was to issue an I/O with RWF_NOWAIT set,
and look for errors. This is somewhat less than perfect:

  - from the error, I can't tell whether RWF_NOWAIT was the problem, or
something else. If I enable a number of new features, I have to run
through all combinations to figure out which ones are supported and
which are not.

Here is the return codes for RWF_NOWAIT
EINVAL - not supported (older kernel)
EOPNOTSUPP - not supported
EAGAIN - supported but could not complete because I/O will be delayed


Which of these are returned from io_submit() and which are returned in 
the iocb?



0 - supported and I/O completed (success).


  - RWF_NOWAIT support is per-filesystem, so I can't just remember not to
enable RWF_NOWAIT globally, I have to track it per file.

Yes, the support is per filesystem. So, the application must know if the
filesystem supports it, possibly by performing a small I/O.


So the application must know about filesystem mount points, and be 
prepared to create a file and try to write it (in case the filesystem is 
empty) or alter its behavior during runtime depending on the errors it sees.


Re: Detecting RWF_NOWAIT support

2017-12-16 Thread Avi Kivity



On 12/14/2017 09:15 PM, Goldwyn Rodrigues wrote:


On 12/14/2017 11:38 AM, Avi Kivity wrote:

I'm looking to add support for RWF_NOWAIT within a linux-aio iocb.
Naturally, I need to detect at runtime whether the kernel support
RWF_NOWAIT or not.


The only method I could find was to issue an I/O with RWF_NOWAIT set,
and look for errors. This is somewhat less than perfect:

  - from the error, I can't tell whether RWF_NOWAIT was the problem, or
something else. If I enable a number of new features, I have to run
through all combinations to figure out which ones are supported and
which are not.

Here is the return codes for RWF_NOWAIT
EINVAL - not supported (older kernel)
EOPNOTSUPP - not supported
EAGAIN - supported but could not complete because I/O will be delayed


Which of these are returned from io_submit() and which are returned in 
the iocb?



0 - supported and I/O completed (success).


  - RWF_NOWAIT support is per-filesystem, so I can't just remember not to
enable RWF_NOWAIT globally, I have to track it per file.

Yes, the support is per filesystem. So, the application must know if the
filesystem supports it, possibly by performing a small I/O.


So the application must know about filesystem mount points, and be 
prepared to create a file and try to write it (in case the filesystem is 
empty) or alter its behavior during runtime depending on the errors it sees.


Detecting RWF_NOWAIT support

2017-12-14 Thread Avi Kivity
I'm looking to add support for RWF_NOWAIT within a linux-aio iocb. 
Naturally, I need to detect at runtime whether the kernel support 
RWF_NOWAIT or not.



The only method I could find was to issue an I/O with RWF_NOWAIT set, 
and look for errors. This is somewhat less than perfect:


 - from the error, I can't tell whether RWF_NOWAIT was the problem, or 
something else. If I enable a number of new features, I have to run 
through all combinations to figure out which ones are supported and 
which are not.


 - RWF_NOWAIT support is per-filesystem, so I can't just remember not 
to enable RWF_NOWAIT globally, I have to track it per file.



Did I miss another method of detecting RWF_NOWAIT?



Detecting RWF_NOWAIT support

2017-12-14 Thread Avi Kivity
I'm looking to add support for RWF_NOWAIT within a linux-aio iocb. 
Naturally, I need to detect at runtime whether the kernel support 
RWF_NOWAIT or not.



The only method I could find was to issue an I/O with RWF_NOWAIT set, 
and look for errors. This is somewhat less than perfect:


 - from the error, I can't tell whether RWF_NOWAIT was the problem, or 
something else. If I enable a number of new features, I have to run 
through all combinations to figure out which ones are supported and 
which are not.


 - RWF_NOWAIT support is per-filesystem, so I can't just remember not 
to enable RWF_NOWAIT globally, I have to track it per file.



Did I miss another method of detecting RWF_NOWAIT?



Re: [RFC PATCH 0/2] x86: Fix missing core serialization on migration

2017-11-14 Thread Avi Kivity



On 11/14/2017 06:49 PM, Mathieu Desnoyers wrote:

- On Nov 14, 2017, at 11:08 AM, Peter Zijlstra pet...@infradead.org wrote:


On Tue, Nov 14, 2017 at 05:05:41PM +0100, Peter Zijlstra wrote:

On Tue, Nov 14, 2017 at 03:17:12PM +, Mathieu Desnoyers wrote:

I've tried to create a small single-threaded self-modifying loop in
user-space to trigger a trace cache or speculative execution quirk,
but I have not succeeded yet. I suspect that I would need to know
more about the internals of the processor architecture to create the
right stalls that would allow speculative execution to move further
ahead, and trigger an incoherent execution flow. Ideas on how to
trigger this would be welcome.

I thought the whole problem was per definition multi-threaded.

Single-threaded stuff can't get out of sync with itself; you'll always
observe your own stores.

And even if you could, you can always execute a local serializing
instruction like CPUID to force things.

What I'm trying to reproduce is something that breaks in single-threaded
case if I explicitly leave out the CPUID core serializing instruction
when doing code modification on upcoming code, in a loop.

AFAIU, Intel requires a core serializing instruction to be issued even
in single-threaded scenarios between code update and execution, to ensure
that speculative execution does not observe incoherent code. Now the
question we all have for Intel is: is this requirement too strong, or
required by reality ?



In single-threaded execution, a jump is enough.

"As processor microarchitectures become more complex and start to 
speculatively execute code ahead of the retire-
ment point (as in P6 and more recent processor families), the rules 
regarding which code should execute, pre- or
post-modification, become blurred. To write self-modifying code and 
ensure that it is compliant with current and
future versions of the IA-32 architectures, use one of the following 
coding options:


(* OPTION 1 *)
Store modified code (as data) into code segment;
Jump to new code or an intermediate location;
Execute new code;"





Re: [RFC PATCH 0/2] x86: Fix missing core serialization on migration

2017-11-14 Thread Avi Kivity



On 11/14/2017 06:49 PM, Mathieu Desnoyers wrote:

- On Nov 14, 2017, at 11:08 AM, Peter Zijlstra pet...@infradead.org wrote:


On Tue, Nov 14, 2017 at 05:05:41PM +0100, Peter Zijlstra wrote:

On Tue, Nov 14, 2017 at 03:17:12PM +, Mathieu Desnoyers wrote:

I've tried to create a small single-threaded self-modifying loop in
user-space to trigger a trace cache or speculative execution quirk,
but I have not succeeded yet. I suspect that I would need to know
more about the internals of the processor architecture to create the
right stalls that would allow speculative execution to move further
ahead, and trigger an incoherent execution flow. Ideas on how to
trigger this would be welcome.

I thought the whole problem was per definition multi-threaded.

Single-threaded stuff can't get out of sync with itself; you'll always
observe your own stores.

And even if you could, you can always execute a local serializing
instruction like CPUID to force things.

What I'm trying to reproduce is something that breaks in single-threaded
case if I explicitly leave out the CPUID core serializing instruction
when doing code modification on upcoming code, in a loop.

AFAIU, Intel requires a core serializing instruction to be issued even
in single-threaded scenarios between code update and execution, to ensure
that speculative execution does not observe incoherent code. Now the
question we all have for Intel is: is this requirement too strong, or
required by reality ?



In single-threaded execution, a jump is enough.

"As processor microarchitectures become more complex and start to 
speculatively execute code ahead of the retire-
ment point (as in P6 and more recent processor families), the rules 
regarding which code should execute, pre- or
post-modification, become blurred. To write self-modifying code and 
ensure that it is compliant with current and
future versions of the IA-32 architectures, use one of the following 
coding options:


(* OPTION 1 *)
Store modified code (as data) into code segment;
Jump to new code or an intermediate location;
Execute new code;"





Re: [RFC PATCH 0/2] x86: Fix missing core serialization on migration

2017-11-14 Thread Avi Kivity



On 11/14/2017 05:17 PM, Mathieu Desnoyers wrote:

- On Nov 14, 2017, at 9:53 AM, Avi Kivity a...@scylladb.com wrote:


On 11/13/2017 06:56 PM, Mathieu Desnoyers wrote:

- On Nov 10, 2017, at 4:57 PM, Mathieu Desnoyers
mathieu.desnoy...@efficios.com wrote:


- On Nov 10, 2017, at 4:36 PM, Linus Torvalds torva...@linux-foundation.org
wrote:


On Fri, Nov 10, 2017 at 1:12 PM, Mathieu Desnoyers
<mathieu.desnoy...@efficios.com> wrote:

x86 can return to user-space through sysexit and sysretq, which are not
core serializing. This breaks expectations from user-space about
sequential consistency from a single-threaded self-modifying program
point of view in specific migration patterns.

Feedback is welcome,

We should check with Intel. I would actually be surprised if the I$
can be out of sync with the D$ after a sysretq.  It would actually
break things like "read code from disk" too in theory.

That core serializing instruction is not that much about I$ vs D$
consistency, but rather about the processor speculatively executing code
ahead of its retirement point. Ref. Intel Architecture Software Developer's
Manual, Volume 3: System Programming.

7.1.3. "Handling Self- and Cross-Modifying Code":

"The act of a processor writing data into a currently executing code segment
with the intent of
executing that data as code is called self-modifying code. Intel Architecture
processors exhibit
model-specific behavior when executing self-modified code, depending upon how
far ahead of
the current execution pointer the code has been modified. As processor
architectures become
more complex and start to speculatively execute code ahead of the retirement
point (as in the P6
family processors), the rules regarding which code should execute, pre- or
post-modification,
become blurred. [...]"

AFAIU, this core serializing instruction seems to be needed for use-cases of
self-modifying code, but not for the initial load of a program from disk,
as the processor has no way to have speculatively executed any of its
instructions.

I figured out what you're pointing to: if exec() is executed by a previously
running thread, and there is no core serializing instruction between program
load and return to user-space, the kernel ends up acting like a JIT, indeed.

I think that's safe. The kernel has to execute a MOV CR3 instruction
before it can execute code loaded by exec, and that is a serializing
instruction. Loading and unloading shared libraries is made safe by the
IRET executed by page faults (loading) and TLB shootdown IPIs (unloading).

Very good points! Perhaps those guarantees should be documented somewhere ?


Directly modifying code in userspace is unsafe if there is some
non-coherent instruction cache. Instruction fetch and speculative
execution are non-coherent, but they're probably too short (in current
processors) to matter. Trace caches are probably large enough, but I
don't know whether they are coherent or not.

Android guys at Google have reproducers of context synchronization issues
on arm 64 in JIT scenarios. Based on the information I got, flushing the
instruction caches is not enough: they also need to issue a context
synchronizing instruction.

Perhaps the current Intel processors may have short enough speculative
execution and small enough trace caches, but relying on this without
a clear statement from Intel seems fragile.


A small trace cache is still vulnerable, the question is whether it is 
coherent or not.



I've tried to create a small single-threaded self-modifying loop in
user-space to trigger a trace cache or speculative execution quirk,
but I have not succeeded yet. I suspect that I would need to know
more about the internals of the processor architecture to create the
right stalls that would allow speculative execution to move further
ahead, and trigger an incoherent execution flow. Ideas on how to
trigger this would be welcome.





Intels resynchronize as soon as you jump (in single-threaded execution), 
so you need to update ahead of the current instruction pointer to see 
something. Not sure what quirk you're interested in seeing, executing 
the old code? That's not very exciting.






Re: [RFC PATCH 0/2] x86: Fix missing core serialization on migration

2017-11-14 Thread Avi Kivity



On 11/14/2017 05:17 PM, Mathieu Desnoyers wrote:

- On Nov 14, 2017, at 9:53 AM, Avi Kivity a...@scylladb.com wrote:


On 11/13/2017 06:56 PM, Mathieu Desnoyers wrote:

- On Nov 10, 2017, at 4:57 PM, Mathieu Desnoyers
mathieu.desnoy...@efficios.com wrote:


- On Nov 10, 2017, at 4:36 PM, Linus Torvalds torva...@linux-foundation.org
wrote:


On Fri, Nov 10, 2017 at 1:12 PM, Mathieu Desnoyers
 wrote:

x86 can return to user-space through sysexit and sysretq, which are not
core serializing. This breaks expectations from user-space about
sequential consistency from a single-threaded self-modifying program
point of view in specific migration patterns.

Feedback is welcome,

We should check with Intel. I would actually be surprised if the I$
can be out of sync with the D$ after a sysretq.  It would actually
break things like "read code from disk" too in theory.

That core serializing instruction is not that much about I$ vs D$
consistency, but rather about the processor speculatively executing code
ahead of its retirement point. Ref. Intel Architecture Software Developer's
Manual, Volume 3: System Programming.

7.1.3. "Handling Self- and Cross-Modifying Code":

"The act of a processor writing data into a currently executing code segment
with the intent of
executing that data as code is called self-modifying code. Intel Architecture
processors exhibit
model-specific behavior when executing self-modified code, depending upon how
far ahead of
the current execution pointer the code has been modified. As processor
architectures become
more complex and start to speculatively execute code ahead of the retirement
point (as in the P6
family processors), the rules regarding which code should execute, pre- or
post-modification,
become blurred. [...]"

AFAIU, this core serializing instruction seems to be needed for use-cases of
self-modifying code, but not for the initial load of a program from disk,
as the processor has no way to have speculatively executed any of its
instructions.

I figured out what you're pointing to: if exec() is executed by a previously
running thread, and there is no core serializing instruction between program
load and return to user-space, the kernel ends up acting like a JIT, indeed.

I think that's safe. The kernel has to execute a MOV CR3 instruction
before it can execute code loaded by exec, and that is a serializing
instruction. Loading and unloading shared libraries is made safe by the
IRET executed by page faults (loading) and TLB shootdown IPIs (unloading).

Very good points! Perhaps those guarantees should be documented somewhere ?


Directly modifying code in userspace is unsafe if there is some
non-coherent instruction cache. Instruction fetch and speculative
execution are non-coherent, but they're probably too short (in current
processors) to matter. Trace caches are probably large enough, but I
don't know whether they are coherent or not.

Android guys at Google have reproducers of context synchronization issues
on arm 64 in JIT scenarios. Based on the information I got, flushing the
instruction caches is not enough: they also need to issue a context
synchronizing instruction.

Perhaps the current Intel processors may have short enough speculative
execution and small enough trace caches, but relying on this without
a clear statement from Intel seems fragile.


A small trace cache is still vulnerable, the question is whether it is 
coherent or not.



I've tried to create a small single-threaded self-modifying loop in
user-space to trigger a trace cache or speculative execution quirk,
but I have not succeeded yet. I suspect that I would need to know
more about the internals of the processor architecture to create the
right stalls that would allow speculative execution to move further
ahead, and trigger an incoherent execution flow. Ideas on how to
trigger this would be welcome.





Intels resynchronize as soon as you jump (in single-threaded execution), 
so you need to update ahead of the current instruction pointer to see 
something. Not sure what quirk you're interested in seeing, executing 
the old code? That's not very exciting.






Re: [RFC PATCH 0/2] x86: Fix missing core serialization on migration

2017-11-14 Thread Avi Kivity



On 11/13/2017 06:56 PM, Mathieu Desnoyers wrote:

- On Nov 10, 2017, at 4:57 PM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:


- On Nov 10, 2017, at 4:36 PM, Linus Torvalds torva...@linux-foundation.org
wrote:


On Fri, Nov 10, 2017 at 1:12 PM, Mathieu Desnoyers
 wrote:

x86 can return to user-space through sysexit and sysretq, which are not
core serializing. This breaks expectations from user-space about
sequential consistency from a single-threaded self-modifying program
point of view in specific migration patterns.

Feedback is welcome,

We should check with Intel. I would actually be surprised if the I$
can be out of sync with the D$ after a sysretq.  It would actually
break things like "read code from disk" too in theory.

That core serializing instruction is not that much about I$ vs D$
consistency, but rather about the processor speculatively executing code
ahead of its retirement point. Ref. Intel Architecture Software Developer's
Manual, Volume 3: System Programming.

7.1.3. "Handling Self- and Cross-Modifying Code":

"The act of a processor writing data into a currently executing code segment
with the intent of
executing that data as code is called self-modifying code. Intel Architecture
processors exhibit
model-specific behavior when executing self-modified code, depending upon how
far ahead of
the current execution pointer the code has been modified. As processor
architectures become
more complex and start to speculatively execute code ahead of the retirement
point (as in the P6
family processors), the rules regarding which code should execute, pre- or
post-modification,
become blurred. [...]"

AFAIU, this core serializing instruction seems to be needed for use-cases of
self-modifying code, but not for the initial load of a program from disk,
as the processor has no way to have speculatively executed any of its
instructions.

I figured out what you're pointing to: if exec() is executed by a previously
running thread, and there is no core serializing instruction between program
load and return to user-space, the kernel ends up acting like a JIT, indeed.


I think that's safe. The kernel has to execute a MOV CR3 instruction 
before it can execute code loaded by exec, and that is a serializing 
instruction. Loading and unloading shared libraries is made safe by the 
IRET executed by page faults (loading) and TLB shootdown IPIs (unloading).


Directly modifying code in userspace is unsafe if there is some 
non-coherent instruction cache. Instruction fetch and speculative 
execution are non-coherent, but they're probably too short (in current 
processors) to matter. Trace caches are probably large enough, but I 
don't know whether they are coherent or not.





Therefore, we'd also need to invoke sync_core_before_usermode() after loading
the program.

Let's wait to hear back from hpa,

Thanks,

Mathieu



Hopefully hpa can tell us more about this,

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com




Re: [RFC PATCH 0/2] x86: Fix missing core serialization on migration

2017-11-14 Thread Avi Kivity



On 11/13/2017 06:56 PM, Mathieu Desnoyers wrote:

- On Nov 10, 2017, at 4:57 PM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:


- On Nov 10, 2017, at 4:36 PM, Linus Torvalds torva...@linux-foundation.org
wrote:


On Fri, Nov 10, 2017 at 1:12 PM, Mathieu Desnoyers
 wrote:

x86 can return to user-space through sysexit and sysretq, which are not
core serializing. This breaks expectations from user-space about
sequential consistency from a single-threaded self-modifying program
point of view in specific migration patterns.

Feedback is welcome,

We should check with Intel. I would actually be surprised if the I$
can be out of sync with the D$ after a sysretq.  It would actually
break things like "read code from disk" too in theory.

That core serializing instruction is not that much about I$ vs D$
consistency, but rather about the processor speculatively executing code
ahead of its retirement point. Ref. Intel Architecture Software Developer's
Manual, Volume 3: System Programming.

7.1.3. "Handling Self- and Cross-Modifying Code":

"The act of a processor writing data into a currently executing code segment
with the intent of
executing that data as code is called self-modifying code. Intel Architecture
processors exhibit
model-specific behavior when executing self-modified code, depending upon how
far ahead of
the current execution pointer the code has been modified. As processor
architectures become
more complex and start to speculatively execute code ahead of the retirement
point (as in the P6
family processors), the rules regarding which code should execute, pre- or
post-modification,
become blurred. [...]"

AFAIU, this core serializing instruction seems to be needed for use-cases of
self-modifying code, but not for the initial load of a program from disk,
as the processor has no way to have speculatively executed any of its
instructions.

I figured out what you're pointing to: if exec() is executed by a previously
running thread, and there is no core serializing instruction between program
load and return to user-space, the kernel ends up acting like a JIT, indeed.


I think that's safe. The kernel has to execute a MOV CR3 instruction 
before it can execute code loaded by exec, and that is a serializing 
instruction. Loading and unloading shared libraries is made safe by the 
IRET executed by page faults (loading) and TLB shootdown IPIs (unloading).


Directly modifying code in userspace is unsafe if there is some 
non-coherent instruction cache. Instruction fetch and speculative 
execution are non-coherent, but they're probably too short (in current 
processors) to matter. Trace caches are probably large enough, but I 
don't know whether they are coherent or not.





Therefore, we'd also need to invoke sync_core_before_usermode() after loading
the program.

Let's wait to hear back from hpa,

Thanks,

Mathieu



Hopefully hpa can tell us more about this,

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com




Re: [PATCH tip/core/rcu 1/3] membarrier: Provide register expedited private command

2017-10-05 Thread Avi Kivity



On 10/05/2017 07:23 AM, Nicholas Piggin wrote:

On Wed,  4 Oct 2017 14:37:53 -0700
"Paul E. McKenney"  wrote:


From: Mathieu Desnoyers 

Provide a new command allowing processes to register their intent to use
the private expedited command.

This allows PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.

Processes are now required to register before using
MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.

Changes since v1:
- Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
   powerpc membarrier_arch_sched_in(), given that we want to specifically
   check the next thread state.
- Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
- Use task_thread_info() to pass thread_info from task to
   *_ti_thread_flag().

Changes since v2:
- Move membarrier_arch_sched_in() call to finish_task_switch().
- Check for NULL t->mm in membarrier_arch_fork().
- Use membarrier_sched_in() in generic code, which invokes the
   arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
   build on PowerPC.
- Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
   allnoconfig build on PowerPC.
- Build and runtime tested on PowerPC.

Changes since v3:
- Simply rely on copy_mm() to copy the membarrier_private_expedited mm
   field on fork.
- powerpc: test thread flag instead of reading
   membarrier_private_expedited in membarrier_arch_fork().
- powerpc: skip memory barrier in membarrier_arch_sched_in() if coming
   from kernel thread, since mmdrop() implies a full barrier.
- Set membarrier_private_expedited to 1 only after arch registration
   code, thus eliminating a race where concurrent commands could succeed
   when they should fail if issued concurrently with process
   registration.
- Use READ_ONCE() for membarrier_private_expedited field access in
   membarrier_private_expedited. Matches WRITE_ONCE() performed in
   process registration.

Changes since v4:
- Move powerpc hook from sched_in() to switch_mm(), based on feedback
   from Nicholas Piggin.

For now, the powerpc approach is okay by me. I plan to test
others (e.g., taking runqueue locks) on larger systems, but that can
be sent as an incremental patch at a later time.

The main thing I would like is for people to review the userspace API.



As a future satisfied user of the expedited private membarrier syscall, 
I am happy with the change.




Re: [PATCH tip/core/rcu 1/3] membarrier: Provide register expedited private command

2017-10-05 Thread Avi Kivity



On 10/05/2017 07:23 AM, Nicholas Piggin wrote:

On Wed,  4 Oct 2017 14:37:53 -0700
"Paul E. McKenney"  wrote:


From: Mathieu Desnoyers 

Provide a new command allowing processes to register their intent to use
the private expedited command.

This allows PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.

Processes are now required to register before using
MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.

Changes since v1:
- Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
   powerpc membarrier_arch_sched_in(), given that we want to specifically
   check the next thread state.
- Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
- Use task_thread_info() to pass thread_info from task to
   *_ti_thread_flag().

Changes since v2:
- Move membarrier_arch_sched_in() call to finish_task_switch().
- Check for NULL t->mm in membarrier_arch_fork().
- Use membarrier_sched_in() in generic code, which invokes the
   arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
   build on PowerPC.
- Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
   allnoconfig build on PowerPC.
- Build and runtime tested on PowerPC.

Changes since v3:
- Simply rely on copy_mm() to copy the membarrier_private_expedited mm
   field on fork.
- powerpc: test thread flag instead of reading
   membarrier_private_expedited in membarrier_arch_fork().
- powerpc: skip memory barrier in membarrier_arch_sched_in() if coming
   from kernel thread, since mmdrop() implies a full barrier.
- Set membarrier_private_expedited to 1 only after arch registration
   code, thus eliminating a race where concurrent commands could succeed
   when they should fail if issued concurrently with process
   registration.
- Use READ_ONCE() for membarrier_private_expedited field access in
   membarrier_private_expedited. Matches WRITE_ONCE() performed in
   process registration.

Changes since v4:
- Move powerpc hook from sched_in() to switch_mm(), based on feedback
   from Nicholas Piggin.

For now, the powerpc approach is okay by me. I plan to test
others (e.g., taking runqueue locks) on larger systems, but that can
be sent as an incremental patch at a later time.

The main thing I would like is for people to review the userspace API.



As a future satisfied user of the expedited private membarrier syscall, 
I am happy with the change.




Re: [RFC PATCH v2] membarrier: expedited private command

2017-08-01 Thread Avi Kivity



On 08/01/2017 01:22 PM, Peter Zijlstra wrote:



If mm cpumask is used, I think it's okay. You can cause quite similar
kind of iteration over CPUs and lots of IPIs, tlb flushes, etc using
munmap/mprotect/etc, or context switch IPIs, etc. Are we reaching the
stage where we're controlling those kinds of ops in terms of impact
to the rest of the system?

So x86 has a tight mm_cpumask(), we only broadcast TLB invalidate IPIs
to those CPUs actually running threads of our process (or very
recently). So while there can be the sporadic stray IPI for a CPU that
recently ran a thread of the target process, it will not get another one
until it switches back into the process.

On machines that need manual TLB broadcasts and don't keep a tight mask,
yes you can interfere at will, but if they care they can fix by
tightening the mask.

In either case, the mm_cpumask() will be bounded by the set of CPUs the
threads are allowed to run on and will not interfere with the rest of
the system.

As to scheduler IPIs, those are limited to the CPUs the user is limited
to and are rate limited by the wakeup-latency of the tasks. After all,
all the time a task is runnable but not running, wakeups are no-ops.

Trouble is of course, that not everybody even sets a single bit in
mm_cpumask() and those that never clear bits will end up with a fairly
wide mask, still interfering with work that isn't hard partitioned.


I hate to propose a way to make this more complicated, but this could be 
fixed by a process first declaring its intent to use expedited 
process-wide membarrier; if it does, then every context switch updates a 
process-wide cpumask indicating which cpus are currently running threads 
of that process:


  if (prev->mm != next->mm)
  if (prev->mm->running_cpumask)
 cpumask_clear(...);
  else if (next->mm->running_cpumask)
 cpumask_set(...);

now only processes that want expedited process-wide membarrier pay for 
it (in other than some predictable branches). You can even have threads 
opt-in, so unrelated threads that don't participate in the party don't 
cause those bits to be set.


Re: [RFC PATCH v2] membarrier: expedited private command

2017-08-01 Thread Avi Kivity



On 08/01/2017 01:22 PM, Peter Zijlstra wrote:



If mm cpumask is used, I think it's okay. You can cause quite similar
kind of iteration over CPUs and lots of IPIs, tlb flushes, etc using
munmap/mprotect/etc, or context switch IPIs, etc. Are we reaching the
stage where we're controlling those kinds of ops in terms of impact
to the rest of the system?

So x86 has a tight mm_cpumask(), we only broadcast TLB invalidate IPIs
to those CPUs actually running threads of our process (or very
recently). So while there can be the sporadic stray IPI for a CPU that
recently ran a thread of the target process, it will not get another one
until it switches back into the process.

On machines that need manual TLB broadcasts and don't keep a tight mask,
yes you can interfere at will, but if they care they can fix by
tightening the mask.

In either case, the mm_cpumask() will be bounded by the set of CPUs the
threads are allowed to run on and will not interfere with the rest of
the system.

As to scheduler IPIs, those are limited to the CPUs the user is limited
to and are rate limited by the wakeup-latency of the tasks. After all,
all the time a task is runnable but not running, wakeups are no-ops.

Trouble is of course, that not everybody even sets a single bit in
mm_cpumask() and those that never clear bits will end up with a fairly
wide mask, still interfering with work that isn't hard partitioned.


I hate to propose a way to make this more complicated, but this could be 
fixed by a process first declaring its intent to use expedited 
process-wide membarrier; if it does, then every context switch updates a 
process-wide cpumask indicating which cpus are currently running threads 
of that process:


  if (prev->mm != next->mm)
  if (prev->mm->running_cpumask)
 cpumask_clear(...);
  else if (next->mm->running_cpumask)
 cpumask_set(...);

now only processes that want expedited process-wide membarrier pay for 
it (in other than some predictable branches). You can even have threads 
opt-in, so unrelated threads that don't participate in the party don't 
cause those bits to be set.


Re: Udpated sys_membarrier() speedup patch, FYI

2017-07-31 Thread Avi Kivity

On 07/31/2017 11:37 AM, Peter Zijlstra wrote:

On Mon, Jul 31, 2017 at 09:03:09AM +0300, Avi Kivity wrote:

I remembered that current->mm does not change when switching to a kernel
task, but my Kernlish is very rusty, or maybe it has changed.

kernel threads do indeed preserve the mm of the old userspace task, but
we track that in ->active_mm. Their ->mm will be NULL.


Gah, I'm so rusty, if I were any rustier I'd be doing bitcoin.


Re: Udpated sys_membarrier() speedup patch, FYI

2017-07-31 Thread Avi Kivity

On 07/31/2017 11:37 AM, Peter Zijlstra wrote:

On Mon, Jul 31, 2017 at 09:03:09AM +0300, Avi Kivity wrote:

I remembered that current->mm does not change when switching to a kernel
task, but my Kernlish is very rusty, or maybe it has changed.

kernel threads do indeed preserve the mm of the old userspace task, but
we track that in ->active_mm. Their ->mm will be NULL.


Gah, I'm so rusty, if I were any rustier I'd be doing bitcoin.


Re: Udpated sys_membarrier() speedup patch, FYI

2017-07-31 Thread Avi Kivity



On 07/28/2017 12:02 AM, Mathieu Desnoyers wrote:

- On Jul 27, 2017, at 4:58 PM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:


- On Jul 27, 2017, at 4:37 PM, Paul E. McKenney paul...@linux.vnet.ibm.com
wrote:


On Thu, Jul 27, 2017 at 11:04:13PM +0300, Avi Kivity wrote:

[...]

+
+   this_cpu = raw_smp_processor_id();
+   for_each_online_cpu(cpu) {
+   struct task_struct *p;
+
+   if (cpu == this_cpu)
+   continue;
+   rcu_read_lock();
+   p = task_rcu_dereference(_rq(cpu)->curr);
+   if (p && p->mm == current->mm)
+   __cpumask_set_cpu(cpu, tmpmask);

This gets you some false positives, if the CPU idled then mm will
not have changed.

Good point!  The battery-powered embedded guys would probably prefer
we not needlessly IPI idle CPUs.  We cannot rely on RCU's dyntick-idle
state in nohz_full cases.  Not sure if is_idle_task() can be used
safely, given things like play_idle().

Would changing the check in this loop to:

if (p && !is_idle_task(p) && p->mm == current->mm) {

work for you ?

Avi, is there an optimization that allows current->mm to be non-null
when the idle task is scheduled that I am missing ?

I would have expected current->mm to be always NULL for idle
tasks.




I remembered that current->mm does not change when switching to a kernel 
task, but my Kernlish is very rusty, or maybe it has changed.


Re: Udpated sys_membarrier() speedup patch, FYI

2017-07-31 Thread Avi Kivity



On 07/28/2017 12:02 AM, Mathieu Desnoyers wrote:

- On Jul 27, 2017, at 4:58 PM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:


- On Jul 27, 2017, at 4:37 PM, Paul E. McKenney paul...@linux.vnet.ibm.com
wrote:


On Thu, Jul 27, 2017 at 11:04:13PM +0300, Avi Kivity wrote:

[...]

+
+   this_cpu = raw_smp_processor_id();
+   for_each_online_cpu(cpu) {
+   struct task_struct *p;
+
+   if (cpu == this_cpu)
+   continue;
+   rcu_read_lock();
+   p = task_rcu_dereference(_rq(cpu)->curr);
+   if (p && p->mm == current->mm)
+   __cpumask_set_cpu(cpu, tmpmask);

This gets you some false positives, if the CPU idled then mm will
not have changed.

Good point!  The battery-powered embedded guys would probably prefer
we not needlessly IPI idle CPUs.  We cannot rely on RCU's dyntick-idle
state in nohz_full cases.  Not sure if is_idle_task() can be used
safely, given things like play_idle().

Would changing the check in this loop to:

if (p && !is_idle_task(p) && p->mm == current->mm) {

work for you ?

Avi, is there an optimization that allows current->mm to be non-null
when the idle task is scheduled that I am missing ?

I would have expected current->mm to be always NULL for idle
tasks.




I remembered that current->mm does not change when switching to a kernel 
task, but my Kernlish is very rusty, or maybe it has changed.


Re: Udpated sys_membarrier() speedup patch, FYI

2017-07-27 Thread Avi Kivity

On 07/27/2017 10:43 PM, Paul E. McKenney wrote:

On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote:

On 07/27/2017 09:12 PM, Paul E. McKenney wrote:

Hello!

Please see below for a prototype sys_membarrier() speedup patch.
Please note that there is some controversy on this subject, so the final
version will probably be quite a bit different than this prototype.

But my main question is whether the throttling shown below is acceptable
for your use cases, namely only one expedited sys_membarrier() permitted
per scheduling-clock period (1 millisecond on many platforms), with any
excess being silently converted to non-expedited form.  The reason for
the throttling is concerns about DoS attacks based on user code with a
tight loop invoking this system call.

Thoughts?

Silent throttling would render it useless for me. -EAGAIN is a
little better, but I'd be forced to spin until either I get kicked
out of my loop, or it succeeds.

IPIing only running threads of my process would be perfect. In fact
I might even be able to make use of "membarrier these threads
please" to reduce IPIs, when I change the topology from fully
connected to something more sparse, on larger machines.

My previous implementations were a signal (but that's horrible on
large machines) and trylock + mprotect (but that doesn't work on
ARM).

OK, how about the following patch, which IPIs only the running
threads of the process doing the sys_membarrier()?


Works for me.



Thanx, Paul



From: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
To: Peter Zijlstra <pet...@infradead.org>
Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers
  <mathieu.desnoy...@efficios.com>,
  "Paul E . McKenney" <paul...@linux.vnet.ibm.com>, Boqun Feng 
<boqun.f...@gmail.com>
Subject: [RFC PATCH] membarrier: expedited private command
Date: Thu, 27 Jul 2017 14:59:43 -0400
Message-Id: <20170727185943.11570-1-mathieu.desnoy...@efficios.com>

Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
from all runqueues for which current thread's mm is the same as our own.

Scheduler-wise, it requires that we add a memory barrier after context
switching between processes (which have different mm).

It would be interesting to benchmark the overhead of this added barrier
on the performance of context switching between processes. If the
preexisting overhead of switching between mm is high enough, the
overhead of adding this extra barrier may be insignificant.

[ Compile-tested only! ]

CC: Peter Zijlstra <pet...@infradead.org>
CC: Paul E. McKenney <paul...@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.f...@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
---
  include/uapi/linux/membarrier.h |  8 +++--
  kernel/membarrier.c | 76 -
  kernel/sched/core.c | 21 
  3 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
index e0b108bd2624..6a33c5852f6b 100644
--- a/include/uapi/linux/membarrier.h
+++ b/include/uapi/linux/membarrier.h
@@ -40,14 +40,18 @@
   *  (non-running threads are de facto in such a
   *  state). This covers threads from all processes
   *  running on the system. This command returns 0.
+ * TODO: documentation.
   *
   * Command to be passed to the membarrier system call. The commands need to
   * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
   * the value 0.
   */
  enum membarrier_cmd {
-   MEMBARRIER_CMD_QUERY = 0,
-   MEMBARRIER_CMD_SHARED = (1 << 0),
+   MEMBARRIER_CMD_QUERY= 0,
+   MEMBARRIER_CMD_SHARED   = (1 << 0),
+   /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
+   /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
+   MEMBARRIER_CMD_PRIVATE_EXPEDITED= (1 << 3),
  };

  #endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/kernel/membarrier.c b/kernel/membarrier.c
index 9f9284f37f8d..8c6c0f96f617 100644
--- a/kernel/membarrier.c
+++ b/kernel/membarrier.c
@@ -19,10 +19,81 @@
  #include 

  /*
+ * XXX For cpu_rq(). Should we rather move
+ * membarrier_private_expedited() to sched/core.c or create
+ * sched/membarrier.c ?
+ */
+#include "sched/sched.h"
+
+/*
   * Bitmask made from a "or" of all commands within enum membarrier_cmd,
   * except MEMBARRIER_CMD_QUERY.
   */
-#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED)
+#define MEMBARRIER_CMD_BITMASK \
+   (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
+



rcu_read_unlock();
+   }
+}
+
+static void membarrier_private_expedit

Re: Udpated sys_membarrier() speedup patch, FYI

2017-07-27 Thread Avi Kivity

On 07/27/2017 10:43 PM, Paul E. McKenney wrote:

On Thu, Jul 27, 2017 at 10:20:14PM +0300, Avi Kivity wrote:

On 07/27/2017 09:12 PM, Paul E. McKenney wrote:

Hello!

Please see below for a prototype sys_membarrier() speedup patch.
Please note that there is some controversy on this subject, so the final
version will probably be quite a bit different than this prototype.

But my main question is whether the throttling shown below is acceptable
for your use cases, namely only one expedited sys_membarrier() permitted
per scheduling-clock period (1 millisecond on many platforms), with any
excess being silently converted to non-expedited form.  The reason for
the throttling is concerns about DoS attacks based on user code with a
tight loop invoking this system call.

Thoughts?

Silent throttling would render it useless for me. -EAGAIN is a
little better, but I'd be forced to spin until either I get kicked
out of my loop, or it succeeds.

IPIing only running threads of my process would be perfect. In fact
I might even be able to make use of "membarrier these threads
please" to reduce IPIs, when I change the topology from fully
connected to something more sparse, on larger machines.

My previous implementations were a signal (but that's horrible on
large machines) and trylock + mprotect (but that doesn't work on
ARM).

OK, how about the following patch, which IPIs only the running
threads of the process doing the sys_membarrier()?


Works for me.



Thanx, Paul



From: Mathieu Desnoyers 
To: Peter Zijlstra 
Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers
  ,
  "Paul E . McKenney" , Boqun Feng 

Subject: [RFC PATCH] membarrier: expedited private command
Date: Thu, 27 Jul 2017 14:59:43 -0400
Message-Id: <20170727185943.11570-1-mathieu.desnoy...@efficios.com>

Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
from all runqueues for which current thread's mm is the same as our own.

Scheduler-wise, it requires that we add a memory barrier after context
switching between processes (which have different mm).

It would be interesting to benchmark the overhead of this added barrier
on the performance of context switching between processes. If the
preexisting overhead of switching between mm is high enough, the
overhead of adding this extra barrier may be insignificant.

[ Compile-tested only! ]

CC: Peter Zijlstra 
CC: Paul E. McKenney 
CC: Boqun Feng 
Signed-off-by: Mathieu Desnoyers 
---
  include/uapi/linux/membarrier.h |  8 +++--
  kernel/membarrier.c | 76 -
  kernel/sched/core.c | 21 
  3 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
index e0b108bd2624..6a33c5852f6b 100644
--- a/include/uapi/linux/membarrier.h
+++ b/include/uapi/linux/membarrier.h
@@ -40,14 +40,18 @@
   *  (non-running threads are de facto in such a
   *  state). This covers threads from all processes
   *  running on the system. This command returns 0.
+ * TODO: documentation.
   *
   * Command to be passed to the membarrier system call. The commands need to
   * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
   * the value 0.
   */
  enum membarrier_cmd {
-   MEMBARRIER_CMD_QUERY = 0,
-   MEMBARRIER_CMD_SHARED = (1 << 0),
+   MEMBARRIER_CMD_QUERY= 0,
+   MEMBARRIER_CMD_SHARED   = (1 << 0),
+   /* reserved for MEMBARRIER_CMD_SHARED_EXPEDITED (1 << 1) */
+   /* reserved for MEMBARRIER_CMD_PRIVATE (1 << 2) */
+   MEMBARRIER_CMD_PRIVATE_EXPEDITED= (1 << 3),
  };

  #endif /* _UAPI_LINUX_MEMBARRIER_H */
diff --git a/kernel/membarrier.c b/kernel/membarrier.c
index 9f9284f37f8d..8c6c0f96f617 100644
--- a/kernel/membarrier.c
+++ b/kernel/membarrier.c
@@ -19,10 +19,81 @@
  #include 

  /*
+ * XXX For cpu_rq(). Should we rather move
+ * membarrier_private_expedited() to sched/core.c or create
+ * sched/membarrier.c ?
+ */
+#include "sched/sched.h"
+
+/*
   * Bitmask made from a "or" of all commands within enum membarrier_cmd,
   * except MEMBARRIER_CMD_QUERY.
   */
-#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED)
+#define MEMBARRIER_CMD_BITMASK \
+   (MEMBARRIER_CMD_SHARED | MEMBARRIER_CMD_PRIVATE_EXPEDITED)
+



rcu_read_unlock();
+   }
+}
+
+static void membarrier_private_expedited(void)
+{
+   int cpu, this_cpu;
+   cpumask_var_t tmpmask;
+
+   if (num_online_cpus() == 1)
+   return;
+
+   /*
+* Matches memory barriers around rq->curr modification in
+* scheduler.
+*/
+   smp_mb();   /* 

Re: Udpated sys_membarrier() speedup patch, FYI

2017-07-27 Thread Avi Kivity

On 07/27/2017 09:12 PM, Paul E. McKenney wrote:

Hello!

Please see below for a prototype sys_membarrier() speedup patch.
Please note that there is some controversy on this subject, so the final
version will probably be quite a bit different than this prototype.

But my main question is whether the throttling shown below is acceptable
for your use cases, namely only one expedited sys_membarrier() permitted
per scheduling-clock period (1 millisecond on many platforms), with any
excess being silently converted to non-expedited form.  The reason for
the throttling is concerns about DoS attacks based on user code with a
tight loop invoking this system call.

Thoughts?


Silent throttling would render it useless for me. -EAGAIN is a little 
better, but I'd be forced to spin until either I get kicked out of my 
loop, or it succeeds.


IPIing only running threads of my process would be perfect. In fact I 
might even be able to make use of "membarrier these threads please" to 
reduce IPIs, when I change the topology from fully connected to 
something more sparse, on larger machines.


My previous implementations were a signal (but that's horrible on large 
machines) and trylock + mprotect (but that doesn't work on ARM).




Thanx, Paul



commit 4cd5253094b6d7f9501e21e13aa4e2e78e8a70cd
Author: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Date:   Tue Jul 18 13:53:32 2017 -0700

 sys_membarrier: Add expedited option
 
 The sys_membarrier() system call has proven too slow for some use cases,

 which has prompted users to instead rely on TLB shootdown.  Although TLB
 shootdown is much faster, it has the slight disadvantage of not working
 at all on arm and arm64 and also of being vulnerable to reasonable
 optimizations that might skip some IPIs.  However, the Linux kernel
 does not currrently provide a reasonable alternative, so it is hard to
 criticize these users from doing what works for them on a given piece
 of hardware at a given time.
 
 This commit therefore adds an expedited option to the sys_membarrier()

 system call, thus providing a faster mechanism that is portable and
 is not subject to death by optimization.  Note that if more than one
 MEMBARRIER_CMD_SHARED_EXPEDITED sys_membarrier() call happens within
 the same jiffy, all but the first will use synchronize_sched() instead
 of synchronize_sched_expedited().
 
 Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>

 [ paulmck: Fix code style issue pointed out by Boqun Feng. ]
 Tested-by: Avi Kivity <a...@scylladb.com>
 Cc: Maged Michael <maged.mich...@gmail.com>
 Cc: Andrew Hunter <a...@google.com>
 Cc: Geoffrey Romer <gro...@google.com>

diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
index e0b108bd2624..5720386d0904 100644
--- a/include/uapi/linux/membarrier.h
+++ b/include/uapi/linux/membarrier.h
@@ -40,6 +40,16 @@
   *  (non-running threads are de facto in such a
   *  state). This covers threads from all processes
   *  running on the system. This command returns 0.
+ * @MEMBARRIER_CMD_SHARED_EXPEDITED:  Execute a memory barrier on all
+ * running threads, but in an expedited fashion.
+ *  Upon return from system call, the caller thread
+ *  is ensured that all running threads have passed
+ *  through a state where all memory accesses to
+ *  user-space addresses match program order between
+ *  entry to and return from the system call
+ *  (non-running threads are de facto in such a
+ *  state). This covers threads from all processes
+ *  running on the system. This command returns 0.
   *
   * Command to be passed to the membarrier system call. The commands need to
   * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
@@ -48,6 +58,7 @@
  enum membarrier_cmd {
MEMBARRIER_CMD_QUERY = 0,
MEMBARRIER_CMD_SHARED = (1 << 0),
+   MEMBARRIER_CMD_SHARED_EXPEDITED = (1 << 1),
  };
  
  #endif /* _UAPI_LINUX_MEMBARRIER_H */

diff --git a/kernel/membarrier.c b/kernel/membarrier.c
index 9f9284f37f8d..587e3bbfae7e 100644
--- a/kernel/membarrier.c
+++ b/kernel/membarrier.c
@@ -22,7 +22,8 @@
   * Bitmask made from a "or" of all commands within enum membarrier_cmd,
   * except MEMBARRIER_CMD_QUERY.
   */
-#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED)
+#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED |\
+MEMBARRIER_CMD_SHARED_EXPEDITED)
  
  

Re: Udpated sys_membarrier() speedup patch, FYI

2017-07-27 Thread Avi Kivity

On 07/27/2017 09:12 PM, Paul E. McKenney wrote:

Hello!

Please see below for a prototype sys_membarrier() speedup patch.
Please note that there is some controversy on this subject, so the final
version will probably be quite a bit different than this prototype.

But my main question is whether the throttling shown below is acceptable
for your use cases, namely only one expedited sys_membarrier() permitted
per scheduling-clock period (1 millisecond on many platforms), with any
excess being silently converted to non-expedited form.  The reason for
the throttling is concerns about DoS attacks based on user code with a
tight loop invoking this system call.

Thoughts?


Silent throttling would render it useless for me. -EAGAIN is a little 
better, but I'd be forced to spin until either I get kicked out of my 
loop, or it succeeds.


IPIing only running threads of my process would be perfect. In fact I 
might even be able to make use of "membarrier these threads please" to 
reduce IPIs, when I change the topology from fully connected to 
something more sparse, on larger machines.


My previous implementations were a signal (but that's horrible on large 
machines) and trylock + mprotect (but that doesn't work on ARM).




Thanx, Paul



commit 4cd5253094b6d7f9501e21e13aa4e2e78e8a70cd
Author: Paul E. McKenney 
Date:   Tue Jul 18 13:53:32 2017 -0700

 sys_membarrier: Add expedited option
 
 The sys_membarrier() system call has proven too slow for some use cases,

 which has prompted users to instead rely on TLB shootdown.  Although TLB
 shootdown is much faster, it has the slight disadvantage of not working
 at all on arm and arm64 and also of being vulnerable to reasonable
 optimizations that might skip some IPIs.  However, the Linux kernel
 does not currrently provide a reasonable alternative, so it is hard to
 criticize these users from doing what works for them on a given piece
 of hardware at a given time.
 
 This commit therefore adds an expedited option to the sys_membarrier()

 system call, thus providing a faster mechanism that is portable and
 is not subject to death by optimization.  Note that if more than one
 MEMBARRIER_CMD_SHARED_EXPEDITED sys_membarrier() call happens within
 the same jiffy, all but the first will use synchronize_sched() instead
 of synchronize_sched_expedited().
 
 Signed-off-by: Paul E. McKenney 

 [ paulmck: Fix code style issue pointed out by Boqun Feng. ]
 Tested-by: Avi Kivity 
 Cc: Maged Michael 
 Cc: Andrew Hunter 
 Cc: Geoffrey Romer 

diff --git a/include/uapi/linux/membarrier.h b/include/uapi/linux/membarrier.h
index e0b108bd2624..5720386d0904 100644
--- a/include/uapi/linux/membarrier.h
+++ b/include/uapi/linux/membarrier.h
@@ -40,6 +40,16 @@
   *  (non-running threads are de facto in such a
   *  state). This covers threads from all processes
   *  running on the system. This command returns 0.
+ * @MEMBARRIER_CMD_SHARED_EXPEDITED:  Execute a memory barrier on all
+ * running threads, but in an expedited fashion.
+ *  Upon return from system call, the caller thread
+ *  is ensured that all running threads have passed
+ *  through a state where all memory accesses to
+ *  user-space addresses match program order between
+ *  entry to and return from the system call
+ *  (non-running threads are de facto in such a
+ *  state). This covers threads from all processes
+ *  running on the system. This command returns 0.
   *
   * Command to be passed to the membarrier system call. The commands need to
   * be a single bit each, except for MEMBARRIER_CMD_QUERY which is assigned to
@@ -48,6 +58,7 @@
  enum membarrier_cmd {
MEMBARRIER_CMD_QUERY = 0,
MEMBARRIER_CMD_SHARED = (1 << 0),
+   MEMBARRIER_CMD_SHARED_EXPEDITED = (1 << 1),
  };
  
  #endif /* _UAPI_LINUX_MEMBARRIER_H */

diff --git a/kernel/membarrier.c b/kernel/membarrier.c
index 9f9284f37f8d..587e3bbfae7e 100644
--- a/kernel/membarrier.c
+++ b/kernel/membarrier.c
@@ -22,7 +22,8 @@
   * Bitmask made from a "or" of all commands within enum membarrier_cmd,
   * except MEMBARRIER_CMD_QUERY.
   */
-#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED)
+#define MEMBARRIER_CMD_BITMASK (MEMBARRIER_CMD_SHARED |\
+MEMBARRIER_CMD_SHARED_EXPEDITED)
  
  /**

   * sys_membarrier - issue memory barriers on a set of threads
@@ -64,6 +65,20 @@ SYSCALL_DEFINE2(membarrier, int, cmd, int, flags)
 

Re: MAP_POPULATE vs. MADV_HUGEPAGES

2017-03-16 Thread Avi Kivity



On 03/16/2017 04:48 PM, Michal Hocko wrote:

On Thu 16-03-17 15:26:54, Avi Kivity wrote:


On 03/16/2017 02:34 PM, Michal Hocko wrote:

On Wed 15-03-17 18:50:32, Avi Kivity wrote:

A user is trying to allocate 1TB of anonymous memory in parallel on 48 cores
(4 NUMA nodes).  The kernel ends up spinning in isolate_freepages_block().

Which kernel version is that?

A good question; it was 3.10.something-el.something.  The user mentioned
above updated to 4.4, and the problem was gone, so it looks like it is a Red
Hat specific problem.  I would really like the 3.10.something kernel to
handle this workload well, but I understand that's not this list's concern.


What is the THP defrag mode
(/sys/kernel/mm/transparent_hugepage/defrag)?

The default (always).

the default has changed since then because the THP faul latencies were
just too large. Currently we only allow madvised VMAs to go stall and
even then we try hard to back off sooner rather than later. See
444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a
stall-free defrag option") merged in 4.4


I see, thanks.  So the 4.4 behavior is better mostly due to not trying 
so hard.


  

I thought to help it along by using MAP_POPULATE, but then my MADV_HUGEPAGE
won't be seen until after mmap() completes, with pages already populated.
Are MAP_POPULATE and MADV_HUGEPAGE mutually exclusive?

Why do you need MADV_HUGEPAGE?

So that I get huge pages even if transparent_hugepage/enabled=madvise.  I'm
allocating almost all of the memory of that machine to be used as a giant
cache, so I want it backed by hugepages.

Is there any strong reason to not use hugetlb then? You probably want
that memory reclaimable, right?


Did you mean hugetlbfs?  It's a pain to configure, and often requires a 
reboot.


We support it via an option, but we prefer the user's first experience 
with the application not to be "configure this kernel parameter and reboot".


We don't particularly need that memory to be reclaimable (and in fact we 
have an option to mlock() it; if it gets swapped, application 
performance tanks).





Is my only option to serialize those memory allocations, and fault in those
pages manually?  Or perhaps use mlock()?

I am still not 100% sure I see what you are trying to achieve, though.
So you do not want all those processes to contend inside the compaction
while still allocate as many huge pages as possible?

Since the process starts with all of that memory free, there should not be
any compaction going on (or perhaps very minimal eviction/movement of a few
pages here and there).  And since it's fixed in later kernels, it looks like
the contention was not really mandated by the workload, just an artifact of
the implementation.

It is possible. A lot has changed since 3.10 times.


Like the default behavior :).



Re: MAP_POPULATE vs. MADV_HUGEPAGES

2017-03-16 Thread Avi Kivity



On 03/16/2017 04:48 PM, Michal Hocko wrote:

On Thu 16-03-17 15:26:54, Avi Kivity wrote:


On 03/16/2017 02:34 PM, Michal Hocko wrote:

On Wed 15-03-17 18:50:32, Avi Kivity wrote:

A user is trying to allocate 1TB of anonymous memory in parallel on 48 cores
(4 NUMA nodes).  The kernel ends up spinning in isolate_freepages_block().

Which kernel version is that?

A good question; it was 3.10.something-el.something.  The user mentioned
above updated to 4.4, and the problem was gone, so it looks like it is a Red
Hat specific problem.  I would really like the 3.10.something kernel to
handle this workload well, but I understand that's not this list's concern.


What is the THP defrag mode
(/sys/kernel/mm/transparent_hugepage/defrag)?

The default (always).

the default has changed since then because the THP faul latencies were
just too large. Currently we only allow madvised VMAs to go stall and
even then we try hard to back off sooner rather than later. See
444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a
stall-free defrag option") merged in 4.4


I see, thanks.  So the 4.4 behavior is better mostly due to not trying 
so hard.


  

I thought to help it along by using MAP_POPULATE, but then my MADV_HUGEPAGE
won't be seen until after mmap() completes, with pages already populated.
Are MAP_POPULATE and MADV_HUGEPAGE mutually exclusive?

Why do you need MADV_HUGEPAGE?

So that I get huge pages even if transparent_hugepage/enabled=madvise.  I'm
allocating almost all of the memory of that machine to be used as a giant
cache, so I want it backed by hugepages.

Is there any strong reason to not use hugetlb then? You probably want
that memory reclaimable, right?


Did you mean hugetlbfs?  It's a pain to configure, and often requires a 
reboot.


We support it via an option, but we prefer the user's first experience 
with the application not to be "configure this kernel parameter and reboot".


We don't particularly need that memory to be reclaimable (and in fact we 
have an option to mlock() it; if it gets swapped, application 
performance tanks).





Is my only option to serialize those memory allocations, and fault in those
pages manually?  Or perhaps use mlock()?

I am still not 100% sure I see what you are trying to achieve, though.
So you do not want all those processes to contend inside the compaction
while still allocate as many huge pages as possible?

Since the process starts with all of that memory free, there should not be
any compaction going on (or perhaps very minimal eviction/movement of a few
pages here and there).  And since it's fixed in later kernels, it looks like
the contention was not really mandated by the workload, just an artifact of
the implementation.

It is possible. A lot has changed since 3.10 times.


Like the default behavior :).



Re: MAP_POPULATE vs. MADV_HUGEPAGES

2017-03-16 Thread Avi Kivity



On 03/16/2017 02:34 PM, Michal Hocko wrote:

On Wed 15-03-17 18:50:32, Avi Kivity wrote:

A user is trying to allocate 1TB of anonymous memory in parallel on 48 cores
(4 NUMA nodes).  The kernel ends up spinning in isolate_freepages_block().

Which kernel version is that?


A good question; it was 3.10.something-el.something.  The user mentioned 
above updated to 4.4, and the problem was gone, so it looks like it is a 
Red Hat specific problem.  I would really like the 3.10.something kernel 
to handle this workload well, but I understand that's not this list's 
concern.



What is the THP defrag mode
(/sys/kernel/mm/transparent_hugepage/defrag)?


The default (always).

  

I thought to help it along by using MAP_POPULATE, but then my MADV_HUGEPAGE
won't be seen until after mmap() completes, with pages already populated.
Are MAP_POPULATE and MADV_HUGEPAGE mutually exclusive?

Why do you need MADV_HUGEPAGE?


So that I get huge pages even if transparent_hugepage/enabled=madvise.  
I'm allocating almost all of the memory of that machine to be used as a 
giant cache, so I want it backed by hugepages.


  

Is my only option to serialize those memory allocations, and fault in those
pages manually?  Or perhaps use mlock()?

I am still not 100% sure I see what you are trying to achieve, though.
So you do not want all those processes to contend inside the compaction
while still allocate as many huge pages as possible?


Since the process starts with all of that memory free, there should not 
be any compaction going on (or perhaps very minimal eviction/movement of 
a few pages here and there).  And since it's fixed in later kernels, it 
looks like the contention was not really mandated by the workload, just 
an artifact of the implementation.


To explain the workload again, the process starts, clones as many 
threads as there are logical processors, and each of those threads 
mmap()s (and mbind()s) a chunk of memory and then proceeds to touch it.




Re: MAP_POPULATE vs. MADV_HUGEPAGES

2017-03-16 Thread Avi Kivity



On 03/16/2017 02:34 PM, Michal Hocko wrote:

On Wed 15-03-17 18:50:32, Avi Kivity wrote:

A user is trying to allocate 1TB of anonymous memory in parallel on 48 cores
(4 NUMA nodes).  The kernel ends up spinning in isolate_freepages_block().

Which kernel version is that?


A good question; it was 3.10.something-el.something.  The user mentioned 
above updated to 4.4, and the problem was gone, so it looks like it is a 
Red Hat specific problem.  I would really like the 3.10.something kernel 
to handle this workload well, but I understand that's not this list's 
concern.



What is the THP defrag mode
(/sys/kernel/mm/transparent_hugepage/defrag)?


The default (always).

  

I thought to help it along by using MAP_POPULATE, but then my MADV_HUGEPAGE
won't be seen until after mmap() completes, with pages already populated.
Are MAP_POPULATE and MADV_HUGEPAGE mutually exclusive?

Why do you need MADV_HUGEPAGE?


So that I get huge pages even if transparent_hugepage/enabled=madvise.  
I'm allocating almost all of the memory of that machine to be used as a 
giant cache, so I want it backed by hugepages.


  

Is my only option to serialize those memory allocations, and fault in those
pages manually?  Or perhaps use mlock()?

I am still not 100% sure I see what you are trying to achieve, though.
So you do not want all those processes to contend inside the compaction
while still allocate as many huge pages as possible?


Since the process starts with all of that memory free, there should not 
be any compaction going on (or perhaps very minimal eviction/movement of 
a few pages here and there).  And since it's fixed in later kernels, it 
looks like the contention was not really mandated by the workload, just 
an artifact of the implementation.


To explain the workload again, the process starts, clones as many 
threads as there are logical processors, and each of those threads 
mmap()s (and mbind()s) a chunk of memory and then proceeds to touch it.




MAP_POPULATE vs. MADV_HUGEPAGES

2017-03-15 Thread Avi Kivity
A user is trying to allocate 1TB of anonymous memory in parallel on 48 
cores (4 NUMA nodes).  The kernel ends up spinning in 
isolate_freepages_block().



I thought to help it along by using MAP_POPULATE, but then my 
MADV_HUGEPAGE won't be seen until after mmap() completes, with pages 
already populated.  Are MAP_POPULATE and MADV_HUGEPAGE mutually exclusive?



Is my only option to serialize those memory allocations, and fault in 
those pages manually?  Or perhaps use mlock()?




MAP_POPULATE vs. MADV_HUGEPAGES

2017-03-15 Thread Avi Kivity
A user is trying to allocate 1TB of anonymous memory in parallel on 48 
cores (4 NUMA nodes).  The kernel ends up spinning in 
isolate_freepages_block().



I thought to help it along by using MAP_POPULATE, but then my 
MADV_HUGEPAGE won't be seen until after mmap() completes, with pages 
already populated.  Are MAP_POPULATE and MADV_HUGEPAGE mutually exclusive?



Is my only option to serialize those memory allocations, and fault in 
those pages manually?  Or perhaps use mlock()?




Re: [PATCH] vfio: Include No-IOMMU mode

2015-11-16 Thread Avi Kivity

On 11/16/2015 07:06 PM, Alex Williamson wrote:

On Wed, 2015-10-28 at 15:21 -0600, Alex Williamson wrote:

There is really no way to safely give a user full access to a DMA
capable device without an IOMMU to protect the host system.  There is
also no way to provide DMA translation, for use cases such as device
assignment to virtual machines.  However, there are still those users
that want userspace drivers even under those conditions.  The UIO
driver exists for this use case, but does not provide the degree of
device access and programming that VFIO has.  In an effort to avoid
code duplication, this introduces a No-IOMMU mode for VFIO.

This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
should make it very clear that this mode is not safe.  Additionally,
CAP_SYS_RAWIO privileges are necessary to work with groups and
containers using this mode.  Groups making use of this support are
named /dev/vfio/noiommu-$GROUP and can only make use of the special
VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
binding a device without a native IOMMU group to a VFIO bus driver
will taint the kernel and should therefore not be considered
supported.  This patch includes no-iommu support for the vfio-pci bus
driver only.

Signed-off-by: Alex Williamson 
---

This is pretty well the same as RFCv2, I've changed the pr_warn to a
dev_warn and added another, printing the pid and comm of the task when
it actually opens the device.  If Stephen can port the driver code
over and prove that this actually works sometime next week, and there
aren't any objections to this code, I'll include it in a pull request
for the next merge window.  MST, I dropped your ack due to the
changes, but I'll be happy to add it back if you like.  Thanks,

Alex

  drivers/vfio/Kconfig|   15 +++
  drivers/vfio/pci/vfio_pci.c |8 +-
  drivers/vfio/vfio.c |  186 ++-
  include/linux/vfio.h|3 +
  include/uapi/linux/vfio.h   |7 ++
  5 files changed, 209 insertions(+), 10 deletions(-)

FYI, this is now in v4.4-rc1 (the slightly modified v2 version).  I want
to give fair warning though that while we seem to agree on this idea, it
hasn't been proven with a userspace driver port.  I've opted to include
it in this merge window rather than delaying it until v4.5, but I really
need to see a user for this before the end of the v4.4 cycle or I think
we'll need to revert and revisit for v4.5 anyway.  I don't really have
any interest in adding and maintaining code that has no users.  Please
keep me informed of progress with a dpdk port.  Thanks,




Thanks Alex.  Copying the dpdk mailing list, where the users live.

dpdk-ers: vfio-noiommu is a replacement for uio_pci_generic and 
uio_igb.  It supports MSI-X and so can be used on SR/IOV VF devices.  
The intent is that you can use dpdk without an external module, using 
vfio, whether you are on bare metal with an iommu, bare metal without an 
iommu, or virtualized.  However, dpdk needs modification to support this.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vfio: Include No-IOMMU mode

2015-11-16 Thread Avi Kivity

On 11/16/2015 07:06 PM, Alex Williamson wrote:

On Wed, 2015-10-28 at 15:21 -0600, Alex Williamson wrote:

There is really no way to safely give a user full access to a DMA
capable device without an IOMMU to protect the host system.  There is
also no way to provide DMA translation, for use cases such as device
assignment to virtual machines.  However, there are still those users
that want userspace drivers even under those conditions.  The UIO
driver exists for this use case, but does not provide the degree of
device access and programming that VFIO has.  In an effort to avoid
code duplication, this introduces a No-IOMMU mode for VFIO.

This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
should make it very clear that this mode is not safe.  Additionally,
CAP_SYS_RAWIO privileges are necessary to work with groups and
containers using this mode.  Groups making use of this support are
named /dev/vfio/noiommu-$GROUP and can only make use of the special
VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
binding a device without a native IOMMU group to a VFIO bus driver
will taint the kernel and should therefore not be considered
supported.  This patch includes no-iommu support for the vfio-pci bus
driver only.

Signed-off-by: Alex Williamson 
---

This is pretty well the same as RFCv2, I've changed the pr_warn to a
dev_warn and added another, printing the pid and comm of the task when
it actually opens the device.  If Stephen can port the driver code
over and prove that this actually works sometime next week, and there
aren't any objections to this code, I'll include it in a pull request
for the next merge window.  MST, I dropped your ack due to the
changes, but I'll be happy to add it back if you like.  Thanks,

Alex

  drivers/vfio/Kconfig|   15 +++
  drivers/vfio/pci/vfio_pci.c |8 +-
  drivers/vfio/vfio.c |  186 ++-
  include/linux/vfio.h|3 +
  include/uapi/linux/vfio.h   |7 ++
  5 files changed, 209 insertions(+), 10 deletions(-)

FYI, this is now in v4.4-rc1 (the slightly modified v2 version).  I want
to give fair warning though that while we seem to agree on this idea, it
hasn't been proven with a userspace driver port.  I've opted to include
it in this merge window rather than delaying it until v4.5, but I really
need to see a user for this before the end of the v4.4 cycle or I think
we'll need to revert and revisit for v4.5 anyway.  I don't really have
any interest in adding and maintaining code that has no users.  Please
keep me informed of progress with a dpdk port.  Thanks,




Thanks Alex.  Copying the dpdk mailing list, where the users live.

dpdk-ers: vfio-noiommu is a replacement for uio_pci_generic and 
uio_igb.  It supports MSI-X and so can be used on SR/IOV VF devices.  
The intent is that you can use dpdk without an external module, using 
vfio, whether you are on bare metal with an iommu, bare metal without an 
iommu, or virtualized.  However, dpdk needs modification to support this.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/2] vfio: Include no-iommu mode

2015-10-12 Thread Avi Kivity



On 10/12/2015 07:23 PM, Alex Williamson wrote:

Also, although you think the long option will set the bar high
enough it probably will not satisfy anyone. It is annoying enough, that
I would just carry a patch to remove it the silly requirement.
And the the people who believe
all user mode DMA is evil won't be satisfied either.

I find that many users blindly follow howtos and only sometimes do they
question the options if they sound scary enough.  So yeah, I would
intend to make the option upstream sound scary enough for people to
think twice about using it and maybe even read the description.  That
still doesn't prevent pasting it into modprobe.d and forgetting about
it.


I think we need to allow for packages to work with this.  I.e. drop 
files into config directories rather than require editing of config 
files.  I think this is mostly workable via modprobe.d.


(for our own use case, we don't require the extreme performance of many 
L2/L3 dpdk apps so we'll work with regular iommued vfio for bare-metal 
installations and ship prebuilt virtual machine images for clouds, so it 
doesn't matter that much to me).


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/2] vfio: Include no-iommu mode

2015-10-12 Thread Avi Kivity



On 10/12/2015 07:23 PM, Alex Williamson wrote:

Also, although you think the long option will set the bar high
enough it probably will not satisfy anyone. It is annoying enough, that
I would just carry a patch to remove it the silly requirement.
And the the people who believe
all user mode DMA is evil won't be satisfied either.

I find that many users blindly follow howtos and only sometimes do they
question the options if they sound scary enough.  So yeah, I would
intend to make the option upstream sound scary enough for people to
think twice about using it and maybe even read the description.  That
still doesn't prevent pasting it into modprobe.d and forgetting about
it.


I think we need to allow for packages to work with this.  I.e. drop 
files into config directories rather than require editing of config 
files.  I think this is mostly workable via modprobe.d.


(for our own use case, we don't require the extreme performance of many 
L2/L3 dpdk apps so we'll work with regular iommued vfio for bare-metal 
installations and ship prebuilt virtual machine images for clouds, so it 
doesn't matter that much to me).


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/2] vfio: Include no-iommu mode

2015-10-11 Thread Avi Kivity



On 10/11/2015 11:57 AM, Michael S. Tsirkin wrote:

On Sun, Oct 11, 2015 at 11:12:14AM +0300, Avi Kivity wrote:

   Mixing no-iommu and secure VFIO is
also unsupported, as are any VFIO IOMMU backends other than the
vfio-noiommu backend.  Furthermore, unsafe group files are relocated
to /dev/vfio-noiommu/.  Upon successful loading in this mode, the
kernel is tainted due to the dummy IOMMU put in place.  Unloading of
the module in this mode is also unsupported and will BUG due to the
lack of support for unregistering an IOMMU for a bus type.

I did not see an API for detecting whether memory translation is provided or
not.  We can have the caller guess this by looking at the device name, or by
requiring the user to specify this, but I think it's cleaner to provide
programmatic access to this attribute.

It seems that caller can just check for VFIO_NOIOMMU_IOMMU.

Isn't this why it's there?


That's just means the capability is there, not that it's active.

But since you must pass the same value to open(), you already know that 
you're using noiommu.



VFIO_IOMMU_MAP_DMA, VFIO_IOMMU_ENABLE and VFIO_IOMMU_DISABLE
will probably also fail ...



Don't you have to call MAP_DMA to pin the memory?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/2] vfio: Include no-iommu mode

2015-10-11 Thread Avi Kivity



On 10/09/2015 09:41 PM, Alex Williamson wrote:

There is really no way to safely give a user full access to a PCI
without an IOMMU to protect the host from errant DMA.  There is also
no way to provide DMA translation, for use cases such as devices
assignment to virtual machines.  However, there are still those users
that want userspace drivers under those conditions.  The UIO driver
exists for this use case, but does not provide the degree of device
access and programming that VFIO has.  In an effort to avoid code
duplication, this introduces a No-IOMMU mode for VFIO.

This mode requires enabling CONFIG_VFIO_NOIOMMU and loading the vfio
module with the option "enable_unsafe_pci_noiommu_mode".  This should
make it very clear that this mode is not safe.  In this mode, there is
no support for unprivileged users, CAP_SYS_ADMIN is required for
access to the necessary dev files.


CAP_SYS_RAWIO seems a better match (in particular, it allows access to 
/dev/mem, which is the same thing).



   Mixing no-iommu and secure VFIO is
also unsupported, as are any VFIO IOMMU backends other than the
vfio-noiommu backend.  Furthermore, unsafe group files are relocated
to /dev/vfio-noiommu/.  Upon successful loading in this mode, the
kernel is tainted due to the dummy IOMMU put in place.  Unloading of
the module in this mode is also unsupported and will BUG due to the
lack of support for unregistering an IOMMU for a bus type.


I did not see an API for detecting whether memory translation is 
provided or not.  We can have the caller guess this by looking at the 
device name, or by requiring the user to specify this, but I think it's 
cleaner to provide programmatic access to this attribute.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/2] vfio: Include no-iommu mode

2015-10-11 Thread Avi Kivity



On 10/09/2015 09:41 PM, Alex Williamson wrote:

There is really no way to safely give a user full access to a PCI
without an IOMMU to protect the host from errant DMA.  There is also
no way to provide DMA translation, for use cases such as devices
assignment to virtual machines.  However, there are still those users
that want userspace drivers under those conditions.  The UIO driver
exists for this use case, but does not provide the degree of device
access and programming that VFIO has.  In an effort to avoid code
duplication, this introduces a No-IOMMU mode for VFIO.

This mode requires enabling CONFIG_VFIO_NOIOMMU and loading the vfio
module with the option "enable_unsafe_pci_noiommu_mode".  This should
make it very clear that this mode is not safe.  In this mode, there is
no support for unprivileged users, CAP_SYS_ADMIN is required for
access to the necessary dev files.


CAP_SYS_RAWIO seems a better match (in particular, it allows access to 
/dev/mem, which is the same thing).



   Mixing no-iommu and secure VFIO is
also unsupported, as are any VFIO IOMMU backends other than the
vfio-noiommu backend.  Furthermore, unsafe group files are relocated
to /dev/vfio-noiommu/.  Upon successful loading in this mode, the
kernel is tainted due to the dummy IOMMU put in place.  Unloading of
the module in this mode is also unsupported and will BUG due to the
lack of support for unregistering an IOMMU for a bus type.


I did not see an API for detecting whether memory translation is 
provided or not.  We can have the caller guess this by looking at the 
device name, or by requiring the user to specify this, but I think it's 
cleaner to provide programmatic access to this attribute.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/2] vfio: Include no-iommu mode

2015-10-11 Thread Avi Kivity



On 10/11/2015 11:57 AM, Michael S. Tsirkin wrote:

On Sun, Oct 11, 2015 at 11:12:14AM +0300, Avi Kivity wrote:

   Mixing no-iommu and secure VFIO is
also unsupported, as are any VFIO IOMMU backends other than the
vfio-noiommu backend.  Furthermore, unsafe group files are relocated
to /dev/vfio-noiommu/.  Upon successful loading in this mode, the
kernel is tainted due to the dummy IOMMU put in place.  Unloading of
the module in this mode is also unsupported and will BUG due to the
lack of support for unregistering an IOMMU for a bus type.

I did not see an API for detecting whether memory translation is provided or
not.  We can have the caller guess this by looking at the device name, or by
requiring the user to specify this, but I think it's cleaner to provide
programmatic access to this attribute.

It seems that caller can just check for VFIO_NOIOMMU_IOMMU.

Isn't this why it's there?


That's just means the capability is there, not that it's active.

But since you must pass the same value to open(), you already know that 
you're using noiommu.



VFIO_IOMMU_MAP_DMA, VFIO_IOMMU_ENABLE and VFIO_IOMMU_DISABLE
will probably also fail ...



Don't you have to call MAP_DMA to pin the memory?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Avi Kivity

On 10/08/2015 01:26 PM, Michael S. Tsirkin wrote:

On Thu, Oct 08, 2015 at 12:19:20PM +0300, Avi Kivity wrote:


On 10/08/2015 11:32 AM, Michael S. Tsirkin wrote:

On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote:

On 08/10/15 00:05, Michael S. Tsirkin wrote:

On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote:

That's what I thought as well, but apparently adding msix support to the
already insecure uio drivers is even worse.

I'm glad you finally agree what these drivers are doing is insecure.

And basically kernel cares about security, no one wants to maintain insecure 
stuff.

So you guys should think harder whether this code makes any sense upstream.

You simply ignore everything I write, cherry-picking the word "insecure" as
if it makes your point.  That is very frustrating.

And I'm sorry about the frustration.  I didn't intend to twist your
words. It's just that I had to spend literally hours trying to explain
that security matters in kernel, and all I was getting back was a
summary "there's no security issue because there are other way to
corrupt memory".

The word security has several meanings.  The primary meaning is "defense
against a malicious attacker".  In that sense, there is no added value at
all, because the attacker is already root, and can already access all of
kernel and user memory.  Even if the attacker is not root, and just has
access to a non-iommu-protected device, they can still DMA to and from any
memory they like.

This sense of the word however is irrelevant for this conversation; the user
already gave up on it when they chose to use uio_pci_generic (either because
they have no iommu, or because they need the extra performance).

Do we agree that security, in the sense of defense against a malicious
attacker, is irrelevant for this conversation?

No. uio_pci_generic currently can be used in a secure way in
a sense that it's protected againt malicious attacker,
assuming you bind it to a device that does not do DMA.


The context of the conversation is dpdk, which only supports DMA.

Do we agree that security, in the sense of defense against a malicious 
attacker, is irrelevant for this conversation, taking this under 
consideration?






A secondary meaning is protection against inadvertent bugs.  Yes, a faulty
memory write that happens to land in the msix page, can cause a random
memory word to be overwritten.  But so can a faulty memory write into the
rings, or the data structures that support virtual->physical translation,
the data structures that describe the packets before translation, the memory
allocator or pool.  The patch extends the vulnerable surface, but by a
negligible amount.


So I was glad when it looked like there's finally an agreement that yes,
there's value in validating userspace input and yes, it's insecure
not to do this.




It is good practice to defend against root oopsing the kernel, but in some
cases it cannot be achieved.

I originally included ways to fix issues that I pointed out, ranging

>from harder to implement with more overhead but more secure to easier to

implement with less overhead but less secure.  There didn't seem to be
an understanding that the issues are there at all, so I stopped doing
that - seemed like a waste of time.

For example, will it kill your performance to reset devices cleanly, on
open and close,

I don't recall this being mentioned at all.

http://mid.gmane.org/20151006005527-mutt-send-email-...@redhat.com


Down at the moment for me.


But really, this is just off the top of my head.
These are all issues VFIO developers encountered
and fixed over the years. Go into that code, read it,
and you will discover the issues and the solutions.


vfio is solving a different problem, the problem of security against a 
malicious attacker, one that I'm hoping to agree here that we aren't 
attempting to solve.


People have been happily using uio_pci_generic despite all those 
issues.  All they were missing was msix support.  You can't use that to 
force them to overhaul that driver, or to add a new subsystem to vfio.





  It seems completely unrelated
to a patch adding msix support to uio_pci_generic.

It isn't unrelated. It's because with MSIX patch you are enabling bus
mastering in kernel.  So if you start device in a bad state it will
corrupt kernel memory.


You are right, this patch can regress secure users.

I'd be surprised if there are msix-capable pci devices that do not rely 
on DMA, though.





  protect them from writes into MSI config, BAR registers
and related capablities etc etc?

Obviously the userspace driver has to write to the BAR area.

If you're talking about the BAR setup registers, yes there is some (tiny)
value in that, but how is it related to this patch?

If you don't, moving BARs will move the MSI-X region and
protecting it won't help.


Won't it just become invisible if you do?

If userspace starts playing with BARs, you lost already, whe

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Avi Kivity



On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote:

On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote:


On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote:

On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote:

It is good practice to defend against root oopsing the kernel, but in some
cases it cannot be achieved.

Absolutely. That's one of the issues with these patches. They don't even
try where it's absolutely possible.


Are you referring to blocking the maps of the msix BAR areas?

For example. There are more. I listed some of the issues on the mailing
list, and I might have missed some.  VFIO has code to address all this,
people should share code to avoid duplication, or at least read it
to understand the issues.


All but one of those are unrelated to the patch that adds msix support.




I think there is value in that.  The value is small because a
corruption is more likely in the dynamic memory responsible for tens
of millions of DMA operations per second, rather than a static 4K
area, but it exists.

There are other bugs which will hurt e.g. each time application does not
exit gracefully.


uio_pci_generic disables DMA when the device is removed, so we're safe 
here, at least if files are released before the address space.




But well, heh :) That's precisely my feeling about the whole "running
userspace drivers without an IOMMU" project. The value is small
since modern hardware has fast IOMMUs, but it exists.



For users that don't have iommus at all (usually because it is taken by 
the hypervisor), it has great value.


I can't comment on iommu overhead; for my use case it is likely 
negligible and we will use an iommu when available; but apparently it 
matters for others.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Avi Kivity



On 10/08/2015 11:32 AM, Michael S. Tsirkin wrote:

On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote:

On 08/10/15 00:05, Michael S. Tsirkin wrote:

On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote:

That's what I thought as well, but apparently adding msix support to the
already insecure uio drivers is even worse.

I'm glad you finally agree what these drivers are doing is insecure.

And basically kernel cares about security, no one wants to maintain insecure 
stuff.

So you guys should think harder whether this code makes any sense upstream.

You simply ignore everything I write, cherry-picking the word "insecure" as
if it makes your point.  That is very frustrating.

And I'm sorry about the frustration.  I didn't intend to twist your
words. It's just that I had to spend literally hours trying to explain
that security matters in kernel, and all I was getting back was a
summary "there's no security issue because there are other way to
corrupt memory".


The word security has several meanings.  The primary meaning is "defense 
against a malicious attacker".  In that sense, there is no added value 
at all, because the attacker is already root, and can already access all 
of kernel and user memory.  Even if the attacker is not root, and just 
has access to a non-iommu-protected device, they can still DMA to and 
from any memory they like.


This sense of the word however is irrelevant for this conversation; the 
user already gave up on it when they chose to use uio_pci_generic 
(either because they have no iommu, or because they need the extra 
performance).


Do we agree that security, in the sense of defense against a malicious 
attacker, is irrelevant for this conversation?


A secondary meaning is protection against inadvertent bugs.  Yes, a 
faulty memory write that happens to land in the msix page, can cause a 
random memory word to be overwritten.  But so can a faulty memory write 
into the rings, or the data structures that support virtual->physical 
translation, the data structures that describe the packets before 
translation, the memory allocator or pool.  The patch extends the 
vulnerable surface, but by a negligible amount.




So I was glad when it looked like there's finally an agreement that yes,
there's value in validating userspace input and yes, it's insecure
not to do this.







It is good practice to defend against root oopsing the kernel, but in some
cases it cannot be achieved.

I originally included ways to fix issues that I pointed out, ranging
from harder to implement with more overhead but more secure to easier to
implement with less overhead but less secure.  There didn't seem to be
an understanding that the issues are there at all, so I stopped doing
that - seemed like a waste of time.

For example, will it kill your performance to reset devices cleanly, on
open and close,


I don't recall this being mentioned at all.  It seems completely 
unrelated to a patch adding msix support to uio_pci_generic.



  protect them from writes into MSI config, BAR registers
and related capablities etc etc?


Obviously the userspace driver has to write to the BAR area.

If you're talking about the BAR setup registers, yes there is some 
(tiny) value in that, but how is it related to this patch?


Protecting the MSI area in the BARs _is_ related to the patch.  I agree 
it adds value, if small.



   And if not, why are you people wasting
time arguing about that?


I you want to use your position as maintainer of uio_pci_generic to get 
people to overhaul the driver for you with unrelated changes, they will 
object.  I can understand a maintainer pointing out the right way to do 
something rather than the wrong way.  But piling on a list of unrelated 
features as prerequisites is, in my opinion, abuse.


Let me repeat that pci_uio_generic is already used for userspace 
drivers, with all the issues that you point out, for a long while now. 
These issues are not exposed by the requirement to use msix. You are not 
protecting the kernel in any way by blocking the patch, you are only 
protecting people with iommu-less configurations from using their hardware.



   The only thing I heard is that it's a hassle.
That's true (though if you follow my advice and try to share code with
vfio/pci you get a lot of this logic for free).


My thinking was that vfio was for secure (in the "defense against 
malicious attackers" sense) while uio_pci_generic was, de-facto at 
least, for use by trusted users.


We are in the strange situation that the Alex is open to adding an 
insecure mode to vfio, while you object to a patch which does not change 
the security of uio_pci_generic in any way; it only makes it more usable 
at the cost of a tiny increase in the bug surface.



   So it's an
understandable argument if you just need something that works, quickly.
But if it's such a stopgap hack, there's no need to insist on it
upstream.


It is n

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Avi Kivity



On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote:

On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote:

It is good practice to defend against root oopsing the kernel, but in some
cases it cannot be achieved.

Absolutely. That's one of the issues with these patches. They don't even
try where it's absolutely possible.



Are you referring to blocking the maps of the msix BAR areas?

I think there is value in that.  The value is small, because a 
corruption is more likely in the dynamic memory responsible for tens of 
millions of DMA operations per second, rather than a static 4K area, but 
it exists.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Avi Kivity



On 10/08/2015 11:32 AM, Michael S. Tsirkin wrote:

On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote:

On 08/10/15 00:05, Michael S. Tsirkin wrote:

On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote:

That's what I thought as well, but apparently adding msix support to the
already insecure uio drivers is even worse.

I'm glad you finally agree what these drivers are doing is insecure.

And basically kernel cares about security, no one wants to maintain insecure 
stuff.

So you guys should think harder whether this code makes any sense upstream.

You simply ignore everything I write, cherry-picking the word "insecure" as
if it makes your point.  That is very frustrating.

And I'm sorry about the frustration.  I didn't intend to twist your
words. It's just that I had to spend literally hours trying to explain
that security matters in kernel, and all I was getting back was a
summary "there's no security issue because there are other way to
corrupt memory".


The word security has several meanings.  The primary meaning is "defense 
against a malicious attacker".  In that sense, there is no added value 
at all, because the attacker is already root, and can already access all 
of kernel and user memory.  Even if the attacker is not root, and just 
has access to a non-iommu-protected device, they can still DMA to and 
from any memory they like.


This sense of the word however is irrelevant for this conversation; the 
user already gave up on it when they chose to use uio_pci_generic 
(either because they have no iommu, or because they need the extra 
performance).


Do we agree that security, in the sense of defense against a malicious 
attacker, is irrelevant for this conversation?


A secondary meaning is protection against inadvertent bugs.  Yes, a 
faulty memory write that happens to land in the msix page, can cause a 
random memory word to be overwritten.  But so can a faulty memory write 
into the rings, or the data structures that support virtual->physical 
translation, the data structures that describe the packets before 
translation, the memory allocator or pool.  The patch extends the 
vulnerable surface, but by a negligible amount.




So I was glad when it looked like there's finally an agreement that yes,
there's value in validating userspace input and yes, it's insecure
not to do this.







It is good practice to defend against root oopsing the kernel, but in some
cases it cannot be achieved.

I originally included ways to fix issues that I pointed out, ranging
from harder to implement with more overhead but more secure to easier to
implement with less overhead but less secure.  There didn't seem to be
an understanding that the issues are there at all, so I stopped doing
that - seemed like a waste of time.

For example, will it kill your performance to reset devices cleanly, on
open and close,


I don't recall this being mentioned at all.  It seems completely 
unrelated to a patch adding msix support to uio_pci_generic.



  protect them from writes into MSI config, BAR registers
and related capablities etc etc?


Obviously the userspace driver has to write to the BAR area.

If you're talking about the BAR setup registers, yes there is some 
(tiny) value in that, but how is it related to this patch?


Protecting the MSI area in the BARs _is_ related to the patch.  I agree 
it adds value, if small.



   And if not, why are you people wasting
time arguing about that?


I you want to use your position as maintainer of uio_pci_generic to get 
people to overhaul the driver for you with unrelated changes, they will 
object.  I can understand a maintainer pointing out the right way to do 
something rather than the wrong way.  But piling on a list of unrelated 
features as prerequisites is, in my opinion, abuse.


Let me repeat that pci_uio_generic is already used for userspace 
drivers, with all the issues that you point out, for a long while now. 
These issues are not exposed by the requirement to use msix. You are not 
protecting the kernel in any way by blocking the patch, you are only 
protecting people with iommu-less configurations from using their hardware.



   The only thing I heard is that it's a hassle.
That's true (though if you follow my advice and try to share code with
vfio/pci you get a lot of this logic for free).


My thinking was that vfio was for secure (in the "defense against 
malicious attackers" sense) while uio_pci_generic was, de-facto at 
least, for use by trusted users.


We are in the strange situation that the Alex is open to adding an 
insecure mode to vfio, while you object to a patch which does not change 
the security of uio_pci_generic in any way; it only makes it more usable 
at the cost of a tiny increase in the bug surface.



   So it's an
understandable argument if you just need something that works, quickly.
But if it's such a stopgap hack, there's no need to insist on it
upstream.


It is n

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Avi Kivity



On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote:

On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote:

It is good practice to defend against root oopsing the kernel, but in some
cases it cannot be achieved.

Absolutely. That's one of the issues with these patches. They don't even
try where it's absolutely possible.



Are you referring to blocking the maps of the msix BAR areas?

I think there is value in that.  The value is small, because a 
corruption is more likely in the dynamic memory responsible for tens of 
millions of DMA operations per second, rather than a static 4K area, but 
it exists.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Avi Kivity



On 10/08/2015 12:16 PM, Michael S. Tsirkin wrote:

On Thu, Oct 08, 2015 at 11:46:30AM +0300, Avi Kivity wrote:


On 10/08/2015 10:32 AM, Michael S. Tsirkin wrote:

On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote:

It is good practice to defend against root oopsing the kernel, but in some
cases it cannot be achieved.

Absolutely. That's one of the issues with these patches. They don't even
try where it's absolutely possible.


Are you referring to blocking the maps of the msix BAR areas?

For example. There are more. I listed some of the issues on the mailing
list, and I might have missed some.  VFIO has code to address all this,
people should share code to avoid duplication, or at least read it
to understand the issues.


All but one of those are unrelated to the patch that adds msix support.




I think there is value in that.  The value is small because a
corruption is more likely in the dynamic memory responsible for tens
of millions of DMA operations per second, rather than a static 4K
area, but it exists.

There are other bugs which will hurt e.g. each time application does not
exit gracefully.


uio_pci_generic disables DMA when the device is removed, so we're safe 
here, at least if files are released before the address space.




But well, heh :) That's precisely my feeling about the whole "running
userspace drivers without an IOMMU" project. The value is small
since modern hardware has fast IOMMUs, but it exists.



For users that don't have iommus at all (usually because it is taken by 
the hypervisor), it has great value.


I can't comment on iommu overhead; for my use case it is likely 
negligible and we will use an iommu when available; but apparently it 
matters for others.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-08 Thread Avi Kivity

On 10/08/2015 01:26 PM, Michael S. Tsirkin wrote:

On Thu, Oct 08, 2015 at 12:19:20PM +0300, Avi Kivity wrote:


On 10/08/2015 11:32 AM, Michael S. Tsirkin wrote:

On Thu, Oct 08, 2015 at 08:33:45AM +0300, Avi Kivity wrote:

On 08/10/15 00:05, Michael S. Tsirkin wrote:

On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote:

That's what I thought as well, but apparently adding msix support to the
already insecure uio drivers is even worse.

I'm glad you finally agree what these drivers are doing is insecure.

And basically kernel cares about security, no one wants to maintain insecure 
stuff.

So you guys should think harder whether this code makes any sense upstream.

You simply ignore everything I write, cherry-picking the word "insecure" as
if it makes your point.  That is very frustrating.

And I'm sorry about the frustration.  I didn't intend to twist your
words. It's just that I had to spend literally hours trying to explain
that security matters in kernel, and all I was getting back was a
summary "there's no security issue because there are other way to
corrupt memory".

The word security has several meanings.  The primary meaning is "defense
against a malicious attacker".  In that sense, there is no added value at
all, because the attacker is already root, and can already access all of
kernel and user memory.  Even if the attacker is not root, and just has
access to a non-iommu-protected device, they can still DMA to and from any
memory they like.

This sense of the word however is irrelevant for this conversation; the user
already gave up on it when they chose to use uio_pci_generic (either because
they have no iommu, or because they need the extra performance).

Do we agree that security, in the sense of defense against a malicious
attacker, is irrelevant for this conversation?

No. uio_pci_generic currently can be used in a secure way in
a sense that it's protected againt malicious attacker,
assuming you bind it to a device that does not do DMA.


The context of the conversation is dpdk, which only supports DMA.

Do we agree that security, in the sense of defense against a malicious 
attacker, is irrelevant for this conversation, taking this under 
consideration?






A secondary meaning is protection against inadvertent bugs.  Yes, a faulty
memory write that happens to land in the msix page, can cause a random
memory word to be overwritten.  But so can a faulty memory write into the
rings, or the data structures that support virtual->physical translation,
the data structures that describe the packets before translation, the memory
allocator or pool.  The patch extends the vulnerable surface, but by a
negligible amount.


So I was glad when it looked like there's finally an agreement that yes,
there's value in validating userspace input and yes, it's insecure
not to do this.




It is good practice to defend against root oopsing the kernel, but in some
cases it cannot be achieved.

I originally included ways to fix issues that I pointed out, ranging

>from harder to implement with more overhead but more secure to easier to

implement with less overhead but less secure.  There didn't seem to be
an understanding that the issues are there at all, so I stopped doing
that - seemed like a waste of time.

For example, will it kill your performance to reset devices cleanly, on
open and close,

I don't recall this being mentioned at all.

http://mid.gmane.org/20151006005527-mutt-send-email-...@redhat.com


Down at the moment for me.


But really, this is just off the top of my head.
These are all issues VFIO developers encountered
and fixed over the years. Go into that code, read it,
and you will discover the issues and the solutions.


vfio is solving a different problem, the problem of security against a 
malicious attacker, one that I'm hoping to agree here that we aren't 
attempting to solve.


People have been happily using uio_pci_generic despite all those 
issues.  All they were missing was msix support.  You can't use that to 
force them to overhaul that driver, or to add a new subsystem to vfio.





  It seems completely unrelated
to a patch adding msix support to uio_pci_generic.

It isn't unrelated. It's because with MSIX patch you are enabling bus
mastering in kernel.  So if you start device in a bad state it will
corrupt kernel memory.


You are right, this patch can regress secure users.

I'd be surprised if there are msix-capable pci devices that do not rely 
on DMA, though.





  protect them from writes into MSI config, BAR registers
and related capablities etc etc?

Obviously the userspace driver has to write to the BAR area.

If you're talking about the BAR setup registers, yes there is some (tiny)
value in that, but how is it related to this patch?

If you don't, moving BARs will move the MSI-X region and
protecting it won't help.


Won't it just become invisible if you do?

If userspace starts playing with BARs, you lost already, whe

Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-07 Thread Avi Kivity



On 08/10/15 00:05, Michael S. Tsirkin wrote:

On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote:

That's what I thought as well, but apparently adding msix support to the
already insecure uio drivers is even worse.

I'm glad you finally agree what these drivers are doing is insecure.

And basically kernel cares about security, no one wants to maintain insecure 
stuff.

So you guys should think harder whether this code makes any sense upstream.


You simply ignore everything I write, cherry-picking the word "insecure" 
as if it makes your point.  That is very frustrating.


The kernel is not secure against root, even in the restricted "will it 
oops" sense.  You can oops it easily, try dd if=/dev/urandom of=/dev/mem 
(or of=/dev/sda for a more satisfying oops).



Getting support from kernel is probably the biggest reason to put code
upstream, and this driver taints kernel unconditionally so you don't get
that.


The biggest reason is that if a driver gets upstream, in a year or two 
it is universally available.




Alternatively, most of the problem you are trying to solve is for
virtualization - and it is is better addressed at the hypervisor level.
There are enough opensource hypervisors out there - work on IOMMU
support there would be time well spent.


It is not.  The problem we are trying to solve, and please consider the 
following as if written in all caps, is that some configurations do not 
have an iommu or cannot use it for performance reasons.


It is good practice to defend against root oopsing the kernel, but in 
some cases it cannot be achieved.  A trivial example is a nommu kernel, 
this is another.  In these cases we can give up on this goal, because it 
is not the only reason for the kernel's existence, there are others.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-07 Thread Avi Kivity



On 10/07/2015 07:31 PM, Alex Williamson wrote:

I
guess the no-iommu map would error if the IOVA isn't simply the bus
address of the page mapped.

Of course this is entirely unsafe and this no-iommu driver should taint
the kernel, but it at least standardizes on one userspace API and you're
already doing completely unsafe things with uio.  vfio should be
enlightened at least to the point that it allows only privileged users
access to devices under such a (lack of) iommu.

There is an additional complication.  With an iommu, userspace programs
the device with virtual addresses, but without it, they have to program
physical addresses.  So vfio would need to communicate this bit of
information.

We can go further and define a better translation API than the current
one (reading /proc/pagemap).  But it's going to be a bigger change to
vfio than I thought at first.

It sounds like a separate vfio iommu backend from type1, one that just
pins the page and returns the bus address.  The curse and benefit would
be that existing type1 users wouldn't "just work" in an insecure mode,
the DMA mapping code would need to be aware of the difference.  Still, I
do really prefer to keep vfio as only exposing a secure, iommu protected
device to the user because surely someone will try and users would
expect that removing iommu restrictions from vfio means they can do
device assignment to VMs w/o an iommu.


That's what I thought as well, but apparently adding msix support to the 
already insecure uio drivers is even worse.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] uio: add MSI/MSI-X support to uio_pci_generic driver

2015-10-07 Thread Avi Kivity



On 10/07/2015 01:25 PM, Michael S. Tsirkin wrote:

On Tue, Oct 06, 2015 at 07:09:11PM +0300, Avi Kivity wrote:


On 10/06/2015 06:15 PM, Michael S. Tsirkin wrote:

While it is possible that userspace malfunctions and accidentally programs
MSI incorrectly, the risk is dwarfed by the ability of userspace to program
DMA incorrectly.

That seems to imply that for the upstream kernel this is not a valid usecase at 
all.


That is trivially incorrect, upstream pci_uio_generic is used with dpdk for
years.

dpdk used to do polling for years. patch to use interrupts was posted in
june 2015.


dpdk used interrupts long before that.




  Are dpdk applications an invalid use case?

The way dpdk is using UIO/sysfs is borderline at best, and can't be used
to justify new interfaces.  They have a more secure mode using VFIO.
That one's more reasonable.



Maybe this was not stressed enough times, but not all configurations 
have an iommu, or want to use one.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-07 Thread Avi Kivity



On 10/06/2015 09:51 PM, Alex Williamson wrote:

On Tue, 2015-10-06 at 18:23 +0300, Avi Kivity wrote:

On 10/06/2015 05:56 PM, Michael S. Tsirkin wrote:

On Tue, Oct 06, 2015 at 05:43:50PM +0300, Vlad Zolotarov wrote:

The only "like VFIO" behavior we implement here is binding the MSI-X
interrupt notification to eventfd descriptor.

There will be more if you add some basic memory protections.

Besides, that's not true.
Your patch queries MSI capability, sets # of vectors.
You even hinted you want to add BAR mapping down the road.

BAR mapping is already available from sysfs; it is not mandatory.


VFIO does all of that.


Copying vfio maintainer Alex (hi!).

vfio's charter is modern iommu-capable configurations. It is designed to
be secure enough to be usable by an unprivileged user.

For performance and hardware reasons, many dpdk deployments use
uio_pci_generic.  They are willing to trade off the security provided by
vfio for the performance and deployment flexibility of pci_uio_generic.
Forcing these features into vfio will compromise its security and
needlessly complicate its code (I guess it can be done with a "null"
iommu, but then vfio will have to decide whether it is secure or not).

It's not just the iommu model vfio uses, it's that vfio is built around
iommu groups.  For instance to use a device in vfio, the user opens the
vfio group file and asks for the device within that group.  That's a
fairly fundamental part of the mechanics to sidestep.

However, is there an opportunity at a lower level?  Systems without an
iommu typically have dma ops handled via a software iotlb (ie. bounce
buffers), but I think they simply don't have iommu ops registered.
Could a no-iommu, iommu subsystem provide enough dummy iommu ops to fake
out vfio?  It would need to iterate the devices on the bus and come up
with dummy iommu groups and dummy versions of iommu_map and unmap.  The
grouping is easy, one device per group, there's no isolation anyway.
The vfio type1 iommu backend will do pinning, which seems like an
improvement over the mlock that uio users probably try to do now.


Right now, people use hugetlbfs maps, which both locks the memory and 
provides better performance.



   I
guess the no-iommu map would error if the IOVA isn't simply the bus
address of the page mapped.

Of course this is entirely unsafe and this no-iommu driver should taint
the kernel, but it at least standardizes on one userspace API and you're
already doing completely unsafe things with uio.  vfio should be
enlightened at least to the point that it allows only privileged users
access to devices under such a (lack of) iommu.


There is an additional complication.  With an iommu, userspace programs 
the device with virtual addresses, but without it, they have to program 
physical addresses.  So vfio would need to communicate this bit of 
information.


We can go further and define a better translation API than the current 
one (reading /proc/pagemap).  But it's going to be a bigger change to 
vfio than I thought at first.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-07 Thread Avi Kivity



On 10/06/2015 09:51 PM, Alex Williamson wrote:

On Tue, 2015-10-06 at 18:23 +0300, Avi Kivity wrote:

On 10/06/2015 05:56 PM, Michael S. Tsirkin wrote:

On Tue, Oct 06, 2015 at 05:43:50PM +0300, Vlad Zolotarov wrote:

The only "like VFIO" behavior we implement here is binding the MSI-X
interrupt notification to eventfd descriptor.

There will be more if you add some basic memory protections.

Besides, that's not true.
Your patch queries MSI capability, sets # of vectors.
You even hinted you want to add BAR mapping down the road.

BAR mapping is already available from sysfs; it is not mandatory.


VFIO does all of that.


Copying vfio maintainer Alex (hi!).

vfio's charter is modern iommu-capable configurations. It is designed to
be secure enough to be usable by an unprivileged user.

For performance and hardware reasons, many dpdk deployments use
uio_pci_generic.  They are willing to trade off the security provided by
vfio for the performance and deployment flexibility of pci_uio_generic.
Forcing these features into vfio will compromise its security and
needlessly complicate its code (I guess it can be done with a "null"
iommu, but then vfio will have to decide whether it is secure or not).

It's not just the iommu model vfio uses, it's that vfio is built around
iommu groups.  For instance to use a device in vfio, the user opens the
vfio group file and asks for the device within that group.  That's a
fairly fundamental part of the mechanics to sidestep.

However, is there an opportunity at a lower level?  Systems without an
iommu typically have dma ops handled via a software iotlb (ie. bounce
buffers), but I think they simply don't have iommu ops registered.
Could a no-iommu, iommu subsystem provide enough dummy iommu ops to fake
out vfio?  It would need to iterate the devices on the bus and come up
with dummy iommu groups and dummy versions of iommu_map and unmap.  The
grouping is easy, one device per group, there's no isolation anyway.
The vfio type1 iommu backend will do pinning, which seems like an
improvement over the mlock that uio users probably try to do now.


Right now, people use hugetlbfs maps, which both locks the memory and 
provides better performance.



   I
guess the no-iommu map would error if the IOVA isn't simply the bus
address of the page mapped.

Of course this is entirely unsafe and this no-iommu driver should taint
the kernel, but it at least standardizes on one userspace API and you're
already doing completely unsafe things with uio.  vfio should be
enlightened at least to the point that it allows only privileged users
access to devices under such a (lack of) iommu.


There is an additional complication.  With an iommu, userspace programs 
the device with virtual addresses, but without it, they have to program 
physical addresses.  So vfio would need to communicate this bit of 
information.


We can go further and define a better translation API than the current 
one (reading /proc/pagemap).  But it's going to be a bigger change to 
vfio than I thought at first.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-07 Thread Avi Kivity



On 08/10/15 00:05, Michael S. Tsirkin wrote:

On Wed, Oct 07, 2015 at 07:39:16PM +0300, Avi Kivity wrote:

That's what I thought as well, but apparently adding msix support to the
already insecure uio drivers is even worse.

I'm glad you finally agree what these drivers are doing is insecure.

And basically kernel cares about security, no one wants to maintain insecure 
stuff.

So you guys should think harder whether this code makes any sense upstream.


You simply ignore everything I write, cherry-picking the word "insecure" 
as if it makes your point.  That is very frustrating.


The kernel is not secure against root, even in the restricted "will it 
oops" sense.  You can oops it easily, try dd if=/dev/urandom of=/dev/mem 
(or of=/dev/sda for a more satisfying oops).



Getting support from kernel is probably the biggest reason to put code
upstream, and this driver taints kernel unconditionally so you don't get
that.


The biggest reason is that if a driver gets upstream, in a year or two 
it is universally available.




Alternatively, most of the problem you are trying to solve is for
virtualization - and it is is better addressed at the hypervisor level.
There are enough opensource hypervisors out there - work on IOMMU
support there would be time well spent.


It is not.  The problem we are trying to solve, and please consider the 
following as if written in all caps, is that some configurations do not 
have an iommu or cannot use it for performance reasons.


It is good practice to defend against root oopsing the kernel, but in 
some cases it cannot be achieved.  A trivial example is a nommu kernel, 
this is another.  In these cases we can give up on this goal, because it 
is not the only reason for the kernel's existence, there are others.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] uio: add MSI/MSI-X support to uio_pci_generic driver

2015-10-07 Thread Avi Kivity



On 10/07/2015 01:25 PM, Michael S. Tsirkin wrote:

On Tue, Oct 06, 2015 at 07:09:11PM +0300, Avi Kivity wrote:


On 10/06/2015 06:15 PM, Michael S. Tsirkin wrote:

While it is possible that userspace malfunctions and accidentally programs
MSI incorrectly, the risk is dwarfed by the ability of userspace to program
DMA incorrectly.

That seems to imply that for the upstream kernel this is not a valid usecase at 
all.


That is trivially incorrect, upstream pci_uio_generic is used with dpdk for
years.

dpdk used to do polling for years. patch to use interrupts was posted in
june 2015.


dpdk used interrupts long before that.




  Are dpdk applications an invalid use case?

The way dpdk is using UIO/sysfs is borderline at best, and can't be used
to justify new interfaces.  They have a more secure mode using VFIO.
That one's more reasonable.



Maybe this was not stressed enough times, but not all configurations 
have an iommu, or want to use one.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-07 Thread Avi Kivity



On 10/07/2015 07:31 PM, Alex Williamson wrote:

I
guess the no-iommu map would error if the IOVA isn't simply the bus
address of the page mapped.

Of course this is entirely unsafe and this no-iommu driver should taint
the kernel, but it at least standardizes on one userspace API and you're
already doing completely unsafe things with uio.  vfio should be
enlightened at least to the point that it allows only privileged users
access to devices under such a (lack of) iommu.

There is an additional complication.  With an iommu, userspace programs
the device with virtual addresses, but without it, they have to program
physical addresses.  So vfio would need to communicate this bit of
information.

We can go further and define a better translation API than the current
one (reading /proc/pagemap).  But it's going to be a bigger change to
vfio than I thought at first.

It sounds like a separate vfio iommu backend from type1, one that just
pins the page and returns the bus address.  The curse and benefit would
be that existing type1 users wouldn't "just work" in an insecure mode,
the DMA mapping code would need to be aware of the difference.  Still, I
do really prefer to keep vfio as only exposing a secure, iommu protected
device to the user because surely someone will try and users would
expect that removing iommu restrictions from vfio means they can do
device assignment to VMs w/o an iommu.


That's what I thought as well, but apparently adding msix support to the 
already insecure uio drivers is even worse.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] uio: add MSI/MSI-X support to uio_pci_generic driver

2015-10-06 Thread Avi Kivity



On 10/06/2015 06:15 PM, Michael S. Tsirkin wrote:

While it is possible that userspace malfunctions and accidentally programs
MSI incorrectly, the risk is dwarfed by the ability of userspace to program
DMA incorrectly.

That seems to imply that for the upstream kernel this is not a valid usecase at 
all.



That is trivially incorrect, upstream pci_uio_generic is used with dpdk 
for years.  Are dpdk applications an invalid use case?


Again:

- security is not compromised.  you need to be root to (ab)use this.
- all of the potentially compromising functionality has been there from 
day 1
- uio_pci_generic is the only way to provide the required performance on 
some configurations (where kernel drivers, or userspace drivers + iommu 
are too slow)
- uio_pci_generic + msix is the only way to enable userspace drivers on 
some configurations (SRIOV)


The proposed functionality does not increase the attack surface.
The proposed functionality marginally increases the bug surface.
The proposed functionality is a natural evolution of uio_pci_generic.

There is a new class of applications (network function virtualization) 
which require this.  They can't use the kernel drivers because they are 
too slow.  They can't use the iommu because it is either too slow, or 
taken over by the hypervisor.  They are willing to live with less kernel 
protection, because they are a single user application anyway (and since 
they use a kernel bypass, they don't really care that much about the 
kernel).


The kernel serves more use-cases than a desktop or a multi-user 
servers.  Some of these users are willing to trade off protection for 
performance or functionality (an extreme, yet similar, example is 
linux-nommu, which allows any application to access any bit of memory, 
due to the lack of protection hardware).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X

2015-10-06 Thread Avi Kivity



On 10/06/2015 05:07 PM, Michael S. Tsirkin wrote:

On Tue, Oct 06, 2015 at 03:15:57PM +0300, Avi Kivity wrote:

btw, (2) doesn't really add any insecurity.  The user could already poke at
the msix tables (as well as perform DMA); they just couldn't get a useful
interrupt out of them.

Poking at msix tables won't cause memory corruption unless msix and bus
mastering is enabled.


It's a given that bus mastering is enabled.  It's true that msix is 
unlikely to be enabled, unless msix support is added.



   It's true root can enable msix and bus mastering
through sysfs - but that's easy to block or detect. Even if you don't
buy a security story, it seems less likely to trigger as a result
of a userspace bug.


If you're doing DMA, that's the least of your worries.

Still, zero-mapping the msix space seems reasonable, and can protect 
userspace from silly stuff.  It can't be considered to have anything to 
do with security though, as long as users can simply DMA to every bit of 
RAM in the system they want to.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-06 Thread Avi Kivity



On 10/06/2015 05:46 PM, Michael S. Tsirkin wrote:

On Mon, Oct 05, 2015 at 11:28:03AM +0300, Avi Kivity wrote:

Eventfd is a natural enough representation of an interrupt; both kvm and
vfio use it, and are also able to share the eventfd, allowing a vfio
interrupt to generate a kvm interrupt, without userspace intervention, and
one day without even kernel intervention.

eventfd without kernel intervention sounds unlikely.

kvm might configure the cpu such that an interrupt will not trigger a
vmexit.  eventfd seems like an unlikely interface to do that: with the
eventfd, device triggering it has no info about the interrupt so it
can't send it to the correct VM.


https://lwn.net/Articles/650863/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-06 Thread Avi Kivity



On 10/06/2015 05:56 PM, Michael S. Tsirkin wrote:

On Tue, Oct 06, 2015 at 05:43:50PM +0300, Vlad Zolotarov wrote:

The only "like VFIO" behavior we implement here is binding the MSI-X
interrupt notification to eventfd descriptor.

There will be more if you add some basic memory protections.

Besides, that's not true.
Your patch queries MSI capability, sets # of vectors.
You even hinted you want to add BAR mapping down the road.


BAR mapping is already available from sysfs; it is not mandatory.


VFIO does all of that.



Copying vfio maintainer Alex (hi!).

vfio's charter is modern iommu-capable configurations. It is designed to 
be secure enough to be usable by an unprivileged user.


For performance and hardware reasons, many dpdk deployments use 
uio_pci_generic.  They are willing to trade off the security provided by 
vfio for the performance and deployment flexibility of pci_uio_generic.  
Forcing these features into vfio will compromise its security and 
needlessly complicate its code (I guess it can be done with a "null" 
iommu, but then vfio will have to decide whether it is secure or not).



This doesn't justifies the
hassle of implementing IOMMU-less VFIO mode.

This applies to both VFIO and UIO really.  I'm not sure the hassle of
maintaining this functionality in tree is justified.  It remains to be
seen whether there are any users that won't taint the kernel.
Apparently not in the current form of the patch, but who knows.


It is not msix that taints the kernel, it's uio_pci_generic.  Msix is a 
tiny feature addition that doesn't change the security situation one bit.


btw, currently you can map BARs and dd to /dev/mem to your heart's 
content without tainting the kernel.  I don't see how you can claim that 
msix support makes the situation worse, when root can access every bit 
of physical memory, either directly or via DMA.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] uio: add MSI/MSI-X support to uio_pci_generic driver

2015-10-06 Thread Avi Kivity



On 10/06/2015 05:30 PM, Michael S. Tsirkin wrote:

On Tue, Oct 06, 2015 at 11:37:59AM +0300, Vlad Zolotarov wrote:

Bus mastering is easily enabled from the user space (taken from DPDK code):

static int
pci_uio_set_bus_master(int dev_fd)
{
uint16_t reg;
int ret;

ret = pread(dev_fd, , sizeof(reg), PCI_COMMAND);
if (ret != sizeof(reg)) {
RTE_LOG(ERR, EAL,
"Cannot read command from PCI config space!\n");
return -1;
}

/* return if bus mastering is already on */
if (reg & PCI_COMMAND_MASTER)
return 0;

reg |= PCI_COMMAND_MASTER;

ret = pwrite(dev_fd, , sizeof(reg), PCI_COMMAND);
if (ret != sizeof(reg)) {
RTE_LOG(ERR, EAL,
"Cannot write command to PCI config space!\n");
return -1;
}

return 0;
}

So, this is a non-issue. ;)

There might be valid reasons for DPDK to do this, e.g. if using VFIO.


DPDK does this when using vfio, and when using uio_pci_generic. All of 
the network cards that DPDK supports require DMA.



I'm guessing it doesn't enable MSI though, does it?


It does not enable MSI, because the main kernel driver used for 
interacting with the device, pci_uio_generic, does not support MSI. In 
some configurations, PCI INTA is not available, while MSI(X) is, hence 
the desire that pci_uio_generic support MSI.


While it is possible that userspace malfunctions and accidentally 
programs MSI incorrectly, the risk is dwarfed by the ability of 
userspace to program DMA incorrectly.  Under normal operation userspace 
programs tens of millions of DMA operations per second, while it never 
touches the MSI BARs (it is the kernel that programs them).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X

2015-10-06 Thread Avi Kivity

On 10/06/2015 10:33 AM, Stephen Hemminger wrote:

Other than implementation objections, so far the two main arguments
against this reduce to:
   1. If you allow UIO ioctl then it opens an API hook for all the crap out
  of tree UIO drivers to do what they want.
   2. If you allow UIO MSI-X then you are expanding the usage of userspace
  device access in an insecure manner.

Another alternative which I explored was making a version of VFIO that
works without IOMMU. It solves #1 but actually increases the likely negative
response to arguent #2. This would keep same API, and avoid having to
modify UIO. But we would still have the same (if not more resistance)
from IOMMU developers who believe all systems have to be secure against
root.


vfio's charter was explicitly aiming for modern setups with iommus.

This could be revisited, but I agree it will have even more resistance, 
justified IMO.


btw, (2) doesn't really add any insecurity.  The user could already poke 
at the msix tables (as well as perform DMA); they just couldn't get a 
useful interrupt out of them.


Maybe a module parameter "allow_insecure_dma" can be added to 
uio_pci_generic.  Without the parameter, bus mastering and msix is 
disabled, with the parameter it is allowed.  This requires the sysadmin 
to take a positive step in order to make use of their hardware.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] uio: add MSI/MSI-X support to uio_pci_generic driver

2015-10-06 Thread Avi Kivity



On 10/06/2015 05:30 PM, Michael S. Tsirkin wrote:

On Tue, Oct 06, 2015 at 11:37:59AM +0300, Vlad Zolotarov wrote:

Bus mastering is easily enabled from the user space (taken from DPDK code):

static int
pci_uio_set_bus_master(int dev_fd)
{
uint16_t reg;
int ret;

ret = pread(dev_fd, , sizeof(reg), PCI_COMMAND);
if (ret != sizeof(reg)) {
RTE_LOG(ERR, EAL,
"Cannot read command from PCI config space!\n");
return -1;
}

/* return if bus mastering is already on */
if (reg & PCI_COMMAND_MASTER)
return 0;

reg |= PCI_COMMAND_MASTER;

ret = pwrite(dev_fd, , sizeof(reg), PCI_COMMAND);
if (ret != sizeof(reg)) {
RTE_LOG(ERR, EAL,
"Cannot write command to PCI config space!\n");
return -1;
}

return 0;
}

So, this is a non-issue. ;)

There might be valid reasons for DPDK to do this, e.g. if using VFIO.


DPDK does this when using vfio, and when using uio_pci_generic. All of 
the network cards that DPDK supports require DMA.



I'm guessing it doesn't enable MSI though, does it?


It does not enable MSI, because the main kernel driver used for 
interacting with the device, pci_uio_generic, does not support MSI. In 
some configurations, PCI INTA is not available, while MSI(X) is, hence 
the desire that pci_uio_generic support MSI.


While it is possible that userspace malfunctions and accidentally 
programs MSI incorrectly, the risk is dwarfed by the ability of 
userspace to program DMA incorrectly.  Under normal operation userspace 
programs tens of millions of DMA operations per second, while it never 
touches the MSI BARs (it is the kernel that programs them).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-06 Thread Avi Kivity



On 10/06/2015 05:56 PM, Michael S. Tsirkin wrote:

On Tue, Oct 06, 2015 at 05:43:50PM +0300, Vlad Zolotarov wrote:

The only "like VFIO" behavior we implement here is binding the MSI-X
interrupt notification to eventfd descriptor.

There will be more if you add some basic memory protections.

Besides, that's not true.
Your patch queries MSI capability, sets # of vectors.
You even hinted you want to add BAR mapping down the road.


BAR mapping is already available from sysfs; it is not mandatory.


VFIO does all of that.



Copying vfio maintainer Alex (hi!).

vfio's charter is modern iommu-capable configurations. It is designed to 
be secure enough to be usable by an unprivileged user.


For performance and hardware reasons, many dpdk deployments use 
uio_pci_generic.  They are willing to trade off the security provided by 
vfio for the performance and deployment flexibility of pci_uio_generic.  
Forcing these features into vfio will compromise its security and 
needlessly complicate its code (I guess it can be done with a "null" 
iommu, but then vfio will have to decide whether it is secure or not).



This doesn't justifies the
hassle of implementing IOMMU-less VFIO mode.

This applies to both VFIO and UIO really.  I'm not sure the hassle of
maintaining this functionality in tree is justified.  It remains to be
seen whether there are any users that won't taint the kernel.
Apparently not in the current form of the patch, but who knows.


It is not msix that taints the kernel, it's uio_pci_generic.  Msix is a 
tiny feature addition that doesn't change the security situation one bit.


btw, currently you can map BARs and dd to /dev/mem to your heart's 
content without tainting the kernel.  I don't see how you can claim that 
msix support makes the situation worse, when root can access every bit 
of physical memory, either directly or via DMA.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X

2015-10-06 Thread Avi Kivity



On 10/06/2015 05:07 PM, Michael S. Tsirkin wrote:

On Tue, Oct 06, 2015 at 03:15:57PM +0300, Avi Kivity wrote:

btw, (2) doesn't really add any insecurity.  The user could already poke at
the msix tables (as well as perform DMA); they just couldn't get a useful
interrupt out of them.

Poking at msix tables won't cause memory corruption unless msix and bus
mastering is enabled.


It's a given that bus mastering is enabled.  It's true that msix is 
unlikely to be enabled, unless msix support is added.



   It's true root can enable msix and bus mastering
through sysfs - but that's easy to block or detect. Even if you don't
buy a security story, it seems less likely to trigger as a result
of a userspace bug.


If you're doing DMA, that's the least of your worries.

Still, zero-mapping the msix space seems reasonable, and can protect 
userspace from silly stuff.  It can't be considered to have anything to 
do with security though, as long as users can simply DMA to every bit of 
RAM in the system they want to.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-06 Thread Avi Kivity



On 10/06/2015 05:46 PM, Michael S. Tsirkin wrote:

On Mon, Oct 05, 2015 at 11:28:03AM +0300, Avi Kivity wrote:

Eventfd is a natural enough representation of an interrupt; both kvm and
vfio use it, and are also able to share the eventfd, allowing a vfio
interrupt to generate a kvm interrupt, without userspace intervention, and
one day without even kernel intervention.

eventfd without kernel intervention sounds unlikely.

kvm might configure the cpu such that an interrupt will not trigger a
vmexit.  eventfd seems like an unlikely interface to do that: with the
eventfd, device triggering it has no info about the interrupt so it
can't send it to the correct VM.


https://lwn.net/Articles/650863/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X

2015-10-06 Thread Avi Kivity

On 10/06/2015 10:33 AM, Stephen Hemminger wrote:

Other than implementation objections, so far the two main arguments
against this reduce to:
   1. If you allow UIO ioctl then it opens an API hook for all the crap out
  of tree UIO drivers to do what they want.
   2. If you allow UIO MSI-X then you are expanding the usage of userspace
  device access in an insecure manner.

Another alternative which I explored was making a version of VFIO that
works without IOMMU. It solves #1 but actually increases the likely negative
response to arguent #2. This would keep same API, and avoid having to
modify UIO. But we would still have the same (if not more resistance)
from IOMMU developers who believe all systems have to be secure against
root.


vfio's charter was explicitly aiming for modern setups with iommus.

This could be revisited, but I agree it will have even more resistance, 
justified IMO.


btw, (2) doesn't really add any insecurity.  The user could already poke 
at the msix tables (as well as perform DMA); they just couldn't get a 
useful interrupt out of them.


Maybe a module parameter "allow_insecure_dma" can be added to 
uio_pci_generic.  Without the parameter, bus mastering and msix is 
disabled, with the parameter it is allowed.  This requires the sysadmin 
to take a positive step in order to make use of their hardware.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/3] uio: add MSI/MSI-X support to uio_pci_generic driver

2015-10-06 Thread Avi Kivity



On 10/06/2015 06:15 PM, Michael S. Tsirkin wrote:

While it is possible that userspace malfunctions and accidentally programs
MSI incorrectly, the risk is dwarfed by the ability of userspace to program
DMA incorrectly.

That seems to imply that for the upstream kernel this is not a valid usecase at 
all.



That is trivially incorrect, upstream pci_uio_generic is used with dpdk 
for years.  Are dpdk applications an invalid use case?


Again:

- security is not compromised.  you need to be root to (ab)use this.
- all of the potentially compromising functionality has been there from 
day 1
- uio_pci_generic is the only way to provide the required performance on 
some configurations (where kernel drivers, or userspace drivers + iommu 
are too slow)
- uio_pci_generic + msix is the only way to enable userspace drivers on 
some configurations (SRIOV)


The proposed functionality does not increase the attack surface.
The proposed functionality marginally increases the bug surface.
The proposed functionality is a natural evolution of uio_pci_generic.

There is a new class of applications (network function virtualization) 
which require this.  They can't use the kernel drivers because they are 
too slow.  They can't use the iommu because it is either too slow, or 
taken over by the hypervisor.  They are willing to live with less kernel 
protection, because they are a single user application anyway (and since 
they use a kernel bypass, they don't really care that much about the 
kernel).


The kernel serves more use-cases than a desktop or a multi-user 
servers.  Some of these users are willing to trade off protection for 
performance or functionality (an extreme, yet similar, example is 
linux-nommu, which allows any application to access any bit of memory, 
due to the lack of protection hardware).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-05 Thread Avi Kivity

On 10/05/2015 02:41 PM, Vlad Zolotarov wrote:



On 10/05/15 13:57, Greg KH wrote:

On Mon, Oct 05, 2015 at 01:48:39PM +0300, Vlad Zolotarov wrote:


On 10/05/15 10:56, Greg KH wrote:

On Mon, Oct 05, 2015 at 10:41:39AM +0300, Vlad Zolotarov wrote:

+struct msix_info {
+int num_irqs;
+struct msix_entry *table;
+struct uio_msix_irq_ctx {
+struct eventfd_ctx *trigger;/* MSI-x vector to 
eventfd */

Why are you using eventfd for msi vectors?  What's the reason for
needing this?
A small correction - for MSI-X vectors. There may be only one MSI 
vector per
PCI function and if it's used it would use the same interface as a 
legacy

INT#x interrupt uses at the moment.
So, for MSI-X case the reason is that there may be (in most cases 
there will
be) more than one interrupt vector. Thus, as I've explained in a 
PATCH1
thread we need a way to indicated each of them separately. eventfd 
seems

like a good way of doing so. If u have better ideas, pls., share.

You need to document what you are doing here, I don't see any
explaination for using eventfd at all.

And no, I don't know of any other solution as I don't know what you 
are

trying to do here (hint, the changelog didn't document it...)

You haven't documented how this api works at all, you are going 
to have

to a lot more work to justify this, as this greatly increases the
complexity of the user/kernel api in unknown ways.

I actually do documented it a bit. Pls., check PATCH3 out.

That provided no information at all about how to use the api.

If it did, you would see that your api is broken for 32/64bit kernels
and will fall over into nasty pieces the first time you try to use it
there, which means it hasn't been tested at all :(

It has been tested of course ;)
I tested it only in 64 bit environment however where both kernel and 
user
space applications were compiled on the same machine with the same 
compiler
and it could be that "int" had the same number of bytes both in 
kernel and
in user space application. Therefore it worked perfectly - I patched 
DPDK to
use the new uio_pci_generic MSI-X API to test this and I have 
verified that
all 3 interrupt modes work: MSI-X with SR-IOV VF device in Amazon 
EC2 guest

and INT#x and MSI with a PF device on bare metal server.

However I agree using uint32_t for "vec" and "fd" would be much more
correct.

I don't think file descriptors are __u32 on a 64bit arch, are they?


I think they are "int" on all platforms and as far as I know u32 
should be enough to contain int on any platform.




You need to make sure structures have the same layout on both 32-bit and 
64-bit systems, or you'll have to code compat ioctl translations for 
them.  The best way to do that is to use __u32 so the sizes are obvious, 
even for int, and to pad everything to 64 bit:


+struct msix_info { 


+__u32 num_irqs;
+__u32 pad; // so pointer below is aligned to 64-bit on both 32-bit 
and 64-bit userspace


+struct msix_entry *table;
+struct uio_msix_irq_ctx {
+struct eventfd_ctx *trigger;/* MSI-x vector to eventfd */





And NEVER use the _t types in kernel code,


Never meant it - it was for a user space interface. For a kernel it's 
u32 of course.




For interfaces, use __u32.  You can't use uint32_t because if someone 
uses C89 in 2015, they may not have .


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-05 Thread Avi Kivity

On 10/05/2015 01:57 PM, Greg KH wrote:

On Mon, Oct 05, 2015 at 01:48:39PM +0300, Vlad Zolotarov wrote:


On 10/05/15 10:56, Greg KH wrote:

On Mon, Oct 05, 2015 at 10:41:39AM +0300, Vlad Zolotarov wrote:

+struct msix_info {
+   int num_irqs;
+   struct msix_entry *table;
+   struct uio_msix_irq_ctx {
+   struct eventfd_ctx *trigger;/* MSI-x vector to eventfd */

Why are you using eventfd for msi vectors?  What's the reason for
needing this?

A small correction - for MSI-X vectors. There may be only one MSI vector per
PCI function and if it's used it would use the same interface as a legacy
INT#x interrupt uses at the moment.
So, for MSI-X case the reason is that there may be (in most cases there will
be) more than one interrupt vector. Thus, as I've explained in a PATCH1
thread we need a way to indicated each of them separately. eventfd seems
like a good way of doing so. If u have better ideas, pls., share.

You need to document what you are doing here, I don't see any
explaination for using eventfd at all.

And no, I don't know of any other solution as I don't know what you are
trying to do here (hint, the changelog didn't document it...)


You haven't documented how this api works at all, you are going to have
to a lot more work to justify this, as this greatly increases the
complexity of the user/kernel api in unknown ways.

I actually do documented it a bit. Pls., check PATCH3 out.

That provided no information at all about how to use the api.

If it did, you would see that your api is broken for 32/64bit kernels
and will fall over into nasty pieces the first time you try to use it
there, which means it hasn't been tested at all :(

It has been tested of course ;)
I tested it only in 64 bit environment however where both kernel and user
space applications were compiled on the same machine with the same compiler
and it could be that "int" had the same number of bytes both in kernel and
in user space application. Therefore it worked perfectly - I patched DPDK to
use the new uio_pci_generic MSI-X API to test this and I have verified that
all 3 interrupt modes work: MSI-X with SR-IOV VF device in Amazon EC2 guest
and INT#x and MSI with a PF device on bare metal server.

However I agree using uint32_t for "vec" and "fd" would be much more
correct.

I don't think file descriptors are __u32 on a 64bit arch, are they?

And NEVER use the _t types in kernel code, the namespaces is all wrong
and it is not applicable for us, sorry.


Wasn't the real reason that they aren't defined (or reserved) by C89, 
and therefore could clash with a user identifier, rather than some 
inherent wrongness?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-05 Thread Avi Kivity

On 10/05/2015 12:49 PM, Greg KH wrote:

On Mon, Oct 05, 2015 at 11:28:03AM +0300, Avi Kivity wrote:

Of course it has to be documented, but this just follows vfio.

Eventfd is a natural enough representation of an interrupt; both kvm and
vfio use it, and are also able to share the eventfd, allowing a vfio
interrupt to generate a kvm interrupt, without userspace intervention, and
one day without even kernel intervention.

That's nice and wonderful, but it's not how UIO works today, so this is
now going to be a mix and match type interface, with no justification so
far as to why to create this new api and exactly how this is all going
to be used from userspace.


The intended user is dpdk (http://dpdk.org), which is a family of 
userspace networking drivers for high performance networking applications.


The natural device driver for dpdk is vfio, which both provides memory 
protection and exposes msi/msix interrupts.  However, in many cases vfio 
cannot be used, either due to the lack of an iommu (for example, in 
virtualized environments) or out of a desire to avoid the iommus 
performance impact.


The challenge in exposing msix interrupts to user space is that there 
are many of them, so you can't simply poll the device fd.  If you do, 
how do you know which interrupt was triggered?  The solution that vfio 
adopted was to associate each interrupt with an eventfd, allowing it to 
be individually polled.  Since you can pass an eventfd with SCM_RIGHTS, 
and since kvm can trigger guest interrupts using an eventfd, the 
solution is very flexible.



Example code would be even better...





This is the vfio dpdk interface code:

http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c

basically, the equivalent uio msix code would be very similar if uio 
adopts a similar interface:


http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_pci_uio.c

(current code lacks msi/msix support, of course).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-05 Thread Avi Kivity

On 10/05/2015 06:11 AM, Greg KH wrote:

On Sun, Oct 04, 2015 at 11:43:17PM +0300, Vlad Zolotarov wrote:

Add support for MSI and MSI-X interrupt modes:
- Interrupt mode selection order is:
 INT#X (for backward compatibility) -> MSI-X -> MSI.
- Add ioctl() commands:
   - UIO_PCI_GENERIC_INT_MODE_GET: query the current interrupt mode.
   - UIO_PCI_GENERIC_IRQ_NUM_GET: query the maximum number of IRQs.
   - UIO_PCI_GENERIC_IRQ_SET: bind the IRQ to eventfd (similar to vfio).
- Add mappings to all bars (memory and portio): some devices have
  registers related to MSI/MSI-X handling outside BAR0.

Signed-off-by: Vlad Zolotarov 
---
New in v3:
- Add __iomem qualifier to temp buffer receiving ioremap value.

New in v2:
- Added #include  to uio_pci_generic.c

Signed-off-by: Vlad Zolotarov 
---
  drivers/uio/uio_pci_generic.c   | 410 +---
  include/linux/uio_pci_generic.h |  36 
  2 files changed, 423 insertions(+), 23 deletions(-)
  create mode 100644 include/linux/uio_pci_generic.h

diff --git a/drivers/uio/uio_pci_generic.c b/drivers/uio/uio_pci_generic.c
index d0b508b..6b8b1789 100644
--- a/drivers/uio/uio_pci_generic.c
+++ b/drivers/uio/uio_pci_generic.c
@@ -22,16 +22,32 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
+#include 
+#include 
+#include 
  
  #define DRIVER_VERSION	"0.01.0"

  #define DRIVER_AUTHOR "Michael S. Tsirkin "
  #define DRIVER_DESC   "Generic UIO driver for PCI 2.3 devices"
  
+struct msix_info {

+   int num_irqs;
+   struct msix_entry *table;
+   struct uio_msix_irq_ctx {
+   struct eventfd_ctx *trigger;/* MSI-x vector to eventfd */

Why are you using eventfd for msi vectors?  What's the reason for
needing this?

You haven't documented how this api works at all, you are going to have
to a lot more work to justify this, as this greatly increases the
complexity of the user/kernel api in unknown ways.




Of course it has to be documented, but this just follows vfio.

Eventfd is a natural enough representation of an interrupt; both kvm and 
vfio use it, and are also able to share the eventfd, allowing a vfio 
interrupt to generate a kvm interrupt, without userspace intervention, 
and one day without even kernel intervention.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-05 Thread Avi Kivity

On 10/05/2015 06:11 AM, Greg KH wrote:

On Sun, Oct 04, 2015 at 11:43:17PM +0300, Vlad Zolotarov wrote:

Add support for MSI and MSI-X interrupt modes:
- Interrupt mode selection order is:
 INT#X (for backward compatibility) -> MSI-X -> MSI.
- Add ioctl() commands:
   - UIO_PCI_GENERIC_INT_MODE_GET: query the current interrupt mode.
   - UIO_PCI_GENERIC_IRQ_NUM_GET: query the maximum number of IRQs.
   - UIO_PCI_GENERIC_IRQ_SET: bind the IRQ to eventfd (similar to vfio).
- Add mappings to all bars (memory and portio): some devices have
  registers related to MSI/MSI-X handling outside BAR0.

Signed-off-by: Vlad Zolotarov 
---
New in v3:
- Add __iomem qualifier to temp buffer receiving ioremap value.

New in v2:
- Added #include  to uio_pci_generic.c

Signed-off-by: Vlad Zolotarov 
---
  drivers/uio/uio_pci_generic.c   | 410 +---
  include/linux/uio_pci_generic.h |  36 
  2 files changed, 423 insertions(+), 23 deletions(-)
  create mode 100644 include/linux/uio_pci_generic.h

diff --git a/drivers/uio/uio_pci_generic.c b/drivers/uio/uio_pci_generic.c
index d0b508b..6b8b1789 100644
--- a/drivers/uio/uio_pci_generic.c
+++ b/drivers/uio/uio_pci_generic.c
@@ -22,16 +22,32 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
+#include 
+#include 
+#include 
  
  #define DRIVER_VERSION	"0.01.0"

  #define DRIVER_AUTHOR "Michael S. Tsirkin "
  #define DRIVER_DESC   "Generic UIO driver for PCI 2.3 devices"
  
+struct msix_info {

+   int num_irqs;
+   struct msix_entry *table;
+   struct uio_msix_irq_ctx {
+   struct eventfd_ctx *trigger;/* MSI-x vector to eventfd */

Why are you using eventfd for msi vectors?  What's the reason for
needing this?

You haven't documented how this api works at all, you are going to have
to a lot more work to justify this, as this greatly increases the
complexity of the user/kernel api in unknown ways.




Of course it has to be documented, but this just follows vfio.

Eventfd is a natural enough representation of an interrupt; both kvm and 
vfio use it, and are also able to share the eventfd, allowing a vfio 
interrupt to generate a kvm interrupt, without userspace intervention, 
and one day without even kernel intervention.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-05 Thread Avi Kivity

On 10/05/2015 12:49 PM, Greg KH wrote:

On Mon, Oct 05, 2015 at 11:28:03AM +0300, Avi Kivity wrote:

Of course it has to be documented, but this just follows vfio.

Eventfd is a natural enough representation of an interrupt; both kvm and
vfio use it, and are also able to share the eventfd, allowing a vfio
interrupt to generate a kvm interrupt, without userspace intervention, and
one day without even kernel intervention.

That's nice and wonderful, but it's not how UIO works today, so this is
now going to be a mix and match type interface, with no justification so
far as to why to create this new api and exactly how this is all going
to be used from userspace.


The intended user is dpdk (http://dpdk.org), which is a family of 
userspace networking drivers for high performance networking applications.


The natural device driver for dpdk is vfio, which both provides memory 
protection and exposes msi/msix interrupts.  However, in many cases vfio 
cannot be used, either due to the lack of an iommu (for example, in 
virtualized environments) or out of a desire to avoid the iommus 
performance impact.


The challenge in exposing msix interrupts to user space is that there 
are many of them, so you can't simply poll the device fd.  If you do, 
how do you know which interrupt was triggered?  The solution that vfio 
adopted was to associate each interrupt with an eventfd, allowing it to 
be individually polled.  Since you can pass an eventfd with SCM_RIGHTS, 
and since kvm can trigger guest interrupts using an eventfd, the 
solution is very flexible.



Example code would be even better...





This is the vfio dpdk interface code:

http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c

basically, the equivalent uio msix code would be very similar if uio 
adopts a similar interface:


http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_pci_uio.c

(current code lacks msi/msix support, of course).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] uio_pci_generic: add MSI/MSI-X support

2015-10-05 Thread Avi Kivity

On 10/05/2015 02:41 PM, Vlad Zolotarov wrote:



On 10/05/15 13:57, Greg KH wrote:

On Mon, Oct 05, 2015 at 01:48:39PM +0300, Vlad Zolotarov wrote:


On 10/05/15 10:56, Greg KH wrote:

On Mon, Oct 05, 2015 at 10:41:39AM +0300, Vlad Zolotarov wrote:

+struct msix_info {
+int num_irqs;
+struct msix_entry *table;
+struct uio_msix_irq_ctx {
+struct eventfd_ctx *trigger;/* MSI-x vector to 
eventfd */

Why are you using eventfd for msi vectors?  What's the reason for
needing this?
A small correction - for MSI-X vectors. There may be only one MSI 
vector per
PCI function and if it's used it would use the same interface as a 
legacy

INT#x interrupt uses at the moment.
So, for MSI-X case the reason is that there may be (in most cases 
there will
be) more than one interrupt vector. Thus, as I've explained in a 
PATCH1
thread we need a way to indicated each of them separately. eventfd 
seems

like a good way of doing so. If u have better ideas, pls., share.

You need to document what you are doing here, I don't see any
explaination for using eventfd at all.

And no, I don't know of any other solution as I don't know what you 
are

trying to do here (hint, the changelog didn't document it...)

You haven't documented how this api works at all, you are going 
to have

to a lot more work to justify this, as this greatly increases the
complexity of the user/kernel api in unknown ways.

I actually do documented it a bit. Pls., check PATCH3 out.

That provided no information at all about how to use the api.

If it did, you would see that your api is broken for 32/64bit kernels
and will fall over into nasty pieces the first time you try to use it
there, which means it hasn't been tested at all :(

It has been tested of course ;)
I tested it only in 64 bit environment however where both kernel and 
user
space applications were compiled on the same machine with the same 
compiler
and it could be that "int" had the same number of bytes both in 
kernel and
in user space application. Therefore it worked perfectly - I patched 
DPDK to
use the new uio_pci_generic MSI-X API to test this and I have 
verified that
all 3 interrupt modes work: MSI-X with SR-IOV VF device in Amazon 
EC2 guest

and INT#x and MSI with a PF device on bare metal server.

However I agree using uint32_t for "vec" and "fd" would be much more
correct.

I don't think file descriptors are __u32 on a 64bit arch, are they?


I think they are "int" on all platforms and as far as I know u32 
should be enough to contain int on any platform.




You need to make sure structures have the same layout on both 32-bit and 
64-bit systems, or you'll have to code compat ioctl translations for 
them.  The best way to do that is to use __u32 so the sizes are obvious, 
even for int, and to pad everything to 64 bit:


+struct msix_info { 


+__u32 num_irqs;
+__u32 pad; // so pointer below is aligned to 64-bit on both 32-bit 
and 64-bit userspace


+struct msix_entry *table;
+struct uio_msix_irq_ctx {
+struct eventfd_ctx *trigger;/* MSI-x vector to eventfd */





And NEVER use the _t types in kernel code,


Never meant it - it was for a user space interface. For a kernel it's 
u32 of course.




For interfaces, use __u32.  You can't use uint32_t because if someone 
uses C89 in 2015, they may not have .


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >